Commit | Line | Data |
---|---|---|
95f51977 | 1 | .\" @(#)refer 6.1 (Berkeley) 5/22/86 |
6f0fb0bd KD |
2 | .\" |
3 | .... refer | tbl | nroff -ms | |
575e3d36 KD |
4 | .EH 'USD:30-%''Some Applications of Inverted Indexes on the UNIX System' |
5 | .OH 'Some Applications of Inverted Indexes on the UNIX System''USD:30-%' | |
6 | .nr LL 6.5i | |
7 | .nr LT 6.5i | |
6f0fb0bd KD |
8 | .de UC |
9 | \\s-2\\$1\\s0\\$2 | |
10 | .. | |
11 | .ds . \&\s+2.\s0 | |
12 | .if t .ds -- \(em | |
13 | .if n .ds -- -- | |
14 | .TR 69 | |
575e3d36 | 15 | \".TM 77-1274-17 39199 39199-11 |
6f0fb0bd KD |
16 | .ND October 27, 1977 |
17 | .ND June 21, 1978 | |
18 | .TL | |
19 | Some Applications of Inverted Indexes on the UNIX System | |
20 | .AU "MH 2C-572" 6377 | |
21 | M. E. Lesk | |
22 | .AI | |
23 | .MH | |
575e3d36 KD |
24 | .\".AB |
25 | .\".LP | |
26 | .\".ft B | |
27 | .\"I. Some Applications of Inverted Indexes \- Overview | |
28 | .\".ft R | |
29 | .\".PP | |
30 | .\"This memorandum describes a set of programs which | |
31 | .\"make inverted indexes to | |
32 | .\"UNIX* | |
33 | .\"text files, and their | |
34 | .\"application to | |
35 | .\"retrieving and formatting citations for documents prepared using | |
36 | .\".I troff. | |
37 | .\".PP | |
38 | .\"The indexing and searching programs make keyword | |
39 | .\"indexes to volumes of material too large for linear searching. | |
40 | .\"Searches for combinations of single words can be performed quickly. | |
41 | .\"The programs for general searching are divided into | |
42 | .\"two phases. The first makes an index from the original | |
43 | .\"data; the second searches the index and retrieves | |
44 | .\"items. | |
45 | .\"Both of these phases are further divided into two parts | |
46 | .\"to separate the data-dependent and algorithm dependent | |
47 | .\"code. | |
48 | .\".PP | |
49 | .\"The major current application of these programs is | |
50 | .\"the | |
51 | .\".I troff | |
52 | .\"preprocessor | |
53 | .\".I refer. | |
54 | .\"A list of 4300 references is maintained on line, | |
55 | .\"containing primarily papers written and cited by | |
56 | .\"local authors. | |
57 | .\"Whenever one of these references is required | |
58 | .\"in a paper, a few words from the title or author list | |
59 | .\"will retrieve it, and the user need not bother to re-enter | |
60 | .\"the exact citation. | |
61 | .\"Alternatively, authors can use their own lists of papers. | |
62 | .\".PP | |
63 | .\"This memorandum is of interest to | |
64 | .\"those who are interested in facilities for searching large | |
65 | .\"but relatively unchanging text files on | |
66 | .\"the | |
67 | .\"UNIX | |
68 | .\"system, | |
69 | .\"and those who are interested in handling bibliographic | |
70 | .\"citations with | |
71 | .\"UNIX | |
72 | .\".I troff. | |
73 | .\".LP | |
74 | .\".ft B | |
75 | .\"II. Updating Publication Lists | |
76 | .\".PP | |
77 | .\"This section is a brief note describing the | |
78 | .\"auxiliary programs for managing the updating | |
79 | .\"processing. | |
80 | .\"It is written to aid clerical users in | |
81 | .\"maintaining lists of references. | |
82 | .\"Primarily, the programs described permit a large | |
83 | .\"amount of individual control over the content | |
84 | .\"of publication lists while retaining the | |
85 | .\"usefulness of the files to other users. | |
86 | .\".LP | |
87 | .\".ft B | |
88 | .\"III. Manual Pages | |
89 | .\".PP | |
90 | .\"This section contains the pages from the | |
91 | .\"UNIX programmer's manual | |
92 | .\"dealing with these commands. | |
93 | .\"It is useful for reference. | |
94 | .\".sp | |
95 | .\"\l'3i' | |
96 | .\".br | |
97 | .\"* UNIX is a trademark of Bell Laboratories. | |
98 | .\".AE | |
6f0fb0bd KD |
99 | .CS 10 4 14 0 0 4 |
100 | .NH | |
101 | Introduction. | |
102 | .PP | |
103 | The | |
104 | .UX | |
105 | system | |
106 | has many utilities | |
107 | (e.g. \fIgrep, awk, lex, egrep, fgrep, ...\fR) | |
108 | to search through files of text, | |
109 | but most of them are based on a linear scan through the | |
110 | entire file, using some deterministic automaton. | |
111 | .ev 1 | |
112 | .ps 8 | |
113 | .vs 10p | |
114 | .ev | |
115 | This memorandum discusses a program which uses inverted | |
116 | indexes | |
117 | .[ | |
118 | %A D. Knuth | |
119 | %T The Art of Computer Programming: Vol. 3, Sorting and Searching | |
120 | %I Addison-Wesley | |
121 | %C Reading, Mass. | |
122 | %D 1977 | |
123 | %O See section 6.5. | |
124 | .] | |
125 | and can thus be used on much larger data bases. | |
126 | .PP | |
127 | As with any indexing system, of course, there are some disadvantages; | |
128 | once an index is made, the files that have been indexed can not be changed | |
129 | without remaking the index. | |
130 | Thus applications are restricted | |
131 | to those making many searches | |
132 | of relatively stable data. | |
133 | Furthermore, these programs depend on hashing, and can only | |
134 | search for exact matches of whole keywords. | |
135 | It is not possible to look for | |
136 | arithmetic or logical expressions (e.g. ``date greater than 1970'') or | |
137 | for regular expression searching such as that in | |
138 | .I lex. | |
139 | .[ | |
140 | lex lesk cstr | |
141 | .] | |
142 | .PP | |
143 | Currently there are two uses of this software, | |
144 | the | |
145 | .I refer | |
146 | preprocessor to format references, | |
147 | and the | |
148 | .I lookall | |
149 | command to search through all text files on | |
150 | the | |
151 | .UX | |
575e3d36 KD |
152 | system.\(dd |
153 | .FS | |
154 | \(dd \fIlookall\fP is not part of the Berkeley UNIX distribution. | |
155 | .FE | |
6f0fb0bd KD |
156 | .PP |
157 | The remaining sections of this memorandum discuss | |
158 | the searching programs and their uses. | |
159 | Section 2 explains the operation of the searching algorithm and describes | |
160 | the data collected for use with the | |
161 | .I lookall | |
162 | command. | |
163 | The more important application, | |
164 | .I refer | |
165 | has a user's description in section 3. | |
166 | Section 4 goes into more detail on | |
167 | reference files | |
168 | for the benefit of those who | |
169 | wish to add references to data bases or | |
170 | write new | |
171 | .I troff | |
172 | macros for use with | |
173 | .I refer. | |
174 | The options to make | |
175 | .I refer | |
176 | collect identical citations, or otherwise relocate and adjust references, | |
177 | are described in section 5. | |
6f0fb0bd KD |
178 | .NH |
179 | Searching. | |
180 | .PP | |
181 | The indexing and searching process is divided into two phases, | |
182 | each made of two parts. | |
183 | These are | |
184 | shown below. | |
185 | .IP A. | |
186 | Construct the index. | |
187 | .RS | |
188 | .IP (1) | |
189 | Find keys \*(-- turn the input files into a sequence of tags and keys, | |
190 | where each tag identifies a distinct item in the input | |
191 | and the keys for each such item are the strings under which it is | |
192 | to be indexed. | |
193 | .IP (2) | |
194 | Hash and sort \*(-- | |
195 | prepare a set of inverted indexes from which, given a set of keys, | |
196 | the appropriate item tags can be found quickly. | |
197 | .RE | |
198 | .IP B. | |
199 | Retrieve an item in response to a query. | |
200 | .RS | |
201 | .IP (3) | |
202 | Search \*(-- | |
203 | Given some keys, look through the files prepared by the hashing | |
204 | and sorting facility and derive the appropriate tags. | |
205 | .IP (4) | |
206 | Deliver \*(-- | |
207 | Given the tags, find the original items. This completes the | |
208 | searching process. | |
209 | .RE | |
210 | .LP | |
211 | The first phase, making the index, is presumably done relatively infrequently. | |
212 | It should, of course, be done whenever the data being | |
213 | indexed change. | |
214 | In contrast, the second phase, retrieving items, | |
215 | is presumably done often, and must be rapid. | |
216 | .PP | |
217 | An effort is made to separate code which depends on the data | |
218 | being handled from code which depends on the searching procedure. | |
219 | The search algorithm is involved only in programs | |
220 | (2) and (3), while knowledge of the actual data files is | |
221 | needed only by programs (1) and (4). | |
222 | Thus it is easy to adapt to different data files or different | |
223 | search algorithms. | |
224 | .PP | |
225 | To start with, it is necessary to have some way of selecting | |
226 | or generating keys from input files. | |
227 | For dealing with files that are basically English, we have | |
228 | a key-making program which automatically selects words | |
229 | and passes them to the hashing and sorting program (step 2). | |
230 | The format used has one line for each input item, | |
231 | arranged | |
232 | as follows: | |
233 | .DS | |
234 | name:start,length (tab) key1 key2 key3 ... | |
235 | .DE | |
236 | where | |
237 | .I name | |
238 | is the file name, | |
239 | .I start | |
240 | is the starting byte number, | |
241 | and | |
242 | .I length | |
243 | is the number of bytes in the entry. | |
244 | .PP | |
245 | These lines are the only input used to make the | |
246 | index. | |
247 | The first field (the file name, byte position, and byte count) | |
248 | is the tag of the item | |
249 | and can be used to retrieve it quickly. | |
250 | Normally, an item is either a whole file or a section of a file | |
251 | delimited by blank lines. | |
252 | After the tab, the second field contains the keys. | |
253 | The keys, if selected by the automatic program, are | |
254 | any alphanumeric strings which | |
255 | are not among the 100 most frequent words in English | |
256 | and which are not entirely numeric (except for four-digit | |
257 | numbers beginning 19, which are accepted as dates). | |
258 | Keys are truncated to six characters and converted to lower case. | |
259 | Some selection is needed if the original items are very large. | |
260 | We normally just take the first | |
261 | .I n | |
262 | keys, with | |
263 | .I n | |
264 | less than 100 or so; this replaces any attempt at intelligent selection. | |
265 | One file in our system is | |
266 | a complete English dictionary; it would presumably be retrieved for all queries. | |
267 | .PP | |
268 | To generate an inverted index to the list of record tags and keys, | |
269 | the keys | |
270 | are hashed | |
271 | and sorted to produce an index. | |
272 | What is wanted, ideally, is a series of lists showing the tags associated | |
273 | with each key. | |
274 | To condense this, | |
275 | what is actually produced is a list showing the tags associated | |
276 | with each hash code, and thus with some set of keys. | |
277 | To speed up access and further save space, | |
278 | a set of three or possibly four files is produced. | |
279 | These files are: | |
280 | .KS | |
281 | .bd 2 2 | |
282 | .TS | |
283 | center; | |
284 | c c | |
285 | lI l. | |
286 | File Contents | |
287 | entry Pointers to posting file | |
288 | for each hash code | |
289 | posting Lists of tag pointers for | |
290 | each hash code | |
291 | tag Tags for each item | |
292 | key Keys for each item | |
293 | (optional) | |
294 | .TE | |
295 | .bd 2 | |
296 | .KE | |
297 | The posting file comprises the real data: it contains a sequence of lists | |
298 | of items posted under each hash code. To speed up searching, | |
299 | the entry file is an array of pointers into the posting file, one per potential | |
300 | hash code. | |
301 | Furthermore, the items in the lists in the posting file are not referred to by their | |
302 | complete tag, but just by an address in the tag file, which | |
303 | gives the complete tags. | |
304 | The key file is optional and contains a copy of the keys | |
305 | used in the indexing. | |
306 | .PP | |
307 | The searching process starts with a query, containing several keys. | |
308 | The goal is to obtain all items which were indexed under these keys. | |
309 | The query keys are hashed, and the pointers in the entry file used | |
310 | to access the lists in the posting file. These lists | |
311 | are addresses in the tag file of documents posted under the | |
312 | hash codes derived from the query. | |
313 | The common items from all lists are determined; | |
314 | this must include the items indexed by every key, but may also | |
315 | contain some items which are false drops, since items referenced by | |
316 | the correct hash codes need not actually have contained the correct keys. | |
317 | Normally, if there are several keys in the query, there are not | |
318 | likely to be many false drops in the final combined list even though | |
319 | each hash code is somewhat ambiguous. | |
320 | The actual tags are then obtained from the tag file, and to guard against | |
321 | the possibility that an item has false-dropped on some hash code | |
322 | in the query, the original items are normally obtained from the delivery | |
323 | program (4) and the query keys checked against them | |
324 | by string comparison. | |
325 | .PP | |
326 | Usually, therefore, the check for bad drops is made against the original file. | |
327 | However, if the key derivation procedure is complex, it may be preferable | |
328 | to check against the keys fed to program (2). | |
329 | In this case the optional key file which contains the | |
330 | keys associated with each item is generated, and the item tag is supplemented | |
331 | by a string | |
332 | .DS | |
333 | ;start,length | |
334 | .DE | |
335 | which indicates the starting byte number in the key file and the length of | |
336 | the string of keys for each item. | |
337 | This file is not usually necessary with the present | |
338 | key-selection program, since the keys | |
339 | always appear in the original document. | |
340 | .PP | |
341 | There is also an option | |
342 | (\f3-C\f2n\|\f1) | |
343 | for coordination level searching. | |
344 | This retrieves items which match all but | |
345 | .I n | |
346 | of the query keys. | |
347 | The items are retrieved in the order of the number | |
348 | of keys that they match. | |
349 | Of course, | |
350 | .I n | |
351 | must be less than the number of query keys (nothing is | |
352 | retrieved unless it matches at least one key). | |
353 | .PP | |
354 | As an example, consider one set of 4377 references, comprising | |
355 | 660,000 bytes. | |
356 | This included 51,000 keys, of which 5,900 were distinct | |
357 | keys. | |
358 | The hash table is kept full to save space (at the expense of time); | |
359 | 995 of 997 possible hash codes were used. | |
360 | The total set of index files (no key file) included 171,000 bytes, | |
361 | about 26% of the original file size. | |
362 | It took 8 minutes of processor time to | |
363 | hash, sort, and write the index. | |
364 | To search for a single query with the resulting index took 1.9 seconds | |
365 | of processor time, | |
366 | while to find the same paper | |
367 | with a sequential linear search | |
368 | using | |
369 | .I grep | |
370 | (reading all of the tags and keys) | |
371 | took 12.3 seconds of processor time. | |
372 | .PP | |
373 | We have also used this software to index all of the English stored on our | |
374 | .UX | |
375 | system. | |
376 | This is the index searched by the | |
377 | .I lookall | |
378 | command. | |
379 | On a typical day there were | |
380 | 29,000 files in our user file system, containing about 152,000,000 | |
381 | bytes. | |
382 | Of these | |
383 | 5,300 files, containing 32,000,000 bytes (about 21%) | |
384 | were English text. | |
385 | The total number of `words' (determined mechanically) | |
386 | was 5,100,000. | |
387 | Of these 227,000 were selected as keys; | |
388 | 19,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes. | |
389 | The | |
390 | resulting inverted file indexes used 845,000 bytes, or about | |
391 | 2.6% of the size of the original files. | |
392 | The particularly small indexes are caused by the | |
393 | fact that keys are taken from only the first 50 non-common words of | |
394 | some very long input files. | |
395 | .PP | |
396 | Even this large \f2lookall\f1 index can be searched quickly. | |
397 | For example, to find this document | |
398 | by looking for the keys | |
399 | ``lesk inverted indexes'' | |
400 | required | |
401 | 1.7 seconds of processor time | |
402 | and system time. | |
403 | By comparison, just to search the 800,000 byte dictionary (smaller than even | |
404 | the inverted indexes, let alone the 27,000,000 bytes of text files) with | |
405 | .I grep | |
406 | takes 29 seconds of processor time. | |
407 | The | |
408 | .I lookall | |
409 | program is thus useful when looking for a document which you believe | |
410 | is stored on-line, but do not know where. For example, many memos | |
411 | from our center are in the file system, but it is often | |
412 | difficult to guess where a particular memo might be (it might have several | |
413 | authors, each with many directories, and have been worked on by | |
414 | a secretary with yet more directories). | |
415 | Instructions for the use of the | |
416 | .I lookall | |
417 | command are given in the manual section, shown | |
418 | in the appendix to this memorandum. | |
419 | .PP | |
420 | The only indexes maintained routinely are those of publication lists and | |
421 | all English files. | |
422 | To make other indexes, the programs for making keys, sorting them, | |
423 | searching the indexes, and delivering answers must be used. | |
424 | Since they are usually invoked as parts of higher-level commands, | |
425 | they are not in the default command | |
426 | directory, but are available to any user in the directory | |
427 | .I /usr/lib/refer . | |
428 | Three programs are of interest: | |
429 | .I mkey , | |
430 | which isolates keys from input files; | |
431 | .I inv , | |
432 | which makes an index from a set of keys; | |
433 | and | |
434 | .I hunt , | |
435 | which searches the index and delivers the items. | |
436 | Note that the two parts of the retrieval phase are combined into | |
437 | one program, to avoid the excessive system work and delay which | |
438 | would result from running these as separate processes. | |
439 | .PP | |
440 | These three commands have a large number of options to adapt to different | |
441 | kinds of input. | |
442 | The user not interested in the detailed description that now follows may | |
443 | skip to section 3, which describes the | |
444 | .I refer | |
445 | program, a packaged-up version of these tools specifically | |
446 | oriented towards formatting references. | |
447 | .PP | |
448 | .B | |
449 | Make Keys. | |
450 | .R | |
451 | The program | |
452 | .I mkey | |
453 | is the key-making program corresponding to step (1) in phase A. | |
454 | Normally, it reads its input from the file names given as arguments, | |
455 | and if there are no arguments it reads from the standard input. | |
456 | It assumes that blank lines in the input delimit | |
457 | separate items, for each of which a different line of | |
458 | keys should be generated. | |
459 | The lines of keys are written on the standard output. | |
460 | Keys are any alphanumeric string in the input not | |
461 | among the most frequent words in English and not entirely numeric | |
462 | (except that all-numeric strings are acceptable if they | |
463 | are between 1900 and 1999). | |
464 | In the output, keys are translated to lower case, and truncated | |
465 | to six characters in length; any associated punctuation is removed. | |
466 | The following flag arguments are recognized by | |
467 | .I mkey: | |
468 | .TS | |
469 | center; | |
470 | lB lw(4i). | |
471 | \-c \f2name T{ | |
472 | Name of file of common words; | |
473 | default is | |
474 | .I /usr/lib/eign. | |
475 | T} | |
476 | \-f \f2name T{ | |
477 | Read a list of files from | |
478 | .I name | |
479 | and take each as an input argument. | |
480 | T} | |
481 | \-i \f2chars T{ | |
482 | Ignore all lines which begin with `%' followed by any character | |
483 | in | |
484 | .I chars . | |
485 | T} | |
486 | \-k\f2n T{ | |
487 | Use at most | |
488 | .I n | |
489 | keys per input item. | |
490 | T} | |
491 | \-l\f2n T{ | |
492 | Ignore items shorter than | |
493 | .I n | |
494 | letters long. | |
495 | T} | |
496 | \-n\f2m T{ | |
497 | Ignore as a key any word in the first | |
498 | .I m | |
499 | words of the list of common English words. | |
500 | The default is 100. | |
501 | T} | |
502 | \-s T{ | |
503 | Remove the labels | |
504 | .I (file:start,length) | |
505 | from the output; just give the keys. | |
506 | Used when searching rather than indexing. | |
507 | T} | |
508 | \-w T{ | |
509 | Each whole file is a separate item; | |
510 | blank lines in files are irrelevant. | |
511 | T} | |
512 | .TE | |
513 | .PP | |
514 | The normal arguments for indexing references are | |
515 | the defaults, which are | |
516 | .I "\-c /usr/lib/eign" , | |
517 | .I \-n100 , | |
518 | and | |
519 | .I \-l3 . | |
520 | For searching, the | |
521 | .I \-s | |
522 | option is also needed. | |
523 | When the big | |
524 | .I lookall | |
525 | index of all English files is run, | |
526 | the options are | |
527 | .I \-w , | |
528 | .I \-k50 , | |
529 | and | |
530 | .I "\-f (filelist)" . | |
531 | When running on textual input, | |
532 | the | |
533 | .I mkey | |
534 | program processes about 1000 English words per processor second. | |
535 | Unless the | |
536 | .I \-k | |
537 | option is used (and the input files are long enough for | |
538 | it to take effect) | |
539 | the output of | |
540 | .I mkey | |
541 | is comparable in size to its input. | |
542 | .PP | |
543 | .B | |
544 | Hash and invert. | |
545 | .R | |
546 | The | |
547 | .I inv | |
548 | program computes the hash codes and writes | |
549 | the inverted files. | |
550 | It reads the output of | |
551 | .I mkey | |
552 | and writes the set of files described earlier | |
553 | in this section. | |
554 | It expects one argument, which is used as the base name for | |
555 | the three (or four) files to be written. | |
556 | Assuming an argument of | |
557 | .I Index | |
558 | (the default) | |
559 | the entry file is named | |
560 | .I Index.ia , | |
561 | the posting file | |
562 | .I Index.ib , | |
563 | the tag file | |
564 | .I Index.ic , | |
565 | and the key file (if present) | |
566 | .I Index.id . | |
567 | The | |
568 | .I inv | |
569 | program recognizes the following options: | |
570 | .TS | |
571 | center; | |
572 | lB lw(4i). | |
573 | \-a T{ | |
574 | Append the new keys to a previous set of inverted files, | |
575 | making new files if there is no old set using the same base name. | |
576 | T} | |
577 | \-d T{ | |
578 | Write the optional key file. | |
579 | This is needed when you can not check for false drops by looking | |
580 | for the keys in the original inputs, i.e. when the key derivation | |
581 | procedure is complicated and | |
582 | the output keys are not words from the input files. | |
583 | T} | |
584 | \-h\f2n T{ | |
585 | The hash table size is | |
586 | .I n | |
587 | (default 997); | |
588 | .I n | |
589 | should be prime. | |
590 | Making \f2n\f1 bigger saves search time and spends disk space. | |
591 | T} | |
592 | \-i[u] \f2name T{ | |
593 | Take input from file | |
594 | .I name , | |
595 | instead of the standard input; | |
596 | if | |
597 | .B u | |
598 | is present | |
599 | .I name | |
600 | is unlinked when the sort is started. | |
601 | Using this option permits the sort scratch space | |
602 | to overlap the disk space used for input keys. | |
603 | T} | |
604 | \-n T{ | |
605 | Make a completely new set of inverted files, ignoring | |
606 | previous files. | |
607 | T} | |
608 | \-p T{ | |
609 | Pipe into the sort program, rather than writing a temporary | |
610 | input file. | |
611 | This saves disk space and spends processor time. | |
612 | T} | |
613 | \-v T{ | |
614 | Verbose mode; print a summary of the number of keys which | |
615 | finished indexing. | |
616 | T} | |
617 | .TE | |
618 | .PP | |
619 | About half the time used in | |
620 | .I inv | |
621 | is in the contained sort. | |
622 | Assuming the sort is roughly linear, however, | |
623 | a guess at the total timing for | |
624 | .I inv | |
625 | is 250 keys per second. | |
626 | The space used is usually of more importance: | |
627 | the entry file uses four bytes per possible hash (note | |
628 | the | |
629 | .B \-h | |
630 | option), | |
631 | and the tag file around 15-20 bytes per item indexed. | |
632 | Roughly, the posting file contains one item for each key instance | |
633 | and one item for each possible hash code; the items are two bytes | |
634 | long if the tag file is less than 65336 bytes long, and the | |
635 | items are four bytes wide if the tag file is greater than | |
636 | 65536 bytes long. | |
637 | Note that to minimize storage, the hash tables should be | |
638 | over-full; | |
639 | for most of the files indexed in this way, there is no | |
640 | other real choice, since the | |
641 | .I entry | |
642 | file must fit in memory. | |
643 | .PP | |
644 | .B | |
645 | Searching and Retrieving. | |
646 | .R | |
647 | The | |
648 | .I hunt | |
649 | program retrieves items from an index. | |
650 | It combines, as mentioned above, the two parts of phase (B): | |
651 | search and delivery. | |
652 | The reason why it is efficient to combine delivery and search | |
653 | is partly to avoid starting unnecessary processes, and partly | |
654 | because the delivery operation must be a part of the search | |
655 | operation in any case. | |
656 | Because of the hashing, the search part takes place in two stages: | |
657 | first items are retrieved which have the right hash codes associated with them, | |
658 | and then the actual items are inspected to determine false drops, i.e. | |
659 | to determine if anything with the right hash codes doesn't really have the right | |
660 | keys. | |
661 | Since the original item is retrieved to check on false drops, | |
662 | it is efficient to present it immediately, rather than only | |
663 | giving the tag as output and later retrieving the | |
664 | item again. | |
665 | If there were a separate key file, this argument would not apply, | |
666 | but separate key files are not common. | |
667 | .PP | |
668 | Input to | |
669 | .I hunt | |
670 | is taken from the standard input, | |
671 | one query per line. | |
672 | Each query should be in | |
673 | .I "mkey \-s" | |
674 | output format; | |
675 | all lower case, no punctuation. | |
676 | The | |
677 | .I hunt | |
678 | program takes one argument which specifies the base name of the index | |
679 | files to be searched. | |
680 | Only one set of index files can be searched at a time, | |
681 | although many text files may be indexed as a group, of course. | |
682 | If one of the text files has been changed since the index, that file | |
683 | is searched with | |
684 | .I fgrep; | |
685 | this may occasionally slow down the searching, and care should be taken to | |
686 | avoid having many out of date files. | |
687 | The following option arguments are recognized by | |
688 | .I hunt: | |
689 | .TS | |
690 | center; | |
691 | lB lw(4i). | |
692 | \-a T{ | |
693 | Give all output; ignore checking for false drops. | |
694 | T} | |
695 | \-C\f2n T{ | |
696 | Coordination level | |
697 | .I n; | |
698 | retrieve items with not more than | |
699 | .I n | |
700 | terms of the input missing; | |
701 | default | |
702 | .I C0 , | |
703 | implying that each search term must be in the output items. | |
704 | T} | |
705 | \-F[yn\f2d\f3\|] T{ | |
706 | ``\-Fy'' gives the text of all the items found; | |
707 | ``\-Fn'' suppresses them. | |
708 | ``\-F\f2d\|\f1'' where \f2d\f1\| is an integer | |
709 | gives the text of the first \f2d\f1 items. | |
710 | The default is | |
711 | .I \-Fy. | |
712 | T} | |
713 | \-g T{ | |
714 | Do not use | |
715 | .I fgrep | |
716 | to search files changed since the index was made; | |
717 | print an error comment instead. | |
718 | T} | |
719 | \-i \f2string T{ | |
720 | Take | |
721 | .I string | |
722 | as input, instead of reading the standard input. | |
723 | T} | |
724 | \-l \f2n T{ | |
725 | The maximum length of internal lists of candidate | |
726 | items is | |
727 | .I n; | |
728 | default 1000. | |
729 | T} | |
730 | \-o \f2string T{ | |
731 | Put text output (``\-Fy'') in | |
732 | .I string; | |
733 | of use | |
734 | .I only | |
735 | when | |
736 | invoked from another program. | |
737 | T} | |
738 | \-p T{ | |
739 | Print hash code frequencies; mostly | |
740 | for use in optimizing hash table sizes. | |
741 | T} | |
742 | \-T[yn\f2d\|\f3] T{ | |
743 | ``\-Ty'' gives the tags of the items found; | |
744 | ``\-Tn'' suppresses them. | |
745 | ``\-T\f2d\f1\|'' where \f2d\f1\| is an integer | |
746 | gives the first \f2d\f1 tags. | |
747 | The default is | |
748 | .I \-Tn . | |
749 | T} | |
750 | \-t \f2string T{ | |
751 | Put tag output (``\-Ty'') in | |
752 | .I string; | |
753 | of use | |
754 | .I only | |
755 | when invoked from another program. | |
756 | T} | |
757 | .TE | |
758 | .PP | |
759 | The timing of | |
760 | .I hunt | |
761 | is complex. | |
762 | Normally the hash table is overfull, so that there will | |
763 | be many false drops on any single term; | |
764 | but a multi-term query will have few false drops on | |
765 | all terms. | |
766 | Thus if a query is underspecified (one search term) | |
767 | many potential items will be examined and discarded as false | |
768 | drops, wasting time. | |
769 | If the query is overspecified (a dozen search terms) | |
770 | many keys will be examined only to verify that | |
771 | the single item under consideration has that key posted. | |
772 | The variation of search time with number of keys is | |
773 | shown in the table below. | |
774 | Queries of varying length were constructed to retrieve | |
775 | a particular document from the file of references. | |
776 | In the sequence to the left, search terms were chosen so as | |
777 | to select the desired paper as quickly as possible. | |
778 | In the sequence on the right, terms were chosen inefficiently, | |
779 | so that the query did not uniquely select the desired document | |
780 | until four keys had been used. | |
781 | The same document was the target in each case, | |
782 | and the final set of eight keys are also identical; the differences | |
783 | at five, six and seven keys are produced by measurement error, not | |
784 | by the slightly different key lists. | |
785 | .TS | |
786 | center; | |
787 | c s s s5 | c s s s | |
788 | cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8 | |
789 | cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8 | |
790 | n n n n | n n n n . | |
791 | Efficient Keys Inefficient Keys | |
792 | No. keys Total drops Retrieved Search time No. keys Total drops Retrieved Search time | |
793 | (incl. false) Documents (seconds) (incl. false) Documents (seconds) | |
794 | 1 15 3 1.27 1 68 55 5.96 | |
795 | 2 1 1 0.11 2 29 29 2.72 | |
796 | 3 1 1 0.14 3 8 8 0.95 | |
797 | 4 1 1 0.17 4 1 1 0.18 | |
798 | 5 1 1 0.19 5 1 1 0.21 | |
799 | 6 1 1 0.23 6 1 1 0.22 | |
800 | 7 1 1 0.27 7 1 1 0.26 | |
801 | 8 1 1 0.29 8 1 1 0.29 | |
802 | .TE | |
803 | As would be expected, the optimal search is achieved | |
804 | when the query just specifies the answer; however, | |
805 | overspecification is quite cheap. | |
806 | Roughly, the time required by | |
807 | .I hunt | |
808 | can be approximated as | |
809 | 30 milliseconds per search key plus 75 milliseconds | |
810 | per dropped document (whether it is a false drop or | |
811 | a real answer). | |
812 | In general, overspecification can be recommended; | |
813 | it protects the user against additions to the data base | |
814 | which turn previously uniquely-answered queries | |
815 | into ambiguous queries. | |
816 | .PP | |
817 | The careful reader will have noted an enormous discrepancy between these times | |
818 | and the earlier quoted time of around 1.9 seconds for a search. The times | |
819 | here are purely for the search and retrieval: they are measured by | |
820 | running many searches through a single invocation of the | |
821 | .I hunt | |
822 | program alone. | |
823 | The normal retrieval operation involves using the shell to | |
824 | set up a pipeline through | |
825 | .I mkey | |
826 | to | |
827 | .I hunt | |
828 | and starting both processes; this adds a fixed overhead of about 1.7 seconds | |
829 | of processor time | |
830 | to any single search. | |
831 | Furthermore, remember that all these times are processor times: | |
832 | on a typical morning on our \s-2PDP\s0 11/70 system, with about one dozen | |
833 | people logged on, | |
834 | to obtain 1 second of processor time for the search program | |
835 | took between 2 and 12 seconds of real time, with a median of | |
836 | 3.9 seconds and a mean of 4.8 seconds. | |
837 | Thus, although the work involved in a single search may be only | |
838 | 200 milliseconds, after you add the 1.7 seconds of startup processor | |
839 | time | |
840 | and then assume a 4:1 elapsed/processor time | |
841 | ratio, it will be 8 seconds before any response is printed. | |
842 | .NH | |
843 | Selecting and Formatting References for T\s-2ROFF\s0 | |
844 | .PP | |
845 | The major application of the retrieval software | |
846 | is | |
847 | .I refer, | |
848 | which is a | |
849 | .I troff | |
850 | preprocessor | |
851 | like | |
852 | .I eqn . | |
853 | .[ | |
854 | kernighan cherry acm 1975 | |
855 | .] | |
856 | It scans its input looking for items of the form | |
857 | .DS | |
858 | \*.[ | |
859 | imprecise citation | |
860 | \*.\^] | |
861 | .DE | |
862 | where an imprecise citation is merely a string | |
863 | of words found in the relevant bibliographic citation. | |
864 | This is translated into a properly formatted reference. | |
865 | If the imprecise citation does not correctly identify | |
866 | a single paper | |
867 | (either | |
868 | selecting no papers or too many) a message is given. | |
869 | The data base of citations searched may be tailored to each | |
870 | system, and individual users may specify their own | |
871 | citation | |
872 | files. | |
873 | On our system, the default data base is accumulated from | |
874 | the publication lists of the members of our organization, plus | |
875 | about half a dozen personal bibliographies that were collected. | |
876 | The present total is about 4300 citations, but this increases steadily. | |
877 | Even now, | |
878 | the data base covers a large fraction of local citations. | |
879 | .PP | |
880 | For example, the reference for the | |
881 | .I eqn | |
882 | paper above was specified as | |
883 | .DS | |
884 | \&\*.\*.\*. | |
885 | \&preprocessor like | |
886 | \&.I eqn. | |
887 | \&.[ | |
888 | \&kernighan cherry acm 1975 | |
889 | \&.] | |
890 | \&It scans its input looking for items | |
891 | \&\*.\*.\*. | |
892 | .DE | |
893 | This paper was itself printed using | |
894 | .I refer. | |
895 | The above input text was processed by | |
896 | .I refer | |
897 | as well as | |
898 | .I tbl | |
899 | and | |
900 | .I troff | |
901 | by the command | |
902 | .DS | |
903 | .ft I | |
904 | refer memo-file | tbl | troff \-ms | |
905 | .ft R | |
906 | .DE | |
907 | and the reference was automatically translated into a correct | |
908 | citation to the ACM paper on mathematical typesetting. | |
909 | .PP | |
910 | The procedure to use to place a reference in a paper | |
911 | using | |
912 | .I refer | |
913 | is as follows. | |
914 | First, use the | |
915 | .I lookbib | |
916 | command to check that the paper is in the data base | |
917 | and to find out what keys are necessary to retrieve it. | |
918 | This is done by typing | |
919 | .I lookbib | |
920 | and then typing some potential queries until | |
921 | a suitable query is found. | |
922 | For example, had one started to find | |
923 | the | |
924 | .I eqn | |
925 | paper shown above by presenting the query | |
926 | .DS | |
927 | $ lookbib | |
928 | kernighan cherry | |
929 | (EOT) | |
930 | .DE | |
931 | .I lookbib | |
932 | would have found several items; experimentation would quickly | |
933 | have shown that the query given above is adequate. | |
934 | Overspecifying the query is of course harmless. | |
935 | A particularly careful reader may have noticed that ``acm'' does not | |
936 | appear in the printed citation; | |
937 | we have supplemented some of the data base items with common | |
938 | extra keywords, such as common abbreviations for journals | |
939 | or other sources, to aid in searching. | |
940 | .PP | |
941 | If the reference is in the data base, the query | |
942 | that retrieved it can be inserted in the text, | |
943 | between | |
944 | .B \*.[ | |
945 | and | |
946 | .B \*.\^] | |
947 | brackets. | |
948 | If it is not in the data base, it can be typed | |
949 | into a private file of references, using the format | |
950 | discussed in the next section, and then | |
951 | the | |
952 | .B \-p | |
953 | option | |
954 | used to search this private file. | |
955 | Such a command might read | |
956 | (if the private references are called | |
957 | .I myfile ) | |
958 | .DS | |
959 | .ft 2 | |
960 | refer \-p myfile document | tbl | eqn | troff \-ms \*. \*. \*. | |
961 | .ft 1 | |
962 | .DE | |
963 | where | |
964 | .I tbl | |
965 | and/or | |
966 | .I eqn | |
967 | could be omitted if not needed. | |
968 | The use | |
969 | of the | |
970 | .I \-ms | |
971 | macros | |
972 | .[ | |
973 | lesk typing documents unix gcos | |
974 | .] | |
975 | or some other macro package, however, | |
976 | is essential. | |
977 | .I Refer | |
978 | only generates the data for the references; exact formatting | |
979 | is done by some macro package, and if none is supplied the | |
980 | references will not be printed. | |
981 | .PP | |
982 | By default, | |
983 | the references are numbered sequentially, | |
984 | and | |
985 | the | |
986 | .I \-ms | |
987 | macros format references as footnotes at the bottom of the page. | |
988 | This memorandum is an example of that style. | |
989 | Other possibilities are discussed in section 5 below. | |
990 | .NH | |
991 | Reference Files. | |
992 | .PP | |
993 | A reference file is a set of bibliographic references usable with | |
994 | .I refer. | |
995 | It can be indexed using the software described in section 2 | |
996 | for fast searching. | |
997 | What | |
998 | .I refer | |
999 | does is to read the input document stream, | |
1000 | looking for imprecise citation references. | |
1001 | It then searches through reference files to find | |
1002 | the full citations, and inserts them into the | |
1003 | document. | |
1004 | The format of the full citation is arranged to make it | |
1005 | convenient for a macro package, such as the | |
1006 | .I \-ms | |
1007 | macros, to format the reference | |
1008 | for printing. | |
1009 | Since | |
1010 | the format of the final reference is determined | |
1011 | by the desired style of output, | |
1012 | which is determined by the macros used, | |
1013 | .I refer | |
1014 | avoids forcing any kind of reference appearance. | |
1015 | All it does is define a set of string registers which | |
1016 | contain the basic information about the reference; | |
1017 | and provide a macro call which is expanded by the macro | |
1018 | package to format the reference. | |
1019 | It is the responsibility of the final macro package | |
1020 | to see that the reference is actually printed; if no | |
1021 | macros are used, and the output of | |
1022 | .I refer | |
1023 | fed untranslated to | |
1024 | .I troff, | |
1025 | nothing at all will be printed. | |
1026 | .PP | |
1027 | The strings defined by | |
1028 | .I refer | |
1029 | are taken directly from the files of references, which | |
1030 | are in the following format. | |
1031 | The references should be separated | |
1032 | by blank lines. | |
1033 | Each reference is a sequence of lines beginning with | |
1034 | .B % | |
1035 | and followed | |
1036 | by a key-letter. | |
1037 | The remainder of that line, and successive lines until the next line beginning | |
1038 | with | |
1039 | .B % , | |
1040 | contain the information specified by the key-letter. | |
1041 | In general, | |
1042 | .I refer | |
1043 | does not interpret the information, but merely presents | |
1044 | it to the macro package for final formatting. | |
1045 | A user with a separate macro package, for example, | |
1046 | can add new key-letters or use the existing ones for other purposes | |
1047 | without bothering | |
1048 | .I refer. | |
1049 | .PP | |
1050 | The meaning of the key-letters given below, in particular, | |
1051 | is that assigned by the | |
1052 | .I \-ms | |
1053 | macros. | |
1054 | Not all information, obviously, is used with each citation. | |
1055 | For example, if a document is both an internal memorandum and a journal article, | |
1056 | the macros ignore the memorandum version and cite only the journal article. | |
1057 | Some kinds of information are not used at all in printing the reference; | |
1058 | if a user does not like finding references by specifying title | |
1059 | or author keywords, and prefers to add specific keywords to the | |
1060 | citation, a field is available which is searched but not | |
1061 | printed (\f3K\f1). | |
1062 | .PP | |
1063 | The key letters currently recognized by | |
1064 | .I refer | |
1065 | and | |
1066 | .I \-ms, | |
1067 | with the kind of information implied, are: | |
1068 | .KS | |
1069 | .TS | |
1070 | center; | |
1071 | c c6 c c | |
1072 | c l c l. | |
1073 | Key Information specified Key Information specified | |
1074 | A Author's name N Issue number | |
1075 | B Title of book containing item O Other information | |
1076 | C City of publication P Page(s) of article | |
1077 | D Date R Technical report reference | |
1078 | E Editor of book containing item T Title | |
1079 | G Government (NTIS) ordering number V Volume number | |
1080 | I Issuer (publisher) | |
1081 | J Journal name | |
1082 | K Keys (for searching) X or | |
1083 | L Label Y or | |
1084 | M Memorandum label Z Information not used by \f2refer\f1 | |
1085 | .TE | |
1086 | .KE | |
1087 | For example, a sample reference could be | |
1088 | typed as: | |
1089 | .DS | |
1090 | %T Bounds on the Complexity of the Maximal | |
1091 | Common Subsequence Problem | |
1092 | %Z ctr127 | |
1093 | %A A. V. Aho | |
1094 | %A D. S. Hirschberg | |
1095 | %A J. D. Ullman | |
1096 | %J J. ACM | |
1097 | %V 23 | |
1098 | %N 1 | |
1099 | %P 1-12 | |
1100 | .\"%M TM 75-1271-7 | |
1101 | %M abcd-78 | |
1102 | %D Jan. 1976 | |
1103 | .DE | |
1104 | Order is irrelevant, except that authors are shown in the order | |
1105 | given. The output of | |
1106 | .I refer | |
1107 | is a stream of string definitions, one | |
1108 | for each of the fields of each reference, as | |
1109 | shown below. | |
1110 | .DS | |
1111 | \*.]- | |
1112 | \*.ds [A authors' names \*.\*.\*. | |
1113 | \*.ds [T title \*.\*.\*. | |
1114 | \*.ds [J journal \*.\*.\*. | |
1115 | \*.\*.\*. | |
1116 | \*.]\|[ type-number | |
1117 | .DE | |
1118 | The special macro | |
1119 | .B \&\*.]\- | |
1120 | precedes the string definitions | |
1121 | and the special macro | |
1122 | .B \*.]\|[ | |
1123 | follows. | |
1124 | These are changed from the input | |
1125 | .B \*.[ | |
1126 | and | |
1127 | .B \*.\^] | |
1128 | so that running the same file through | |
1129 | .I refer | |
1130 | again is harmless. | |
1131 | The | |
1132 | .B \*.]\- | |
1133 | macro can be used by the macro package to | |
1134 | initialize. | |
1135 | The | |
1136 | .B \*.]\|[ | |
1137 | macro, which should be used | |
1138 | to print the reference, is given an | |
1139 | argument | |
1140 | .I type-number | |
1141 | to indicate the kind of reference, as follows: | |
1142 | .KS | |
1143 | .TS | |
1144 | center; | |
1145 | c c | |
1146 | n l. | |
1147 | Value Kind of reference | |
1148 | 1 Journal article | |
1149 | 2 Book | |
1150 | 3 Article within book | |
1151 | 4 Technical report | |
1152 | 5 Bell Labs technical memorandum | |
1153 | 0 Other | |
1154 | .TE | |
1155 | .KE | |
1156 | The reference is flagged in the text | |
1157 | with the sequence | |
1158 | .DS | |
1159 | \e*\|([\*.number\e*\|(\*.\^] | |
1160 | .DE | |
1161 | where | |
1162 | .I number | |
1163 | is the footnote number. | |
1164 | The strings | |
1165 | .B [\*. | |
1166 | and | |
1167 | .B \*.\^] | |
1168 | should be used by the macro package | |
1169 | to format the reference flag in the text. | |
1170 | These strings can be replaced for a particular | |
1171 | footnote, as described in section 5. | |
1172 | The footnote number (or other signal) is available | |
1173 | to the reference macro | |
1174 | .B \*.]\|[ | |
1175 | as the | |
1176 | string register | |
1177 | .B [F . | |
1178 | .PP | |
1179 | In some cases users wish to suspend the searching, and merely | |
1180 | use the reference macro formatting. | |
1181 | That is, the user doesn't want to provide a search key | |
1182 | between | |
1183 | .B \*.[ | |
1184 | and | |
1185 | .B \*.\^] | |
1186 | brackets, but merely | |
1187 | the reference lines for the appropriate document. | |
1188 | Alternatively, the user | |
1189 | can wish | |
1190 | to add a few fields to those in the reference | |
1191 | as in the standard file, or | |
1192 | override some fields. | |
1193 | Altering or replacing fields, or supplying whole references, is easily done | |
1194 | by inserting lines beginning | |
1195 | with | |
1196 | .B % ; | |
1197 | any such line is taken as | |
1198 | direct input to the reference | |
1199 | processor rather than keys to be searched. | |
1200 | Thus | |
1201 | .DS | |
1202 | \*.[ | |
1203 | key1 key2 key3 \*.\*.\*. | |
1204 | %Q New format item | |
1205 | %R Override report name | |
1206 | \*.\^] | |
1207 | .DE | |
575e3d36 | 1208 | makes the indicated changes to the result of searching for |
6f0fb0bd KD |
1209 | the keys. |
1210 | All of the search keys must be given before the first | |
1211 | \f3%\f1 line. | |
1212 | .PP | |
1213 | If no search keys are provided, an entire citation can | |
1214 | be provided in-line in the text. | |
1215 | For example, if the | |
1216 | .I eqn | |
1217 | paper citation were to be inserted in | |
1218 | this way, rather than by searching for it in the data base, | |
1219 | the input would read | |
1220 | .DS | |
1221 | \&\*.\*.\*. | |
1222 | \&preprocessor like | |
1223 | \&.I eqn. | |
1224 | \&.[ | |
1225 | \&%A B. W. Kernighan | |
1226 | \&%A L. L. Cherry | |
1227 | \&%T A System for Typesetting Mathematics | |
1228 | \&%J Comm. ACM | |
1229 | \&%V 18 | |
1230 | \&%N 3 | |
1231 | \&%P 151-157 | |
1232 | \&%D March 1975 | |
1233 | \&.] | |
1234 | \&It scans its input looking for items | |
1235 | \&\*.\*.\*. | |
1236 | .DE | |
1237 | This would produce a citation of the same appearance as that | |
1238 | resulting from the file search. | |
1239 | .PP | |
1240 | As shown, fields are normally turned into | |
1241 | .I troff | |
1242 | strings. | |
1243 | Sometimes users would rather have them defined as macros, | |
1244 | so that other | |
1245 | .I troff | |
1246 | commands can be placed into the data. | |
1247 | When this is necessary, simply double the control character | |
1248 | .B % | |
1249 | in the data. | |
1250 | Thus the input | |
1251 | .DS | |
1252 | \&.[ | |
1253 | %V 23 | |
1254 | %%M | |
1255 | Bell Laboratories, | |
1256 | Murray Hill, N.J. 07974 | |
1257 | \&.] | |
1258 | .DE | |
1259 | is processed by | |
1260 | .I refer | |
1261 | into | |
1262 | .DS | |
1263 | \&.ds [V 23 | |
1264 | \&.de [M | |
1265 | Bell Laboratories, | |
1266 | Murray Hill, N.J. 07974 | |
1267 | \&.. | |
1268 | .DE | |
1269 | The information after | |
1270 | .B %%M | |
1271 | is defined as a macro to be invoked by | |
1272 | .B .[M | |
1273 | while the information after | |
1274 | .B %V | |
1275 | is turned into a string to be invoked by | |
1276 | .B \e\(**([V . | |
1277 | At present | |
1278 | .I \-ms | |
1279 | expects all information as strings. | |
1280 | .NH | |
1281 | Collecting References and other Refer Options | |
1282 | .PP | |
1283 | Normally, the combination of | |
1284 | .I refer | |
1285 | and | |
1286 | .I \-ms | |
1287 | formats output as | |
1288 | .I troff | |
1289 | footnotes which are consecutively numbered and placed | |
1290 | at the bottom of the page. However, | |
1291 | options exist to | |
1292 | place the references at the end; to arrange references alphabetically | |
1293 | by senior author; and to indicate references by strings in the text of the form | |
1294 | [Name1975a] | |
1295 | rather than by number. | |
1296 | Whenever references are not placed at the bottom of a page | |
1297 | identical references are coalesced. | |
1298 | .PP | |
1299 | For example, the | |
1300 | .B \-e | |
1301 | option to | |
1302 | .I refer | |
1303 | specifies that references are to be collected; in this case | |
1304 | they are output whenever the sequence | |
1305 | .DS | |
1306 | \*.[ | |
1307 | $LIST$ | |
1308 | \*.\^] | |
1309 | .DE | |
1310 | is encountered. | |
1311 | Thus, to place references at the end of a paper, the user would run | |
1312 | .I refer | |
1313 | with the | |
1314 | .I \-e | |
1315 | option and place the above $LIST$ commands after the last | |
1316 | line of the text. | |
1317 | .I Refer | |
1318 | will then move all the references to that point. | |
1319 | To aid in formatting the collected references, | |
1320 | .I refer | |
1321 | writes the references preceded by the line | |
1322 | .DS | |
1323 | .B .]< | |
1324 | .DE | |
1325 | and | |
1326 | followed by the line | |
1327 | .DS | |
1328 | .B .]> | |
1329 | .DE | |
1330 | to invoke special macros before and after the references. | |
1331 | .PP | |
1332 | Another possible option to | |
1333 | .I refer | |
1334 | is the | |
1335 | .B \-s | |
1336 | option to specify | |
1337 | sorting of references. The default, | |
1338 | of course, is to list references in the order presented. | |
1339 | The | |
1340 | .B \-s | |
1341 | option implies the | |
1342 | .B \-e | |
1343 | option, and thus requires | |
1344 | a | |
1345 | .DS | |
1346 | \*.[ | |
1347 | $LIST$ | |
1348 | \*.\^] | |
1349 | .DE | |
1350 | entry to call out the reference list. | |
1351 | The | |
1352 | .B \-s | |
1353 | option may be followed by a string of letters, numbers, and `+' signs indicating how | |
1354 | the references are to be sorted. | |
1355 | The sort is done using the fields whose key-letters are | |
1356 | in the string as sorting keys; the numbers indicate how many | |
1357 | of the fields are to be considered, with `+' | |
1358 | taken as a large number. | |
1359 | Thus the default is | |
1360 | .B \-sAD | |
1361 | meaning ``Sort on senior author, then date.'' To | |
1362 | sort on all authors and then title, specify | |
1363 | .B \-sA+T . | |
1364 | And to sort on two authors and then the journal, | |
1365 | write | |
1366 | .B \-sA2J . | |
1367 | .PP | |
1368 | Other options to | |
1369 | .I refer | |
1370 | change the signal or label inserted in the text for each reference. | |
1371 | Normally these are just sequential numbers, | |
1372 | and their exact placement (within brackets, as superscripts, etc.) is determined | |
1373 | by the macro package. | |
1374 | The | |
1375 | .B \-l | |
1376 | option replaces reference numbers by | |
1377 | strings composed of the senior author's last name, the date, | |
1378 | and a disambiguating letter. | |
1379 | If a number follows the | |
1380 | .B l | |
1381 | as in | |
1382 | .B \-l3 | |
1383 | only that many letters of the last name are used | |
1384 | in the label string. | |
1385 | To abbreviate the date as well the form | |
1386 | \f3-l\f2m,n\f1 | |
1387 | shortens the last name to the | |
1388 | first | |
1389 | .I m | |
1390 | letters and the date to the | |
1391 | last | |
1392 | .I n | |
1393 | digits. | |
1394 | For example, the option | |
1395 | .B \-l3,2 | |
1396 | would refer to the | |
1397 | .I eqn | |
1398 | paper (reference 3) by the signal | |
1399 | .I Ker75a , | |
1400 | since it is the first cited reference by Kernighan in 1975. | |
1401 | .PP | |
1402 | A user wishing to specify particular labels for | |
1403 | a private bibliography may use the | |
1404 | .B \-k | |
1405 | option. | |
1406 | Specifying | |
1407 | \f3\-k\f2x\f1 | |
1408 | causes the field \f2x\f1 to be used as a label. | |
1409 | The default is \f3L\f1. | |
1410 | If this field ends in \f3\-\f1, that character | |
1411 | is replaced by a sequence letter; otherwise the field | |
1412 | is used exactly as given. | |
1413 | .PP | |
1414 | If none of the | |
1415 | .I refer -produced | |
1416 | signals are desired, | |
1417 | the | |
1418 | .B \-b | |
1419 | option entirely suppresses automatic text signals. | |
1420 | .PP | |
1421 | If the user wishes to override the | |
1422 | .I \-ms | |
1423 | treatment of the reference signal (which is normally to | |
1424 | enclose the number in brackets in | |
1425 | .I nroff | |
1426 | and make it a superscript in | |
1427 | .I troff\\| ) | |
1428 | this can be done easily. | |
1429 | If the lines | |
1430 | .B \&.[ | |
1431 | or | |
1432 | .B \&.] | |
1433 | contain anything following these characters, | |
1434 | the remainders of these lines are used to surround | |
1435 | the reference signal, instead of the default. | |
1436 | Thus, for example, to say ``See reference (2).'' | |
1437 | and avoid | |
1438 | ``See reference.\s-3\u2\d\s+3'' the | |
1439 | input might appear | |
1440 | .DS | |
1441 | \&See reference | |
1442 | \&\*.[ ( | |
1443 | imprecise citation ... | |
1444 | \&\*.\^])\*. | |
1445 | .DE | |
1446 | Note that blanks are significant in this construction. | |
1447 | If a permanent change is desired in the style of reference | |
1448 | signals, however, it is probably easier to redefine the strings | |
1449 | .B \&[. | |
1450 | and | |
1451 | .B \&.] | |
1452 | (which are used to bracket each signal) | |
1453 | than to change each citation. | |
1454 | .PP | |
1455 | Although normally | |
1456 | .I refer | |
1457 | limits itself to retrieving the data for the reference, | |
1458 | and leaves to a macro package the job of arranging that | |
1459 | data as required by the local format, there are two | |
1460 | special options for rearrangements that can not be | |
1461 | done by macro packages. | |
1462 | The | |
1463 | .B \-c | |
1464 | option puts fields into all upper case | |
1465 | (C\s-2APS\s+2-S\s-2MALL\s+2 C\s-2APS\s+2 | |
1466 | in | |
1467 | .I troff | |
1468 | output). | |
1469 | The key-letters indicated what information is to be translated | |
1470 | to upper case follow the | |
1471 | .B c , | |
1472 | so that | |
1473 | .B \-cAJ | |
1474 | means that authors' names and journals are to be in caps. | |
1475 | The | |
1476 | .B \-a | |
1477 | option writes the names of authors last name first, that is | |
1478 | .I "A. D. Hall, Jr." | |
1479 | is written as | |
1480 | .I "Hall, A. D. Jr" . | |
1481 | The citation form of | |
1482 | the | |
1483 | .I "Journal of the ACM" , | |
1484 | for example, would require | |
1485 | both | |
1486 | .B \-cA | |
1487 | and | |
1488 | .B \-a | |
1489 | options. | |
1490 | This produces authors' names in the style | |
1491 | .I | |
1492 | K\s-2ERNIGHAN\s0, B. W. \s-2AND\s0 C\s-2HERRY\s0, L. L.\& | |
1493 | .R | |
1494 | for the previous example. | |
1495 | The | |
1496 | .B \-a | |
1497 | option may be followed by a number to indicate how many | |
1498 | author names should be reversed; | |
1499 | .B \-a1 | |
1500 | (without any | |
1501 | .B \-c | |
1502 | option) | |
1503 | would produce | |
1504 | .I | |
1505 | Kernighan, B. W. and L. L. Cherry, | |
1506 | .R | |
1507 | for example. | |
1508 | .PP | |
1509 | Finally, there is also the previously-mentioned | |
1510 | .B \-p | |
1511 | option to let the user specify | |
1512 | a private file of references to be searched before the public files. | |
1513 | Note that | |
1514 | .I refer | |
1515 | does not insist on a previously made index for these files. | |
1516 | If a file is named which contains reference | |
1517 | data but is not indexed, it will be searched | |
1518 | (more slowly) | |
1519 | by | |
1520 | .I refer | |
1521 | using | |
1522 | .I fgrep. | |
1523 | In this way | |
1524 | it is easy for users to keep small files of | |
1525 | new references, which can later be added to the | |
1526 | public data bases. | |
1527 | .SG MH-1274-MEL-\s8UNIX\s0 |