BSD 4_3_Reno release
[unix-history] / usr / src / share / doc / usd / 30.invert / refer
CommitLineData
95f51977 1.\" @(#)refer 6.1 (Berkeley) 5/22/86
6f0fb0bd
KD
2.\"
3.... refer | tbl | nroff -ms
575e3d36
KD
4.EH 'USD:30-%''Some Applications of Inverted Indexes on the UNIX System'
5.OH 'Some Applications of Inverted Indexes on the UNIX System''USD:30-%'
6.nr LL 6.5i
7.nr LT 6.5i
6f0fb0bd
KD
8.de UC
9\\s-2\\$1\\s0\\$2
10..
11.ds . \&\s+2.\s0
12.if t .ds -- \(em
13.if n .ds -- --
14.TR 69
575e3d36 15\".TM 77-1274-17 39199 39199-11
6f0fb0bd
KD
16.ND October 27, 1977
17.ND June 21, 1978
18.TL
19Some Applications of Inverted Indexes on the UNIX System
20.AU "MH 2C-572" 6377
21M. E. Lesk
22.AI
23.MH
575e3d36
KD
24.\".AB
25.\".LP
26.\".ft B
27.\"I. Some Applications of Inverted Indexes \- Overview
28.\".ft R
29.\".PP
30.\"This memorandum describes a set of programs which
31.\"make inverted indexes to
32.\"UNIX*
33.\"text files, and their
34.\"application to
35.\"retrieving and formatting citations for documents prepared using
36.\".I troff.
37.\".PP
38.\"The indexing and searching programs make keyword
39.\"indexes to volumes of material too large for linear searching.
40.\"Searches for combinations of single words can be performed quickly.
41.\"The programs for general searching are divided into
42.\"two phases. The first makes an index from the original
43.\"data; the second searches the index and retrieves
44.\"items.
45.\"Both of these phases are further divided into two parts
46.\"to separate the data-dependent and algorithm dependent
47.\"code.
48.\".PP
49.\"The major current application of these programs is
50.\"the
51.\".I troff
52.\"preprocessor
53.\".I refer.
54.\"A list of 4300 references is maintained on line,
55.\"containing primarily papers written and cited by
56.\"local authors.
57.\"Whenever one of these references is required
58.\"in a paper, a few words from the title or author list
59.\"will retrieve it, and the user need not bother to re-enter
60.\"the exact citation.
61.\"Alternatively, authors can use their own lists of papers.
62.\".PP
63.\"This memorandum is of interest to
64.\"those who are interested in facilities for searching large
65.\"but relatively unchanging text files on
66.\"the
67.\"UNIX
68.\"system,
69.\"and those who are interested in handling bibliographic
70.\"citations with
71.\"UNIX
72.\".I troff.
73.\".LP
74.\".ft B
75.\"II. Updating Publication Lists
76.\".PP
77.\"This section is a brief note describing the
78.\"auxiliary programs for managing the updating
79.\"processing.
80.\"It is written to aid clerical users in
81.\"maintaining lists of references.
82.\"Primarily, the programs described permit a large
83.\"amount of individual control over the content
84.\"of publication lists while retaining the
85.\"usefulness of the files to other users.
86.\".LP
87.\".ft B
88.\"III. Manual Pages
89.\".PP
90.\"This section contains the pages from the
91.\"UNIX programmer's manual
92.\"dealing with these commands.
93.\"It is useful for reference.
94.\".sp
95.\"\l'3i'
96.\".br
97.\"* UNIX is a trademark of Bell Laboratories.
98.\".AE
6f0fb0bd
KD
99.CS 10 4 14 0 0 4
100.NH
101Introduction.
102.PP
103The
104.UX
105system
106has many utilities
107(e.g. \fIgrep, awk, lex, egrep, fgrep, ...\fR)
108to search through files of text,
109but most of them are based on a linear scan through the
110entire file, using some deterministic automaton.
111.ev 1
112.ps 8
113.vs 10p
114.ev
115This memorandum discusses a program which uses inverted
116indexes
117.[
118%A D. Knuth
119%T The Art of Computer Programming: Vol. 3, Sorting and Searching
120%I Addison-Wesley
121%C Reading, Mass.
122%D 1977
123%O See section 6.5.
124.]
125and can thus be used on much larger data bases.
126.PP
127As with any indexing system, of course, there are some disadvantages;
128once an index is made, the files that have been indexed can not be changed
129without remaking the index.
130Thus applications are restricted
131to those making many searches
132of relatively stable data.
133Furthermore, these programs depend on hashing, and can only
134search for exact matches of whole keywords.
135It is not possible to look for
136arithmetic or logical expressions (e.g. ``date greater than 1970'') or
137for regular expression searching such as that in
138.I lex.
139.[
140lex lesk cstr
141.]
142.PP
143Currently there are two uses of this software,
144the
145.I refer
146preprocessor to format references,
147and the
148.I lookall
149command to search through all text files on
150the
151.UX
575e3d36
KD
152system.\(dd
153.FS
154\(dd \fIlookall\fP is not part of the Berkeley UNIX distribution.
155.FE
6f0fb0bd
KD
156.PP
157The remaining sections of this memorandum discuss
158the searching programs and their uses.
159Section 2 explains the operation of the searching algorithm and describes
160the data collected for use with the
161.I lookall
162command.
163The more important application,
164.I refer
165has a user's description in section 3.
166Section 4 goes into more detail on
167reference files
168for the benefit of those who
169wish to add references to data bases or
170write new
171.I troff
172macros for use with
173.I refer.
174The options to make
175.I refer
176collect identical citations, or otherwise relocate and adjust references,
177are described in section 5.
6f0fb0bd
KD
178.NH
179Searching.
180.PP
181The indexing and searching process is divided into two phases,
182each made of two parts.
183These are
184shown below.
185.IP A.
186Construct the index.
187.RS
188.IP (1)
189Find keys \*(-- turn the input files into a sequence of tags and keys,
190where each tag identifies a distinct item in the input
191and the keys for each such item are the strings under which it is
192to be indexed.
193.IP (2)
194Hash and sort \*(--
195prepare a set of inverted indexes from which, given a set of keys,
196the appropriate item tags can be found quickly.
197.RE
198.IP B.
199Retrieve an item in response to a query.
200.RS
201.IP (3)
202Search \*(--
203Given some keys, look through the files prepared by the hashing
204and sorting facility and derive the appropriate tags.
205.IP (4)
206Deliver \*(--
207Given the tags, find the original items. This completes the
208searching process.
209.RE
210.LP
211The first phase, making the index, is presumably done relatively infrequently.
212It should, of course, be done whenever the data being
213indexed change.
214In contrast, the second phase, retrieving items,
215is presumably done often, and must be rapid.
216.PP
217An effort is made to separate code which depends on the data
218being handled from code which depends on the searching procedure.
219The search algorithm is involved only in programs
220(2) and (3), while knowledge of the actual data files is
221needed only by programs (1) and (4).
222Thus it is easy to adapt to different data files or different
223search algorithms.
224.PP
225To start with, it is necessary to have some way of selecting
226or generating keys from input files.
227For dealing with files that are basically English, we have
228a key-making program which automatically selects words
229and passes them to the hashing and sorting program (step 2).
230The format used has one line for each input item,
231arranged
232as follows:
233.DS
234name:start,length (tab) key1 key2 key3 ...
235.DE
236where
237.I name
238is the file name,
239.I start
240is the starting byte number,
241and
242.I length
243is the number of bytes in the entry.
244.PP
245These lines are the only input used to make the
246index.
247The first field (the file name, byte position, and byte count)
248is the tag of the item
249and can be used to retrieve it quickly.
250Normally, an item is either a whole file or a section of a file
251delimited by blank lines.
252After the tab, the second field contains the keys.
253The keys, if selected by the automatic program, are
254any alphanumeric strings which
255are not among the 100 most frequent words in English
256and which are not entirely numeric (except for four-digit
257numbers beginning 19, which are accepted as dates).
258Keys are truncated to six characters and converted to lower case.
259Some selection is needed if the original items are very large.
260We normally just take the first
261.I n
262keys, with
263.I n
264less than 100 or so; this replaces any attempt at intelligent selection.
265One file in our system is
266a complete English dictionary; it would presumably be retrieved for all queries.
267.PP
268To generate an inverted index to the list of record tags and keys,
269the keys
270are hashed
271and sorted to produce an index.
272What is wanted, ideally, is a series of lists showing the tags associated
273with each key.
274To condense this,
275what is actually produced is a list showing the tags associated
276with each hash code, and thus with some set of keys.
277To speed up access and further save space,
278a set of three or possibly four files is produced.
279These files are:
280.KS
281.bd 2 2
282.TS
283center;
284c c
285lI l.
286File Contents
287entry Pointers to posting file
288 for each hash code
289posting Lists of tag pointers for
290 each hash code
291tag Tags for each item
292key Keys for each item
293 (optional)
294.TE
295.bd 2
296.KE
297The posting file comprises the real data: it contains a sequence of lists
298of items posted under each hash code. To speed up searching,
299the entry file is an array of pointers into the posting file, one per potential
300hash code.
301Furthermore, the items in the lists in the posting file are not referred to by their
302complete tag, but just by an address in the tag file, which
303gives the complete tags.
304The key file is optional and contains a copy of the keys
305used in the indexing.
306.PP
307The searching process starts with a query, containing several keys.
308The goal is to obtain all items which were indexed under these keys.
309The query keys are hashed, and the pointers in the entry file used
310to access the lists in the posting file. These lists
311are addresses in the tag file of documents posted under the
312hash codes derived from the query.
313The common items from all lists are determined;
314this must include the items indexed by every key, but may also
315contain some items which are false drops, since items referenced by
316the correct hash codes need not actually have contained the correct keys.
317Normally, if there are several keys in the query, there are not
318likely to be many false drops in the final combined list even though
319each hash code is somewhat ambiguous.
320The actual tags are then obtained from the tag file, and to guard against
321the possibility that an item has false-dropped on some hash code
322in the query, the original items are normally obtained from the delivery
323program (4) and the query keys checked against them
324by string comparison.
325.PP
326Usually, therefore, the check for bad drops is made against the original file.
327However, if the key derivation procedure is complex, it may be preferable
328to check against the keys fed to program (2).
329In this case the optional key file which contains the
330keys associated with each item is generated, and the item tag is supplemented
331by a string
332.DS
333;start,length
334.DE
335which indicates the starting byte number in the key file and the length of
336the string of keys for each item.
337This file is not usually necessary with the present
338key-selection program, since the keys
339always appear in the original document.
340.PP
341There is also an option
342(\f3-C\f2n\|\f1)
343for coordination level searching.
344This retrieves items which match all but
345.I n
346of the query keys.
347The items are retrieved in the order of the number
348of keys that they match.
349Of course,
350.I n
351must be less than the number of query keys (nothing is
352retrieved unless it matches at least one key).
353.PP
354As an example, consider one set of 4377 references, comprising
355660,000 bytes.
356This included 51,000 keys, of which 5,900 were distinct
357keys.
358The hash table is kept full to save space (at the expense of time);
359995 of 997 possible hash codes were used.
360The total set of index files (no key file) included 171,000 bytes,
361about 26% of the original file size.
362It took 8 minutes of processor time to
363hash, sort, and write the index.
364To search for a single query with the resulting index took 1.9 seconds
365of processor time,
366while to find the same paper
367with a sequential linear search
368using
369.I grep
370(reading all of the tags and keys)
371took 12.3 seconds of processor time.
372.PP
373We have also used this software to index all of the English stored on our
374.UX
375system.
376This is the index searched by the
377.I lookall
378command.
379On a typical day there were
38029,000 files in our user file system, containing about 152,000,000
381bytes.
382Of these
3835,300 files, containing 32,000,000 bytes (about 21%)
384were English text.
385The total number of `words' (determined mechanically)
386was 5,100,000.
387Of these 227,000 were selected as keys;
38819,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes.
389The
390resulting inverted file indexes used 845,000 bytes, or about
3912.6% of the size of the original files.
392The particularly small indexes are caused by the
393fact that keys are taken from only the first 50 non-common words of
394some very long input files.
395.PP
396Even this large \f2lookall\f1 index can be searched quickly.
397For example, to find this document
398by looking for the keys
399``lesk inverted indexes''
400required
4011.7 seconds of processor time
402and system time.
403By comparison, just to search the 800,000 byte dictionary (smaller than even
404the inverted indexes, let alone the 27,000,000 bytes of text files) with
405.I grep
406takes 29 seconds of processor time.
407The
408.I lookall
409program is thus useful when looking for a document which you believe
410is stored on-line, but do not know where. For example, many memos
411from our center are in the file system, but it is often
412difficult to guess where a particular memo might be (it might have several
413authors, each with many directories, and have been worked on by
414a secretary with yet more directories).
415Instructions for the use of the
416.I lookall
417command are given in the manual section, shown
418in the appendix to this memorandum.
419.PP
420The only indexes maintained routinely are those of publication lists and
421all English files.
422To make other indexes, the programs for making keys, sorting them,
423searching the indexes, and delivering answers must be used.
424Since they are usually invoked as parts of higher-level commands,
425they are not in the default command
426directory, but are available to any user in the directory
427.I /usr/lib/refer .
428Three programs are of interest:
429.I mkey ,
430which isolates keys from input files;
431.I inv ,
432which makes an index from a set of keys;
433and
434.I hunt ,
435which searches the index and delivers the items.
436Note that the two parts of the retrieval phase are combined into
437one program, to avoid the excessive system work and delay which
438would result from running these as separate processes.
439.PP
440These three commands have a large number of options to adapt to different
441kinds of input.
442The user not interested in the detailed description that now follows may
443skip to section 3, which describes the
444.I refer
445program, a packaged-up version of these tools specifically
446oriented towards formatting references.
447.PP
448.B
449Make Keys.
450.R
451The program
452.I mkey
453is the key-making program corresponding to step (1) in phase A.
454Normally, it reads its input from the file names given as arguments,
455and if there are no arguments it reads from the standard input.
456It assumes that blank lines in the input delimit
457separate items, for each of which a different line of
458keys should be generated.
459The lines of keys are written on the standard output.
460Keys are any alphanumeric string in the input not
461among the most frequent words in English and not entirely numeric
462(except that all-numeric strings are acceptable if they
463are between 1900 and 1999).
464In the output, keys are translated to lower case, and truncated
465to six characters in length; any associated punctuation is removed.
466The following flag arguments are recognized by
467.I mkey:
468.TS
469center;
470lB lw(4i).
471\-c \f2name T{
472Name of file of common words;
473default is
474.I /usr/lib/eign.
475T}
476\-f \f2name T{
477Read a list of files from
478.I name
479and take each as an input argument.
480T}
481\-i \f2chars T{
482Ignore all lines which begin with `%' followed by any character
483in
484.I chars .
485T}
486\-k\f2n T{
487Use at most
488.I n
489keys per input item.
490T}
491\-l\f2n T{
492Ignore items shorter than
493.I n
494letters long.
495T}
496\-n\f2m T{
497Ignore as a key any word in the first
498.I m
499words of the list of common English words.
500The default is 100.
501T}
502\-s T{
503Remove the labels
504.I (file:start,length)
505from the output; just give the keys.
506Used when searching rather than indexing.
507T}
508\-w T{
509Each whole file is a separate item;
510blank lines in files are irrelevant.
511T}
512.TE
513.PP
514The normal arguments for indexing references are
515the defaults, which are
516.I "\-c /usr/lib/eign" ,
517.I \-n100 ,
518and
519.I \-l3 .
520For searching, the
521.I \-s
522option is also needed.
523When the big
524.I lookall
525index of all English files is run,
526the options are
527.I \-w ,
528.I \-k50 ,
529and
530.I "\-f (filelist)" .
531When running on textual input,
532the
533.I mkey
534program processes about 1000 English words per processor second.
535Unless the
536.I \-k
537option is used (and the input files are long enough for
538it to take effect)
539the output of
540.I mkey
541is comparable in size to its input.
542.PP
543.B
544Hash and invert.
545.R
546The
547.I inv
548program computes the hash codes and writes
549the inverted files.
550It reads the output of
551.I mkey
552and writes the set of files described earlier
553in this section.
554It expects one argument, which is used as the base name for
555the three (or four) files to be written.
556Assuming an argument of
557.I Index
558(the default)
559the entry file is named
560.I Index.ia ,
561the posting file
562.I Index.ib ,
563the tag file
564.I Index.ic ,
565and the key file (if present)
566.I Index.id .
567The
568.I inv
569program recognizes the following options:
570.TS
571center;
572lB lw(4i).
573\-a T{
574Append the new keys to a previous set of inverted files,
575making new files if there is no old set using the same base name.
576T}
577\-d T{
578Write the optional key file.
579This is needed when you can not check for false drops by looking
580for the keys in the original inputs, i.e. when the key derivation
581procedure is complicated and
582the output keys are not words from the input files.
583T}
584\-h\f2n T{
585The hash table size is
586.I n
587(default 997);
588.I n
589should be prime.
590Making \f2n\f1 bigger saves search time and spends disk space.
591T}
592\-i[u] \f2name T{
593Take input from file
594.I name ,
595instead of the standard input;
596if
597.B u
598is present
599.I name
600is unlinked when the sort is started.
601Using this option permits the sort scratch space
602to overlap the disk space used for input keys.
603T}
604\-n T{
605Make a completely new set of inverted files, ignoring
606previous files.
607T}
608\-p T{
609Pipe into the sort program, rather than writing a temporary
610input file.
611This saves disk space and spends processor time.
612T}
613\-v T{
614Verbose mode; print a summary of the number of keys which
615finished indexing.
616T}
617.TE
618.PP
619About half the time used in
620.I inv
621is in the contained sort.
622Assuming the sort is roughly linear, however,
623a guess at the total timing for
624.I inv
625is 250 keys per second.
626The space used is usually of more importance:
627the entry file uses four bytes per possible hash (note
628the
629.B \-h
630option),
631and the tag file around 15-20 bytes per item indexed.
632Roughly, the posting file contains one item for each key instance
633and one item for each possible hash code; the items are two bytes
634long if the tag file is less than 65336 bytes long, and the
635items are four bytes wide if the tag file is greater than
63665536 bytes long.
637Note that to minimize storage, the hash tables should be
638over-full;
639for most of the files indexed in this way, there is no
640other real choice, since the
641.I entry
642file must fit in memory.
643.PP
644.B
645Searching and Retrieving.
646.R
647The
648.I hunt
649program retrieves items from an index.
650It combines, as mentioned above, the two parts of phase (B):
651search and delivery.
652The reason why it is efficient to combine delivery and search
653is partly to avoid starting unnecessary processes, and partly
654because the delivery operation must be a part of the search
655operation in any case.
656Because of the hashing, the search part takes place in two stages:
657first items are retrieved which have the right hash codes associated with them,
658and then the actual items are inspected to determine false drops, i.e.
659to determine if anything with the right hash codes doesn't really have the right
660keys.
661Since the original item is retrieved to check on false drops,
662it is efficient to present it immediately, rather than only
663giving the tag as output and later retrieving the
664item again.
665If there were a separate key file, this argument would not apply,
666but separate key files are not common.
667.PP
668Input to
669.I hunt
670is taken from the standard input,
671one query per line.
672Each query should be in
673.I "mkey \-s"
674output format;
675all lower case, no punctuation.
676The
677.I hunt
678program takes one argument which specifies the base name of the index
679files to be searched.
680Only one set of index files can be searched at a time,
681although many text files may be indexed as a group, of course.
682If one of the text files has been changed since the index, that file
683is searched with
684.I fgrep;
685this may occasionally slow down the searching, and care should be taken to
686avoid having many out of date files.
687The following option arguments are recognized by
688.I hunt:
689.TS
690center;
691lB lw(4i).
692\-a T{
693Give all output; ignore checking for false drops.
694T}
695\-C\f2n T{
696Coordination level
697.I n;
698retrieve items with not more than
699.I n
700terms of the input missing;
701default
702.I C0 ,
703implying that each search term must be in the output items.
704T}
705\-F[yn\f2d\f3\|] T{
706``\-Fy'' gives the text of all the items found;
707``\-Fn'' suppresses them.
708``\-F\f2d\|\f1'' where \f2d\f1\| is an integer
709gives the text of the first \f2d\f1 items.
710The default is
711.I \-Fy.
712T}
713\-g T{
714Do not use
715.I fgrep
716to search files changed since the index was made;
717print an error comment instead.
718T}
719\-i \f2string T{
720Take
721.I string
722as input, instead of reading the standard input.
723T}
724\-l \f2n T{
725The maximum length of internal lists of candidate
726items is
727.I n;
728default 1000.
729T}
730\-o \f2string T{
731Put text output (``\-Fy'') in
732.I string;
733of use
734.I only
735when
736invoked from another program.
737T}
738\-p T{
739Print hash code frequencies; mostly
740for use in optimizing hash table sizes.
741T}
742\-T[yn\f2d\|\f3] T{
743``\-Ty'' gives the tags of the items found;
744``\-Tn'' suppresses them.
745``\-T\f2d\f1\|'' where \f2d\f1\| is an integer
746gives the first \f2d\f1 tags.
747The default is
748.I \-Tn .
749T}
750\-t \f2string T{
751Put tag output (``\-Ty'') in
752.I string;
753of use
754.I only
755when invoked from another program.
756T}
757.TE
758.PP
759The timing of
760.I hunt
761is complex.
762Normally the hash table is overfull, so that there will
763be many false drops on any single term;
764but a multi-term query will have few false drops on
765all terms.
766Thus if a query is underspecified (one search term)
767many potential items will be examined and discarded as false
768drops, wasting time.
769If the query is overspecified (a dozen search terms)
770many keys will be examined only to verify that
771the single item under consideration has that key posted.
772The variation of search time with number of keys is
773shown in the table below.
774Queries of varying length were constructed to retrieve
775a particular document from the file of references.
776In the sequence to the left, search terms were chosen so as
777to select the desired paper as quickly as possible.
778In the sequence on the right, terms were chosen inefficiently,
779so that the query did not uniquely select the desired document
780until four keys had been used.
781The same document was the target in each case,
782and the final set of eight keys are also identical; the differences
783at five, six and seven keys are produced by measurement error, not
784by the slightly different key lists.
785.TS
786center;
787c s s s5 | c s s s
788cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8
789cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8
790n n n n | n n n n .
791Efficient Keys Inefficient Keys
792No. keys Total drops Retrieved Search time No. keys Total drops Retrieved Search time
793 (incl. false) Documents (seconds) (incl. false) Documents (seconds)
7941 15 3 1.27 1 68 55 5.96
7952 1 1 0.11 2 29 29 2.72
7963 1 1 0.14 3 8 8 0.95
7974 1 1 0.17 4 1 1 0.18
7985 1 1 0.19 5 1 1 0.21
7996 1 1 0.23 6 1 1 0.22
8007 1 1 0.27 7 1 1 0.26
8018 1 1 0.29 8 1 1 0.29
802.TE
803As would be expected, the optimal search is achieved
804when the query just specifies the answer; however,
805overspecification is quite cheap.
806Roughly, the time required by
807.I hunt
808can be approximated as
80930 milliseconds per search key plus 75 milliseconds
810per dropped document (whether it is a false drop or
811a real answer).
812In general, overspecification can be recommended;
813it protects the user against additions to the data base
814which turn previously uniquely-answered queries
815into ambiguous queries.
816.PP
817The careful reader will have noted an enormous discrepancy between these times
818and the earlier quoted time of around 1.9 seconds for a search. The times
819here are purely for the search and retrieval: they are measured by
820running many searches through a single invocation of the
821.I hunt
822program alone.
823The normal retrieval operation involves using the shell to
824set up a pipeline through
825.I mkey
826to
827.I hunt
828and starting both processes; this adds a fixed overhead of about 1.7 seconds
829of processor time
830to any single search.
831Furthermore, remember that all these times are processor times:
832on a typical morning on our \s-2PDP\s0 11/70 system, with about one dozen
833people logged on,
834to obtain 1 second of processor time for the search program
835took between 2 and 12 seconds of real time, with a median of
8363.9 seconds and a mean of 4.8 seconds.
837Thus, although the work involved in a single search may be only
838200 milliseconds, after you add the 1.7 seconds of startup processor
839time
840and then assume a 4:1 elapsed/processor time
841ratio, it will be 8 seconds before any response is printed.
842.NH
843Selecting and Formatting References for T\s-2ROFF\s0
844.PP
845The major application of the retrieval software
846is
847.I refer,
848which is a
849.I troff
850preprocessor
851like
852.I eqn .
853.[
854kernighan cherry acm 1975
855.]
856It scans its input looking for items of the form
857.DS
858\*.[
859imprecise citation
860\*.\^]
861.DE
862where an imprecise citation is merely a string
863of words found in the relevant bibliographic citation.
864This is translated into a properly formatted reference.
865If the imprecise citation does not correctly identify
866a single paper
867(either
868selecting no papers or too many) a message is given.
869The data base of citations searched may be tailored to each
870system, and individual users may specify their own
871citation
872files.
873On our system, the default data base is accumulated from
874the publication lists of the members of our organization, plus
875about half a dozen personal bibliographies that were collected.
876The present total is about 4300 citations, but this increases steadily.
877Even now,
878the data base covers a large fraction of local citations.
879.PP
880For example, the reference for the
881.I eqn
882paper above was specified as
883.DS
884\&\*.\*.\*.
885\&preprocessor like
886\&.I eqn.
887\&.[
888\&kernighan cherry acm 1975
889\&.]
890\&It scans its input looking for items
891\&\*.\*.\*.
892.DE
893This paper was itself printed using
894.I refer.
895The above input text was processed by
896.I refer
897as well as
898.I tbl
899and
900.I troff
901by the command
902.DS
903.ft I
904refer memo-file | tbl | troff \-ms
905.ft R
906.DE
907and the reference was automatically translated into a correct
908citation to the ACM paper on mathematical typesetting.
909.PP
910The procedure to use to place a reference in a paper
911using
912.I refer
913is as follows.
914First, use the
915.I lookbib
916command to check that the paper is in the data base
917and to find out what keys are necessary to retrieve it.
918This is done by typing
919.I lookbib
920and then typing some potential queries until
921a suitable query is found.
922For example, had one started to find
923the
924.I eqn
925paper shown above by presenting the query
926.DS
927 $ lookbib
928 kernighan cherry
929 (EOT)
930.DE
931.I lookbib
932would have found several items; experimentation would quickly
933have shown that the query given above is adequate.
934Overspecifying the query is of course harmless.
935A particularly careful reader may have noticed that ``acm'' does not
936appear in the printed citation;
937we have supplemented some of the data base items with common
938extra keywords, such as common abbreviations for journals
939or other sources, to aid in searching.
940.PP
941If the reference is in the data base, the query
942that retrieved it can be inserted in the text,
943between
944.B \*.[
945and
946.B \*.\^]
947brackets.
948If it is not in the data base, it can be typed
949into a private file of references, using the format
950discussed in the next section, and then
951the
952.B \-p
953option
954used to search this private file.
955Such a command might read
956(if the private references are called
957.I myfile )
958.DS
959.ft 2
960refer \-p myfile document | tbl | eqn | troff \-ms \*. \*. \*.
961.ft 1
962.DE
963where
964.I tbl
965and/or
966.I eqn
967could be omitted if not needed.
968The use
969of the
970.I \-ms
971macros
972.[
973lesk typing documents unix gcos
974.]
975or some other macro package, however,
976is essential.
977.I Refer
978only generates the data for the references; exact formatting
979is done by some macro package, and if none is supplied the
980references will not be printed.
981.PP
982By default,
983the references are numbered sequentially,
984and
985the
986.I \-ms
987macros format references as footnotes at the bottom of the page.
988This memorandum is an example of that style.
989Other possibilities are discussed in section 5 below.
990.NH
991Reference Files.
992.PP
993A reference file is a set of bibliographic references usable with
994.I refer.
995It can be indexed using the software described in section 2
996for fast searching.
997What
998.I refer
999does is to read the input document stream,
1000looking for imprecise citation references.
1001It then searches through reference files to find
1002the full citations, and inserts them into the
1003document.
1004The format of the full citation is arranged to make it
1005convenient for a macro package, such as the
1006.I \-ms
1007macros, to format the reference
1008for printing.
1009Since
1010the format of the final reference is determined
1011by the desired style of output,
1012which is determined by the macros used,
1013.I refer
1014avoids forcing any kind of reference appearance.
1015All it does is define a set of string registers which
1016contain the basic information about the reference;
1017and provide a macro call which is expanded by the macro
1018package to format the reference.
1019It is the responsibility of the final macro package
1020to see that the reference is actually printed; if no
1021macros are used, and the output of
1022.I refer
1023fed untranslated to
1024.I troff,
1025nothing at all will be printed.
1026.PP
1027The strings defined by
1028.I refer
1029are taken directly from the files of references, which
1030are in the following format.
1031The references should be separated
1032by blank lines.
1033Each reference is a sequence of lines beginning with
1034.B %
1035and followed
1036by a key-letter.
1037The remainder of that line, and successive lines until the next line beginning
1038with
1039.B % ,
1040contain the information specified by the key-letter.
1041In general,
1042.I refer
1043does not interpret the information, but merely presents
1044it to the macro package for final formatting.
1045A user with a separate macro package, for example,
1046can add new key-letters or use the existing ones for other purposes
1047without bothering
1048.I refer.
1049.PP
1050The meaning of the key-letters given below, in particular,
1051is that assigned by the
1052.I \-ms
1053macros.
1054Not all information, obviously, is used with each citation.
1055For example, if a document is both an internal memorandum and a journal article,
1056the macros ignore the memorandum version and cite only the journal article.
1057Some kinds of information are not used at all in printing the reference;
1058if a user does not like finding references by specifying title
1059or author keywords, and prefers to add specific keywords to the
1060citation, a field is available which is searched but not
1061printed (\f3K\f1).
1062.PP
1063The key letters currently recognized by
1064.I refer
1065and
1066.I \-ms,
1067with the kind of information implied, are:
1068.KS
1069.TS
1070center;
1071c c6 c c
1072c l c l.
1073Key Information specified Key Information specified
1074A Author's name N Issue number
1075B Title of book containing item O Other information
1076C City of publication P Page(s) of article
1077D Date R Technical report reference
1078E Editor of book containing item T Title
1079G Government (NTIS) ordering number V Volume number
1080I Issuer (publisher)
1081J Journal name
1082K Keys (for searching) X or
1083L Label Y or
1084M Memorandum label Z Information not used by \f2refer\f1
1085.TE
1086.KE
1087For example, a sample reference could be
1088typed as:
1089.DS
1090%T Bounds on the Complexity of the Maximal
1091Common Subsequence Problem
1092%Z ctr127
1093%A A. V. Aho
1094%A D. S. Hirschberg
1095%A J. D. Ullman
1096%J J. ACM
1097%V 23
1098%N 1
1099%P 1-12
1100.\"%M TM 75-1271-7
1101%M abcd-78
1102%D Jan. 1976
1103.DE
1104Order is irrelevant, except that authors are shown in the order
1105given. The output of
1106.I refer
1107is a stream of string definitions, one
1108for each of the fields of each reference, as
1109shown below.
1110.DS
1111\*.]-
1112\*.ds [A authors' names \*.\*.\*.
1113\*.ds [T title \*.\*.\*.
1114\*.ds [J journal \*.\*.\*.
1115\*.\*.\*.
1116\*.]\|[ type-number
1117.DE
1118The special macro
1119.B \&\*.]\-
1120precedes the string definitions
1121and the special macro
1122.B \*.]\|[
1123follows.
1124These are changed from the input
1125.B \*.[
1126and
1127.B \*.\^]
1128so that running the same file through
1129.I refer
1130again is harmless.
1131The
1132.B \*.]\-
1133macro can be used by the macro package to
1134initialize.
1135The
1136.B \*.]\|[
1137macro, which should be used
1138to print the reference, is given an
1139argument
1140.I type-number
1141to indicate the kind of reference, as follows:
1142.KS
1143.TS
1144center;
1145c c
1146n l.
1147Value Kind of reference
11481 Journal article
11492 Book
11503 Article within book
11514 Technical report
11525 Bell Labs technical memorandum
11530 Other
1154.TE
1155.KE
1156The reference is flagged in the text
1157with the sequence
1158.DS
1159\e*\|([\*.number\e*\|(\*.\^]
1160.DE
1161where
1162.I number
1163is the footnote number.
1164The strings
1165.B [\*.
1166and
1167.B \*.\^]
1168should be used by the macro package
1169to format the reference flag in the text.
1170These strings can be replaced for a particular
1171footnote, as described in section 5.
1172The footnote number (or other signal) is available
1173to the reference macro
1174.B \*.]\|[
1175as the
1176string register
1177.B [F .
1178.PP
1179In some cases users wish to suspend the searching, and merely
1180use the reference macro formatting.
1181That is, the user doesn't want to provide a search key
1182between
1183.B \*.[
1184and
1185.B \*.\^]
1186brackets, but merely
1187the reference lines for the appropriate document.
1188Alternatively, the user
1189can wish
1190to add a few fields to those in the reference
1191as in the standard file, or
1192override some fields.
1193Altering or replacing fields, or supplying whole references, is easily done
1194by inserting lines beginning
1195with
1196.B % ;
1197any such line is taken as
1198direct input to the reference
1199processor rather than keys to be searched.
1200Thus
1201.DS
1202\*.[
1203key1 key2 key3 \*.\*.\*.
1204%Q New format item
1205%R Override report name
1206\*.\^]
1207.DE
575e3d36 1208makes the indicated changes to the result of searching for
6f0fb0bd
KD
1209the keys.
1210All of the search keys must be given before the first
1211\f3%\f1 line.
1212.PP
1213If no search keys are provided, an entire citation can
1214be provided in-line in the text.
1215For example, if the
1216.I eqn
1217paper citation were to be inserted in
1218this way, rather than by searching for it in the data base,
1219the input would read
1220.DS
1221\&\*.\*.\*.
1222\&preprocessor like
1223\&.I eqn.
1224\&.[
1225\&%A B. W. Kernighan
1226\&%A L. L. Cherry
1227\&%T A System for Typesetting Mathematics
1228\&%J Comm. ACM
1229\&%V 18
1230\&%N 3
1231\&%P 151-157
1232\&%D March 1975
1233\&.]
1234\&It scans its input looking for items
1235\&\*.\*.\*.
1236.DE
1237This would produce a citation of the same appearance as that
1238resulting from the file search.
1239.PP
1240As shown, fields are normally turned into
1241.I troff
1242strings.
1243Sometimes users would rather have them defined as macros,
1244so that other
1245.I troff
1246commands can be placed into the data.
1247When this is necessary, simply double the control character
1248.B %
1249in the data.
1250Thus the input
1251.DS
1252\&.[
1253%V 23
1254%%M
1255Bell Laboratories,
1256Murray Hill, N.J. 07974
1257\&.]
1258.DE
1259is processed by
1260.I refer
1261into
1262.DS
1263\&.ds [V 23
1264\&.de [M
1265Bell Laboratories,
1266Murray Hill, N.J. 07974
1267\&..
1268.DE
1269The information after
1270.B %%M
1271is defined as a macro to be invoked by
1272.B .[M
1273while the information after
1274.B %V
1275is turned into a string to be invoked by
1276.B \e\(**([V .
1277At present
1278.I \-ms
1279expects all information as strings.
1280.NH
1281Collecting References and other Refer Options
1282.PP
1283Normally, the combination of
1284.I refer
1285and
1286.I \-ms
1287formats output as
1288.I troff
1289footnotes which are consecutively numbered and placed
1290at the bottom of the page. However,
1291options exist to
1292place the references at the end; to arrange references alphabetically
1293by senior author; and to indicate references by strings in the text of the form
1294[Name1975a]
1295rather than by number.
1296Whenever references are not placed at the bottom of a page
1297identical references are coalesced.
1298.PP
1299For example, the
1300.B \-e
1301option to
1302.I refer
1303specifies that references are to be collected; in this case
1304they are output whenever the sequence
1305.DS
1306\*.[
1307$LIST$
1308\*.\^]
1309.DE
1310is encountered.
1311Thus, to place references at the end of a paper, the user would run
1312.I refer
1313with the
1314.I \-e
1315option and place the above $LIST$ commands after the last
1316line of the text.
1317.I Refer
1318will then move all the references to that point.
1319To aid in formatting the collected references,
1320.I refer
1321writes the references preceded by the line
1322.DS
1323.B .]<
1324.DE
1325and
1326followed by the line
1327.DS
1328.B .]>
1329.DE
1330to invoke special macros before and after the references.
1331.PP
1332Another possible option to
1333.I refer
1334is the
1335.B \-s
1336option to specify
1337sorting of references. The default,
1338of course, is to list references in the order presented.
1339The
1340.B \-s
1341option implies the
1342.B \-e
1343option, and thus requires
1344a
1345.DS
1346\*.[
1347$LIST$
1348\*.\^]
1349.DE
1350entry to call out the reference list.
1351The
1352.B \-s
1353option may be followed by a string of letters, numbers, and `+' signs indicating how
1354the references are to be sorted.
1355The sort is done using the fields whose key-letters are
1356in the string as sorting keys; the numbers indicate how many
1357of the fields are to be considered, with `+'
1358taken as a large number.
1359Thus the default is
1360.B \-sAD
1361meaning ``Sort on senior author, then date.'' To
1362sort on all authors and then title, specify
1363.B \-sA+T .
1364And to sort on two authors and then the journal,
1365write
1366.B \-sA2J .
1367.PP
1368Other options to
1369.I refer
1370change the signal or label inserted in the text for each reference.
1371Normally these are just sequential numbers,
1372and their exact placement (within brackets, as superscripts, etc.) is determined
1373by the macro package.
1374The
1375.B \-l
1376option replaces reference numbers by
1377strings composed of the senior author's last name, the date,
1378and a disambiguating letter.
1379If a number follows the
1380.B l
1381as in
1382.B \-l3
1383only that many letters of the last name are used
1384in the label string.
1385To abbreviate the date as well the form
1386\f3-l\f2m,n\f1
1387shortens the last name to the
1388first
1389.I m
1390letters and the date to the
1391last
1392.I n
1393digits.
1394For example, the option
1395.B \-l3,2
1396would refer to the
1397.I eqn
1398paper (reference 3) by the signal
1399.I Ker75a ,
1400since it is the first cited reference by Kernighan in 1975.
1401.PP
1402A user wishing to specify particular labels for
1403a private bibliography may use the
1404.B \-k
1405option.
1406Specifying
1407\f3\-k\f2x\f1
1408causes the field \f2x\f1 to be used as a label.
1409The default is \f3L\f1.
1410If this field ends in \f3\-\f1, that character
1411is replaced by a sequence letter; otherwise the field
1412is used exactly as given.
1413.PP
1414If none of the
1415.I refer -produced
1416signals are desired,
1417the
1418.B \-b
1419option entirely suppresses automatic text signals.
1420.PP
1421If the user wishes to override the
1422.I \-ms
1423treatment of the reference signal (which is normally to
1424enclose the number in brackets in
1425.I nroff
1426and make it a superscript in
1427.I troff\\| )
1428this can be done easily.
1429If the lines
1430.B \&.[
1431or
1432.B \&.]
1433contain anything following these characters,
1434the remainders of these lines are used to surround
1435the reference signal, instead of the default.
1436Thus, for example, to say ``See reference (2).''
1437and avoid
1438``See reference.\s-3\u2\d\s+3'' the
1439input might appear
1440.DS
1441\&See reference
1442\&\*.[ (
1443imprecise citation ...
1444\&\*.\^])\*.
1445.DE
1446Note that blanks are significant in this construction.
1447If a permanent change is desired in the style of reference
1448signals, however, it is probably easier to redefine the strings
1449.B \&[.
1450and
1451.B \&.]
1452(which are used to bracket each signal)
1453than to change each citation.
1454.PP
1455Although normally
1456.I refer
1457limits itself to retrieving the data for the reference,
1458and leaves to a macro package the job of arranging that
1459data as required by the local format, there are two
1460special options for rearrangements that can not be
1461done by macro packages.
1462The
1463.B \-c
1464option puts fields into all upper case
1465(C\s-2APS\s+2-S\s-2MALL\s+2 C\s-2APS\s+2
1466in
1467.I troff
1468output).
1469The key-letters indicated what information is to be translated
1470to upper case follow the
1471.B c ,
1472so that
1473.B \-cAJ
1474means that authors' names and journals are to be in caps.
1475The
1476.B \-a
1477option writes the names of authors last name first, that is
1478.I "A. D. Hall, Jr."
1479is written as
1480.I "Hall, A. D. Jr" .
1481The citation form of
1482the
1483.I "Journal of the ACM" ,
1484for example, would require
1485both
1486.B \-cA
1487and
1488.B \-a
1489options.
1490This produces authors' names in the style
1491.I
1492K\s-2ERNIGHAN\s0, B. W. \s-2AND\s0 C\s-2HERRY\s0, L. L.\&
1493.R
1494for the previous example.
1495The
1496.B \-a
1497option may be followed by a number to indicate how many
1498author names should be reversed;
1499.B \-a1
1500(without any
1501.B \-c
1502option)
1503would produce
1504.I
1505Kernighan, B. W. and L. L. Cherry,
1506.R
1507for example.
1508.PP
1509Finally, there is also the previously-mentioned
1510.B \-p
1511option to let the user specify
1512a private file of references to be searched before the public files.
1513Note that
1514.I refer
1515does not insist on a previously made index for these files.
1516If a file is named which contains reference
1517data but is not indexed, it will be searched
1518(more slowly)
1519by
1520.I refer
1521using
1522.I fgrep.
1523In this way
1524it is easy for users to keep small files of
1525new references, which can later be added to the
1526public data bases.
1527.SG MH-1274-MEL-\s8UNIX\s0