BSD 4_1_snap development
[unix-history] / usr / doc / refer / refer
CommitLineData
f2a0d81d
C
1.... refer | tbl | nroff -ms
2.de UC
3\\s-2\\$1\\s0\\$2
4..
5.ds . \&\s+2.\s0
6.if t .ds -- \(em
7.if n .ds -- --
8.TR 69
9.TM 77-1274-17 39199 39199-11
10.ND October 27, 1977
11.ND June 21, 1978
12.TL
13Some Applications of Inverted Indexes on the UNIX System
14.AU "MH 2C-572" 6377
15M. E. Lesk
16.AI
17.MH
18.AB
19.LP
20.ft B
21I. Some Applications of Inverted Indexes \- Overview
22.ft R
23.PP
24This memorandum describes a set of programs which
25make inverted indexes to
26UNIX*
27text files, and their
28application to
29retrieving and formatting citations for documents prepared using
30.I troff.
31.PP
32The indexing and searching programs make keyword
33indexes to volumes of material too large for linear searching.
34Searches for combinations of single words can be performed quickly.
35The programs for general searching are divided into
36two phases. The first makes an index from the original
37data; the second searches the index and retrieves
38items.
39Both of these phases are further divided into two parts
40to separate the data-dependent and algorithm dependent
41code.
42.PP
43The major current application of these programs is
44the
45.I troff
46preprocessor
47.I refer.
48A list of 4300 references is maintained on line,
49containing primarily papers written and cited by
50local authors.
51Whenever one of these references is required
52in a paper, a few words from the title or author list
53will retrieve it, and the user need not bother to re-enter
54the exact citation.
55Alternatively, authors can use their own lists of papers.
56.PP
57This memorandum is of interest to
58those who are interested in facilities for searching large
59but relatively unchanging text files on
60the
61UNIX
62system,
63and those who are interested in handling bibliographic
64citations with
65UNIX
66.I troff.
67.LP
68.ft B
69II. Updating Publication Lists
70.PP
71This section is a brief note describing the
72auxiliary programs for managing the updating
73processing.
74It is written to aid clerical users in
75maintaining lists of references.
76Primarily, the programs described permit a large
77amount of individual control over the content
78of publication lists while retaining the
79usefulness of the files to other users.
80.LP
81.ft B
82III. Manual Pages
83.PP
84This section contains the pages from the
85UNIX programmer's manual
86dealing with these commands.
87It is useful for reference.
88.sp
89\l'3i'
90.br
91* UNIX is a Trademark of Bell Laboratories.
92.AE
93.CS 10 4 14 0 0 4
94.NH
95Introduction.
96.PP
97The
98.UX
99system
100has many utilities
101(e.g. \fIgrep, awk, lex, egrep, fgrep, ...\fR)
102to search through files of text,
103but most of them are based on a linear scan through the
104entire file, using some deterministic automaton.
105.ev 1
106.ps 8
107.vs 10p
108.ev
109This memorandum discusses a program which uses inverted
110indexes
111.[
112%A D. Knuth
113%T The Art of Computer Programming: Vol. 3, Sorting and Searching
114%I Addison-Wesley
115%C Reading, Mass.
116%D 1977
117%O See section 6.5.
118.]
119and can thus be used on much larger data bases.
120.PP
121As with any indexing system, of course, there are some disadvantages;
122once an index is made, the files that have been indexed can not be changed
123without remaking the index.
124Thus applications are restricted
125to those making many searches
126of relatively stable data.
127Furthermore, these programs depend on hashing, and can only
128search for exact matches of whole keywords.
129It is not possible to look for
130arithmetic or logical expressions (e.g. ``date greater than 1970'') or
131for regular expression searching such as that in
132.I lex.
133.[
134lex lesk cstr
135.]
136.PP
137Currently there are two uses of this software,
138the
139.I refer
140preprocessor to format references,
141and the
142.I lookall
143command to search through all text files on
144the
145.UX
146system.
147.PP
148The remaining sections of this memorandum discuss
149the searching programs and their uses.
150Section 2 explains the operation of the searching algorithm and describes
151the data collected for use with the
152.I lookall
153command.
154The more important application,
155.I refer
156has a user's description in section 3.
157Section 4 goes into more detail on
158reference files
159for the benefit of those who
160wish to add references to data bases or
161write new
162.I troff
163macros for use with
164.I refer.
165The options to make
166.I refer
167collect identical citations, or otherwise relocate and adjust references,
168are described in section 5.
169The
170.UX
171manual sections for
172.I "refer, lookall,"
173and associated commands are attached as appendices.
174.NH
175Searching.
176.PP
177The indexing and searching process is divided into two phases,
178each made of two parts.
179These are
180shown below.
181.IP A.
182Construct the index.
183.RS
184.IP (1)
185Find keys \*(-- turn the input files into a sequence of tags and keys,
186where each tag identifies a distinct item in the input
187and the keys for each such item are the strings under which it is
188to be indexed.
189.IP (2)
190Hash and sort \*(--
191prepare a set of inverted indexes from which, given a set of keys,
192the appropriate item tags can be found quickly.
193.RE
194.IP B.
195Retrieve an item in response to a query.
196.RS
197.IP (3)
198Search \*(--
199Given some keys, look through the files prepared by the hashing
200and sorting facility and derive the appropriate tags.
201.IP (4)
202Deliver \*(--
203Given the tags, find the original items. This completes the
204searching process.
205.RE
206.LP
207The first phase, making the index, is presumably done relatively infrequently.
208It should, of course, be done whenever the data being
209indexed change.
210In contrast, the second phase, retrieving items,
211is presumably done often, and must be rapid.
212.PP
213An effort is made to separate code which depends on the data
214being handled from code which depends on the searching procedure.
215The search algorithm is involved only in programs
216(2) and (3), while knowledge of the actual data files is
217needed only by programs (1) and (4).
218Thus it is easy to adapt to different data files or different
219search algorithms.
220.PP
221To start with, it is necessary to have some way of selecting
222or generating keys from input files.
223For dealing with files that are basically English, we have
224a key-making program which automatically selects words
225and passes them to the hashing and sorting program (step 2).
226The format used has one line for each input item,
227arranged
228as follows:
229.DS
230name:start,length (tab) key1 key2 key3 ...
231.DE
232where
233.I name
234is the file name,
235.I start
236is the starting byte number,
237and
238.I length
239is the number of bytes in the entry.
240.PP
241These lines are the only input used to make the
242index.
243The first field (the file name, byte position, and byte count)
244is the tag of the item
245and can be used to retrieve it quickly.
246Normally, an item is either a whole file or a section of a file
247delimited by blank lines.
248After the tab, the second field contains the keys.
249The keys, if selected by the automatic program, are
250any alphanumeric strings which
251are not among the 100 most frequent words in English
252and which are not entirely numeric (except for four-digit
253numbers beginning 19, which are accepted as dates).
254Keys are truncated to six characters and converted to lower case.
255Some selection is needed if the original items are very large.
256We normally just take the first
257.I n
258keys, with
259.I n
260less than 100 or so; this replaces any attempt at intelligent selection.
261One file in our system is
262a complete English dictionary; it would presumably be retrieved for all queries.
263.PP
264To generate an inverted index to the list of record tags and keys,
265the keys
266are hashed
267and sorted to produce an index.
268What is wanted, ideally, is a series of lists showing the tags associated
269with each key.
270To condense this,
271what is actually produced is a list showing the tags associated
272with each hash code, and thus with some set of keys.
273To speed up access and further save space,
274a set of three or possibly four files is produced.
275These files are:
276.KS
277.bd 2 2
278.TS
279center;
280c c
281lI l.
282File Contents
283entry Pointers to posting file
284 for each hash code
285posting Lists of tag pointers for
286 each hash code
287tag Tags for each item
288key Keys for each item
289 (optional)
290.TE
291.bd 2
292.KE
293The posting file comprises the real data: it contains a sequence of lists
294of items posted under each hash code. To speed up searching,
295the entry file is an array of pointers into the posting file, one per potential
296hash code.
297Furthermore, the items in the lists in the posting file are not referred to by their
298complete tag, but just by an address in the tag file, which
299gives the complete tags.
300The key file is optional and contains a copy of the keys
301used in the indexing.
302.PP
303The searching process starts with a query, containing several keys.
304The goal is to obtain all items which were indexed under these keys.
305The query keys are hashed, and the pointers in the entry file used
306to access the lists in the posting file. These lists
307are addresses in the tag file of documents posted under the
308hash codes derived from the query.
309The common items from all lists are determined;
310this must include the items indexed by every key, but may also
311contain some items which are false drops, since items referenced by
312the correct hash codes need not actually have contained the correct keys.
313Normally, if there are several keys in the query, there are not
314likely to be many false drops in the final combined list even though
315each hash code is somewhat ambiguous.
316The actual tags are then obtained from the tag file, and to guard against
317the possibility that an item has false-dropped on some hash code
318in the query, the original items are normally obtained from the delivery
319program (4) and the query keys checked against them
320by string comparison.
321.PP
322Usually, therefore, the check for bad drops is made against the original file.
323However, if the key derivation procedure is complex, it may be preferable
324to check against the keys fed to program (2).
325In this case the optional key file which contains the
326keys associated with each item is generated, and the item tag is supplemented
327by a string
328.DS
329;start,length
330.DE
331which indicates the starting byte number in the key file and the length of
332the string of keys for each item.
333This file is not usually necessary with the present
334key-selection program, since the keys
335always appear in the original document.
336.PP
337There is also an option
338(\f3-C\f2n\|\f1)
339for coordination level searching.
340This retrieves items which match all but
341.I n
342of the query keys.
343The items are retrieved in the order of the number
344of keys that they match.
345Of course,
346.I n
347must be less than the number of query keys (nothing is
348retrieved unless it matches at least one key).
349.PP
350As an example, consider one set of 4377 references, comprising
351660,000 bytes.
352This included 51,000 keys, of which 5,900 were distinct
353keys.
354The hash table is kept full to save space (at the expense of time);
355995 of 997 possible hash codes were used.
356The total set of index files (no key file) included 171,000 bytes,
357about 26% of the original file size.
358It took 8 minutes of processor time to
359hash, sort, and write the index.
360To search for a single query with the resulting index took 1.9 seconds
361of processor time,
362while to find the same paper
363with a sequential linear search
364using
365.I grep
366(reading all of the tags and keys)
367took 12.3 seconds of processor time.
368.PP
369We have also used this software to index all of the English stored on our
370.UX
371system.
372This is the index searched by the
373.I lookall
374command.
375On a typical day there were
37629,000 files in our user file system, containing about 152,000,000
377bytes.
378Of these
3795,300 files, containing 32,000,000 bytes (about 21%)
380were English text.
381The total number of `words' (determined mechanically)
382was 5,100,000.
383Of these 227,000 were selected as keys;
38419,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes.
385The
386resulting inverted file indexes used 845,000 bytes, or about
3872.6% of the size of the original files.
388The particularly small indexes are caused by the
389fact that keys are taken from only the first 50 non-common words of
390some very long input files.
391.PP
392Even this large \f2lookall\f1 index can be searched quickly.
393For example, to find this document
394by looking for the keys
395``lesk inverted indexes''
396required
3971.7 seconds of processor time
398and system time.
399By comparison, just to search the 800,000 byte dictionary (smaller than even
400the inverted indexes, let alone the 27,000,000 bytes of text files) with
401.I grep
402takes 29 seconds of processor time.
403The
404.I lookall
405program is thus useful when looking for a document which you believe
406is stored on-line, but do not know where. For example, many memos
407from our center are in the file system, but it is often
408difficult to guess where a particular memo might be (it might have several
409authors, each with many directories, and have been worked on by
410a secretary with yet more directories).
411Instructions for the use of the
412.I lookall
413command are given in the manual section, shown
414in the appendix to this memorandum.
415.PP
416The only indexes maintained routinely are those of publication lists and
417all English files.
418To make other indexes, the programs for making keys, sorting them,
419searching the indexes, and delivering answers must be used.
420Since they are usually invoked as parts of higher-level commands,
421they are not in the default command
422directory, but are available to any user in the directory
423.I /usr/lib/refer .
424Three programs are of interest:
425.I mkey ,
426which isolates keys from input files;
427.I inv ,
428which makes an index from a set of keys;
429and
430.I hunt ,
431which searches the index and delivers the items.
432Note that the two parts of the retrieval phase are combined into
433one program, to avoid the excessive system work and delay which
434would result from running these as separate processes.
435.PP
436These three commands have a large number of options to adapt to different
437kinds of input.
438The user not interested in the detailed description that now follows may
439skip to section 3, which describes the
440.I refer
441program, a packaged-up version of these tools specifically
442oriented towards formatting references.
443.PP
444.B
445Make Keys.
446.R
447The program
448.I mkey
449is the key-making program corresponding to step (1) in phase A.
450Normally, it reads its input from the file names given as arguments,
451and if there are no arguments it reads from the standard input.
452It assumes that blank lines in the input delimit
453separate items, for each of which a different line of
454keys should be generated.
455The lines of keys are written on the standard output.
456Keys are any alphanumeric string in the input not
457among the most frequent words in English and not entirely numeric
458(except that all-numeric strings are acceptable if they
459are between 1900 and 1999).
460In the output, keys are translated to lower case, and truncated
461to six characters in length; any associated punctuation is removed.
462The following flag arguments are recognized by
463.I mkey:
464.TS
465center;
466lB lw(4i).
467\-c \f2name T{
468Name of file of common words;
469default is
470.I /usr/lib/eign.
471T}
472\-f \f2name T{
473Read a list of files from
474.I name
475and take each as an input argument.
476T}
477\-i \f2chars T{
478Ignore all lines which begin with `%' followed by any character
479in
480.I chars .
481T}
482\-k\f2n T{
483Use at most
484.I n
485keys per input item.
486T}
487\-l\f2n T{
488Ignore items shorter than
489.I n
490letters long.
491T}
492\-n\f2m T{
493Ignore as a key any word in the first
494.I m
495words of the list of common English words.
496The default is 100.
497T}
498\-s T{
499Remove the labels
500.I (file:start,length)
501from the output; just give the keys.
502Used when searching rather than indexing.
503T}
504\-w T{
505Each whole file is a separate item;
506blank lines in files are irrelevant.
507T}
508.TE
509.PP
510The normal arguments for indexing references are
511the defaults, which are
512.I "\-c /usr/lib/eign" ,
513.I \-n100 ,
514and
515.I \-l3 .
516For searching, the
517.I \-s
518option is also needed.
519When the big
520.I lookall
521index of all English files is run,
522the options are
523.I \-w ,
524.I \-k50 ,
525and
526.I "\-f (filelist)" .
527When running on textual input,
528the
529.I mkey
530program processes about 1000 English words per processor second.
531Unless the
532.I \-k
533option is used (and the input files are long enough for
534it to take effect)
535the output of
536.I mkey
537is comparable in size to its input.
538.PP
539.B
540Hash and invert.
541.R
542The
543.I inv
544program computes the hash codes and writes
545the inverted files.
546It reads the output of
547.I mkey
548and writes the set of files described earlier
549in this section.
550It expects one argument, which is used as the base name for
551the three (or four) files to be written.
552Assuming an argument of
553.I Index
554(the default)
555the entry file is named
556.I Index.ia ,
557the posting file
558.I Index.ib ,
559the tag file
560.I Index.ic ,
561and the key file (if present)
562.I Index.id .
563The
564.I inv
565program recognizes the following options:
566.TS
567center;
568lB lw(4i).
569\-a T{
570Append the new keys to a previous set of inverted files,
571making new files if there is no old set using the same base name.
572T}
573\-d T{
574Write the optional key file.
575This is needed when you can not check for false drops by looking
576for the keys in the original inputs, i.e. when the key derivation
577procedure is complicated and
578the output keys are not words from the input files.
579T}
580\-h\f2n T{
581The hash table size is
582.I n
583(default 997);
584.I n
585should be prime.
586Making \f2n\f1 bigger saves search time and spends disk space.
587T}
588\-i[u] \f2name T{
589Take input from file
590.I name ,
591instead of the standard input;
592if
593.B u
594is present
595.I name
596is unlinked when the sort is started.
597Using this option permits the sort scratch space
598to overlap the disk space used for input keys.
599T}
600\-n T{
601Make a completely new set of inverted files, ignoring
602previous files.
603T}
604\-p T{
605Pipe into the sort program, rather than writing a temporary
606input file.
607This saves disk space and spends processor time.
608T}
609\-v T{
610Verbose mode; print a summary of the number of keys which
611finished indexing.
612T}
613.TE
614.PP
615About half the time used in
616.I inv
617is in the contained sort.
618Assuming the sort is roughly linear, however,
619a guess at the total timing for
620.I inv
621is 250 keys per second.
622The space used is usually of more importance:
623the entry file uses four bytes per possible hash (note
624the
625.B \-h
626option),
627and the tag file around 15-20 bytes per item indexed.
628Roughly, the posting file contains one item for each key instance
629and one item for each possible hash code; the items are two bytes
630long if the tag file is less than 65336 bytes long, and the
631items are four bytes wide if the tag file is greater than
63265536 bytes long.
633Note that to minimize storage, the hash tables should be
634over-full;
635for most of the files indexed in this way, there is no
636other real choice, since the
637.I entry
638file must fit in memory.
639.PP
640.B
641Searching and Retrieving.
642.R
643The
644.I hunt
645program retrieves items from an index.
646It combines, as mentioned above, the two parts of phase (B):
647search and delivery.
648The reason why it is efficient to combine delivery and search
649is partly to avoid starting unnecessary processes, and partly
650because the delivery operation must be a part of the search
651operation in any case.
652Because of the hashing, the search part takes place in two stages:
653first items are retrieved which have the right hash codes associated with them,
654and then the actual items are inspected to determine false drops, i.e.
655to determine if anything with the right hash codes doesn't really have the right
656keys.
657Since the original item is retrieved to check on false drops,
658it is efficient to present it immediately, rather than only
659giving the tag as output and later retrieving the
660item again.
661If there were a separate key file, this argument would not apply,
662but separate key files are not common.
663.PP
664Input to
665.I hunt
666is taken from the standard input,
667one query per line.
668Each query should be in
669.I "mkey \-s"
670output format;
671all lower case, no punctuation.
672The
673.I hunt
674program takes one argument which specifies the base name of the index
675files to be searched.
676Only one set of index files can be searched at a time,
677although many text files may be indexed as a group, of course.
678If one of the text files has been changed since the index, that file
679is searched with
680.I fgrep;
681this may occasionally slow down the searching, and care should be taken to
682avoid having many out of date files.
683The following option arguments are recognized by
684.I hunt:
685.TS
686center;
687lB lw(4i).
688\-a T{
689Give all output; ignore checking for false drops.
690T}
691\-C\f2n T{
692Coordination level
693.I n;
694retrieve items with not more than
695.I n
696terms of the input missing;
697default
698.I C0 ,
699implying that each search term must be in the output items.
700T}
701\-F[yn\f2d\f3\|] T{
702``\-Fy'' gives the text of all the items found;
703``\-Fn'' suppresses them.
704``\-F\f2d\|\f1'' where \f2d\f1\| is an integer
705gives the text of the first \f2d\f1 items.
706The default is
707.I \-Fy.
708T}
709\-g T{
710Do not use
711.I fgrep
712to search files changed since the index was made;
713print an error comment instead.
714T}
715\-i \f2string T{
716Take
717.I string
718as input, instead of reading the standard input.
719T}
720\-l \f2n T{
721The maximum length of internal lists of candidate
722items is
723.I n;
724default 1000.
725T}
726\-o \f2string T{
727Put text output (``\-Fy'') in
728.I string;
729of use
730.I only
731when
732invoked from another program.
733T}
734\-p T{
735Print hash code frequencies; mostly
736for use in optimizing hash table sizes.
737T}
738\-T[yn\f2d\|\f3] T{
739``\-Ty'' gives the tags of the items found;
740``\-Tn'' suppresses them.
741``\-T\f2d\f1\|'' where \f2d\f1\| is an integer
742gives the first \f2d\f1 tags.
743The default is
744.I \-Tn .
745T}
746\-t \f2string T{
747Put tag output (``\-Ty'') in
748.I string;
749of use
750.I only
751when invoked from another program.
752T}
753.TE
754.PP
755The timing of
756.I hunt
757is complex.
758Normally the hash table is overfull, so that there will
759be many false drops on any single term;
760but a multi-term query will have few false drops on
761all terms.
762Thus if a query is underspecified (one search term)
763many potential items will be examined and discarded as false
764drops, wasting time.
765If the query is overspecified (a dozen search terms)
766many keys will be examined only to verify that
767the single item under consideration has that key posted.
768The variation of search time with number of keys is
769shown in the table below.
770Queries of varying length were constructed to retrieve
771a particular document from the file of references.
772In the sequence to the left, search terms were chosen so as
773to select the desired paper as quickly as possible.
774In the sequence on the right, terms were chosen inefficiently,
775so that the query did not uniquely select the desired document
776until four keys had been used.
777The same document was the target in each case,
778and the final set of eight keys are also identical; the differences
779at five, six and seven keys are produced by measurement error, not
780by the slightly different key lists.
781.TS
782center;
783c s s s5 | c s s s
784cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8
785cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8
786n n n n | n n n n .
787Efficient Keys Inefficient Keys
788No. keys Total drops Retrieved Search time No. keys Total drops Retrieved Search time
789 (incl. false) Documents (seconds) (incl. false) Documents (seconds)
7901 15 3 1.27 1 68 55 5.96
7912 1 1 0.11 2 29 29 2.72
7923 1 1 0.14 3 8 8 0.95
7934 1 1 0.17 4 1 1 0.18
7945 1 1 0.19 5 1 1 0.21
7956 1 1 0.23 6 1 1 0.22
7967 1 1 0.27 7 1 1 0.26
7978 1 1 0.29 8 1 1 0.29
798.TE
799As would be expected, the optimal search is achieved
800when the query just specifies the answer; however,
801overspecification is quite cheap.
802Roughly, the time required by
803.I hunt
804can be approximated as
80530 milliseconds per search key plus 75 milliseconds
806per dropped document (whether it is a false drop or
807a real answer).
808In general, overspecification can be recommended;
809it protects the user against additions to the data base
810which turn previously uniquely-answered queries
811into ambiguous queries.
812.PP
813The careful reader will have noted an enormous discrepancy between these times
814and the earlier quoted time of around 1.9 seconds for a search. The times
815here are purely for the search and retrieval: they are measured by
816running many searches through a single invocation of the
817.I hunt
818program alone.
819The normal retrieval operation involves using the shell to
820set up a pipeline through
821.I mkey
822to
823.I hunt
824and starting both processes; this adds a fixed overhead of about 1.7 seconds
825of processor time
826to any single search.
827Furthermore, remember that all these times are processor times:
828on a typical morning on our \s-2PDP\s0 11/70 system, with about one dozen
829people logged on,
830to obtain 1 second of processor time for the search program
831took between 2 and 12 seconds of real time, with a median of
8323.9 seconds and a mean of 4.8 seconds.
833Thus, although the work involved in a single search may be only
834200 milliseconds, after you add the 1.7 seconds of startup processor
835time
836and then assume a 4:1 elapsed/processor time
837ratio, it will be 8 seconds before any response is printed.
838.NH
839Selecting and Formatting References for T\s-2ROFF\s0
840.PP
841The major application of the retrieval software
842is
843.I refer,
844which is a
845.I troff
846preprocessor
847like
848.I eqn .
849.[
850kernighan cherry acm 1975
851.]
852It scans its input looking for items of the form
853.DS
854\*.[
855imprecise citation
856\*.\^]
857.DE
858where an imprecise citation is merely a string
859of words found in the relevant bibliographic citation.
860This is translated into a properly formatted reference.
861If the imprecise citation does not correctly identify
862a single paper
863(either
864selecting no papers or too many) a message is given.
865The data base of citations searched may be tailored to each
866system, and individual users may specify their own
867citation
868files.
869On our system, the default data base is accumulated from
870the publication lists of the members of our organization, plus
871about half a dozen personal bibliographies that were collected.
872The present total is about 4300 citations, but this increases steadily.
873Even now,
874the data base covers a large fraction of local citations.
875.PP
876For example, the reference for the
877.I eqn
878paper above was specified as
879.DS
880\&\*.\*.\*.
881\&preprocessor like
882\&.I eqn.
883\&.[
884\&kernighan cherry acm 1975
885\&.]
886\&It scans its input looking for items
887\&\*.\*.\*.
888.DE
889This paper was itself printed using
890.I refer.
891The above input text was processed by
892.I refer
893as well as
894.I tbl
895and
896.I troff
897by the command
898.DS
899.ft I
900refer memo-file | tbl | troff \-ms
901.ft R
902.DE
903and the reference was automatically translated into a correct
904citation to the ACM paper on mathematical typesetting.
905.PP
906The procedure to use to place a reference in a paper
907using
908.I refer
909is as follows.
910First, use the
911.I lookbib
912command to check that the paper is in the data base
913and to find out what keys are necessary to retrieve it.
914This is done by typing
915.I lookbib
916and then typing some potential queries until
917a suitable query is found.
918For example, had one started to find
919the
920.I eqn
921paper shown above by presenting the query
922.DS
923 $ lookbib
924 kernighan cherry
925 (EOT)
926.DE
927.I lookbib
928would have found several items; experimentation would quickly
929have shown that the query given above is adequate.
930Overspecifying the query is of course harmless.
931A particularly careful reader may have noticed that ``acm'' does not
932appear in the printed citation;
933we have supplemented some of the data base items with common
934extra keywords, such as common abbreviations for journals
935or other sources, to aid in searching.
936.PP
937If the reference is in the data base, the query
938that retrieved it can be inserted in the text,
939between
940.B \*.[
941and
942.B \*.\^]
943brackets.
944If it is not in the data base, it can be typed
945into a private file of references, using the format
946discussed in the next section, and then
947the
948.B \-p
949option
950used to search this private file.
951Such a command might read
952(if the private references are called
953.I myfile )
954.DS
955.ft 2
956refer \-p myfile document | tbl | eqn | troff \-ms \*. \*. \*.
957.ft 1
958.DE
959where
960.I tbl
961and/or
962.I eqn
963could be omitted if not needed.
964The use
965of the
966.I \-ms
967macros
968.[
969lesk typing documents unix gcos
970.]
971or some other macro package, however,
972is essential.
973.I Refer
974only generates the data for the references; exact formatting
975is done by some macro package, and if none is supplied the
976references will not be printed.
977.PP
978By default,
979the references are numbered sequentially,
980and
981the
982.I \-ms
983macros format references as footnotes at the bottom of the page.
984This memorandum is an example of that style.
985Other possibilities are discussed in section 5 below.
986.NH
987Reference Files.
988.PP
989A reference file is a set of bibliographic references usable with
990.I refer.
991It can be indexed using the software described in section 2
992for fast searching.
993What
994.I refer
995does is to read the input document stream,
996looking for imprecise citation references.
997It then searches through reference files to find
998the full citations, and inserts them into the
999document.
1000The format of the full citation is arranged to make it
1001convenient for a macro package, such as the
1002.I \-ms
1003macros, to format the reference
1004for printing.
1005Since
1006the format of the final reference is determined
1007by the desired style of output,
1008which is determined by the macros used,
1009.I refer
1010avoids forcing any kind of reference appearance.
1011All it does is define a set of string registers which
1012contain the basic information about the reference;
1013and provide a macro call which is expanded by the macro
1014package to format the reference.
1015It is the responsibility of the final macro package
1016to see that the reference is actually printed; if no
1017macros are used, and the output of
1018.I refer
1019fed untranslated to
1020.I troff,
1021nothing at all will be printed.
1022.PP
1023The strings defined by
1024.I refer
1025are taken directly from the files of references, which
1026are in the following format.
1027The references should be separated
1028by blank lines.
1029Each reference is a sequence of lines beginning with
1030.B %
1031and followed
1032by a key-letter.
1033The remainder of that line, and successive lines until the next line beginning
1034with
1035.B % ,
1036contain the information specified by the key-letter.
1037In general,
1038.I refer
1039does not interpret the information, but merely presents
1040it to the macro package for final formatting.
1041A user with a separate macro package, for example,
1042can add new key-letters or use the existing ones for other purposes
1043without bothering
1044.I refer.
1045.PP
1046The meaning of the key-letters given below, in particular,
1047is that assigned by the
1048.I \-ms
1049macros.
1050Not all information, obviously, is used with each citation.
1051For example, if a document is both an internal memorandum and a journal article,
1052the macros ignore the memorandum version and cite only the journal article.
1053Some kinds of information are not used at all in printing the reference;
1054if a user does not like finding references by specifying title
1055or author keywords, and prefers to add specific keywords to the
1056citation, a field is available which is searched but not
1057printed (\f3K\f1).
1058.PP
1059The key letters currently recognized by
1060.I refer
1061and
1062.I \-ms,
1063with the kind of information implied, are:
1064.KS
1065.TS
1066center;
1067c c6 c c
1068c l c l.
1069Key Information specified Key Information specified
1070A Author's name N Issue number
1071B Title of book containing item O Other information
1072C City of publication P Page(s) of article
1073D Date R Technical report reference
1074E Editor of book containing item T Title
1075G Government (NTIS) ordering number V Volume number
1076I Issuer (publisher)
1077J Journal name
1078K Keys (for searching) X or
1079L Label Y or
1080M Memorandum label Z Information not used by \f2refer\f1
1081.TE
1082.KE
1083For example, a sample reference could be
1084typed as:
1085.DS
1086%T Bounds on the Complexity of the Maximal
1087Common Subsequence Problem
1088%Z ctr127
1089%A A. V. Aho
1090%A D. S. Hirschberg
1091%A J. D. Ullman
1092%J J. ACM
1093%V 23
1094%N 1
1095%P 1-12
1096.\"%M TM 75-1271-7
1097%M abcd-78
1098%D Jan. 1976
1099.DE
1100Order is irrelevant, except that authors are shown in the order
1101given. The output of
1102.I refer
1103is a stream of string definitions, one
1104for each of the fields of each reference, as
1105shown below.
1106.DS
1107\*.]-
1108\*.ds [A authors' names \*.\*.\*.
1109\*.ds [T title \*.\*.\*.
1110\*.ds [J journal \*.\*.\*.
1111\*.\*.\*.
1112\*.]\|[ type-number
1113.DE
1114The special macro
1115.B \&\*.]\-
1116precedes the string definitions
1117and the special macro
1118.B \*.]\|[
1119follows.
1120These are changed from the input
1121.B \*.[
1122and
1123.B \*.\^]
1124so that running the same file through
1125.I refer
1126again is harmless.
1127The
1128.B \*.]\-
1129macro can be used by the macro package to
1130initialize.
1131The
1132.B \*.]\|[
1133macro, which should be used
1134to print the reference, is given an
1135argument
1136.I type-number
1137to indicate the kind of reference, as follows:
1138.KS
1139.TS
1140center;
1141c c
1142n l.
1143Value Kind of reference
11441 Journal article
11452 Book
11463 Article within book
11474 Technical report
11485 Bell Labs technical memorandum
11490 Other
1150.TE
1151.KE
1152The reference is flagged in the text
1153with the sequence
1154.DS
1155\e*\|([\*.number\e*\|(\*.\^]
1156.DE
1157where
1158.I number
1159is the footnote number.
1160The strings
1161.B [\*.
1162and
1163.B \*.\^]
1164should be used by the macro package
1165to format the reference flag in the text.
1166These strings can be replaced for a particular
1167footnote, as described in section 5.
1168The footnote number (or other signal) is available
1169to the reference macro
1170.B \*.]\|[
1171as the
1172string register
1173.B [F .
1174.PP
1175In some cases users wish to suspend the searching, and merely
1176use the reference macro formatting.
1177That is, the user doesn't want to provide a search key
1178between
1179.B \*.[
1180and
1181.B \*.\^]
1182brackets, but merely
1183the reference lines for the appropriate document.
1184Alternatively, the user
1185can wish
1186to add a few fields to those in the reference
1187as in the standard file, or
1188override some fields.
1189Altering or replacing fields, or supplying whole references, is easily done
1190by inserting lines beginning
1191with
1192.B % ;
1193any such line is taken as
1194direct input to the reference
1195processor rather than keys to be searched.
1196Thus
1197.DS
1198\*.[
1199key1 key2 key3 \*.\*.\*.
1200%Q New format item
1201%R Override report name
1202\*.\^]
1203.DE
1204makes the indicates changes to the result of searching for
1205the keys.
1206All of the search keys must be given before the first
1207\f3%\f1 line.
1208.PP
1209If no search keys are provided, an entire citation can
1210be provided in-line in the text.
1211For example, if the
1212.I eqn
1213paper citation were to be inserted in
1214this way, rather than by searching for it in the data base,
1215the input would read
1216.DS
1217\&\*.\*.\*.
1218\&preprocessor like
1219\&.I eqn.
1220\&.[
1221\&%A B. W. Kernighan
1222\&%A L. L. Cherry
1223\&%T A System for Typesetting Mathematics
1224\&%J Comm. ACM
1225\&%V 18
1226\&%N 3
1227\&%P 151-157
1228\&%D March 1975
1229\&.]
1230\&It scans its input looking for items
1231\&\*.\*.\*.
1232.DE
1233This would produce a citation of the same appearance as that
1234resulting from the file search.
1235.PP
1236As shown, fields are normally turned into
1237.I troff
1238strings.
1239Sometimes users would rather have them defined as macros,
1240so that other
1241.I troff
1242commands can be placed into the data.
1243When this is necessary, simply double the control character
1244.B %
1245in the data.
1246Thus the input
1247.DS
1248\&.[
1249%V 23
1250%%M
1251Bell Laboratories,
1252Murray Hill, N.J. 07974
1253\&.]
1254.DE
1255is processed by
1256.I refer
1257into
1258.DS
1259\&.ds [V 23
1260\&.de [M
1261Bell Laboratories,
1262Murray Hill, N.J. 07974
1263\&..
1264.DE
1265The information after
1266.B %%M
1267is defined as a macro to be invoked by
1268.B .[M
1269while the information after
1270.B %V
1271is turned into a string to be invoked by
1272.B \e\(**([V .
1273At present
1274.I \-ms
1275expects all information as strings.
1276.NH
1277Collecting References and other Refer Options
1278.PP
1279Normally, the combination of
1280.I refer
1281and
1282.I \-ms
1283formats output as
1284.I troff
1285footnotes which are consecutively numbered and placed
1286at the bottom of the page. However,
1287options exist to
1288place the references at the end; to arrange references alphabetically
1289by senior author; and to indicate references by strings in the text of the form
1290[Name1975a]
1291rather than by number.
1292Whenever references are not placed at the bottom of a page
1293identical references are coalesced.
1294.PP
1295For example, the
1296.B \-e
1297option to
1298.I refer
1299specifies that references are to be collected; in this case
1300they are output whenever the sequence
1301.DS
1302\*.[
1303$LIST$
1304\*.\^]
1305.DE
1306is encountered.
1307Thus, to place references at the end of a paper, the user would run
1308.I refer
1309with the
1310.I \-e
1311option and place the above $LIST$ commands after the last
1312line of the text.
1313.I Refer
1314will then move all the references to that point.
1315To aid in formatting the collected references,
1316.I refer
1317writes the references preceded by the line
1318.DS
1319.B .]<
1320.DE
1321and
1322followed by the line
1323.DS
1324.B .]>
1325.DE
1326to invoke special macros before and after the references.
1327.PP
1328Another possible option to
1329.I refer
1330is the
1331.B \-s
1332option to specify
1333sorting of references. The default,
1334of course, is to list references in the order presented.
1335The
1336.B \-s
1337option implies the
1338.B \-e
1339option, and thus requires
1340a
1341.DS
1342\*.[
1343$LIST$
1344\*.\^]
1345.DE
1346entry to call out the reference list.
1347The
1348.B \-s
1349option may be followed by a string of letters, numbers, and `+' signs indicating how
1350the references are to be sorted.
1351The sort is done using the fields whose key-letters are
1352in the string as sorting keys; the numbers indicate how many
1353of the fields are to be considered, with `+'
1354taken as a large number.
1355Thus the default is
1356.B \-sAD
1357meaning ``Sort on senior author, then date.'' To
1358sort on all authors and then title, specify
1359.B \-sA+T .
1360And to sort on two authors and then the journal,
1361write
1362.B \-sA2J .
1363.PP
1364Other options to
1365.I refer
1366change the signal or label inserted in the text for each reference.
1367Normally these are just sequential numbers,
1368and their exact placement (within brackets, as superscripts, etc.) is determined
1369by the macro package.
1370The
1371.B \-l
1372option replaces reference numbers by
1373strings composed of the senior author's last name, the date,
1374and a disambiguating letter.
1375If a number follows the
1376.B l
1377as in
1378.B \-l3
1379only that many letters of the last name are used
1380in the label string.
1381To abbreviate the date as well the form
1382\f3-l\f2m,n\f1
1383shortens the last name to the
1384first
1385.I m
1386letters and the date to the
1387last
1388.I n
1389digits.
1390For example, the option
1391.B \-l3,2
1392would refer to the
1393.I eqn
1394paper (reference 3) by the signal
1395.I Ker75a ,
1396since it is the first cited reference by Kernighan in 1975.
1397.PP
1398A user wishing to specify particular labels for
1399a private bibliography may use the
1400.B \-k
1401option.
1402Specifying
1403\f3\-k\f2x\f1
1404causes the field \f2x\f1 to be used as a label.
1405The default is \f3L\f1.
1406If this field ends in \f3\-\f1, that character
1407is replaced by a sequence letter; otherwise the field
1408is used exactly as given.
1409.PP
1410If none of the
1411.I refer -produced
1412signals are desired,
1413the
1414.B \-b
1415option entirely suppresses automatic text signals.
1416.PP
1417If the user wishes to override the
1418.I \-ms
1419treatment of the reference signal (which is normally to
1420enclose the number in brackets in
1421.I nroff
1422and make it a superscript in
1423.I troff\\| )
1424this can be done easily.
1425If the lines
1426.B \&.[
1427or
1428.B \&.]
1429contain anything following these characters,
1430the remainders of these lines are used to surround
1431the reference signal, instead of the default.
1432Thus, for example, to say ``See reference (2).''
1433and avoid
1434``See reference.\s-3\u2\d\s+3'' the
1435input might appear
1436.DS
1437\&See reference
1438\&\*.[ (
1439imprecise citation ...
1440\&\*.\^])\*.
1441.DE
1442Note that blanks are significant in this construction.
1443If a permanent change is desired in the style of reference
1444signals, however, it is probably easier to redefine the strings
1445.B \&[.
1446and
1447.B \&.]
1448(which are used to bracket each signal)
1449than to change each citation.
1450.PP
1451Although normally
1452.I refer
1453limits itself to retrieving the data for the reference,
1454and leaves to a macro package the job of arranging that
1455data as required by the local format, there are two
1456special options for rearrangements that can not be
1457done by macro packages.
1458The
1459.B \-c
1460option puts fields into all upper case
1461(C\s-2APS\s+2-S\s-2MALL\s+2 C\s-2APS\s+2
1462in
1463.I troff
1464output).
1465The key-letters indicated what information is to be translated
1466to upper case follow the
1467.B c ,
1468so that
1469.B \-cAJ
1470means that authors' names and journals are to be in caps.
1471The
1472.B \-a
1473option writes the names of authors last name first, that is
1474.I "A. D. Hall, Jr."
1475is written as
1476.I "Hall, A. D. Jr" .
1477The citation form of
1478the
1479.I "Journal of the ACM" ,
1480for example, would require
1481both
1482.B \-cA
1483and
1484.B \-a
1485options.
1486This produces authors' names in the style
1487.I
1488K\s-2ERNIGHAN\s0, B. W. \s-2AND\s0 C\s-2HERRY\s0, L. L.\&
1489.R
1490for the previous example.
1491The
1492.B \-a
1493option may be followed by a number to indicate how many
1494author names should be reversed;
1495.B \-a1
1496(without any
1497.B \-c
1498option)
1499would produce
1500.I
1501Kernighan, B. W. and L. L. Cherry,
1502.R
1503for example.
1504.PP
1505Finally, there is also the previously-mentioned
1506.B \-p
1507option to let the user specify
1508a private file of references to be searched before the public files.
1509Note that
1510.I refer
1511does not insist on a previously made index for these files.
1512If a file is named which contains reference
1513data but is not indexed, it will be searched
1514(more slowly)
1515by
1516.I refer
1517using
1518.I fgrep.
1519In this way
1520it is easy for users to keep small files of
1521new references, which can later be added to the
1522public data bases.
1523.SG MH-1274-MEL-\s8UNIX\s0