modern syntax for asgops & inits cause Donn's latest ccom rejects the old.
[unix-history] / usr / src / old / lex / PSD.doc / lex.ms
CommitLineData
5652ef93 1.\" @(#)lex.ms 6.1 (Berkeley) %G%
b1716fed 2.\"
5652ef93
KM
3.EH 'PS1:16-%''Lex \- A Lexical Analyzer Generator'
4.OH 'Lex \- A Lexical Analyzer Generator''PS1:16-%'
b1716fed
KM
5.hc ~
6.bd I 2
7.de TS
8.br
9.nf
10.SP 1v
11.ul 0
12..
13.de TE
14.SP 1v
15.fi
16..
5652ef93
KM
17.\".de PT
18.\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
19.\".if \\n%>1 'sp
20.\"..
b1716fed 21.ND July 21, 1975
5652ef93
KM
22.\".RP
23.\".TM 75-1274-15 39199 39199-11
b1716fed
KM
24.TL
25Lex \- A Lexical Analyzer ~Generator~
26.AU ``MH 2C-569'' 6377
27M. E. Lesk and E. Schmidt
28.AI
29.MH
30.AB
31.sp
32.bd I 2
5652ef93
KM
33.\".nr PS 8
34.\".nr VS 9
35.\".ps 8
36.\".vs 9p
b1716fed
KM
37Lex helps write programs whose control flow
38is directed by instances of regular
39expressions in the input stream.
40It is well suited for editor-script type transformations and
41for segmenting input in preparation for
42a parsing routine.
43.PP
44Lex source is a table of regular expressions and corresponding program fragments.
45The table is translated to a program
46which reads an input stream, copying it to an output stream
47and partitioning the input
48into strings which match the given expressions.
49As each such string is recognized the corresponding
50program fragment is executed.
51The recognition of the expressions
52is performed by a deterministic finite automaton
53generated by Lex.
54The program fragments written by the user are executed in the order in which the
55corresponding regular expressions occur in the input stream.
56.if n .if \n(tm .ig
57.PP
58The lexical analysis
59programs written with Lex accept ambiguous specifications
60and choose the longest
61match possible at each input point.
62If necessary, substantial look~ahead
63is performed on the input, but the
64input stream will be backed up to the
65end of the current partition, so that the user
66has general freedom to manipulate it.
67.PP
68Lex can generate analyzers in either C or Ratfor, a language
69which can be translated automatically to portable Fortran.
70It is available on the PDP-11 UNIX, Honeywell GCOS,
71and IBM OS systems.
72This manual, however, will only discuss generating analyzers
73in C on the UNIX system, which is the only supported
74form of Lex under UNIX Version 7.
75Lex is designed to simplify
76interfacing with Yacc, for those
77with access to this compiler-compiler system.
78..
5652ef93
KM
79.\".nr PS 9
80.\".nr VS 11
b1716fed 81.AE
b1716fed
KM
82.2C
83.NH
84Introduction.
85.PP
86Lex is a program generator designed for
87lexical processing of character input streams.
88It accepts a high-level, problem oriented specification
89for character string matching,
90and
91produces a program in a general purpose language which recognizes
92regular expressions.
93The regular expressions are specified by the user in the
94source specifications given to Lex.
95The Lex written code recognizes these expressions
96in an input stream and partitions the input stream into
97strings matching the expressions. At the bound~aries
98between strings
99program sections
100provided by the user are executed.
101The Lex source file associates the regular expressions and the
102program fragments.
103As each expression appears in the input to the program written by Lex,
104the corresponding fragment is executed.
105.PP
106.de MH
107Bell Laboratories, Murray Hill, NJ 07974.
108..
109The user supplies the additional code
110beyond expression matching
111needed to complete his tasks, possibly
112including code written by other generators.
113The program that recognizes the expressions is generated in the
114general purpose programming language employed for the
115user's program fragments.
116Thus, a high level expression
117language is provided to write the string expressions to be
118matched while the user's freedom to write actions
119is unimpaired.
120This avoids forcing the user who wishes to use a string manipulation
121language for input analysis to write processing programs in the same
122and often inappropriate string handling language.
123.PP
124Lex is not a complete language, but rather a generator representing
125a new language feature which can be added to
126different programming languages, called ``host languages.''
127Just as general purpose languages
128can produce code to run on different computer hardware,
129Lex can write code in different host languages.
130The host language is used for the output code generated by Lex
131and also for the program fragments added by the user.
132Compatible run-time libraries for the different host languages
133are also provided.
134This makes Lex adaptable to different environments and
135different users.
136Each application
137may be directed to the combination of hardware and host language appropriate
138to the task, the user's background, and the properties of local
139implementations.
140At present, the only supported host language is C,
141although Fortran (in the form of Ratfor [2] has been available
142in the past.
143Lex itself exists on UNIX, GCOS, and OS/370; but the
144code generated by Lex may be taken anywhere the appropriate
145compilers exist.
146.PP
147Lex turns the user's expressions and actions
148(called
149.ul
150source
151in this memo) into the host general-purpose language;
152the generated program is named
153.ul
154yylex.
155The
156.ul
157yylex
158program
159will recognize expressions
160in a stream
161(called
162.ul
163input
164in this memo)
165and perform the specified actions for each expression as it is detected.
166See Figure 1.
167.GS
168.TS
169center;
170l _ r
171l|c|r
172l _ r
173l _ r
174l|c|r
175l _ r
176c s s
177c s s.
178
179Source \(-> Lex \(-> yylex
180
181.sp 2
182
183Input \(-> yylex \(-> Output
184
185.sp
186An overview of Lex
b1716fed
KM
187Figure 1
188.TE
189.GE
190.PP
191For a trivial example, consider a program to delete
192from the input
193all blanks or tabs at the ends of lines.
194.TS
195center;
196l l.
197%%
198[ \et]+$ ;
199.TE
200is all that is required.
201The program
202contains a %% delimiter to mark the beginning of the rules, and
203one rule.
204This rule contains a regular expression
205which matches one or more
206instances of the characters blank or tab
207(written \et for visibility, in accordance with the C language convention)
208just prior to the end of a line.
209The brackets indicate the character
210class made of blank and tab; the + indicates ``one or more ...'';
211and the $ indicates ``end of line,'' as in QED.
212No action is specified,
213so the program generated by Lex (yylex) will ignore these characters.
214Everything else will be copied.
215To change any remaining
216string of blanks or tabs to a single blank,
217add another rule:
218.TS
219center;
220l l.
221%%
222[ \et]+$ ;
223[ \et]+ printf(" ");
224.TE
225The finite automaton generated for this
226source will scan for both rules at once,
227observing at
228the termination of the string of blanks or tabs
229whether or not there is a newline character, and executing
230the desired rule action.
231The first rule matches all strings of blanks or tabs
232at the end of lines, and the second
233rule all remaining strings of blanks or tabs.
234.PP
235Lex can be used alone for simple transformations, or
236for analysis and statistics gathering on a lexical level.
237Lex can also be used with a parser generator
238to perform the lexical analysis phase; it is particularly
239easy to interface Lex and Yacc [3].
240Lex programs recognize only regular expressions;
241Yacc writes parsers that accept a large class of context free grammars,
242but require a lower level analyzer to recognize input tokens.
243Thus, a combination of Lex and Yacc is often appropriate.
244When used as a preprocessor for a later parser generator,
245Lex is used to partition the input stream,
246and the parser generator assigns structure to
247the resulting pieces.
248The flow of control
249in such a case (which might be the first half of a compiler,
250for example) is shown in Figure 2.
251Additional programs,
252written by other generators
253or by hand, can
254be added easily to programs written by Lex.
255.BS 2
5652ef93
KM
256.ps 9
257.vs 11
b1716fed
KM
258.TS
259center;
260l c c c l
261l c c c l
262l c c c l
263l _ c _ l
264l|c|c|c|l
265l _ c _ l
266l c c c l
267l _ c _ l
268l|c|c|c|l
269l _ c _ l
270l c s s l
271l c s s l.
272 lexical grammar
273 rules rules
274 \(da \(da
275
276 Lex Yacc
277
278 \(da \(da
279
280Input \(-> yylex \(-> yyparse \(-> Parsed input
281
282.sp
283 Lex with Yacc
b1716fed
KM
284 Figure 2
285.TE
5652ef93
KM
286.ps 10
287.vs 12
b1716fed
KM
288.BE
289Yacc users
290will realize that the name
291.ul
292yylex
293is what Yacc expects its lexical analyzer to be named,
294so that the use of this name by Lex simplifies
295interfacing.
296.PP
297Lex generates a deterministic finite automaton from the regular expressions
298in the source [4].
299The automaton is interpreted, rather than compiled, in order
300to save space.
301The result is still a fast analyzer.
302In particular, the time taken by a Lex program
303to recognize and partition an input stream is
304proportional to the length of the input.
305The number of Lex rules or
306the complexity of the rules is
307not important in determining speed,
308unless rules which include
309forward context require a significant amount of re~scanning.
310What does increase with the number and complexity of rules
311is the size of the finite
312automaton, and therefore the size of the program
313generated by Lex.
314.PP
315In the program written by Lex, the user's fragments
316(representing the
317.ul
318actions
319to be performed as each regular expression
320is found)
321are gathered
322as cases of a switch.
323The automaton interpreter directs the control flow.
324Opportunity is provided for the user to insert either
325declarations or additional statements in the routine containing
326the actions, or to
327add subroutines outside this action routine.
328.PP
329Lex is not limited to source which can
330be interpreted on the basis of one character
331look~ahead.
332For example,
333if there are two rules, one looking for
334.I ab
335and another for
336.I abcdefg ,
337and the input stream is
338.I abcdefh ,
339Lex will recognize
340.I ab
341and leave
342the input pointer just before
343.I "cd. . ."
344Such backup is more costly
345than the processing of simpler languages.
346.2C
347.NH
348Lex Source.
349.PP
350The general format of Lex source is:
351.TS
352center;
353l.
354{definitions}
355%%
356{rules}
357%%
358{user subroutines}
359.TE
360where the definitions and the user subroutines
361are often omitted.
362The second
363.I %%
364is optional, but the first is required
365to mark the beginning of the rules.
366The absolute minimum Lex program is thus
367.TS
368center;
369l.
370%%
371.TE
372(no definitions, no rules) which translates into a program
373which copies the input to the output unchanged.
374.PP
375In the outline of Lex programs shown above, the
376.I
377rules
378.R
379represent the user's control
380decisions; they are a table, in which the left column
381contains
382.I
383regular expressions
384.R
385(see section 3)
386and the right column contains
387.I
388actions,
389.R
390program fragments to be executed when the expressions
391are recognized.
392Thus an individual rule might appear
393.TS
394center;
395l l.
396integer printf("found keyword INT");
397.TE
398to look for the string
399.I integer
400in the input stream and
401print the message ``found keyword INT'' whenever it appears.
402In this example the host procedural language is C and
403the C library function
404.I
405printf
406.R
407is used to print the string.
408The end
409of the expression is indicated by the first blank or tab character.
410If the action is merely a single C expression,
411it can just be given on the right side of the line; if it is
412compound, or takes more than a line, it should be enclosed in
413braces.
414As a slightly more useful example, suppose it is desired to
415change a number of words from British to American spelling.
416Lex rules such as
417.TS
418center;
419l l.
420colour printf("color");
421mechanise printf("mechanize");
422petrol printf("gas");
423.TE
424would be a start. These rules are not quite enough,
425since
426the word
427.I petroleum
428would become
429.I gaseum ;
430a way of dealing
431with this will be described later.
432.2C
433.NH
434Lex Regular Expressions.
435.PP
436The definitions of regular expressions are very similar to those
437in QED [5].
438A regular
439expression specifies a set of strings to be matched.
440It contains text characters (which match the corresponding
441characters in the strings being compared)
442and operator characters (which specify
443repetitions, choices, and other features).
444The letters of the alphabet and the digits are
445always text characters; thus the regular expression
446.TS
447center;
448l l.
449integer
450.TE
451matches the string
452.ul
453integer
454wherever it appears
455and the expression
456.TS
457center;
458l.
459a57D
460.TE
461looks for the string
462.ul
463a57D.
464.PP
465.I
466Operators.
467.R
468The operator characters are
469.TS
470center;
471l.
472" \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
473.TE
474and if they are to be used as text characters, an escape
475should be used.
476The quotation mark operator (")
477indicates that whatever is contained between a pair of quotes
478is to be taken as text characters.
479Thus
480.TS
481center;
482l.
483xyz"++"
484.TE
485matches the string
486.I xyz++
487when it appears. Note that a part of a string may be quoted.
488It is harmless but unnecessary to quote an ordinary
489text character; the expression
490.TS
491center;
492l.
493"xyz++"
494.TE
495is the same as the one above.
496Thus by quoting every non-alphanumeric character
497being used as a text character, the user can avoid remembering
498the list above of current
499operator characters, and is safe should further extensions to Lex
500lengthen the list.
501.PP
502An operator character may also be turned into a text character
503by preceding it with \e as in
504.TS
505center;
506l.
507xyz\e+\e+
508.TE
509which
510is another, less readable, equivalent of the above expressions.
511Another use of the quoting mechanism is to get a blank into
512an expression; normally, as explained above, blanks or tabs end
513a rule.
514Any blank character not contained within [\|] (see below) must
515be quoted.
516Several normal C escapes with \e
517are recognized: \en is newline, \et is tab, and \eb is backspace.
518To enter \e itself, use \e\e.
519Since newline is illegal in an expression, \en must be used;
520it is not
521required to escape tab and backspace.
522Every character but blank, tab, newline and the list above is always
523a text character.
524.PP
525.I
526Character classes.
527.R
528Classes of characters can be specified using the operator pair [\|].
529The construction
530.I [abc]
531matches a
532single character, which may be
533.I a ,
534.I b ,
535or
536.I c .
537Within square brackets,
538most operator meanings are ignored.
539Only three characters are special:
540these are \e \(mi and ^. The \(mi character
541indicates ranges. For example,
542.TS
543center;
544l.
545[a\(miz0\(mi9<>_]
546.TE
547indicates the character class containing all the lower case letters,
548the digits,
549the angle brackets, and underline.
550Ranges may be given in either order.
551Using \(mi between any pair of characters which are
552not both upper case letters, both lower case letters, or both digits
553is implementation dependent and will get a warning message.
554(E.g., [0\-z] in ASCII is many more characters
555than it is in EBCDIC).
556If it is desired to include the
557character \(mi in a character class, it should be first or
558last; thus
559.TS
560center;
561l.
562[\(mi+0\(mi9]
563.TE
564matches all the digits and the two signs.
565.PP
566In character classes,
567the ^ operator must appear as the first character
568after the left bracket; it indicates that the resulting string
569is to be complemented with respect to the computer character set.
570Thus
571.TS
572center;
573l.
574[^abc]
575.TE
576matches all characters except a, b, or c, including
577all special or control characters; or
578.TS
579center;
580l.
581[^a\-zA\-Z]
582.TE
583is any character which is not a letter.
584The \e character provides the usual escapes within
585character class brackets.
586.PP
587.I
588Arbitrary character.
589.R
590To match almost any character, the operator character
591.TS
592center;
593l.
594\&.
595.TE
596is the class of all characters except newline.
597Escaping into octal is possible although non-portable:
598.TS
599center;
600l.
601[\e40\-\e176]
602.TE
603matches all printable characters in the ASCII character set, from octal
60440 (blank) to octal 176 (tilde).
605.PP
606.I
607Optional expressions.
608.R
609The operator
610.I ?
611indicates
612an optional element of an expression.
613Thus
614.TS
615center;
616l.
617ab?c
618.TE
619matches either
620.I ac
621or
622.I abc .
623.PP
624.I
625Repeated expressions.
626.R
627Repetitions of classes are indicated by the operators
628.I \(**
629and
630.I + .
631.TS
632center;
633l.
634\f2a\(**\f1
635.TE
636is any number of consecutive
637.I a
638characters, including zero; while
639.TS
640center;
641l.
642a+
643.TE
644is one or more instances of
645.I a.
646For example,
647.TS
648center;
649l.
650[a\-z]+
651.TE
652is all strings of lower case letters.
653And
654.TS
655center;
656l.
657[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
658.TE
659indicates all alphanumeric strings with a leading
660alphabetic character.
661This is a typical expression for recognizing identifiers in
662computer languages.
663.PP
664.I
665Alternation and Grouping.
666.R
667The operator |
668indicates alternation:
669.TS
670center;
671l.
672(ab\||\|cd)
673.TE
674matches either
675.ul
676ab
677or
678.ul
679cd.
680Note that parentheses are used for grouping, although
681they are
682not necessary on the outside level;
683.TS
684center;
685l.
686ab\||\|cd
687.TE
688would have sufficed.
689Parentheses
690can be used for more complex expressions:
691.TS
692center;
693l.
694(ab\||\|cd+)?(ef)\(**
695.TE
696matches such strings as
697.I abefef ,
698.I efefef ,
699.I cdef ,
700or
701.I cddd\| ;
702but not
703.I abc ,
704.I abcd ,
705or
706.I abcdef .
707.PP
708.I
709Context sensitivity.
710.R
711Lex will recognize a small amount of surrounding
712context. The two simplest operators for this are
713.I ^
714and
715.I $ .
716If the first character of an expression is
717.I ^ ,
718the expression will only be matched at the beginning
719of a line (after a newline character, or at the beginning of
720the input stream).
721This can never conflict with the other meaning of
722.I ^ ,
723complementation
724of character classes, since that only applies within
725the [\|] operators.
726If the very last character is
727.I $ ,
728the expression will only be matched at the end of a line (when
729immediately followed by newline).
730The latter operator is a special case of the
731.I /
732operator character,
733which indicates trailing context.
734The expression
735.TS
736center;
737l.
738ab/cd
739.TE
740matches the string
741.I ab ,
742but only if followed by
743.ul
744cd.
745Thus
746.TS
747center;
748l.
749ab$
750.TE
751is the same as
752.TS
753center;
754l.
755ab/\en
756.TE
757Left context is handled in Lex by
758.I
759start conditions
760.R
761as explained in section 10. If a rule is only to be executed
762when the Lex automaton interpreter is in start condition
763.I
764x,
765.R
766the rule should be prefixed by
767.TS
768center;
769l.
770<x>
771.TE
772using the angle bracket operator characters.
773If we considered ``being at the beginning of a line'' to be
774start condition
775.I ONE ,
776then the ^ operator
777would be equivalent to
778.TS
779center;
780l.
781<ONE>
782.TE
783Start conditions are explained more fully later.
784.PP
785.I
786Repetitions and Definitions.
787.R
788The operators {} specify
789either repetitions (if they enclose numbers)
790or
791definition expansion (if they enclose a name). For example
792.TS
793center;
794l.
795{digit}
796.TE
797looks for a predefined string named
798.I digit
799and inserts it
800at that point in the expression.
801The definitions are given in the first part of the Lex
802input, before the rules.
803In contrast,
804.TS
805center;
806l.
807a{1,5}
808.TE
809looks for 1 to 5 occurrences of
810.I a .
811.PP
812Finally, initial
813.I %
814is special, being the separator
815for Lex source segments.
816.2C
817.NH
818Lex Actions.
819.PP
820When an expression written as above is matched, Lex
821executes the corresponding action. This section describes
822some features of Lex which aid in writing actions. Note
823that there is a default action, which
824consists of copying the input to the output. This
825is performed on all strings not otherwise matched. Thus
826the Lex user who wishes to absorb the entire input, without
827producing any output, must provide rules to match everything.
828When Lex is being used with Yacc, this is the normal
829situation.
830One may consider that actions are what is done instead of
831copying the input to the output; thus, in general,
832a rule which merely copies can be omitted.
833Also, a character combination
834which is omitted from the rules
835and which appears as input
836is likely to be printed on the output, thus calling
837attention to the gap in the rules.
838.PP
839One of the simplest things that can be done is to ignore
840the input. Specifying a C null statement, \fI;\fR as an action
841causes this result. A frequent rule is
842.TS
843center;
844l l.
845[ \et\en] ;
846.TE
847which causes the three spacing characters (blank, tab, and newline)
848to be ignored.
849.PP
850Another easy way to avoid writing actions is the action character
851|, which indicates that the action for this rule is the action
852for the next rule.
853The previous example could also have been written
854.TS
855center;
856l l.
857" " |
858"\et" |
859"\en" ;
860.TE
861with the same result, although in different style.
862The quotes around \en and \et are not required.
863.PP
864In more complex actions, the user
865will
866often want to know the actual text that matched some expression
867like
868.I [a\(miz]+ .
869Lex leaves this text in an external character
870array named
871.I
872yytext.
873.R
874Thus, to print the name found,
875a rule like
876.TS
877center;
878l l.
879[a\-z]+ printf("%s", yytext);
880.TE
881will print
882the string in
883.I
884yytext.
885.R
886The C function
887.I
888printf
889.R
890accepts a format argument and data to be printed;
891in this case, the format is ``print string'' (% indicating
892data conversion, and
893.I s
894indicating string type),
895and the data are the characters
896in
897.I
898yytext.
899.R
900So this just places
901the matched string
902on the output.
903This action
904is so common that
905it may be written as ECHO:
906.TS
907center;
908l l.
909[a\-z]+ ECHO;
910.TE
911is the same as the above.
912Since the default action is just to
913print the characters found, one might ask why
914give a rule, like this one, which merely specifies
915the default action?
916Such rules are often required
917to avoid matching some other rule
918which is not desired. For example, if there is a rule
919which matches
920.I read
921it will normally match the instances of
922.I read
923contained in
924.I bread
925or
926.I readjust ;
927to avoid
928this,
929a rule
930of the form
931.I [a\(miz]+
932is needed.
933This is explained further below.
934.PP
935Sometimes it is more convenient to know the end of what
936has been found; hence Lex also provides a count
937.I
938yyleng
939.R
940of the number of characters matched.
941To count both the number
942of words and the number of characters in words in the input, the user might write
943.TS
944center;
945l l.
946[a\-zA\-Z]+ {words++; chars += yyleng;}
947.TE
948which accumulates in
949.ul
950chars
951the number
952of characters in the words recognized.
953The last character in the string matched can
954be accessed by
955.TS
956center;
957l.
958yytext[yyleng\-1]
959.TE
960.PP
961Occasionally, a Lex
962action may decide that a rule has not recognized the correct
963span of characters.
964Two routines are provided to aid with this situation.
965First,
966.I
967yymore()
968.R
969can be called to indicate that the next input expression recognized is to be
970tacked on to the end of this input. Normally,
971the next input string would overwrite the current
972entry in
973.I
974yytext.
975.R
976Second,
977.I
978yyless (n)
979.R
980may be called to indicate that not all the characters matched
981by the currently successful expression are wanted right now.
982The argument
983.I
984n
985.R
986indicates the number of characters
987in
988.I
989yytext
990.R
991to be retained.
992Further characters previously matched
993are
994returned to the input. This provides the same sort of
995look~ahead offered by the / operator,
996but in a different form.
997.PP
998.I
999Example:
1000.R
1001Consider a language which defines
1002a string as a set of characters between quotation (") marks, and provides that
1003to include a " in a string it must be preceded by a \e. The
1004regular expression which matches that is somewhat confusing,
1005so that it might be preferable to write
1006.TS
1007center;
1008l l.
1009\e"[^"]\(** {
1010 if (yytext[yyleng\-1] == \(fm\e\e\(fm)
1011 yymore();
1012 else
1013 ... normal user processing
1014 }
1015.TE
1016which will, when faced with a string such as
1017.I
1018"abc\e"def\|"
1019.R
1020first match
1021the five characters
1022\fI"abc\e\|\fR;
1023then
1024the call to
1025.I yymore()
1026will
1027cause the next part of the string,
1028\fI"def\|\fR,
1029to be tacked on the end.
1030Note that the final quote terminating the string should be picked
1031up in the code labeled ``normal processing''.
1032.PP
1033The function
1034.I
1035yyless()
1036.R
1037might be used to reprocess
1038text in various circumstances. Consider the C problem of distinguishing
1039the ambiguity of ``=\(mia''.
1040Suppose it is desired to treat this as ``=\(mi a''
1041but print a message. A rule might be
5652ef93
KM
1042.ps 9
1043.vs 11
b1716fed
KM
1044.TS
1045center;
1046l l.
1047=\(mi[a\-zA\-Z] {
5652ef93 1048 printf("Op (=\(mi) ambiguous\en");
b1716fed
KM
1049 yyless(yyleng\-1);
1050 ... action for =\(mi ...
1051 }
1052.TE
5652ef93
KM
1053.ps 10
1054.vs 12
b1716fed
KM
1055which prints a message, returns the letter after the
1056operator to the input stream, and treats the operator as ``=\(mi''.
1057Alternatively it might be desired to treat this as ``= \(mia''.
1058To do this, just return the minus
1059sign as well as the letter to the input:
5652ef93
KM
1060.ps 9
1061.vs 11
b1716fed
KM
1062.TS
1063center;
1064l l.
1065=\(mi[a\-zA\-Z] {
5652ef93 1066 printf("Op (=\(mi) ambiguous\en");
b1716fed
KM
1067 yyless(yyleng\-2);
1068 ... action for = ...
1069 }
1070.TE
5652ef93
KM
1071.ps 10
1072.vs 12
b1716fed
KM
1073will perform the other interpretation.
1074Note that the expressions for the two cases might more easily
1075be written
1076.TS
1077center;
1078l l.
1079=\(mi/[A\-Za\-z]
1080.TE
1081in the first case and
1082.TS
1083center;
1084l.
1085=/\-[A\-Za\-z]
1086.TE
1087in the second;
1088no backup would be required in the rule action.
1089It is not necessary to recognize the whole identifier
1090to observe the ambiguity.
1091The
1092possibility of ``=\(mi3'', however, makes
1093.TS
1094center;
1095l.
1096=\(mi/[^ \et\en]
1097.TE
1098a still better rule.
1099.PP
1100In addition to these routines, Lex also permits
1101access to the I/O routines
1102it uses.
1103They are:
1104.IP 1)
1105.I
1106input()
1107.R
1108which returns the next input character;
1109.IP 2)
1110.I
1111output(c)
1112.R
1113which writes the character
1114.I
1115c
1116.R
1117on the output; and
1118.IP 3)
1119.I
1120unput(c)
1121.R
1122pushes the character
1123.I
1124c
1125.R
1126back onto the input stream to be read later by
1127.I
1128input().
1129.R
1130.LP
1131By default these routines are provided as macro definitions,
1132but the user can override them and supply private versions.
1133These routines
1134define the relationship between external files and
1135internal characters, and must all be retained
1136or modified consistently.
1137They may be redefined, to
1138cause input or output to be transmitted to or from strange
1139places, including other programs or internal memory;
1140but the character set used must be consistent in all routines;
1141a value of zero returned by
1142.I
1143input
1144.R
1145must mean end of file; and
1146the relationship between
1147.I
1148unput
1149.R
1150and
1151.I
1152input
1153.R
1154must be retained
1155or the Lex look~ahead will not work.
1156Lex does not look ahead at all if it does not have to,
1157but every rule ending in
1158.ft I
1159+ \(** ?
1160.ft R
1161or
1162.ft I
1163$
1164.ft R
1165or containing
1166.ft I
1167/
1168.ft R
1169implies look~ahead.
1170Look~ahead is also necessary to match an expression that is a prefix
1171of another expression.
1172See below for a discussion of the character set used by Lex.
1173The standard Lex library imposes
1174a 100 character limit on backup.
1175.PP
1176Another Lex library routine that the user will sometimes want
1177to redefine is
1178.I
1179yywrap()
1180.R
1181which is called whenever Lex reaches an end-of-file.
1182If
1183.I
1184yywrap
1185.R
1186returns a 1, Lex continues with the normal wrapup on end of input.
1187Sometimes, however, it is convenient to arrange for more
1188input to arrive
1189from a new source.
1190In this case, the user should provide
1191a
1192.I
1193yywrap
1194.R
1195which
1196arranges for new input and
1197returns 0. This instructs Lex to continue processing.
1198The default
1199.I
1200yywrap
1201.R
1202always returns 1.
1203.PP
1204This routine is also a convenient place
1205to print tables, summaries, etc. at the end
1206of a program. Note that it is not
1207possible to write a normal rule which recognizes
1208end-of-file; the only access to this condition is
1209through
1210.I
1211yywrap.
1212.R
1213In fact, unless a private version of
1214.I
1215input()
1216.R
1217is supplied
1218a file containing nulls
1219cannot be handled,
1220since a value of 0 returned by
1221.I
1222input
1223.R
1224is taken to be end-of-file.
1225.PP
1226.2C
1227.NH
1228Ambiguous Source Rules.
1229.PP
1230Lex can handle ambiguous specifications.
1231When more than one expression can match the
1232current input, Lex chooses as follows:
1233.IP 1)
1234The longest match is preferred.
1235.IP 2)
1236Among rules which matched the same number of characters,
1237the rule given first is preferred.
1238.LP
1239Thus, suppose the rules
1240.TS
1241center;
1242l l.
1243integer keyword action ...;
1244[a\-z]+ identifier action ...;
1245.TE
1246to be given in that order. If the input is
1247.I integers ,
1248it is taken as an identifier, because
1249.I [a\-z]+
1250matches 8 characters while
1251.I integer
1252matches only 7.
1253If the input is
1254.I integer ,
1255both rules match 7 characters, and
1256the keyword rule is selected because it was given first.
1257Anything shorter (e.g. \fIint\fR\|) will
1258not match the expression
1259.I integer
1260and so the identifier interpretation is used.
1261.PP
1262The principle of preferring the longest
1263match makes rules containing
1264expressions like
1265.I \&.\(**
1266dangerous.
1267For example,
1268.TS
1269center;
1270l.
1271\&\(fm.\(**\(fm
1272.TE
1273might seem a good way of recognizing
1274a string in single quotes.
1275But it is an invitation for the program to read far
1276ahead, looking for a distant
1277single quote.
1278Presented with the input
1279.TS
1280center;
1281l l.
1282\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
1283.TE
1284the above expression will match
1285.TS
1286center;
1287l l.
1288\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
1289.TE
1290which is probably not what was wanted.
1291A better rule is of the form
1292.TS
1293center;
1294l.
1295\&\(fm[^\(fm\en]\(**\(fm
1296.TE
1297which, on the above input, will stop
1298after
1299.I \(fmfirst\(fm .
1300The consequences
1301of errors like this are mitigated by the fact
1302that the
1303.I \&.
1304operator will not match newline.
1305Thus expressions like
1306.I \&.\(**
1307stop on the
1308current line.
1309Don't try to defeat this with expressions like
1310.I [.\en]+
1311or
1312equivalents;
1313the Lex generated program will try to read
1314the entire input file, causing
1315internal buffer overflows.
1316.PP
1317Note that Lex is normally partitioning
1318the input stream, not searching for all possible matches
1319of each expression.
1320This means that each character is accounted for
1321once and only once.
1322For example, suppose it is desired to
1323count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
1324Some Lex rules to do this might be
1325.TS
1326center;
1327l l.
1328she s++;
1329he h++;
1330\en |
1331\&. ;
1332.TE
1333where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
1334Remember that . does not include newline.
1335Since \fIshe\fR includes \fIhe\fR, Lex will normally
1336.I
1337not
1338.R
1339recognize
1340the instances of \fIhe\fR included in \fIshe\fR,
1341since once it has passed a \fIshe\fR those characters are gone.
1342.PP
1343Sometimes the user would like to override this choice. The action
1344REJECT
1345means ``go do the next alternative.''
1346It causes whatever rule was second choice after the current
1347rule to be executed.
1348The position of the input pointer is adjusted accordingly.
1349Suppose the user really wants to count the included instances of \fIhe\fR:
1350.TS
1351center;
1352l l.
1353she {s++; REJECT;}
1354he {h++; REJECT;}
1355\en |
1356\&. ;
1357.TE
1358these rules are one way of changing the previous example
1359to do just that.
1360After counting each expression, it is rejected; whenever appropriate,
1361the other expression will then be counted. In this example, of course,
1362the user could note that \fIshe\fR includes \fIhe\fR but not
1363vice versa, and omit the REJECT action on \fIhe\fR;
1364in other cases, however, it
1365would not be possible a priori to tell
1366which input characters
1367were in both classes.
1368.PP
1369Consider the two rules
1370.TS
1371center;
1372l l.
1373a[bc]+ { ... ; REJECT;}
1374a[cd]+ { ... ; REJECT;}
1375.TE
1376If the input is
1377.I ab ,
1378only the first rule matches,
1379and on
1380.I ad
1381only the second matches.
1382The input string
1383.I accb
1384matches the first rule for four characters
1385and then the second rule for three characters.
1386In contrast, the input
1387.I accd
1388agrees with
1389the second rule for four characters and then the first
1390rule for three.
1391.PP
1392In general, REJECT is useful whenever
1393the purpose of Lex is not to partition the input
1394stream but to detect all examples of some items
1395in the input, and the instances of these items
1396may overlap or include each other.
1397Suppose a digram table of the input is desired;
1398normally the digrams overlap, that is the word
1399.I the
1400is considered to contain
1401both
1402.I th
1403and
1404.I he .
1405Assuming a two-dimensional array named
1406.ul
1407digram
1408to be incremented, the appropriate
1409source is
1410.TS
1411center;
1412l l.
1413%%
5652ef93
KM
1414[a\-z][a\-z] {
1415 digram[yytext[0]][yytext[1]]++;
1416 REJECT;
1417 }
1418\. ;
b1716fed
KM
1419\en ;
1420.TE
1421where the REJECT is necessary to pick up
1422a letter pair beginning at every character, rather than at every
1423other character.
1424.2C
1425.NH
1426Lex Source Definitions.
1427.PP
1428Remember the format of the Lex
1429source:
1430.TS
1431center;
1432l.
1433{definitions}
1434%%
1435{rules}
1436%%
1437{user routines}
1438.TE
1439So far only the rules have been described. The user needs
1440additional options,
1441though, to define variables for use in his program and for use
1442by Lex.
1443These can go either in the definitions section
1444or in the rules section.
1445.PP
1446Remember that Lex is turning the rules into a program.
1447Any source not intercepted by Lex is copied
1448into the generated program. There are three classes
1449of such things.
1450.IP 1)
1451Any line which is not part of a Lex rule or action
1452which begins with a blank or tab is copied into
1453the Lex generated program.
1454Such source input prior to the first %% delimiter will be external
1455to any function in the code; if it appears immediately after the first
1456%%,
1457it appears in an appropriate place for declarations
1458in the function written by Lex which contains the actions.
1459This material must look like program fragments,
1460and should precede the first Lex rule.
1461.IP
1462As a side effect of the above, lines which begin with a blank
1463or tab, and which contain a comment,
1464are passed through to the generated program.
1465This can be used to include comments in either the Lex source or
1466the generated code. The comments should follow the host
1467language convention.
1468.IP 2)
1469Anything included between lines containing
1470only
1471.I %{
1472and
1473.I %}
1474is
1475copied out as above. The delimiters are discarded.
1476This format permits entering text like preprocessor statements that
1477must begin in column 1,
1478or copying lines that do not look like programs.
1479.IP 3)
1480Anything after the third %% delimiter, regardless of formats, etc.,
1481is copied out after the Lex output.
1482.PP
1483Definitions intended for Lex are given
1484before the first %% delimiter. Any line in this section
1485not contained between %{ and %}, and begining
1486in column 1, is assumed to define Lex substitution strings.
1487The format of such lines is
1488.TS
1489center;
1490l l.
1491name translation
1492.TE
1493and it
1494causes the string given as a translation to
1495be associated with the name.
1496The name and translation
1497must be separated by at least one blank or tab, and the name must begin with a letter.
1498The translation can then be called out
1499by the {name} syntax in a rule.
1500Using {D} for the digits and {E} for an exponent field,
1501for example, might abbreviate rules to recognize numbers:
1502.TS
1503center;
1504l l.
1505D [0\-9]
1506E [DEde][\-+]?{D}+
1507%%
1508{D}+ printf("integer");
1509{D}+"."{D}\(**({E})? |
1510{D}\(**"."{D}+({E})? |
1511{D}+{E} printf("real");
1512.TE
1513Note the first two rules for real numbers;
1514both require a decimal point and contain
1515an optional exponent field,
1516but the first requires at least one digit before the
1517decimal point and the second requires at least one
1518digit after the decimal point.
1519To correctly handle the problem
1520posed by a Fortran expression such as
1521.I 35.EQ.I ,
1522which does not contain a real number, a context-sensitive
1523rule such as
1524.TS
1525center;
1526l l.
1527[0\-9]+/"."EQ printf("integer");
1528.TE
1529could be used in addition to the normal rule for integers.
1530.PP
1531The definitions
1532section may also contain other commands, including the
1533selection of a host language, a character set table,
1534a list of start conditions, or adjustments to the default
1535size of arrays within Lex itself for larger source programs.
1536These possibilities
1537are discussed below under ``Summary of Source Format,''
1538section 12.
1539.2C
1540.NH
1541Usage.
1542.PP
1543There are two steps in
1544compiling a Lex source program.
1545First, the Lex source must be turned into a generated program
1546in the host general purpose language.
1547Then this program must be compiled and loaded, usually with
1548a library of Lex subroutines.
1549The generated program
1550is on a file named
1551.I lex.yy.c .
1552The I/O library is defined in terms of the C standard
1553library [6].
1554.PP
1555The C programs generated by Lex are slightly different
1556on OS/370, because the
1557OS compiler is less powerful than the UNIX or GCOS compilers,
1558and does less at compile time.
1559C programs generated on GCOS and UNIX are the same.
1560.PP
1561.I
1562UNIX.
1563.R
1564The library is accessed by the loader flag
1565.I \-ll .
1566So an appropriate
1567set of commands is
1568.KS
1569.in 5
1570lex source
1571cc lex.yy.c \-ll
1572.in 0
1573.KE
1574The resulting program is placed on the usual file
1575.I
1576a.out
1577.R
1578for later execution.
1579To use Lex with Yacc see below.
1580Although the default Lex I/O routines use the C standard library,
1581the Lex automata themselves do not do so;
1582if private versions of
1583.I
1584input,
1585output
1586.R
1587and
1588.I unput
1589are given, the library can be avoided.
1590.PP
1591.2C
1592.NH
1593Lex and Yacc.
1594.PP
1595If you want to use Lex with Yacc, note that what Lex writes is a program
1596named
1597.I
1598yylex(),
1599.R
1600the name required by Yacc for its analyzer.
1601Normally, the default main program on the Lex library
1602calls this routine, but if Yacc is loaded, and its main
1603program is used, Yacc will call
1604.I
1605yylex().
1606.R
1607In this case each Lex rule should end with
1608.TS
1609center;
1610l.
1611return(token);
1612.TE
1613where the appropriate token value is returned.
1614An easy way to get access
1615to Yacc's names for tokens is to
1616compile the Lex output file as part of
1617the Yacc output file by placing the line
1618.TS
1619center;
1620l.
1621# include "lex.yy.c"
1622.TE
1623in the last section of Yacc input.
1624Supposing the grammar to be
1625named ``good'' and the lexical rules to be named ``better''
1626the UNIX command sequence can just be:
1627.TS
1628center;
1629l.
1630yacc good
1631lex better
1632cc y.tab.c \-ly \-ll
1633.TE
1634The Yacc library (\-ly) should be loaded before the Lex library,
1635to obtain a main program which invokes the Yacc parser.
1636The generations of Lex and Yacc programs can be done in
1637either order.
1638.2C
1639.NH
1640Examples.
1641.PP
1642As a trivial problem, consider copying an input file while
1643adding 3 to every positive number divisible by 7.
1644Here is a suitable Lex source program
1645.TS
1646center;
1647l l.
1648%%
1649 int k;
1650[0\-9]+ {
1651 k = atoi(yytext);
1652 if (k%7 == 0)
1653 printf("%d", k+3);
1654 else
1655 printf("%d",k);
1656 }
1657.TE
1658to do just that.
1659The rule [0\-9]+ recognizes strings of digits;
1660.I
1661atoi
1662.R
1663converts the digits to binary
1664and stores the result in
1665.ul
1666k.
1667The operator % (remainder) is used to check whether
1668.ul
1669k
1670is divisible by 7; if it is,
1671it is incremented by 3 as it is written out.
1672It may be objected that this program will alter such
1673input items as
1674.I 49.63
1675or
1676.I X7 .
1677Furthermore, it increments the absolute value
1678of all negative numbers divisible by 7.
1679To avoid this, just add a few more rules after the active one,
1680as here:
1681.TS
1682center;
1683l l.
1684%%
1685 int k;
1686\-?[0\-9]+ {
1687 k = atoi(yytext);
5652ef93
KM
1688 printf("%d",
1689 k%7 == 0 ? k+3 : k);
b1716fed
KM
1690 }
1691\-?[0\-9.]+ ECHO;
1692[A-Za-z][A-Za-z0-9]+ ECHO;
1693.TE
1694Numerical strings containing
1695a ``.'' or preceded by a letter will be picked up by
1696one of the last two rules, and not changed.
1697The
1698.I if\-else
1699has been replaced by
1700a C conditional expression to save space;
1701the form
1702.ul
1703a?b:c
1704means ``if
1705.I a
1706then
1707.I b
1708else
1709.I c ''.
1710.PP
1711For an example of statistics gathering, here
1712is a program which histograms the lengths
1713of words, where a word is defined as a string of letters.
1714.TS
1715center;
1716l l.
1717 int lengs[100];
1718%%
1719[a\-z]+ lengs[yyleng]++;
1720\&. |
1721\en ;
1722%%
1723.T&
1724l s.
1725yywrap()
1726{
1727int i;
1728printf("Length No. words\en");
1729for(i=0; i<100; i++)
1730 if (lengs[i] > 0)
1731 printf("%5d%10d\en",i,lengs[i]);
1732return(1);
1733}
1734.TE
1735This program
1736accumulates the histogram, while producing no output. At the end
1737of the input it prints the table.
1738The final statement
1739.I
1740return(1);
1741.R
1742indicates that Lex is to perform wrapup. If
1743.I
1744yywrap
1745.R
1746returns zero (false)
1747it implies that further input is available
1748and the program is
1749to continue reading and processing.
1750To provide a
1751.I
1752yywrap
1753.R
1754that never
1755returns true causes an infinite loop.
1756.PP
1757As a larger example,
1758here are some parts of a program written by N. L. Schryer
1759to convert double precision Fortran to single precision Fortran.
1760Because Fortran does not distinguish upper and lower case letters,
1761this routine begins by defining a set of classes including
1762both cases of each letter:
1763.TS
1764center;
1765l l.
1766a [aA]
1767b [bB]
1768c [cC]
1769\&...
1770z [zZ]
1771.TE
1772An additional class recognizes white space:
1773.TS
1774center;
1775l l.
1776W [ \et]\(**
1777.TE
1778The first rule changes
1779``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
1780.TS
1781center;
1782l.
1783{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
1784 printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
1785 }
1786.TE
1787Care is taken throughout this program to preserve the case
1788(upper or lower)
1789of the original program.
1790The conditional operator is used to
1791select the proper form of the keyword.
1792The next rule copies continuation card indications to
1793avoid confusing them with constants:
1794.TS
1795center;
1796l l.
1797^" "[^ 0] ECHO;
1798.TE
1799In the regular expression, the quotes surround the
1800blanks.
1801It is interpreted as
1802``beginning of line, then five blanks, then
1803anything but blank or zero.''
1804Note the two different meanings of
1805.I ^ .
1806There follow some rules to change double precision
1807constants to ordinary floating constants.
1808.TS
1809center;
1810l.
1811[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
1812[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
1813"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
1814 /\(** convert constants \(**/
1815 for(p=yytext; \(**p != 0; p++)
1816 {
1817 if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
1818 \(**p=+ \(fme\(fm\- \(fmd\(fm;
1819 ECHO;
1820 }
1821.TE
1822After the floating point constant is recognized, it is
1823scanned by the
1824.ul
1825for
1826loop
1827to find the letter
1828.I d
1829or
1830.I D .
1831The program than adds
1832.c
1833.I \(fme\(fm\-\(fmd\(fm ,
1834which converts
1835it to the next letter of the alphabet.
1836The modified constant, now single-precision,
1837is written out again.
1838There follow a series of names which must be respelled to remove
1839their initial \fId\fR.
1840By using the
1841array
1842.I
1843yytext
1844.R
1845the same action suffices for all the names (only a sample of
1846a rather long list is given here).
1847.TS
1848center;
1849l l.
1850{d}{s}{i}{n} |
1851{d}{c}{o}{s} |
1852{d}{s}{q}{r}{t} |
1853{d}{a}{t}{a}{n} |
1854\&...
1855{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
1856.TE
1857Another list of names must have initial \fId\fR changed to initial \fIa\fR:
1858.TS
1859center;
1860l l.
1861{d}{l}{o}{g} |
1862{d}{l}{o}{g}10 |
1863{d}{m}{i}{n}1 |
1864{d}{m}{a}{x}1 {
1865 yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
1866 ECHO;
1867 }
1868.TE
1869And one routine
1870must have initial \fId\fR changed to initial \fIr\fR:
1871.TS
1872center;
1873l l.
1874{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
1875 ECHO;
1876 }
1877.TE
1878To avoid such names as \fIdsinx\fR being detected as instances
1879of \fIdsin\fR, some final rules pick up longer words as identifiers
1880and copy some surviving characters:
1881.TS
1882center;
1883l l.
1884[A\-Za\-z][A\-Za\-z0\-9]\(** |
1885[0\-9]+ |
1886\en |
1887\&. ECHO;
1888.TE
1889Note that this program is not complete; it
1890does not deal with the spacing problems in Fortran or
1891with the use of keywords as identifiers.
1892.br
1893.2C
1894.NH
1895Left Context Sensitivity.
1896.PP
1897Sometimes
1898it is desirable to have several sets of lexical rules
1899to be applied at different times in the input.
1900For example, a compiler preprocessor might distinguish
1901preprocessor statements and analyze them differently
1902from ordinary statements.
1903This requires
1904sensitivity
1905to prior context, and there are several ways of handling
1906such problems.
1907The \fI^\fR operator, for example, is a prior context operator,
1908recognizing immediately preceding left context just as \fI$\fR recognizes
1909immediately following right context.
1910Adjacent left context could be extended, to produce a facility similar to
1911that for adjacent right context, but it is unlikely
1912to be as useful, since often the relevant left context
1913appeared some time earlier, such as at the beginning of a line.
1914.PP
1915This section describes three means of dealing
1916with different environments: a simple use of flags,
1917when only a few rules change from one environment to another,
1918the use of
1919.I
1920start conditions
1921.R
1922on rules,
1923and the possibility of making multiple lexical analyzers all run
1924together.
1925In each case, there are rules which recognize the need to change the
1926environment in which the
1927following input text is analyzed, and set some parameter
1928to reflect the change. This may be a flag explicitly tested by
1929the user's action code; such a flag is the simplest way of dealing
1930with the problem, since Lex is not involved at all.
1931It may be more convenient,
1932however,
1933to have Lex remember the flags as initial conditions on the rules.
1934Any rule may be associated with a start condition. It will only
1935be recognized when Lex is in
1936that start condition.
1937The current start condition may be changed at any time.
1938Finally, if the sets of rules for the different environments
1939are very dissimilar,
1940clarity may be best achieved by writing several distinct lexical
1941analyzers, and switching from one to another as desired.
1942.PP
1943Consider the following problem: copy the input to the output,
1944changing the word \fImagic\fR to \fIfirst\fR on every line which began
1945with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
1946which began with the letter \fIb\fR, and changing
1947\fImagic\fR to \fIthird\fR on every line which began
1948with the letter \fIc\fR. All other words and all other lines
1949are left unchanged.
1950.PP
1951These rules are so simple that the easiest way
1952to do this job is with a flag:
1953.TS
1954center;
1955l l.
1956 int flag;
1957%%
1958^a {flag = \(fma\(fm; ECHO;}
1959^b {flag = \(fmb\(fm; ECHO;}
1960^c {flag = \(fmc\(fm; ECHO;}
1961\en {flag = 0 ; ECHO;}
1962magic {
1963 switch (flag)
1964 {
1965 case \(fma\(fm: printf("first"); break;
1966 case \(fmb\(fm: printf("second"); break;
1967\11\11 case \(fmc\(fm: printf("third"); break;
1968 default: ECHO; break;
1969 }
1970 }
1971.TE
1972should be adequate.
1973.PP
1974To handle the same problem with start conditions, each
1975start condition must be introduced to Lex in the definitions section
1976with a line reading
1977.TS
1978center;
1979l l.
1980%Start name1 name2 ...
1981.TE
1982where the conditions may be named in any order.
1983The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
1984The conditions may be referenced at the
1985head of a rule with the <> brackets:
1986.TS
1987center;
1988l.
1989<name1>expression
1990.TE
1991is a rule which is only recognized when Lex is in the
1992start condition \fIname1\fR.
1993To enter a start condition,
1994execute the action statement
1995.TS
1996center;
1997l.
1998BEGIN name1;
1999.TE
2000which changes the start condition to \fIname1\fR.
2001To resume the normal state,
2002.TS
2003center;
2004l.
2005BEGIN 0;
2006.TE
2007resets the initial condition
2008of the Lex automaton interpreter.
2009A rule may be active in several
2010start conditions:
2011.TS
2012center;
2013l.
2014<name1,name2,name3>
2015.TE
2016is a legal prefix. Any rule not beginning with the
2017<> prefix operator is always active.
2018.PP
2019The same example as before can be written:
2020.TS
2021center;
2022l l.
2023%START AA BB CC
2024%%
2025^a {ECHO; BEGIN AA;}
2026^b {ECHO; BEGIN BB;}
2027^c {ECHO; BEGIN CC;}
2028\en {ECHO; BEGIN 0;}
2029<AA>magic printf("first");
2030<BB>magic printf("second");
2031<CC>magic printf("third");
2032.TE
2033where the logic is exactly the same as in the previous
2034method of handling the problem, but Lex does the work
2035rather than the user's code.
2036.2C
2037.NH
2038Character Set.
2039.PP
2040The programs generated by Lex handle
2041character I/O only through the routines
2042.I
2043input,
2044output,
2045.R
2046and
2047.I
2048unput.
2049.R
2050Thus the character representation
2051provided in these routines
2052is accepted by Lex and employed to return
2053values in
2054.I
2055yytext.
2056.R
2057For internal use
2058a character is represented as a small integer
2059which, if the standard library is used,
2060has a value equal to the integer value of the bit
2061pattern representing the character on the host computer.
2062Normally, the letter
2063.I a
2064is represented as the same form as the character constant
2065.I \(fma\(fm .
2066If this interpretation is changed, by providing I/O
2067routines which translate the characters,
2068Lex must be told about
2069it, by giving a translation table.
2070This table must be in the definitions section,
2071and must be bracketed by lines containing only
2072``%T''.
2073The table contains lines of the form
2074.TS
2075center;
2076l.
2077{integer} {character string}
2078.TE
2079which indicate the value associated with each character.
2080Thus the next example
2081.GS 2
2082.TS
2083center;
2084l l.
2085%T
2086 1 Aa
2087 2 Bb
2088\&...
208926 Zz
209027 \en
209128 +
209229 \-
209330 0
209431 1
2095\&...
209639 9
2097%T
2098.TE
2099.sp
2100.ce 1
2101Sample character table.
2102.GE
2103maps the lower and upper case letters together into the integers 1 through 26,
2104newline into 27, + and \- into 28 and 29, and the
2105digits into 30 through 39.
2106Note the escape for newline.
2107If a table is supplied, every character that is to appear either
2108in the rules or in any valid input must be included
2109in the table.
2110No character
2111may be assigned the number 0, and no character may be
2112assigned a bigger number than the size of the hardware character set.
2113.2C
2114.NH
2115Summary of Source Format.
2116.PP
2117The general form of a Lex source file is:
2118.TS
2119center;
2120l.
2121{definitions}
2122%%
2123{rules}
2124%%
2125{user subroutines}
2126.TE
2127The definitions section contains
2128a combination of
2129.IP 1)
2130Definitions, in the form ``name space translation''.
2131.IP 2)
2132Included code, in the form ``space code''.
2133.IP 3)
2134Included code, in the form
2135.TS
2136center;
2137l.
2138%{
2139code
2140%}
2141.TE
2142.ns
2143.IP 4)
2144Start conditions, given in the form
2145.TS
2146center;
2147l.
2148%S name1 name2 ...
2149.TE
2150.ns
2151.IP 5)
2152Character set tables, in the form
2153.TS
2154center;
2155l.
2156%T
2157number space character-string
2158\&...
2159%T
2160.TE
2161.ns
2162.IP 6)
2163Changes to internal array sizes, in the form
2164.TS
2165center;
2166l.
2167%\fIx\fR\0\0\fInnn\fR
2168.TE
2169where \fInnn\fR is a decimal integer representing an array size
2170and \fIx\fR selects the parameter as follows:
2171.TS
2172center;
2173c c
2174c l.
2175Letter Parameter
2176p positions
2177n states
2178e tree nodes
2179a transitions
2180k packed character classes
2181o output array size
2182.TE
2183.LP
2184Lines in the rules section have the form ``expression action''
2185where the action may be continued on succeeding
2186lines by using braces to delimit it.
2187.PP
2188Regular expressions in Lex use the following
2189operators:
2190.br
2191.TS
2192center;
2193l l.
2194x the character "x"
2195"x" an "x", even if x is an operator.
2196\ex an "x", even if x is an operator.
2197[xy] the character x or y.
2198[x\-z] the characters x, y or z.
2199[^x] any character but x.
2200\&. any character but newline.
2201^x an x at the beginning of a line.
2202<y>x an x when Lex is in start condition y.
2203x$ an x at the end of a line.
2204x? an optional x.
2205x\(** 0,1,2, ... instances of x.
2206x+ 1,2,3, ... instances of x.
2207x|y an x or a y.
2208(x) an x.
2209x/y an x but only if followed by y.
5652ef93
KM
2210{xx} the translation of xx from the
2211 definitions section.
b1716fed
KM
2212x{m,n} \fIm\fR through \fIn\fR occurrences of x
2213.TE
2214.NH
2215Caveats and Bugs.
2216.PP
2217There are pathological expressions which
2218produce exponential growth of the tables when
2219converted to deterministic machines;
2220fortunately, they are rare.
2221.PP
2222REJECT does not rescan the input; instead it remembers the results of the previous
2223scan. This means that if a rule with trailing context is found, and
2224REJECT executed, the user
2225must not have used
2226.ul
2227unput
2228to change the characters forthcoming
2229from the input stream.
2230This is the only restriction on the user's ability to manipulate
2231the not-yet-processed input.
2232.PP
2233.2C
2234.NH
2235Acknowledgments.
2236.PP
2237As should
2238be obvious from the above, the outside of Lex
2239is patterned
2240on Yacc and the inside on Aho's string matching routines.
2241Therefore, both S. C. Johnson and A. V. Aho
2242are really originators
2243of much of Lex,
2244as well as debuggers of it.
2245Many thanks are due to both.
2246.PP
2247The code of the current version of Lex was designed, written,
2248and debugged by Eric Schmidt.
2249.SG MH-1274-MEL-unix
2250.sp 1
2251.2C
2252.NH
2253References.
2254.SP 1v
2255.IP 1.
2256B. W. Kernighan and D. M. Ritchie,
2257.I
2258The C Programming Language,
2259.R
2260Prentice-Hall, N. J. (1978).
2261.IP 2.
2262B. W. Kernighan,
2263.I
2264Ratfor: A Preprocessor for a Rational Fortran,
2265.R
2266Software \- Practice and Experience,
2267\fB5\fR, pp. 395-496 (1975).
2268.IP 3.
2269S. C. Johnson,
2270.I
2271Yacc: Yet Another Compiler Compiler,
2272.R
2273Computing Science Technical Report No. 32,
22741975,
2275.MH
2276.if \n(tm (also TM 75-1273-6)
2277.IP 4.
2278A. V. Aho and M. J. Corasick,
2279.I
2280Efficient String Matching: An Aid to Bibliographic Search,
2281.R
2282Comm. ACM
2283.B
228418,
2285.R
2286333-340 (1975).
2287.IP 5.
2288B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
2289.I
2290QED Text Editor,
2291.R
2292Computing Science Technical Report No. 5,
22931972,
2294.MH
2295.IP 6.
2296D. M. Ritchie,
2297private communication.
2298See also
2299M. E. Lesk,
2300.I
2301The Portable C Library,
2302.R
2303Computing Science Technical Report No. 31,
2304.MH
2305.if \n(tm (also TM 75-1274-11)