new doc.mk; document numbering changes for 4.4BSD
[unix-history] / usr / src / old / lex / PSD.doc / lex.ms
CommitLineData
3edcb7c8
KB
1.\" %sccs.include.proprietary.roff%
2.\"
c2eda2a7 3.\" @(#)lex.ms 6.4 (Berkeley) %G%
b1716fed 4.\"
c2eda2a7
KB
5.EH 'PSD:16-%''Lex \- A Lexical Analyzer Generator'
6.OH 'Lex \- A Lexical Analyzer Generator''PSD:16-%'
b1716fed
KM
7.hc ~
8.bd I 2
9.de TS
10.br
11.nf
12.SP 1v
13.ul 0
14..
15.de TE
16.SP 1v
17.fi
18..
5652ef93
KM
19.\".de PT
20.\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
21.\".if \\n%>1 'sp
22.\"..
b1716fed 23.ND July 21, 1975
5652ef93
KM
24.\".RP
25.\".TM 75-1274-15 39199 39199-11
b1716fed
KM
26.TL
27Lex \- A Lexical Analyzer ~Generator~
28.AU ``MH 2C-569'' 6377
29M. E. Lesk and E. Schmidt
30.AI
31.MH
32.AB
33.sp
34.bd I 2
5652ef93
KM
35.\".nr PS 8
36.\".nr VS 9
37.\".ps 8
38.\".vs 9p
b1716fed
KM
39Lex helps write programs whose control flow
40is directed by instances of regular
41expressions in the input stream.
42It is well suited for editor-script type transformations and
43for segmenting input in preparation for
44a parsing routine.
45.PP
46Lex source is a table of regular expressions and corresponding program fragments.
47The table is translated to a program
48which reads an input stream, copying it to an output stream
49and partitioning the input
50into strings which match the given expressions.
51As each such string is recognized the corresponding
52program fragment is executed.
53The recognition of the expressions
54is performed by a deterministic finite automaton
55generated by Lex.
56The program fragments written by the user are executed in the order in which the
57corresponding regular expressions occur in the input stream.
58.if n .if \n(tm .ig
59.PP
60The lexical analysis
61programs written with Lex accept ambiguous specifications
62and choose the longest
63match possible at each input point.
64If necessary, substantial look~ahead
65is performed on the input, but the
66input stream will be backed up to the
67end of the current partition, so that the user
68has general freedom to manipulate it.
69.PP
70Lex can generate analyzers in either C or Ratfor, a language
71which can be translated automatically to portable Fortran.
72It is available on the PDP-11 UNIX, Honeywell GCOS,
73and IBM OS systems.
74This manual, however, will only discuss generating analyzers
75in C on the UNIX system, which is the only supported
76form of Lex under UNIX Version 7.
77Lex is designed to simplify
78interfacing with Yacc, for those
79with access to this compiler-compiler system.
80..
5652ef93
KM
81.\".nr PS 9
82.\".nr VS 11
b1716fed 83.AE
b1716fed
KM
84.2C
85.NH
86Introduction.
87.PP
88Lex is a program generator designed for
89lexical processing of character input streams.
90It accepts a high-level, problem oriented specification
91for character string matching,
92and
93produces a program in a general purpose language which recognizes
94regular expressions.
95The regular expressions are specified by the user in the
96source specifications given to Lex.
97The Lex written code recognizes these expressions
98in an input stream and partitions the input stream into
99strings matching the expressions. At the bound~aries
100between strings
101program sections
102provided by the user are executed.
103The Lex source file associates the regular expressions and the
104program fragments.
105As each expression appears in the input to the program written by Lex,
106the corresponding fragment is executed.
107.PP
108.de MH
109Bell Laboratories, Murray Hill, NJ 07974.
110..
111The user supplies the additional code
112beyond expression matching
113needed to complete his tasks, possibly
114including code written by other generators.
115The program that recognizes the expressions is generated in the
116general purpose programming language employed for the
117user's program fragments.
118Thus, a high level expression
119language is provided to write the string expressions to be
120matched while the user's freedom to write actions
121is unimpaired.
122This avoids forcing the user who wishes to use a string manipulation
123language for input analysis to write processing programs in the same
124and often inappropriate string handling language.
125.PP
126Lex is not a complete language, but rather a generator representing
127a new language feature which can be added to
128different programming languages, called ``host languages.''
129Just as general purpose languages
130can produce code to run on different computer hardware,
131Lex can write code in different host languages.
132The host language is used for the output code generated by Lex
133and also for the program fragments added by the user.
134Compatible run-time libraries for the different host languages
135are also provided.
136This makes Lex adaptable to different environments and
137different users.
138Each application
139may be directed to the combination of hardware and host language appropriate
140to the task, the user's background, and the properties of local
141implementations.
142At present, the only supported host language is C,
143although Fortran (in the form of Ratfor [2] has been available
144in the past.
145Lex itself exists on UNIX, GCOS, and OS/370; but the
146code generated by Lex may be taken anywhere the appropriate
147compilers exist.
148.PP
149Lex turns the user's expressions and actions
150(called
151.ul
152source
153in this memo) into the host general-purpose language;
154the generated program is named
155.ul
156yylex.
157The
158.ul
159yylex
160program
161will recognize expressions
162in a stream
163(called
164.ul
165input
166in this memo)
167and perform the specified actions for each expression as it is detected.
168See Figure 1.
169.GS
170.TS
171center;
172l _ r
173l|c|r
174l _ r
175l _ r
176l|c|r
177l _ r
178c s s
179c s s.
180
181Source \(-> Lex \(-> yylex
182
183.sp 2
184
185Input \(-> yylex \(-> Output
186
187.sp
188An overview of Lex
b1716fed
KM
189Figure 1
190.TE
191.GE
192.PP
193For a trivial example, consider a program to delete
194from the input
195all blanks or tabs at the ends of lines.
196.TS
197center;
198l l.
199%%
200[ \et]+$ ;
201.TE
202is all that is required.
203The program
204contains a %% delimiter to mark the beginning of the rules, and
205one rule.
206This rule contains a regular expression
207which matches one or more
208instances of the characters blank or tab
209(written \et for visibility, in accordance with the C language convention)
210just prior to the end of a line.
211The brackets indicate the character
212class made of blank and tab; the + indicates ``one or more ...'';
213and the $ indicates ``end of line,'' as in QED.
214No action is specified,
215so the program generated by Lex (yylex) will ignore these characters.
216Everything else will be copied.
217To change any remaining
218string of blanks or tabs to a single blank,
219add another rule:
220.TS
221center;
222l l.
223%%
224[ \et]+$ ;
225[ \et]+ printf(" ");
226.TE
227The finite automaton generated for this
228source will scan for both rules at once,
229observing at
230the termination of the string of blanks or tabs
231whether or not there is a newline character, and executing
232the desired rule action.
233The first rule matches all strings of blanks or tabs
234at the end of lines, and the second
235rule all remaining strings of blanks or tabs.
236.PP
237Lex can be used alone for simple transformations, or
238for analysis and statistics gathering on a lexical level.
239Lex can also be used with a parser generator
240to perform the lexical analysis phase; it is particularly
241easy to interface Lex and Yacc [3].
242Lex programs recognize only regular expressions;
243Yacc writes parsers that accept a large class of context free grammars,
244but require a lower level analyzer to recognize input tokens.
245Thus, a combination of Lex and Yacc is often appropriate.
246When used as a preprocessor for a later parser generator,
247Lex is used to partition the input stream,
248and the parser generator assigns structure to
249the resulting pieces.
250The flow of control
251in such a case (which might be the first half of a compiler,
252for example) is shown in Figure 2.
253Additional programs,
254written by other generators
255or by hand, can
256be added easily to programs written by Lex.
257.BS 2
5652ef93
KM
258.ps 9
259.vs 11
b1716fed
KM
260.TS
261center;
262l c c c l
263l c c c l
264l c c c l
265l _ c _ l
266l|c|c|c|l
267l _ c _ l
268l c c c l
269l _ c _ l
270l|c|c|c|l
271l _ c _ l
272l c s s l
273l c s s l.
274 lexical grammar
275 rules rules
276 \(da \(da
277
278 Lex Yacc
279
280 \(da \(da
281
282Input \(-> yylex \(-> yyparse \(-> Parsed input
283
284.sp
285 Lex with Yacc
b1716fed
KM
286 Figure 2
287.TE
5652ef93
KM
288.ps 10
289.vs 12
b1716fed
KM
290.BE
291Yacc users
292will realize that the name
293.ul
294yylex
295is what Yacc expects its lexical analyzer to be named,
296so that the use of this name by Lex simplifies
297interfacing.
298.PP
299Lex generates a deterministic finite automaton from the regular expressions
300in the source [4].
301The automaton is interpreted, rather than compiled, in order
302to save space.
303The result is still a fast analyzer.
304In particular, the time taken by a Lex program
305to recognize and partition an input stream is
306proportional to the length of the input.
307The number of Lex rules or
308the complexity of the rules is
309not important in determining speed,
310unless rules which include
311forward context require a significant amount of re~scanning.
312What does increase with the number and complexity of rules
313is the size of the finite
314automaton, and therefore the size of the program
315generated by Lex.
316.PP
317In the program written by Lex, the user's fragments
318(representing the
319.ul
320actions
321to be performed as each regular expression
322is found)
323are gathered
324as cases of a switch.
325The automaton interpreter directs the control flow.
326Opportunity is provided for the user to insert either
327declarations or additional statements in the routine containing
328the actions, or to
329add subroutines outside this action routine.
330.PP
331Lex is not limited to source which can
332be interpreted on the basis of one character
333look~ahead.
334For example,
335if there are two rules, one looking for
336.I ab
337and another for
338.I abcdefg ,
339and the input stream is
340.I abcdefh ,
341Lex will recognize
342.I ab
343and leave
344the input pointer just before
345.I "cd. . ."
346Such backup is more costly
347than the processing of simpler languages.
348.2C
349.NH
350Lex Source.
351.PP
352The general format of Lex source is:
353.TS
354center;
355l.
356{definitions}
357%%
358{rules}
359%%
360{user subroutines}
361.TE
362where the definitions and the user subroutines
363are often omitted.
364The second
365.I %%
366is optional, but the first is required
367to mark the beginning of the rules.
368The absolute minimum Lex program is thus
369.TS
370center;
371l.
372%%
373.TE
374(no definitions, no rules) which translates into a program
375which copies the input to the output unchanged.
376.PP
377In the outline of Lex programs shown above, the
378.I
379rules
380.R
381represent the user's control
382decisions; they are a table, in which the left column
383contains
384.I
385regular expressions
386.R
387(see section 3)
388and the right column contains
389.I
390actions,
391.R
392program fragments to be executed when the expressions
393are recognized.
394Thus an individual rule might appear
395.TS
396center;
397l l.
398integer printf("found keyword INT");
399.TE
400to look for the string
401.I integer
402in the input stream and
403print the message ``found keyword INT'' whenever it appears.
404In this example the host procedural language is C and
405the C library function
406.I
407printf
408.R
409is used to print the string.
410The end
411of the expression is indicated by the first blank or tab character.
412If the action is merely a single C expression,
413it can just be given on the right side of the line; if it is
414compound, or takes more than a line, it should be enclosed in
415braces.
416As a slightly more useful example, suppose it is desired to
417change a number of words from British to American spelling.
418Lex rules such as
419.TS
420center;
421l l.
422colour printf("color");
423mechanise printf("mechanize");
424petrol printf("gas");
425.TE
426would be a start. These rules are not quite enough,
427since
428the word
429.I petroleum
430would become
431.I gaseum ;
432a way of dealing
433with this will be described later.
434.2C
435.NH
436Lex Regular Expressions.
437.PP
438The definitions of regular expressions are very similar to those
439in QED [5].
440A regular
441expression specifies a set of strings to be matched.
442It contains text characters (which match the corresponding
443characters in the strings being compared)
444and operator characters (which specify
445repetitions, choices, and other features).
446The letters of the alphabet and the digits are
447always text characters; thus the regular expression
448.TS
449center;
450l l.
451integer
452.TE
453matches the string
454.ul
455integer
456wherever it appears
457and the expression
458.TS
459center;
460l.
461a57D
462.TE
463looks for the string
464.ul
465a57D.
466.PP
467.I
468Operators.
469.R
470The operator characters are
471.TS
472center;
473l.
474" \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
475.TE
476and if they are to be used as text characters, an escape
477should be used.
478The quotation mark operator (")
479indicates that whatever is contained between a pair of quotes
480is to be taken as text characters.
481Thus
482.TS
483center;
484l.
485xyz"++"
486.TE
487matches the string
488.I xyz++
489when it appears. Note that a part of a string may be quoted.
490It is harmless but unnecessary to quote an ordinary
491text character; the expression
492.TS
493center;
494l.
495"xyz++"
496.TE
497is the same as the one above.
498Thus by quoting every non-alphanumeric character
499being used as a text character, the user can avoid remembering
500the list above of current
501operator characters, and is safe should further extensions to Lex
502lengthen the list.
503.PP
504An operator character may also be turned into a text character
505by preceding it with \e as in
506.TS
507center;
508l.
509xyz\e+\e+
510.TE
511which
512is another, less readable, equivalent of the above expressions.
513Another use of the quoting mechanism is to get a blank into
514an expression; normally, as explained above, blanks or tabs end
515a rule.
516Any blank character not contained within [\|] (see below) must
517be quoted.
518Several normal C escapes with \e
519are recognized: \en is newline, \et is tab, and \eb is backspace.
520To enter \e itself, use \e\e.
521Since newline is illegal in an expression, \en must be used;
522it is not
523required to escape tab and backspace.
524Every character but blank, tab, newline and the list above is always
525a text character.
526.PP
527.I
528Character classes.
529.R
530Classes of characters can be specified using the operator pair [\|].
531The construction
532.I [abc]
533matches a
534single character, which may be
535.I a ,
536.I b ,
537or
538.I c .
539Within square brackets,
540most operator meanings are ignored.
541Only three characters are special:
542these are \e \(mi and ^. The \(mi character
543indicates ranges. For example,
544.TS
545center;
546l.
547[a\(miz0\(mi9<>_]
548.TE
549indicates the character class containing all the lower case letters,
550the digits,
551the angle brackets, and underline.
552Ranges may be given in either order.
553Using \(mi between any pair of characters which are
554not both upper case letters, both lower case letters, or both digits
555is implementation dependent and will get a warning message.
556(E.g., [0\-z] in ASCII is many more characters
557than it is in EBCDIC).
558If it is desired to include the
559character \(mi in a character class, it should be first or
560last; thus
561.TS
562center;
563l.
564[\(mi+0\(mi9]
565.TE
566matches all the digits and the two signs.
567.PP
568In character classes,
569the ^ operator must appear as the first character
570after the left bracket; it indicates that the resulting string
571is to be complemented with respect to the computer character set.
572Thus
573.TS
574center;
575l.
576[^abc]
577.TE
578matches all characters except a, b, or c, including
579all special or control characters; or
580.TS
581center;
582l.
583[^a\-zA\-Z]
584.TE
585is any character which is not a letter.
586The \e character provides the usual escapes within
587character class brackets.
588.PP
589.I
590Arbitrary character.
591.R
592To match almost any character, the operator character
593.TS
594center;
595l.
596\&.
597.TE
598is the class of all characters except newline.
599Escaping into octal is possible although non-portable:
600.TS
601center;
602l.
603[\e40\-\e176]
604.TE
605matches all printable characters in the ASCII character set, from octal
60640 (blank) to octal 176 (tilde).
607.PP
608.I
609Optional expressions.
610.R
611The operator
612.I ?
613indicates
614an optional element of an expression.
615Thus
616.TS
617center;
618l.
619ab?c
620.TE
621matches either
622.I ac
623or
624.I abc .
625.PP
626.I
627Repeated expressions.
628.R
629Repetitions of classes are indicated by the operators
630.I \(**
631and
632.I + .
633.TS
634center;
635l.
636\f2a\(**\f1
637.TE
638is any number of consecutive
639.I a
640characters, including zero; while
641.TS
642center;
643l.
644a+
645.TE
646is one or more instances of
647.I a.
648For example,
649.TS
650center;
651l.
652[a\-z]+
653.TE
654is all strings of lower case letters.
655And
656.TS
657center;
658l.
659[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
660.TE
661indicates all alphanumeric strings with a leading
662alphabetic character.
663This is a typical expression for recognizing identifiers in
664computer languages.
665.PP
666.I
667Alternation and Grouping.
668.R
669The operator |
670indicates alternation:
671.TS
672center;
673l.
674(ab\||\|cd)
675.TE
676matches either
677.ul
678ab
679or
680.ul
681cd.
682Note that parentheses are used for grouping, although
683they are
684not necessary on the outside level;
685.TS
686center;
687l.
688ab\||\|cd
689.TE
690would have sufficed.
691Parentheses
692can be used for more complex expressions:
693.TS
694center;
695l.
696(ab\||\|cd+)?(ef)\(**
697.TE
698matches such strings as
699.I abefef ,
700.I efefef ,
701.I cdef ,
702or
703.I cddd\| ;
704but not
705.I abc ,
706.I abcd ,
707or
708.I abcdef .
709.PP
710.I
711Context sensitivity.
712.R
713Lex will recognize a small amount of surrounding
714context. The two simplest operators for this are
715.I ^
716and
717.I $ .
718If the first character of an expression is
719.I ^ ,
720the expression will only be matched at the beginning
721of a line (after a newline character, or at the beginning of
722the input stream).
723This can never conflict with the other meaning of
724.I ^ ,
725complementation
726of character classes, since that only applies within
727the [\|] operators.
728If the very last character is
729.I $ ,
730the expression will only be matched at the end of a line (when
731immediately followed by newline).
732The latter operator is a special case of the
733.I /
734operator character,
735which indicates trailing context.
736The expression
737.TS
738center;
739l.
740ab/cd
741.TE
742matches the string
743.I ab ,
744but only if followed by
745.ul
746cd.
747Thus
748.TS
749center;
750l.
751ab$
752.TE
753is the same as
754.TS
755center;
756l.
757ab/\en
758.TE
759Left context is handled in Lex by
760.I
761start conditions
762.R
763as explained in section 10. If a rule is only to be executed
764when the Lex automaton interpreter is in start condition
765.I
766x,
767.R
768the rule should be prefixed by
769.TS
770center;
771l.
772<x>
773.TE
774using the angle bracket operator characters.
775If we considered ``being at the beginning of a line'' to be
776start condition
777.I ONE ,
778then the ^ operator
779would be equivalent to
780.TS
781center;
782l.
783<ONE>
784.TE
785Start conditions are explained more fully later.
786.PP
787.I
788Repetitions and Definitions.
789.R
790The operators {} specify
791either repetitions (if they enclose numbers)
792or
793definition expansion (if they enclose a name). For example
794.TS
795center;
796l.
797{digit}
798.TE
799looks for a predefined string named
800.I digit
801and inserts it
802at that point in the expression.
803The definitions are given in the first part of the Lex
804input, before the rules.
805In contrast,
806.TS
807center;
808l.
809a{1,5}
810.TE
811looks for 1 to 5 occurrences of
812.I a .
813.PP
814Finally, initial
815.I %
816is special, being the separator
817for Lex source segments.
818.2C
819.NH
820Lex Actions.
821.PP
822When an expression written as above is matched, Lex
823executes the corresponding action. This section describes
824some features of Lex which aid in writing actions. Note
825that there is a default action, which
826consists of copying the input to the output. This
827is performed on all strings not otherwise matched. Thus
828the Lex user who wishes to absorb the entire input, without
829producing any output, must provide rules to match everything.
830When Lex is being used with Yacc, this is the normal
831situation.
832One may consider that actions are what is done instead of
833copying the input to the output; thus, in general,
834a rule which merely copies can be omitted.
835Also, a character combination
836which is omitted from the rules
837and which appears as input
838is likely to be printed on the output, thus calling
839attention to the gap in the rules.
840.PP
841One of the simplest things that can be done is to ignore
842the input. Specifying a C null statement, \fI;\fR as an action
843causes this result. A frequent rule is
844.TS
845center;
846l l.
847[ \et\en] ;
848.TE
849which causes the three spacing characters (blank, tab, and newline)
850to be ignored.
851.PP
852Another easy way to avoid writing actions is the action character
853|, which indicates that the action for this rule is the action
854for the next rule.
855The previous example could also have been written
856.TS
857center;
858l l.
859" " |
860"\et" |
861"\en" ;
862.TE
863with the same result, although in different style.
864The quotes around \en and \et are not required.
865.PP
866In more complex actions, the user
867will
868often want to know the actual text that matched some expression
869like
870.I [a\(miz]+ .
871Lex leaves this text in an external character
872array named
873.I
874yytext.
875.R
876Thus, to print the name found,
877a rule like
878.TS
879center;
880l l.
881[a\-z]+ printf("%s", yytext);
882.TE
883will print
884the string in
885.I
886yytext.
887.R
888The C function
889.I
890printf
891.R
892accepts a format argument and data to be printed;
893in this case, the format is ``print string'' (% indicating
894data conversion, and
895.I s
896indicating string type),
897and the data are the characters
898in
899.I
900yytext.
901.R
902So this just places
903the matched string
904on the output.
905This action
906is so common that
907it may be written as ECHO:
908.TS
909center;
910l l.
911[a\-z]+ ECHO;
912.TE
913is the same as the above.
914Since the default action is just to
915print the characters found, one might ask why
916give a rule, like this one, which merely specifies
917the default action?
918Such rules are often required
919to avoid matching some other rule
920which is not desired. For example, if there is a rule
921which matches
922.I read
923it will normally match the instances of
924.I read
925contained in
926.I bread
927or
928.I readjust ;
929to avoid
930this,
931a rule
932of the form
933.I [a\(miz]+
934is needed.
935This is explained further below.
936.PP
937Sometimes it is more convenient to know the end of what
938has been found; hence Lex also provides a count
939.I
940yyleng
941.R
942of the number of characters matched.
943To count both the number
944of words and the number of characters in words in the input, the user might write
945.TS
946center;
947l l.
948[a\-zA\-Z]+ {words++; chars += yyleng;}
949.TE
950which accumulates in
951.ul
952chars
953the number
954of characters in the words recognized.
955The last character in the string matched can
956be accessed by
957.TS
958center;
959l.
960yytext[yyleng\-1]
961.TE
962.PP
963Occasionally, a Lex
964action may decide that a rule has not recognized the correct
965span of characters.
966Two routines are provided to aid with this situation.
967First,
968.I
969yymore()
970.R
971can be called to indicate that the next input expression recognized is to be
972tacked on to the end of this input. Normally,
973the next input string would overwrite the current
974entry in
975.I
976yytext.
977.R
978Second,
979.I
980yyless (n)
981.R
982may be called to indicate that not all the characters matched
983by the currently successful expression are wanted right now.
984The argument
985.I
986n
987.R
988indicates the number of characters
989in
990.I
991yytext
992.R
993to be retained.
994Further characters previously matched
995are
996returned to the input. This provides the same sort of
997look~ahead offered by the / operator,
998but in a different form.
999.PP
1000.I
1001Example:
1002.R
1003Consider a language which defines
1004a string as a set of characters between quotation (") marks, and provides that
1005to include a " in a string it must be preceded by a \e. The
1006regular expression which matches that is somewhat confusing,
1007so that it might be preferable to write
1008.TS
1009center;
1010l l.
1011\e"[^"]\(** {
1012 if (yytext[yyleng\-1] == \(fm\e\e\(fm)
1013 yymore();
1014 else
1015 ... normal user processing
1016 }
1017.TE
1018which will, when faced with a string such as
1019.I
1020"abc\e"def\|"
1021.R
1022first match
1023the five characters
1024\fI"abc\e\|\fR;
1025then
1026the call to
1027.I yymore()
1028will
1029cause the next part of the string,
1030\fI"def\|\fR,
1031to be tacked on the end.
1032Note that the final quote terminating the string should be picked
1033up in the code labeled ``normal processing''.
1034.PP
1035The function
1036.I
1037yyless()
1038.R
1039might be used to reprocess
1040text in various circumstances. Consider the C problem of distinguishing
1041the ambiguity of ``=\(mia''.
1042Suppose it is desired to treat this as ``=\(mi a''
1043but print a message. A rule might be
5652ef93
KM
1044.ps 9
1045.vs 11
b1716fed
KM
1046.TS
1047center;
1048l l.
1049=\(mi[a\-zA\-Z] {
5652ef93 1050 printf("Op (=\(mi) ambiguous\en");
b1716fed
KM
1051 yyless(yyleng\-1);
1052 ... action for =\(mi ...
1053 }
1054.TE
5652ef93
KM
1055.ps 10
1056.vs 12
b1716fed
KM
1057which prints a message, returns the letter after the
1058operator to the input stream, and treats the operator as ``=\(mi''.
1059Alternatively it might be desired to treat this as ``= \(mia''.
1060To do this, just return the minus
1061sign as well as the letter to the input:
5652ef93
KM
1062.ps 9
1063.vs 11
b1716fed
KM
1064.TS
1065center;
1066l l.
1067=\(mi[a\-zA\-Z] {
5652ef93 1068 printf("Op (=\(mi) ambiguous\en");
b1716fed
KM
1069 yyless(yyleng\-2);
1070 ... action for = ...
1071 }
1072.TE
5652ef93
KM
1073.ps 10
1074.vs 12
b1716fed
KM
1075will perform the other interpretation.
1076Note that the expressions for the two cases might more easily
1077be written
1078.TS
1079center;
1080l l.
1081=\(mi/[A\-Za\-z]
1082.TE
1083in the first case and
1084.TS
1085center;
1086l.
1087=/\-[A\-Za\-z]
1088.TE
1089in the second;
1090no backup would be required in the rule action.
1091It is not necessary to recognize the whole identifier
1092to observe the ambiguity.
1093The
1094possibility of ``=\(mi3'', however, makes
1095.TS
1096center;
1097l.
1098=\(mi/[^ \et\en]
1099.TE
1100a still better rule.
1101.PP
1102In addition to these routines, Lex also permits
1103access to the I/O routines
1104it uses.
1105They are:
1106.IP 1)
1107.I
1108input()
1109.R
1110which returns the next input character;
1111.IP 2)
1112.I
1113output(c)
1114.R
1115which writes the character
1116.I
1117c
1118.R
1119on the output; and
1120.IP 3)
1121.I
1122unput(c)
1123.R
1124pushes the character
1125.I
1126c
1127.R
1128back onto the input stream to be read later by
1129.I
1130input().
1131.R
1132.LP
1133By default these routines are provided as macro definitions,
1134but the user can override them and supply private versions.
1135These routines
1136define the relationship between external files and
1137internal characters, and must all be retained
1138or modified consistently.
1139They may be redefined, to
1140cause input or output to be transmitted to or from strange
1141places, including other programs or internal memory;
1142but the character set used must be consistent in all routines;
1143a value of zero returned by
1144.I
1145input
1146.R
1147must mean end of file; and
1148the relationship between
1149.I
1150unput
1151.R
1152and
1153.I
1154input
1155.R
1156must be retained
1157or the Lex look~ahead will not work.
1158Lex does not look ahead at all if it does not have to,
1159but every rule ending in
1160.ft I
1161+ \(** ?
1162.ft R
1163or
1164.ft I
1165$
1166.ft R
1167or containing
1168.ft I
1169/
1170.ft R
1171implies look~ahead.
1172Look~ahead is also necessary to match an expression that is a prefix
1173of another expression.
1174See below for a discussion of the character set used by Lex.
1175The standard Lex library imposes
1176a 100 character limit on backup.
1177.PP
1178Another Lex library routine that the user will sometimes want
1179to redefine is
1180.I
1181yywrap()
1182.R
1183which is called whenever Lex reaches an end-of-file.
1184If
1185.I
1186yywrap
1187.R
1188returns a 1, Lex continues with the normal wrapup on end of input.
1189Sometimes, however, it is convenient to arrange for more
1190input to arrive
1191from a new source.
1192In this case, the user should provide
1193a
1194.I
1195yywrap
1196.R
1197which
1198arranges for new input and
1199returns 0. This instructs Lex to continue processing.
1200The default
1201.I
1202yywrap
1203.R
1204always returns 1.
1205.PP
1206This routine is also a convenient place
1207to print tables, summaries, etc. at the end
1208of a program. Note that it is not
1209possible to write a normal rule which recognizes
1210end-of-file; the only access to this condition is
1211through
1212.I
1213yywrap.
1214.R
1215In fact, unless a private version of
1216.I
1217input()
1218.R
1219is supplied
1220a file containing nulls
1221cannot be handled,
1222since a value of 0 returned by
1223.I
1224input
1225.R
1226is taken to be end-of-file.
1227.PP
1228.2C
1229.NH
1230Ambiguous Source Rules.
1231.PP
1232Lex can handle ambiguous specifications.
1233When more than one expression can match the
1234current input, Lex chooses as follows:
1235.IP 1)
1236The longest match is preferred.
1237.IP 2)
1238Among rules which matched the same number of characters,
1239the rule given first is preferred.
1240.LP
1241Thus, suppose the rules
1242.TS
1243center;
1244l l.
1245integer keyword action ...;
1246[a\-z]+ identifier action ...;
1247.TE
1248to be given in that order. If the input is
1249.I integers ,
1250it is taken as an identifier, because
1251.I [a\-z]+
1252matches 8 characters while
1253.I integer
1254matches only 7.
1255If the input is
1256.I integer ,
1257both rules match 7 characters, and
1258the keyword rule is selected because it was given first.
1259Anything shorter (e.g. \fIint\fR\|) will
1260not match the expression
1261.I integer
1262and so the identifier interpretation is used.
1263.PP
1264The principle of preferring the longest
1265match makes rules containing
1266expressions like
1267.I \&.\(**
1268dangerous.
1269For example,
1270.TS
1271center;
1272l.
1273\&\(fm.\(**\(fm
1274.TE
1275might seem a good way of recognizing
1276a string in single quotes.
1277But it is an invitation for the program to read far
1278ahead, looking for a distant
1279single quote.
1280Presented with the input
1281.TS
1282center;
1283l l.
1284\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
1285.TE
1286the above expression will match
1287.TS
1288center;
1289l l.
1290\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
1291.TE
1292which is probably not what was wanted.
1293A better rule is of the form
1294.TS
1295center;
1296l.
1297\&\(fm[^\(fm\en]\(**\(fm
1298.TE
1299which, on the above input, will stop
1300after
1301.I \(fmfirst\(fm .
1302The consequences
1303of errors like this are mitigated by the fact
1304that the
1305.I \&.
1306operator will not match newline.
1307Thus expressions like
1308.I \&.\(**
1309stop on the
1310current line.
1311Don't try to defeat this with expressions like
9c343507 1312.I (.|\en)+
b1716fed
KM
1313or
1314equivalents;
1315the Lex generated program will try to read
1316the entire input file, causing
1317internal buffer overflows.
1318.PP
1319Note that Lex is normally partitioning
1320the input stream, not searching for all possible matches
1321of each expression.
1322This means that each character is accounted for
1323once and only once.
1324For example, suppose it is desired to
1325count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
1326Some Lex rules to do this might be
1327.TS
1328center;
1329l l.
1330she s++;
1331he h++;
1332\en |
1333\&. ;
1334.TE
1335where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
1336Remember that . does not include newline.
1337Since \fIshe\fR includes \fIhe\fR, Lex will normally
1338.I
1339not
1340.R
1341recognize
1342the instances of \fIhe\fR included in \fIshe\fR,
1343since once it has passed a \fIshe\fR those characters are gone.
1344.PP
1345Sometimes the user would like to override this choice. The action
1346REJECT
1347means ``go do the next alternative.''
1348It causes whatever rule was second choice after the current
1349rule to be executed.
1350The position of the input pointer is adjusted accordingly.
1351Suppose the user really wants to count the included instances of \fIhe\fR:
1352.TS
1353center;
1354l l.
1355she {s++; REJECT;}
1356he {h++; REJECT;}
1357\en |
1358\&. ;
1359.TE
1360these rules are one way of changing the previous example
1361to do just that.
1362After counting each expression, it is rejected; whenever appropriate,
1363the other expression will then be counted. In this example, of course,
1364the user could note that \fIshe\fR includes \fIhe\fR but not
1365vice versa, and omit the REJECT action on \fIhe\fR;
1366in other cases, however, it
1367would not be possible a priori to tell
1368which input characters
1369were in both classes.
1370.PP
1371Consider the two rules
1372.TS
1373center;
1374l l.
1375a[bc]+ { ... ; REJECT;}
1376a[cd]+ { ... ; REJECT;}
1377.TE
1378If the input is
1379.I ab ,
1380only the first rule matches,
1381and on
1382.I ad
1383only the second matches.
1384The input string
1385.I accb
1386matches the first rule for four characters
1387and then the second rule for three characters.
1388In contrast, the input
1389.I accd
1390agrees with
1391the second rule for four characters and then the first
1392rule for three.
1393.PP
1394In general, REJECT is useful whenever
1395the purpose of Lex is not to partition the input
1396stream but to detect all examples of some items
1397in the input, and the instances of these items
1398may overlap or include each other.
1399Suppose a digram table of the input is desired;
1400normally the digrams overlap, that is the word
1401.I the
1402is considered to contain
1403both
1404.I th
1405and
1406.I he .
1407Assuming a two-dimensional array named
1408.ul
1409digram
1410to be incremented, the appropriate
1411source is
1412.TS
1413center;
1414l l.
1415%%
5652ef93
KM
1416[a\-z][a\-z] {
1417 digram[yytext[0]][yytext[1]]++;
1418 REJECT;
1419 }
1420\. ;
b1716fed
KM
1421\en ;
1422.TE
1423where the REJECT is necessary to pick up
1424a letter pair beginning at every character, rather than at every
1425other character.
1426.2C
1427.NH
1428Lex Source Definitions.
1429.PP
1430Remember the format of the Lex
1431source:
1432.TS
1433center;
1434l.
1435{definitions}
1436%%
1437{rules}
1438%%
1439{user routines}
1440.TE
1441So far only the rules have been described. The user needs
1442additional options,
1443though, to define variables for use in his program and for use
1444by Lex.
1445These can go either in the definitions section
1446or in the rules section.
1447.PP
1448Remember that Lex is turning the rules into a program.
1449Any source not intercepted by Lex is copied
1450into the generated program. There are three classes
1451of such things.
1452.IP 1)
1453Any line which is not part of a Lex rule or action
1454which begins with a blank or tab is copied into
1455the Lex generated program.
1456Such source input prior to the first %% delimiter will be external
1457to any function in the code; if it appears immediately after the first
1458%%,
1459it appears in an appropriate place for declarations
1460in the function written by Lex which contains the actions.
1461This material must look like program fragments,
1462and should precede the first Lex rule.
1463.IP
1464As a side effect of the above, lines which begin with a blank
1465or tab, and which contain a comment,
1466are passed through to the generated program.
1467This can be used to include comments in either the Lex source or
1468the generated code. The comments should follow the host
1469language convention.
1470.IP 2)
1471Anything included between lines containing
1472only
1473.I %{
1474and
1475.I %}
1476is
1477copied out as above. The delimiters are discarded.
1478This format permits entering text like preprocessor statements that
1479must begin in column 1,
1480or copying lines that do not look like programs.
1481.IP 3)
1482Anything after the third %% delimiter, regardless of formats, etc.,
1483is copied out after the Lex output.
1484.PP
1485Definitions intended for Lex are given
1486before the first %% delimiter. Any line in this section
1487not contained between %{ and %}, and begining
1488in column 1, is assumed to define Lex substitution strings.
1489The format of such lines is
1490.TS
1491center;
1492l l.
1493name translation
1494.TE
1495and it
1496causes the string given as a translation to
1497be associated with the name.
1498The name and translation
1499must be separated by at least one blank or tab, and the name must begin with a letter.
1500The translation can then be called out
1501by the {name} syntax in a rule.
1502Using {D} for the digits and {E} for an exponent field,
1503for example, might abbreviate rules to recognize numbers:
1504.TS
1505center;
1506l l.
1507D [0\-9]
1508E [DEde][\-+]?{D}+
1509%%
1510{D}+ printf("integer");
1511{D}+"."{D}\(**({E})? |
1512{D}\(**"."{D}+({E})? |
1513{D}+{E} printf("real");
1514.TE
1515Note the first two rules for real numbers;
1516both require a decimal point and contain
1517an optional exponent field,
1518but the first requires at least one digit before the
1519decimal point and the second requires at least one
1520digit after the decimal point.
1521To correctly handle the problem
1522posed by a Fortran expression such as
1523.I 35.EQ.I ,
1524which does not contain a real number, a context-sensitive
1525rule such as
1526.TS
1527center;
1528l l.
1529[0\-9]+/"."EQ printf("integer");
1530.TE
1531could be used in addition to the normal rule for integers.
1532.PP
1533The definitions
1534section may also contain other commands, including the
1535selection of a host language, a character set table,
1536a list of start conditions, or adjustments to the default
1537size of arrays within Lex itself for larger source programs.
1538These possibilities
1539are discussed below under ``Summary of Source Format,''
1540section 12.
1541.2C
1542.NH
1543Usage.
1544.PP
1545There are two steps in
1546compiling a Lex source program.
1547First, the Lex source must be turned into a generated program
1548in the host general purpose language.
1549Then this program must be compiled and loaded, usually with
1550a library of Lex subroutines.
1551The generated program
1552is on a file named
1553.I lex.yy.c .
1554The I/O library is defined in terms of the C standard
1555library [6].
1556.PP
1557The C programs generated by Lex are slightly different
1558on OS/370, because the
1559OS compiler is less powerful than the UNIX or GCOS compilers,
1560and does less at compile time.
1561C programs generated on GCOS and UNIX are the same.
1562.PP
1563.I
1564UNIX.
1565.R
1566The library is accessed by the loader flag
1567.I \-ll .
1568So an appropriate
1569set of commands is
1570.KS
1571.in 5
1572lex source
1573cc lex.yy.c \-ll
1574.in 0
1575.KE
1576The resulting program is placed on the usual file
1577.I
1578a.out
1579.R
1580for later execution.
1581To use Lex with Yacc see below.
1582Although the default Lex I/O routines use the C standard library,
1583the Lex automata themselves do not do so;
1584if private versions of
1585.I
1586input,
1587output
1588.R
1589and
1590.I unput
1591are given, the library can be avoided.
1592.PP
1593.2C
1594.NH
1595Lex and Yacc.
1596.PP
1597If you want to use Lex with Yacc, note that what Lex writes is a program
1598named
1599.I
1600yylex(),
1601.R
1602the name required by Yacc for its analyzer.
1603Normally, the default main program on the Lex library
1604calls this routine, but if Yacc is loaded, and its main
1605program is used, Yacc will call
1606.I
1607yylex().
1608.R
1609In this case each Lex rule should end with
1610.TS
1611center;
1612l.
1613return(token);
1614.TE
1615where the appropriate token value is returned.
1616An easy way to get access
1617to Yacc's names for tokens is to
1618compile the Lex output file as part of
1619the Yacc output file by placing the line
1620.TS
1621center;
1622l.
1623# include "lex.yy.c"
1624.TE
1625in the last section of Yacc input.
1626Supposing the grammar to be
1627named ``good'' and the lexical rules to be named ``better''
1628the UNIX command sequence can just be:
1629.TS
1630center;
1631l.
1632yacc good
1633lex better
1634cc y.tab.c \-ly \-ll
1635.TE
1636The Yacc library (\-ly) should be loaded before the Lex library,
1637to obtain a main program which invokes the Yacc parser.
1638The generations of Lex and Yacc programs can be done in
1639either order.
1640.2C
1641.NH
1642Examples.
1643.PP
1644As a trivial problem, consider copying an input file while
1645adding 3 to every positive number divisible by 7.
1646Here is a suitable Lex source program
1647.TS
1648center;
1649l l.
1650%%
1651 int k;
1652[0\-9]+ {
1653 k = atoi(yytext);
1654 if (k%7 == 0)
1655 printf("%d", k+3);
1656 else
1657 printf("%d",k);
1658 }
1659.TE
1660to do just that.
1661The rule [0\-9]+ recognizes strings of digits;
1662.I
1663atoi
1664.R
1665converts the digits to binary
1666and stores the result in
1667.ul
1668k.
1669The operator % (remainder) is used to check whether
1670.ul
1671k
1672is divisible by 7; if it is,
1673it is incremented by 3 as it is written out.
1674It may be objected that this program will alter such
1675input items as
1676.I 49.63
1677or
1678.I X7 .
1679Furthermore, it increments the absolute value
1680of all negative numbers divisible by 7.
1681To avoid this, just add a few more rules after the active one,
1682as here:
1683.TS
1684center;
1685l l.
1686%%
1687 int k;
1688\-?[0\-9]+ {
1689 k = atoi(yytext);
5652ef93
KM
1690 printf("%d",
1691 k%7 == 0 ? k+3 : k);
b1716fed
KM
1692 }
1693\-?[0\-9.]+ ECHO;
1694[A-Za-z][A-Za-z0-9]+ ECHO;
1695.TE
1696Numerical strings containing
1697a ``.'' or preceded by a letter will be picked up by
1698one of the last two rules, and not changed.
1699The
1700.I if\-else
1701has been replaced by
1702a C conditional expression to save space;
1703the form
1704.ul
1705a?b:c
1706means ``if
1707.I a
1708then
1709.I b
1710else
1711.I c ''.
1712.PP
1713For an example of statistics gathering, here
1714is a program which histograms the lengths
1715of words, where a word is defined as a string of letters.
1716.TS
1717center;
1718l l.
1719 int lengs[100];
1720%%
1721[a\-z]+ lengs[yyleng]++;
1722\&. |
1723\en ;
1724%%
1725.T&
1726l s.
1727yywrap()
1728{
1729int i;
1730printf("Length No. words\en");
1731for(i=0; i<100; i++)
1732 if (lengs[i] > 0)
1733 printf("%5d%10d\en",i,lengs[i]);
1734return(1);
1735}
1736.TE
1737This program
1738accumulates the histogram, while producing no output. At the end
1739of the input it prints the table.
1740The final statement
1741.I
1742return(1);
1743.R
1744indicates that Lex is to perform wrapup. If
1745.I
1746yywrap
1747.R
1748returns zero (false)
1749it implies that further input is available
1750and the program is
1751to continue reading and processing.
1752To provide a
1753.I
1754yywrap
1755.R
1756that never
1757returns true causes an infinite loop.
1758.PP
1759As a larger example,
1760here are some parts of a program written by N. L. Schryer
1761to convert double precision Fortran to single precision Fortran.
1762Because Fortran does not distinguish upper and lower case letters,
1763this routine begins by defining a set of classes including
1764both cases of each letter:
1765.TS
1766center;
1767l l.
1768a [aA]
1769b [bB]
1770c [cC]
1771\&...
1772z [zZ]
1773.TE
1774An additional class recognizes white space:
1775.TS
1776center;
1777l l.
1778W [ \et]\(**
1779.TE
1780The first rule changes
1781``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
1782.TS
1783center;
1784l.
1785{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
1786 printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
1787 }
1788.TE
1789Care is taken throughout this program to preserve the case
1790(upper or lower)
1791of the original program.
1792The conditional operator is used to
1793select the proper form of the keyword.
1794The next rule copies continuation card indications to
1795avoid confusing them with constants:
1796.TS
1797center;
1798l l.
1799^" "[^ 0] ECHO;
1800.TE
1801In the regular expression, the quotes surround the
1802blanks.
1803It is interpreted as
1804``beginning of line, then five blanks, then
1805anything but blank or zero.''
1806Note the two different meanings of
1807.I ^ .
1808There follow some rules to change double precision
1809constants to ordinary floating constants.
1810.TS
1811center;
1812l.
1813[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
1814[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
1815"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
1816 /\(** convert constants \(**/
1817 for(p=yytext; \(**p != 0; p++)
1818 {
1819 if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
1820 \(**p=+ \(fme\(fm\- \(fmd\(fm;
1821 ECHO;
1822 }
1823.TE
1824After the floating point constant is recognized, it is
1825scanned by the
1826.ul
1827for
1828loop
1829to find the letter
1830.I d
1831or
1832.I D .
1833The program than adds
1834.c
1835.I \(fme\(fm\-\(fmd\(fm ,
1836which converts
1837it to the next letter of the alphabet.
1838The modified constant, now single-precision,
1839is written out again.
1840There follow a series of names which must be respelled to remove
1841their initial \fId\fR.
1842By using the
1843array
1844.I
1845yytext
1846.R
1847the same action suffices for all the names (only a sample of
1848a rather long list is given here).
1849.TS
1850center;
1851l l.
1852{d}{s}{i}{n} |
1853{d}{c}{o}{s} |
1854{d}{s}{q}{r}{t} |
1855{d}{a}{t}{a}{n} |
1856\&...
1857{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
1858.TE
1859Another list of names must have initial \fId\fR changed to initial \fIa\fR:
1860.TS
1861center;
1862l l.
1863{d}{l}{o}{g} |
1864{d}{l}{o}{g}10 |
1865{d}{m}{i}{n}1 |
1866{d}{m}{a}{x}1 {
1867 yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
1868 ECHO;
1869 }
1870.TE
1871And one routine
1872must have initial \fId\fR changed to initial \fIr\fR:
1873.TS
1874center;
1875l l.
1876{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
1877 ECHO;
1878 }
1879.TE
1880To avoid such names as \fIdsinx\fR being detected as instances
1881of \fIdsin\fR, some final rules pick up longer words as identifiers
1882and copy some surviving characters:
1883.TS
1884center;
1885l l.
1886[A\-Za\-z][A\-Za\-z0\-9]\(** |
1887[0\-9]+ |
1888\en |
1889\&. ECHO;
1890.TE
1891Note that this program is not complete; it
1892does not deal with the spacing problems in Fortran or
1893with the use of keywords as identifiers.
1894.br
1895.2C
1896.NH
1897Left Context Sensitivity.
1898.PP
1899Sometimes
1900it is desirable to have several sets of lexical rules
1901to be applied at different times in the input.
1902For example, a compiler preprocessor might distinguish
1903preprocessor statements and analyze them differently
1904from ordinary statements.
1905This requires
1906sensitivity
1907to prior context, and there are several ways of handling
1908such problems.
1909The \fI^\fR operator, for example, is a prior context operator,
1910recognizing immediately preceding left context just as \fI$\fR recognizes
1911immediately following right context.
1912Adjacent left context could be extended, to produce a facility similar to
1913that for adjacent right context, but it is unlikely
1914to be as useful, since often the relevant left context
1915appeared some time earlier, such as at the beginning of a line.
1916.PP
1917This section describes three means of dealing
1918with different environments: a simple use of flags,
1919when only a few rules change from one environment to another,
1920the use of
1921.I
1922start conditions
1923.R
1924on rules,
1925and the possibility of making multiple lexical analyzers all run
1926together.
1927In each case, there are rules which recognize the need to change the
1928environment in which the
1929following input text is analyzed, and set some parameter
1930to reflect the change. This may be a flag explicitly tested by
1931the user's action code; such a flag is the simplest way of dealing
1932with the problem, since Lex is not involved at all.
1933It may be more convenient,
1934however,
1935to have Lex remember the flags as initial conditions on the rules.
1936Any rule may be associated with a start condition. It will only
1937be recognized when Lex is in
1938that start condition.
1939The current start condition may be changed at any time.
1940Finally, if the sets of rules for the different environments
1941are very dissimilar,
1942clarity may be best achieved by writing several distinct lexical
1943analyzers, and switching from one to another as desired.
1944.PP
1945Consider the following problem: copy the input to the output,
1946changing the word \fImagic\fR to \fIfirst\fR on every line which began
1947with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
1948which began with the letter \fIb\fR, and changing
1949\fImagic\fR to \fIthird\fR on every line which began
1950with the letter \fIc\fR. All other words and all other lines
1951are left unchanged.
1952.PP
1953These rules are so simple that the easiest way
1954to do this job is with a flag:
1955.TS
1956center;
1957l l.
1958 int flag;
1959%%
1960^a {flag = \(fma\(fm; ECHO;}
1961^b {flag = \(fmb\(fm; ECHO;}
1962^c {flag = \(fmc\(fm; ECHO;}
1963\en {flag = 0 ; ECHO;}
1964magic {
1965 switch (flag)
1966 {
1967 case \(fma\(fm: printf("first"); break;
1968 case \(fmb\(fm: printf("second"); break;
1969\11\11 case \(fmc\(fm: printf("third"); break;
1970 default: ECHO; break;
1971 }
1972 }
1973.TE
1974should be adequate.
1975.PP
1976To handle the same problem with start conditions, each
1977start condition must be introduced to Lex in the definitions section
1978with a line reading
1979.TS
1980center;
1981l l.
1982%Start name1 name2 ...
1983.TE
1984where the conditions may be named in any order.
1985The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
1986The conditions may be referenced at the
1987head of a rule with the <> brackets:
1988.TS
1989center;
1990l.
1991<name1>expression
1992.TE
1993is a rule which is only recognized when Lex is in the
1994start condition \fIname1\fR.
1995To enter a start condition,
1996execute the action statement
1997.TS
1998center;
1999l.
2000BEGIN name1;
2001.TE
2002which changes the start condition to \fIname1\fR.
2003To resume the normal state,
2004.TS
2005center;
2006l.
2007BEGIN 0;
2008.TE
2009resets the initial condition
2010of the Lex automaton interpreter.
2011A rule may be active in several
2012start conditions:
2013.TS
2014center;
2015l.
2016<name1,name2,name3>
2017.TE
2018is a legal prefix. Any rule not beginning with the
2019<> prefix operator is always active.
2020.PP
2021The same example as before can be written:
2022.TS
2023center;
2024l l.
2025%START AA BB CC
2026%%
2027^a {ECHO; BEGIN AA;}
2028^b {ECHO; BEGIN BB;}
2029^c {ECHO; BEGIN CC;}
2030\en {ECHO; BEGIN 0;}
2031<AA>magic printf("first");
2032<BB>magic printf("second");
2033<CC>magic printf("third");
2034.TE
2035where the logic is exactly the same as in the previous
2036method of handling the problem, but Lex does the work
2037rather than the user's code.
2038.2C
2039.NH
2040Character Set.
2041.PP
2042The programs generated by Lex handle
2043character I/O only through the routines
2044.I
2045input,
2046output,
2047.R
2048and
2049.I
2050unput.
2051.R
2052Thus the character representation
2053provided in these routines
2054is accepted by Lex and employed to return
2055values in
2056.I
2057yytext.
2058.R
2059For internal use
2060a character is represented as a small integer
2061which, if the standard library is used,
2062has a value equal to the integer value of the bit
2063pattern representing the character on the host computer.
2064Normally, the letter
2065.I a
2066is represented as the same form as the character constant
2067.I \(fma\(fm .
2068If this interpretation is changed, by providing I/O
2069routines which translate the characters,
2070Lex must be told about
2071it, by giving a translation table.
2072This table must be in the definitions section,
2073and must be bracketed by lines containing only
2074``%T''.
2075The table contains lines of the form
2076.TS
2077center;
2078l.
2079{integer} {character string}
2080.TE
2081which indicate the value associated with each character.
2082Thus the next example
2083.GS 2
2084.TS
2085center;
2086l l.
2087%T
2088 1 Aa
2089 2 Bb
2090\&...
209126 Zz
209227 \en
209328 +
209429 \-
209530 0
209631 1
2097\&...
209839 9
2099%T
2100.TE
2101.sp
2102.ce 1
2103Sample character table.
2104.GE
2105maps the lower and upper case letters together into the integers 1 through 26,
2106newline into 27, + and \- into 28 and 29, and the
2107digits into 30 through 39.
2108Note the escape for newline.
2109If a table is supplied, every character that is to appear either
2110in the rules or in any valid input must be included
2111in the table.
2112No character
2113may be assigned the number 0, and no character may be
2114assigned a bigger number than the size of the hardware character set.
2115.2C
2116.NH
2117Summary of Source Format.
2118.PP
2119The general form of a Lex source file is:
2120.TS
2121center;
2122l.
2123{definitions}
2124%%
2125{rules}
2126%%
2127{user subroutines}
2128.TE
2129The definitions section contains
2130a combination of
2131.IP 1)
2132Definitions, in the form ``name space translation''.
2133.IP 2)
2134Included code, in the form ``space code''.
2135.IP 3)
2136Included code, in the form
2137.TS
2138center;
2139l.
2140%{
2141code
2142%}
2143.TE
2144.ns
2145.IP 4)
2146Start conditions, given in the form
2147.TS
2148center;
2149l.
2150%S name1 name2 ...
2151.TE
2152.ns
2153.IP 5)
2154Character set tables, in the form
2155.TS
2156center;
2157l.
2158%T
2159number space character-string
2160\&...
2161%T
2162.TE
2163.ns
2164.IP 6)
2165Changes to internal array sizes, in the form
2166.TS
2167center;
2168l.
2169%\fIx\fR\0\0\fInnn\fR
2170.TE
2171where \fInnn\fR is a decimal integer representing an array size
2172and \fIx\fR selects the parameter as follows:
2173.TS
2174center;
2175c c
2176c l.
2177Letter Parameter
2178p positions
2179n states
2180e tree nodes
2181a transitions
2182k packed character classes
2183o output array size
2184.TE
2185.LP
2186Lines in the rules section have the form ``expression action''
2187where the action may be continued on succeeding
2188lines by using braces to delimit it.
2189.PP
2190Regular expressions in Lex use the following
2191operators:
2192.br
2193.TS
2194center;
2195l l.
2196x the character "x"
2197"x" an "x", even if x is an operator.
2198\ex an "x", even if x is an operator.
2199[xy] the character x or y.
2200[x\-z] the characters x, y or z.
2201[^x] any character but x.
2202\&. any character but newline.
2203^x an x at the beginning of a line.
2204<y>x an x when Lex is in start condition y.
2205x$ an x at the end of a line.
2206x? an optional x.
2207x\(** 0,1,2, ... instances of x.
2208x+ 1,2,3, ... instances of x.
2209x|y an x or a y.
2210(x) an x.
2211x/y an x but only if followed by y.
5652ef93
KM
2212{xx} the translation of xx from the
2213 definitions section.
b1716fed
KM
2214x{m,n} \fIm\fR through \fIn\fR occurrences of x
2215.TE
2216.NH
2217Caveats and Bugs.
2218.PP
2219There are pathological expressions which
2220produce exponential growth of the tables when
2221converted to deterministic machines;
2222fortunately, they are rare.
2223.PP
2224REJECT does not rescan the input; instead it remembers the results of the previous
2225scan. This means that if a rule with trailing context is found, and
2226REJECT executed, the user
2227must not have used
2228.ul
2229unput
2230to change the characters forthcoming
2231from the input stream.
2232This is the only restriction on the user's ability to manipulate
2233the not-yet-processed input.
2234.PP
2235.2C
2236.NH
2237Acknowledgments.
2238.PP
2239As should
2240be obvious from the above, the outside of Lex
2241is patterned
2242on Yacc and the inside on Aho's string matching routines.
2243Therefore, both S. C. Johnson and A. V. Aho
2244are really originators
2245of much of Lex,
2246as well as debuggers of it.
2247Many thanks are due to both.
2248.PP
2249The code of the current version of Lex was designed, written,
2250and debugged by Eric Schmidt.
2251.SG MH-1274-MEL-unix
2252.sp 1
2253.2C
2254.NH
2255References.
2256.SP 1v
2257.IP 1.
2258B. W. Kernighan and D. M. Ritchie,
2259.I
2260The C Programming Language,
2261.R
2262Prentice-Hall, N. J. (1978).
2263.IP 2.
2264B. W. Kernighan,
2265.I
2266Ratfor: A Preprocessor for a Rational Fortran,
2267.R
2268Software \- Practice and Experience,
2269\fB5\fR, pp. 395-496 (1975).
2270.IP 3.
2271S. C. Johnson,
2272.I
2273Yacc: Yet Another Compiler Compiler,
2274.R
2275Computing Science Technical Report No. 32,
22761975,
2277.MH
2278.if \n(tm (also TM 75-1273-6)
2279.IP 4.
2280A. V. Aho and M. J. Corasick,
2281.I
2282Efficient String Matching: An Aid to Bibliographic Search,
2283.R
2284Comm. ACM
2285.B
228618,
2287.R
2288333-340 (1975).
2289.IP 5.
2290B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
2291.I
2292QED Text Editor,
2293.R
2294Computing Science Technical Report No. 5,
22951972,
2296.MH
2297.IP 6.
2298D. M. Ritchie,
2299private communication.
2300See also
2301M. E. Lesk,
2302.I
2303The Portable C Library,
2304.R
2305Computing Science Technical Report No. 31,
2306.MH
2307.if \n(tm (also TM 75-1274-11)