BSD 3 development
[unix-history] / usr / doc / lex
CommitLineData
2074ceed
BJ
1.hc ~
2.bd I 2
3.de TS
4.br
5.nf
6.SP 1v
7.ul 0
8..
9.de TE
10.SP 1v
11.fi
12..
13.de PT
14.if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
15.if \\n%>1 'sp
16..
17.ND July 21, 1975
18.RP
19.TM 75-1274-15 39199 39199-11
20.TL
21Lex \- A Lexical Analyzer ~Generator~
22.AU ``MH 2C-569'' 6377
23M. E. Lesk and E. Schmidt
24.AI
25.MH
26.AB
27.sp
28.bd I 2
29.nr PS 8
30.nr VS 9
31.ps 8
32.vs 9p
33Lex helps write programs whose control flow
34is directed by instances of regular
35expressions in the input stream.
36It is well suited for editor-script type transformations and
37for segmenting input in preparation for
38a parsing routine.
39.PP
40Lex source is a table of regular expressions and corresponding program fragments.
41The table is translated to a program
42which reads an input stream, copying it to an output stream
43and partitioning the input
44into strings which match the given expressions.
45As each such string is recognized the corresponding
46program fragment is executed.
47The recognition of the expressions
48is performed by a deterministic finite automaton
49generated by Lex.
50The program fragments written by the user are executed in the order in which the
51corresponding regular expressions occur in the input stream.
52.if n .if \n(tm .ig
53.PP
54The lexical analysis
55programs written with Lex accept ambiguous specifications
56and choose the longest
57match possible at each input point.
58If necessary, substantial look~ahead
59is performed on the input, but the
60input stream will be backed up to the
61end of the current partition, so that the user
62has general freedom to manipulate it.
63.PP
64Lex can generate analyzers in either C or Ratfor, a language
65which can be translated automatically to portable Fortran.
66It is available on the PDP-11 UNIX, Honeywell GCOS,
67and IBM OS systems.
68This manual, however, will only discuss generating analyzers
69in C on the UNIX system, which is the only supported
70form of Lex under UNIX Version 7.
71Lex is designed to simplify
72interfacing with Yacc, for those
73with access to this compiler-compiler system.
74..
75.nr PS 9
76.nr VS 11
77.AE
78.SH
79.ce 1
80Table of Contents
81.LP
82.ce 100
83.TS
84r 1l 2r .
851. Introduction. 1
862. Lex Source. 3
873. Lex Regular Expressions. 3
884. Lex Actions. 5
895. Ambiguous Source Rules. 7
906. Lex Source Definitions. 8
917. Usage. 8
928. Lex and Yacc. 9
939. Examples. 10
9410. Left Context Sensitivity. 11
9511. Character Set. 12
9612. Summary of Source Format. 12
9713. Caveats and Bugs. 13
9814. Acknowledgments. 13
9915. References. 13
100.TE
101.ce 0
102.2C
103.NH
104Introduction.
105.PP
106Lex is a program generator designed for
107lexical processing of character input streams.
108It accepts a high-level, problem oriented specification
109for character string matching,
110and
111produces a program in a general purpose language which recognizes
112regular expressions.
113The regular expressions are specified by the user in the
114source specifications given to Lex.
115The Lex written code recognizes these expressions
116in an input stream and partitions the input stream into
117strings matching the expressions. At the bound~aries
118between strings
119program sections
120provided by the user are executed.
121The Lex source file associates the regular expressions and the
122program fragments.
123As each expression appears in the input to the program written by Lex,
124the corresponding fragment is executed.
125.PP
126.de MH
127Bell Laboratories, Murray Hill, NJ 07974.
128..
129The user supplies the additional code
130beyond expression matching
131needed to complete his tasks, possibly
132including code written by other generators.
133The program that recognizes the expressions is generated in the
134general purpose programming language employed for the
135user's program fragments.
136Thus, a high level expression
137language is provided to write the string expressions to be
138matched while the user's freedom to write actions
139is unimpaired.
140This avoids forcing the user who wishes to use a string manipulation
141language for input analysis to write processing programs in the same
142and often inappropriate string handling language.
143.PP
144Lex is not a complete language, but rather a generator representing
145a new language feature which can be added to
146different programming languages, called ``host languages.''
147Just as general purpose languages
148can produce code to run on different computer hardware,
149Lex can write code in different host languages.
150The host language is used for the output code generated by Lex
151and also for the program fragments added by the user.
152Compatible run-time libraries for the different host languages
153are also provided.
154This makes Lex adaptable to different environments and
155different users.
156Each application
157may be directed to the combination of hardware and host language appropriate
158to the task, the user's background, and the properties of local
159implementations.
160At present, the only supported host language is C,
161although Fortran (in the form of Ratfor [2] has been available
162in the past.
163Lex itself exists on UNIX, GCOS, and OS/370; but the
164code generated by Lex may be taken anywhere the appropriate
165compilers exist.
166.PP
167Lex turns the user's expressions and actions
168(called
169.ul
170source
171in this memo) into the host general-purpose language;
172the generated program is named
173.ul
174yylex.
175The
176.ul
177yylex
178program
179will recognize expressions
180in a stream
181(called
182.ul
183input
184in this memo)
185and perform the specified actions for each expression as it is detected.
186See Figure 1.
187.GS
188.TS
189center;
190l _ r
191l|c|r
192l _ r
193l _ r
194l|c|r
195l _ r
196c s s
197c s s.
198
199Source \(-> Lex \(-> yylex
200
201.sp 2
202
203Input \(-> yylex \(-> Output
204
205.sp
206An overview of Lex
207.sp
208Figure 1
209.TE
210.GE
211.PP
212For a trivial example, consider a program to delete
213from the input
214all blanks or tabs at the ends of lines.
215.TS
216center;
217l l.
218%%
219[ \et]+$ ;
220.TE
221is all that is required.
222The program
223contains a %% delimiter to mark the beginning of the rules, and
224one rule.
225This rule contains a regular expression
226which matches one or more
227instances of the characters blank or tab
228(written \et for visibility, in accordance with the C language convention)
229just prior to the end of a line.
230The brackets indicate the character
231class made of blank and tab; the + indicates ``one or more ...'';
232and the $ indicates ``end of line,'' as in QED.
233No action is specified,
234so the program generated by Lex (yylex) will ignore these characters.
235Everything else will be copied.
236To change any remaining
237string of blanks or tabs to a single blank,
238add another rule:
239.TS
240center;
241l l.
242%%
243[ \et]+$ ;
244[ \et]+ printf(" ");
245.TE
246The finite automaton generated for this
247source will scan for both rules at once,
248observing at
249the termination of the string of blanks or tabs
250whether or not there is a newline character, and executing
251the desired rule action.
252The first rule matches all strings of blanks or tabs
253at the end of lines, and the second
254rule all remaining strings of blanks or tabs.
255.PP
256Lex can be used alone for simple transformations, or
257for analysis and statistics gathering on a lexical level.
258Lex can also be used with a parser generator
259to perform the lexical analysis phase; it is particularly
260easy to interface Lex and Yacc [3].
261Lex programs recognize only regular expressions;
262Yacc writes parsers that accept a large class of context free grammars,
263but require a lower level analyzer to recognize input tokens.
264Thus, a combination of Lex and Yacc is often appropriate.
265When used as a preprocessor for a later parser generator,
266Lex is used to partition the input stream,
267and the parser generator assigns structure to
268the resulting pieces.
269The flow of control
270in such a case (which might be the first half of a compiler,
271for example) is shown in Figure 2.
272Additional programs,
273written by other generators
274or by hand, can
275be added easily to programs written by Lex.
276.BS 2
277.TS
278center;
279l c c c l
280l c c c l
281l c c c l
282l _ c _ l
283l|c|c|c|l
284l _ c _ l
285l c c c l
286l _ c _ l
287l|c|c|c|l
288l _ c _ l
289l c s s l
290l c s s l.
291 lexical grammar
292 rules rules
293 \(da \(da
294
295 Lex Yacc
296
297 \(da \(da
298
299Input \(-> yylex \(-> yyparse \(-> Parsed input
300
301.sp
302 Lex with Yacc
303.sp
304 Figure 2
305.TE
306.BE
307Yacc users
308will realize that the name
309.ul
310yylex
311is what Yacc expects its lexical analyzer to be named,
312so that the use of this name by Lex simplifies
313interfacing.
314.PP
315Lex generates a deterministic finite automaton from the regular expressions
316in the source [4].
317The automaton is interpreted, rather than compiled, in order
318to save space.
319The result is still a fast analyzer.
320In particular, the time taken by a Lex program
321to recognize and partition an input stream is
322proportional to the length of the input.
323The number of Lex rules or
324the complexity of the rules is
325not important in determining speed,
326unless rules which include
327forward context require a significant amount of re~scanning.
328What does increase with the number and complexity of rules
329is the size of the finite
330automaton, and therefore the size of the program
331generated by Lex.
332.PP
333In the program written by Lex, the user's fragments
334(representing the
335.ul
336actions
337to be performed as each regular expression
338is found)
339are gathered
340as cases of a switch.
341The automaton interpreter directs the control flow.
342Opportunity is provided for the user to insert either
343declarations or additional statements in the routine containing
344the actions, or to
345add subroutines outside this action routine.
346.PP
347Lex is not limited to source which can
348be interpreted on the basis of one character
349look~ahead.
350For example,
351if there are two rules, one looking for
352.I ab
353and another for
354.I abcdefg ,
355and the input stream is
356.I abcdefh ,
357Lex will recognize
358.I ab
359and leave
360the input pointer just before
361.I "cd. . ."
362Such backup is more costly
363than the processing of simpler languages.
364.2C
365.NH
366Lex Source.
367.PP
368The general format of Lex source is:
369.TS
370center;
371l.
372{definitions}
373%%
374{rules}
375%%
376{user subroutines}
377.TE
378where the definitions and the user subroutines
379are often omitted.
380The second
381.I %%
382is optional, but the first is required
383to mark the beginning of the rules.
384The absolute minimum Lex program is thus
385.TS
386center;
387l.
388%%
389.TE
390(no definitions, no rules) which translates into a program
391which copies the input to the output unchanged.
392.PP
393In the outline of Lex programs shown above, the
394.I
395rules
396.R
397represent the user's control
398decisions; they are a table, in which the left column
399contains
400.I
401regular expressions
402.R
403(see section 3)
404and the right column contains
405.I
406actions,
407.R
408program fragments to be executed when the expressions
409are recognized.
410Thus an individual rule might appear
411.TS
412center;
413l l.
414integer printf("found keyword INT");
415.TE
416to look for the string
417.I integer
418in the input stream and
419print the message ``found keyword INT'' whenever it appears.
420In this example the host procedural language is C and
421the C library function
422.I
423printf
424.R
425is used to print the string.
426The end
427of the expression is indicated by the first blank or tab character.
428If the action is merely a single C expression,
429it can just be given on the right side of the line; if it is
430compound, or takes more than a line, it should be enclosed in
431braces.
432As a slightly more useful example, suppose it is desired to
433change a number of words from British to American spelling.
434Lex rules such as
435.TS
436center;
437l l.
438colour printf("color");
439mechanise printf("mechanize");
440petrol printf("gas");
441.TE
442would be a start. These rules are not quite enough,
443since
444the word
445.I petroleum
446would become
447.I gaseum ;
448a way of dealing
449with this will be described later.
450.2C
451.NH
452Lex Regular Expressions.
453.PP
454The definitions of regular expressions are very similar to those
455in QED [5].
456A regular
457expression specifies a set of strings to be matched.
458It contains text characters (which match the corresponding
459characters in the strings being compared)
460and operator characters (which specify
461repetitions, choices, and other features).
462The letters of the alphabet and the digits are
463always text characters; thus the regular expression
464.TS
465center;
466l l.
467integer
468.TE
469matches the string
470.ul
471integer
472wherever it appears
473and the expression
474.TS
475center;
476l.
477a57D
478.TE
479looks for the string
480.ul
481a57D.
482.PP
483.I
484Operators.
485.R
486The operator characters are
487.TS
488center;
489l.
490" \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
491.TE
492and if they are to be used as text characters, an escape
493should be used.
494The quotation mark operator (")
495indicates that whatever is contained between a pair of quotes
496is to be taken as text characters.
497Thus
498.TS
499center;
500l.
501xyz"++"
502.TE
503matches the string
504.I xyz++
505when it appears. Note that a part of a string may be quoted.
506It is harmless but unnecessary to quote an ordinary
507text character; the expression
508.TS
509center;
510l.
511"xyz++"
512.TE
513is the same as the one above.
514Thus by quoting every non-alphanumeric character
515being used as a text character, the user can avoid remembering
516the list above of current
517operator characters, and is safe should further extensions to Lex
518lengthen the list.
519.PP
520An operator character may also be turned into a text character
521by preceding it with \e as in
522.TS
523center;
524l.
525xyz\e+\e+
526.TE
527which
528is another, less readable, equivalent of the above expressions.
529Another use of the quoting mechanism is to get a blank into
530an expression; normally, as explained above, blanks or tabs end
531a rule.
532Any blank character not contained within [\|] (see below) must
533be quoted.
534Several normal C escapes with \e
535are recognized: \en is newline, \et is tab, and \eb is backspace.
536To enter \e itself, use \e\e.
537Since newline is illegal in an expression, \en must be used;
538it is not
539required to escape tab and backspace.
540Every character but blank, tab, newline and the list above is always
541a text character.
542.PP
543.I
544Character classes.
545.R
546Classes of characters can be specified using the operator pair [\|].
547The construction
548.I [abc]
549matches a
550single character, which may be
551.I a ,
552.I b ,
553or
554.I c .
555Within square brackets,
556most operator meanings are ignored.
557Only three characters are special:
558these are \e \(mi and ^. The \(mi character
559indicates ranges. For example,
560.TS
561center;
562l.
563[a\(miz0\(mi9<>_]
564.TE
565indicates the character class containing all the lower case letters,
566the digits,
567the angle brackets, and underline.
568Ranges may be given in either order.
569Using \(mi between any pair of characters which are
570not both upper case letters, both lower case letters, or both digits
571is implementation dependent and will get a warning message.
572(E.g., [0\-z] in ASCII is many more characters
573than it is in EBCDIC).
574If it is desired to include the
575character \(mi in a character class, it should be first or
576last; thus
577.TS
578center;
579l.
580[\(mi+0\(mi9]
581.TE
582matches all the digits and the two signs.
583.PP
584In character classes,
585the ^ operator must appear as the first character
586after the left bracket; it indicates that the resulting string
587is to be complemented with respect to the computer character set.
588Thus
589.TS
590center;
591l.
592[^abc]
593.TE
594matches all characters except a, b, or c, including
595all special or control characters; or
596.TS
597center;
598l.
599[^a\-zA\-Z]
600.TE
601is any character which is not a letter.
602The \e character provides the usual escapes within
603character class brackets.
604.PP
605.I
606Arbitrary character.
607.R
608To match almost any character, the operator character
609.TS
610center;
611l.
612\&.
613.TE
614is the class of all characters except newline.
615Escaping into octal is possible although non-portable:
616.TS
617center;
618l.
619[\e40\-\e176]
620.TE
621matches all printable characters in the ASCII character set, from octal
62240 (blank) to octal 176 (tilde).
623.PP
624.I
625Optional expressions.
626.R
627The operator
628.I ?
629indicates
630an optional element of an expression.
631Thus
632.TS
633center;
634l.
635ab?c
636.TE
637matches either
638.I ac
639or
640.I abc .
641.PP
642.I
643Repeated expressions.
644.R
645Repetitions of classes are indicated by the operators
646.I \(**
647and
648.I + .
649.TS
650center;
651l.
652\f2a\(**\f1
653.TE
654is any number of consecutive
655.I a
656characters, including zero; while
657.TS
658center;
659l.
660a+
661.TE
662is one or more instances of
663.I a.
664For example,
665.TS
666center;
667l.
668[a\-z]+
669.TE
670is all strings of lower case letters.
671And
672.TS
673center;
674l.
675[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
676.TE
677indicates all alphanumeric strings with a leading
678alphabetic character.
679This is a typical expression for recognizing identifiers in
680computer languages.
681.PP
682.I
683Alternation and Grouping.
684.R
685The operator |
686indicates alternation:
687.TS
688center;
689l.
690(ab\||\|cd)
691.TE
692matches either
693.ul
694ab
695or
696.ul
697cd.
698Note that parentheses are used for grouping, although
699they are
700not necessary on the outside level;
701.TS
702center;
703l.
704ab\||\|cd
705.TE
706would have sufficed.
707Parentheses
708can be used for more complex expressions:
709.TS
710center;
711l.
712(ab\||\|cd+)?(ef)\(**
713.TE
714matches such strings as
715.I abefef ,
716.I efefef ,
717.I cdef ,
718or
719.I cddd\| ;
720but not
721.I abc ,
722.I abcd ,
723or
724.I abcdef .
725.PP
726.I
727Context sensitivity.
728.R
729Lex will recognize a small amount of surrounding
730context. The two simplest operators for this are
731.I ^
732and
733.I $ .
734If the first character of an expression is
735.I ^ ,
736the expression will only be matched at the beginning
737of a line (after a newline character, or at the beginning of
738the input stream).
739This can never conflict with the other meaning of
740.I ^ ,
741complementation
742of character classes, since that only applies within
743the [\|] operators.
744If the very last character is
745.I $ ,
746the expression will only be matched at the end of a line (when
747immediately followed by newline).
748The latter operator is a special case of the
749.I /
750operator character,
751which indicates trailing context.
752The expression
753.TS
754center;
755l.
756ab/cd
757.TE
758matches the string
759.I ab ,
760but only if followed by
761.ul
762cd.
763Thus
764.TS
765center;
766l.
767ab$
768.TE
769is the same as
770.TS
771center;
772l.
773ab/\en
774.TE
775Left context is handled in Lex by
776.I
777start conditions
778.R
779as explained in section 10. If a rule is only to be executed
780when the Lex automaton interpreter is in start condition
781.I
782x,
783.R
784the rule should be prefixed by
785.TS
786center;
787l.
788<x>
789.TE
790using the angle bracket operator characters.
791If we considered ``being at the beginning of a line'' to be
792start condition
793.I ONE ,
794then the ^ operator
795would be equivalent to
796.TS
797center;
798l.
799<ONE>
800.TE
801Start conditions are explained more fully later.
802.PP
803.I
804Repetitions and Definitions.
805.R
806The operators {} specify
807either repetitions (if they enclose numbers)
808or
809definition expansion (if they enclose a name). For example
810.TS
811center;
812l.
813{digit}
814.TE
815looks for a predefined string named
816.I digit
817and inserts it
818at that point in the expression.
819The definitions are given in the first part of the Lex
820input, before the rules.
821In contrast,
822.TS
823center;
824l.
825a{1,5}
826.TE
827looks for 1 to 5 occurrences of
828.I a .
829.PP
830Finally, initial
831.I %
832is special, being the separator
833for Lex source segments.
834.2C
835.NH
836Lex Actions.
837.PP
838When an expression written as above is matched, Lex
839executes the corresponding action. This section describes
840some features of Lex which aid in writing actions. Note
841that there is a default action, which
842consists of copying the input to the output. This
843is performed on all strings not otherwise matched. Thus
844the Lex user who wishes to absorb the entire input, without
845producing any output, must provide rules to match everything.
846When Lex is being used with Yacc, this is the normal
847situation.
848One may consider that actions are what is done instead of
849copying the input to the output; thus, in general,
850a rule which merely copies can be omitted.
851Also, a character combination
852which is omitted from the rules
853and which appears as input
854is likely to be printed on the output, thus calling
855attention to the gap in the rules.
856.PP
857One of the simplest things that can be done is to ignore
858the input. Specifying a C null statement, \fI;\fR as an action
859causes this result. A frequent rule is
860.TS
861center;
862l l.
863[ \et\en] ;
864.TE
865which causes the three spacing characters (blank, tab, and newline)
866to be ignored.
867.PP
868Another easy way to avoid writing actions is the action character
869|, which indicates that the action for this rule is the action
870for the next rule.
871The previous example could also have been written
872.TS
873center;
874l l.
875" " |
876"\et" |
877"\en" ;
878.TE
879with the same result, although in different style.
880The quotes around \en and \et are not required.
881.PP
882In more complex actions, the user
883will
884often want to know the actual text that matched some expression
885like
886.I [a\(miz]+ .
887Lex leaves this text in an external character
888array named
889.I
890yytext.
891.R
892Thus, to print the name found,
893a rule like
894.TS
895center;
896l l.
897[a\-z]+ printf("%s", yytext);
898.TE
899will print
900the string in
901.I
902yytext.
903.R
904The C function
905.I
906printf
907.R
908accepts a format argument and data to be printed;
909in this case, the format is ``print string'' (% indicating
910data conversion, and
911.I s
912indicating string type),
913and the data are the characters
914in
915.I
916yytext.
917.R
918So this just places
919the matched string
920on the output.
921This action
922is so common that
923it may be written as ECHO:
924.TS
925center;
926l l.
927[a\-z]+ ECHO;
928.TE
929is the same as the above.
930Since the default action is just to
931print the characters found, one might ask why
932give a rule, like this one, which merely specifies
933the default action?
934Such rules are often required
935to avoid matching some other rule
936which is not desired. For example, if there is a rule
937which matches
938.I read
939it will normally match the instances of
940.I read
941contained in
942.I bread
943or
944.I readjust ;
945to avoid
946this,
947a rule
948of the form
949.I [a\(miz]+
950is needed.
951This is explained further below.
952.PP
953Sometimes it is more convenient to know the end of what
954has been found; hence Lex also provides a count
955.I
956yyleng
957.R
958of the number of characters matched.
959To count both the number
960of words and the number of characters in words in the input, the user might write
961.TS
962center;
963l l.
964[a\-zA\-Z]+ {words++; chars += yyleng;}
965.TE
966which accumulates in
967.ul
968chars
969the number
970of characters in the words recognized.
971The last character in the string matched can
972be accessed by
973.TS
974center;
975l.
976yytext[yyleng\-1]
977.TE
978.PP
979Occasionally, a Lex
980action may decide that a rule has not recognized the correct
981span of characters.
982Two routines are provided to aid with this situation.
983First,
984.I
985yymore()
986.R
987can be called to indicate that the next input expression recognized is to be
988tacked on to the end of this input. Normally,
989the next input string would overwrite the current
990entry in
991.I
992yytext.
993.R
994Second,
995.I
996yyless (n)
997.R
998may be called to indicate that not all the characters matched
999by the currently successful expression are wanted right now.
1000The argument
1001.I
1002n
1003.R
1004indicates the number of characters
1005in
1006.I
1007yytext
1008.R
1009to be retained.
1010Further characters previously matched
1011are
1012returned to the input. This provides the same sort of
1013look~ahead offered by the / operator,
1014but in a different form.
1015.PP
1016.I
1017Example:
1018.R
1019Consider a language which defines
1020a string as a set of characters between quotation (") marks, and provides that
1021to include a " in a string it must be preceded by a \e. The
1022regular expression which matches that is somewhat confusing,
1023so that it might be preferable to write
1024.TS
1025center;
1026l l.
1027\e"[^"]\(** {
1028 if (yytext[yyleng\-1] == \(fm\e\e\(fm)
1029 yymore();
1030 else
1031 ... normal user processing
1032 }
1033.TE
1034which will, when faced with a string such as
1035.I
1036"abc\e"def\|"
1037.R
1038first match
1039the five characters
1040\fI"abc\e\|\fR;
1041then
1042the call to
1043.I yymore()
1044will
1045cause the next part of the string,
1046\fI"def\|\fR,
1047to be tacked on the end.
1048Note that the final quote terminating the string should be picked
1049up in the code labeled ``normal processing''.
1050.PP
1051The function
1052.I
1053yyless()
1054.R
1055might be used to reprocess
1056text in various circumstances. Consider the C problem of distinguishing
1057the ambiguity of ``=\(mia''.
1058Suppose it is desired to treat this as ``=\(mi a''
1059but print a message. A rule might be
1060.TS
1061center;
1062l l.
1063=\(mi[a\-zA\-Z] {
1064 printf("Operator (=\(mi) ambiguous\en");
1065 yyless(yyleng\-1);
1066 ... action for =\(mi ...
1067 }
1068.TE
1069which prints a message, returns the letter after the
1070operator to the input stream, and treats the operator as ``=\(mi''.
1071Alternatively it might be desired to treat this as ``= \(mia''.
1072To do this, just return the minus
1073sign as well as the letter to the input:
1074.TS
1075center;
1076l l.
1077=\(mi[a\-zA\-Z] {
1078 printf("Operator (=\(mi) ambiguous\en");
1079 yyless(yyleng\-2);
1080 ... action for = ...
1081 }
1082.TE
1083will perform the other interpretation.
1084Note that the expressions for the two cases might more easily
1085be written
1086.TS
1087center;
1088l l.
1089=\(mi/[A\-Za\-z]
1090.TE
1091in the first case and
1092.TS
1093center;
1094l.
1095=/\-[A\-Za\-z]
1096.TE
1097in the second;
1098no backup would be required in the rule action.
1099It is not necessary to recognize the whole identifier
1100to observe the ambiguity.
1101The
1102possibility of ``=\(mi3'', however, makes
1103.TS
1104center;
1105l.
1106=\(mi/[^ \et\en]
1107.TE
1108a still better rule.
1109.PP
1110In addition to these routines, Lex also permits
1111access to the I/O routines
1112it uses.
1113They are:
1114.IP 1)
1115.I
1116input()
1117.R
1118which returns the next input character;
1119.IP 2)
1120.I
1121output(c)
1122.R
1123which writes the character
1124.I
1125c
1126.R
1127on the output; and
1128.IP 3)
1129.I
1130unput(c)
1131.R
1132pushes the character
1133.I
1134c
1135.R
1136back onto the input stream to be read later by
1137.I
1138input().
1139.R
1140.LP
1141By default these routines are provided as macro definitions,
1142but the user can override them and supply private versions.
1143These routines
1144define the relationship between external files and
1145internal characters, and must all be retained
1146or modified consistently.
1147They may be redefined, to
1148cause input or output to be transmitted to or from strange
1149places, including other programs or internal memory;
1150but the character set used must be consistent in all routines;
1151a value of zero returned by
1152.I
1153input
1154.R
1155must mean end of file; and
1156the relationship between
1157.I
1158unput
1159.R
1160and
1161.I
1162input
1163.R
1164must be retained
1165or the Lex look~ahead will not work.
1166Lex does not look ahead at all if it does not have to,
1167but every rule ending in
1168.ft I
1169+ \(** ?
1170.ft R
1171or
1172.ft I
1173$
1174.ft R
1175or containing
1176.ft I
1177/
1178.ft R
1179implies look~ahead.
1180Look~ahead is also necessary to match an expression that is a prefix
1181of another expression.
1182See below for a discussion of the character set used by Lex.
1183The standard Lex library imposes
1184a 100 character limit on backup.
1185.PP
1186Another Lex library routine that the user will sometimes want
1187to redefine is
1188.I
1189yywrap()
1190.R
1191which is called whenever Lex reaches an end-of-file.
1192If
1193.I
1194yywrap
1195.R
1196returns a 1, Lex continues with the normal wrapup on end of input.
1197Sometimes, however, it is convenient to arrange for more
1198input to arrive
1199from a new source.
1200In this case, the user should provide
1201a
1202.I
1203yywrap
1204.R
1205which
1206arranges for new input and
1207returns 0. This instructs Lex to continue processing.
1208The default
1209.I
1210yywrap
1211.R
1212always returns 1.
1213.PP
1214This routine is also a convenient place
1215to print tables, summaries, etc. at the end
1216of a program. Note that it is not
1217possible to write a normal rule which recognizes
1218end-of-file; the only access to this condition is
1219through
1220.I
1221yywrap.
1222.R
1223In fact, unless a private version of
1224.I
1225input()
1226.R
1227is supplied
1228a file containing nulls
1229cannot be handled,
1230since a value of 0 returned by
1231.I
1232input
1233.R
1234is taken to be end-of-file.
1235.PP
1236.2C
1237.NH
1238Ambiguous Source Rules.
1239.PP
1240Lex can handle ambiguous specifications.
1241When more than one expression can match the
1242current input, Lex chooses as follows:
1243.IP 1)
1244The longest match is preferred.
1245.IP 2)
1246Among rules which matched the same number of characters,
1247the rule given first is preferred.
1248.LP
1249Thus, suppose the rules
1250.TS
1251center;
1252l l.
1253integer keyword action ...;
1254[a\-z]+ identifier action ...;
1255.TE
1256to be given in that order. If the input is
1257.I integers ,
1258it is taken as an identifier, because
1259.I [a\-z]+
1260matches 8 characters while
1261.I integer
1262matches only 7.
1263If the input is
1264.I integer ,
1265both rules match 7 characters, and
1266the keyword rule is selected because it was given first.
1267Anything shorter (e.g. \fIint\fR\|) will
1268not match the expression
1269.I integer
1270and so the identifier interpretation is used.
1271.PP
1272The principle of preferring the longest
1273match makes rules containing
1274expressions like
1275.I \&.\(**
1276dangerous.
1277For example,
1278.TS
1279center;
1280l.
1281\&\(fm.\(**\(fm
1282.TE
1283might seem a good way of recognizing
1284a string in single quotes.
1285But it is an invitation for the program to read far
1286ahead, looking for a distant
1287single quote.
1288Presented with the input
1289.TS
1290center;
1291l l.
1292\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
1293.TE
1294the above expression will match
1295.TS
1296center;
1297l l.
1298\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
1299.TE
1300which is probably not what was wanted.
1301A better rule is of the form
1302.TS
1303center;
1304l.
1305\&\(fm[^\(fm\en]\(**\(fm
1306.TE
1307which, on the above input, will stop
1308after
1309.I \(fmfirst\(fm .
1310The consequences
1311of errors like this are mitigated by the fact
1312that the
1313.I \&.
1314operator will not match newline.
1315Thus expressions like
1316.I \&.\(**
1317stop on the
1318current line.
1319Don't try to defeat this with expressions like
1320.I [.\en]+
1321or
1322equivalents;
1323the Lex generated program will try to read
1324the entire input file, causing
1325internal buffer overflows.
1326.PP
1327Note that Lex is normally partitioning
1328the input stream, not searching for all possible matches
1329of each expression.
1330This means that each character is accounted for
1331once and only once.
1332For example, suppose it is desired to
1333count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
1334Some Lex rules to do this might be
1335.TS
1336center;
1337l l.
1338she s++;
1339he h++;
1340\en |
1341\&. ;
1342.TE
1343where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
1344Remember that . does not include newline.
1345Since \fIshe\fR includes \fIhe\fR, Lex will normally
1346.I
1347not
1348.R
1349recognize
1350the instances of \fIhe\fR included in \fIshe\fR,
1351since once it has passed a \fIshe\fR those characters are gone.
1352.PP
1353Sometimes the user would like to override this choice. The action
1354REJECT
1355means ``go do the next alternative.''
1356It causes whatever rule was second choice after the current
1357rule to be executed.
1358The position of the input pointer is adjusted accordingly.
1359Suppose the user really wants to count the included instances of \fIhe\fR:
1360.TS
1361center;
1362l l.
1363she {s++; REJECT;}
1364he {h++; REJECT;}
1365\en |
1366\&. ;
1367.TE
1368these rules are one way of changing the previous example
1369to do just that.
1370After counting each expression, it is rejected; whenever appropriate,
1371the other expression will then be counted. In this example, of course,
1372the user could note that \fIshe\fR includes \fIhe\fR but not
1373vice versa, and omit the REJECT action on \fIhe\fR;
1374in other cases, however, it
1375would not be possible a priori to tell
1376which input characters
1377were in both classes.
1378.PP
1379Consider the two rules
1380.TS
1381center;
1382l l.
1383a[bc]+ { ... ; REJECT;}
1384a[cd]+ { ... ; REJECT;}
1385.TE
1386If the input is
1387.I ab ,
1388only the first rule matches,
1389and on
1390.I ad
1391only the second matches.
1392The input string
1393.I accb
1394matches the first rule for four characters
1395and then the second rule for three characters.
1396In contrast, the input
1397.I accd
1398agrees with
1399the second rule for four characters and then the first
1400rule for three.
1401.PP
1402In general, REJECT is useful whenever
1403the purpose of Lex is not to partition the input
1404stream but to detect all examples of some items
1405in the input, and the instances of these items
1406may overlap or include each other.
1407Suppose a digram table of the input is desired;
1408normally the digrams overlap, that is the word
1409.I the
1410is considered to contain
1411both
1412.I th
1413and
1414.I he .
1415Assuming a two-dimensional array named
1416.ul
1417digram
1418to be incremented, the appropriate
1419source is
1420.TS
1421center;
1422l l.
1423%%
1424[a\-z][a\-z] {digram[yytext[0]][yytext[1]]++; REJECT;}
1425. ;
1426\en ;
1427.TE
1428where the REJECT is necessary to pick up
1429a letter pair beginning at every character, rather than at every
1430other character.
1431.2C
1432.NH
1433Lex Source Definitions.
1434.PP
1435Remember the format of the Lex
1436source:
1437.TS
1438center;
1439l.
1440{definitions}
1441%%
1442{rules}
1443%%
1444{user routines}
1445.TE
1446So far only the rules have been described. The user needs
1447additional options,
1448though, to define variables for use in his program and for use
1449by Lex.
1450These can go either in the definitions section
1451or in the rules section.
1452.PP
1453Remember that Lex is turning the rules into a program.
1454Any source not intercepted by Lex is copied
1455into the generated program. There are three classes
1456of such things.
1457.IP 1)
1458Any line which is not part of a Lex rule or action
1459which begins with a blank or tab is copied into
1460the Lex generated program.
1461Such source input prior to the first %% delimiter will be external
1462to any function in the code; if it appears immediately after the first
1463%%,
1464it appears in an appropriate place for declarations
1465in the function written by Lex which contains the actions.
1466This material must look like program fragments,
1467and should precede the first Lex rule.
1468.IP
1469As a side effect of the above, lines which begin with a blank
1470or tab, and which contain a comment,
1471are passed through to the generated program.
1472This can be used to include comments in either the Lex source or
1473the generated code. The comments should follow the host
1474language convention.
1475.IP 2)
1476Anything included between lines containing
1477only
1478.I %{
1479and
1480.I %}
1481is
1482copied out as above. The delimiters are discarded.
1483This format permits entering text like preprocessor statements that
1484must begin in column 1,
1485or copying lines that do not look like programs.
1486.IP 3)
1487Anything after the third %% delimiter, regardless of formats, etc.,
1488is copied out after the Lex output.
1489.PP
1490Definitions intended for Lex are given
1491before the first %% delimiter. Any line in this section
1492not contained between %{ and %}, and begining
1493in column 1, is assumed to define Lex substitution strings.
1494The format of such lines is
1495.TS
1496center;
1497l l.
1498name translation
1499.TE
1500and it
1501causes the string given as a translation to
1502be associated with the name.
1503The name and translation
1504must be separated by at least one blank or tab, and the name must begin with a letter.
1505The translation can then be called out
1506by the {name} syntax in a rule.
1507Using {D} for the digits and {E} for an exponent field,
1508for example, might abbreviate rules to recognize numbers:
1509.TS
1510center;
1511l l.
1512D [0\-9]
1513E [DEde][\-+]?{D}+
1514%%
1515{D}+ printf("integer");
1516{D}+"."{D}\(**({E})? |
1517{D}\(**"."{D}+({E})? |
1518{D}+{E} printf("real");
1519.TE
1520Note the first two rules for real numbers;
1521both require a decimal point and contain
1522an optional exponent field,
1523but the first requires at least one digit before the
1524decimal point and the second requires at least one
1525digit after the decimal point.
1526To correctly handle the problem
1527posed by a Fortran expression such as
1528.I 35.EQ.I ,
1529which does not contain a real number, a context-sensitive
1530rule such as
1531.TS
1532center;
1533l l.
1534[0\-9]+/"."EQ printf("integer");
1535.TE
1536could be used in addition to the normal rule for integers.
1537.PP
1538The definitions
1539section may also contain other commands, including the
1540selection of a host language, a character set table,
1541a list of start conditions, or adjustments to the default
1542size of arrays within Lex itself for larger source programs.
1543These possibilities
1544are discussed below under ``Summary of Source Format,''
1545section 12.
1546.2C
1547.NH
1548Usage.
1549.PP
1550There are two steps in
1551compiling a Lex source program.
1552First, the Lex source must be turned into a generated program
1553in the host general purpose language.
1554Then this program must be compiled and loaded, usually with
1555a library of Lex subroutines.
1556The generated program
1557is on a file named
1558.I lex.yy.c .
1559The I/O library is defined in terms of the C standard
1560library [6].
1561.PP
1562The C programs generated by Lex are slightly different
1563on OS/370, because the
1564OS compiler is less powerful than the UNIX or GCOS compilers,
1565and does less at compile time.
1566C programs generated on GCOS and UNIX are the same.
1567.PP
1568.I
1569UNIX.
1570.R
1571The library is accessed by the loader flag
1572.I \-ll .
1573So an appropriate
1574set of commands is
1575.KS
1576.in 5
1577lex source
1578cc lex.yy.c \-ll
1579.in 0
1580.KE
1581The resulting program is placed on the usual file
1582.I
1583a.out
1584.R
1585for later execution.
1586To use Lex with Yacc see below.
1587Although the default Lex I/O routines use the C standard library,
1588the Lex automata themselves do not do so;
1589if private versions of
1590.I
1591input,
1592output
1593.R
1594and
1595.I unput
1596are given, the library can be avoided.
1597.PP
1598.2C
1599.NH
1600Lex and Yacc.
1601.PP
1602If you want to use Lex with Yacc, note that what Lex writes is a program
1603named
1604.I
1605yylex(),
1606.R
1607the name required by Yacc for its analyzer.
1608Normally, the default main program on the Lex library
1609calls this routine, but if Yacc is loaded, and its main
1610program is used, Yacc will call
1611.I
1612yylex().
1613.R
1614In this case each Lex rule should end with
1615.TS
1616center;
1617l.
1618return(token);
1619.TE
1620where the appropriate token value is returned.
1621An easy way to get access
1622to Yacc's names for tokens is to
1623compile the Lex output file as part of
1624the Yacc output file by placing the line
1625.TS
1626center;
1627l.
1628# include "lex.yy.c"
1629.TE
1630in the last section of Yacc input.
1631Supposing the grammar to be
1632named ``good'' and the lexical rules to be named ``better''
1633the UNIX command sequence can just be:
1634.TS
1635center;
1636l.
1637yacc good
1638lex better
1639cc y.tab.c \-ly \-ll
1640.TE
1641The Yacc library (\-ly) should be loaded before the Lex library,
1642to obtain a main program which invokes the Yacc parser.
1643The generations of Lex and Yacc programs can be done in
1644either order.
1645.2C
1646.NH
1647Examples.
1648.PP
1649As a trivial problem, consider copying an input file while
1650adding 3 to every positive number divisible by 7.
1651Here is a suitable Lex source program
1652.TS
1653center;
1654l l.
1655%%
1656 int k;
1657[0\-9]+ {
1658 k = atoi(yytext);
1659 if (k%7 == 0)
1660 printf("%d", k+3);
1661 else
1662 printf("%d",k);
1663 }
1664.TE
1665to do just that.
1666The rule [0\-9]+ recognizes strings of digits;
1667.I
1668atoi
1669.R
1670converts the digits to binary
1671and stores the result in
1672.ul
1673k.
1674The operator % (remainder) is used to check whether
1675.ul
1676k
1677is divisible by 7; if it is,
1678it is incremented by 3 as it is written out.
1679It may be objected that this program will alter such
1680input items as
1681.I 49.63
1682or
1683.I X7 .
1684Furthermore, it increments the absolute value
1685of all negative numbers divisible by 7.
1686To avoid this, just add a few more rules after the active one,
1687as here:
1688.TS
1689center;
1690l l.
1691%%
1692 int k;
1693\-?[0\-9]+ {
1694 k = atoi(yytext);
1695 printf("%d", k%7 == 0 ? k+3 : k);
1696 }
1697\-?[0\-9.]+ ECHO;
1698[A-Za-z][A-Za-z0-9]+ ECHO;
1699.TE
1700Numerical strings containing
1701a ``.'' or preceded by a letter will be picked up by
1702one of the last two rules, and not changed.
1703The
1704.I if\-else
1705has been replaced by
1706a C conditional expression to save space;
1707the form
1708.ul
1709a?b:c
1710means ``if
1711.I a
1712then
1713.I b
1714else
1715.I c ''.
1716.PP
1717For an example of statistics gathering, here
1718is a program which histograms the lengths
1719of words, where a word is defined as a string of letters.
1720.TS
1721center;
1722l l.
1723 int lengs[100];
1724%%
1725[a\-z]+ lengs[yyleng]++;
1726\&. |
1727\en ;
1728%%
1729.T&
1730l s.
1731yywrap()
1732{
1733int i;
1734printf("Length No. words\en");
1735for(i=0; i<100; i++)
1736 if (lengs[i] > 0)
1737 printf("%5d%10d\en",i,lengs[i]);
1738return(1);
1739}
1740.TE
1741This program
1742accumulates the histogram, while producing no output. At the end
1743of the input it prints the table.
1744The final statement
1745.I
1746return(1);
1747.R
1748indicates that Lex is to perform wrapup. If
1749.I
1750yywrap
1751.R
1752returns zero (false)
1753it implies that further input is available
1754and the program is
1755to continue reading and processing.
1756To provide a
1757.I
1758yywrap
1759.R
1760that never
1761returns true causes an infinite loop.
1762.PP
1763As a larger example,
1764here are some parts of a program written by N. L. Schryer
1765to convert double precision Fortran to single precision Fortran.
1766Because Fortran does not distinguish upper and lower case letters,
1767this routine begins by defining a set of classes including
1768both cases of each letter:
1769.TS
1770center;
1771l l.
1772a [aA]
1773b [bB]
1774c [cC]
1775\&...
1776z [zZ]
1777.TE
1778An additional class recognizes white space:
1779.TS
1780center;
1781l l.
1782W [ \et]\(**
1783.TE
1784The first rule changes
1785``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
1786.TS
1787center;
1788l.
1789{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
1790 printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
1791 }
1792.TE
1793Care is taken throughout this program to preserve the case
1794(upper or lower)
1795of the original program.
1796The conditional operator is used to
1797select the proper form of the keyword.
1798The next rule copies continuation card indications to
1799avoid confusing them with constants:
1800.TS
1801center;
1802l l.
1803^" "[^ 0] ECHO;
1804.TE
1805In the regular expression, the quotes surround the
1806blanks.
1807It is interpreted as
1808``beginning of line, then five blanks, then
1809anything but blank or zero.''
1810Note the two different meanings of
1811.I ^ .
1812There follow some rules to change double precision
1813constants to ordinary floating constants.
1814.TS
1815center;
1816l.
1817[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
1818[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
1819"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
1820 /\(** convert constants \(**/
1821 for(p=yytext; \(**p != 0; p++)
1822 {
1823 if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
1824 \(**p=+ \(fme\(fm\- \(fmd\(fm;
1825 ECHO;
1826 }
1827.TE
1828After the floating point constant is recognized, it is
1829scanned by the
1830.ul
1831for
1832loop
1833to find the letter
1834.I d
1835or
1836.I D .
1837The program than adds
1838.c
1839.I \(fme\(fm\-\(fmd\(fm ,
1840which converts
1841it to the next letter of the alphabet.
1842The modified constant, now single-precision,
1843is written out again.
1844There follow a series of names which must be respelled to remove
1845their initial \fId\fR.
1846By using the
1847array
1848.I
1849yytext
1850.R
1851the same action suffices for all the names (only a sample of
1852a rather long list is given here).
1853.TS
1854center;
1855l l.
1856{d}{s}{i}{n} |
1857{d}{c}{o}{s} |
1858{d}{s}{q}{r}{t} |
1859{d}{a}{t}{a}{n} |
1860\&...
1861{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
1862.TE
1863Another list of names must have initial \fId\fR changed to initial \fIa\fR:
1864.TS
1865center;
1866l l.
1867{d}{l}{o}{g} |
1868{d}{l}{o}{g}10 |
1869{d}{m}{i}{n}1 |
1870{d}{m}{a}{x}1 {
1871 yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
1872 ECHO;
1873 }
1874.TE
1875And one routine
1876must have initial \fId\fR changed to initial \fIr\fR:
1877.TS
1878center;
1879l l.
1880{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
1881 ECHO;
1882 }
1883.TE
1884To avoid such names as \fIdsinx\fR being detected as instances
1885of \fIdsin\fR, some final rules pick up longer words as identifiers
1886and copy some surviving characters:
1887.TS
1888center;
1889l l.
1890[A\-Za\-z][A\-Za\-z0\-9]\(** |
1891[0\-9]+ |
1892\en |
1893\&. ECHO;
1894.TE
1895Note that this program is not complete; it
1896does not deal with the spacing problems in Fortran or
1897with the use of keywords as identifiers.
1898.br
1899.2C
1900.NH
1901Left Context Sensitivity.
1902.PP
1903Sometimes
1904it is desirable to have several sets of lexical rules
1905to be applied at different times in the input.
1906For example, a compiler preprocessor might distinguish
1907preprocessor statements and analyze them differently
1908from ordinary statements.
1909This requires
1910sensitivity
1911to prior context, and there are several ways of handling
1912such problems.
1913The \fI^\fR operator, for example, is a prior context operator,
1914recognizing immediately preceding left context just as \fI$\fR recognizes
1915immediately following right context.
1916Adjacent left context could be extended, to produce a facility similar to
1917that for adjacent right context, but it is unlikely
1918to be as useful, since often the relevant left context
1919appeared some time earlier, such as at the beginning of a line.
1920.PP
1921This section describes three means of dealing
1922with different environments: a simple use of flags,
1923when only a few rules change from one environment to another,
1924the use of
1925.I
1926start conditions
1927.R
1928on rules,
1929and the possibility of making multiple lexical analyzers all run
1930together.
1931In each case, there are rules which recognize the need to change the
1932environment in which the
1933following input text is analyzed, and set some parameter
1934to reflect the change. This may be a flag explicitly tested by
1935the user's action code; such a flag is the simplest way of dealing
1936with the problem, since Lex is not involved at all.
1937It may be more convenient,
1938however,
1939to have Lex remember the flags as initial conditions on the rules.
1940Any rule may be associated with a start condition. It will only
1941be recognized when Lex is in
1942that start condition.
1943The current start condition may be changed at any time.
1944Finally, if the sets of rules for the different environments
1945are very dissimilar,
1946clarity may be best achieved by writing several distinct lexical
1947analyzers, and switching from one to another as desired.
1948.PP
1949Consider the following problem: copy the input to the output,
1950changing the word \fImagic\fR to \fIfirst\fR on every line which began
1951with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
1952which began with the letter \fIb\fR, and changing
1953\fImagic\fR to \fIthird\fR on every line which began
1954with the letter \fIc\fR. All other words and all other lines
1955are left unchanged.
1956.PP
1957These rules are so simple that the easiest way
1958to do this job is with a flag:
1959.TS
1960center;
1961l l.
1962 int flag;
1963%%
1964^a {flag = \(fma\(fm; ECHO;}
1965^b {flag = \(fmb\(fm; ECHO;}
1966^c {flag = \(fmc\(fm; ECHO;}
1967\en {flag = 0 ; ECHO;}
1968magic {
1969 switch (flag)
1970 {
1971 case \(fma\(fm: printf("first"); break;
1972 case \(fmb\(fm: printf("second"); break;
1973\11\11 case \(fmc\(fm: printf("third"); break;
1974 default: ECHO; break;
1975 }
1976 }
1977.TE
1978should be adequate.
1979.PP
1980To handle the same problem with start conditions, each
1981start condition must be introduced to Lex in the definitions section
1982with a line reading
1983.TS
1984center;
1985l l.
1986%Start name1 name2 ...
1987.TE
1988where the conditions may be named in any order.
1989The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
1990The conditions may be referenced at the
1991head of a rule with the <> brackets:
1992.TS
1993center;
1994l.
1995<name1>expression
1996.TE
1997is a rule which is only recognized when Lex is in the
1998start condition \fIname1\fR.
1999To enter a start condition,
2000execute the action statement
2001.TS
2002center;
2003l.
2004BEGIN name1;
2005.TE
2006which changes the start condition to \fIname1\fR.
2007To resume the normal state,
2008.TS
2009center;
2010l.
2011BEGIN 0;
2012.TE
2013resets the initial condition
2014of the Lex automaton interpreter.
2015A rule may be active in several
2016start conditions:
2017.TS
2018center;
2019l.
2020<name1,name2,name3>
2021.TE
2022is a legal prefix. Any rule not beginning with the
2023<> prefix operator is always active.
2024.PP
2025The same example as before can be written:
2026.TS
2027center;
2028l l.
2029%START AA BB CC
2030%%
2031^a {ECHO; BEGIN AA;}
2032^b {ECHO; BEGIN BB;}
2033^c {ECHO; BEGIN CC;}
2034\en {ECHO; BEGIN 0;}
2035<AA>magic printf("first");
2036<BB>magic printf("second");
2037<CC>magic printf("third");
2038.TE
2039where the logic is exactly the same as in the previous
2040method of handling the problem, but Lex does the work
2041rather than the user's code.
2042.2C
2043.NH
2044Character Set.
2045.PP
2046The programs generated by Lex handle
2047character I/O only through the routines
2048.I
2049input,
2050output,
2051.R
2052and
2053.I
2054unput.
2055.R
2056Thus the character representation
2057provided in these routines
2058is accepted by Lex and employed to return
2059values in
2060.I
2061yytext.
2062.R
2063For internal use
2064a character is represented as a small integer
2065which, if the standard library is used,
2066has a value equal to the integer value of the bit
2067pattern representing the character on the host computer.
2068Normally, the letter
2069.I a
2070is represented as the same form as the character constant
2071.I \(fma\(fm .
2072If this interpretation is changed, by providing I/O
2073routines which translate the characters,
2074Lex must be told about
2075it, by giving a translation table.
2076This table must be in the definitions section,
2077and must be bracketed by lines containing only
2078``%T''.
2079The table contains lines of the form
2080.TS
2081center;
2082l.
2083{integer} {character string}
2084.TE
2085which indicate the value associated with each character.
2086Thus the next example
2087.GS 2
2088.TS
2089center;
2090l l.
2091%T
2092 1 Aa
2093 2 Bb
2094\&...
209526 Zz
209627 \en
209728 +
209829 \-
209930 0
210031 1
2101\&...
210239 9
2103%T
2104.TE
2105.sp
2106.ce 1
2107Sample character table.
2108.GE
2109maps the lower and upper case letters together into the integers 1 through 26,
2110newline into 27, + and \- into 28 and 29, and the
2111digits into 30 through 39.
2112Note the escape for newline.
2113If a table is supplied, every character that is to appear either
2114in the rules or in any valid input must be included
2115in the table.
2116No character
2117may be assigned the number 0, and no character may be
2118assigned a bigger number than the size of the hardware character set.
2119.2C
2120.NH
2121Summary of Source Format.
2122.PP
2123The general form of a Lex source file is:
2124.TS
2125center;
2126l.
2127{definitions}
2128%%
2129{rules}
2130%%
2131{user subroutines}
2132.TE
2133The definitions section contains
2134a combination of
2135.IP 1)
2136Definitions, in the form ``name space translation''.
2137.IP 2)
2138Included code, in the form ``space code''.
2139.IP 3)
2140Included code, in the form
2141.TS
2142center;
2143l.
2144%{
2145code
2146%}
2147.TE
2148.ns
2149.IP 4)
2150Start conditions, given in the form
2151.TS
2152center;
2153l.
2154%S name1 name2 ...
2155.TE
2156.ns
2157.IP 5)
2158Character set tables, in the form
2159.TS
2160center;
2161l.
2162%T
2163number space character-string
2164\&...
2165%T
2166.TE
2167.ns
2168.IP 6)
2169Changes to internal array sizes, in the form
2170.TS
2171center;
2172l.
2173%\fIx\fR\0\0\fInnn\fR
2174.TE
2175where \fInnn\fR is a decimal integer representing an array size
2176and \fIx\fR selects the parameter as follows:
2177.TS
2178center;
2179c c
2180c l.
2181Letter Parameter
2182p positions
2183n states
2184e tree nodes
2185a transitions
2186k packed character classes
2187o output array size
2188.TE
2189.LP
2190Lines in the rules section have the form ``expression action''
2191where the action may be continued on succeeding
2192lines by using braces to delimit it.
2193.PP
2194Regular expressions in Lex use the following
2195operators:
2196.br
2197.TS
2198center;
2199l l.
2200x the character "x"
2201"x" an "x", even if x is an operator.
2202\ex an "x", even if x is an operator.
2203[xy] the character x or y.
2204[x\-z] the characters x, y or z.
2205[^x] any character but x.
2206\&. any character but newline.
2207^x an x at the beginning of a line.
2208<y>x an x when Lex is in start condition y.
2209x$ an x at the end of a line.
2210x? an optional x.
2211x\(** 0,1,2, ... instances of x.
2212x+ 1,2,3, ... instances of x.
2213x|y an x or a y.
2214(x) an x.
2215x/y an x but only if followed by y.
2216{xx} the translation of xx from the definitions section.
2217x{m,n} \fIm\fR through \fIn\fR occurrences of x
2218.TE
2219.NH
2220Caveats and Bugs.
2221.PP
2222There are pathological expressions which
2223produce exponential growth of the tables when
2224converted to deterministic machines;
2225fortunately, they are rare.
2226.PP
2227REJECT does not rescan the input; instead it remembers the results of the previous
2228scan. This means that if a rule with trailing context is found, and
2229REJECT executed, the user
2230must not have used
2231.ul
2232unput
2233to change the characters forthcoming
2234from the input stream.
2235This is the only restriction on the user's ability to manipulate
2236the not-yet-processed input.
2237.PP
2238.2C
2239.NH
2240Acknowledgments.
2241.PP
2242As should
2243be obvious from the above, the outside of Lex
2244is patterned
2245on Yacc and the inside on Aho's string matching routines.
2246Therefore, both S. C. Johnson and A. V. Aho
2247are really originators
2248of much of Lex,
2249as well as debuggers of it.
2250Many thanks are due to both.
2251.PP
2252The code of the current version of Lex was designed, written,
2253and debugged by Eric Schmidt.
2254.SG MH-1274-MEL-unix
2255.sp 1
2256.2C
2257.NH
2258References.
2259.SP 1v
2260.IP 1.
2261B. W. Kernighan and D. M. Ritchie,
2262.I
2263The C Programming Language,
2264.R
2265Prentice-Hall, N. J. (1978).
2266.IP 2.
2267B. W. Kernighan,
2268.I
2269Ratfor: A Preprocessor for a Rational Fortran,
2270.R
2271Software \- Practice and Experience,
2272\fB5\fR, pp. 395-496 (1975).
2273.IP 3.
2274S. C. Johnson,
2275.I
2276Yacc: Yet Another Compiler Compiler,
2277.R
2278Computing Science Technical Report No. 32,
22791975,
2280.MH
2281.if \n(tm (also TM 75-1273-6)
2282.IP 4.
2283A. V. Aho and M. J. Corasick,
2284.I
2285Efficient String Matching: An Aid to Bibliographic Search,
2286.R
2287Comm. ACM
2288.B
228918,
2290.R
2291333-340 (1975).
2292.IP 5.
2293B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
2294.I
2295QED Text Editor,
2296.R
2297Computing Science Technical Report No. 5,
22981972,
2299.MH
2300.IP 6.
2301D. M. Ritchie,
2302private communication.
2303See also
2304M. E. Lesk,
2305.I
2306The Portable C Library,
2307.R
2308Computing Science Technical Report No. 31,
2309.MH
2310.if \n(tm (also TM 75-1274-11)