This commit was manufactured by cvs2svn to create tag 'FreeBSD-release/1.0'.
[unix-history] / usr.bin / lex / flexdoc.1
CommitLineData
78ed81a3 1.TH FLEX 1 "26 May 1990" "Version 2.3"
15637ed4
RG
2.SH NAME
3flex - fast lexical analyzer generator
4.SH SYNOPSIS
5.B flex
6.B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
7.I [filename ...]
8.SH DESCRIPTION
9.I flex
10is a tool for generating
11.I scanners:
12programs which recognized lexical patterns in text.
13.I flex
14reads
15the given input files, or its standard input if no file names are given,
16for a description of a scanner to generate. The description is in
17the form of pairs
18of regular expressions and C code, called
19.I rules. flex
20generates as output a C source file,
21.B lex.yy.c,
22which defines a routine
23.B yylex().
24This file is compiled and linked with the
25.B -lfl
26library to produce an executable. When the executable is run,
27it analyzes its input for occurrences
28of the regular expressions. Whenever it finds one, it executes
29the corresponding C code.
30.SH SOME SIMPLE EXAMPLES
31.LP
32First some simple examples to get the flavor of how one uses
33.I flex.
34The following
35.I flex
36input specifies a scanner which whenever it encounters the string
37"username" will replace it with the user's login name:
38.nf
39
40 %%
41 username printf( "%s", getlogin() );
42
43.fi
44By default, any text not matched by a
45.I flex
46scanner
47is copied to the output, so the net effect of this scanner is
48to copy its input file to its output with each occurrence
49of "username" expanded.
50In this input, there is just one rule. "username" is the
51.I pattern
52and the "printf" is the
53.I action.
54The "%%" marks the beginning of the rules.
55.LP
56Here's another simple example:
57.nf
58
59 int num_lines = 0, num_chars = 0;
60
61 %%
62 \\n ++num_lines; ++num_chars;
63 . ++num_chars;
64
65 %%
66 main()
67 {
68 yylex();
69 printf( "# of lines = %d, # of chars = %d\\n",
70 num_lines, num_chars );
71 }
72
73.fi
74This scanner counts the number of characters and the number
75of lines in its input (it produces no output other than the
76final report on the counts). The first line
77declares two globals, "num_lines" and "num_chars", which are accessible
78both inside
79.B yylex()
80and in the
81.B main()
82routine declared after the second "%%". There are two rules, one
83which matches a newline ("\\n") and increments both the line count and
84the character count, and one which matches any character other than
85a newline (indicated by the "." regular expression).
86.LP
87A somewhat more complicated example:
88.nf
89
90 /* scanner for a toy Pascal-like language */
91
92 %{
93 /* need this for the call to atof() below */
94 #include <math.h>
95 %}
96
97 DIGIT [0-9]
98 ID [a-z][a-z0-9]*
99
100 %%
101
102 {DIGIT}+ {
103 printf( "An integer: %s (%d)\\n", yytext,
104 atoi( yytext ) );
105 }
106
107 {DIGIT}+"."{DIGIT}* {
78ed81a3 108 printf( "A float: %s (%g)\\n", yytext,
15637ed4
RG
109 atof( yytext ) );
110 }
111
112 if|then|begin|end|procedure|function {
113 printf( "A keyword: %s\\n", yytext );
114 }
115
116 {ID} printf( "An identifier: %s\\n", yytext );
117
118 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
119
120 "{"[^}\\n]*"}" /* eat up one-line comments */
121
122 [ \\t\\n]+ /* eat up whitespace */
123
124 . printf( "Unrecognized character: %s\\n", yytext );
125
126 %%
127
128 main( argc, argv )
129 int argc;
130 char **argv;
131 {
132 ++argv, --argc; /* skip over program name */
133 if ( argc > 0 )
134 yyin = fopen( argv[0], "r" );
135 else
136 yyin = stdin;
137
138 yylex();
139 }
140
141.fi
142This is the beginnings of a simple scanner for a language like
143Pascal. It identifies different types of
144.I tokens
145and reports on what it has seen.
146.LP
147The details of this example will be explained in the following
148sections.
149.SH FORMAT OF THE INPUT FILE
150The
151.I flex
152input file consists of three sections, separated by a line with just
153.B %%
154in it:
155.nf
156
157 definitions
158 %%
159 rules
160 %%
161 user code
162
163.fi
164The
165.I definitions
166section contains declarations of simple
167.I name
168definitions to simplify the scanner specification, and declarations of
169.I start conditions,
170which are explained in a later section.
171.LP
172Name definitions have the form:
173.nf
174
175 name definition
176
177.fi
178The "name" is a word beginning with a letter or an underscore ('_')
179followed by zero or more letters, digits, '_', or '-' (dash).
180The definition is taken to begin at the first non-white-space character
181following the name and continuing to the end of the line.
182The definition can subsequently be referred to using "{name}", which
183will expand to "(definition)". For example,
184.nf
185
186 DIGIT [0-9]
187 ID [a-z][a-z0-9]*
188
189.fi
190defines "DIGIT" to be a regular expression which matches a
191single digit, and
192"ID" to be a regular expression which matches a letter
193followed by zero-or-more letters-or-digits.
194A subsequent reference to
195.nf
196
197 {DIGIT}+"."{DIGIT}*
198
199.fi
200is identical to
201.nf
202
203 ([0-9])+"."([0-9])*
204
205.fi
206and matches one-or-more digits followed by a '.' followed
207by zero-or-more digits.
208.LP
209The
210.I rules
211section of the
212.I flex
213input contains a series of rules of the form:
214.nf
215
216 pattern action
217
218.fi
219where the pattern must be unindented and the action must begin
220on the same line.
221.LP
222See below for a further description of patterns and actions.
223.LP
224Finally, the user code section is simply copied to
225.B lex.yy.c
226verbatim.
227It is used for companion routines which call or are called
228by the scanner. The presence of this section is optional;
229if it is missing, the second
230.B %%
231in the input file may be skipped, too.
232.LP
233In the definitions and rules sections, any
234.I indented
235text or text enclosed in
236.B %{
237and
238.B %}
239is copied verbatim to the output (with the %{}'s removed).
240The %{}'s must appear unindented on lines by themselves.
241.LP
242In the rules section,
243any indented or %{} text appearing before the
244first rule may be used to declare variables
245which are local to the scanning routine and (after the declarations)
246code which is to be executed whenever the scanning routine is entered.
247Other indented or %{} text in the rule section is still copied to the output,
248but its meaning is not well-defined and it may well cause compile-time
249errors (this feature is present for
250.I POSIX
251compliance; see below for other such features).
252.LP
253In the definitions section, an unindented comment (i.e., a line
254beginning with "/*") is also copied verbatim to the output up
255to the next "*/". Also, any line in the definitions section
256beginning with '#' is ignored, though this style of comment is
257deprecated and may go away in the future.
258.SH PATTERNS
259The patterns in the input are written using an extended set of regular
260expressions. These are:
261.nf
262
263 x match the character 'x'
264 . any character except newline
265 [xyz] a "character class"; in this case, the pattern
266 matches either an 'x', a 'y', or a 'z'
267 [abj-oZ] a "character class" with a range in it; matches
268 an 'a', a 'b', any letter from 'j' through 'o',
269 or a 'Z'
270 [^A-Z] a "negated character class", i.e., any character
271 but those in the class. In this case, any
272 character EXCEPT an uppercase letter.
273 [^A-Z\\n] any character EXCEPT an uppercase letter or
274 a newline
275 r* zero or more r's, where r is any regular expression
276 r+ one or more r's
277 r? zero or one r's (that is, "an optional r")
278 r{2,5} anywhere from two to five r's
279 r{2,} two or more r's
280 r{4} exactly 4 r's
281 {name} the expansion of the "name" definition
282 (see above)
283 "[xyz]\\"foo"
284 the literal string: [xyz]"foo
285 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
286 then the ANSI-C interpretation of \\x.
287 Otherwise, a literal 'X' (used to escape
288 operators such as '*')
289 \\123 the character with octal value 123
290 \\x2a the character with hexadecimal value 2a
291 (r) match an r; parentheses are used to override
292 precedence (see below)
293
294
295 rs the regular expression r followed by the
296 regular expression s; called "concatenation"
297
298
299 r|s either an r or an s
300
301
302 r/s an r but only if it is followed by an s. The
303 s is not part of the matched text. This type
304 of pattern is called as "trailing context".
305 ^r an r, but only at the beginning of a line
306 r$ an r, but only at the end of a line. Equivalent
307 to "r/\\n".
308
309
310 <s>r an r, but only in start condition s (see
311 below for discussion of start conditions)
312 <s1,s2,s3>r
313 same, but in any of start conditions s1,
314 s2, or s3
315
316
317 <<EOF>> an end-of-file
318 <s1,s2><<EOF>>
319 an end-of-file when in start condition s1 or s2
320
321.fi
322The regular expressions listed above are grouped according to
323precedence, from highest precedence at the top to lowest at the bottom.
324Those grouped together have equal precedence. For example,
325.nf
326
327 foo|bar*
328
329.fi
330is the same as
331.nf
332
333 (foo)|(ba(r*))
334
335.fi
336since the '*' operator has higher precedence than concatenation,
337and concatenation higher than alternation ('|'). This pattern
338therefore matches
339.I either
340the string "foo"
341.I or
342the string "ba" followed by zero-or-more r's.
343To match "foo" or zero-or-more "bar"'s, use:
344.nf
345
346 foo|(bar)*
347
348.fi
349and to match zero-or-more "foo"'s-or-"bar"'s:
350.nf
351
352 (foo|bar)*
353
354.fi
355.LP
356Some notes on patterns:
357.IP -
358A negated character class such as the example "[^A-Z]"
359above
360.I will match a newline
361unless "\\n" (or an equivalent escape sequence) is one of the
362characters explicitly present in the negated character class
363(e.g., "[^A-Z\\n]"). This is unlike how many other regular
364expression tools treat negated character classes, but unfortunately
365the inconsistency is historically entrenched.
366Matching newlines means that a pattern like [^"]* can match an entire
367input (overflowing the scanner's input buffer) unless there's another
368quote in the input.
369.IP -
370A rule can have at most one instance of trailing context (the '/' operator
371or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
372can only occur at the beginning of a pattern, and, as well as with '/' and '$',
373cannot be grouped inside parentheses. A '^' which does not occur at
374the beginning of a rule or a '$' which does not occur at the end of
375a rule loses its special properties and is treated as a normal character.
376.IP
377The following are illegal:
378.nf
379
380 foo/bar$
381 <sc1>foo<sc2>bar
382
383.fi
384Note that the first of these, can be written "foo/bar\\n".
385.IP
386The following will result in '$' or '^' being treated as a normal character:
387.nf
388
389 foo|(bar$)
390 foo|^bar
391
392.fi
393If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
394could be used (the special '|' action is explained below):
395.nf
396
397 foo |
398 bar$ /* action goes here */
399
400.fi
401A similar trick will work for matching a foo or a
402bar-at-the-beginning-of-a-line.
403.SH HOW THE INPUT IS MATCHED
404When the generated scanner is run, it analyzes its input looking
405for strings which match any of its patterns. If it finds more than
406one match, it takes the one matching the most text (for trailing
407context rules, this includes the length of the trailing part, even
408though it will then be returned to the input). If it finds two
409or more matches of the same length, the
410rule listed first in the
411.I flex
412input file is chosen.
413.LP
414Once the match is determined, the text corresponding to the match
415(called the
416.I token)
417is made available in the global character pointer
418.B yytext,
419and its length in the global integer
420.B yyleng.
421The
422.I action
423corresponding to the matched pattern is then executed (a more
424detailed description of actions follows), and then the remaining
425input is scanned for another match.
426.LP
427If no match is found, then the
428.I default rule
429is executed: the next character in the input is considered matched and
430copied to the standard output. Thus, the simplest legal
431.I flex
432input is:
433.nf
434
435 %%
436
437.fi
438which generates a scanner that simply copies its input (one character
439at a time) to its output.
440.SH ACTIONS
441Each pattern in a rule has a corresponding action, which can be any
442arbitrary C statement. The pattern ends at the first non-escaped
443whitespace character; the remainder of the line is its action. If the
444action is empty, then when the pattern is matched the input token
445is simply discarded. For example, here is the specification for a program
446which deletes all occurrences of "zap me" from its input:
447.nf
448
449 %%
450 "zap me"
451
452.fi
453(It will copy all other characters in the input to the output since
454they will be matched by the default rule.)
455.LP
456Here is a program which compresses multiple blanks and tabs down to
457a single blank, and throws away whitespace found at the end of a line:
458.nf
459
460 %%
461 [ \\t]+ putchar( ' ' );
462 [ \\t]+$ /* ignore this token */
463
464.fi
465.LP
466If the action contains a '{', then the action spans till the balancing '}'
467is found, and the action may cross multiple lines.
468.I flex
469knows about C strings and comments and won't be fooled by braces found
470within them, but also allows actions to begin with
471.B %{
472and will consider the action to be all the text up to the next
473.B %}
474(regardless of ordinary braces inside the action).
475.LP
476An action consisting solely of a vertical bar ('|') means "same as
477the action for the next rule." See below for an illustration.
478.LP
479Actions can include arbitrary C code, including
480.B return
481statements to return a value to whatever routine called
482.B yylex().
483Each time
484.B yylex()
485is called it continues processing tokens from where it last left
486off until it either reaches
487the end of the file or executes a return. Once it reaches an end-of-file,
488however, then any subsequent call to
489.B yylex()
490will simply immediately return, unless
491.B yyrestart()
492is first called (see below).
493.LP
494Actions are not allowed to modify yytext or yyleng.
495.LP
496There are a number of special directives which can be included within
497an action:
498.IP -
499.B ECHO
500copies yytext to the scanner's output.
501.IP -
502.B BEGIN
503followed by the name of a start condition places the scanner in the
504corresponding start condition (see below).
505.IP -
506.B REJECT
507directs the scanner to proceed on to the "second best" rule which matched the
508input (or a prefix of the input). The rule is chosen as described
509above in "How the Input is Matched", and
510.B yytext
511and
512.B yyleng
513set up appropriately.
514It may either be one which matched as much text
515as the originally chosen rule but came later in the
516.I flex
517input file, or one which matched less text.
518For example, the following will both count the
519words in the input and call the routine special() whenever "frob" is seen:
520.nf
521
522 int word_count = 0;
523 %%
524
525 frob special(); REJECT;
526 [^ \\t\\n]+ ++word_count;
527
528.fi
529Without the
530.B REJECT,
531any "frob"'s in the input would not be counted as words, since the
532scanner normally executes only one action per token.
533Multiple
534.B REJECT's
535are allowed, each one finding the next best choice to the currently
536active rule. For example, when the following scanner scans the token
537"abcd", it will write "abcdabcaba" to the output:
538.nf
539
540 %%
541 a |
542 ab |
543 abc |
544 abcd ECHO; REJECT;
545 .|\\n /* eat up any unmatched character */
546
547.fi
548(The first three rules share the fourth's action since they use
549the special '|' action.)
550.B REJECT
551is a particularly expensive feature in terms scanner performance;
552if it is used in
553.I any
554of the scanner's actions it will slow down
555.I all
556of the scanner's matching. Furthermore,
557.B REJECT
558cannot be used with the
559.I -f
560or
561.I -F
562options (see below).
563.IP
564Note also that unlike the other special actions,
565.B REJECT
566is a
567.I branch;
568code immediately following it in the action will
569.I not
570be executed.
571.IP -
572.B yymore()
573tells the scanner that the next time it matches a rule, the corresponding
574token should be
575.I appended
576onto the current value of
577.B yytext
578rather than replacing it. For example, given the input "mega-kludge"
579the following will write "mega-mega-kludge" to the output:
580.nf
581
582 %%
583 mega- ECHO; yymore();
584 kludge ECHO;
585
586.fi
587First "mega-" is matched and echoed to the output. Then "kludge"
588is matched, but the previous "mega-" is still hanging around at the
589beginning of
590.B yytext
591so the
592.B ECHO
593for the "kludge" rule will actually write "mega-kludge".
594The presence of
595.B yymore()
596in the scanner's action entails a minor performance penalty in the
597scanner's matching speed.
598.IP -
599.B yyless(n)
600returns all but the first
601.I n
602characters of the current token back to the input stream, where they
603will be rescanned when the scanner looks for the next match.
604.B yytext
605and
606.B yyleng
607are adjusted appropriately (e.g.,
608.B yyleng
609will now be equal to
610.I n
611). For example, on the input "foobar" the following will write out
612"foobarbar":
613.nf
614
615 %%
616 foobar ECHO; yyless(3);
617 [a-z]+ ECHO;
618
619.fi
620An argument of 0 to
621.B yyless
622will cause the entire current input string to be scanned again. Unless you've
623changed how the scanner will subsequently process its input (using
624.B BEGIN,
625for example), this will result in an endless loop.
626.IP -
627.B unput(c)
628puts the character
629.I c
630back onto the input stream. It will be the next character scanned.
631The following action will take the current token and cause it
632to be rescanned enclosed in parentheses.
633.nf
634
635 {
636 int i;
637 unput( ')' );
638 for ( i = yyleng - 1; i >= 0; --i )
639 unput( yytext[i] );
640 unput( '(' );
641 }
642
643.fi
644Note that since each
645.B unput()
646puts the given character back at the
647.I beginning
648of the input stream, pushing back strings must be done back-to-front.
649.IP -
650.B input()
651reads the next character from the input stream. For example,
652the following is one way to eat up C comments:
653.nf
654
655 %%
656 "/*" {
657 register int c;
658
659 for ( ; ; )
660 {
661 while ( (c = input()) != '*' &&
662 c != EOF )
663 ; /* eat up text of comment */
664
665 if ( c == '*' )
666 {
667 while ( (c = input()) == '*' )
668 ;
669 if ( c == '/' )
670 break; /* found the end */
671 }
672
673 if ( c == EOF )
674 {
675 error( "EOF in comment" );
676 break;
677 }
678 }
679 }
680
681.fi
682(Note that if the scanner is compiled using
683.B C++,
684then
685.B input()
686is instead referred to as
687.B yyinput(),
688in order to avoid a name clash with the
689.B C++
690stream by the name of
691.I input.)
692.IP -
693.B yyterminate()
694can be used in lieu of a return statement in an action. It terminates
695the scanner and returns a 0 to the scanner's caller, indicating "all done".
696Subsequent calls to the scanner will immediately return unless preceded
697by a call to
698.B yyrestart()
699(see below).
700By default,
701.B yyterminate()
702is also called when an end-of-file is encountered. It is a macro and
703may be redefined.
704.SH THE GENERATED SCANNER
705The output of
706.I flex
707is the file
708.B lex.yy.c,
709which contains the scanning routine
710.B yylex(),
711a number of tables used by it for matching tokens, and a number
712of auxiliary routines and macros. By default,
713.B yylex()
714is declared as follows:
715.nf
716
717 int yylex()
718 {
719 ... various definitions and the actions in here ...
720 }
721
722.fi
723(If your environment supports function prototypes, then it will
724be "int yylex( void )".) This definition may be changed by redefining
725the "YY_DECL" macro. For example, you could use:
726.nf
727
728 #undef YY_DECL
729 #define YY_DECL float lexscan( a, b ) float a, b;
730
731.fi
732to give the scanning routine the name
733.I lexscan,
734returning a float, and taking two floats as arguments. Note that
735if you give arguments to the scanning routine using a
736K&R-style/non-prototyped function declaration, you must terminate
737the definition with a semi-colon (;).
738.LP
739Whenever
740.B yylex()
741is called, it scans tokens from the global input file
742.I yyin
743(which defaults to stdin). It continues until it either reaches
744an end-of-file (at which point it returns the value 0) or
745one of its actions executes a
746.I return
747statement.
748In the former case, when called again the scanner will immediately
749return unless
750.B yyrestart()
751is called to point
752.I yyin
753at the new input file. (
754.B yyrestart()
755takes one argument, a
756.B FILE *
757pointer.)
758In the latter case (i.e., when an action
759executes a return), the scanner may then be called again and it
760will resume scanning where it left off.
761.LP
762By default (and for purposes of efficiency), the scanner uses
763block-reads rather than simple
764.I getc()
765calls to read characters from
766.I yyin.
767The nature of how it gets its input can be controlled by redefining the
768.B YY_INPUT
769macro.
770YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
771action is to place up to
772.I max_size
773characters in the character array
774.I buf
775and return in the integer variable
776.I result
777either the
778number of characters read or the constant YY_NULL (0 on Unix systems)
779to indicate EOF. The default YY_INPUT reads from the
780global file-pointer "yyin".
781.LP
782A sample redefinition of YY_INPUT (in the definitions
783section of the input file):
784.nf
785
786 %{
787 #undef YY_INPUT
788 #define YY_INPUT(buf,result,max_size) \\
78ed81a3 789 { \\
790 int c = getchar(); \\
791 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
792 }
15637ed4
RG
793 %}
794
795.fi
796This definition will change the input processing to occur
797one character at a time.
798.LP
799You also can add in things like keeping track of the
800input line number this way; but don't expect your scanner to
801go very fast.
802.LP
803When the scanner receives an end-of-file indication from YY_INPUT,
804it then checks the
805.B yywrap()
806function. If
807.B yywrap()
808returns false (zero), then it is assumed that the
809function has gone ahead and set up
810.I yyin
811to point to another input file, and scanning continues. If it returns
812true (non-zero), then the scanner terminates, returning 0 to its
813caller.
814.LP
815The default
816.B yywrap()
817always returns 1. Presently, to redefine it you must first
818"#undef yywrap", as it is currently implemented as a macro. As indicated
819by the hedging in the previous sentence, it may be changed to
820a true function in the near future.
821.LP
822The scanner writes its
823.B ECHO
824output to the
825.I yyout
826global (default, stdout), which may be redefined by the user simply
827by assigning it to some other
828.B FILE
829pointer.
830.SH START CONDITIONS
831.I flex
832provides a mechanism for conditionally activating rules. Any rule
833whose pattern is prefixed with "<sc>" will only be active when
834the scanner is in the start condition named "sc". For example,
835.nf
836
837 <STRING>[^"]* { /* eat up the string body ... */
838 ...
839 }
840
841.fi
842will be active only when the scanner is in the "STRING" start
843condition, and
844.nf
845
846 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
847 ...
848 }
849
850.fi
851will be active only when the current start condition is
852either "INITIAL", "STRING", or "QUOTE".
853.LP
854Start conditions
855are declared in the definitions (first) section of the input
856using unindented lines beginning with either
857.B %s
858or
859.B %x
860followed by a list of names.
861The former declares
862.I inclusive
863start conditions, the latter
864.I exclusive
865start conditions. A start condition is activated using the
866.B BEGIN
867action. Until the next
868.B BEGIN
869action is executed, rules with the given start
870condition will be active and
871rules with other start conditions will be inactive.
872If the start condition is
873.I inclusive,
874then rules with no start conditions at all will also be active.
875If it is
876.I exclusive,
877then
878.I only
879rules qualified with the start condition will be active.
880A set of rules contingent on the same exclusive start condition
881describe a scanner which is independent of any of the other rules in the
882.I flex
883input. Because of this,
884exclusive start conditions make it easy to specify "mini-scanners"
885which scan portions of the input that are syntactically different
886from the rest (e.g., comments).
887.LP
888If the distinction between inclusive and exclusive start conditions
889is still a little vague, here's a simple example illustrating the
890connection between the two. The set of rules:
891.nf
892
893 %s example
894 %%
895 <example>foo /* do something */
896
897.fi
898is equivalent to
899.nf
900
901 %x example
902 %%
903 <INITIAL,example>foo /* do something */
904
905.fi
906.LP
907The default rule (to
908.B ECHO
909any unmatched character) remains active in start conditions.
910.LP
911.B BEGIN(0)
912returns to the original state where only the rules with
913no start conditions are active. This state can also be
914referred to as the start-condition "INITIAL", so
915.B BEGIN(INITIAL)
916is equivalent to
917.B BEGIN(0).
918(The parentheses around the start condition name are not required but
919are considered good style.)
920.LP
921.B BEGIN
922actions can also be given as indented code at the beginning
923of the rules section. For example, the following will cause
924the scanner to enter the "SPECIAL" start condition whenever
925.I yylex()
926is called and the global variable
927.I enter_special
928is true:
929.nf
930
931 int enter_special;
932
933 %x SPECIAL
934 %%
935 if ( enter_special )
936 BEGIN(SPECIAL);
937
938 <SPECIAL>blahblahblah
939 ...more rules follow...
940
941.fi
942.LP
943To illustrate the uses of start conditions,
944here is a scanner which provides two different interpretations
945of a string like "123.456". By default it will treat it as
946as three tokens, the integer "123", a dot ('.'), and the integer "456".
947But if the string is preceded earlier in the line by the string
948"expect-floats"
949it will treat it as a single token, the floating-point number
950123.456:
951.nf
952
953 %{
954 #include <math.h>
955 %}
956 %s expect
957
958 %%
959 expect-floats BEGIN(expect);
960
961 <expect>[0-9]+"."[0-9]+ {
962 printf( "found a float, = %f\\n",
963 atof( yytext ) );
964 }
965 <expect>\\n {
966 /* that's the end of the line, so
967 * we need another "expect-number"
968 * before we'll recognize any more
969 * numbers
970 */
971 BEGIN(INITIAL);
972 }
973
974 [0-9]+ {
975 printf( "found an integer, = %d\\n",
976 atoi( yytext ) );
977 }
978
979 "." printf( "found a dot\\n" );
980
981.fi
982Here is a scanner which recognizes (and discards) C comments while
983maintaining a count of the current input line.
984.nf
985
986 %x comment
987 %%
988 int line_num = 1;
989
990 "/*" BEGIN(comment);
991
992 <comment>[^*\\n]* /* eat anything that's not a '*' */
993 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
994 <comment>\\n ++line_num;
995 <comment>"*"+"/" BEGIN(INITIAL);
996
997.fi
998Note that start-conditions names are really integer values and
999can be stored as such. Thus, the above could be extended in the
1000following fashion:
1001.nf
1002
1003 %x comment foo
1004 %%
1005 int line_num = 1;
1006 int comment_caller;
1007
1008 "/*" {
1009 comment_caller = INITIAL;
1010 BEGIN(comment);
1011 }
1012
1013 ...
1014
1015 <foo>"/*" {
1016 comment_caller = foo;
1017 BEGIN(comment);
1018 }
1019
1020 <comment>[^*\\n]* /* eat anything that's not a '*' */
1021 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1022 <comment>\\n ++line_num;
1023 <comment>"*"+"/" BEGIN(comment_caller);
1024
1025.fi
1026One can then implement a "stack" of start conditions using an
1027array of integers. (It is likely that such stacks will become
1028a full-fledged
1029.I flex
1030feature in the future.) Note, though, that
1031start conditions do not have their own name-space; %s's and %x's
1032declare names in the same fashion as #define's.
1033.SH MULTIPLE INPUT BUFFERS
1034Some scanners (such as those which support "include" files)
1035require reading from several input streams. As
1036.I flex
1037scanners do a large amount of buffering, one cannot control
1038where the next input will be read from by simply writing a
1039.B YY_INPUT
1040which is sensitive to the scanning context.
1041.B YY_INPUT
1042is only called when the scanner reaches the end of its buffer, which
1043may be a long time after scanning a statement such as an "include"
1044which requires switching the input source.
1045.LP
1046To negotiate these sorts of problems,
1047.I flex
1048provides a mechanism for creating and switching between multiple
1049input buffers. An input buffer is created by using:
1050.nf
1051
1052 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1053
1054.fi
1055which takes a
1056.I FILE
1057pointer and a size and creates a buffer associated with the given
1058file and large enough to hold
1059.I size
1060characters (when in doubt, use
1061.B YY_BUF_SIZE
1062for the size). It returns a
1063.B YY_BUFFER_STATE
1064handle, which may then be passed to other routines:
1065.nf
1066
1067 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1068
1069.fi
1070switches the scanner's input buffer so subsequent tokens will
1071come from
1072.I new_buffer.
1073Note that
1074.B yy_switch_to_buffer()
1075may be used by yywrap() to sets things up for continued scanning, instead
1076of opening a new file and pointing
1077.I yyin
1078at it.
1079.nf
1080
1081 void yy_delete_buffer( YY_BUFFER_STATE buffer )
1082
1083.fi
1084is used to reclaim the storage associated with a buffer.
1085.LP
1086.B yy_new_buffer()
1087is an alias for
1088.B yy_create_buffer(),
1089provided for compatibility with the C++ use of
1090.I new
1091and
1092.I delete
1093for creating and destroying dynamic objects.
1094.LP
1095Finally, the
1096.B YY_CURRENT_BUFFER
1097macro returns a
1098.B YY_BUFFER_STATE
1099handle to the current buffer.
1100.LP
1101Here is an example of using these features for writing a scanner
1102which expands include files (the
1103.B <<EOF>>
1104feature is discussed below):
1105.nf
1106
1107 /* the "incl" state is used for picking up the name
1108 * of an include file
1109 */
1110 %x incl
1111
1112 %{
1113 #define MAX_INCLUDE_DEPTH 10
1114 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1115 int include_stack_ptr = 0;
1116 %}
1117
1118 %%
1119 include BEGIN(incl);
1120
1121 [a-z]+ ECHO;
1122 [^a-z\\n]*\\n? ECHO;
1123
1124 <incl>[ \\t]* /* eat the whitespace */
1125 <incl>[^ \\t\\n]+ { /* got the include file name */
1126 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1127 {
1128 fprintf( stderr, "Includes nested too deeply" );
1129 exit( 1 );
1130 }
1131
1132 include_stack[include_stack_ptr++] =
1133 YY_CURRENT_BUFFER;
1134
1135 yyin = fopen( yytext, "r" );
1136
1137 if ( ! yyin )
1138 error( ... );
1139
1140 yy_switch_to_buffer(
1141 yy_create_buffer( yyin, YY_BUF_SIZE ) );
1142
1143 BEGIN(INITIAL);
1144 }
1145
1146 <<EOF>> {
1147 if ( --include_stack_ptr < 0 )
1148 {
1149 yyterminate();
1150 }
1151
1152 else
1153 yy_switch_to_buffer(
1154 include_stack[include_stack_ptr] );
1155 }
1156
1157.fi
1158.SH END-OF-FILE RULES
1159The special rule "<<EOF>>" indicates
1160actions which are to be taken when an end-of-file is
1161encountered and yywrap() returns non-zero (i.e., indicates
1162no further files to process). The action must finish
1163by doing one of four things:
1164.IP -
1165the special
1166.B YY_NEW_FILE
1167action, if
1168.I yyin
1169has been pointed at a new file to process;
1170.IP -
1171a
1172.I return
1173statement;
1174.IP -
1175the special
1176.B yyterminate()
1177action;
1178.IP -
1179or, switching to a new buffer using
1180.B yy_switch_to_buffer()
1181as shown in the example above.
1182.LP
1183<<EOF>> rules may not be used with other
1184patterns; they may only be qualified with a list of start
1185conditions. If an unqualified <<EOF>> rule is given, it
1186applies to
1187.I all
1188start conditions which do not already have <<EOF>> actions. To
1189specify an <<EOF>> rule for only the initial start condition, use
1190.nf
1191
1192 <INITIAL><<EOF>>
1193
1194.fi
1195.LP
1196These rules are useful for catching things like unclosed comments.
1197An example:
1198.nf
1199
1200 %x quote
1201 %%
1202
1203 ...other rules for dealing with quotes...
1204
1205 <quote><<EOF>> {
1206 error( "unterminated quote" );
1207 yyterminate();
1208 }
1209 <<EOF>> {
1210 if ( *++filelist )
1211 {
1212 yyin = fopen( *filelist, "r" );
1213 YY_NEW_FILE;
1214 }
1215 else
1216 yyterminate();
1217 }
1218
1219.fi
1220.SH MISCELLANEOUS MACROS
1221The macro
1222.bd
1223YY_USER_ACTION
1224can be redefined to provide an action
1225which is always executed prior to the matched rule's action. For example,
1226it could be #define'd to call a routine to convert yytext to lower-case.
1227.LP
1228The macro
1229.B YY_USER_INIT
1230may be redefined to provide an action which is always executed before
1231the first scan (and before the scanner's internal initializations are done).
1232For example, it could be used to call a routine to read
1233in a data table or open a logging file.
1234.LP
1235In the generated scanner, the actions are all gathered in one large
1236switch statement and separated using
1237.B YY_BREAK,
1238which may be redefined. By default, it is simply a "break", to separate
1239each rule's action from the following rule's.
1240Redefining
1241.B YY_BREAK
1242allows, for example, C++ users to
1243#define YY_BREAK to do nothing (while being very careful that every
1244rule ends with a "break" or a "return"!) to avoid suffering from
1245unreachable statement warnings where because a rule's action ends with
1246"return", the
1247.B YY_BREAK
1248is inaccessible.
1249.SH INTERFACING WITH YACC
1250One of the main uses of
1251.I flex
1252is as a companion to the
1253.I yacc
1254parser-generator.
1255.I yacc
1256parsers expect to call a routine named
1257.B yylex()
1258to find the next input token. The routine is supposed to
1259return the type of the next token as well as putting any associated
1260value in the global
1261.B yylval.
1262To use
1263.I flex
1264with
1265.I yacc,
1266one specifies the
1267.B -d
1268option to
1269.I yacc
1270to instruct it to generate the file
1271.B y.tab.h
1272containing definitions of all the
1273.B %tokens
1274appearing in the
1275.I yacc
1276input. This file is then included in the
1277.I flex
1278scanner. For example, if one of the tokens is "TOK_NUMBER",
1279part of the scanner might look like:
1280.nf
1281
1282 %{
1283 #include "y.tab.h"
1284 %}
1285
1286 %%
1287
1288 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
1289
1290.fi
1291.SH TRANSLATION TABLE
1292In the name of POSIX compliance,
1293.I flex
1294supports a
1295.I translation table
1296for mapping input characters into groups.
1297The table is specified in the first section, and its format looks like:
1298.nf
1299
1300 %t
1301 1 abcd
1302 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1303 52 0123456789
1304 6 \\t\\ \\n
1305 %t
1306
1307.fi
1308This example specifies that the characters 'a', 'b', 'c', and 'd'
1309are to all be lumped into group #1, upper-case letters
1310in group #2, digits in group #52, tabs, blanks, and newlines into
1311group #6, and
1312.I
1313no other characters will appear in the patterns.
1314The group numbers are actually disregarded by
1315.I flex;
1316.B %t
1317serves, though, to lump characters together. Given the above
1318table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0".
1319They both say, "match any character in group #1, followed by
1320zero-or-more pairs of characters
1321from group #2, followed by a character from group #52." Thus
1322.B %t
1323provides a crude way for introducing equivalence classes into
1324the scanner specification.
1325.LP
1326Note that the
1327.B -i
1328option (see below) coupled with the equivalence classes which
1329.I flex
1330automatically generates take care of virtually all the instances
1331when one might consider using
1332.B %t.
1333But what the hell, it's there if you want it.
1334.SH OPTIONS
1335.I flex
1336has the following options:
1337.TP
1338.B -b
1339Generate backtracking information to
1340.I lex.backtrack.
1341This is a list of scanner states which require backtracking
1342and the input characters on which they do so. By adding rules one
1343can remove backtracking states. If all backtracking states
1344are eliminated and
1345.B -f
1346or
1347.B -F
1348is used, the generated scanner will run faster (see the
1349.B -p
1350flag). Only users who wish to squeeze every last cycle out of their
1351scanners need worry about this option. (See the section on PERFORMANCE
1352CONSIDERATIONS below.)
1353.TP
1354.B -c
1355is a do-nothing, deprecated option included for POSIX compliance.
1356.IP
1357.B NOTE:
1358in previous releases of
1359.I flex
1360.B -c
1361specified table-compression options. This functionality is
1362now given by the
1363.B -C
1364flag. To ease the the impact of this change, when
1365.I flex
1366encounters
1367.B -c,
1368it currently issues a warning message and assumes that
1369.B -C
1370was desired instead. In the future this "promotion" of
1371.B -c
1372to
1373.B -C
1374will go away in the name of full POSIX compliance (unless
1375the POSIX meaning is removed first).
1376.TP
1377.B -d
1378makes the generated scanner run in
1379.I debug
1380mode. Whenever a pattern is recognized and the global
1381.B yy_flex_debug
1382is non-zero (which is the default),
1383the scanner will write to
1384.I stderr
1385a line of the form:
1386.nf
1387
1388 --accepting rule at line 53 ("the matched text")
1389
1390.fi
1391The line number refers to the location of the rule in the file
1392defining the scanner (i.e., the file that was fed to flex). Messages
1393are also generated when the scanner backtracks, accepts the
1394default rule, reaches the end of its input buffer (or encounters
1395a NUL; at this point, the two look the same as far as the scanner's concerned),
1396or reaches an end-of-file.
1397.TP
1398.B -f
1399specifies (take your pick)
1400.I full table
1401or
1402.I fast scanner.
1403No table compression is done. The result is large but fast.
1404This option is equivalent to
1405.B -Cf
1406(see below).
1407.TP
1408.B -i
1409instructs
1410.I flex
1411to generate a
1412.I case-insensitive
1413scanner. The case of letters given in the
1414.I flex
1415input patterns will
1416be ignored, and tokens in the input will be matched regardless of case. The
1417matched text given in
1418.I yytext
1419will have the preserved case (i.e., it will not be folded).
1420.TP
1421.B -n
1422is another do-nothing, deprecated option included only for
1423POSIX compliance.
1424.TP
1425.B -p
1426generates a performance report to stderr. The report
1427consists of comments regarding features of the
1428.I flex
1429input file which will cause a loss of performance in the resulting scanner.
1430Note that the use of
1431.I REJECT
1432and variable trailing context (see the BUGS section in flex(1))
1433entails a substantial performance penalty; use of
1434.I yymore(),
1435the
1436.B ^
1437operator,
1438and the
1439.B -I
1440flag entail minor performance penalties.
1441.TP
1442.B -s
1443causes the
1444.I default rule
1445(that unmatched scanner input is echoed to
1446.I stdout)
1447to be suppressed. If the scanner encounters input that does not
1448match any of its rules, it aborts with an error. This option is
1449useful for finding holes in a scanner's rule set.
1450.TP
1451.B -t
1452instructs
1453.I flex
1454to write the scanner it generates to standard output instead
1455of
1456.B lex.yy.c.
1457.TP
1458.B -v
1459specifies that
1460.I flex
1461should write to
1462.I stderr
1463a summary of statistics regarding the scanner it generates.
1464Most of the statistics are meaningless to the casual
1465.I flex
1466user, but the
1467first line identifies the version of
1468.I flex,
1469which is useful for figuring
1470out where you stand with respect to patches and new releases,
1471and the next two lines give the date when the scanner was created
1472and a summary of the flags which were in effect.
1473.TP
1474.B -F
1475specifies that the
1476.ul
1477fast
1478scanner table representation should be used. This representation is
1479about as fast as the full table representation
1480.ul
1481(-f),
1482and for some sets of patterns will be considerably smaller (and for
1483others, larger). In general, if the pattern set contains both "keywords"
1484and a catch-all, "identifier" rule, such as in the set:
1485.nf
1486
1487 "case" return TOK_CASE;
1488 "switch" return TOK_SWITCH;
1489 ...
1490 "default" return TOK_DEFAULT;
1491 [a-z]+ return TOK_ID;
1492
1493.fi
1494then you're better off using the full table representation. If only
1495the "identifier" rule is present and you then use a hash table or some such
1496to detect the keywords, you're better off using
1497.ul
1498-F.
1499.IP
1500This option is equivalent to
1501.B -CF
1502(see below).
1503.TP
1504.B -I
1505instructs
1506.I flex
1507to generate an
1508.I interactive
1509scanner. Normally, scanners generated by
1510.I flex
1511always look ahead one
1512character before deciding that a rule has been matched. At the cost of
1513some scanning overhead,
1514.I flex
1515will generate a scanner which only looks ahead
1516when needed. Such scanners are called
1517.I interactive
1518because if you want to write a scanner for an interactive system such as a
1519command shell, you will probably want the user's input to be terminated
1520with a newline, and without
1521.B -I
1522the user will have to type a character in addition to the newline in order
1523to have the newline recognized. This leads to dreadful interactive
1524performance.
1525.IP
1526If all this seems to confusing, here's the general rule: if a human will
1527be typing in input to your scanner, use
1528.B -I,
1529otherwise don't; if you don't care about squeezing the utmost performance
1530from your scanner and you
1531don't want to make any assumptions about the input to your scanner,
1532use
1533.B -I.
1534.IP
1535Note,
1536.B -I
1537cannot be used in conjunction with
1538.I full
1539or
1540.I fast tables,
1541i.e., the
1542.B -f, -F, -Cf,
1543or
1544.B -CF
1545flags.
1546.TP
1547.B -L
1548instructs
1549.I flex
1550not to generate
1551.B #line
1552directives. Without this option,
1553.I flex
1554peppers the generated scanner
1555with #line directives so error messages in the actions will be correctly
1556located with respect to the original
1557.I flex
1558input file, and not to
1559the fairly meaningless line numbers of
1560.B lex.yy.c.
1561(Unfortunately
1562.I flex
1563does not presently generate the necessary directives
1564to "retarget" the line numbers for those parts of
1565.B lex.yy.c
1566which it generated. So if there is an error in the generated code,
1567a meaningless line number is reported.)
1568.TP
1569.B -T
1570makes
1571.I flex
1572run in
1573.I trace
1574mode. It will generate a lot of messages to
1575.I stdout
1576concerning
1577the form of the input and the resultant non-deterministic and deterministic
1578finite automata. This option is mostly for use in maintaining
1579.I flex.
1580.TP
1581.B -8
1582instructs
1583.I flex
1584to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1585characters. On some sites,
1586.I flex
1587is installed with this option as the default. On others, the default
1588is 7-bit characters. To see which is the case, check the verbose
1589.B (-v)
1590output for "equivalence classes created". If the denominator of
1591the number shown is 128, then by default
1592.I flex
1593is generating 7-bit characters. If it is 256, then the default is
15948-bit characters and the
1595.B -8
1596flag is not required (but may be a good idea to keep the scanner
1597specification portable). Feeding a 7-bit scanner 8-bit characters
1598will result in infinite loops, bus errors, or other such fireworks,
1599so when in doubt, use the flag. Note that if equivalence classes
1600are used, 8-bit scanners take only slightly more table space than
16017-bit scanners (128 bytes, to be exact); if equivalence classes are
1602not used, however, then the tables may grow up to twice their
16037-bit size.
1604.TP
1605.B -C[efmF]
1606controls the degree of table compression.
1607.IP
1608.B -Ce
1609directs
1610.I flex
1611to construct
1612.I equivalence classes,
1613i.e., sets of characters
1614which have identical lexical properties (for example, if the only
1615appearance of digits in the
1616.I flex
1617input is in the character class
1618"[0-9]" then the digits '0', '1', ..., '9' will all be put
1619in the same equivalence class). Equivalence classes usually give
1620dramatic reductions in the final table/object file sizes (typically
1621a factor of 2-5) and are pretty cheap performance-wise (one array
1622look-up per character scanned).
1623.IP
1624.B -Cf
1625specifies that the
1626.I full
1627scanner tables should be generated -
1628.I flex
1629should not compress the
1630tables by taking advantages of similar transition functions for
1631different states.
1632.IP
1633.B -CF
1634specifies that the alternate fast scanner representation (described
1635above under the
1636.B -F
1637flag)
1638should be used.
1639.IP
1640.B -Cm
1641directs
1642.I flex
1643to construct
1644.I meta-equivalence classes,
1645which are sets of equivalence classes (or characters, if equivalence
1646classes are not being used) that are commonly used together. Meta-equivalence
1647classes are often a big win when using compressed tables, but they
1648have a moderate performance impact (one or two "if" tests and one
1649array look-up per character scanned).
1650.IP
1651A lone
1652.B -C
1653specifies that the scanner tables should be compressed but neither
1654equivalence classes nor meta-equivalence classes should be used.
1655.IP
1656The options
1657.B -Cf
1658or
1659.B -CF
1660and
1661.B -Cm
1662do not make sense together - there is no opportunity for meta-equivalence
1663classes if the table is not being compressed. Otherwise the options
1664may be freely mixed.
1665.IP
1666The default setting is
1667.B -Cem,
1668which specifies that
1669.I flex
1670should generate equivalence classes
1671and meta-equivalence classes. This setting provides the highest
1672degree of table compression. You can trade off
1673faster-executing scanners at the cost of larger tables with
1674the following generally being true:
1675.nf
1676
1677 slowest & smallest
1678 -Cem
1679 -Cm
1680 -Ce
1681 -C
1682 -C{f,F}e
1683 -C{f,F}
1684 fastest & largest
1685
1686.fi
1687Note that scanners with the smallest tables are usually generated and
1688compiled the quickest, so
1689during development you will usually want to use the default, maximal
1690compression.
1691.IP
1692.B -Cfe
1693is often a good compromise between speed and size for production
1694scanners.
1695.IP
1696.B -C
1697options are not cumulative; whenever the flag is encountered, the
1698previous -C settings are forgotten.
1699.TP
1700.B -Sskeleton_file
1701overrides the default skeleton file from which
1702.I flex
1703constructs its scanners. You'll never need this option unless you are doing
1704.I flex
1705maintenance or development.
1706.SH PERFORMANCE CONSIDERATIONS
1707The main design goal of
1708.I flex
1709is that it generate high-performance scanners. It has been optimized
1710for dealing well with large sets of rules. Aside from the effects
1711of table compression on scanner speed outlined above,
1712there are a number of options/actions which degrade performance. These
1713are, from most expensive to least:
1714.nf
1715
1716 REJECT
1717
1718 pattern sets that require backtracking
1719 arbitrary trailing context
1720
1721 '^' beginning-of-line operator
1722 yymore()
1723
1724.fi
1725with the first three all being quite expensive and the last two
1726being quite cheap.
1727.LP
1728.B REJECT
1729should be avoided at all costs when performance is important.
1730It is a particularly expensive option.
1731.LP
1732Getting rid of backtracking is messy and often may be an enormous
1733amount of work for a complicated scanner. In principal, one begins
1734by using the
1735.B -b
1736flag to generate a
1737.I lex.backtrack
1738file. For example, on the input
1739.nf
1740
1741 %%
1742 foo return TOK_KEYWORD;
1743 foobar return TOK_KEYWORD;
1744
1745.fi
1746the file looks like:
1747.nf
1748
1749 State #6 is non-accepting -
1750 associated rule line numbers:
1751 2 3
1752 out-transitions: [ o ]
1753 jam-transitions: EOF [ \\001-n p-\\177 ]
1754
1755 State #8 is non-accepting -
1756 associated rule line numbers:
1757 3
1758 out-transitions: [ a ]
1759 jam-transitions: EOF [ \\001-` b-\\177 ]
1760
1761 State #9 is non-accepting -
1762 associated rule line numbers:
1763 3
1764 out-transitions: [ r ]
1765 jam-transitions: EOF [ \\001-q s-\\177 ]
1766
1767 Compressed tables always backtrack.
1768
1769.fi
1770The first few lines tell us that there's a scanner state in
1771which it can make a transition on an 'o' but not on any other
1772character, and that in that state the currently scanned text does not match
1773any rule. The state occurs when trying to match the rules found
1774at lines 2 and 3 in the input file.
1775If the scanner is in that state and then reads
1776something other than an 'o', it will have to backtrack to find
1777a rule which is matched. With
1778a bit of headscratching one can see that this must be the
1779state it's in when it has seen "fo". When this has happened,
1780if anything other than another 'o' is seen, the scanner will
1781have to back up to simply match the 'f' (by the default rule).
1782.LP
1783The comment regarding State #8 indicates there's a problem
1784when "foob" has been scanned. Indeed, on any character other
1785than a 'b', the scanner will have to back up to accept "foo".
1786Similarly, the comment for State #9 concerns when "fooba" has
1787been scanned.
1788.LP
1789The final comment reminds us that there's no point going to
1790all the trouble of removing backtracking from the rules unless
1791we're using
1792.B -f
1793or
1794.B -F,
1795since there's no performance gain doing so with compressed scanners.
1796.LP
1797The way to remove the backtracking is to add "error" rules:
1798.nf
1799
1800 %%
1801 foo return TOK_KEYWORD;
1802 foobar return TOK_KEYWORD;
1803
1804 fooba |
1805 foob |
1806 fo {
1807 /* false alarm, not really a keyword */
1808 return TOK_ID;
1809 }
1810
1811.fi
1812.LP
1813Eliminating backtracking among a list of keywords can also be
1814done using a "catch-all" rule:
1815.nf
1816
1817 %%
1818 foo return TOK_KEYWORD;
1819 foobar return TOK_KEYWORD;
1820
1821 [a-z]+ return TOK_ID;
1822
1823.fi
1824This is usually the best solution when appropriate.
1825.LP
1826Backtracking messages tend to cascade.
1827With a complicated set of rules it's not uncommon to get hundreds
1828of messages. If one can decipher them, though, it often
1829only takes a dozen or so rules to eliminate the backtracking (though
1830it's easy to make a mistake and have an error rule accidentally match
1831a valid token. A possible future
1832.I flex
1833feature will be to automatically add rules to eliminate backtracking).
1834.LP
1835.I Variable
1836trailing context (where both the leading and trailing parts do not have
1837a fixed length) entails almost the same performance loss as
1838.I REJECT
1839(i.e., substantial). So when possible a rule like:
1840.nf
1841
1842 %%
1843 mouse|rat/(cat|dog) run();
1844
1845.fi
1846is better written:
1847.nf
1848
1849 %%
1850 mouse/cat|dog run();
1851 rat/cat|dog run();
1852
1853.fi
1854or as
1855.nf
1856
1857 %%
1858 mouse|rat/cat run();
1859 mouse|rat/dog run();
1860
1861.fi
1862Note that here the special '|' action does
1863.I not
1864provide any savings, and can even make things worse (see
1865.B BUGS
1866in flex(1)).
1867.LP
1868Another area where the user can increase a scanner's performance
1869(and one that's easier to implement) arises from the fact that
1870the longer the tokens matched, the faster the scanner will run.
1871This is because with long tokens the processing of most input
1872characters takes place in the (short) inner scanning loop, and
1873does not often have to go through the additional work of setting up
1874the scanning environment (e.g.,
1875.B yytext)
1876for the action. Recall the scanner for C comments:
1877.nf
1878
1879 %x comment
1880 %%
1881 int line_num = 1;
1882
1883 "/*" BEGIN(comment);
1884
1885 <comment>[^*\\n]*
1886 <comment>"*"+[^*/\\n]*
1887 <comment>\\n ++line_num;
1888 <comment>"*"+"/" BEGIN(INITIAL);
1889
1890.fi
1891This could be sped up by writing it as:
1892.nf
1893
1894 %x comment
1895 %%
1896 int line_num = 1;
1897
1898 "/*" BEGIN(comment);
1899
1900 <comment>[^*\\n]*
1901 <comment>[^*\\n]*\\n ++line_num;
1902 <comment>"*"+[^*/\\n]*
1903 <comment>"*"+[^*/\\n]*\\n ++line_num;
1904 <comment>"*"+"/" BEGIN(INITIAL);
1905
1906.fi
1907Now instead of each newline requiring the processing of another
1908action, recognizing the newlines is "distributed" over the other rules
1909to keep the matched text as long as possible. Note that
1910.I adding
1911rules does
1912.I not
1913slow down the scanner! The speed of the scanner is independent
1914of the number of rules or (modulo the considerations given at the
1915beginning of this section) how complicated the rules are with
1916regard to operators such as '*' and '|'.
1917.LP
1918A final example in speeding up a scanner: suppose you want to scan
1919through a file containing identifiers and keywords, one per line
1920and with no other extraneous characters, and recognize all the
1921keywords. A natural first approach is:
1922.nf
1923
1924 %%
1925 asm |
1926 auto |
1927 break |
1928 ... etc ...
1929 volatile |
1930 while /* it's a keyword */
1931
1932 .|\\n /* it's not a keyword */
1933
1934.fi
1935To eliminate the back-tracking, introduce a catch-all rule:
1936.nf
1937
1938 %%
1939 asm |
1940 auto |
1941 break |
1942 ... etc ...
1943 volatile |
1944 while /* it's a keyword */
1945
1946 [a-z]+ |
1947 .|\\n /* it's not a keyword */
1948
1949.fi
1950Now, if it's guaranteed that there's exactly one word per line,
1951then we can reduce the total number of matches by a half by
1952merging in the recognition of newlines with that of the other
1953tokens:
1954.nf
1955
1956 %%
1957 asm\\n |
1958 auto\\n |
1959 break\\n |
1960 ... etc ...
1961 volatile\\n |
1962 while\\n /* it's a keyword */
1963
1964 [a-z]+\\n |
1965 .|\\n /* it's not a keyword */
1966
1967.fi
1968One has to be careful here, as we have now reintroduced backtracking
1969into the scanner. In particular, while
1970.I we
1971know that there will never be any characters in the input stream
1972other than letters or newlines,
1973.I flex
1974can't figure this out, and it will plan for possibly needing backtracking
1975when it has scanned a token like "auto" and then the next character
1976is something other than a newline or a letter. Previously it would
1977then just match the "auto" rule and be done, but now it has no "auto"
1978rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
1979we could either duplicate all rules but without final newlines, or,
1980since we never expect to encounter such an input and therefore don't
1981how it's classified, we can introduce one more catch-all rule, this
1982one which doesn't include a newline:
1983.nf
1984
1985 %%
1986 asm\\n |
1987 auto\\n |
1988 break\\n |
1989 ... etc ...
1990 volatile\\n |
1991 while\\n /* it's a keyword */
1992
1993 [a-z]+\\n |
1994 [a-z]+ |
1995 .|\\n /* it's not a keyword */
1996
1997.fi
1998Compiled with
1999.B -Cf,
2000this is about as fast as one can get a
2001.I flex
2002scanner to go for this particular problem.
2003.LP
2004A final note:
2005.I flex
2006is slow when matching NUL's, particularly when a token contains
2007multiple NUL's.
2008It's best to write rules which match
2009.I short
2010amounts of text if it's anticipated that the text will often include NUL's.
2011.SH INCOMPATIBILITIES WITH LEX AND POSIX
2012.I flex
2013is a rewrite of the Unix
2014.I lex
2015tool (the two implementations do not share any code, though),
2016with some extensions and incompatibilities, both of which
2017are of concern to those who wish to write scanners acceptable
2018to either implementation. At present, the POSIX
2019.I lex
2020draft is
2021very close to the original
2022.I lex
2023implementation, so some of these
2024incompatibilities are also in conflict with the POSIX draft. But
2025the intent is that except as noted below,
2026.I flex
2027as it presently stands will
2028ultimately be POSIX conformant (i.e., that those areas of conflict with
2029the POSIX draft will be resolved in
2030.I flex's
2031favor). Please bear in
2032mind that all the comments which follow are with regard to the POSIX
2033.I draft
2034standard of Summer 1989, and not the final document (or subsequent
2035drafts); they are included so
2036.I flex
2037users can be aware of the standardization issues and those areas where
2038.I flex
2039may in the near future undergo changes incompatible with
2040its current definition.
2041.LP
2042.I flex
2043is fully compatible with
2044.I lex
2045with the following exceptions:
2046.IP -
78ed81a3 2047The undocumented
2048.I lex
2049scanner internal variable
2050.B yylineno
2051is not supported. It is difficult to support this option efficiently,
2052since it requires examining every character scanned and reexamining
2053the characters when the scanner backs up.
2054Things get more complicated when the end of buffer or file is reached or a
2055NUL is scanned (since the scan must then be restarted with the proper line
2056number count), or the user uses the yyless(), unput(), or REJECT actions,
2057or the multiple input buffer functions.
2058.IP
2059The fix is to add rules which, upon seeing a newline, increment
2060yylineno. This is usually an easy process, though it can be a drag if some
2061of the patterns can match multiple newlines along with other characters.
2062.IP
2063yylineno is not part of the POSIX draft.
2064.IP -
2065The
2066.B input()
2067routine is not redefinable, though it may be called to read characters
2068following whatever has been matched by a rule. If
2069.B input()
2070encounters an end-of-file the normal
2071.B yywrap()
2072processing is done. A ``real'' end-of-file is returned by
2073.B input()
2074as
2075.I EOF.
2076.IP
2077Input is instead controlled by redefining the
2078.B YY_INPUT
2079macro.
2080.IP
2081The
2082.I flex
2083restriction that
2084.B input()
2085cannot be redefined is in accordance with the POSIX draft, but
2086.B YY_INPUT
2087has not yet been accepted into the draft (and probably won't; it looks
2088like the draft will simply not specify any way of controlling the
2089scanner's input other than by making an initial assignment to
2090.I yyin).
2091.IP -
2092.I flex
2093scanners do not use stdio for input. Because of this, when writing an
2094interactive scanner one must explicitly call fflush() on the
2095stream associated with the terminal after writing out a prompt.
2096With
2097.I lex
2098such writes are automatically flushed since
2099.I lex
2100scanners use
2101.B getchar()
2102for their input. Also, when writing interactive scanners with
2103.I flex,
2104the
2105.B -I
2106flag must be used.
2107.IP -
2108.I flex
2109scanners are not as reentrant as
2110.I lex
2111scanners. In particular, if you have an interactive scanner and
2112an interrupt handler which long-jumps out of the scanner, and
2113the scanner is subsequently called again, you may get the following
2114message:
2115.nf
2116
2117 fatal flex scanner internal error--end of buffer missed
2118
2119.fi
2120To reenter the scanner, first use
2121.nf
2122
2123 yyrestart( yyin );
2124
2125.fi
2126.IP -
2127.B output()
2128is not supported.
2129Output from the
2130.B ECHO
2131macro is done to the file-pointer
2132.I yyout
2133(default
2134.I stdout).
2135.IP
2136The POSIX draft mentions that an
2137.B output()
2138routine exists but currently gives no details as to what it does.
2139.IP -
15637ed4
RG
2140.I lex
2141does not support exclusive start conditions (%x), though they
2142are in the current POSIX draft.
2143.IP -
2144When definitions are expanded,
2145.I flex
2146encloses them in parentheses.
2147With lex, the following:
2148.nf
2149
2150 NAME [A-Z][A-Z0-9]*
2151 %%
2152 foo{NAME}? printf( "Found it\\n" );
2153 %%
2154
2155.fi
2156will not match the string "foo" because when the macro
2157is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
2158and the precedence is such that the '?' is associated with
2159"[A-Z0-9]*". With
2160.I flex,
2161the rule will be expanded to
2162"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
2163Note that because of this, the
2164.B ^, $, <s>, /,
2165and
2166.B <<EOF>>
2167operators cannot be used in a
2168.I flex
2169definition.
2170.IP
2171The POSIX draft interpretation is the same as
2172.I flex's.
2173.IP -
2174To specify a character class which matches anything but a left bracket (']'),
2175in
2176.I lex
2177one can use "[^]]" but with
2178.I flex
2179one must use "[^\\]]". The latter works with
2180.I lex,
2181too.
2182.IP -
15637ed4
RG
2183The
2184.I lex
2185.B %r
2186(generate a Ratfor scanner) option is not supported. It is not part
2187of the POSIX draft.
2188.IP -
2189If you are providing your own yywrap() routine, you must include a
2190"#undef yywrap" in the definitions section (section 1). Note that
2191the "#undef" will have to be enclosed in %{}'s.
2192.IP
2193The POSIX draft
78ed81a3 2194specifies that yywrap() is a function and this is very unlikely to change; so
15637ed4
RG
2195.I flex users are warned
2196that
2197.B yywrap()
2198is likely to be changed to a function in the near future.
2199.IP -
2200After a call to
2201.B unput(),
2202.I yytext
2203and
2204.I yyleng
2205are undefined until the next token is matched. This is not the case with
2206.I lex
2207or the present POSIX draft.
2208.IP -
2209The precedence of the
2210.B {}
2211(numeric range) operator is different.
2212.I lex
2213interprets "abc{1,3}" as "match one, two, or
2214three occurrences of 'abc'", whereas
2215.I flex
2216interprets it as "match 'ab'
2217followed by one, two, or three occurrences of 'c'". The latter is
2218in agreement with the current POSIX draft.
2219.IP -
2220The precedence of the
2221.B ^
2222operator is different.
2223.I lex
2224interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
2225or 'bar' anywhere", whereas
2226.I flex
2227interprets it as "match either 'foo' or 'bar' if they come at the beginning
2228of a line". The latter is in agreement with the current POSIX draft.
2229.IP -
2230To refer to yytext outside of the scanner source file,
2231the correct definition with
2232.I flex
2233is "extern char *yytext" rather than "extern char yytext[]".
2234This is contrary to the current POSIX draft but a point on which
2235.I flex
2236will not be changing, as the array representation entails a
2237serious performance penalty. It is hoped that the POSIX draft will
2238be emended to support the
2239.I flex
2240variety of declaration (as this is a fairly painless change to
2241require of
2242.I lex
2243users).
2244.IP -
2245.I yyin
2246is
2247.I initialized
2248by
2249.I lex
2250to be
2251.I stdin;
2252.I flex,
2253on the other hand,
2254initializes
2255.I yyin
2256to NULL
2257and then
2258.I assigns
2259it to
2260.I stdin
2261the first time the scanner is called, providing
2262.I yyin
2263has not already been assigned to a non-NULL value. The difference is
2264subtle, but the net effect is that with
2265.I flex
2266scanners,
2267.I yyin
2268does not have a valid value until the scanner has been called.
2269.IP -
2270The special table-size declarations such as
2271.B %a
2272supported by
2273.I lex
2274are not required by
2275.I flex
2276scanners;
2277.I flex
2278ignores them.
2279.IP -
2280The name
2281.bd
2282FLEX_SCANNER
2283is #define'd so scanners may be written for use with either
2284.I flex
2285or
2286.I lex.
2287.LP
2288The following
2289.I flex
2290features are not included in
2291.I lex
2292or the POSIX draft standard:
2293.nf
2294
2295 yyterminate()
2296 <<EOF>>
2297 YY_DECL
2298 #line directives
2299 %{}'s around actions
2300 yyrestart()
2301 comments beginning with '#' (deprecated)
2302 multiple actions on a line
2303
2304.fi
2305This last feature refers to the fact that with
2306.I flex
2307you can put multiple actions on the same line, separated with
2308semi-colons, while with
2309.I lex,
2310the following
2311.nf
2312
2313 foo handle_foo(); ++num_foos_seen;
2314
2315.fi
2316is (rather surprisingly) truncated to
2317.nf
2318
2319 foo handle_foo();
2320
2321.fi
2322.I flex
2323does not truncate the action. Actions that are not enclosed in
2324braces are simply terminated at the end of the line.
2325.SH DIAGNOSTICS
2326.I reject_used_but_not_detected undefined
2327or
2328.I yymore_used_but_not_detected undefined -
2329These errors can occur at compile time. They indicate that the
2330scanner uses
2331.B REJECT
2332or
2333.B yymore()
2334but that
2335.I flex
2336failed to notice the fact, meaning that
2337.I flex
2338scanned the first two sections looking for occurrences of these actions
2339and failed to find any, but somehow you snuck some in (via a #include
2340file, for example). Make an explicit reference to the action in your
2341.I flex
2342input file. (Note that previously
2343.I flex
2344supported a
2345.B %used/%unused
2346mechanism for dealing with this problem; this feature is still supported
2347but now deprecated, and will go away soon unless the author hears from
2348people who can argue compellingly that they need it.)
2349.LP
2350.I flex scanner jammed -
2351a scanner compiled with
2352.B -s
2353has encountered an input string which wasn't matched by
2354any of its rules.
2355.LP
2356.I flex input buffer overflowed -
2357a scanner rule matched a string long enough to overflow the
2358scanner's internal input buffer (16K bytes by default - controlled by
2359.B YY_BUF_SIZE
78ed81a3 2360in "lex.skel". Note that to redefine this macro, you must first
15637ed4
RG
2361.B #undefine
2362it).
2363.LP
2364.I scanner requires -8 flag -
2365Your scanner specification includes recognizing 8-bit characters and
2366you did not specify the -8 flag (and your site has not installed flex
2367with -8 as the default).
2368.LP
78ed81a3 2369.I
2370fatal flex scanner internal error--end of buffer missed -
2371This can occur in an scanner which is reentered after a long-jump
2372has jumped out (or over) the scanner's activation frame. Before
2373reentering the scanner, use:
2374.nf
2375
2376 yyrestart( yyin );
2377
2378.fi
2379.LP
15637ed4
RG
2380.I too many %t classes! -
2381You managed to put every single character into its own %t class.
2382.I flex
2383requires that at least one of the classes share characters.
2384.SH DEFICIENCIES / BUGS
2385See flex(1).
2386.SH "SEE ALSO"
2387.LP
2388flex(1), lex(1), yacc(1), sed(1), awk(1).
2389.LP
2390M. E. Lesk and E. Schmidt,
2391.I LEX - Lexical Analyzer Generator
2392.SH AUTHOR
2393Vern Paxson, with the help of many ideas and much inspiration from
2394Van Jacobson. Original version by Jef Poskanzer. The fast table
2395representation is a partial implementation of a design done by Van
2396Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
2397.LP
2398Thanks to the many
2399.I flex
2400beta-testers, feedbackers, and contributors, especially Casey
78ed81a3 2401Leedom, benson@odi.com, Keith Bostic,
15637ed4
RG
2402Frederic Brehm, Nick Christopher, Jason Coughlin,
2403Scott David Daniels, Leo Eskin,
2404Chris Faylor, Eric Goldman, Eric
2405Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
2406Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
2407Jef Poskanzer, Jim Roskind,
2408Dave Tallman, Frank Whaley, Ken Yap, and those whose names
2409have slipped my marginal mail-archiving skills but whose contributions
2410are appreciated all the same.
2411.LP
2412Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
2413Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
2414headaches.
2415.LP
2416Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
2417to Benson Margulies and Fred
2418Burke for C++ support; to Ove Ewerlid for the basics of support for
2419NUL's; and to Eric Hughes for the basics of support for multiple buffers.
2420.LP
2421Work is being done on extending
2422.I flex
2423to generate scanners in which the
2424state machine is directly represented in C code rather than tables.
2425These scanners may well be substantially faster than those generated
2426using -f or -F. If you are working in this area and are interested
2427in comparing notes and seeing whether redundant work can be avoided,
2428contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE).
2429.LP
2430This work was primarily done when I was at the Real Time Systems Group
2431at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
2432for the support I received.
2433.LP
2434Send comments to:
2435.nf
2436
2437 Vern Paxson
78ed81a3 2438 Computer Systems Engineering
2439 Bldg. 46A, Room 1123
2440 Lawrence Berkeley Laboratory
2441 University of California
2442 Berkeley, CA 94720
2443
2444 vern@ee.lbl.gov
2445 ucbvax!ee.lbl.gov!vern
15637ed4
RG
2446
2447.fi