BSD 4_3_Net_2 development
[unix-history] / .ref-BSD-4_3_Reno / usr / src / pgrm / lex / flexdoc.1
CommitLineData
810a4854
C
1.TH FLEX 1 "26 May 1990" "Version 2.3"
2.SH NAME
3flex - fast lexical analyzer generator
4.SH SYNOPSIS
5.B flex
6.B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
7.I [filename ...]
8.SH DESCRIPTION
9.I flex
10is a tool for generating
11.I scanners:
12programs which recognized lexical patterns in text.
13.I flex
14reads
15the given input files, or its standard input if no file names are given,
16for a description of a scanner to generate. The description is in
17the form of pairs
18of regular expressions and C code, called
19.I rules. flex
20generates as output a C source file,
21.B lex.yy.c,
22which defines a routine
23.B yylex().
24This file is compiled and linked with the
25.B -lfl
26library to produce an executable. When the executable is run,
27it analyzes its input for occurrences
28of the regular expressions. Whenever it finds one, it executes
29the corresponding C code.
30.SH SOME SIMPLE EXAMPLES
31.LP
32First some simple examples to get the flavor of how one uses
33.I flex.
34The following
35.I flex
36input specifies a scanner which whenever it encounters the string
37"username" will replace it with the user's login name:
38.nf
39
40 %%
41 username printf( "%s", getlogin() );
42
43.fi
44By default, any text not matched by a
45.I flex
46scanner
47is copied to the output, so the net effect of this scanner is
48to copy its input file to its output with each occurrence
49of "username" expanded.
50In this input, there is just one rule. "username" is the
51.I pattern
52and the "printf" is the
53.I action.
54The "%%" marks the beginning of the rules.
55.LP
56Here's another simple example:
57.nf
58
59 int num_lines = 0, num_chars = 0;
60
61 %%
62 \\n ++num_lines; ++num_chars;
63 . ++num_chars;
64
65 %%
66 main()
67 {
68 yylex();
69 printf( "# of lines = %d, # of chars = %d\\n",
70 num_lines, num_chars );
71 }
72
73.fi
74This scanner counts the number of characters and the number
75of lines in its input (it produces no output other than the
76final report on the counts). The first line
77declares two globals, "num_lines" and "num_chars", which are accessible
78both inside
79.B yylex()
80and in the
81.B main()
82routine declared after the second "%%". There are two rules, one
83which matches a newline ("\\n") and increments both the line count and
84the character count, and one which matches any character other than
85a newline (indicated by the "." regular expression).
86.LP
87A somewhat more complicated example:
88.nf
89
90 /* scanner for a toy Pascal-like language */
91
92 %{
93 /* need this for the call to atof() below */
94 #include <math.h>
95 %}
96
97 DIGIT [0-9]
98 ID [a-z][a-z0-9]*
99
100 %%
101
102 {DIGIT}+ {
103 printf( "An integer: %s (%d)\\n", yytext,
104 atoi( yytext ) );
105 }
106
107 {DIGIT}+"."{DIGIT}* {
108 printf( "A float: %s (%d)\\n", yytext,
109 atof( yytext ) );
110 }
111
112 if|then|begin|end|procedure|function {
113 printf( "A keyword: %s\\n", yytext );
114 }
115
116 {ID} printf( "An identifier: %s\\n", yytext );
117
118 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
119
120 "{"[^}\\n]*"}" /* eat up one-line comments */
121
122 [ \\t\\n]+ /* eat up whitespace */
123
124 . printf( "Unrecognized character: %s\\n", yytext );
125
126 %%
127
128 main( argc, argv )
129 int argc;
130 char **argv;
131 {
132 ++argv, --argc; /* skip over program name */
133 if ( argc > 0 )
134 yyin = fopen( argv[0], "r" );
135 else
136 yyin = stdin;
137
138 yylex();
139 }
140
141.fi
142This is the beginnings of a simple scanner for a language like
143Pascal. It identifies different types of
144.I tokens
145and reports on what it has seen.
146.LP
147The details of this example will be explained in the following
148sections.
149.SH FORMAT OF THE INPUT FILE
150The
151.I flex
152input file consists of three sections, separated by a line with just
153.B %%
154in it:
155.nf
156
157 definitions
158 %%
159 rules
160 %%
161 user code
162
163.fi
164The
165.I definitions
166section contains declarations of simple
167.I name
168definitions to simplify the scanner specification, and declarations of
169.I start conditions,
170which are explained in a later section.
171.LP
172Name definitions have the form:
173.nf
174
175 name definition
176
177.fi
178The "name" is a word beginning with a letter or an underscore ('_')
179followed by zero or more letters, digits, '_', or '-' (dash).
180The definition is taken to begin at the first non-white-space character
181following the name and continuing to the end of the line.
182The definition can subsequently be referred to using "{name}", which
183will expand to "(definition)". For example,
184.nf
185
186 DIGIT [0-9]
187 ID [a-z][a-z0-9]*
188
189.fi
190defines "DIGIT" to be a regular expression which matches a
191single digit, and
192"ID" to be a regular expression which matches a letter
193followed by zero-or-more letters-or-digits.
194A subsequent reference to
195.nf
196
197 {DIGIT}+"."{DIGIT}*
198
199.fi
200is identical to
201.nf
202
203 ([0-9])+"."([0-9])*
204
205.fi
206and matches one-or-more digits followed by a '.' followed
207by zero-or-more digits.
208.LP
209The
210.I rules
211section of the
212.I flex
213input contains a series of rules of the form:
214.nf
215
216 pattern action
217
218.fi
219where the pattern must be unindented and the action must begin
220on the same line.
221.LP
222See below for a further description of patterns and actions.
223.LP
224Finally, the user code section is simply copied to
225.B lex.yy.c
226verbatim.
227It is used for companion routines which call or are called
228by the scanner. The presence of this section is optional;
229if it is missing, the second
230.B %%
231in the input file may be skipped, too.
232.LP
233In the definitions and rules sections, any
234.I indented
235text or text enclosed in
236.B %{
237and
238.B %}
239is copied verbatim to the output (with the %{}'s removed).
240The %{}'s must appear unindented on lines by themselves.
241.LP
242In the rules section,
243any indented or %{} text appearing before the
244first rule may be used to declare variables
245which are local to the scanning routine and (after the declarations)
246code which is to be executed whenever the scanning routine is entered.
247Other indented or %{} text in the rule section is still copied to the output,
248but its meaning is not well-defined and it may well cause compile-time
249errors (this feature is present for
250.I POSIX
251compliance; see below for other such features).
252.LP
253In the definitions section, an unindented comment (i.e., a line
254beginning with "/*") is also copied verbatim to the output up
255to the next "*/". Also, any line in the definitions section
256beginning with '#' is ignored, though this style of comment is
257deprecated and may go away in the future.
258.SH PATTERNS
259The patterns in the input are written using an extended set of regular
260expressions. These are:
261.nf
262
263 x match the character 'x'
264 . any character except newline
265 [xyz] a "character class"; in this case, the pattern
266 matches either an 'x', a 'y', or a 'z'
267 [abj-oZ] a "character class" with a range in it; matches
268 an 'a', a 'b', any letter from 'j' through 'o',
269 or a 'Z'
270 [^A-Z] a "negated character class", i.e., any character
271 but those in the class. In this case, any
272 character EXCEPT an uppercase letter.
273 [^A-Z\\n] any character EXCEPT an uppercase letter or
274 a newline
275 r* zero or more r's, where r is any regular expression
276 r+ one or more r's
277 r? zero or one r's (that is, "an optional r")
278 r{2,5} anywhere from two to five r's
279 r{2,} two or more r's
280 r{4} exactly 4 r's
281 {name} the expansion of the "name" definition
282 (see above)
283 "[xyz]\\"foo"
284 the literal string: [xyz]"foo
285 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
286 then the ANSI-C interpretation of \\x.
287 Otherwise, a literal 'X' (used to escape
288 operators such as '*')
289 \\123 the character with octal value 123
290 \\x2a the character with hexadecimal value 2a
291 (r) match an r; parentheses are used to override
292 precedence (see below)
293
294
295 rs the regular expression r followed by the
296 regular expression s; called "concatenation"
297
298
299 r|s either an r or an s
300
301
302 r/s an r but only if it is followed by an s. The
303 s is not part of the matched text. This type
304 of pattern is called as "trailing context".
305 ^r an r, but only at the beginning of a line
306 r$ an r, but only at the end of a line. Equivalent
307 to "r/\\n".
308
309
310 <s>r an r, but only in start condition s (see
311 below for discussion of start conditions)
312 <s1,s2,s3>r
313 same, but in any of start conditions s1,
314 s2, or s3
315
316
317 <<EOF>> an end-of-file
318 <s1,s2><<EOF>>
319 an end-of-file when in start condition s1 or s2
320
321.fi
322The regular expressions listed above are grouped according to
323precedence, from highest precedence at the top to lowest at the bottom.
324Those grouped together have equal precedence. For example,
325.nf
326
327 foo|bar*
328
329.fi
330is the same as
331.nf
332
333 (foo)|(ba(r*))
334
335.fi
336since the '*' operator has higher precedence than concatenation,
337and concatenation higher than alternation ('|'). This pattern
338therefore matches
339.I either
340the string "foo"
341.I or
342the string "ba" followed by zero-or-more r's.
343To match "foo" or zero-or-more "bar"'s, use:
344.nf
345
346 foo|(bar)*
347
348.fi
349and to match zero-or-more "foo"'s-or-"bar"'s:
350.nf
351
352 (foo|bar)*
353
354.fi
355.LP
356Some notes on patterns:
357.IP -
358A negated character class such as the example "[^A-Z]"
359above
360.I will match a newline
361unless "\\n" (or an equivalent escape sequence) is one of the
362characters explicitly present in the negated character class
363(e.g., "[^A-Z\\n]"). This is unlike how many other regular
364expression tools treat negated character classes, but unfortunately
365the inconsistency is historically entrenched.
366Matching newlines means that a pattern like [^"]* can match an entire
367input (overflowing the scanner's input buffer) unless there's another
368quote in the input.
369.IP -
370A rule can have at most one instance of trailing context (the '/' operator
371or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
372can only occur at the beginning of a pattern, and, as well as with '/' and '$',
373cannot be grouped inside parentheses. A '^' which does not occur at
374the beginning of a rule or a '$' which does not occur at the end of
375a rule loses its special properties and is treated as a normal character.
376.IP
377The following are illegal:
378.nf
379
380 foo/bar$
381 <sc1>foo<sc2>bar
382
383.fi
384Note that the first of these, can be written "foo/bar\\n".
385.IP
386The following will result in '$' or '^' being treated as a normal character:
387.nf
388
389 foo|(bar$)
390 foo|^bar
391
392.fi
393If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
394could be used (the special '|' action is explained below):
395.nf
396
397 foo |
398 bar$ /* action goes here */
399
400.fi
401A similar trick will work for matching a foo or a
402bar-at-the-beginning-of-a-line.
403.SH HOW THE INPUT IS MATCHED
404When the generated scanner is run, it analyzes its input looking
405for strings which match any of its patterns. If it finds more than
406one match, it takes the one matching the most text (for trailing
407context rules, this includes the length of the trailing part, even
408though it will then be returned to the input). If it finds two
409or more matches of the same length, the
410rule listed first in the
411.I flex
412input file is chosen.
413.LP
414Once the match is determined, the text corresponding to the match
415(called the
416.I token)
417is made available in the global character pointer
418.B yytext,
419and its length in the global integer
420.B yyleng.
421The
422.I action
423corresponding to the matched pattern is then executed (a more
424detailed description of actions follows), and then the remaining
425input is scanned for another match.
426.LP
427If no match is found, then the
428.I default rule
429is executed: the next character in the input is considered matched and
430copied to the standard output. Thus, the simplest legal
431.I flex
432input is:
433.nf
434
435 %%
436
437.fi
438which generates a scanner that simply copies its input (one character
439at a time) to its output.
440.SH ACTIONS
441Each pattern in a rule has a corresponding action, which can be any
442arbitrary C statement. The pattern ends at the first non-escaped
443whitespace character; the remainder of the line is its action. If the
444action is empty, then when the pattern is matched the input token
445is simply discarded. For example, here is the specification for a program
446which deletes all occurrences of "zap me" from its input:
447.nf
448
449 %%
450 "zap me"
451
452.fi
453(It will copy all other characters in the input to the output since
454they will be matched by the default rule.)
455.LP
456Here is a program which compresses multiple blanks and tabs down to
457a single blank, and throws away whitespace found at the end of a line:
458.nf
459
460 %%
461 [ \\t]+ putchar( ' ' );
462 [ \\t]+$ /* ignore this token */
463
464.fi
465.LP
466If the action contains a '{', then the action spans till the balancing '}'
467is found, and the action may cross multiple lines.
468.I flex
469knows about C strings and comments and won't be fooled by braces found
470within them, but also allows actions to begin with
471.B %{
472and will consider the action to be all the text up to the next
473.B %}
474(regardless of ordinary braces inside the action).
475.LP
476An action consisting solely of a vertical bar ('|') means "same as
477the action for the next rule." See below for an illustration.
478.LP
479Actions can include arbitrary C code, including
480.B return
481statements to return a value to whatever routine called
482.B yylex().
483Each time
484.B yylex()
485is called it continues processing tokens from where it last left
486off until it either reaches
487the end of the file or executes a return. Once it reaches an end-of-file,
488however, then any subsequent call to
489.B yylex()
490will simply immediately return, unless
491.B yyrestart()
492is first called (see below).
493.LP
494Actions are not allowed to modify yytext or yyleng.
495.LP
496There are a number of special directives which can be included within
497an action:
498.IP -
499.B ECHO
500copies yytext to the scanner's output.
501.IP -
502.B BEGIN
503followed by the name of a start condition places the scanner in the
504corresponding start condition (see below).
505.IP -
506.B REJECT
507directs the scanner to proceed on to the "second best" rule which matched the
508input (or a prefix of the input). The rule is chosen as described
509above in "How the Input is Matched", and
510.B yytext
511and
512.B yyleng
513set up appropriately.
514It may either be one which matched as much text
515as the originally chosen rule but came later in the
516.I flex
517input file, or one which matched less text.
518For example, the following will both count the
519words in the input and call the routine special() whenever "frob" is seen:
520.nf
521
522 int word_count = 0;
523 %%
524
525 frob special(); REJECT;
526 [^ \\t\\n]+ ++word_count;
527
528.fi
529Without the
530.B REJECT,
531any "frob"'s in the input would not be counted as words, since the
532scanner normally executes only one action per token.
533Multiple
534.B REJECT's
535are allowed, each one finding the next best choice to the currently
536active rule. For example, when the following scanner scans the token
537"abcd", it will write "abcdabcaba" to the output:
538.nf
539
540 %%
541 a |
542 ab |
543 abc |
544 abcd ECHO; REJECT;
545 .|\\n /* eat up any unmatched character */
546
547.fi
548(The first three rules share the fourth's action since they use
549the special '|' action.)
550.B REJECT
551is a particularly expensive feature in terms scanner performance;
552if it is used in
553.I any
554of the scanner's actions it will slow down
555.I all
556of the scanner's matching. Furthermore,
557.B REJECT
558cannot be used with the
559.I -f
560or
561.I -F
562options (see below).
563.IP
564Note also that unlike the other special actions,
565.B REJECT
566is a
567.I branch;
568code immediately following it in the action will
569.I not
570be executed.
571.IP -
572.B yymore()
573tells the scanner that the next time it matches a rule, the corresponding
574token should be
575.I appended
576onto the current value of
577.B yytext
578rather than replacing it. For example, given the input "mega-kludge"
579the following will write "mega-mega-kludge" to the output:
580.nf
581
582 %%
583 mega- ECHO; yymore();
584 kludge ECHO;
585
586.fi
587First "mega-" is matched and echoed to the output. Then "kludge"
588is matched, but the previous "mega-" is still hanging around at the
589beginning of
590.B yytext
591so the
592.B ECHO
593for the "kludge" rule will actually write "mega-kludge".
594The presence of
595.B yymore()
596in the scanner's action entails a minor performance penalty in the
597scanner's matching speed.
598.IP -
599.B yyless(n)
600returns all but the first
601.I n
602characters of the current token back to the input stream, where they
603will be rescanned when the scanner looks for the next match.
604.B yytext
605and
606.B yyleng
607are adjusted appropriately (e.g.,
608.B yyleng
609will now be equal to
610.I n
611). For example, on the input "foobar" the following will write out
612"foobarbar":
613.nf
614
615 %%
616 foobar ECHO; yyless(3);
617 [a-z]+ ECHO;
618
619.fi
620An argument of 0 to
621.B yyless
622will cause the entire current input string to be scanned again. Unless you've
623changed how the scanner will subsequently process its input (using
624.B BEGIN,
625for example), this will result in an endless loop.
626.IP -
627.B unput(c)
628puts the character
629.I c
630back onto the input stream. It will be the next character scanned.
631The following action will take the current token and cause it
632to be rescanned enclosed in parentheses.
633.nf
634
635 {
636 int i;
637 unput( ')' );
638 for ( i = yyleng - 1; i >= 0; --i )
639 unput( yytext[i] );
640 unput( '(' );
641 }
642
643.fi
644Note that since each
645.B unput()
646puts the given character back at the
647.I beginning
648of the input stream, pushing back strings must be done back-to-front.
649.IP -
650.B input()
651reads the next character from the input stream. For example,
652the following is one way to eat up C comments:
653.nf
654
655 %%
656 "/*" {
657 register int c;
658
659 for ( ; ; )
660 {
661 while ( (c = input()) != '*' &&
662 c != EOF )
663 ; /* eat up text of comment */
664
665 if ( c == '*' )
666 {
667 while ( (c = input()) == '*' )
668 ;
669 if ( c == '/' )
670 break; /* found the end */
671 }
672
673 if ( c == EOF )
674 {
675 error( "EOF in comment" );
676 break;
677 }
678 }
679 }
680
681.fi
682(Note that if the scanner is compiled using
683.B C++,
684then
685.B input()
686is instead referred to as
687.B yyinput(),
688in order to avoid a name clash with the
689.B C++
690stream by the name of
691.I input.)
692.IP -
693.B yyterminate()
694can be used in lieu of a return statement in an action. It terminates
695the scanner and returns a 0 to the scanner's caller, indicating "all done".
696Subsequent calls to the scanner will immediately return unless preceded
697by a call to
698.B yyrestart()
699(see below).
700By default,
701.B yyterminate()
702is also called when an end-of-file is encountered. It is a macro and
703may be redefined.
704.SH THE GENERATED SCANNER
705The output of
706.I flex
707is the file
708.B lex.yy.c,
709which contains the scanning routine
710.B yylex(),
711a number of tables used by it for matching tokens, and a number
712of auxiliary routines and macros. By default,
713.B yylex()
714is declared as follows:
715.nf
716
717 int yylex()
718 {
719 ... various definitions and the actions in here ...
720 }
721
722.fi
723(If your environment supports function prototypes, then it will
724be "int yylex( void )".) This definition may be changed by redefining
725the "YY_DECL" macro. For example, you could use:
726.nf
727
728 #undef YY_DECL
729 #define YY_DECL float lexscan( a, b ) float a, b;
730
731.fi
732to give the scanning routine the name
733.I lexscan,
734returning a float, and taking two floats as arguments. Note that
735if you give arguments to the scanning routine using a
736K&R-style/non-prototyped function declaration, you must terminate
737the definition with a semi-colon (;).
738.LP
739Whenever
740.B yylex()
741is called, it scans tokens from the global input file
742.I yyin
743(which defaults to stdin). It continues until it either reaches
744an end-of-file (at which point it returns the value 0) or
745one of its actions executes a
746.I return
747statement.
748In the former case, when called again the scanner will immediately
749return unless
750.B yyrestart()
751is called to point
752.I yyin
753at the new input file. (
754.B yyrestart()
755takes one argument, a
756.B FILE *
757pointer.)
758In the latter case (i.e., when an action
759executes a return), the scanner may then be called again and it
760will resume scanning where it left off.
761.LP
762By default (and for purposes of efficiency), the scanner uses
763block-reads rather than simple
764.I getc()
765calls to read characters from
766.I yyin.
767The nature of how it gets its input can be controlled by redefining the
768.B YY_INPUT
769macro.
770YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
771action is to place up to
772.I max_size
773characters in the character array
774.I buf
775and return in the integer variable
776.I result
777either the
778number of characters read or the constant YY_NULL (0 on Unix systems)
779to indicate EOF. The default YY_INPUT reads from the
780global file-pointer "yyin".
781.LP
782A sample redefinition of YY_INPUT (in the definitions
783section of the input file):
784.nf
785
786 %{
787 #undef YY_INPUT
788 #define YY_INPUT(buf,result,max_size) \\
789 result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1;
790 %}
791
792.fi
793This definition will change the input processing to occur
794one character at a time.
795.LP
796You also can add in things like keeping track of the
797input line number this way; but don't expect your scanner to
798go very fast.
799.LP
800When the scanner receives an end-of-file indication from YY_INPUT,
801it then checks the
802.B yywrap()
803function. If
804.B yywrap()
805returns false (zero), then it is assumed that the
806function has gone ahead and set up
807.I yyin
808to point to another input file, and scanning continues. If it returns
809true (non-zero), then the scanner terminates, returning 0 to its
810caller.
811.LP
812The default
813.B yywrap()
814always returns 1. Presently, to redefine it you must first
815"#undef yywrap", as it is currently implemented as a macro. As indicated
816by the hedging in the previous sentence, it may be changed to
817a true function in the near future.
818.LP
819The scanner writes its
820.B ECHO
821output to the
822.I yyout
823global (default, stdout), which may be redefined by the user simply
824by assigning it to some other
825.B FILE
826pointer.
827.SH START CONDITIONS
828.I flex
829provides a mechanism for conditionally activating rules. Any rule
830whose pattern is prefixed with "<sc>" will only be active when
831the scanner is in the start condition named "sc". For example,
832.nf
833
834 <STRING>[^"]* { /* eat up the string body ... */
835 ...
836 }
837
838.fi
839will be active only when the scanner is in the "STRING" start
840condition, and
841.nf
842
843 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
844 ...
845 }
846
847.fi
848will be active only when the current start condition is
849either "INITIAL", "STRING", or "QUOTE".
850.LP
851Start conditions
852are declared in the definitions (first) section of the input
853using unindented lines beginning with either
854.B %s
855or
856.B %x
857followed by a list of names.
858The former declares
859.I inclusive
860start conditions, the latter
861.I exclusive
862start conditions. A start condition is activated using the
863.B BEGIN
864action. Until the next
865.B BEGIN
866action is executed, rules with the given start
867condition will be active and
868rules with other start conditions will be inactive.
869If the start condition is
870.I inclusive,
871then rules with no start conditions at all will also be active.
872If it is
873.I exclusive,
874then
875.I only
876rules qualified with the start condition will be active.
877A set of rules contingent on the same exclusive start condition
878describe a scanner which is independent of any of the other rules in the
879.I flex
880input. Because of this,
881exclusive start conditions make it easy to specify "mini-scanners"
882which scan portions of the input that are syntactically different
883from the rest (e.g., comments).
884.LP
885If the distinction between inclusive and exclusive start conditions
886is still a little vague, here's a simple example illustrating the
887connection between the two. The set of rules:
888.nf
889
890 %s example
891 %%
892 <example>foo /* do something */
893
894.fi
895is equivalent to
896.nf
897
898 %x example
899 %%
900 <INITIAL,example>foo /* do something */
901
902.fi
903.LP
904The default rule (to
905.B ECHO
906any unmatched character) remains active in start conditions.
907.LP
908.B BEGIN(0)
909returns to the original state where only the rules with
910no start conditions are active. This state can also be
911referred to as the start-condition "INITIAL", so
912.B BEGIN(INITIAL)
913is equivalent to
914.B BEGIN(0).
915(The parentheses around the start condition name are not required but
916are considered good style.)
917.LP
918.B BEGIN
919actions can also be given as indented code at the beginning
920of the rules section. For example, the following will cause
921the scanner to enter the "SPECIAL" start condition whenever
922.I yylex()
923is called and the global variable
924.I enter_special
925is true:
926.nf
927
928 int enter_special;
929
930 %x SPECIAL
931 %%
932 if ( enter_special )
933 BEGIN(SPECIAL);
934
935 <SPECIAL>blahblahblah
936 ...more rules follow...
937
938.fi
939.LP
940To illustrate the uses of start conditions,
941here is a scanner which provides two different interpretations
942of a string like "123.456". By default it will treat it as
943as three tokens, the integer "123", a dot ('.'), and the integer "456".
944But if the string is preceded earlier in the line by the string
945"expect-floats"
946it will treat it as a single token, the floating-point number
947123.456:
948.nf
949
950 %{
951 #include <math.h>
952 %}
953 %s expect
954
955 %%
956 expect-floats BEGIN(expect);
957
958 <expect>[0-9]+"."[0-9]+ {
959 printf( "found a float, = %f\\n",
960 atof( yytext ) );
961 }
962 <expect>\\n {
963 /* that's the end of the line, so
964 * we need another "expect-number"
965 * before we'll recognize any more
966 * numbers
967 */
968 BEGIN(INITIAL);
969 }
970
971 [0-9]+ {
972 printf( "found an integer, = %d\\n",
973 atoi( yytext ) );
974 }
975
976 "." printf( "found a dot\\n" );
977
978.fi
979Here is a scanner which recognizes (and discards) C comments while
980maintaining a count of the current input line.
981.nf
982
983 %x comment
984 %%
985 int line_num = 1;
986
987 "/*" BEGIN(comment);
988
989 <comment>[^*\\n]* /* eat anything that's not a '*' */
990 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
991 <comment>\\n ++line_num;
992 <comment>"*"+"/" BEGIN(INITIAL);
993
994.fi
995Note that start-conditions names are really integer values and
996can be stored as such. Thus, the above could be extended in the
997following fashion:
998.nf
999
1000 %x comment foo
1001 %%
1002 int line_num = 1;
1003 int comment_caller;
1004
1005 "/*" {
1006 comment_caller = INITIAL;
1007 BEGIN(comment);
1008 }
1009
1010 ...
1011
1012 <foo>"/*" {
1013 comment_caller = foo;
1014 BEGIN(comment);
1015 }
1016
1017 <comment>[^*\\n]* /* eat anything that's not a '*' */
1018 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1019 <comment>\\n ++line_num;
1020 <comment>"*"+"/" BEGIN(comment_caller);
1021
1022.fi
1023One can then implement a "stack" of start conditions using an
1024array of integers. (It is likely that such stacks will become
1025a full-fledged
1026.I flex
1027feature in the future.) Note, though, that
1028start conditions do not have their own name-space; %s's and %x's
1029declare names in the same fashion as #define's.
1030.SH MULTIPLE INPUT BUFFERS
1031Some scanners (such as those which support "include" files)
1032require reading from several input streams. As
1033.I flex
1034scanners do a large amount of buffering, one cannot control
1035where the next input will be read from by simply writing a
1036.B YY_INPUT
1037which is sensitive to the scanning context.
1038.B YY_INPUT
1039is only called when the scanner reaches the end of its buffer, which
1040may be a long time after scanning a statement such as an "include"
1041which requires switching the input source.
1042.LP
1043To negotiate these sorts of problems,
1044.I flex
1045provides a mechanism for creating and switching between multiple
1046input buffers. An input buffer is created by using:
1047.nf
1048
1049 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1050
1051.fi
1052which takes a
1053.I FILE
1054pointer and a size and creates a buffer associated with the given
1055file and large enough to hold
1056.I size
1057characters (when in doubt, use
1058.B YY_BUF_SIZE
1059for the size). It returns a
1060.B YY_BUFFER_STATE
1061handle, which may then be passed to other routines:
1062.nf
1063
1064 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1065
1066.fi
1067switches the scanner's input buffer so subsequent tokens will
1068come from
1069.I new_buffer.
1070Note that
1071.B yy_switch_to_buffer()
1072may be used by yywrap() to sets things up for continued scanning, instead
1073of opening a new file and pointing
1074.I yyin
1075at it.
1076.nf
1077
1078 void yy_delete_buffer( YY_BUFFER_STATE buffer )
1079
1080.fi
1081is used to reclaim the storage associated with a buffer.
1082.LP
1083.B yy_new_buffer()
1084is an alias for
1085.B yy_create_buffer(),
1086provided for compatibility with the C++ use of
1087.I new
1088and
1089.I delete
1090for creating and destroying dynamic objects.
1091.LP
1092Finally, the
1093.B YY_CURRENT_BUFFER
1094macro returns a
1095.B YY_BUFFER_STATE
1096handle to the current buffer.
1097.LP
1098Here is an example of using these features for writing a scanner
1099which expands include files (the
1100.B <<EOF>>
1101feature is discussed below):
1102.nf
1103
1104 /* the "incl" state is used for picking up the name
1105 * of an include file
1106 */
1107 %x incl
1108
1109 %{
1110 #define MAX_INCLUDE_DEPTH 10
1111 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1112 int include_stack_ptr = 0;
1113 %}
1114
1115 %%
1116 include BEGIN(incl);
1117
1118 [a-z]+ ECHO;
1119 [^a-z\\n]*\\n? ECHO;
1120
1121 <incl>[ \\t]* /* eat the whitespace */
1122 <incl>[^ \\t\\n]+ { /* got the include file name */
1123 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1124 {
1125 fprintf( stderr, "Includes nested too deeply" );
1126 exit( 1 );
1127 }
1128
1129 include_stack[include_stack_ptr++] =
1130 YY_CURRENT_BUFFER;
1131
1132 yyin = fopen( yytext, "r" );
1133
1134 if ( ! yyin )
1135 error( ... );
1136
1137 yy_switch_to_buffer(
1138 yy_create_buffer( yyin, YY_BUF_SIZE ) );
1139
1140 BEGIN(INITIAL);
1141 }
1142
1143 <<EOF>> {
1144 if ( --include_stack_ptr < 0 )
1145 {
1146 yyterminate();
1147 }
1148
1149 else
1150 yy_switch_to_buffer(
1151 include_stack[include_stack_ptr] );
1152 }
1153
1154.fi
1155.SH END-OF-FILE RULES
1156The special rule "<<EOF>>" indicates
1157actions which are to be taken when an end-of-file is
1158encountered and yywrap() returns non-zero (i.e., indicates
1159no further files to process). The action must finish
1160by doing one of four things:
1161.IP -
1162the special
1163.B YY_NEW_FILE
1164action, if
1165.I yyin
1166has been pointed at a new file to process;
1167.IP -
1168a
1169.I return
1170statement;
1171.IP -
1172the special
1173.B yyterminate()
1174action;
1175.IP -
1176or, switching to a new buffer using
1177.B yy_switch_to_buffer()
1178as shown in the example above.
1179.LP
1180<<EOF>> rules may not be used with other
1181patterns; they may only be qualified with a list of start
1182conditions. If an unqualified <<EOF>> rule is given, it
1183applies to
1184.I all
1185start conditions which do not already have <<EOF>> actions. To
1186specify an <<EOF>> rule for only the initial start condition, use
1187.nf
1188
1189 <INITIAL><<EOF>>
1190
1191.fi
1192.LP
1193These rules are useful for catching things like unclosed comments.
1194An example:
1195.nf
1196
1197 %x quote
1198 %%
1199
1200 ...other rules for dealing with quotes...
1201
1202 <quote><<EOF>> {
1203 error( "unterminated quote" );
1204 yyterminate();
1205 }
1206 <<EOF>> {
1207 if ( *++filelist )
1208 {
1209 yyin = fopen( *filelist, "r" );
1210 YY_NEW_FILE;
1211 }
1212 else
1213 yyterminate();
1214 }
1215
1216.fi
1217.SH MISCELLANEOUS MACROS
1218The macro
1219.bd
1220YY_USER_ACTION
1221can be redefined to provide an action
1222which is always executed prior to the matched rule's action. For example,
1223it could be #define'd to call a routine to convert yytext to lower-case.
1224.LP
1225The macro
1226.B YY_USER_INIT
1227may be redefined to provide an action which is always executed before
1228the first scan (and before the scanner's internal initializations are done).
1229For example, it could be used to call a routine to read
1230in a data table or open a logging file.
1231.LP
1232In the generated scanner, the actions are all gathered in one large
1233switch statement and separated using
1234.B YY_BREAK,
1235which may be redefined. By default, it is simply a "break", to separate
1236each rule's action from the following rule's.
1237Redefining
1238.B YY_BREAK
1239allows, for example, C++ users to
1240#define YY_BREAK to do nothing (while being very careful that every
1241rule ends with a "break" or a "return"!) to avoid suffering from
1242unreachable statement warnings where because a rule's action ends with
1243"return", the
1244.B YY_BREAK
1245is inaccessible.
1246.SH INTERFACING WITH YACC
1247One of the main uses of
1248.I flex
1249is as a companion to the
1250.I yacc
1251parser-generator.
1252.I yacc
1253parsers expect to call a routine named
1254.B yylex()
1255to find the next input token. The routine is supposed to
1256return the type of the next token as well as putting any associated
1257value in the global
1258.B yylval.
1259To use
1260.I flex
1261with
1262.I yacc,
1263one specifies the
1264.B -d
1265option to
1266.I yacc
1267to instruct it to generate the file
1268.B y.tab.h
1269containing definitions of all the
1270.B %tokens
1271appearing in the
1272.I yacc
1273input. This file is then included in the
1274.I flex
1275scanner. For example, if one of the tokens is "TOK_NUMBER",
1276part of the scanner might look like:
1277.nf
1278
1279 %{
1280 #include "y.tab.h"
1281 %}
1282
1283 %%
1284
1285 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
1286
1287.fi
1288.SH TRANSLATION TABLE
1289In the name of POSIX compliance,
1290.I flex
1291supports a
1292.I translation table
1293for mapping input characters into groups.
1294The table is specified in the first section, and its format looks like:
1295.nf
1296
1297 %t
1298 1 abcd
1299 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1300 52 0123456789
1301 6 \\t\\ \\n
1302 %t
1303
1304.fi
1305This example specifies that the characters 'a', 'b', 'c', and 'd'
1306are to all be lumped into group #1, upper-case letters
1307in group #2, digits in group #52, tabs, blanks, and newlines into
1308group #6, and
1309.I
1310no other characters will appear in the patterns.
1311The group numbers are actually disregarded by
1312.I flex;
1313.B %t
1314serves, though, to lump characters together. Given the above
1315table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0".
1316They both say, "match any character in group #1, followed by
1317zero-or-more pairs of characters
1318from group #2, followed by a character from group #52." Thus
1319.B %t
1320provides a crude way for introducing equivalence classes into
1321the scanner specification.
1322.LP
1323Note that the
1324.B -i
1325option (see below) coupled with the equivalence classes which
1326.I flex
1327automatically generates take care of virtually all the instances
1328when one might consider using
1329.B %t.
1330But what the hell, it's there if you want it.
1331.SH OPTIONS
1332.I flex
1333has the following options:
1334.TP
1335.B -b
1336Generate backtracking information to
1337.I lex.backtrack.
1338This is a list of scanner states which require backtracking
1339and the input characters on which they do so. By adding rules one
1340can remove backtracking states. If all backtracking states
1341are eliminated and
1342.B -f
1343or
1344.B -F
1345is used, the generated scanner will run faster (see the
1346.B -p
1347flag). Only users who wish to squeeze every last cycle out of their
1348scanners need worry about this option. (See the section on PERFORMANCE
1349CONSIDERATIONS below.)
1350.TP
1351.B -c
1352is a do-nothing, deprecated option included for POSIX compliance.
1353.IP
1354.B NOTE:
1355in previous releases of
1356.I flex
1357.B -c
1358specified table-compression options. This functionality is
1359now given by the
1360.B -C
1361flag. To ease the the impact of this change, when
1362.I flex
1363encounters
1364.B -c,
1365it currently issues a warning message and assumes that
1366.B -C
1367was desired instead. In the future this "promotion" of
1368.B -c
1369to
1370.B -C
1371will go away in the name of full POSIX compliance (unless
1372the POSIX meaning is removed first).
1373.TP
1374.B -d
1375makes the generated scanner run in
1376.I debug
1377mode. Whenever a pattern is recognized and the global
1378.B yy_flex_debug
1379is non-zero (which is the default),
1380the scanner will write to
1381.I stderr
1382a line of the form:
1383.nf
1384
1385 --accepting rule at line 53 ("the matched text")
1386
1387.fi
1388The line number refers to the location of the rule in the file
1389defining the scanner (i.e., the file that was fed to flex). Messages
1390are also generated when the scanner backtracks, accepts the
1391default rule, reaches the end of its input buffer (or encounters
1392a NUL; at this point, the two look the same as far as the scanner's concerned),
1393or reaches an end-of-file.
1394.TP
1395.B -f
1396specifies (take your pick)
1397.I full table
1398or
1399.I fast scanner.
1400No table compression is done. The result is large but fast.
1401This option is equivalent to
1402.B -Cf
1403(see below).
1404.TP
1405.B -i
1406instructs
1407.I flex
1408to generate a
1409.I case-insensitive
1410scanner. The case of letters given in the
1411.I flex
1412input patterns will
1413be ignored, and tokens in the input will be matched regardless of case. The
1414matched text given in
1415.I yytext
1416will have the preserved case (i.e., it will not be folded).
1417.TP
1418.B -n
1419is another do-nothing, deprecated option included only for
1420POSIX compliance.
1421.TP
1422.B -p
1423generates a performance report to stderr. The report
1424consists of comments regarding features of the
1425.I flex
1426input file which will cause a loss of performance in the resulting scanner.
1427Note that the use of
1428.I REJECT
1429and variable trailing context (see the BUGS section in flex(1))
1430entails a substantial performance penalty; use of
1431.I yymore(),
1432the
1433.B ^
1434operator,
1435and the
1436.B -I
1437flag entail minor performance penalties.
1438.TP
1439.B -s
1440causes the
1441.I default rule
1442(that unmatched scanner input is echoed to
1443.I stdout)
1444to be suppressed. If the scanner encounters input that does not
1445match any of its rules, it aborts with an error. This option is
1446useful for finding holes in a scanner's rule set.
1447.TP
1448.B -t
1449instructs
1450.I flex
1451to write the scanner it generates to standard output instead
1452of
1453.B lex.yy.c.
1454.TP
1455.B -v
1456specifies that
1457.I flex
1458should write to
1459.I stderr
1460a summary of statistics regarding the scanner it generates.
1461Most of the statistics are meaningless to the casual
1462.I flex
1463user, but the
1464first line identifies the version of
1465.I flex,
1466which is useful for figuring
1467out where you stand with respect to patches and new releases,
1468and the next two lines give the date when the scanner was created
1469and a summary of the flags which were in effect.
1470.TP
1471.B -F
1472specifies that the
1473.ul
1474fast
1475scanner table representation should be used. This representation is
1476about as fast as the full table representation
1477.ul
1478(-f),
1479and for some sets of patterns will be considerably smaller (and for
1480others, larger). In general, if the pattern set contains both "keywords"
1481and a catch-all, "identifier" rule, such as in the set:
1482.nf
1483
1484 "case" return TOK_CASE;
1485 "switch" return TOK_SWITCH;
1486 ...
1487 "default" return TOK_DEFAULT;
1488 [a-z]+ return TOK_ID;
1489
1490.fi
1491then you're better off using the full table representation. If only
1492the "identifier" rule is present and you then use a hash table or some such
1493to detect the keywords, you're better off using
1494.ul
1495-F.
1496.IP
1497This option is equivalent to
1498.B -CF
1499(see below).
1500.TP
1501.B -I
1502instructs
1503.I flex
1504to generate an
1505.I interactive
1506scanner. Normally, scanners generated by
1507.I flex
1508always look ahead one
1509character before deciding that a rule has been matched. At the cost of
1510some scanning overhead,
1511.I flex
1512will generate a scanner which only looks ahead
1513when needed. Such scanners are called
1514.I interactive
1515because if you want to write a scanner for an interactive system such as a
1516command shell, you will probably want the user's input to be terminated
1517with a newline, and without
1518.B -I
1519the user will have to type a character in addition to the newline in order
1520to have the newline recognized. This leads to dreadful interactive
1521performance.
1522.IP
1523If all this seems to confusing, here's the general rule: if a human will
1524be typing in input to your scanner, use
1525.B -I,
1526otherwise don't; if you don't care about squeezing the utmost performance
1527from your scanner and you
1528don't want to make any assumptions about the input to your scanner,
1529use
1530.B -I.
1531.IP
1532Note,
1533.B -I
1534cannot be used in conjunction with
1535.I full
1536or
1537.I fast tables,
1538i.e., the
1539.B -f, -F, -Cf,
1540or
1541.B -CF
1542flags.
1543.TP
1544.B -L
1545instructs
1546.I flex
1547not to generate
1548.B #line
1549directives. Without this option,
1550.I flex
1551peppers the generated scanner
1552with #line directives so error messages in the actions will be correctly
1553located with respect to the original
1554.I flex
1555input file, and not to
1556the fairly meaningless line numbers of
1557.B lex.yy.c.
1558(Unfortunately
1559.I flex
1560does not presently generate the necessary directives
1561to "retarget" the line numbers for those parts of
1562.B lex.yy.c
1563which it generated. So if there is an error in the generated code,
1564a meaningless line number is reported.)
1565.TP
1566.B -T
1567makes
1568.I flex
1569run in
1570.I trace
1571mode. It will generate a lot of messages to
1572.I stdout
1573concerning
1574the form of the input and the resultant non-deterministic and deterministic
1575finite automata. This option is mostly for use in maintaining
1576.I flex.
1577.TP
1578.B -8
1579instructs
1580.I flex
1581to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1582characters. On some sites,
1583.I flex
1584is installed with this option as the default. On others, the default
1585is 7-bit characters. To see which is the case, check the verbose
1586.B (-v)
1587output for "equivalence classes created". If the denominator of
1588the number shown is 128, then by default
1589.I flex
1590is generating 7-bit characters. If it is 256, then the default is
15918-bit characters and the
1592.B -8
1593flag is not required (but may be a good idea to keep the scanner
1594specification portable). Feeding a 7-bit scanner 8-bit characters
1595will result in infinite loops, bus errors, or other such fireworks,
1596so when in doubt, use the flag. Note that if equivalence classes
1597are used, 8-bit scanners take only slightly more table space than
15987-bit scanners (128 bytes, to be exact); if equivalence classes are
1599not used, however, then the tables may grow up to twice their
16007-bit size.
1601.TP
1602.B -C[efmF]
1603controls the degree of table compression.
1604.IP
1605.B -Ce
1606directs
1607.I flex
1608to construct
1609.I equivalence classes,
1610i.e., sets of characters
1611which have identical lexical properties (for example, if the only
1612appearance of digits in the
1613.I flex
1614input is in the character class
1615"[0-9]" then the digits '0', '1', ..., '9' will all be put
1616in the same equivalence class). Equivalence classes usually give
1617dramatic reductions in the final table/object file sizes (typically
1618a factor of 2-5) and are pretty cheap performance-wise (one array
1619look-up per character scanned).
1620.IP
1621.B -Cf
1622specifies that the
1623.I full
1624scanner tables should be generated -
1625.I flex
1626should not compress the
1627tables by taking advantages of similar transition functions for
1628different states.
1629.IP
1630.B -CF
1631specifies that the alternate fast scanner representation (described
1632above under the
1633.B -F
1634flag)
1635should be used.
1636.IP
1637.B -Cm
1638directs
1639.I flex
1640to construct
1641.I meta-equivalence classes,
1642which are sets of equivalence classes (or characters, if equivalence
1643classes are not being used) that are commonly used together. Meta-equivalence
1644classes are often a big win when using compressed tables, but they
1645have a moderate performance impact (one or two "if" tests and one
1646array look-up per character scanned).
1647.IP
1648A lone
1649.B -C
1650specifies that the scanner tables should be compressed but neither
1651equivalence classes nor meta-equivalence classes should be used.
1652.IP
1653The options
1654.B -Cf
1655or
1656.B -CF
1657and
1658.B -Cm
1659do not make sense together - there is no opportunity for meta-equivalence
1660classes if the table is not being compressed. Otherwise the options
1661may be freely mixed.
1662.IP
1663The default setting is
1664.B -Cem,
1665which specifies that
1666.I flex
1667should generate equivalence classes
1668and meta-equivalence classes. This setting provides the highest
1669degree of table compression. You can trade off
1670faster-executing scanners at the cost of larger tables with
1671the following generally being true:
1672.nf
1673
1674 slowest & smallest
1675 -Cem
1676 -Cm
1677 -Ce
1678 -C
1679 -C{f,F}e
1680 -C{f,F}
1681 fastest & largest
1682
1683.fi
1684Note that scanners with the smallest tables are usually generated and
1685compiled the quickest, so
1686during development you will usually want to use the default, maximal
1687compression.
1688.IP
1689.B -Cfe
1690is often a good compromise between speed and size for production
1691scanners.
1692.IP
1693.B -C
1694options are not cumulative; whenever the flag is encountered, the
1695previous -C settings are forgotten.
1696.TP
1697.B -Sskeleton_file
1698overrides the default skeleton file from which
1699.I flex
1700constructs its scanners. You'll never need this option unless you are doing
1701.I flex
1702maintenance or development.
1703.SH PERFORMANCE CONSIDERATIONS
1704The main design goal of
1705.I flex
1706is that it generate high-performance scanners. It has been optimized
1707for dealing well with large sets of rules. Aside from the effects
1708of table compression on scanner speed outlined above,
1709there are a number of options/actions which degrade performance. These
1710are, from most expensive to least:
1711.nf
1712
1713 REJECT
1714
1715 pattern sets that require backtracking
1716 arbitrary trailing context
1717
1718 '^' beginning-of-line operator
1719 yymore()
1720
1721.fi
1722with the first three all being quite expensive and the last two
1723being quite cheap.
1724.LP
1725.B REJECT
1726should be avoided at all costs when performance is important.
1727It is a particularly expensive option.
1728.LP
1729Getting rid of backtracking is messy and often may be an enormous
1730amount of work for a complicated scanner. In principal, one begins
1731by using the
1732.B -b
1733flag to generate a
1734.I lex.backtrack
1735file. For example, on the input
1736.nf
1737
1738 %%
1739 foo return TOK_KEYWORD;
1740 foobar return TOK_KEYWORD;
1741
1742.fi
1743the file looks like:
1744.nf
1745
1746 State #6 is non-accepting -
1747 associated rule line numbers:
1748 2 3
1749 out-transitions: [ o ]
1750 jam-transitions: EOF [ \\001-n p-\\177 ]
1751
1752 State #8 is non-accepting -
1753 associated rule line numbers:
1754 3
1755 out-transitions: [ a ]
1756 jam-transitions: EOF [ \\001-` b-\\177 ]
1757
1758 State #9 is non-accepting -
1759 associated rule line numbers:
1760 3
1761 out-transitions: [ r ]
1762 jam-transitions: EOF [ \\001-q s-\\177 ]
1763
1764 Compressed tables always backtrack.
1765
1766.fi
1767The first few lines tell us that there's a scanner state in
1768which it can make a transition on an 'o' but not on any other
1769character, and that in that state the currently scanned text does not match
1770any rule. The state occurs when trying to match the rules found
1771at lines 2 and 3 in the input file.
1772If the scanner is in that state and then reads
1773something other than an 'o', it will have to backtrack to find
1774a rule which is matched. With
1775a bit of headscratching one can see that this must be the
1776state it's in when it has seen "fo". When this has happened,
1777if anything other than another 'o' is seen, the scanner will
1778have to back up to simply match the 'f' (by the default rule).
1779.LP
1780The comment regarding State #8 indicates there's a problem
1781when "foob" has been scanned. Indeed, on any character other
1782than a 'b', the scanner will have to back up to accept "foo".
1783Similarly, the comment for State #9 concerns when "fooba" has
1784been scanned.
1785.LP
1786The final comment reminds us that there's no point going to
1787all the trouble of removing backtracking from the rules unless
1788we're using
1789.B -f
1790or
1791.B -F,
1792since there's no performance gain doing so with compressed scanners.
1793.LP
1794The way to remove the backtracking is to add "error" rules:
1795.nf
1796
1797 %%
1798 foo return TOK_KEYWORD;
1799 foobar return TOK_KEYWORD;
1800
1801 fooba |
1802 foob |
1803 fo {
1804 /* false alarm, not really a keyword */
1805 return TOK_ID;
1806 }
1807
1808.fi
1809.LP
1810Eliminating backtracking among a list of keywords can also be
1811done using a "catch-all" rule:
1812.nf
1813
1814 %%
1815 foo return TOK_KEYWORD;
1816 foobar return TOK_KEYWORD;
1817
1818 [a-z]+ return TOK_ID;
1819
1820.fi
1821This is usually the best solution when appropriate.
1822.LP
1823Backtracking messages tend to cascade.
1824With a complicated set of rules it's not uncommon to get hundreds
1825of messages. If one can decipher them, though, it often
1826only takes a dozen or so rules to eliminate the backtracking (though
1827it's easy to make a mistake and have an error rule accidentally match
1828a valid token. A possible future
1829.I flex
1830feature will be to automatically add rules to eliminate backtracking).
1831.LP
1832.I Variable
1833trailing context (where both the leading and trailing parts do not have
1834a fixed length) entails almost the same performance loss as
1835.I REJECT
1836(i.e., substantial). So when possible a rule like:
1837.nf
1838
1839 %%
1840 mouse|rat/(cat|dog) run();
1841
1842.fi
1843is better written:
1844.nf
1845
1846 %%
1847 mouse/cat|dog run();
1848 rat/cat|dog run();
1849
1850.fi
1851or as
1852.nf
1853
1854 %%
1855 mouse|rat/cat run();
1856 mouse|rat/dog run();
1857
1858.fi
1859Note that here the special '|' action does
1860.I not
1861provide any savings, and can even make things worse (see
1862.B BUGS
1863in flex(1)).
1864.LP
1865Another area where the user can increase a scanner's performance
1866(and one that's easier to implement) arises from the fact that
1867the longer the tokens matched, the faster the scanner will run.
1868This is because with long tokens the processing of most input
1869characters takes place in the (short) inner scanning loop, and
1870does not often have to go through the additional work of setting up
1871the scanning environment (e.g.,
1872.B yytext)
1873for the action. Recall the scanner for C comments:
1874.nf
1875
1876 %x comment
1877 %%
1878 int line_num = 1;
1879
1880 "/*" BEGIN(comment);
1881
1882 <comment>[^*\\n]*
1883 <comment>"*"+[^*/\\n]*
1884 <comment>\\n ++line_num;
1885 <comment>"*"+"/" BEGIN(INITIAL);
1886
1887.fi
1888This could be sped up by writing it as:
1889.nf
1890
1891 %x comment
1892 %%
1893 int line_num = 1;
1894
1895 "/*" BEGIN(comment);
1896
1897 <comment>[^*\\n]*
1898 <comment>[^*\\n]*\\n ++line_num;
1899 <comment>"*"+[^*/\\n]*
1900 <comment>"*"+[^*/\\n]*\\n ++line_num;
1901 <comment>"*"+"/" BEGIN(INITIAL);
1902
1903.fi
1904Now instead of each newline requiring the processing of another
1905action, recognizing the newlines is "distributed" over the other rules
1906to keep the matched text as long as possible. Note that
1907.I adding
1908rules does
1909.I not
1910slow down the scanner! The speed of the scanner is independent
1911of the number of rules or (modulo the considerations given at the
1912beginning of this section) how complicated the rules are with
1913regard to operators such as '*' and '|'.
1914.LP
1915A final example in speeding up a scanner: suppose you want to scan
1916through a file containing identifiers and keywords, one per line
1917and with no other extraneous characters, and recognize all the
1918keywords. A natural first approach is:
1919.nf
1920
1921 %%
1922 asm |
1923 auto |
1924 break |
1925 ... etc ...
1926 volatile |
1927 while /* it's a keyword */
1928
1929 .|\\n /* it's not a keyword */
1930
1931.fi
1932To eliminate the back-tracking, introduce a catch-all rule:
1933.nf
1934
1935 %%
1936 asm |
1937 auto |
1938 break |
1939 ... etc ...
1940 volatile |
1941 while /* it's a keyword */
1942
1943 [a-z]+ |
1944 .|\\n /* it's not a keyword */
1945
1946.fi
1947Now, if it's guaranteed that there's exactly one word per line,
1948then we can reduce the total number of matches by a half by
1949merging in the recognition of newlines with that of the other
1950tokens:
1951.nf
1952
1953 %%
1954 asm\\n |
1955 auto\\n |
1956 break\\n |
1957 ... etc ...
1958 volatile\\n |
1959 while\\n /* it's a keyword */
1960
1961 [a-z]+\\n |
1962 .|\\n /* it's not a keyword */
1963
1964.fi
1965One has to be careful here, as we have now reintroduced backtracking
1966into the scanner. In particular, while
1967.I we
1968know that there will never be any characters in the input stream
1969other than letters or newlines,
1970.I flex
1971can't figure this out, and it will plan for possibly needing backtracking
1972when it has scanned a token like "auto" and then the next character
1973is something other than a newline or a letter. Previously it would
1974then just match the "auto" rule and be done, but now it has no "auto"
1975rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
1976we could either duplicate all rules but without final newlines, or,
1977since we never expect to encounter such an input and therefore don't
1978how it's classified, we can introduce one more catch-all rule, this
1979one which doesn't include a newline:
1980.nf
1981
1982 %%
1983 asm\\n |
1984 auto\\n |
1985 break\\n |
1986 ... etc ...
1987 volatile\\n |
1988 while\\n /* it's a keyword */
1989
1990 [a-z]+\\n |
1991 [a-z]+ |
1992 .|\\n /* it's not a keyword */
1993
1994.fi
1995Compiled with
1996.B -Cf,
1997this is about as fast as one can get a
1998.I flex
1999scanner to go for this particular problem.
2000.LP
2001A final note:
2002.I flex
2003is slow when matching NUL's, particularly when a token contains
2004multiple NUL's.
2005It's best to write rules which match
2006.I short
2007amounts of text if it's anticipated that the text will often include NUL's.
2008.SH INCOMPATIBILITIES WITH LEX AND POSIX
2009.I flex
2010is a rewrite of the Unix
2011.I lex
2012tool (the two implementations do not share any code, though),
2013with some extensions and incompatibilities, both of which
2014are of concern to those who wish to write scanners acceptable
2015to either implementation. At present, the POSIX
2016.I lex
2017draft is
2018very close to the original
2019.I lex
2020implementation, so some of these
2021incompatibilities are also in conflict with the POSIX draft. But
2022the intent is that except as noted below,
2023.I flex
2024as it presently stands will
2025ultimately be POSIX conformant (i.e., that those areas of conflict with
2026the POSIX draft will be resolved in
2027.I flex's
2028favor). Please bear in
2029mind that all the comments which follow are with regard to the POSIX
2030.I draft
2031standard of Summer 1989, and not the final document (or subsequent
2032drafts); they are included so
2033.I flex
2034users can be aware of the standardization issues and those areas where
2035.I flex
2036may in the near future undergo changes incompatible with
2037its current definition.
2038.LP
2039.I flex
2040is fully compatible with
2041.I lex
2042with the following exceptions:
2043.IP -
2044.I lex
2045does not support exclusive start conditions (%x), though they
2046are in the current POSIX draft.
2047.IP -
2048When definitions are expanded,
2049.I flex
2050encloses them in parentheses.
2051With lex, the following:
2052.nf
2053
2054 NAME [A-Z][A-Z0-9]*
2055 %%
2056 foo{NAME}? printf( "Found it\\n" );
2057 %%
2058
2059.fi
2060will not match the string "foo" because when the macro
2061is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
2062and the precedence is such that the '?' is associated with
2063"[A-Z0-9]*". With
2064.I flex,
2065the rule will be expanded to
2066"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
2067Note that because of this, the
2068.B ^, $, <s>, /,
2069and
2070.B <<EOF>>
2071operators cannot be used in a
2072.I flex
2073definition.
2074.IP
2075The POSIX draft interpretation is the same as
2076.I flex's.
2077.IP -
2078To specify a character class which matches anything but a left bracket (']'),
2079in
2080.I lex
2081one can use "[^]]" but with
2082.I flex
2083one must use "[^\\]]". The latter works with
2084.I lex,
2085too.
2086.IP -
2087The undocumented
2088.I lex
2089scanner internal variable
2090.B yylineno
2091is not supported. (The variable is not part of the POSIX draft.)
2092.IP -
2093The
2094.B input()
2095routine is not redefinable, though it may be called to read characters
2096following whatever has been matched by a rule. If
2097.B input()
2098encounters an end-of-file the normal
2099.B yywrap()
2100processing is done. A ``real'' end-of-file is returned by
2101.B input()
2102as
2103.I EOF.
2104.IP
2105Input is instead controlled by redefining the
2106.B YY_INPUT
2107macro.
2108.IP
2109The
2110.I flex
2111restriction that
2112.B input()
2113cannot be redefined is in accordance with the POSIX draft, but
2114.B YY_INPUT
2115has not yet been accepted into the draft.
2116.IP -
2117.B output()
2118is not supported.
2119Output from the
2120.B ECHO
2121macro is done to the file-pointer
2122.I yyout
2123(default
2124.I stdout).
2125.IP
2126The POSIX draft mentions that an
2127.B output()
2128routine exists but currently gives no details as to what it does.
2129.IP -
2130The
2131.I lex
2132.B %r
2133(generate a Ratfor scanner) option is not supported. It is not part
2134of the POSIX draft.
2135.IP -
2136If you are providing your own yywrap() routine, you must include a
2137"#undef yywrap" in the definitions section (section 1). Note that
2138the "#undef" will have to be enclosed in %{}'s.
2139.IP
2140The POSIX draft
2141specifies that yywrap() is a function and this is unlikely to change; so
2142.I flex users are warned
2143that
2144.B yywrap()
2145is likely to be changed to a function in the near future.
2146.IP -
2147After a call to
2148.B unput(),
2149.I yytext
2150and
2151.I yyleng
2152are undefined until the next token is matched. This is not the case with
2153.I lex
2154or the present POSIX draft.
2155.IP -
2156The precedence of the
2157.B {}
2158(numeric range) operator is different.
2159.I lex
2160interprets "abc{1,3}" as "match one, two, or
2161three occurrences of 'abc'", whereas
2162.I flex
2163interprets it as "match 'ab'
2164followed by one, two, or three occurrences of 'c'". The latter is
2165in agreement with the current POSIX draft.
2166.IP -
2167The precedence of the
2168.B ^
2169operator is different.
2170.I lex
2171interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
2172or 'bar' anywhere", whereas
2173.I flex
2174interprets it as "match either 'foo' or 'bar' if they come at the beginning
2175of a line". The latter is in agreement with the current POSIX draft.
2176.IP -
2177To refer to yytext outside of the scanner source file,
2178the correct definition with
2179.I flex
2180is "extern char *yytext" rather than "extern char yytext[]".
2181This is contrary to the current POSIX draft but a point on which
2182.I flex
2183will not be changing, as the array representation entails a
2184serious performance penalty. It is hoped that the POSIX draft will
2185be emended to support the
2186.I flex
2187variety of declaration (as this is a fairly painless change to
2188require of
2189.I lex
2190users).
2191.IP -
2192.I yyin
2193is
2194.I initialized
2195by
2196.I lex
2197to be
2198.I stdin;
2199.I flex,
2200on the other hand,
2201initializes
2202.I yyin
2203to NULL
2204and then
2205.I assigns
2206it to
2207.I stdin
2208the first time the scanner is called, providing
2209.I yyin
2210has not already been assigned to a non-NULL value. The difference is
2211subtle, but the net effect is that with
2212.I flex
2213scanners,
2214.I yyin
2215does not have a valid value until the scanner has been called.
2216.IP -
2217The special table-size declarations such as
2218.B %a
2219supported by
2220.I lex
2221are not required by
2222.I flex
2223scanners;
2224.I flex
2225ignores them.
2226.IP -
2227The name
2228.bd
2229FLEX_SCANNER
2230is #define'd so scanners may be written for use with either
2231.I flex
2232or
2233.I lex.
2234.LP
2235The following
2236.I flex
2237features are not included in
2238.I lex
2239or the POSIX draft standard:
2240.nf
2241
2242 yyterminate()
2243 <<EOF>>
2244 YY_DECL
2245 #line directives
2246 %{}'s around actions
2247 yyrestart()
2248 comments beginning with '#' (deprecated)
2249 multiple actions on a line
2250
2251.fi
2252This last feature refers to the fact that with
2253.I flex
2254you can put multiple actions on the same line, separated with
2255semi-colons, while with
2256.I lex,
2257the following
2258.nf
2259
2260 foo handle_foo(); ++num_foos_seen;
2261
2262.fi
2263is (rather surprisingly) truncated to
2264.nf
2265
2266 foo handle_foo();
2267
2268.fi
2269.I flex
2270does not truncate the action. Actions that are not enclosed in
2271braces are simply terminated at the end of the line.
2272.SH DIAGNOSTICS
2273.I reject_used_but_not_detected undefined
2274or
2275.I yymore_used_but_not_detected undefined -
2276These errors can occur at compile time. They indicate that the
2277scanner uses
2278.B REJECT
2279or
2280.B yymore()
2281but that
2282.I flex
2283failed to notice the fact, meaning that
2284.I flex
2285scanned the first two sections looking for occurrences of these actions
2286and failed to find any, but somehow you snuck some in (via a #include
2287file, for example). Make an explicit reference to the action in your
2288.I flex
2289input file. (Note that previously
2290.I flex
2291supported a
2292.B %used/%unused
2293mechanism for dealing with this problem; this feature is still supported
2294but now deprecated, and will go away soon unless the author hears from
2295people who can argue compellingly that they need it.)
2296.LP
2297.I flex scanner jammed -
2298a scanner compiled with
2299.B -s
2300has encountered an input string which wasn't matched by
2301any of its rules.
2302.LP
2303.I flex input buffer overflowed -
2304a scanner rule matched a string long enough to overflow the
2305scanner's internal input buffer (16K bytes by default - controlled by
2306.B YY_BUF_SIZE
2307in "flex.skel". Note that to redefine this macro, you must first
2308.B #undefine
2309it).
2310.LP
2311.I scanner requires -8 flag -
2312Your scanner specification includes recognizing 8-bit characters and
2313you did not specify the -8 flag (and your site has not installed flex
2314with -8 as the default).
2315.LP
2316.I too many %t classes! -
2317You managed to put every single character into its own %t class.
2318.I flex
2319requires that at least one of the classes share characters.
2320.SH DEFICIENCIES / BUGS
2321See flex(1).
2322.SH "SEE ALSO"
2323.LP
2324flex(1), lex(1), yacc(1), sed(1), awk(1).
2325.LP
2326M. E. Lesk and E. Schmidt,
2327.I LEX - Lexical Analyzer Generator
2328.SH AUTHOR
2329Vern Paxson, with the help of many ideas and much inspiration from
2330Van Jacobson. Original version by Jef Poskanzer. The fast table
2331representation is a partial implementation of a design done by Van
2332Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
2333.LP
2334Thanks to the many
2335.I flex
2336beta-testers, feedbackers, and contributors, especially Casey
2337Leedom, benson@odi.com,
2338Frederic Brehm, Nick Christopher, Jason Coughlin,
2339Scott David Daniels, Leo Eskin,
2340Chris Faylor, Eric Goldman, Eric
2341Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
2342Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
2343Jef Poskanzer, Jim Roskind,
2344Dave Tallman, Frank Whaley, Ken Yap, and those whose names
2345have slipped my marginal mail-archiving skills but whose contributions
2346are appreciated all the same.
2347.LP
2348Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
2349Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
2350headaches.
2351.LP
2352Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
2353to Benson Margulies and Fred
2354Burke for C++ support; to Ove Ewerlid for the basics of support for
2355NUL's; and to Eric Hughes for the basics of support for multiple buffers.
2356.LP
2357Work is being done on extending
2358.I flex
2359to generate scanners in which the
2360state machine is directly represented in C code rather than tables.
2361These scanners may well be substantially faster than those generated
2362using -f or -F. If you are working in this area and are interested
2363in comparing notes and seeing whether redundant work can be avoided,
2364contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE).
2365.LP
2366This work was primarily done when I was at the Real Time Systems Group
2367at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
2368for the support I received.
2369.LP
2370Send comments to:
2371.nf
2372
2373 Vern Paxson
2374 Computer Science Department
2375 4126 Upson Hall
2376 Cornell University
2377 Ithaca, NY 14853-7501
2378
2379 vern@cs.cornell.edu
2380 decvax!cornell!vern
2381
2382.fi