BSD 4_3_Net_2 release
[unix-history] / usr / src / usr.bin / lex / flexdoc.1
CommitLineData
af359dea
C
1.\" Copyright (c) 1991 The Regents of the University of California.
2.\" All rights reserved.
3.\"
4.\" new copyright; att/bsd/shared
5.\" This code is derived from software contributed to Berkeley by
6.\" Vern Paxson of Lawrence Berkeley Laboratory.
7.\"
8.\" The United States Government has rights in this work pursuant
9.\" to contract no. DE-AC03-76SF00098 between the United States
10.\" Department of Energy and the University of California.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\" notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\" notice, this list of conditions and the following disclaimer in the
19.\" documentation and/or other materials provided with the distribution.
20.\" 3. All advertising materials mentioning features or use of this software
21.\" must display the following acknowledgement:
22.\" This product includes software developed by the University of
23.\" California, Berkeley and its contributors.
24.\" 4. Neither the name of the University nor the names of its contributors
25.\" may be used to endorse or promote products derived from this software
26.\" without specific prior written permission.
27.\"
28.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
29.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
30.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
31.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
32.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
33.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
34.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
35.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
36.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
37.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
38.\" SUCH DAMAGE.
39.\"
40.\" @(#)flexdoc.1 5.2 (Berkeley) 4/12/91
41.\"
42.TH FLEX 1 "April 12, 1991"
810a4854
C
43.SH NAME
44flex - fast lexical analyzer generator
45.SH SYNOPSIS
46.B flex
47.B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
48.I [filename ...]
49.SH DESCRIPTION
50.I flex
51is a tool for generating
52.I scanners:
53programs which recognized lexical patterns in text.
54.I flex
55reads
56the given input files, or its standard input if no file names are given,
57for a description of a scanner to generate. The description is in
58the form of pairs
59of regular expressions and C code, called
60.I rules. flex
61generates as output a C source file,
62.B lex.yy.c,
63which defines a routine
64.B yylex().
65This file is compiled and linked with the
66.B -lfl
67library to produce an executable. When the executable is run,
68it analyzes its input for occurrences
69of the regular expressions. Whenever it finds one, it executes
70the corresponding C code.
71.SH SOME SIMPLE EXAMPLES
72.LP
73First some simple examples to get the flavor of how one uses
74.I flex.
75The following
76.I flex
77input specifies a scanner which whenever it encounters the string
78"username" will replace it with the user's login name:
79.nf
80
81 %%
82 username printf( "%s", getlogin() );
83
84.fi
85By default, any text not matched by a
86.I flex
87scanner
88is copied to the output, so the net effect of this scanner is
89to copy its input file to its output with each occurrence
90of "username" expanded.
91In this input, there is just one rule. "username" is the
92.I pattern
93and the "printf" is the
94.I action.
95The "%%" marks the beginning of the rules.
96.LP
97Here's another simple example:
98.nf
99
100 int num_lines = 0, num_chars = 0;
101
102 %%
103 \\n ++num_lines; ++num_chars;
104 . ++num_chars;
105
106 %%
107 main()
108 {
109 yylex();
110 printf( "# of lines = %d, # of chars = %d\\n",
111 num_lines, num_chars );
112 }
113
114.fi
115This scanner counts the number of characters and the number
116of lines in its input (it produces no output other than the
117final report on the counts). The first line
118declares two globals, "num_lines" and "num_chars", which are accessible
119both inside
120.B yylex()
121and in the
122.B main()
123routine declared after the second "%%". There are two rules, one
124which matches a newline ("\\n") and increments both the line count and
125the character count, and one which matches any character other than
126a newline (indicated by the "." regular expression).
127.LP
128A somewhat more complicated example:
129.nf
130
131 /* scanner for a toy Pascal-like language */
132
133 %{
134 /* need this for the call to atof() below */
135 #include <math.h>
136 %}
137
138 DIGIT [0-9]
139 ID [a-z][a-z0-9]*
140
141 %%
142
143 {DIGIT}+ {
144 printf( "An integer: %s (%d)\\n", yytext,
145 atoi( yytext ) );
146 }
147
148 {DIGIT}+"."{DIGIT}* {
149 printf( "A float: %s (%d)\\n", yytext,
150 atof( yytext ) );
151 }
152
153 if|then|begin|end|procedure|function {
154 printf( "A keyword: %s\\n", yytext );
155 }
156
157 {ID} printf( "An identifier: %s\\n", yytext );
158
159 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
160
161 "{"[^}\\n]*"}" /* eat up one-line comments */
162
163 [ \\t\\n]+ /* eat up whitespace */
164
165 . printf( "Unrecognized character: %s\\n", yytext );
166
167 %%
168
169 main( argc, argv )
170 int argc;
171 char **argv;
172 {
173 ++argv, --argc; /* skip over program name */
174 if ( argc > 0 )
175 yyin = fopen( argv[0], "r" );
176 else
177 yyin = stdin;
178
179 yylex();
180 }
181
182.fi
183This is the beginnings of a simple scanner for a language like
184Pascal. It identifies different types of
185.I tokens
186and reports on what it has seen.
187.LP
188The details of this example will be explained in the following
189sections.
190.SH FORMAT OF THE INPUT FILE
191The
192.I flex
193input file consists of three sections, separated by a line with just
194.B %%
195in it:
196.nf
197
198 definitions
199 %%
200 rules
201 %%
202 user code
203
204.fi
205The
206.I definitions
207section contains declarations of simple
208.I name
209definitions to simplify the scanner specification, and declarations of
210.I start conditions,
211which are explained in a later section.
212.LP
213Name definitions have the form:
214.nf
215
216 name definition
217
218.fi
219The "name" is a word beginning with a letter or an underscore ('_')
220followed by zero or more letters, digits, '_', or '-' (dash).
221The definition is taken to begin at the first non-white-space character
222following the name and continuing to the end of the line.
223The definition can subsequently be referred to using "{name}", which
224will expand to "(definition)". For example,
225.nf
226
227 DIGIT [0-9]
228 ID [a-z][a-z0-9]*
229
230.fi
231defines "DIGIT" to be a regular expression which matches a
232single digit, and
233"ID" to be a regular expression which matches a letter
234followed by zero-or-more letters-or-digits.
235A subsequent reference to
236.nf
237
238 {DIGIT}+"."{DIGIT}*
239
240.fi
241is identical to
242.nf
243
244 ([0-9])+"."([0-9])*
245
246.fi
247and matches one-or-more digits followed by a '.' followed
248by zero-or-more digits.
249.LP
250The
251.I rules
252section of the
253.I flex
254input contains a series of rules of the form:
255.nf
256
257 pattern action
258
259.fi
260where the pattern must be unindented and the action must begin
261on the same line.
262.LP
263See below for a further description of patterns and actions.
264.LP
265Finally, the user code section is simply copied to
266.B lex.yy.c
267verbatim.
268It is used for companion routines which call or are called
269by the scanner. The presence of this section is optional;
270if it is missing, the second
271.B %%
272in the input file may be skipped, too.
273.LP
274In the definitions and rules sections, any
275.I indented
276text or text enclosed in
277.B %{
278and
279.B %}
280is copied verbatim to the output (with the %{}'s removed).
281The %{}'s must appear unindented on lines by themselves.
282.LP
283In the rules section,
284any indented or %{} text appearing before the
285first rule may be used to declare variables
286which are local to the scanning routine and (after the declarations)
287code which is to be executed whenever the scanning routine is entered.
288Other indented or %{} text in the rule section is still copied to the output,
289but its meaning is not well-defined and it may well cause compile-time
290errors (this feature is present for
291.I POSIX
292compliance; see below for other such features).
293.LP
294In the definitions section, an unindented comment (i.e., a line
295beginning with "/*") is also copied verbatim to the output up
296to the next "*/". Also, any line in the definitions section
297beginning with '#' is ignored, though this style of comment is
298deprecated and may go away in the future.
299.SH PATTERNS
300The patterns in the input are written using an extended set of regular
301expressions. These are:
302.nf
303
304 x match the character 'x'
305 . any character except newline
306 [xyz] a "character class"; in this case, the pattern
307 matches either an 'x', a 'y', or a 'z'
308 [abj-oZ] a "character class" with a range in it; matches
309 an 'a', a 'b', any letter from 'j' through 'o',
310 or a 'Z'
311 [^A-Z] a "negated character class", i.e., any character
312 but those in the class. In this case, any
313 character EXCEPT an uppercase letter.
314 [^A-Z\\n] any character EXCEPT an uppercase letter or
315 a newline
316 r* zero or more r's, where r is any regular expression
317 r+ one or more r's
318 r? zero or one r's (that is, "an optional r")
319 r{2,5} anywhere from two to five r's
320 r{2,} two or more r's
321 r{4} exactly 4 r's
322 {name} the expansion of the "name" definition
323 (see above)
324 "[xyz]\\"foo"
325 the literal string: [xyz]"foo
326 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
327 then the ANSI-C interpretation of \\x.
328 Otherwise, a literal 'X' (used to escape
329 operators such as '*')
330 \\123 the character with octal value 123
331 \\x2a the character with hexadecimal value 2a
332 (r) match an r; parentheses are used to override
333 precedence (see below)
334
335
336 rs the regular expression r followed by the
337 regular expression s; called "concatenation"
338
339
340 r|s either an r or an s
341
342
343 r/s an r but only if it is followed by an s. The
344 s is not part of the matched text. This type
345 of pattern is called as "trailing context".
346 ^r an r, but only at the beginning of a line
347 r$ an r, but only at the end of a line. Equivalent
348 to "r/\\n".
349
350
351 <s>r an r, but only in start condition s (see
352 below for discussion of start conditions)
353 <s1,s2,s3>r
354 same, but in any of start conditions s1,
355 s2, or s3
356
357
358 <<EOF>> an end-of-file
359 <s1,s2><<EOF>>
360 an end-of-file when in start condition s1 or s2
361
362.fi
363The regular expressions listed above are grouped according to
364precedence, from highest precedence at the top to lowest at the bottom.
365Those grouped together have equal precedence. For example,
366.nf
367
368 foo|bar*
369
370.fi
371is the same as
372.nf
373
374 (foo)|(ba(r*))
375
376.fi
377since the '*' operator has higher precedence than concatenation,
378and concatenation higher than alternation ('|'). This pattern
379therefore matches
380.I either
381the string "foo"
382.I or
383the string "ba" followed by zero-or-more r's.
384To match "foo" or zero-or-more "bar"'s, use:
385.nf
386
387 foo|(bar)*
388
389.fi
390and to match zero-or-more "foo"'s-or-"bar"'s:
391.nf
392
393 (foo|bar)*
394
395.fi
396.LP
397Some notes on patterns:
398.IP -
399A negated character class such as the example "[^A-Z]"
400above
401.I will match a newline
402unless "\\n" (or an equivalent escape sequence) is one of the
403characters explicitly present in the negated character class
404(e.g., "[^A-Z\\n]"). This is unlike how many other regular
405expression tools treat negated character classes, but unfortunately
406the inconsistency is historically entrenched.
407Matching newlines means that a pattern like [^"]* can match an entire
408input (overflowing the scanner's input buffer) unless there's another
409quote in the input.
410.IP -
411A rule can have at most one instance of trailing context (the '/' operator
412or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
413can only occur at the beginning of a pattern, and, as well as with '/' and '$',
414cannot be grouped inside parentheses. A '^' which does not occur at
415the beginning of a rule or a '$' which does not occur at the end of
416a rule loses its special properties and is treated as a normal character.
417.IP
418The following are illegal:
419.nf
420
421 foo/bar$
422 <sc1>foo<sc2>bar
423
424.fi
425Note that the first of these, can be written "foo/bar\\n".
426.IP
427The following will result in '$' or '^' being treated as a normal character:
428.nf
429
430 foo|(bar$)
431 foo|^bar
432
433.fi
434If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
435could be used (the special '|' action is explained below):
436.nf
437
438 foo |
439 bar$ /* action goes here */
440
441.fi
442A similar trick will work for matching a foo or a
443bar-at-the-beginning-of-a-line.
444.SH HOW THE INPUT IS MATCHED
445When the generated scanner is run, it analyzes its input looking
446for strings which match any of its patterns. If it finds more than
447one match, it takes the one matching the most text (for trailing
448context rules, this includes the length of the trailing part, even
449though it will then be returned to the input). If it finds two
450or more matches of the same length, the
451rule listed first in the
452.I flex
453input file is chosen.
454.LP
455Once the match is determined, the text corresponding to the match
456(called the
457.I token)
458is made available in the global character pointer
459.B yytext,
460and its length in the global integer
461.B yyleng.
462The
463.I action
464corresponding to the matched pattern is then executed (a more
465detailed description of actions follows), and then the remaining
466input is scanned for another match.
467.LP
468If no match is found, then the
469.I default rule
470is executed: the next character in the input is considered matched and
471copied to the standard output. Thus, the simplest legal
472.I flex
473input is:
474.nf
475
476 %%
477
478.fi
479which generates a scanner that simply copies its input (one character
480at a time) to its output.
481.SH ACTIONS
482Each pattern in a rule has a corresponding action, which can be any
483arbitrary C statement. The pattern ends at the first non-escaped
484whitespace character; the remainder of the line is its action. If the
485action is empty, then when the pattern is matched the input token
486is simply discarded. For example, here is the specification for a program
487which deletes all occurrences of "zap me" from its input:
488.nf
489
490 %%
491 "zap me"
492
493.fi
494(It will copy all other characters in the input to the output since
495they will be matched by the default rule.)
496.LP
497Here is a program which compresses multiple blanks and tabs down to
498a single blank, and throws away whitespace found at the end of a line:
499.nf
500
501 %%
502 [ \\t]+ putchar( ' ' );
503 [ \\t]+$ /* ignore this token */
504
505.fi
506.LP
507If the action contains a '{', then the action spans till the balancing '}'
508is found, and the action may cross multiple lines.
509.I flex
510knows about C strings and comments and won't be fooled by braces found
511within them, but also allows actions to begin with
512.B %{
513and will consider the action to be all the text up to the next
514.B %}
515(regardless of ordinary braces inside the action).
516.LP
517An action consisting solely of a vertical bar ('|') means "same as
518the action for the next rule." See below for an illustration.
519.LP
520Actions can include arbitrary C code, including
521.B return
522statements to return a value to whatever routine called
523.B yylex().
524Each time
525.B yylex()
526is called it continues processing tokens from where it last left
527off until it either reaches
528the end of the file or executes a return. Once it reaches an end-of-file,
529however, then any subsequent call to
530.B yylex()
531will simply immediately return, unless
532.B yyrestart()
533is first called (see below).
534.LP
535Actions are not allowed to modify yytext or yyleng.
536.LP
537There are a number of special directives which can be included within
538an action:
539.IP -
540.B ECHO
541copies yytext to the scanner's output.
542.IP -
543.B BEGIN
544followed by the name of a start condition places the scanner in the
545corresponding start condition (see below).
546.IP -
547.B REJECT
548directs the scanner to proceed on to the "second best" rule which matched the
549input (or a prefix of the input). The rule is chosen as described
550above in "How the Input is Matched", and
551.B yytext
552and
553.B yyleng
554set up appropriately.
555It may either be one which matched as much text
556as the originally chosen rule but came later in the
557.I flex
558input file, or one which matched less text.
559For example, the following will both count the
560words in the input and call the routine special() whenever "frob" is seen:
561.nf
562
563 int word_count = 0;
564 %%
565
566 frob special(); REJECT;
567 [^ \\t\\n]+ ++word_count;
568
569.fi
570Without the
571.B REJECT,
572any "frob"'s in the input would not be counted as words, since the
573scanner normally executes only one action per token.
574Multiple
575.B REJECT's
576are allowed, each one finding the next best choice to the currently
577active rule. For example, when the following scanner scans the token
578"abcd", it will write "abcdabcaba" to the output:
579.nf
580
581 %%
582 a |
583 ab |
584 abc |
585 abcd ECHO; REJECT;
586 .|\\n /* eat up any unmatched character */
587
588.fi
589(The first three rules share the fourth's action since they use
590the special '|' action.)
591.B REJECT
592is a particularly expensive feature in terms scanner performance;
593if it is used in
594.I any
595of the scanner's actions it will slow down
596.I all
597of the scanner's matching. Furthermore,
598.B REJECT
599cannot be used with the
600.I -f
601or
602.I -F
603options (see below).
604.IP
605Note also that unlike the other special actions,
606.B REJECT
607is a
608.I branch;
609code immediately following it in the action will
610.I not
611be executed.
612.IP -
613.B yymore()
614tells the scanner that the next time it matches a rule, the corresponding
615token should be
616.I appended
617onto the current value of
618.B yytext
619rather than replacing it. For example, given the input "mega-kludge"
620the following will write "mega-mega-kludge" to the output:
621.nf
622
623 %%
624 mega- ECHO; yymore();
625 kludge ECHO;
626
627.fi
628First "mega-" is matched and echoed to the output. Then "kludge"
629is matched, but the previous "mega-" is still hanging around at the
630beginning of
631.B yytext
632so the
633.B ECHO
634for the "kludge" rule will actually write "mega-kludge".
635The presence of
636.B yymore()
637in the scanner's action entails a minor performance penalty in the
638scanner's matching speed.
639.IP -
640.B yyless(n)
641returns all but the first
642.I n
643characters of the current token back to the input stream, where they
644will be rescanned when the scanner looks for the next match.
645.B yytext
646and
647.B yyleng
648are adjusted appropriately (e.g.,
649.B yyleng
650will now be equal to
651.I n
652). For example, on the input "foobar" the following will write out
653"foobarbar":
654.nf
655
656 %%
657 foobar ECHO; yyless(3);
658 [a-z]+ ECHO;
659
660.fi
661An argument of 0 to
662.B yyless
663will cause the entire current input string to be scanned again. Unless you've
664changed how the scanner will subsequently process its input (using
665.B BEGIN,
666for example), this will result in an endless loop.
667.IP -
668.B unput(c)
669puts the character
670.I c
671back onto the input stream. It will be the next character scanned.
672The following action will take the current token and cause it
673to be rescanned enclosed in parentheses.
674.nf
675
676 {
677 int i;
678 unput( ')' );
679 for ( i = yyleng - 1; i >= 0; --i )
680 unput( yytext[i] );
681 unput( '(' );
682 }
683
684.fi
685Note that since each
686.B unput()
687puts the given character back at the
688.I beginning
689of the input stream, pushing back strings must be done back-to-front.
690.IP -
691.B input()
692reads the next character from the input stream. For example,
693the following is one way to eat up C comments:
694.nf
695
696 %%
697 "/*" {
698 register int c;
699
700 for ( ; ; )
701 {
702 while ( (c = input()) != '*' &&
703 c != EOF )
704 ; /* eat up text of comment */
705
706 if ( c == '*' )
707 {
708 while ( (c = input()) == '*' )
709 ;
710 if ( c == '/' )
711 break; /* found the end */
712 }
713
714 if ( c == EOF )
715 {
716 error( "EOF in comment" );
717 break;
718 }
719 }
720 }
721
722.fi
723(Note that if the scanner is compiled using
724.B C++,
725then
726.B input()
727is instead referred to as
728.B yyinput(),
729in order to avoid a name clash with the
730.B C++
731stream by the name of
732.I input.)
733.IP -
734.B yyterminate()
735can be used in lieu of a return statement in an action. It terminates
736the scanner and returns a 0 to the scanner's caller, indicating "all done".
737Subsequent calls to the scanner will immediately return unless preceded
738by a call to
739.B yyrestart()
740(see below).
741By default,
742.B yyterminate()
743is also called when an end-of-file is encountered. It is a macro and
744may be redefined.
745.SH THE GENERATED SCANNER
746The output of
747.I flex
748is the file
749.B lex.yy.c,
750which contains the scanning routine
751.B yylex(),
752a number of tables used by it for matching tokens, and a number
753of auxiliary routines and macros. By default,
754.B yylex()
755is declared as follows:
756.nf
757
758 int yylex()
759 {
760 ... various definitions and the actions in here ...
761 }
762
763.fi
764(If your environment supports function prototypes, then it will
765be "int yylex( void )".) This definition may be changed by redefining
766the "YY_DECL" macro. For example, you could use:
767.nf
768
769 #undef YY_DECL
770 #define YY_DECL float lexscan( a, b ) float a, b;
771
772.fi
773to give the scanning routine the name
774.I lexscan,
775returning a float, and taking two floats as arguments. Note that
776if you give arguments to the scanning routine using a
777K&R-style/non-prototyped function declaration, you must terminate
778the definition with a semi-colon (;).
779.LP
780Whenever
781.B yylex()
782is called, it scans tokens from the global input file
783.I yyin
784(which defaults to stdin). It continues until it either reaches
785an end-of-file (at which point it returns the value 0) or
786one of its actions executes a
787.I return
788statement.
789In the former case, when called again the scanner will immediately
790return unless
791.B yyrestart()
792is called to point
793.I yyin
794at the new input file. (
795.B yyrestart()
796takes one argument, a
797.B FILE *
798pointer.)
799In the latter case (i.e., when an action
800executes a return), the scanner may then be called again and it
801will resume scanning where it left off.
802.LP
803By default (and for purposes of efficiency), the scanner uses
804block-reads rather than simple
805.I getc()
806calls to read characters from
807.I yyin.
808The nature of how it gets its input can be controlled by redefining the
809.B YY_INPUT
810macro.
811YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
812action is to place up to
813.I max_size
814characters in the character array
815.I buf
816and return in the integer variable
817.I result
818either the
819number of characters read or the constant YY_NULL (0 on Unix systems)
820to indicate EOF. The default YY_INPUT reads from the
821global file-pointer "yyin".
822.LP
823A sample redefinition of YY_INPUT (in the definitions
824section of the input file):
825.nf
826
827 %{
828 #undef YY_INPUT
829 #define YY_INPUT(buf,result,max_size) \\
830 result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1;
831 %}
832
833.fi
834This definition will change the input processing to occur
835one character at a time.
836.LP
837You also can add in things like keeping track of the
838input line number this way; but don't expect your scanner to
839go very fast.
840.LP
841When the scanner receives an end-of-file indication from YY_INPUT,
842it then checks the
843.B yywrap()
844function. If
845.B yywrap()
846returns false (zero), then it is assumed that the
847function has gone ahead and set up
848.I yyin
849to point to another input file, and scanning continues. If it returns
850true (non-zero), then the scanner terminates, returning 0 to its
851caller.
852.LP
853The default
854.B yywrap()
855always returns 1. Presently, to redefine it you must first
856"#undef yywrap", as it is currently implemented as a macro. As indicated
857by the hedging in the previous sentence, it may be changed to
858a true function in the near future.
859.LP
860The scanner writes its
861.B ECHO
862output to the
863.I yyout
864global (default, stdout), which may be redefined by the user simply
865by assigning it to some other
866.B FILE
867pointer.
868.SH START CONDITIONS
869.I flex
870provides a mechanism for conditionally activating rules. Any rule
871whose pattern is prefixed with "<sc>" will only be active when
872the scanner is in the start condition named "sc". For example,
873.nf
874
875 <STRING>[^"]* { /* eat up the string body ... */
876 ...
877 }
878
879.fi
880will be active only when the scanner is in the "STRING" start
881condition, and
882.nf
883
884 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
885 ...
886 }
887
888.fi
889will be active only when the current start condition is
890either "INITIAL", "STRING", or "QUOTE".
891.LP
892Start conditions
893are declared in the definitions (first) section of the input
894using unindented lines beginning with either
895.B %s
896or
897.B %x
898followed by a list of names.
899The former declares
900.I inclusive
901start conditions, the latter
902.I exclusive
903start conditions. A start condition is activated using the
904.B BEGIN
905action. Until the next
906.B BEGIN
907action is executed, rules with the given start
908condition will be active and
909rules with other start conditions will be inactive.
910If the start condition is
911.I inclusive,
912then rules with no start conditions at all will also be active.
913If it is
914.I exclusive,
915then
916.I only
917rules qualified with the start condition will be active.
918A set of rules contingent on the same exclusive start condition
919describe a scanner which is independent of any of the other rules in the
920.I flex
921input. Because of this,
922exclusive start conditions make it easy to specify "mini-scanners"
923which scan portions of the input that are syntactically different
924from the rest (e.g., comments).
925.LP
926If the distinction between inclusive and exclusive start conditions
927is still a little vague, here's a simple example illustrating the
928connection between the two. The set of rules:
929.nf
930
931 %s example
932 %%
933 <example>foo /* do something */
934
935.fi
936is equivalent to
937.nf
938
939 %x example
940 %%
941 <INITIAL,example>foo /* do something */
942
943.fi
944.LP
945The default rule (to
946.B ECHO
947any unmatched character) remains active in start conditions.
948.LP
949.B BEGIN(0)
950returns to the original state where only the rules with
951no start conditions are active. This state can also be
952referred to as the start-condition "INITIAL", so
953.B BEGIN(INITIAL)
954is equivalent to
955.B BEGIN(0).
956(The parentheses around the start condition name are not required but
957are considered good style.)
958.LP
959.B BEGIN
960actions can also be given as indented code at the beginning
961of the rules section. For example, the following will cause
962the scanner to enter the "SPECIAL" start condition whenever
963.I yylex()
964is called and the global variable
965.I enter_special
966is true:
967.nf
968
969 int enter_special;
970
971 %x SPECIAL
972 %%
973 if ( enter_special )
974 BEGIN(SPECIAL);
975
976 <SPECIAL>blahblahblah
977 ...more rules follow...
978
979.fi
980.LP
981To illustrate the uses of start conditions,
982here is a scanner which provides two different interpretations
983of a string like "123.456". By default it will treat it as
984as three tokens, the integer "123", a dot ('.'), and the integer "456".
985But if the string is preceded earlier in the line by the string
986"expect-floats"
987it will treat it as a single token, the floating-point number
988123.456:
989.nf
990
991 %{
992 #include <math.h>
993 %}
994 %s expect
995
996 %%
997 expect-floats BEGIN(expect);
998
999 <expect>[0-9]+"."[0-9]+ {
1000 printf( "found a float, = %f\\n",
1001 atof( yytext ) );
1002 }
1003 <expect>\\n {
1004 /* that's the end of the line, so
1005 * we need another "expect-number"
1006 * before we'll recognize any more
1007 * numbers
1008 */
1009 BEGIN(INITIAL);
1010 }
1011
1012 [0-9]+ {
1013 printf( "found an integer, = %d\\n",
1014 atoi( yytext ) );
1015 }
1016
1017 "." printf( "found a dot\\n" );
1018
1019.fi
1020Here is a scanner which recognizes (and discards) C comments while
1021maintaining a count of the current input line.
1022.nf
1023
1024 %x comment
1025 %%
1026 int line_num = 1;
1027
1028 "/*" BEGIN(comment);
1029
1030 <comment>[^*\\n]* /* eat anything that's not a '*' */
1031 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1032 <comment>\\n ++line_num;
1033 <comment>"*"+"/" BEGIN(INITIAL);
1034
1035.fi
1036Note that start-conditions names are really integer values and
1037can be stored as such. Thus, the above could be extended in the
1038following fashion:
1039.nf
1040
1041 %x comment foo
1042 %%
1043 int line_num = 1;
1044 int comment_caller;
1045
1046 "/*" {
1047 comment_caller = INITIAL;
1048 BEGIN(comment);
1049 }
1050
1051 ...
1052
1053 <foo>"/*" {
1054 comment_caller = foo;
1055 BEGIN(comment);
1056 }
1057
1058 <comment>[^*\\n]* /* eat anything that's not a '*' */
1059 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1060 <comment>\\n ++line_num;
1061 <comment>"*"+"/" BEGIN(comment_caller);
1062
1063.fi
1064One can then implement a "stack" of start conditions using an
1065array of integers. (It is likely that such stacks will become
1066a full-fledged
1067.I flex
1068feature in the future.) Note, though, that
1069start conditions do not have their own name-space; %s's and %x's
1070declare names in the same fashion as #define's.
1071.SH MULTIPLE INPUT BUFFERS
1072Some scanners (such as those which support "include" files)
1073require reading from several input streams. As
1074.I flex
1075scanners do a large amount of buffering, one cannot control
1076where the next input will be read from by simply writing a
1077.B YY_INPUT
1078which is sensitive to the scanning context.
1079.B YY_INPUT
1080is only called when the scanner reaches the end of its buffer, which
1081may be a long time after scanning a statement such as an "include"
1082which requires switching the input source.
1083.LP
1084To negotiate these sorts of problems,
1085.I flex
1086provides a mechanism for creating and switching between multiple
1087input buffers. An input buffer is created by using:
1088.nf
1089
1090 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1091
1092.fi
1093which takes a
1094.I FILE
1095pointer and a size and creates a buffer associated with the given
1096file and large enough to hold
1097.I size
1098characters (when in doubt, use
1099.B YY_BUF_SIZE
1100for the size). It returns a
1101.B YY_BUFFER_STATE
1102handle, which may then be passed to other routines:
1103.nf
1104
1105 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1106
1107.fi
1108switches the scanner's input buffer so subsequent tokens will
1109come from
1110.I new_buffer.
1111Note that
1112.B yy_switch_to_buffer()
1113may be used by yywrap() to sets things up for continued scanning, instead
1114of opening a new file and pointing
1115.I yyin
1116at it.
1117.nf
1118
1119 void yy_delete_buffer( YY_BUFFER_STATE buffer )
1120
1121.fi
1122is used to reclaim the storage associated with a buffer.
1123.LP
1124.B yy_new_buffer()
1125is an alias for
1126.B yy_create_buffer(),
1127provided for compatibility with the C++ use of
1128.I new
1129and
1130.I delete
1131for creating and destroying dynamic objects.
1132.LP
1133Finally, the
1134.B YY_CURRENT_BUFFER
1135macro returns a
1136.B YY_BUFFER_STATE
1137handle to the current buffer.
1138.LP
1139Here is an example of using these features for writing a scanner
1140which expands include files (the
1141.B <<EOF>>
1142feature is discussed below):
1143.nf
1144
1145 /* the "incl" state is used for picking up the name
1146 * of an include file
1147 */
1148 %x incl
1149
1150 %{
1151 #define MAX_INCLUDE_DEPTH 10
1152 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1153 int include_stack_ptr = 0;
1154 %}
1155
1156 %%
1157 include BEGIN(incl);
1158
1159 [a-z]+ ECHO;
1160 [^a-z\\n]*\\n? ECHO;
1161
1162 <incl>[ \\t]* /* eat the whitespace */
1163 <incl>[^ \\t\\n]+ { /* got the include file name */
1164 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1165 {
1166 fprintf( stderr, "Includes nested too deeply" );
1167 exit( 1 );
1168 }
1169
1170 include_stack[include_stack_ptr++] =
1171 YY_CURRENT_BUFFER;
1172
1173 yyin = fopen( yytext, "r" );
1174
1175 if ( ! yyin )
1176 error( ... );
1177
1178 yy_switch_to_buffer(
1179 yy_create_buffer( yyin, YY_BUF_SIZE ) );
1180
1181 BEGIN(INITIAL);
1182 }
1183
1184 <<EOF>> {
1185 if ( --include_stack_ptr < 0 )
1186 {
1187 yyterminate();
1188 }
1189
1190 else
1191 yy_switch_to_buffer(
1192 include_stack[include_stack_ptr] );
1193 }
1194
1195.fi
1196.SH END-OF-FILE RULES
1197The special rule "<<EOF>>" indicates
1198actions which are to be taken when an end-of-file is
1199encountered and yywrap() returns non-zero (i.e., indicates
1200no further files to process). The action must finish
1201by doing one of four things:
1202.IP -
1203the special
1204.B YY_NEW_FILE
1205action, if
1206.I yyin
1207has been pointed at a new file to process;
1208.IP -
1209a
1210.I return
1211statement;
1212.IP -
1213the special
1214.B yyterminate()
1215action;
1216.IP -
1217or, switching to a new buffer using
1218.B yy_switch_to_buffer()
1219as shown in the example above.
1220.LP
1221<<EOF>> rules may not be used with other
1222patterns; they may only be qualified with a list of start
1223conditions. If an unqualified <<EOF>> rule is given, it
1224applies to
1225.I all
1226start conditions which do not already have <<EOF>> actions. To
1227specify an <<EOF>> rule for only the initial start condition, use
1228.nf
1229
1230 <INITIAL><<EOF>>
1231
1232.fi
1233.LP
1234These rules are useful for catching things like unclosed comments.
1235An example:
1236.nf
1237
1238 %x quote
1239 %%
1240
1241 ...other rules for dealing with quotes...
1242
1243 <quote><<EOF>> {
1244 error( "unterminated quote" );
1245 yyterminate();
1246 }
1247 <<EOF>> {
1248 if ( *++filelist )
1249 {
1250 yyin = fopen( *filelist, "r" );
1251 YY_NEW_FILE;
1252 }
1253 else
1254 yyterminate();
1255 }
1256
1257.fi
1258.SH MISCELLANEOUS MACROS
1259The macro
1260.bd
1261YY_USER_ACTION
1262can be redefined to provide an action
1263which is always executed prior to the matched rule's action. For example,
1264it could be #define'd to call a routine to convert yytext to lower-case.
1265.LP
1266The macro
1267.B YY_USER_INIT
1268may be redefined to provide an action which is always executed before
1269the first scan (and before the scanner's internal initializations are done).
1270For example, it could be used to call a routine to read
1271in a data table or open a logging file.
1272.LP
1273In the generated scanner, the actions are all gathered in one large
1274switch statement and separated using
1275.B YY_BREAK,
1276which may be redefined. By default, it is simply a "break", to separate
1277each rule's action from the following rule's.
1278Redefining
1279.B YY_BREAK
1280allows, for example, C++ users to
1281#define YY_BREAK to do nothing (while being very careful that every
1282rule ends with a "break" or a "return"!) to avoid suffering from
1283unreachable statement warnings where because a rule's action ends with
1284"return", the
1285.B YY_BREAK
1286is inaccessible.
1287.SH INTERFACING WITH YACC
1288One of the main uses of
1289.I flex
1290is as a companion to the
1291.I yacc
1292parser-generator.
1293.I yacc
1294parsers expect to call a routine named
1295.B yylex()
1296to find the next input token. The routine is supposed to
1297return the type of the next token as well as putting any associated
1298value in the global
1299.B yylval.
1300To use
1301.I flex
1302with
1303.I yacc,
1304one specifies the
1305.B -d
1306option to
1307.I yacc
1308to instruct it to generate the file
1309.B y.tab.h
1310containing definitions of all the
1311.B %tokens
1312appearing in the
1313.I yacc
1314input. This file is then included in the
1315.I flex
1316scanner. For example, if one of the tokens is "TOK_NUMBER",
1317part of the scanner might look like:
1318.nf
1319
1320 %{
1321 #include "y.tab.h"
1322 %}
1323
1324 %%
1325
1326 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
1327
1328.fi
1329.SH TRANSLATION TABLE
1330In the name of POSIX compliance,
1331.I flex
1332supports a
1333.I translation table
1334for mapping input characters into groups.
1335The table is specified in the first section, and its format looks like:
1336.nf
1337
1338 %t
1339 1 abcd
1340 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1341 52 0123456789
1342 6 \\t\\ \\n
1343 %t
1344
1345.fi
1346This example specifies that the characters 'a', 'b', 'c', and 'd'
1347are to all be lumped into group #1, upper-case letters
1348in group #2, digits in group #52, tabs, blanks, and newlines into
1349group #6, and
1350.I
1351no other characters will appear in the patterns.
1352The group numbers are actually disregarded by
1353.I flex;
1354.B %t
1355serves, though, to lump characters together. Given the above
1356table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0".
1357They both say, "match any character in group #1, followed by
1358zero-or-more pairs of characters
1359from group #2, followed by a character from group #52." Thus
1360.B %t
1361provides a crude way for introducing equivalence classes into
1362the scanner specification.
1363.LP
1364Note that the
1365.B -i
1366option (see below) coupled with the equivalence classes which
1367.I flex
1368automatically generates take care of virtually all the instances
1369when one might consider using
1370.B %t.
1371But what the hell, it's there if you want it.
1372.SH OPTIONS
1373.I flex
1374has the following options:
1375.TP
1376.B -b
1377Generate backtracking information to
1378.I lex.backtrack.
1379This is a list of scanner states which require backtracking
1380and the input characters on which they do so. By adding rules one
1381can remove backtracking states. If all backtracking states
1382are eliminated and
1383.B -f
1384or
1385.B -F
1386is used, the generated scanner will run faster (see the
1387.B -p
1388flag). Only users who wish to squeeze every last cycle out of their
1389scanners need worry about this option. (See the section on PERFORMANCE
1390CONSIDERATIONS below.)
1391.TP
1392.B -c
1393is a do-nothing, deprecated option included for POSIX compliance.
1394.IP
1395.B NOTE:
1396in previous releases of
1397.I flex
1398.B -c
1399specified table-compression options. This functionality is
1400now given by the
1401.B -C
1402flag. To ease the the impact of this change, when
1403.I flex
1404encounters
1405.B -c,
1406it currently issues a warning message and assumes that
1407.B -C
1408was desired instead. In the future this "promotion" of
1409.B -c
1410to
1411.B -C
1412will go away in the name of full POSIX compliance (unless
1413the POSIX meaning is removed first).
1414.TP
1415.B -d
1416makes the generated scanner run in
1417.I debug
1418mode. Whenever a pattern is recognized and the global
1419.B yy_flex_debug
1420is non-zero (which is the default),
1421the scanner will write to
1422.I stderr
1423a line of the form:
1424.nf
1425
1426 --accepting rule at line 53 ("the matched text")
1427
1428.fi
1429The line number refers to the location of the rule in the file
1430defining the scanner (i.e., the file that was fed to flex). Messages
1431are also generated when the scanner backtracks, accepts the
1432default rule, reaches the end of its input buffer (or encounters
1433a NUL; at this point, the two look the same as far as the scanner's concerned),
1434or reaches an end-of-file.
1435.TP
1436.B -f
1437specifies (take your pick)
1438.I full table
1439or
1440.I fast scanner.
1441No table compression is done. The result is large but fast.
1442This option is equivalent to
1443.B -Cf
1444(see below).
1445.TP
1446.B -i
1447instructs
1448.I flex
1449to generate a
1450.I case-insensitive
1451scanner. The case of letters given in the
1452.I flex
1453input patterns will
1454be ignored, and tokens in the input will be matched regardless of case. The
1455matched text given in
1456.I yytext
1457will have the preserved case (i.e., it will not be folded).
1458.TP
1459.B -n
1460is another do-nothing, deprecated option included only for
1461POSIX compliance.
1462.TP
1463.B -p
1464generates a performance report to stderr. The report
1465consists of comments regarding features of the
1466.I flex
1467input file which will cause a loss of performance in the resulting scanner.
1468Note that the use of
1469.I REJECT
1470and variable trailing context (see the BUGS section in flex(1))
1471entails a substantial performance penalty; use of
1472.I yymore(),
1473the
1474.B ^
1475operator,
1476and the
1477.B -I
1478flag entail minor performance penalties.
1479.TP
1480.B -s
1481causes the
1482.I default rule
1483(that unmatched scanner input is echoed to
1484.I stdout)
1485to be suppressed. If the scanner encounters input that does not
1486match any of its rules, it aborts with an error. This option is
1487useful for finding holes in a scanner's rule set.
1488.TP
1489.B -t
1490instructs
1491.I flex
1492to write the scanner it generates to standard output instead
1493of
1494.B lex.yy.c.
1495.TP
1496.B -v
1497specifies that
1498.I flex
1499should write to
1500.I stderr
1501a summary of statistics regarding the scanner it generates.
1502Most of the statistics are meaningless to the casual
1503.I flex
1504user, but the
1505first line identifies the version of
1506.I flex,
1507which is useful for figuring
1508out where you stand with respect to patches and new releases,
1509and the next two lines give the date when the scanner was created
1510and a summary of the flags which were in effect.
1511.TP
1512.B -F
1513specifies that the
1514.ul
1515fast
1516scanner table representation should be used. This representation is
1517about as fast as the full table representation
1518.ul
1519(-f),
1520and for some sets of patterns will be considerably smaller (and for
1521others, larger). In general, if the pattern set contains both "keywords"
1522and a catch-all, "identifier" rule, such as in the set:
1523.nf
1524
1525 "case" return TOK_CASE;
1526 "switch" return TOK_SWITCH;
1527 ...
1528 "default" return TOK_DEFAULT;
1529 [a-z]+ return TOK_ID;
1530
1531.fi
1532then you're better off using the full table representation. If only
1533the "identifier" rule is present and you then use a hash table or some such
1534to detect the keywords, you're better off using
1535.ul
1536-F.
1537.IP
1538This option is equivalent to
1539.B -CF
1540(see below).
1541.TP
1542.B -I
1543instructs
1544.I flex
1545to generate an
1546.I interactive
1547scanner. Normally, scanners generated by
1548.I flex
1549always look ahead one
1550character before deciding that a rule has been matched. At the cost of
1551some scanning overhead,
1552.I flex
1553will generate a scanner which only looks ahead
1554when needed. Such scanners are called
1555.I interactive
1556because if you want to write a scanner for an interactive system such as a
1557command shell, you will probably want the user's input to be terminated
1558with a newline, and without
1559.B -I
1560the user will have to type a character in addition to the newline in order
1561to have the newline recognized. This leads to dreadful interactive
1562performance.
1563.IP
1564If all this seems to confusing, here's the general rule: if a human will
1565be typing in input to your scanner, use
1566.B -I,
1567otherwise don't; if you don't care about squeezing the utmost performance
1568from your scanner and you
1569don't want to make any assumptions about the input to your scanner,
1570use
1571.B -I.
1572.IP
1573Note,
1574.B -I
1575cannot be used in conjunction with
1576.I full
1577or
1578.I fast tables,
1579i.e., the
1580.B -f, -F, -Cf,
1581or
1582.B -CF
1583flags.
1584.TP
1585.B -L
1586instructs
1587.I flex
1588not to generate
1589.B #line
1590directives. Without this option,
1591.I flex
1592peppers the generated scanner
1593with #line directives so error messages in the actions will be correctly
1594located with respect to the original
1595.I flex
1596input file, and not to
1597the fairly meaningless line numbers of
1598.B lex.yy.c.
1599(Unfortunately
1600.I flex
1601does not presently generate the necessary directives
1602to "retarget" the line numbers for those parts of
1603.B lex.yy.c
1604which it generated. So if there is an error in the generated code,
1605a meaningless line number is reported.)
1606.TP
1607.B -T
1608makes
1609.I flex
1610run in
1611.I trace
1612mode. It will generate a lot of messages to
1613.I stdout
1614concerning
1615the form of the input and the resultant non-deterministic and deterministic
1616finite automata. This option is mostly for use in maintaining
1617.I flex.
1618.TP
1619.B -8
1620instructs
1621.I flex
1622to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1623characters. On some sites,
1624.I flex
1625is installed with this option as the default. On others, the default
1626is 7-bit characters. To see which is the case, check the verbose
1627.B (-v)
1628output for "equivalence classes created". If the denominator of
1629the number shown is 128, then by default
1630.I flex
1631is generating 7-bit characters. If it is 256, then the default is
16328-bit characters and the
1633.B -8
1634flag is not required (but may be a good idea to keep the scanner
1635specification portable). Feeding a 7-bit scanner 8-bit characters
1636will result in infinite loops, bus errors, or other such fireworks,
1637so when in doubt, use the flag. Note that if equivalence classes
1638are used, 8-bit scanners take only slightly more table space than
16397-bit scanners (128 bytes, to be exact); if equivalence classes are
1640not used, however, then the tables may grow up to twice their
16417-bit size.
1642.TP
1643.B -C[efmF]
1644controls the degree of table compression.
1645.IP
1646.B -Ce
1647directs
1648.I flex
1649to construct
1650.I equivalence classes,
1651i.e., sets of characters
1652which have identical lexical properties (for example, if the only
1653appearance of digits in the
1654.I flex
1655input is in the character class
1656"[0-9]" then the digits '0', '1', ..., '9' will all be put
1657in the same equivalence class). Equivalence classes usually give
1658dramatic reductions in the final table/object file sizes (typically
1659a factor of 2-5) and are pretty cheap performance-wise (one array
1660look-up per character scanned).
1661.IP
1662.B -Cf
1663specifies that the
1664.I full
1665scanner tables should be generated -
1666.I flex
1667should not compress the
1668tables by taking advantages of similar transition functions for
1669different states.
1670.IP
1671.B -CF
1672specifies that the alternate fast scanner representation (described
1673above under the
1674.B -F
1675flag)
1676should be used.
1677.IP
1678.B -Cm
1679directs
1680.I flex
1681to construct
1682.I meta-equivalence classes,
1683which are sets of equivalence classes (or characters, if equivalence
1684classes are not being used) that are commonly used together. Meta-equivalence
1685classes are often a big win when using compressed tables, but they
1686have a moderate performance impact (one or two "if" tests and one
1687array look-up per character scanned).
1688.IP
1689A lone
1690.B -C
1691specifies that the scanner tables should be compressed but neither
1692equivalence classes nor meta-equivalence classes should be used.
1693.IP
1694The options
1695.B -Cf
1696or
1697.B -CF
1698and
1699.B -Cm
1700do not make sense together - there is no opportunity for meta-equivalence
1701classes if the table is not being compressed. Otherwise the options
1702may be freely mixed.
1703.IP
1704The default setting is
1705.B -Cem,
1706which specifies that
1707.I flex
1708should generate equivalence classes
1709and meta-equivalence classes. This setting provides the highest
1710degree of table compression. You can trade off
1711faster-executing scanners at the cost of larger tables with
1712the following generally being true:
1713.nf
1714
1715 slowest & smallest
1716 -Cem
1717 -Cm
1718 -Ce
1719 -C
1720 -C{f,F}e
1721 -C{f,F}
1722 fastest & largest
1723
1724.fi
1725Note that scanners with the smallest tables are usually generated and
1726compiled the quickest, so
1727during development you will usually want to use the default, maximal
1728compression.
1729.IP
1730.B -Cfe
1731is often a good compromise between speed and size for production
1732scanners.
1733.IP
1734.B -C
1735options are not cumulative; whenever the flag is encountered, the
1736previous -C settings are forgotten.
1737.TP
1738.B -Sskeleton_file
1739overrides the default skeleton file from which
1740.I flex
1741constructs its scanners. You'll never need this option unless you are doing
1742.I flex
1743maintenance or development.
1744.SH PERFORMANCE CONSIDERATIONS
1745The main design goal of
1746.I flex
1747is that it generate high-performance scanners. It has been optimized
1748for dealing well with large sets of rules. Aside from the effects
1749of table compression on scanner speed outlined above,
1750there are a number of options/actions which degrade performance. These
1751are, from most expensive to least:
1752.nf
1753
1754 REJECT
1755
1756 pattern sets that require backtracking
1757 arbitrary trailing context
1758
1759 '^' beginning-of-line operator
1760 yymore()
1761
1762.fi
1763with the first three all being quite expensive and the last two
1764being quite cheap.
1765.LP
1766.B REJECT
1767should be avoided at all costs when performance is important.
1768It is a particularly expensive option.
1769.LP
1770Getting rid of backtracking is messy and often may be an enormous
1771amount of work for a complicated scanner. In principal, one begins
1772by using the
1773.B -b
1774flag to generate a
1775.I lex.backtrack
1776file. For example, on the input
1777.nf
1778
1779 %%
1780 foo return TOK_KEYWORD;
1781 foobar return TOK_KEYWORD;
1782
1783.fi
1784the file looks like:
1785.nf
1786
1787 State #6 is non-accepting -
1788 associated rule line numbers:
1789 2 3
1790 out-transitions: [ o ]
1791 jam-transitions: EOF [ \\001-n p-\\177 ]
1792
1793 State #8 is non-accepting -
1794 associated rule line numbers:
1795 3
1796 out-transitions: [ a ]
1797 jam-transitions: EOF [ \\001-` b-\\177 ]
1798
1799 State #9 is non-accepting -
1800 associated rule line numbers:
1801 3
1802 out-transitions: [ r ]
1803 jam-transitions: EOF [ \\001-q s-\\177 ]
1804
1805 Compressed tables always backtrack.
1806
1807.fi
1808The first few lines tell us that there's a scanner state in
1809which it can make a transition on an 'o' but not on any other
1810character, and that in that state the currently scanned text does not match
1811any rule. The state occurs when trying to match the rules found
1812at lines 2 and 3 in the input file.
1813If the scanner is in that state and then reads
1814something other than an 'o', it will have to backtrack to find
1815a rule which is matched. With
1816a bit of headscratching one can see that this must be the
1817state it's in when it has seen "fo". When this has happened,
1818if anything other than another 'o' is seen, the scanner will
1819have to back up to simply match the 'f' (by the default rule).
1820.LP
1821The comment regarding State #8 indicates there's a problem
1822when "foob" has been scanned. Indeed, on any character other
1823than a 'b', the scanner will have to back up to accept "foo".
1824Similarly, the comment for State #9 concerns when "fooba" has
1825been scanned.
1826.LP
1827The final comment reminds us that there's no point going to
1828all the trouble of removing backtracking from the rules unless
1829we're using
1830.B -f
1831or
1832.B -F,
1833since there's no performance gain doing so with compressed scanners.
1834.LP
1835The way to remove the backtracking is to add "error" rules:
1836.nf
1837
1838 %%
1839 foo return TOK_KEYWORD;
1840 foobar return TOK_KEYWORD;
1841
1842 fooba |
1843 foob |
1844 fo {
1845 /* false alarm, not really a keyword */
1846 return TOK_ID;
1847 }
1848
1849.fi
1850.LP
1851Eliminating backtracking among a list of keywords can also be
1852done using a "catch-all" rule:
1853.nf
1854
1855 %%
1856 foo return TOK_KEYWORD;
1857 foobar return TOK_KEYWORD;
1858
1859 [a-z]+ return TOK_ID;
1860
1861.fi
1862This is usually the best solution when appropriate.
1863.LP
1864Backtracking messages tend to cascade.
1865With a complicated set of rules it's not uncommon to get hundreds
1866of messages. If one can decipher them, though, it often
1867only takes a dozen or so rules to eliminate the backtracking (though
1868it's easy to make a mistake and have an error rule accidentally match
1869a valid token. A possible future
1870.I flex
1871feature will be to automatically add rules to eliminate backtracking).
1872.LP
1873.I Variable
1874trailing context (where both the leading and trailing parts do not have
1875a fixed length) entails almost the same performance loss as
1876.I REJECT
1877(i.e., substantial). So when possible a rule like:
1878.nf
1879
1880 %%
1881 mouse|rat/(cat|dog) run();
1882
1883.fi
1884is better written:
1885.nf
1886
1887 %%
1888 mouse/cat|dog run();
1889 rat/cat|dog run();
1890
1891.fi
1892or as
1893.nf
1894
1895 %%
1896 mouse|rat/cat run();
1897 mouse|rat/dog run();
1898
1899.fi
1900Note that here the special '|' action does
1901.I not
1902provide any savings, and can even make things worse (see
1903.B BUGS
1904in flex(1)).
1905.LP
1906Another area where the user can increase a scanner's performance
1907(and one that's easier to implement) arises from the fact that
1908the longer the tokens matched, the faster the scanner will run.
1909This is because with long tokens the processing of most input
1910characters takes place in the (short) inner scanning loop, and
1911does not often have to go through the additional work of setting up
1912the scanning environment (e.g.,
1913.B yytext)
1914for the action. Recall the scanner for C comments:
1915.nf
1916
1917 %x comment
1918 %%
1919 int line_num = 1;
1920
1921 "/*" BEGIN(comment);
1922
1923 <comment>[^*\\n]*
1924 <comment>"*"+[^*/\\n]*
1925 <comment>\\n ++line_num;
1926 <comment>"*"+"/" BEGIN(INITIAL);
1927
1928.fi
1929This could be sped up by writing it as:
1930.nf
1931
1932 %x comment
1933 %%
1934 int line_num = 1;
1935
1936 "/*" BEGIN(comment);
1937
1938 <comment>[^*\\n]*
1939 <comment>[^*\\n]*\\n ++line_num;
1940 <comment>"*"+[^*/\\n]*
1941 <comment>"*"+[^*/\\n]*\\n ++line_num;
1942 <comment>"*"+"/" BEGIN(INITIAL);
1943
1944.fi
1945Now instead of each newline requiring the processing of another
1946action, recognizing the newlines is "distributed" over the other rules
1947to keep the matched text as long as possible. Note that
1948.I adding
1949rules does
1950.I not
1951slow down the scanner! The speed of the scanner is independent
1952of the number of rules or (modulo the considerations given at the
1953beginning of this section) how complicated the rules are with
1954regard to operators such as '*' and '|'.
1955.LP
1956A final example in speeding up a scanner: suppose you want to scan
1957through a file containing identifiers and keywords, one per line
1958and with no other extraneous characters, and recognize all the
1959keywords. A natural first approach is:
1960.nf
1961
1962 %%
1963 asm |
1964 auto |
1965 break |
1966 ... etc ...
1967 volatile |
1968 while /* it's a keyword */
1969
1970 .|\\n /* it's not a keyword */
1971
1972.fi
1973To eliminate the back-tracking, introduce a catch-all rule:
1974.nf
1975
1976 %%
1977 asm |
1978 auto |
1979 break |
1980 ... etc ...
1981 volatile |
1982 while /* it's a keyword */
1983
1984 [a-z]+ |
1985 .|\\n /* it's not a keyword */
1986
1987.fi
1988Now, if it's guaranteed that there's exactly one word per line,
1989then we can reduce the total number of matches by a half by
1990merging in the recognition of newlines with that of the other
1991tokens:
1992.nf
1993
1994 %%
1995 asm\\n |
1996 auto\\n |
1997 break\\n |
1998 ... etc ...
1999 volatile\\n |
2000 while\\n /* it's a keyword */
2001
2002 [a-z]+\\n |
2003 .|\\n /* it's not a keyword */
2004
2005.fi
2006One has to be careful here, as we have now reintroduced backtracking
2007into the scanner. In particular, while
2008.I we
2009know that there will never be any characters in the input stream
2010other than letters or newlines,
2011.I flex
2012can't figure this out, and it will plan for possibly needing backtracking
2013when it has scanned a token like "auto" and then the next character
2014is something other than a newline or a letter. Previously it would
2015then just match the "auto" rule and be done, but now it has no "auto"
2016rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
2017we could either duplicate all rules but without final newlines, or,
2018since we never expect to encounter such an input and therefore don't
2019how it's classified, we can introduce one more catch-all rule, this
2020one which doesn't include a newline:
2021.nf
2022
2023 %%
2024 asm\\n |
2025 auto\\n |
2026 break\\n |
2027 ... etc ...
2028 volatile\\n |
2029 while\\n /* it's a keyword */
2030
2031 [a-z]+\\n |
2032 [a-z]+ |
2033 .|\\n /* it's not a keyword */
2034
2035.fi
2036Compiled with
2037.B -Cf,
2038this is about as fast as one can get a
2039.I flex
2040scanner to go for this particular problem.
2041.LP
2042A final note:
2043.I flex
2044is slow when matching NUL's, particularly when a token contains
2045multiple NUL's.
2046It's best to write rules which match
2047.I short
2048amounts of text if it's anticipated that the text will often include NUL's.
2049.SH INCOMPATIBILITIES WITH LEX AND POSIX
2050.I flex
2051is a rewrite of the Unix
2052.I lex
2053tool (the two implementations do not share any code, though),
2054with some extensions and incompatibilities, both of which
2055are of concern to those who wish to write scanners acceptable
2056to either implementation. At present, the POSIX
2057.I lex
2058draft is
2059very close to the original
2060.I lex
2061implementation, so some of these
2062incompatibilities are also in conflict with the POSIX draft. But
2063the intent is that except as noted below,
2064.I flex
2065as it presently stands will
2066ultimately be POSIX conformant (i.e., that those areas of conflict with
2067the POSIX draft will be resolved in
2068.I flex's
2069favor). Please bear in
2070mind that all the comments which follow are with regard to the POSIX
2071.I draft
2072standard of Summer 1989, and not the final document (or subsequent
2073drafts); they are included so
2074.I flex
2075users can be aware of the standardization issues and those areas where
2076.I flex
2077may in the near future undergo changes incompatible with
2078its current definition.
2079.LP
2080.I flex
2081is fully compatible with
2082.I lex
2083with the following exceptions:
2084.IP -
2085.I lex
2086does not support exclusive start conditions (%x), though they
2087are in the current POSIX draft.
2088.IP -
2089When definitions are expanded,
2090.I flex
2091encloses them in parentheses.
2092With lex, the following:
2093.nf
2094
2095 NAME [A-Z][A-Z0-9]*
2096 %%
2097 foo{NAME}? printf( "Found it\\n" );
2098 %%
2099
2100.fi
2101will not match the string "foo" because when the macro
2102is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
2103and the precedence is such that the '?' is associated with
2104"[A-Z0-9]*". With
2105.I flex,
2106the rule will be expanded to
2107"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
2108Note that because of this, the
2109.B ^, $, <s>, /,
2110and
2111.B <<EOF>>
2112operators cannot be used in a
2113.I flex
2114definition.
2115.IP
2116The POSIX draft interpretation is the same as
2117.I flex's.
2118.IP -
2119To specify a character class which matches anything but a left bracket (']'),
2120in
2121.I lex
2122one can use "[^]]" but with
2123.I flex
2124one must use "[^\\]]". The latter works with
2125.I lex,
2126too.
2127.IP -
2128The undocumented
2129.I lex
2130scanner internal variable
2131.B yylineno
2132is not supported. (The variable is not part of the POSIX draft.)
2133.IP -
2134The
2135.B input()
2136routine is not redefinable, though it may be called to read characters
2137following whatever has been matched by a rule. If
2138.B input()
2139encounters an end-of-file the normal
2140.B yywrap()
2141processing is done. A ``real'' end-of-file is returned by
2142.B input()
2143as
2144.I EOF.
2145.IP
2146Input is instead controlled by redefining the
2147.B YY_INPUT
2148macro.
2149.IP
2150The
2151.I flex
2152restriction that
2153.B input()
2154cannot be redefined is in accordance with the POSIX draft, but
2155.B YY_INPUT
2156has not yet been accepted into the draft.
2157.IP -
2158.B output()
2159is not supported.
2160Output from the
2161.B ECHO
2162macro is done to the file-pointer
2163.I yyout
2164(default
2165.I stdout).
2166.IP
2167The POSIX draft mentions that an
2168.B output()
2169routine exists but currently gives no details as to what it does.
2170.IP -
2171The
2172.I lex
2173.B %r
2174(generate a Ratfor scanner) option is not supported. It is not part
2175of the POSIX draft.
2176.IP -
2177If you are providing your own yywrap() routine, you must include a
2178"#undef yywrap" in the definitions section (section 1). Note that
2179the "#undef" will have to be enclosed in %{}'s.
2180.IP
2181The POSIX draft
2182specifies that yywrap() is a function and this is unlikely to change; so
2183.I flex users are warned
2184that
2185.B yywrap()
2186is likely to be changed to a function in the near future.
2187.IP -
2188After a call to
2189.B unput(),
2190.I yytext
2191and
2192.I yyleng
2193are undefined until the next token is matched. This is not the case with
2194.I lex
2195or the present POSIX draft.
2196.IP -
2197The precedence of the
2198.B {}
2199(numeric range) operator is different.
2200.I lex
2201interprets "abc{1,3}" as "match one, two, or
2202three occurrences of 'abc'", whereas
2203.I flex
2204interprets it as "match 'ab'
2205followed by one, two, or three occurrences of 'c'". The latter is
2206in agreement with the current POSIX draft.
2207.IP -
2208The precedence of the
2209.B ^
2210operator is different.
2211.I lex
2212interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
2213or 'bar' anywhere", whereas
2214.I flex
2215interprets it as "match either 'foo' or 'bar' if they come at the beginning
2216of a line". The latter is in agreement with the current POSIX draft.
2217.IP -
2218To refer to yytext outside of the scanner source file,
2219the correct definition with
2220.I flex
2221is "extern char *yytext" rather than "extern char yytext[]".
2222This is contrary to the current POSIX draft but a point on which
2223.I flex
2224will not be changing, as the array representation entails a
2225serious performance penalty. It is hoped that the POSIX draft will
2226be emended to support the
2227.I flex
2228variety of declaration (as this is a fairly painless change to
2229require of
2230.I lex
2231users).
2232.IP -
2233.I yyin
2234is
2235.I initialized
2236by
2237.I lex
2238to be
2239.I stdin;
2240.I flex,
2241on the other hand,
2242initializes
2243.I yyin
2244to NULL
2245and then
2246.I assigns
2247it to
2248.I stdin
2249the first time the scanner is called, providing
2250.I yyin
2251has not already been assigned to a non-NULL value. The difference is
2252subtle, but the net effect is that with
2253.I flex
2254scanners,
2255.I yyin
2256does not have a valid value until the scanner has been called.
2257.IP -
2258The special table-size declarations such as
2259.B %a
2260supported by
2261.I lex
2262are not required by
2263.I flex
2264scanners;
2265.I flex
2266ignores them.
2267.IP -
2268The name
2269.bd
2270FLEX_SCANNER
2271is #define'd so scanners may be written for use with either
2272.I flex
2273or
2274.I lex.
2275.LP
2276The following
2277.I flex
2278features are not included in
2279.I lex
2280or the POSIX draft standard:
2281.nf
2282
2283 yyterminate()
2284 <<EOF>>
2285 YY_DECL
2286 #line directives
2287 %{}'s around actions
2288 yyrestart()
2289 comments beginning with '#' (deprecated)
2290 multiple actions on a line
2291
2292.fi
2293This last feature refers to the fact that with
2294.I flex
2295you can put multiple actions on the same line, separated with
2296semi-colons, while with
2297.I lex,
2298the following
2299.nf
2300
2301 foo handle_foo(); ++num_foos_seen;
2302
2303.fi
2304is (rather surprisingly) truncated to
2305.nf
2306
2307 foo handle_foo();
2308
2309.fi
2310.I flex
2311does not truncate the action. Actions that are not enclosed in
2312braces are simply terminated at the end of the line.
2313.SH DIAGNOSTICS
2314.I reject_used_but_not_detected undefined
2315or
2316.I yymore_used_but_not_detected undefined -
2317These errors can occur at compile time. They indicate that the
2318scanner uses
2319.B REJECT
2320or
2321.B yymore()
2322but that
2323.I flex
2324failed to notice the fact, meaning that
2325.I flex
2326scanned the first two sections looking for occurrences of these actions
2327and failed to find any, but somehow you snuck some in (via a #include
2328file, for example). Make an explicit reference to the action in your
2329.I flex
2330input file. (Note that previously
2331.I flex
2332supported a
2333.B %used/%unused
2334mechanism for dealing with this problem; this feature is still supported
2335but now deprecated, and will go away soon unless the author hears from
2336people who can argue compellingly that they need it.)
2337.LP
2338.I flex scanner jammed -
2339a scanner compiled with
2340.B -s
2341has encountered an input string which wasn't matched by
2342any of its rules.
2343.LP
2344.I flex input buffer overflowed -
2345a scanner rule matched a string long enough to overflow the
2346scanner's internal input buffer (16K bytes by default - controlled by
2347.B YY_BUF_SIZE
2348in "flex.skel". Note that to redefine this macro, you must first
2349.B #undefine
2350it).
2351.LP
2352.I scanner requires -8 flag -
2353Your scanner specification includes recognizing 8-bit characters and
2354you did not specify the -8 flag (and your site has not installed flex
2355with -8 as the default).
2356.LP
2357.I too many %t classes! -
2358You managed to put every single character into its own %t class.
2359.I flex
2360requires that at least one of the classes share characters.
2361.SH DEFICIENCIES / BUGS
2362See flex(1).
2363.SH "SEE ALSO"
2364.LP
2365flex(1), lex(1), yacc(1), sed(1), awk(1).
2366.LP
2367M. E. Lesk and E. Schmidt,
2368.I LEX - Lexical Analyzer Generator
2369.SH AUTHOR
2370Vern Paxson, with the help of many ideas and much inspiration from
2371Van Jacobson. Original version by Jef Poskanzer. The fast table
2372representation is a partial implementation of a design done by Van
2373Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
2374.LP
2375Thanks to the many
2376.I flex
2377beta-testers, feedbackers, and contributors, especially Casey
2378Leedom, benson@odi.com,
2379Frederic Brehm, Nick Christopher, Jason Coughlin,
2380Scott David Daniels, Leo Eskin,
2381Chris Faylor, Eric Goldman, Eric
2382Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
2383Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
2384Jef Poskanzer, Jim Roskind,
2385Dave Tallman, Frank Whaley, Ken Yap, and those whose names
2386have slipped my marginal mail-archiving skills but whose contributions
2387are appreciated all the same.
2388.LP
2389Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
2390Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
2391headaches.
2392.LP
2393Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
2394to Benson Margulies and Fred
2395Burke for C++ support; to Ove Ewerlid for the basics of support for
2396NUL's; and to Eric Hughes for the basics of support for multiple buffers.
2397.LP
2398Work is being done on extending
2399.I flex
2400to generate scanners in which the
2401state machine is directly represented in C code rather than tables.
2402These scanners may well be substantially faster than those generated
2403using -f or -F. If you are working in this area and are interested
2404in comparing notes and seeing whether redundant work can be avoided,
2405contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE).
2406.LP
2407This work was primarily done when I was at the Real Time Systems Group
2408at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
2409for the support I received.
2410.LP
2411Send comments to:
2412.nf
2413
2414 Vern Paxson
2415 Computer Science Department
2416 4126 Upson Hall
2417 Cornell University
2418 Ithaca, NY 14853-7501
2419
2420 vern@cs.cornell.edu
2421 decvax!cornell!vern
2422
2423.fi