Commit | Line | Data |
---|---|---|
78ed81a3 | 1 | .TH FLEX 1 "26 May 1990" "Version 2.3" |
15637ed4 RG |
2 | .SH NAME |
3 | flex - fast lexical analyzer generator | |
4 | .SH SYNOPSIS | |
5 | .B flex | |
6 | .B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton] | |
7 | .I [filename ...] | |
8 | .SH DESCRIPTION | |
9 | .I flex | |
10 | is a tool for generating | |
11 | .I scanners: | |
12 | programs which recognized lexical patterns in text. | |
13 | .I flex | |
14 | reads | |
15 | the given input files, or its standard input if no file names are given, | |
16 | for a description of a scanner to generate. The description is in | |
17 | the form of pairs | |
18 | of regular expressions and C code, called | |
19 | .I rules. flex | |
20 | generates as output a C source file, | |
21 | .B lex.yy.c, | |
22 | which defines a routine | |
23 | .B yylex(). | |
24 | This file is compiled and linked with the | |
25 | .B -lfl | |
26 | library to produce an executable. When the executable is run, | |
27 | it analyzes its input for occurrences | |
28 | of the regular expressions. Whenever it finds one, it executes | |
29 | the corresponding C code. | |
30 | .SH SOME SIMPLE EXAMPLES | |
31 | .LP | |
32 | First some simple examples to get the flavor of how one uses | |
33 | .I flex. | |
34 | The following | |
35 | .I flex | |
36 | input specifies a scanner which whenever it encounters the string | |
37 | "username" will replace it with the user's login name: | |
38 | .nf | |
39 | ||
40 | %% | |
41 | username printf( "%s", getlogin() ); | |
42 | ||
43 | .fi | |
44 | By default, any text not matched by a | |
45 | .I flex | |
46 | scanner | |
47 | is copied to the output, so the net effect of this scanner is | |
48 | to copy its input file to its output with each occurrence | |
49 | of "username" expanded. | |
50 | In this input, there is just one rule. "username" is the | |
51 | .I pattern | |
52 | and the "printf" is the | |
53 | .I action. | |
54 | The "%%" marks the beginning of the rules. | |
55 | .LP | |
56 | Here's another simple example: | |
57 | .nf | |
58 | ||
59 | int num_lines = 0, num_chars = 0; | |
60 | ||
61 | %% | |
62 | \\n ++num_lines; ++num_chars; | |
63 | . ++num_chars; | |
64 | ||
65 | %% | |
66 | main() | |
67 | { | |
68 | yylex(); | |
69 | printf( "# of lines = %d, # of chars = %d\\n", | |
70 | num_lines, num_chars ); | |
71 | } | |
72 | ||
73 | .fi | |
74 | This scanner counts the number of characters and the number | |
75 | of lines in its input (it produces no output other than the | |
76 | final report on the counts). The first line | |
77 | declares two globals, "num_lines" and "num_chars", which are accessible | |
78 | both inside | |
79 | .B yylex() | |
80 | and in the | |
81 | .B main() | |
82 | routine declared after the second "%%". There are two rules, one | |
83 | which matches a newline ("\\n") and increments both the line count and | |
84 | the character count, and one which matches any character other than | |
85 | a newline (indicated by the "." regular expression). | |
86 | .LP | |
87 | A somewhat more complicated example: | |
88 | .nf | |
89 | ||
90 | /* scanner for a toy Pascal-like language */ | |
91 | ||
92 | %{ | |
93 | /* need this for the call to atof() below */ | |
94 | #include <math.h> | |
95 | %} | |
96 | ||
97 | DIGIT [0-9] | |
98 | ID [a-z][a-z0-9]* | |
99 | ||
100 | %% | |
101 | ||
102 | {DIGIT}+ { | |
103 | printf( "An integer: %s (%d)\\n", yytext, | |
104 | atoi( yytext ) ); | |
105 | } | |
106 | ||
107 | {DIGIT}+"."{DIGIT}* { | |
78ed81a3 | 108 | printf( "A float: %s (%g)\\n", yytext, |
15637ed4 RG |
109 | atof( yytext ) ); |
110 | } | |
111 | ||
112 | if|then|begin|end|procedure|function { | |
113 | printf( "A keyword: %s\\n", yytext ); | |
114 | } | |
115 | ||
116 | {ID} printf( "An identifier: %s\\n", yytext ); | |
117 | ||
118 | "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); | |
119 | ||
120 | "{"[^}\\n]*"}" /* eat up one-line comments */ | |
121 | ||
122 | [ \\t\\n]+ /* eat up whitespace */ | |
123 | ||
124 | . printf( "Unrecognized character: %s\\n", yytext ); | |
125 | ||
126 | %% | |
127 | ||
128 | main( argc, argv ) | |
129 | int argc; | |
130 | char **argv; | |
131 | { | |
132 | ++argv, --argc; /* skip over program name */ | |
133 | if ( argc > 0 ) | |
134 | yyin = fopen( argv[0], "r" ); | |
135 | else | |
136 | yyin = stdin; | |
137 | ||
138 | yylex(); | |
139 | } | |
140 | ||
141 | .fi | |
142 | This is the beginnings of a simple scanner for a language like | |
143 | Pascal. It identifies different types of | |
144 | .I tokens | |
145 | and reports on what it has seen. | |
146 | .LP | |
147 | The details of this example will be explained in the following | |
148 | sections. | |
149 | .SH FORMAT OF THE INPUT FILE | |
150 | The | |
151 | .I flex | |
152 | input file consists of three sections, separated by a line with just | |
153 | .B %% | |
154 | in it: | |
155 | .nf | |
156 | ||
157 | definitions | |
158 | %% | |
159 | rules | |
160 | %% | |
161 | user code | |
162 | ||
163 | .fi | |
164 | The | |
165 | .I definitions | |
166 | section contains declarations of simple | |
167 | .I name | |
168 | definitions to simplify the scanner specification, and declarations of | |
169 | .I start conditions, | |
170 | which are explained in a later section. | |
171 | .LP | |
172 | Name definitions have the form: | |
173 | .nf | |
174 | ||
175 | name definition | |
176 | ||
177 | .fi | |
178 | The "name" is a word beginning with a letter or an underscore ('_') | |
179 | followed by zero or more letters, digits, '_', or '-' (dash). | |
180 | The definition is taken to begin at the first non-white-space character | |
181 | following the name and continuing to the end of the line. | |
182 | The definition can subsequently be referred to using "{name}", which | |
183 | will expand to "(definition)". For example, | |
184 | .nf | |
185 | ||
186 | DIGIT [0-9] | |
187 | ID [a-z][a-z0-9]* | |
188 | ||
189 | .fi | |
190 | defines "DIGIT" to be a regular expression which matches a | |
191 | single digit, and | |
192 | "ID" to be a regular expression which matches a letter | |
193 | followed by zero-or-more letters-or-digits. | |
194 | A subsequent reference to | |
195 | .nf | |
196 | ||
197 | {DIGIT}+"."{DIGIT}* | |
198 | ||
199 | .fi | |
200 | is identical to | |
201 | .nf | |
202 | ||
203 | ([0-9])+"."([0-9])* | |
204 | ||
205 | .fi | |
206 | and matches one-or-more digits followed by a '.' followed | |
207 | by zero-or-more digits. | |
208 | .LP | |
209 | The | |
210 | .I rules | |
211 | section of the | |
212 | .I flex | |
213 | input contains a series of rules of the form: | |
214 | .nf | |
215 | ||
216 | pattern action | |
217 | ||
218 | .fi | |
219 | where the pattern must be unindented and the action must begin | |
220 | on the same line. | |
221 | .LP | |
222 | See below for a further description of patterns and actions. | |
223 | .LP | |
224 | Finally, the user code section is simply copied to | |
225 | .B lex.yy.c | |
226 | verbatim. | |
227 | It is used for companion routines which call or are called | |
228 | by the scanner. The presence of this section is optional; | |
229 | if it is missing, the second | |
230 | .B %% | |
231 | in the input file may be skipped, too. | |
232 | .LP | |
233 | In the definitions and rules sections, any | |
234 | .I indented | |
235 | text or text enclosed in | |
236 | .B %{ | |
237 | and | |
238 | .B %} | |
239 | is copied verbatim to the output (with the %{}'s removed). | |
240 | The %{}'s must appear unindented on lines by themselves. | |
241 | .LP | |
242 | In the rules section, | |
243 | any indented or %{} text appearing before the | |
244 | first rule may be used to declare variables | |
245 | which are local to the scanning routine and (after the declarations) | |
246 | code which is to be executed whenever the scanning routine is entered. | |
247 | Other indented or %{} text in the rule section is still copied to the output, | |
248 | but its meaning is not well-defined and it may well cause compile-time | |
249 | errors (this feature is present for | |
250 | .I POSIX | |
251 | compliance; see below for other such features). | |
252 | .LP | |
253 | In the definitions section, an unindented comment (i.e., a line | |
254 | beginning with "/*") is also copied verbatim to the output up | |
255 | to the next "*/". Also, any line in the definitions section | |
256 | beginning with '#' is ignored, though this style of comment is | |
257 | deprecated and may go away in the future. | |
258 | .SH PATTERNS | |
259 | The patterns in the input are written using an extended set of regular | |
260 | expressions. These are: | |
261 | .nf | |
262 | ||
263 | x match the character 'x' | |
264 | . any character except newline | |
265 | [xyz] a "character class"; in this case, the pattern | |
266 | matches either an 'x', a 'y', or a 'z' | |
267 | [abj-oZ] a "character class" with a range in it; matches | |
268 | an 'a', a 'b', any letter from 'j' through 'o', | |
269 | or a 'Z' | |
270 | [^A-Z] a "negated character class", i.e., any character | |
271 | but those in the class. In this case, any | |
272 | character EXCEPT an uppercase letter. | |
273 | [^A-Z\\n] any character EXCEPT an uppercase letter or | |
274 | a newline | |
275 | r* zero or more r's, where r is any regular expression | |
276 | r+ one or more r's | |
277 | r? zero or one r's (that is, "an optional r") | |
278 | r{2,5} anywhere from two to five r's | |
279 | r{2,} two or more r's | |
280 | r{4} exactly 4 r's | |
281 | {name} the expansion of the "name" definition | |
282 | (see above) | |
283 | "[xyz]\\"foo" | |
284 | the literal string: [xyz]"foo | |
285 | \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', | |
286 | then the ANSI-C interpretation of \\x. | |
287 | Otherwise, a literal 'X' (used to escape | |
288 | operators such as '*') | |
289 | \\123 the character with octal value 123 | |
290 | \\x2a the character with hexadecimal value 2a | |
291 | (r) match an r; parentheses are used to override | |
292 | precedence (see below) | |
293 | ||
294 | ||
295 | rs the regular expression r followed by the | |
296 | regular expression s; called "concatenation" | |
297 | ||
298 | ||
299 | r|s either an r or an s | |
300 | ||
301 | ||
302 | r/s an r but only if it is followed by an s. The | |
303 | s is not part of the matched text. This type | |
304 | of pattern is called as "trailing context". | |
305 | ^r an r, but only at the beginning of a line | |
306 | r$ an r, but only at the end of a line. Equivalent | |
307 | to "r/\\n". | |
308 | ||
309 | ||
310 | <s>r an r, but only in start condition s (see | |
311 | below for discussion of start conditions) | |
312 | <s1,s2,s3>r | |
313 | same, but in any of start conditions s1, | |
314 | s2, or s3 | |
315 | ||
316 | ||
317 | <<EOF>> an end-of-file | |
318 | <s1,s2><<EOF>> | |
319 | an end-of-file when in start condition s1 or s2 | |
320 | ||
321 | .fi | |
322 | The regular expressions listed above are grouped according to | |
323 | precedence, from highest precedence at the top to lowest at the bottom. | |
324 | Those grouped together have equal precedence. For example, | |
325 | .nf | |
326 | ||
327 | foo|bar* | |
328 | ||
329 | .fi | |
330 | is the same as | |
331 | .nf | |
332 | ||
333 | (foo)|(ba(r*)) | |
334 | ||
335 | .fi | |
336 | since the '*' operator has higher precedence than concatenation, | |
337 | and concatenation higher than alternation ('|'). This pattern | |
338 | therefore matches | |
339 | .I either | |
340 | the string "foo" | |
341 | .I or | |
342 | the string "ba" followed by zero-or-more r's. | |
343 | To match "foo" or zero-or-more "bar"'s, use: | |
344 | .nf | |
345 | ||
346 | foo|(bar)* | |
347 | ||
348 | .fi | |
349 | and to match zero-or-more "foo"'s-or-"bar"'s: | |
350 | .nf | |
351 | ||
352 | (foo|bar)* | |
353 | ||
354 | .fi | |
355 | .LP | |
356 | Some notes on patterns: | |
357 | .IP - | |
358 | A negated character class such as the example "[^A-Z]" | |
359 | above | |
360 | .I will match a newline | |
361 | unless "\\n" (or an equivalent escape sequence) is one of the | |
362 | characters explicitly present in the negated character class | |
363 | (e.g., "[^A-Z\\n]"). This is unlike how many other regular | |
364 | expression tools treat negated character classes, but unfortunately | |
365 | the inconsistency is historically entrenched. | |
366 | Matching newlines means that a pattern like [^"]* can match an entire | |
367 | input (overflowing the scanner's input buffer) unless there's another | |
368 | quote in the input. | |
369 | .IP - | |
370 | A rule can have at most one instance of trailing context (the '/' operator | |
371 | or the '$' operator). The start condition, '^', and "<<EOF>>" patterns | |
372 | can only occur at the beginning of a pattern, and, as well as with '/' and '$', | |
373 | cannot be grouped inside parentheses. A '^' which does not occur at | |
374 | the beginning of a rule or a '$' which does not occur at the end of | |
375 | a rule loses its special properties and is treated as a normal character. | |
376 | .IP | |
377 | The following are illegal: | |
378 | .nf | |
379 | ||
380 | foo/bar$ | |
381 | <sc1>foo<sc2>bar | |
382 | ||
383 | .fi | |
384 | Note that the first of these, can be written "foo/bar\\n". | |
385 | .IP | |
386 | The following will result in '$' or '^' being treated as a normal character: | |
387 | .nf | |
388 | ||
389 | foo|(bar$) | |
390 | foo|^bar | |
391 | ||
392 | .fi | |
393 | If what's wanted is a "foo" or a bar-followed-by-a-newline, the following | |
394 | could be used (the special '|' action is explained below): | |
395 | .nf | |
396 | ||
397 | foo | | |
398 | bar$ /* action goes here */ | |
399 | ||
400 | .fi | |
401 | A similar trick will work for matching a foo or a | |
402 | bar-at-the-beginning-of-a-line. | |
403 | .SH HOW THE INPUT IS MATCHED | |
404 | When the generated scanner is run, it analyzes its input looking | |
405 | for strings which match any of its patterns. If it finds more than | |
406 | one match, it takes the one matching the most text (for trailing | |
407 | context rules, this includes the length of the trailing part, even | |
408 | though it will then be returned to the input). If it finds two | |
409 | or more matches of the same length, the | |
410 | rule listed first in the | |
411 | .I flex | |
412 | input file is chosen. | |
413 | .LP | |
414 | Once the match is determined, the text corresponding to the match | |
415 | (called the | |
416 | .I token) | |
417 | is made available in the global character pointer | |
418 | .B yytext, | |
419 | and its length in the global integer | |
420 | .B yyleng. | |
421 | The | |
422 | .I action | |
423 | corresponding to the matched pattern is then executed (a more | |
424 | detailed description of actions follows), and then the remaining | |
425 | input is scanned for another match. | |
426 | .LP | |
427 | If no match is found, then the | |
428 | .I default rule | |
429 | is executed: the next character in the input is considered matched and | |
430 | copied to the standard output. Thus, the simplest legal | |
431 | .I flex | |
432 | input is: | |
433 | .nf | |
434 | ||
435 | %% | |
436 | ||
437 | .fi | |
438 | which generates a scanner that simply copies its input (one character | |
439 | at a time) to its output. | |
440 | .SH ACTIONS | |
441 | Each pattern in a rule has a corresponding action, which can be any | |
442 | arbitrary C statement. The pattern ends at the first non-escaped | |
443 | whitespace character; the remainder of the line is its action. If the | |
444 | action is empty, then when the pattern is matched the input token | |
445 | is simply discarded. For example, here is the specification for a program | |
446 | which deletes all occurrences of "zap me" from its input: | |
447 | .nf | |
448 | ||
449 | %% | |
450 | "zap me" | |
451 | ||
452 | .fi | |
453 | (It will copy all other characters in the input to the output since | |
454 | they will be matched by the default rule.) | |
455 | .LP | |
456 | Here is a program which compresses multiple blanks and tabs down to | |
457 | a single blank, and throws away whitespace found at the end of a line: | |
458 | .nf | |
459 | ||
460 | %% | |
461 | [ \\t]+ putchar( ' ' ); | |
462 | [ \\t]+$ /* ignore this token */ | |
463 | ||
464 | .fi | |
465 | .LP | |
466 | If the action contains a '{', then the action spans till the balancing '}' | |
467 | is found, and the action may cross multiple lines. | |
468 | .I flex | |
469 | knows about C strings and comments and won't be fooled by braces found | |
470 | within them, but also allows actions to begin with | |
471 | .B %{ | |
472 | and will consider the action to be all the text up to the next | |
473 | .B %} | |
474 | (regardless of ordinary braces inside the action). | |
475 | .LP | |
476 | An action consisting solely of a vertical bar ('|') means "same as | |
477 | the action for the next rule." See below for an illustration. | |
478 | .LP | |
479 | Actions can include arbitrary C code, including | |
480 | .B return | |
481 | statements to return a value to whatever routine called | |
482 | .B yylex(). | |
483 | Each time | |
484 | .B yylex() | |
485 | is called it continues processing tokens from where it last left | |
486 | off until it either reaches | |
487 | the end of the file or executes a return. Once it reaches an end-of-file, | |
488 | however, then any subsequent call to | |
489 | .B yylex() | |
490 | will simply immediately return, unless | |
491 | .B yyrestart() | |
492 | is first called (see below). | |
493 | .LP | |
494 | Actions are not allowed to modify yytext or yyleng. | |
495 | .LP | |
496 | There are a number of special directives which can be included within | |
497 | an action: | |
498 | .IP - | |
499 | .B ECHO | |
500 | copies yytext to the scanner's output. | |
501 | .IP - | |
502 | .B BEGIN | |
503 | followed by the name of a start condition places the scanner in the | |
504 | corresponding start condition (see below). | |
505 | .IP - | |
506 | .B REJECT | |
507 | directs the scanner to proceed on to the "second best" rule which matched the | |
508 | input (or a prefix of the input). The rule is chosen as described | |
509 | above in "How the Input is Matched", and | |
510 | .B yytext | |
511 | and | |
512 | .B yyleng | |
513 | set up appropriately. | |
514 | It may either be one which matched as much text | |
515 | as the originally chosen rule but came later in the | |
516 | .I flex | |
517 | input file, or one which matched less text. | |
518 | For example, the following will both count the | |
519 | words in the input and call the routine special() whenever "frob" is seen: | |
520 | .nf | |
521 | ||
522 | int word_count = 0; | |
523 | %% | |
524 | ||
525 | frob special(); REJECT; | |
526 | [^ \\t\\n]+ ++word_count; | |
527 | ||
528 | .fi | |
529 | Without the | |
530 | .B REJECT, | |
531 | any "frob"'s in the input would not be counted as words, since the | |
532 | scanner normally executes only one action per token. | |
533 | Multiple | |
534 | .B REJECT's | |
535 | are allowed, each one finding the next best choice to the currently | |
536 | active rule. For example, when the following scanner scans the token | |
537 | "abcd", it will write "abcdabcaba" to the output: | |
538 | .nf | |
539 | ||
540 | %% | |
541 | a | | |
542 | ab | | |
543 | abc | | |
544 | abcd ECHO; REJECT; | |
545 | .|\\n /* eat up any unmatched character */ | |
546 | ||
547 | .fi | |
548 | (The first three rules share the fourth's action since they use | |
549 | the special '|' action.) | |
550 | .B REJECT | |
551 | is a particularly expensive feature in terms scanner performance; | |
552 | if it is used in | |
553 | .I any | |
554 | of the scanner's actions it will slow down | |
555 | .I all | |
556 | of the scanner's matching. Furthermore, | |
557 | .B REJECT | |
558 | cannot be used with the | |
559 | .I -f | |
560 | or | |
561 | .I -F | |
562 | options (see below). | |
563 | .IP | |
564 | Note also that unlike the other special actions, | |
565 | .B REJECT | |
566 | is a | |
567 | .I branch; | |
568 | code immediately following it in the action will | |
569 | .I not | |
570 | be executed. | |
571 | .IP - | |
572 | .B yymore() | |
573 | tells the scanner that the next time it matches a rule, the corresponding | |
574 | token should be | |
575 | .I appended | |
576 | onto the current value of | |
577 | .B yytext | |
578 | rather than replacing it. For example, given the input "mega-kludge" | |
579 | the following will write "mega-mega-kludge" to the output: | |
580 | .nf | |
581 | ||
582 | %% | |
583 | mega- ECHO; yymore(); | |
584 | kludge ECHO; | |
585 | ||
586 | .fi | |
587 | First "mega-" is matched and echoed to the output. Then "kludge" | |
588 | is matched, but the previous "mega-" is still hanging around at the | |
589 | beginning of | |
590 | .B yytext | |
591 | so the | |
592 | .B ECHO | |
593 | for the "kludge" rule will actually write "mega-kludge". | |
594 | The presence of | |
595 | .B yymore() | |
596 | in the scanner's action entails a minor performance penalty in the | |
597 | scanner's matching speed. | |
598 | .IP - | |
599 | .B yyless(n) | |
600 | returns all but the first | |
601 | .I n | |
602 | characters of the current token back to the input stream, where they | |
603 | will be rescanned when the scanner looks for the next match. | |
604 | .B yytext | |
605 | and | |
606 | .B yyleng | |
607 | are adjusted appropriately (e.g., | |
608 | .B yyleng | |
609 | will now be equal to | |
610 | .I n | |
611 | ). For example, on the input "foobar" the following will write out | |
612 | "foobarbar": | |
613 | .nf | |
614 | ||
615 | %% | |
616 | foobar ECHO; yyless(3); | |
617 | [a-z]+ ECHO; | |
618 | ||
619 | .fi | |
620 | An argument of 0 to | |
621 | .B yyless | |
622 | will cause the entire current input string to be scanned again. Unless you've | |
623 | changed how the scanner will subsequently process its input (using | |
624 | .B BEGIN, | |
625 | for example), this will result in an endless loop. | |
626 | .IP - | |
627 | .B unput(c) | |
628 | puts the character | |
629 | .I c | |
630 | back onto the input stream. It will be the next character scanned. | |
631 | The following action will take the current token and cause it | |
632 | to be rescanned enclosed in parentheses. | |
633 | .nf | |
634 | ||
635 | { | |
636 | int i; | |
637 | unput( ')' ); | |
638 | for ( i = yyleng - 1; i >= 0; --i ) | |
639 | unput( yytext[i] ); | |
640 | unput( '(' ); | |
641 | } | |
642 | ||
643 | .fi | |
644 | Note that since each | |
645 | .B unput() | |
646 | puts the given character back at the | |
647 | .I beginning | |
648 | of the input stream, pushing back strings must be done back-to-front. | |
649 | .IP - | |
650 | .B input() | |
651 | reads the next character from the input stream. For example, | |
652 | the following is one way to eat up C comments: | |
653 | .nf | |
654 | ||
655 | %% | |
656 | "/*" { | |
657 | register int c; | |
658 | ||
659 | for ( ; ; ) | |
660 | { | |
661 | while ( (c = input()) != '*' && | |
662 | c != EOF ) | |
663 | ; /* eat up text of comment */ | |
664 | ||
665 | if ( c == '*' ) | |
666 | { | |
667 | while ( (c = input()) == '*' ) | |
668 | ; | |
669 | if ( c == '/' ) | |
670 | break; /* found the end */ | |
671 | } | |
672 | ||
673 | if ( c == EOF ) | |
674 | { | |
675 | error( "EOF in comment" ); | |
676 | break; | |
677 | } | |
678 | } | |
679 | } | |
680 | ||
681 | .fi | |
682 | (Note that if the scanner is compiled using | |
683 | .B C++, | |
684 | then | |
685 | .B input() | |
686 | is instead referred to as | |
687 | .B yyinput(), | |
688 | in order to avoid a name clash with the | |
689 | .B C++ | |
690 | stream by the name of | |
691 | .I input.) | |
692 | .IP - | |
693 | .B yyterminate() | |
694 | can be used in lieu of a return statement in an action. It terminates | |
695 | the scanner and returns a 0 to the scanner's caller, indicating "all done". | |
696 | Subsequent calls to the scanner will immediately return unless preceded | |
697 | by a call to | |
698 | .B yyrestart() | |
699 | (see below). | |
700 | By default, | |
701 | .B yyterminate() | |
702 | is also called when an end-of-file is encountered. It is a macro and | |
703 | may be redefined. | |
704 | .SH THE GENERATED SCANNER | |
705 | The output of | |
706 | .I flex | |
707 | is the file | |
708 | .B lex.yy.c, | |
709 | which contains the scanning routine | |
710 | .B yylex(), | |
711 | a number of tables used by it for matching tokens, and a number | |
712 | of auxiliary routines and macros. By default, | |
713 | .B yylex() | |
714 | is declared as follows: | |
715 | .nf | |
716 | ||
717 | int yylex() | |
718 | { | |
719 | ... various definitions and the actions in here ... | |
720 | } | |
721 | ||
722 | .fi | |
723 | (If your environment supports function prototypes, then it will | |
724 | be "int yylex( void )".) This definition may be changed by redefining | |
725 | the "YY_DECL" macro. For example, you could use: | |
726 | .nf | |
727 | ||
728 | #undef YY_DECL | |
729 | #define YY_DECL float lexscan( a, b ) float a, b; | |
730 | ||
731 | .fi | |
732 | to give the scanning routine the name | |
733 | .I lexscan, | |
734 | returning a float, and taking two floats as arguments. Note that | |
735 | if you give arguments to the scanning routine using a | |
736 | K&R-style/non-prototyped function declaration, you must terminate | |
737 | the definition with a semi-colon (;). | |
738 | .LP | |
739 | Whenever | |
740 | .B yylex() | |
741 | is called, it scans tokens from the global input file | |
742 | .I yyin | |
743 | (which defaults to stdin). It continues until it either reaches | |
744 | an end-of-file (at which point it returns the value 0) or | |
745 | one of its actions executes a | |
746 | .I return | |
747 | statement. | |
748 | In the former case, when called again the scanner will immediately | |
749 | return unless | |
750 | .B yyrestart() | |
751 | is called to point | |
752 | .I yyin | |
753 | at the new input file. ( | |
754 | .B yyrestart() | |
755 | takes one argument, a | |
756 | .B FILE * | |
757 | pointer.) | |
758 | In the latter case (i.e., when an action | |
759 | executes a return), the scanner may then be called again and it | |
760 | will resume scanning where it left off. | |
761 | .LP | |
762 | By default (and for purposes of efficiency), the scanner uses | |
763 | block-reads rather than simple | |
764 | .I getc() | |
765 | calls to read characters from | |
766 | .I yyin. | |
767 | The nature of how it gets its input can be controlled by redefining the | |
768 | .B YY_INPUT | |
769 | macro. | |
770 | YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its | |
771 | action is to place up to | |
772 | .I max_size | |
773 | characters in the character array | |
774 | .I buf | |
775 | and return in the integer variable | |
776 | .I result | |
777 | either the | |
778 | number of characters read or the constant YY_NULL (0 on Unix systems) | |
779 | to indicate EOF. The default YY_INPUT reads from the | |
780 | global file-pointer "yyin". | |
781 | .LP | |
782 | A sample redefinition of YY_INPUT (in the definitions | |
783 | section of the input file): | |
784 | .nf | |
785 | ||
786 | %{ | |
787 | #undef YY_INPUT | |
788 | #define YY_INPUT(buf,result,max_size) \\ | |
78ed81a3 | 789 | { \\ |
790 | int c = getchar(); \\ | |
791 | result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ | |
792 | } | |
15637ed4 RG |
793 | %} |
794 | ||
795 | .fi | |
796 | This definition will change the input processing to occur | |
797 | one character at a time. | |
798 | .LP | |
799 | You also can add in things like keeping track of the | |
800 | input line number this way; but don't expect your scanner to | |
801 | go very fast. | |
802 | .LP | |
803 | When the scanner receives an end-of-file indication from YY_INPUT, | |
804 | it then checks the | |
805 | .B yywrap() | |
806 | function. If | |
807 | .B yywrap() | |
808 | returns false (zero), then it is assumed that the | |
809 | function has gone ahead and set up | |
810 | .I yyin | |
811 | to point to another input file, and scanning continues. If it returns | |
812 | true (non-zero), then the scanner terminates, returning 0 to its | |
813 | caller. | |
814 | .LP | |
815 | The default | |
816 | .B yywrap() | |
817 | always returns 1. Presently, to redefine it you must first | |
818 | "#undef yywrap", as it is currently implemented as a macro. As indicated | |
819 | by the hedging in the previous sentence, it may be changed to | |
820 | a true function in the near future. | |
821 | .LP | |
822 | The scanner writes its | |
823 | .B ECHO | |
824 | output to the | |
825 | .I yyout | |
826 | global (default, stdout), which may be redefined by the user simply | |
827 | by assigning it to some other | |
828 | .B FILE | |
829 | pointer. | |
830 | .SH START CONDITIONS | |
831 | .I flex | |
832 | provides a mechanism for conditionally activating rules. Any rule | |
833 | whose pattern is prefixed with "<sc>" will only be active when | |
834 | the scanner is in the start condition named "sc". For example, | |
835 | .nf | |
836 | ||
837 | <STRING>[^"]* { /* eat up the string body ... */ | |
838 | ... | |
839 | } | |
840 | ||
841 | .fi | |
842 | will be active only when the scanner is in the "STRING" start | |
843 | condition, and | |
844 | .nf | |
845 | ||
846 | <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */ | |
847 | ... | |
848 | } | |
849 | ||
850 | .fi | |
851 | will be active only when the current start condition is | |
852 | either "INITIAL", "STRING", or "QUOTE". | |
853 | .LP | |
854 | Start conditions | |
855 | are declared in the definitions (first) section of the input | |
856 | using unindented lines beginning with either | |
857 | .B %s | |
858 | or | |
859 | .B %x | |
860 | followed by a list of names. | |
861 | The former declares | |
862 | .I inclusive | |
863 | start conditions, the latter | |
864 | .I exclusive | |
865 | start conditions. A start condition is activated using the | |
866 | .B BEGIN | |
867 | action. Until the next | |
868 | .B BEGIN | |
869 | action is executed, rules with the given start | |
870 | condition will be active and | |
871 | rules with other start conditions will be inactive. | |
872 | If the start condition is | |
873 | .I inclusive, | |
874 | then rules with no start conditions at all will also be active. | |
875 | If it is | |
876 | .I exclusive, | |
877 | then | |
878 | .I only | |
879 | rules qualified with the start condition will be active. | |
880 | A set of rules contingent on the same exclusive start condition | |
881 | describe a scanner which is independent of any of the other rules in the | |
882 | .I flex | |
883 | input. Because of this, | |
884 | exclusive start conditions make it easy to specify "mini-scanners" | |
885 | which scan portions of the input that are syntactically different | |
886 | from the rest (e.g., comments). | |
887 | .LP | |
888 | If the distinction between inclusive and exclusive start conditions | |
889 | is still a little vague, here's a simple example illustrating the | |
890 | connection between the two. The set of rules: | |
891 | .nf | |
892 | ||
893 | %s example | |
894 | %% | |
895 | <example>foo /* do something */ | |
896 | ||
897 | .fi | |
898 | is equivalent to | |
899 | .nf | |
900 | ||
901 | %x example | |
902 | %% | |
903 | <INITIAL,example>foo /* do something */ | |
904 | ||
905 | .fi | |
906 | .LP | |
907 | The default rule (to | |
908 | .B ECHO | |
909 | any unmatched character) remains active in start conditions. | |
910 | .LP | |
911 | .B BEGIN(0) | |
912 | returns to the original state where only the rules with | |
913 | no start conditions are active. This state can also be | |
914 | referred to as the start-condition "INITIAL", so | |
915 | .B BEGIN(INITIAL) | |
916 | is equivalent to | |
917 | .B BEGIN(0). | |
918 | (The parentheses around the start condition name are not required but | |
919 | are considered good style.) | |
920 | .LP | |
921 | .B BEGIN | |
922 | actions can also be given as indented code at the beginning | |
923 | of the rules section. For example, the following will cause | |
924 | the scanner to enter the "SPECIAL" start condition whenever | |
925 | .I yylex() | |
926 | is called and the global variable | |
927 | .I enter_special | |
928 | is true: | |
929 | .nf | |
930 | ||
931 | int enter_special; | |
932 | ||
933 | %x SPECIAL | |
934 | %% | |
935 | if ( enter_special ) | |
936 | BEGIN(SPECIAL); | |
937 | ||
938 | <SPECIAL>blahblahblah | |
939 | ...more rules follow... | |
940 | ||
941 | .fi | |
942 | .LP | |
943 | To illustrate the uses of start conditions, | |
944 | here is a scanner which provides two different interpretations | |
945 | of a string like "123.456". By default it will treat it as | |
946 | as three tokens, the integer "123", a dot ('.'), and the integer "456". | |
947 | But if the string is preceded earlier in the line by the string | |
948 | "expect-floats" | |
949 | it will treat it as a single token, the floating-point number | |
950 | 123.456: | |
951 | .nf | |
952 | ||
953 | %{ | |
954 | #include <math.h> | |
955 | %} | |
956 | %s expect | |
957 | ||
958 | %% | |
959 | expect-floats BEGIN(expect); | |
960 | ||
961 | <expect>[0-9]+"."[0-9]+ { | |
962 | printf( "found a float, = %f\\n", | |
963 | atof( yytext ) ); | |
964 | } | |
965 | <expect>\\n { | |
966 | /* that's the end of the line, so | |
967 | * we need another "expect-number" | |
968 | * before we'll recognize any more | |
969 | * numbers | |
970 | */ | |
971 | BEGIN(INITIAL); | |
972 | } | |
973 | ||
974 | [0-9]+ { | |
975 | printf( "found an integer, = %d\\n", | |
976 | atoi( yytext ) ); | |
977 | } | |
978 | ||
979 | "." printf( "found a dot\\n" ); | |
980 | ||
981 | .fi | |
982 | Here is a scanner which recognizes (and discards) C comments while | |
983 | maintaining a count of the current input line. | |
984 | .nf | |
985 | ||
986 | %x comment | |
987 | %% | |
988 | int line_num = 1; | |
989 | ||
990 | "/*" BEGIN(comment); | |
991 | ||
992 | <comment>[^*\\n]* /* eat anything that's not a '*' */ | |
993 | <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ | |
994 | <comment>\\n ++line_num; | |
995 | <comment>"*"+"/" BEGIN(INITIAL); | |
996 | ||
997 | .fi | |
998 | Note that start-conditions names are really integer values and | |
999 | can be stored as such. Thus, the above could be extended in the | |
1000 | following fashion: | |
1001 | .nf | |
1002 | ||
1003 | %x comment foo | |
1004 | %% | |
1005 | int line_num = 1; | |
1006 | int comment_caller; | |
1007 | ||
1008 | "/*" { | |
1009 | comment_caller = INITIAL; | |
1010 | BEGIN(comment); | |
1011 | } | |
1012 | ||
1013 | ... | |
1014 | ||
1015 | <foo>"/*" { | |
1016 | comment_caller = foo; | |
1017 | BEGIN(comment); | |
1018 | } | |
1019 | ||
1020 | <comment>[^*\\n]* /* eat anything that's not a '*' */ | |
1021 | <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ | |
1022 | <comment>\\n ++line_num; | |
1023 | <comment>"*"+"/" BEGIN(comment_caller); | |
1024 | ||
1025 | .fi | |
1026 | One can then implement a "stack" of start conditions using an | |
1027 | array of integers. (It is likely that such stacks will become | |
1028 | a full-fledged | |
1029 | .I flex | |
1030 | feature in the future.) Note, though, that | |
1031 | start conditions do not have their own name-space; %s's and %x's | |
1032 | declare names in the same fashion as #define's. | |
1033 | .SH MULTIPLE INPUT BUFFERS | |
1034 | Some scanners (such as those which support "include" files) | |
1035 | require reading from several input streams. As | |
1036 | .I flex | |
1037 | scanners do a large amount of buffering, one cannot control | |
1038 | where the next input will be read from by simply writing a | |
1039 | .B YY_INPUT | |
1040 | which is sensitive to the scanning context. | |
1041 | .B YY_INPUT | |
1042 | is only called when the scanner reaches the end of its buffer, which | |
1043 | may be a long time after scanning a statement such as an "include" | |
1044 | which requires switching the input source. | |
1045 | .LP | |
1046 | To negotiate these sorts of problems, | |
1047 | .I flex | |
1048 | provides a mechanism for creating and switching between multiple | |
1049 | input buffers. An input buffer is created by using: | |
1050 | .nf | |
1051 | ||
1052 | YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) | |
1053 | ||
1054 | .fi | |
1055 | which takes a | |
1056 | .I FILE | |
1057 | pointer and a size and creates a buffer associated with the given | |
1058 | file and large enough to hold | |
1059 | .I size | |
1060 | characters (when in doubt, use | |
1061 | .B YY_BUF_SIZE | |
1062 | for the size). It returns a | |
1063 | .B YY_BUFFER_STATE | |
1064 | handle, which may then be passed to other routines: | |
1065 | .nf | |
1066 | ||
1067 | void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) | |
1068 | ||
1069 | .fi | |
1070 | switches the scanner's input buffer so subsequent tokens will | |
1071 | come from | |
1072 | .I new_buffer. | |
1073 | Note that | |
1074 | .B yy_switch_to_buffer() | |
1075 | may be used by yywrap() to sets things up for continued scanning, instead | |
1076 | of opening a new file and pointing | |
1077 | .I yyin | |
1078 | at it. | |
1079 | .nf | |
1080 | ||
1081 | void yy_delete_buffer( YY_BUFFER_STATE buffer ) | |
1082 | ||
1083 | .fi | |
1084 | is used to reclaim the storage associated with a buffer. | |
1085 | .LP | |
1086 | .B yy_new_buffer() | |
1087 | is an alias for | |
1088 | .B yy_create_buffer(), | |
1089 | provided for compatibility with the C++ use of | |
1090 | .I new | |
1091 | and | |
1092 | .I delete | |
1093 | for creating and destroying dynamic objects. | |
1094 | .LP | |
1095 | Finally, the | |
1096 | .B YY_CURRENT_BUFFER | |
1097 | macro returns a | |
1098 | .B YY_BUFFER_STATE | |
1099 | handle to the current buffer. | |
1100 | .LP | |
1101 | Here is an example of using these features for writing a scanner | |
1102 | which expands include files (the | |
1103 | .B <<EOF>> | |
1104 | feature is discussed below): | |
1105 | .nf | |
1106 | ||
1107 | /* the "incl" state is used for picking up the name | |
1108 | * of an include file | |
1109 | */ | |
1110 | %x incl | |
1111 | ||
1112 | %{ | |
1113 | #define MAX_INCLUDE_DEPTH 10 | |
1114 | YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; | |
1115 | int include_stack_ptr = 0; | |
1116 | %} | |
1117 | ||
1118 | %% | |
1119 | include BEGIN(incl); | |
1120 | ||
1121 | [a-z]+ ECHO; | |
1122 | [^a-z\\n]*\\n? ECHO; | |
1123 | ||
1124 | <incl>[ \\t]* /* eat the whitespace */ | |
1125 | <incl>[^ \\t\\n]+ { /* got the include file name */ | |
1126 | if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) | |
1127 | { | |
1128 | fprintf( stderr, "Includes nested too deeply" ); | |
1129 | exit( 1 ); | |
1130 | } | |
1131 | ||
1132 | include_stack[include_stack_ptr++] = | |
1133 | YY_CURRENT_BUFFER; | |
1134 | ||
1135 | yyin = fopen( yytext, "r" ); | |
1136 | ||
1137 | if ( ! yyin ) | |
1138 | error( ... ); | |
1139 | ||
1140 | yy_switch_to_buffer( | |
1141 | yy_create_buffer( yyin, YY_BUF_SIZE ) ); | |
1142 | ||
1143 | BEGIN(INITIAL); | |
1144 | } | |
1145 | ||
1146 | <<EOF>> { | |
1147 | if ( --include_stack_ptr < 0 ) | |
1148 | { | |
1149 | yyterminate(); | |
1150 | } | |
1151 | ||
1152 | else | |
1153 | yy_switch_to_buffer( | |
1154 | include_stack[include_stack_ptr] ); | |
1155 | } | |
1156 | ||
1157 | .fi | |
1158 | .SH END-OF-FILE RULES | |
1159 | The special rule "<<EOF>>" indicates | |
1160 | actions which are to be taken when an end-of-file is | |
1161 | encountered and yywrap() returns non-zero (i.e., indicates | |
1162 | no further files to process). The action must finish | |
1163 | by doing one of four things: | |
1164 | .IP - | |
1165 | the special | |
1166 | .B YY_NEW_FILE | |
1167 | action, if | |
1168 | .I yyin | |
1169 | has been pointed at a new file to process; | |
1170 | .IP - | |
1171 | a | |
1172 | .I return | |
1173 | statement; | |
1174 | .IP - | |
1175 | the special | |
1176 | .B yyterminate() | |
1177 | action; | |
1178 | .IP - | |
1179 | or, switching to a new buffer using | |
1180 | .B yy_switch_to_buffer() | |
1181 | as shown in the example above. | |
1182 | .LP | |
1183 | <<EOF>> rules may not be used with other | |
1184 | patterns; they may only be qualified with a list of start | |
1185 | conditions. If an unqualified <<EOF>> rule is given, it | |
1186 | applies to | |
1187 | .I all | |
1188 | start conditions which do not already have <<EOF>> actions. To | |
1189 | specify an <<EOF>> rule for only the initial start condition, use | |
1190 | .nf | |
1191 | ||
1192 | <INITIAL><<EOF>> | |
1193 | ||
1194 | .fi | |
1195 | .LP | |
1196 | These rules are useful for catching things like unclosed comments. | |
1197 | An example: | |
1198 | .nf | |
1199 | ||
1200 | %x quote | |
1201 | %% | |
1202 | ||
1203 | ...other rules for dealing with quotes... | |
1204 | ||
1205 | <quote><<EOF>> { | |
1206 | error( "unterminated quote" ); | |
1207 | yyterminate(); | |
1208 | } | |
1209 | <<EOF>> { | |
1210 | if ( *++filelist ) | |
1211 | { | |
1212 | yyin = fopen( *filelist, "r" ); | |
1213 | YY_NEW_FILE; | |
1214 | } | |
1215 | else | |
1216 | yyterminate(); | |
1217 | } | |
1218 | ||
1219 | .fi | |
1220 | .SH MISCELLANEOUS MACROS | |
1221 | The macro | |
1222 | .bd | |
1223 | YY_USER_ACTION | |
1224 | can be redefined to provide an action | |
1225 | which is always executed prior to the matched rule's action. For example, | |
1226 | it could be #define'd to call a routine to convert yytext to lower-case. | |
1227 | .LP | |
1228 | The macro | |
1229 | .B YY_USER_INIT | |
1230 | may be redefined to provide an action which is always executed before | |
1231 | the first scan (and before the scanner's internal initializations are done). | |
1232 | For example, it could be used to call a routine to read | |
1233 | in a data table or open a logging file. | |
1234 | .LP | |
1235 | In the generated scanner, the actions are all gathered in one large | |
1236 | switch statement and separated using | |
1237 | .B YY_BREAK, | |
1238 | which may be redefined. By default, it is simply a "break", to separate | |
1239 | each rule's action from the following rule's. | |
1240 | Redefining | |
1241 | .B YY_BREAK | |
1242 | allows, for example, C++ users to | |
1243 | #define YY_BREAK to do nothing (while being very careful that every | |
1244 | rule ends with a "break" or a "return"!) to avoid suffering from | |
1245 | unreachable statement warnings where because a rule's action ends with | |
1246 | "return", the | |
1247 | .B YY_BREAK | |
1248 | is inaccessible. | |
1249 | .SH INTERFACING WITH YACC | |
1250 | One of the main uses of | |
1251 | .I flex | |
1252 | is as a companion to the | |
1253 | .I yacc | |
1254 | parser-generator. | |
1255 | .I yacc | |
1256 | parsers expect to call a routine named | |
1257 | .B yylex() | |
1258 | to find the next input token. The routine is supposed to | |
1259 | return the type of the next token as well as putting any associated | |
1260 | value in the global | |
1261 | .B yylval. | |
1262 | To use | |
1263 | .I flex | |
1264 | with | |
1265 | .I yacc, | |
1266 | one specifies the | |
1267 | .B -d | |
1268 | option to | |
1269 | .I yacc | |
1270 | to instruct it to generate the file | |
1271 | .B y.tab.h | |
1272 | containing definitions of all the | |
1273 | .B %tokens | |
1274 | appearing in the | |
1275 | .I yacc | |
1276 | input. This file is then included in the | |
1277 | .I flex | |
1278 | scanner. For example, if one of the tokens is "TOK_NUMBER", | |
1279 | part of the scanner might look like: | |
1280 | .nf | |
1281 | ||
1282 | %{ | |
1283 | #include "y.tab.h" | |
1284 | %} | |
1285 | ||
1286 | %% | |
1287 | ||
1288 | [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; | |
1289 | ||
1290 | .fi | |
1291 | .SH TRANSLATION TABLE | |
1292 | In the name of POSIX compliance, | |
1293 | .I flex | |
1294 | supports a | |
1295 | .I translation table | |
1296 | for mapping input characters into groups. | |
1297 | The table is specified in the first section, and its format looks like: | |
1298 | .nf | |
1299 | ||
1300 | %t | |
1301 | 1 abcd | |
1302 | 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ | |
1303 | 52 0123456789 | |
1304 | 6 \\t\\ \\n | |
1305 | %t | |
1306 | ||
1307 | .fi | |
1308 | This example specifies that the characters 'a', 'b', 'c', and 'd' | |
1309 | are to all be lumped into group #1, upper-case letters | |
1310 | in group #2, digits in group #52, tabs, blanks, and newlines into | |
1311 | group #6, and | |
1312 | .I | |
1313 | no other characters will appear in the patterns. | |
1314 | The group numbers are actually disregarded by | |
1315 | .I flex; | |
1316 | .B %t | |
1317 | serves, though, to lump characters together. Given the above | |
1318 | table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0". | |
1319 | They both say, "match any character in group #1, followed by | |
1320 | zero-or-more pairs of characters | |
1321 | from group #2, followed by a character from group #52." Thus | |
1322 | .B %t | |
1323 | provides a crude way for introducing equivalence classes into | |
1324 | the scanner specification. | |
1325 | .LP | |
1326 | Note that the | |
1327 | .B -i | |
1328 | option (see below) coupled with the equivalence classes which | |
1329 | .I flex | |
1330 | automatically generates take care of virtually all the instances | |
1331 | when one might consider using | |
1332 | .B %t. | |
1333 | But what the hell, it's there if you want it. | |
1334 | .SH OPTIONS | |
1335 | .I flex | |
1336 | has the following options: | |
1337 | .TP | |
1338 | .B -b | |
1339 | Generate backtracking information to | |
1340 | .I lex.backtrack. | |
1341 | This is a list of scanner states which require backtracking | |
1342 | and the input characters on which they do so. By adding rules one | |
1343 | can remove backtracking states. If all backtracking states | |
1344 | are eliminated and | |
1345 | .B -f | |
1346 | or | |
1347 | .B -F | |
1348 | is used, the generated scanner will run faster (see the | |
1349 | .B -p | |
1350 | flag). Only users who wish to squeeze every last cycle out of their | |
1351 | scanners need worry about this option. (See the section on PERFORMANCE | |
1352 | CONSIDERATIONS below.) | |
1353 | .TP | |
1354 | .B -c | |
1355 | is a do-nothing, deprecated option included for POSIX compliance. | |
1356 | .IP | |
1357 | .B NOTE: | |
1358 | in previous releases of | |
1359 | .I flex | |
1360 | .B -c | |
1361 | specified table-compression options. This functionality is | |
1362 | now given by the | |
1363 | .B -C | |
1364 | flag. To ease the the impact of this change, when | |
1365 | .I flex | |
1366 | encounters | |
1367 | .B -c, | |
1368 | it currently issues a warning message and assumes that | |
1369 | .B -C | |
1370 | was desired instead. In the future this "promotion" of | |
1371 | .B -c | |
1372 | to | |
1373 | .B -C | |
1374 | will go away in the name of full POSIX compliance (unless | |
1375 | the POSIX meaning is removed first). | |
1376 | .TP | |
1377 | .B -d | |
1378 | makes the generated scanner run in | |
1379 | .I debug | |
1380 | mode. Whenever a pattern is recognized and the global | |
1381 | .B yy_flex_debug | |
1382 | is non-zero (which is the default), | |
1383 | the scanner will write to | |
1384 | .I stderr | |
1385 | a line of the form: | |
1386 | .nf | |
1387 | ||
1388 | --accepting rule at line 53 ("the matched text") | |
1389 | ||
1390 | .fi | |
1391 | The line number refers to the location of the rule in the file | |
1392 | defining the scanner (i.e., the file that was fed to flex). Messages | |
1393 | are also generated when the scanner backtracks, accepts the | |
1394 | default rule, reaches the end of its input buffer (or encounters | |
1395 | a NUL; at this point, the two look the same as far as the scanner's concerned), | |
1396 | or reaches an end-of-file. | |
1397 | .TP | |
1398 | .B -f | |
1399 | specifies (take your pick) | |
1400 | .I full table | |
1401 | or | |
1402 | .I fast scanner. | |
1403 | No table compression is done. The result is large but fast. | |
1404 | This option is equivalent to | |
1405 | .B -Cf | |
1406 | (see below). | |
1407 | .TP | |
1408 | .B -i | |
1409 | instructs | |
1410 | .I flex | |
1411 | to generate a | |
1412 | .I case-insensitive | |
1413 | scanner. The case of letters given in the | |
1414 | .I flex | |
1415 | input patterns will | |
1416 | be ignored, and tokens in the input will be matched regardless of case. The | |
1417 | matched text given in | |
1418 | .I yytext | |
1419 | will have the preserved case (i.e., it will not be folded). | |
1420 | .TP | |
1421 | .B -n | |
1422 | is another do-nothing, deprecated option included only for | |
1423 | POSIX compliance. | |
1424 | .TP | |
1425 | .B -p | |
1426 | generates a performance report to stderr. The report | |
1427 | consists of comments regarding features of the | |
1428 | .I flex | |
1429 | input file which will cause a loss of performance in the resulting scanner. | |
1430 | Note that the use of | |
1431 | .I REJECT | |
1432 | and variable trailing context (see the BUGS section in flex(1)) | |
1433 | entails a substantial performance penalty; use of | |
1434 | .I yymore(), | |
1435 | the | |
1436 | .B ^ | |
1437 | operator, | |
1438 | and the | |
1439 | .B -I | |
1440 | flag entail minor performance penalties. | |
1441 | .TP | |
1442 | .B -s | |
1443 | causes the | |
1444 | .I default rule | |
1445 | (that unmatched scanner input is echoed to | |
1446 | .I stdout) | |
1447 | to be suppressed. If the scanner encounters input that does not | |
1448 | match any of its rules, it aborts with an error. This option is | |
1449 | useful for finding holes in a scanner's rule set. | |
1450 | .TP | |
1451 | .B -t | |
1452 | instructs | |
1453 | .I flex | |
1454 | to write the scanner it generates to standard output instead | |
1455 | of | |
1456 | .B lex.yy.c. | |
1457 | .TP | |
1458 | .B -v | |
1459 | specifies that | |
1460 | .I flex | |
1461 | should write to | |
1462 | .I stderr | |
1463 | a summary of statistics regarding the scanner it generates. | |
1464 | Most of the statistics are meaningless to the casual | |
1465 | .I flex | |
1466 | user, but the | |
1467 | first line identifies the version of | |
1468 | .I flex, | |
1469 | which is useful for figuring | |
1470 | out where you stand with respect to patches and new releases, | |
1471 | and the next two lines give the date when the scanner was created | |
1472 | and a summary of the flags which were in effect. | |
1473 | .TP | |
1474 | .B -F | |
1475 | specifies that the | |
1476 | .ul | |
1477 | fast | |
1478 | scanner table representation should be used. This representation is | |
1479 | about as fast as the full table representation | |
1480 | .ul | |
1481 | (-f), | |
1482 | and for some sets of patterns will be considerably smaller (and for | |
1483 | others, larger). In general, if the pattern set contains both "keywords" | |
1484 | and a catch-all, "identifier" rule, such as in the set: | |
1485 | .nf | |
1486 | ||
1487 | "case" return TOK_CASE; | |
1488 | "switch" return TOK_SWITCH; | |
1489 | ... | |
1490 | "default" return TOK_DEFAULT; | |
1491 | [a-z]+ return TOK_ID; | |
1492 | ||
1493 | .fi | |
1494 | then you're better off using the full table representation. If only | |
1495 | the "identifier" rule is present and you then use a hash table or some such | |
1496 | to detect the keywords, you're better off using | |
1497 | .ul | |
1498 | -F. | |
1499 | .IP | |
1500 | This option is equivalent to | |
1501 | .B -CF | |
1502 | (see below). | |
1503 | .TP | |
1504 | .B -I | |
1505 | instructs | |
1506 | .I flex | |
1507 | to generate an | |
1508 | .I interactive | |
1509 | scanner. Normally, scanners generated by | |
1510 | .I flex | |
1511 | always look ahead one | |
1512 | character before deciding that a rule has been matched. At the cost of | |
1513 | some scanning overhead, | |
1514 | .I flex | |
1515 | will generate a scanner which only looks ahead | |
1516 | when needed. Such scanners are called | |
1517 | .I interactive | |
1518 | because if you want to write a scanner for an interactive system such as a | |
1519 | command shell, you will probably want the user's input to be terminated | |
1520 | with a newline, and without | |
1521 | .B -I | |
1522 | the user will have to type a character in addition to the newline in order | |
1523 | to have the newline recognized. This leads to dreadful interactive | |
1524 | performance. | |
1525 | .IP | |
1526 | If all this seems to confusing, here's the general rule: if a human will | |
1527 | be typing in input to your scanner, use | |
1528 | .B -I, | |
1529 | otherwise don't; if you don't care about squeezing the utmost performance | |
1530 | from your scanner and you | |
1531 | don't want to make any assumptions about the input to your scanner, | |
1532 | use | |
1533 | .B -I. | |
1534 | .IP | |
1535 | Note, | |
1536 | .B -I | |
1537 | cannot be used in conjunction with | |
1538 | .I full | |
1539 | or | |
1540 | .I fast tables, | |
1541 | i.e., the | |
1542 | .B -f, -F, -Cf, | |
1543 | or | |
1544 | .B -CF | |
1545 | flags. | |
1546 | .TP | |
1547 | .B -L | |
1548 | instructs | |
1549 | .I flex | |
1550 | not to generate | |
1551 | .B #line | |
1552 | directives. Without this option, | |
1553 | .I flex | |
1554 | peppers the generated scanner | |
1555 | with #line directives so error messages in the actions will be correctly | |
1556 | located with respect to the original | |
1557 | .I flex | |
1558 | input file, and not to | |
1559 | the fairly meaningless line numbers of | |
1560 | .B lex.yy.c. | |
1561 | (Unfortunately | |
1562 | .I flex | |
1563 | does not presently generate the necessary directives | |
1564 | to "retarget" the line numbers for those parts of | |
1565 | .B lex.yy.c | |
1566 | which it generated. So if there is an error in the generated code, | |
1567 | a meaningless line number is reported.) | |
1568 | .TP | |
1569 | .B -T | |
1570 | makes | |
1571 | .I flex | |
1572 | run in | |
1573 | .I trace | |
1574 | mode. It will generate a lot of messages to | |
1575 | .I stdout | |
1576 | concerning | |
1577 | the form of the input and the resultant non-deterministic and deterministic | |
1578 | finite automata. This option is mostly for use in maintaining | |
1579 | .I flex. | |
1580 | .TP | |
1581 | .B -8 | |
1582 | instructs | |
1583 | .I flex | |
1584 | to generate an 8-bit scanner, i.e., one which can recognize 8-bit | |
1585 | characters. On some sites, | |
1586 | .I flex | |
1587 | is installed with this option as the default. On others, the default | |
1588 | is 7-bit characters. To see which is the case, check the verbose | |
1589 | .B (-v) | |
1590 | output for "equivalence classes created". If the denominator of | |
1591 | the number shown is 128, then by default | |
1592 | .I flex | |
1593 | is generating 7-bit characters. If it is 256, then the default is | |
1594 | 8-bit characters and the | |
1595 | .B -8 | |
1596 | flag is not required (but may be a good idea to keep the scanner | |
1597 | specification portable). Feeding a 7-bit scanner 8-bit characters | |
1598 | will result in infinite loops, bus errors, or other such fireworks, | |
1599 | so when in doubt, use the flag. Note that if equivalence classes | |
1600 | are used, 8-bit scanners take only slightly more table space than | |
1601 | 7-bit scanners (128 bytes, to be exact); if equivalence classes are | |
1602 | not used, however, then the tables may grow up to twice their | |
1603 | 7-bit size. | |
1604 | .TP | |
1605 | .B -C[efmF] | |
1606 | controls the degree of table compression. | |
1607 | .IP | |
1608 | .B -Ce | |
1609 | directs | |
1610 | .I flex | |
1611 | to construct | |
1612 | .I equivalence classes, | |
1613 | i.e., sets of characters | |
1614 | which have identical lexical properties (for example, if the only | |
1615 | appearance of digits in the | |
1616 | .I flex | |
1617 | input is in the character class | |
1618 | "[0-9]" then the digits '0', '1', ..., '9' will all be put | |
1619 | in the same equivalence class). Equivalence classes usually give | |
1620 | dramatic reductions in the final table/object file sizes (typically | |
1621 | a factor of 2-5) and are pretty cheap performance-wise (one array | |
1622 | look-up per character scanned). | |
1623 | .IP | |
1624 | .B -Cf | |
1625 | specifies that the | |
1626 | .I full | |
1627 | scanner tables should be generated - | |
1628 | .I flex | |
1629 | should not compress the | |
1630 | tables by taking advantages of similar transition functions for | |
1631 | different states. | |
1632 | .IP | |
1633 | .B -CF | |
1634 | specifies that the alternate fast scanner representation (described | |
1635 | above under the | |
1636 | .B -F | |
1637 | flag) | |
1638 | should be used. | |
1639 | .IP | |
1640 | .B -Cm | |
1641 | directs | |
1642 | .I flex | |
1643 | to construct | |
1644 | .I meta-equivalence classes, | |
1645 | which are sets of equivalence classes (or characters, if equivalence | |
1646 | classes are not being used) that are commonly used together. Meta-equivalence | |
1647 | classes are often a big win when using compressed tables, but they | |
1648 | have a moderate performance impact (one or two "if" tests and one | |
1649 | array look-up per character scanned). | |
1650 | .IP | |
1651 | A lone | |
1652 | .B -C | |
1653 | specifies that the scanner tables should be compressed but neither | |
1654 | equivalence classes nor meta-equivalence classes should be used. | |
1655 | .IP | |
1656 | The options | |
1657 | .B -Cf | |
1658 | or | |
1659 | .B -CF | |
1660 | and | |
1661 | .B -Cm | |
1662 | do not make sense together - there is no opportunity for meta-equivalence | |
1663 | classes if the table is not being compressed. Otherwise the options | |
1664 | may be freely mixed. | |
1665 | .IP | |
1666 | The default setting is | |
1667 | .B -Cem, | |
1668 | which specifies that | |
1669 | .I flex | |
1670 | should generate equivalence classes | |
1671 | and meta-equivalence classes. This setting provides the highest | |
1672 | degree of table compression. You can trade off | |
1673 | faster-executing scanners at the cost of larger tables with | |
1674 | the following generally being true: | |
1675 | .nf | |
1676 | ||
1677 | slowest & smallest | |
1678 | -Cem | |
1679 | -Cm | |
1680 | -Ce | |
1681 | -C | |
1682 | -C{f,F}e | |
1683 | -C{f,F} | |
1684 | fastest & largest | |
1685 | ||
1686 | .fi | |
1687 | Note that scanners with the smallest tables are usually generated and | |
1688 | compiled the quickest, so | |
1689 | during development you will usually want to use the default, maximal | |
1690 | compression. | |
1691 | .IP | |
1692 | .B -Cfe | |
1693 | is often a good compromise between speed and size for production | |
1694 | scanners. | |
1695 | .IP | |
1696 | .B -C | |
1697 | options are not cumulative; whenever the flag is encountered, the | |
1698 | previous -C settings are forgotten. | |
1699 | .TP | |
1700 | .B -Sskeleton_file | |
1701 | overrides the default skeleton file from which | |
1702 | .I flex | |
1703 | constructs its scanners. You'll never need this option unless you are doing | |
1704 | .I flex | |
1705 | maintenance or development. | |
1706 | .SH PERFORMANCE CONSIDERATIONS | |
1707 | The main design goal of | |
1708 | .I flex | |
1709 | is that it generate high-performance scanners. It has been optimized | |
1710 | for dealing well with large sets of rules. Aside from the effects | |
1711 | of table compression on scanner speed outlined above, | |
1712 | there are a number of options/actions which degrade performance. These | |
1713 | are, from most expensive to least: | |
1714 | .nf | |
1715 | ||
1716 | REJECT | |
1717 | ||
1718 | pattern sets that require backtracking | |
1719 | arbitrary trailing context | |
1720 | ||
1721 | '^' beginning-of-line operator | |
1722 | yymore() | |
1723 | ||
1724 | .fi | |
1725 | with the first three all being quite expensive and the last two | |
1726 | being quite cheap. | |
1727 | .LP | |
1728 | .B REJECT | |
1729 | should be avoided at all costs when performance is important. | |
1730 | It is a particularly expensive option. | |
1731 | .LP | |
1732 | Getting rid of backtracking is messy and often may be an enormous | |
1733 | amount of work for a complicated scanner. In principal, one begins | |
1734 | by using the | |
1735 | .B -b | |
1736 | flag to generate a | |
1737 | .I lex.backtrack | |
1738 | file. For example, on the input | |
1739 | .nf | |
1740 | ||
1741 | %% | |
1742 | foo return TOK_KEYWORD; | |
1743 | foobar return TOK_KEYWORD; | |
1744 | ||
1745 | .fi | |
1746 | the file looks like: | |
1747 | .nf | |
1748 | ||
1749 | State #6 is non-accepting - | |
1750 | associated rule line numbers: | |
1751 | 2 3 | |
1752 | out-transitions: [ o ] | |
1753 | jam-transitions: EOF [ \\001-n p-\\177 ] | |
1754 | ||
1755 | State #8 is non-accepting - | |
1756 | associated rule line numbers: | |
1757 | 3 | |
1758 | out-transitions: [ a ] | |
1759 | jam-transitions: EOF [ \\001-` b-\\177 ] | |
1760 | ||
1761 | State #9 is non-accepting - | |
1762 | associated rule line numbers: | |
1763 | 3 | |
1764 | out-transitions: [ r ] | |
1765 | jam-transitions: EOF [ \\001-q s-\\177 ] | |
1766 | ||
1767 | Compressed tables always backtrack. | |
1768 | ||
1769 | .fi | |
1770 | The first few lines tell us that there's a scanner state in | |
1771 | which it can make a transition on an 'o' but not on any other | |
1772 | character, and that in that state the currently scanned text does not match | |
1773 | any rule. The state occurs when trying to match the rules found | |
1774 | at lines 2 and 3 in the input file. | |
1775 | If the scanner is in that state and then reads | |
1776 | something other than an 'o', it will have to backtrack to find | |
1777 | a rule which is matched. With | |
1778 | a bit of headscratching one can see that this must be the | |
1779 | state it's in when it has seen "fo". When this has happened, | |
1780 | if anything other than another 'o' is seen, the scanner will | |
1781 | have to back up to simply match the 'f' (by the default rule). | |
1782 | .LP | |
1783 | The comment regarding State #8 indicates there's a problem | |
1784 | when "foob" has been scanned. Indeed, on any character other | |
1785 | than a 'b', the scanner will have to back up to accept "foo". | |
1786 | Similarly, the comment for State #9 concerns when "fooba" has | |
1787 | been scanned. | |
1788 | .LP | |
1789 | The final comment reminds us that there's no point going to | |
1790 | all the trouble of removing backtracking from the rules unless | |
1791 | we're using | |
1792 | .B -f | |
1793 | or | |
1794 | .B -F, | |
1795 | since there's no performance gain doing so with compressed scanners. | |
1796 | .LP | |
1797 | The way to remove the backtracking is to add "error" rules: | |
1798 | .nf | |
1799 | ||
1800 | %% | |
1801 | foo return TOK_KEYWORD; | |
1802 | foobar return TOK_KEYWORD; | |
1803 | ||
1804 | fooba | | |
1805 | foob | | |
1806 | fo { | |
1807 | /* false alarm, not really a keyword */ | |
1808 | return TOK_ID; | |
1809 | } | |
1810 | ||
1811 | .fi | |
1812 | .LP | |
1813 | Eliminating backtracking among a list of keywords can also be | |
1814 | done using a "catch-all" rule: | |
1815 | .nf | |
1816 | ||
1817 | %% | |
1818 | foo return TOK_KEYWORD; | |
1819 | foobar return TOK_KEYWORD; | |
1820 | ||
1821 | [a-z]+ return TOK_ID; | |
1822 | ||
1823 | .fi | |
1824 | This is usually the best solution when appropriate. | |
1825 | .LP | |
1826 | Backtracking messages tend to cascade. | |
1827 | With a complicated set of rules it's not uncommon to get hundreds | |
1828 | of messages. If one can decipher them, though, it often | |
1829 | only takes a dozen or so rules to eliminate the backtracking (though | |
1830 | it's easy to make a mistake and have an error rule accidentally match | |
1831 | a valid token. A possible future | |
1832 | .I flex | |
1833 | feature will be to automatically add rules to eliminate backtracking). | |
1834 | .LP | |
1835 | .I Variable | |
1836 | trailing context (where both the leading and trailing parts do not have | |
1837 | a fixed length) entails almost the same performance loss as | |
1838 | .I REJECT | |
1839 | (i.e., substantial). So when possible a rule like: | |
1840 | .nf | |
1841 | ||
1842 | %% | |
1843 | mouse|rat/(cat|dog) run(); | |
1844 | ||
1845 | .fi | |
1846 | is better written: | |
1847 | .nf | |
1848 | ||
1849 | %% | |
1850 | mouse/cat|dog run(); | |
1851 | rat/cat|dog run(); | |
1852 | ||
1853 | .fi | |
1854 | or as | |
1855 | .nf | |
1856 | ||
1857 | %% | |
1858 | mouse|rat/cat run(); | |
1859 | mouse|rat/dog run(); | |
1860 | ||
1861 | .fi | |
1862 | Note that here the special '|' action does | |
1863 | .I not | |
1864 | provide any savings, and can even make things worse (see | |
1865 | .B BUGS | |
1866 | in flex(1)). | |
1867 | .LP | |
1868 | Another area where the user can increase a scanner's performance | |
1869 | (and one that's easier to implement) arises from the fact that | |
1870 | the longer the tokens matched, the faster the scanner will run. | |
1871 | This is because with long tokens the processing of most input | |
1872 | characters takes place in the (short) inner scanning loop, and | |
1873 | does not often have to go through the additional work of setting up | |
1874 | the scanning environment (e.g., | |
1875 | .B yytext) | |
1876 | for the action. Recall the scanner for C comments: | |
1877 | .nf | |
1878 | ||
1879 | %x comment | |
1880 | %% | |
1881 | int line_num = 1; | |
1882 | ||
1883 | "/*" BEGIN(comment); | |
1884 | ||
1885 | <comment>[^*\\n]* | |
1886 | <comment>"*"+[^*/\\n]* | |
1887 | <comment>\\n ++line_num; | |
1888 | <comment>"*"+"/" BEGIN(INITIAL); | |
1889 | ||
1890 | .fi | |
1891 | This could be sped up by writing it as: | |
1892 | .nf | |
1893 | ||
1894 | %x comment | |
1895 | %% | |
1896 | int line_num = 1; | |
1897 | ||
1898 | "/*" BEGIN(comment); | |
1899 | ||
1900 | <comment>[^*\\n]* | |
1901 | <comment>[^*\\n]*\\n ++line_num; | |
1902 | <comment>"*"+[^*/\\n]* | |
1903 | <comment>"*"+[^*/\\n]*\\n ++line_num; | |
1904 | <comment>"*"+"/" BEGIN(INITIAL); | |
1905 | ||
1906 | .fi | |
1907 | Now instead of each newline requiring the processing of another | |
1908 | action, recognizing the newlines is "distributed" over the other rules | |
1909 | to keep the matched text as long as possible. Note that | |
1910 | .I adding | |
1911 | rules does | |
1912 | .I not | |
1913 | slow down the scanner! The speed of the scanner is independent | |
1914 | of the number of rules or (modulo the considerations given at the | |
1915 | beginning of this section) how complicated the rules are with | |
1916 | regard to operators such as '*' and '|'. | |
1917 | .LP | |
1918 | A final example in speeding up a scanner: suppose you want to scan | |
1919 | through a file containing identifiers and keywords, one per line | |
1920 | and with no other extraneous characters, and recognize all the | |
1921 | keywords. A natural first approach is: | |
1922 | .nf | |
1923 | ||
1924 | %% | |
1925 | asm | | |
1926 | auto | | |
1927 | break | | |
1928 | ... etc ... | |
1929 | volatile | | |
1930 | while /* it's a keyword */ | |
1931 | ||
1932 | .|\\n /* it's not a keyword */ | |
1933 | ||
1934 | .fi | |
1935 | To eliminate the back-tracking, introduce a catch-all rule: | |
1936 | .nf | |
1937 | ||
1938 | %% | |
1939 | asm | | |
1940 | auto | | |
1941 | break | | |
1942 | ... etc ... | |
1943 | volatile | | |
1944 | while /* it's a keyword */ | |
1945 | ||
1946 | [a-z]+ | | |
1947 | .|\\n /* it's not a keyword */ | |
1948 | ||
1949 | .fi | |
1950 | Now, if it's guaranteed that there's exactly one word per line, | |
1951 | then we can reduce the total number of matches by a half by | |
1952 | merging in the recognition of newlines with that of the other | |
1953 | tokens: | |
1954 | .nf | |
1955 | ||
1956 | %% | |
1957 | asm\\n | | |
1958 | auto\\n | | |
1959 | break\\n | | |
1960 | ... etc ... | |
1961 | volatile\\n | | |
1962 | while\\n /* it's a keyword */ | |
1963 | ||
1964 | [a-z]+\\n | | |
1965 | .|\\n /* it's not a keyword */ | |
1966 | ||
1967 | .fi | |
1968 | One has to be careful here, as we have now reintroduced backtracking | |
1969 | into the scanner. In particular, while | |
1970 | .I we | |
1971 | know that there will never be any characters in the input stream | |
1972 | other than letters or newlines, | |
1973 | .I flex | |
1974 | can't figure this out, and it will plan for possibly needing backtracking | |
1975 | when it has scanned a token like "auto" and then the next character | |
1976 | is something other than a newline or a letter. Previously it would | |
1977 | then just match the "auto" rule and be done, but now it has no "auto" | |
1978 | rule, only a "auto\\n" rule. To eliminate the possibility of backtracking, | |
1979 | we could either duplicate all rules but without final newlines, or, | |
1980 | since we never expect to encounter such an input and therefore don't | |
1981 | how it's classified, we can introduce one more catch-all rule, this | |
1982 | one which doesn't include a newline: | |
1983 | .nf | |
1984 | ||
1985 | %% | |
1986 | asm\\n | | |
1987 | auto\\n | | |
1988 | break\\n | | |
1989 | ... etc ... | |
1990 | volatile\\n | | |
1991 | while\\n /* it's a keyword */ | |
1992 | ||
1993 | [a-z]+\\n | | |
1994 | [a-z]+ | | |
1995 | .|\\n /* it's not a keyword */ | |
1996 | ||
1997 | .fi | |
1998 | Compiled with | |
1999 | .B -Cf, | |
2000 | this is about as fast as one can get a | |
2001 | .I flex | |
2002 | scanner to go for this particular problem. | |
2003 | .LP | |
2004 | A final note: | |
2005 | .I flex | |
2006 | is slow when matching NUL's, particularly when a token contains | |
2007 | multiple NUL's. | |
2008 | It's best to write rules which match | |
2009 | .I short | |
2010 | amounts of text if it's anticipated that the text will often include NUL's. | |
2011 | .SH INCOMPATIBILITIES WITH LEX AND POSIX | |
2012 | .I flex | |
2013 | is a rewrite of the Unix | |
2014 | .I lex | |
2015 | tool (the two implementations do not share any code, though), | |
2016 | with some extensions and incompatibilities, both of which | |
2017 | are of concern to those who wish to write scanners acceptable | |
2018 | to either implementation. At present, the POSIX | |
2019 | .I lex | |
2020 | draft is | |
2021 | very close to the original | |
2022 | .I lex | |
2023 | implementation, so some of these | |
2024 | incompatibilities are also in conflict with the POSIX draft. But | |
2025 | the intent is that except as noted below, | |
2026 | .I flex | |
2027 | as it presently stands will | |
2028 | ultimately be POSIX conformant (i.e., that those areas of conflict with | |
2029 | the POSIX draft will be resolved in | |
2030 | .I flex's | |
2031 | favor). Please bear in | |
2032 | mind that all the comments which follow are with regard to the POSIX | |
2033 | .I draft | |
2034 | standard of Summer 1989, and not the final document (or subsequent | |
2035 | drafts); they are included so | |
2036 | .I flex | |
2037 | users can be aware of the standardization issues and those areas where | |
2038 | .I flex | |
2039 | may in the near future undergo changes incompatible with | |
2040 | its current definition. | |
2041 | .LP | |
2042 | .I flex | |
2043 | is fully compatible with | |
2044 | .I lex | |
2045 | with the following exceptions: | |
2046 | .IP - | |
78ed81a3 | 2047 | The undocumented |
2048 | .I lex | |
2049 | scanner internal variable | |
2050 | .B yylineno | |
2051 | is not supported. It is difficult to support this option efficiently, | |
2052 | since it requires examining every character scanned and reexamining | |
2053 | the characters when the scanner backs up. | |
2054 | Things get more complicated when the end of buffer or file is reached or a | |
2055 | NUL is scanned (since the scan must then be restarted with the proper line | |
2056 | number count), or the user uses the yyless(), unput(), or REJECT actions, | |
2057 | or the multiple input buffer functions. | |
2058 | .IP | |
2059 | The fix is to add rules which, upon seeing a newline, increment | |
2060 | yylineno. This is usually an easy process, though it can be a drag if some | |
2061 | of the patterns can match multiple newlines along with other characters. | |
2062 | .IP | |
2063 | yylineno is not part of the POSIX draft. | |
2064 | .IP - | |
2065 | The | |
2066 | .B input() | |
2067 | routine is not redefinable, though it may be called to read characters | |
2068 | following whatever has been matched by a rule. If | |
2069 | .B input() | |
2070 | encounters an end-of-file the normal | |
2071 | .B yywrap() | |
2072 | processing is done. A ``real'' end-of-file is returned by | |
2073 | .B input() | |
2074 | as | |
2075 | .I EOF. | |
2076 | .IP | |
2077 | Input is instead controlled by redefining the | |
2078 | .B YY_INPUT | |
2079 | macro. | |
2080 | .IP | |
2081 | The | |
2082 | .I flex | |
2083 | restriction that | |
2084 | .B input() | |
2085 | cannot be redefined is in accordance with the POSIX draft, but | |
2086 | .B YY_INPUT | |
2087 | has not yet been accepted into the draft (and probably won't; it looks | |
2088 | like the draft will simply not specify any way of controlling the | |
2089 | scanner's input other than by making an initial assignment to | |
2090 | .I yyin). | |
2091 | .IP - | |
2092 | .I flex | |
2093 | scanners do not use stdio for input. Because of this, when writing an | |
2094 | interactive scanner one must explicitly call fflush() on the | |
2095 | stream associated with the terminal after writing out a prompt. | |
2096 | With | |
2097 | .I lex | |
2098 | such writes are automatically flushed since | |
2099 | .I lex | |
2100 | scanners use | |
2101 | .B getchar() | |
2102 | for their input. Also, when writing interactive scanners with | |
2103 | .I flex, | |
2104 | the | |
2105 | .B -I | |
2106 | flag must be used. | |
2107 | .IP - | |
2108 | .I flex | |
2109 | scanners are not as reentrant as | |
2110 | .I lex | |
2111 | scanners. In particular, if you have an interactive scanner and | |
2112 | an interrupt handler which long-jumps out of the scanner, and | |
2113 | the scanner is subsequently called again, you may get the following | |
2114 | message: | |
2115 | .nf | |
2116 | ||
2117 | fatal flex scanner internal error--end of buffer missed | |
2118 | ||
2119 | .fi | |
2120 | To reenter the scanner, first use | |
2121 | .nf | |
2122 | ||
2123 | yyrestart( yyin ); | |
2124 | ||
2125 | .fi | |
2126 | .IP - | |
2127 | .B output() | |
2128 | is not supported. | |
2129 | Output from the | |
2130 | .B ECHO | |
2131 | macro is done to the file-pointer | |
2132 | .I yyout | |
2133 | (default | |
2134 | .I stdout). | |
2135 | .IP | |
2136 | The POSIX draft mentions that an | |
2137 | .B output() | |
2138 | routine exists but currently gives no details as to what it does. | |
2139 | .IP - | |
15637ed4 RG |
2140 | .I lex |
2141 | does not support exclusive start conditions (%x), though they | |
2142 | are in the current POSIX draft. | |
2143 | .IP - | |
2144 | When definitions are expanded, | |
2145 | .I flex | |
2146 | encloses them in parentheses. | |
2147 | With lex, the following: | |
2148 | .nf | |
2149 | ||
2150 | NAME [A-Z][A-Z0-9]* | |
2151 | %% | |
2152 | foo{NAME}? printf( "Found it\\n" ); | |
2153 | %% | |
2154 | ||
2155 | .fi | |
2156 | will not match the string "foo" because when the macro | |
2157 | is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" | |
2158 | and the precedence is such that the '?' is associated with | |
2159 | "[A-Z0-9]*". With | |
2160 | .I flex, | |
2161 | the rule will be expanded to | |
2162 | "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. | |
2163 | Note that because of this, the | |
2164 | .B ^, $, <s>, /, | |
2165 | and | |
2166 | .B <<EOF>> | |
2167 | operators cannot be used in a | |
2168 | .I flex | |
2169 | definition. | |
2170 | .IP | |
2171 | The POSIX draft interpretation is the same as | |
2172 | .I flex's. | |
2173 | .IP - | |
2174 | To specify a character class which matches anything but a left bracket (']'), | |
2175 | in | |
2176 | .I lex | |
2177 | one can use "[^]]" but with | |
2178 | .I flex | |
2179 | one must use "[^\\]]". The latter works with | |
2180 | .I lex, | |
2181 | too. | |
2182 | .IP - | |
15637ed4 RG |
2183 | The |
2184 | .I lex | |
2185 | .B %r | |
2186 | (generate a Ratfor scanner) option is not supported. It is not part | |
2187 | of the POSIX draft. | |
2188 | .IP - | |
2189 | If you are providing your own yywrap() routine, you must include a | |
2190 | "#undef yywrap" in the definitions section (section 1). Note that | |
2191 | the "#undef" will have to be enclosed in %{}'s. | |
2192 | .IP | |
2193 | The POSIX draft | |
78ed81a3 | 2194 | specifies that yywrap() is a function and this is very unlikely to change; so |
15637ed4 RG |
2195 | .I flex users are warned |
2196 | that | |
2197 | .B yywrap() | |
2198 | is likely to be changed to a function in the near future. | |
2199 | .IP - | |
2200 | After a call to | |
2201 | .B unput(), | |
2202 | .I yytext | |
2203 | and | |
2204 | .I yyleng | |
2205 | are undefined until the next token is matched. This is not the case with | |
2206 | .I lex | |
2207 | or the present POSIX draft. | |
2208 | .IP - | |
2209 | The precedence of the | |
2210 | .B {} | |
2211 | (numeric range) operator is different. | |
2212 | .I lex | |
2213 | interprets "abc{1,3}" as "match one, two, or | |
2214 | three occurrences of 'abc'", whereas | |
2215 | .I flex | |
2216 | interprets it as "match 'ab' | |
2217 | followed by one, two, or three occurrences of 'c'". The latter is | |
2218 | in agreement with the current POSIX draft. | |
2219 | .IP - | |
2220 | The precedence of the | |
2221 | .B ^ | |
2222 | operator is different. | |
2223 | .I lex | |
2224 | interprets "^foo|bar" as "match either 'foo' at the beginning of a line, | |
2225 | or 'bar' anywhere", whereas | |
2226 | .I flex | |
2227 | interprets it as "match either 'foo' or 'bar' if they come at the beginning | |
2228 | of a line". The latter is in agreement with the current POSIX draft. | |
2229 | .IP - | |
2230 | To refer to yytext outside of the scanner source file, | |
2231 | the correct definition with | |
2232 | .I flex | |
2233 | is "extern char *yytext" rather than "extern char yytext[]". | |
2234 | This is contrary to the current POSIX draft but a point on which | |
2235 | .I flex | |
2236 | will not be changing, as the array representation entails a | |
2237 | serious performance penalty. It is hoped that the POSIX draft will | |
2238 | be emended to support the | |
2239 | .I flex | |
2240 | variety of declaration (as this is a fairly painless change to | |
2241 | require of | |
2242 | .I lex | |
2243 | users). | |
2244 | .IP - | |
2245 | .I yyin | |
2246 | is | |
2247 | .I initialized | |
2248 | by | |
2249 | .I lex | |
2250 | to be | |
2251 | .I stdin; | |
2252 | .I flex, | |
2253 | on the other hand, | |
2254 | initializes | |
2255 | .I yyin | |
2256 | to NULL | |
2257 | and then | |
2258 | .I assigns | |
2259 | it to | |
2260 | .I stdin | |
2261 | the first time the scanner is called, providing | |
2262 | .I yyin | |
2263 | has not already been assigned to a non-NULL value. The difference is | |
2264 | subtle, but the net effect is that with | |
2265 | .I flex | |
2266 | scanners, | |
2267 | .I yyin | |
2268 | does not have a valid value until the scanner has been called. | |
2269 | .IP - | |
2270 | The special table-size declarations such as | |
2271 | .B %a | |
2272 | supported by | |
2273 | .I lex | |
2274 | are not required by | |
2275 | .I flex | |
2276 | scanners; | |
2277 | .I flex | |
2278 | ignores them. | |
2279 | .IP - | |
2280 | The name | |
2281 | .bd | |
2282 | FLEX_SCANNER | |
2283 | is #define'd so scanners may be written for use with either | |
2284 | .I flex | |
2285 | or | |
2286 | .I lex. | |
2287 | .LP | |
2288 | The following | |
2289 | .I flex | |
2290 | features are not included in | |
2291 | .I lex | |
2292 | or the POSIX draft standard: | |
2293 | .nf | |
2294 | ||
2295 | yyterminate() | |
2296 | <<EOF>> | |
2297 | YY_DECL | |
2298 | #line directives | |
2299 | %{}'s around actions | |
2300 | yyrestart() | |
2301 | comments beginning with '#' (deprecated) | |
2302 | multiple actions on a line | |
2303 | ||
2304 | .fi | |
2305 | This last feature refers to the fact that with | |
2306 | .I flex | |
2307 | you can put multiple actions on the same line, separated with | |
2308 | semi-colons, while with | |
2309 | .I lex, | |
2310 | the following | |
2311 | .nf | |
2312 | ||
2313 | foo handle_foo(); ++num_foos_seen; | |
2314 | ||
2315 | .fi | |
2316 | is (rather surprisingly) truncated to | |
2317 | .nf | |
2318 | ||
2319 | foo handle_foo(); | |
2320 | ||
2321 | .fi | |
2322 | .I flex | |
2323 | does not truncate the action. Actions that are not enclosed in | |
2324 | braces are simply terminated at the end of the line. | |
2325 | .SH DIAGNOSTICS | |
2326 | .I reject_used_but_not_detected undefined | |
2327 | or | |
2328 | .I yymore_used_but_not_detected undefined - | |
2329 | These errors can occur at compile time. They indicate that the | |
2330 | scanner uses | |
2331 | .B REJECT | |
2332 | or | |
2333 | .B yymore() | |
2334 | but that | |
2335 | .I flex | |
2336 | failed to notice the fact, meaning that | |
2337 | .I flex | |
2338 | scanned the first two sections looking for occurrences of these actions | |
2339 | and failed to find any, but somehow you snuck some in (via a #include | |
2340 | file, for example). Make an explicit reference to the action in your | |
2341 | .I flex | |
2342 | input file. (Note that previously | |
2343 | .I flex | |
2344 | supported a | |
2345 | .B %used/%unused | |
2346 | mechanism for dealing with this problem; this feature is still supported | |
2347 | but now deprecated, and will go away soon unless the author hears from | |
2348 | people who can argue compellingly that they need it.) | |
2349 | .LP | |
2350 | .I flex scanner jammed - | |
2351 | a scanner compiled with | |
2352 | .B -s | |
2353 | has encountered an input string which wasn't matched by | |
2354 | any of its rules. | |
2355 | .LP | |
2356 | .I flex input buffer overflowed - | |
2357 | a scanner rule matched a string long enough to overflow the | |
2358 | scanner's internal input buffer (16K bytes by default - controlled by | |
2359 | .B YY_BUF_SIZE | |
78ed81a3 | 2360 | in "lex.skel". Note that to redefine this macro, you must first |
15637ed4 RG |
2361 | .B #undefine |
2362 | it). | |
2363 | .LP | |
2364 | .I scanner requires -8 flag - | |
2365 | Your scanner specification includes recognizing 8-bit characters and | |
2366 | you did not specify the -8 flag (and your site has not installed flex | |
2367 | with -8 as the default). | |
2368 | .LP | |
78ed81a3 | 2369 | .I |
2370 | fatal flex scanner internal error--end of buffer missed - | |
2371 | This can occur in an scanner which is reentered after a long-jump | |
2372 | has jumped out (or over) the scanner's activation frame. Before | |
2373 | reentering the scanner, use: | |
2374 | .nf | |
2375 | ||
2376 | yyrestart( yyin ); | |
2377 | ||
2378 | .fi | |
2379 | .LP | |
15637ed4 RG |
2380 | .I too many %t classes! - |
2381 | You managed to put every single character into its own %t class. | |
2382 | .I flex | |
2383 | requires that at least one of the classes share characters. | |
2384 | .SH DEFICIENCIES / BUGS | |
2385 | See flex(1). | |
2386 | .SH "SEE ALSO" | |
2387 | .LP | |
2388 | flex(1), lex(1), yacc(1), sed(1), awk(1). | |
2389 | .LP | |
2390 | M. E. Lesk and E. Schmidt, | |
2391 | .I LEX - Lexical Analyzer Generator | |
2392 | .SH AUTHOR | |
2393 | Vern Paxson, with the help of many ideas and much inspiration from | |
2394 | Van Jacobson. Original version by Jef Poskanzer. The fast table | |
2395 | representation is a partial implementation of a design done by Van | |
2396 | Jacobson. The implementation was done by Kevin Gong and Vern Paxson. | |
2397 | .LP | |
2398 | Thanks to the many | |
2399 | .I flex | |
2400 | beta-testers, feedbackers, and contributors, especially Casey | |
78ed81a3 | 2401 | Leedom, benson@odi.com, Keith Bostic, |
15637ed4 RG |
2402 | Frederic Brehm, Nick Christopher, Jason Coughlin, |
2403 | Scott David Daniels, Leo Eskin, | |
2404 | Chris Faylor, Eric Goldman, Eric | |
2405 | Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht, | |
2406 | Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt, | |
2407 | Jef Poskanzer, Jim Roskind, | |
2408 | Dave Tallman, Frank Whaley, Ken Yap, and those whose names | |
2409 | have slipped my marginal mail-archiving skills but whose contributions | |
2410 | are appreciated all the same. | |
2411 | .LP | |
2412 | Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob | |
2413 | Mulcahy, Rich Salz, and Richard Stallman for help with various distribution | |
2414 | headaches. | |
2415 | .LP | |
2416 | Thanks to Esmond Pitt and Earle Horton for 8-bit character support; | |
2417 | to Benson Margulies and Fred | |
2418 | Burke for C++ support; to Ove Ewerlid for the basics of support for | |
2419 | NUL's; and to Eric Hughes for the basics of support for multiple buffers. | |
2420 | .LP | |
2421 | Work is being done on extending | |
2422 | .I flex | |
2423 | to generate scanners in which the | |
2424 | state machine is directly represented in C code rather than tables. | |
2425 | These scanners may well be substantially faster than those generated | |
2426 | using -f or -F. If you are working in this area and are interested | |
2427 | in comparing notes and seeing whether redundant work can be avoided, | |
2428 | contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE). | |
2429 | .LP | |
2430 | This work was primarily done when I was at the Real Time Systems Group | |
2431 | at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there | |
2432 | for the support I received. | |
2433 | .LP | |
2434 | Send comments to: | |
2435 | .nf | |
2436 | ||
2437 | Vern Paxson | |
78ed81a3 | 2438 | Computer Systems Engineering |
2439 | Bldg. 46A, Room 1123 | |
2440 | Lawrence Berkeley Laboratory | |
2441 | University of California | |
2442 | Berkeley, CA 94720 | |
2443 | ||
2444 | vern@ee.lbl.gov | |
2445 | ucbvax!ee.lbl.gov!vern | |
15637ed4 RG |
2446 | |
2447 | .fi |