Commit | Line | Data |
---|---|---|
810a4854 C |
1 | .TH FLEX 1 "26 May 1990" "Version 2.3" |
2 | .SH NAME | |
3 | flex - fast lexical analyzer generator | |
4 | .SH SYNOPSIS | |
5 | .B flex | |
6 | .B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton] | |
7 | .I [filename ...] | |
8 | .SH DESCRIPTION | |
9 | .I flex | |
10 | is a tool for generating | |
11 | .I scanners: | |
12 | programs which recognized lexical patterns in text. | |
13 | .I flex | |
14 | reads | |
15 | the given input files, or its standard input if no file names are given, | |
16 | for a description of a scanner to generate. The description is in | |
17 | the form of pairs | |
18 | of regular expressions and C code, called | |
19 | .I rules. flex | |
20 | generates as output a C source file, | |
21 | .B lex.yy.c, | |
22 | which defines a routine | |
23 | .B yylex(). | |
24 | This file is compiled and linked with the | |
25 | .B -lfl | |
26 | library to produce an executable. When the executable is run, | |
27 | it analyzes its input for occurrences | |
28 | of the regular expressions. Whenever it finds one, it executes | |
29 | the corresponding C code. | |
30 | .SH SOME SIMPLE EXAMPLES | |
31 | .LP | |
32 | First some simple examples to get the flavor of how one uses | |
33 | .I flex. | |
34 | The following | |
35 | .I flex | |
36 | input specifies a scanner which whenever it encounters the string | |
37 | "username" will replace it with the user's login name: | |
38 | .nf | |
39 | ||
40 | %% | |
41 | username printf( "%s", getlogin() ); | |
42 | ||
43 | .fi | |
44 | By default, any text not matched by a | |
45 | .I flex | |
46 | scanner | |
47 | is copied to the output, so the net effect of this scanner is | |
48 | to copy its input file to its output with each occurrence | |
49 | of "username" expanded. | |
50 | In this input, there is just one rule. "username" is the | |
51 | .I pattern | |
52 | and the "printf" is the | |
53 | .I action. | |
54 | The "%%" marks the beginning of the rules. | |
55 | .LP | |
56 | Here's another simple example: | |
57 | .nf | |
58 | ||
59 | int num_lines = 0, num_chars = 0; | |
60 | ||
61 | %% | |
62 | \\n ++num_lines; ++num_chars; | |
63 | . ++num_chars; | |
64 | ||
65 | %% | |
66 | main() | |
67 | { | |
68 | yylex(); | |
69 | printf( "# of lines = %d, # of chars = %d\\n", | |
70 | num_lines, num_chars ); | |
71 | } | |
72 | ||
73 | .fi | |
74 | This scanner counts the number of characters and the number | |
75 | of lines in its input (it produces no output other than the | |
76 | final report on the counts). The first line | |
77 | declares two globals, "num_lines" and "num_chars", which are accessible | |
78 | both inside | |
79 | .B yylex() | |
80 | and in the | |
81 | .B main() | |
82 | routine declared after the second "%%". There are two rules, one | |
83 | which matches a newline ("\\n") and increments both the line count and | |
84 | the character count, and one which matches any character other than | |
85 | a newline (indicated by the "." regular expression). | |
86 | .LP | |
87 | A somewhat more complicated example: | |
88 | .nf | |
89 | ||
90 | /* scanner for a toy Pascal-like language */ | |
91 | ||
92 | %{ | |
93 | /* need this for the call to atof() below */ | |
94 | #include <math.h> | |
95 | %} | |
96 | ||
97 | DIGIT [0-9] | |
98 | ID [a-z][a-z0-9]* | |
99 | ||
100 | %% | |
101 | ||
102 | {DIGIT}+ { | |
103 | printf( "An integer: %s (%d)\\n", yytext, | |
104 | atoi( yytext ) ); | |
105 | } | |
106 | ||
107 | {DIGIT}+"."{DIGIT}* { | |
108 | printf( "A float: %s (%d)\\n", yytext, | |
109 | atof( yytext ) ); | |
110 | } | |
111 | ||
112 | if|then|begin|end|procedure|function { | |
113 | printf( "A keyword: %s\\n", yytext ); | |
114 | } | |
115 | ||
116 | {ID} printf( "An identifier: %s\\n", yytext ); | |
117 | ||
118 | "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); | |
119 | ||
120 | "{"[^}\\n]*"}" /* eat up one-line comments */ | |
121 | ||
122 | [ \\t\\n]+ /* eat up whitespace */ | |
123 | ||
124 | . printf( "Unrecognized character: %s\\n", yytext ); | |
125 | ||
126 | %% | |
127 | ||
128 | main( argc, argv ) | |
129 | int argc; | |
130 | char **argv; | |
131 | { | |
132 | ++argv, --argc; /* skip over program name */ | |
133 | if ( argc > 0 ) | |
134 | yyin = fopen( argv[0], "r" ); | |
135 | else | |
136 | yyin = stdin; | |
137 | ||
138 | yylex(); | |
139 | } | |
140 | ||
141 | .fi | |
142 | This is the beginnings of a simple scanner for a language like | |
143 | Pascal. It identifies different types of | |
144 | .I tokens | |
145 | and reports on what it has seen. | |
146 | .LP | |
147 | The details of this example will be explained in the following | |
148 | sections. | |
149 | .SH FORMAT OF THE INPUT FILE | |
150 | The | |
151 | .I flex | |
152 | input file consists of three sections, separated by a line with just | |
153 | .B %% | |
154 | in it: | |
155 | .nf | |
156 | ||
157 | definitions | |
158 | %% | |
159 | rules | |
160 | %% | |
161 | user code | |
162 | ||
163 | .fi | |
164 | The | |
165 | .I definitions | |
166 | section contains declarations of simple | |
167 | .I name | |
168 | definitions to simplify the scanner specification, and declarations of | |
169 | .I start conditions, | |
170 | which are explained in a later section. | |
171 | .LP | |
172 | Name definitions have the form: | |
173 | .nf | |
174 | ||
175 | name definition | |
176 | ||
177 | .fi | |
178 | The "name" is a word beginning with a letter or an underscore ('_') | |
179 | followed by zero or more letters, digits, '_', or '-' (dash). | |
180 | The definition is taken to begin at the first non-white-space character | |
181 | following the name and continuing to the end of the line. | |
182 | The definition can subsequently be referred to using "{name}", which | |
183 | will expand to "(definition)". For example, | |
184 | .nf | |
185 | ||
186 | DIGIT [0-9] | |
187 | ID [a-z][a-z0-9]* | |
188 | ||
189 | .fi | |
190 | defines "DIGIT" to be a regular expression which matches a | |
191 | single digit, and | |
192 | "ID" to be a regular expression which matches a letter | |
193 | followed by zero-or-more letters-or-digits. | |
194 | A subsequent reference to | |
195 | .nf | |
196 | ||
197 | {DIGIT}+"."{DIGIT}* | |
198 | ||
199 | .fi | |
200 | is identical to | |
201 | .nf | |
202 | ||
203 | ([0-9])+"."([0-9])* | |
204 | ||
205 | .fi | |
206 | and matches one-or-more digits followed by a '.' followed | |
207 | by zero-or-more digits. | |
208 | .LP | |
209 | The | |
210 | .I rules | |
211 | section of the | |
212 | .I flex | |
213 | input contains a series of rules of the form: | |
214 | .nf | |
215 | ||
216 | pattern action | |
217 | ||
218 | .fi | |
219 | where the pattern must be unindented and the action must begin | |
220 | on the same line. | |
221 | .LP | |
222 | See below for a further description of patterns and actions. | |
223 | .LP | |
224 | Finally, the user code section is simply copied to | |
225 | .B lex.yy.c | |
226 | verbatim. | |
227 | It is used for companion routines which call or are called | |
228 | by the scanner. The presence of this section is optional; | |
229 | if it is missing, the second | |
230 | .B %% | |
231 | in the input file may be skipped, too. | |
232 | .LP | |
233 | In the definitions and rules sections, any | |
234 | .I indented | |
235 | text or text enclosed in | |
236 | .B %{ | |
237 | and | |
238 | .B %} | |
239 | is copied verbatim to the output (with the %{}'s removed). | |
240 | The %{}'s must appear unindented on lines by themselves. | |
241 | .LP | |
242 | In the rules section, | |
243 | any indented or %{} text appearing before the | |
244 | first rule may be used to declare variables | |
245 | which are local to the scanning routine and (after the declarations) | |
246 | code which is to be executed whenever the scanning routine is entered. | |
247 | Other indented or %{} text in the rule section is still copied to the output, | |
248 | but its meaning is not well-defined and it may well cause compile-time | |
249 | errors (this feature is present for | |
250 | .I POSIX | |
251 | compliance; see below for other such features). | |
252 | .LP | |
253 | In the definitions section, an unindented comment (i.e., a line | |
254 | beginning with "/*") is also copied verbatim to the output up | |
255 | to the next "*/". Also, any line in the definitions section | |
256 | beginning with '#' is ignored, though this style of comment is | |
257 | deprecated and may go away in the future. | |
258 | .SH PATTERNS | |
259 | The patterns in the input are written using an extended set of regular | |
260 | expressions. These are: | |
261 | .nf | |
262 | ||
263 | x match the character 'x' | |
264 | . any character except newline | |
265 | [xyz] a "character class"; in this case, the pattern | |
266 | matches either an 'x', a 'y', or a 'z' | |
267 | [abj-oZ] a "character class" with a range in it; matches | |
268 | an 'a', a 'b', any letter from 'j' through 'o', | |
269 | or a 'Z' | |
270 | [^A-Z] a "negated character class", i.e., any character | |
271 | but those in the class. In this case, any | |
272 | character EXCEPT an uppercase letter. | |
273 | [^A-Z\\n] any character EXCEPT an uppercase letter or | |
274 | a newline | |
275 | r* zero or more r's, where r is any regular expression | |
276 | r+ one or more r's | |
277 | r? zero or one r's (that is, "an optional r") | |
278 | r{2,5} anywhere from two to five r's | |
279 | r{2,} two or more r's | |
280 | r{4} exactly 4 r's | |
281 | {name} the expansion of the "name" definition | |
282 | (see above) | |
283 | "[xyz]\\"foo" | |
284 | the literal string: [xyz]"foo | |
285 | \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', | |
286 | then the ANSI-C interpretation of \\x. | |
287 | Otherwise, a literal 'X' (used to escape | |
288 | operators such as '*') | |
289 | \\123 the character with octal value 123 | |
290 | \\x2a the character with hexadecimal value 2a | |
291 | (r) match an r; parentheses are used to override | |
292 | precedence (see below) | |
293 | ||
294 | ||
295 | rs the regular expression r followed by the | |
296 | regular expression s; called "concatenation" | |
297 | ||
298 | ||
299 | r|s either an r or an s | |
300 | ||
301 | ||
302 | r/s an r but only if it is followed by an s. The | |
303 | s is not part of the matched text. This type | |
304 | of pattern is called as "trailing context". | |
305 | ^r an r, but only at the beginning of a line | |
306 | r$ an r, but only at the end of a line. Equivalent | |
307 | to "r/\\n". | |
308 | ||
309 | ||
310 | <s>r an r, but only in start condition s (see | |
311 | below for discussion of start conditions) | |
312 | <s1,s2,s3>r | |
313 | same, but in any of start conditions s1, | |
314 | s2, or s3 | |
315 | ||
316 | ||
317 | <<EOF>> an end-of-file | |
318 | <s1,s2><<EOF>> | |
319 | an end-of-file when in start condition s1 or s2 | |
320 | ||
321 | .fi | |
322 | The regular expressions listed above are grouped according to | |
323 | precedence, from highest precedence at the top to lowest at the bottom. | |
324 | Those grouped together have equal precedence. For example, | |
325 | .nf | |
326 | ||
327 | foo|bar* | |
328 | ||
329 | .fi | |
330 | is the same as | |
331 | .nf | |
332 | ||
333 | (foo)|(ba(r*)) | |
334 | ||
335 | .fi | |
336 | since the '*' operator has higher precedence than concatenation, | |
337 | and concatenation higher than alternation ('|'). This pattern | |
338 | therefore matches | |
339 | .I either | |
340 | the string "foo" | |
341 | .I or | |
342 | the string "ba" followed by zero-or-more r's. | |
343 | To match "foo" or zero-or-more "bar"'s, use: | |
344 | .nf | |
345 | ||
346 | foo|(bar)* | |
347 | ||
348 | .fi | |
349 | and to match zero-or-more "foo"'s-or-"bar"'s: | |
350 | .nf | |
351 | ||
352 | (foo|bar)* | |
353 | ||
354 | .fi | |
355 | .LP | |
356 | Some notes on patterns: | |
357 | .IP - | |
358 | A negated character class such as the example "[^A-Z]" | |
359 | above | |
360 | .I will match a newline | |
361 | unless "\\n" (or an equivalent escape sequence) is one of the | |
362 | characters explicitly present in the negated character class | |
363 | (e.g., "[^A-Z\\n]"). This is unlike how many other regular | |
364 | expression tools treat negated character classes, but unfortunately | |
365 | the inconsistency is historically entrenched. | |
366 | Matching newlines means that a pattern like [^"]* can match an entire | |
367 | input (overflowing the scanner's input buffer) unless there's another | |
368 | quote in the input. | |
369 | .IP - | |
370 | A rule can have at most one instance of trailing context (the '/' operator | |
371 | or the '$' operator). The start condition, '^', and "<<EOF>>" patterns | |
372 | can only occur at the beginning of a pattern, and, as well as with '/' and '$', | |
373 | cannot be grouped inside parentheses. A '^' which does not occur at | |
374 | the beginning of a rule or a '$' which does not occur at the end of | |
375 | a rule loses its special properties and is treated as a normal character. | |
376 | .IP | |
377 | The following are illegal: | |
378 | .nf | |
379 | ||
380 | foo/bar$ | |
381 | <sc1>foo<sc2>bar | |
382 | ||
383 | .fi | |
384 | Note that the first of these, can be written "foo/bar\\n". | |
385 | .IP | |
386 | The following will result in '$' or '^' being treated as a normal character: | |
387 | .nf | |
388 | ||
389 | foo|(bar$) | |
390 | foo|^bar | |
391 | ||
392 | .fi | |
393 | If what's wanted is a "foo" or a bar-followed-by-a-newline, the following | |
394 | could be used (the special '|' action is explained below): | |
395 | .nf | |
396 | ||
397 | foo | | |
398 | bar$ /* action goes here */ | |
399 | ||
400 | .fi | |
401 | A similar trick will work for matching a foo or a | |
402 | bar-at-the-beginning-of-a-line. | |
403 | .SH HOW THE INPUT IS MATCHED | |
404 | When the generated scanner is run, it analyzes its input looking | |
405 | for strings which match any of its patterns. If it finds more than | |
406 | one match, it takes the one matching the most text (for trailing | |
407 | context rules, this includes the length of the trailing part, even | |
408 | though it will then be returned to the input). If it finds two | |
409 | or more matches of the same length, the | |
410 | rule listed first in the | |
411 | .I flex | |
412 | input file is chosen. | |
413 | .LP | |
414 | Once the match is determined, the text corresponding to the match | |
415 | (called the | |
416 | .I token) | |
417 | is made available in the global character pointer | |
418 | .B yytext, | |
419 | and its length in the global integer | |
420 | .B yyleng. | |
421 | The | |
422 | .I action | |
423 | corresponding to the matched pattern is then executed (a more | |
424 | detailed description of actions follows), and then the remaining | |
425 | input is scanned for another match. | |
426 | .LP | |
427 | If no match is found, then the | |
428 | .I default rule | |
429 | is executed: the next character in the input is considered matched and | |
430 | copied to the standard output. Thus, the simplest legal | |
431 | .I flex | |
432 | input is: | |
433 | .nf | |
434 | ||
435 | %% | |
436 | ||
437 | .fi | |
438 | which generates a scanner that simply copies its input (one character | |
439 | at a time) to its output. | |
440 | .SH ACTIONS | |
441 | Each pattern in a rule has a corresponding action, which can be any | |
442 | arbitrary C statement. The pattern ends at the first non-escaped | |
443 | whitespace character; the remainder of the line is its action. If the | |
444 | action is empty, then when the pattern is matched the input token | |
445 | is simply discarded. For example, here is the specification for a program | |
446 | which deletes all occurrences of "zap me" from its input: | |
447 | .nf | |
448 | ||
449 | %% | |
450 | "zap me" | |
451 | ||
452 | .fi | |
453 | (It will copy all other characters in the input to the output since | |
454 | they will be matched by the default rule.) | |
455 | .LP | |
456 | Here is a program which compresses multiple blanks and tabs down to | |
457 | a single blank, and throws away whitespace found at the end of a line: | |
458 | .nf | |
459 | ||
460 | %% | |
461 | [ \\t]+ putchar( ' ' ); | |
462 | [ \\t]+$ /* ignore this token */ | |
463 | ||
464 | .fi | |
465 | .LP | |
466 | If the action contains a '{', then the action spans till the balancing '}' | |
467 | is found, and the action may cross multiple lines. | |
468 | .I flex | |
469 | knows about C strings and comments and won't be fooled by braces found | |
470 | within them, but also allows actions to begin with | |
471 | .B %{ | |
472 | and will consider the action to be all the text up to the next | |
473 | .B %} | |
474 | (regardless of ordinary braces inside the action). | |
475 | .LP | |
476 | An action consisting solely of a vertical bar ('|') means "same as | |
477 | the action for the next rule." See below for an illustration. | |
478 | .LP | |
479 | Actions can include arbitrary C code, including | |
480 | .B return | |
481 | statements to return a value to whatever routine called | |
482 | .B yylex(). | |
483 | Each time | |
484 | .B yylex() | |
485 | is called it continues processing tokens from where it last left | |
486 | off until it either reaches | |
487 | the end of the file or executes a return. Once it reaches an end-of-file, | |
488 | however, then any subsequent call to | |
489 | .B yylex() | |
490 | will simply immediately return, unless | |
491 | .B yyrestart() | |
492 | is first called (see below). | |
493 | .LP | |
494 | Actions are not allowed to modify yytext or yyleng. | |
495 | .LP | |
496 | There are a number of special directives which can be included within | |
497 | an action: | |
498 | .IP - | |
499 | .B ECHO | |
500 | copies yytext to the scanner's output. | |
501 | .IP - | |
502 | .B BEGIN | |
503 | followed by the name of a start condition places the scanner in the | |
504 | corresponding start condition (see below). | |
505 | .IP - | |
506 | .B REJECT | |
507 | directs the scanner to proceed on to the "second best" rule which matched the | |
508 | input (or a prefix of the input). The rule is chosen as described | |
509 | above in "How the Input is Matched", and | |
510 | .B yytext | |
511 | and | |
512 | .B yyleng | |
513 | set up appropriately. | |
514 | It may either be one which matched as much text | |
515 | as the originally chosen rule but came later in the | |
516 | .I flex | |
517 | input file, or one which matched less text. | |
518 | For example, the following will both count the | |
519 | words in the input and call the routine special() whenever "frob" is seen: | |
520 | .nf | |
521 | ||
522 | int word_count = 0; | |
523 | %% | |
524 | ||
525 | frob special(); REJECT; | |
526 | [^ \\t\\n]+ ++word_count; | |
527 | ||
528 | .fi | |
529 | Without the | |
530 | .B REJECT, | |
531 | any "frob"'s in the input would not be counted as words, since the | |
532 | scanner normally executes only one action per token. | |
533 | Multiple | |
534 | .B REJECT's | |
535 | are allowed, each one finding the next best choice to the currently | |
536 | active rule. For example, when the following scanner scans the token | |
537 | "abcd", it will write "abcdabcaba" to the output: | |
538 | .nf | |
539 | ||
540 | %% | |
541 | a | | |
542 | ab | | |
543 | abc | | |
544 | abcd ECHO; REJECT; | |
545 | .|\\n /* eat up any unmatched character */ | |
546 | ||
547 | .fi | |
548 | (The first three rules share the fourth's action since they use | |
549 | the special '|' action.) | |
550 | .B REJECT | |
551 | is a particularly expensive feature in terms scanner performance; | |
552 | if it is used in | |
553 | .I any | |
554 | of the scanner's actions it will slow down | |
555 | .I all | |
556 | of the scanner's matching. Furthermore, | |
557 | .B REJECT | |
558 | cannot be used with the | |
559 | .I -f | |
560 | or | |
561 | .I -F | |
562 | options (see below). | |
563 | .IP | |
564 | Note also that unlike the other special actions, | |
565 | .B REJECT | |
566 | is a | |
567 | .I branch; | |
568 | code immediately following it in the action will | |
569 | .I not | |
570 | be executed. | |
571 | .IP - | |
572 | .B yymore() | |
573 | tells the scanner that the next time it matches a rule, the corresponding | |
574 | token should be | |
575 | .I appended | |
576 | onto the current value of | |
577 | .B yytext | |
578 | rather than replacing it. For example, given the input "mega-kludge" | |
579 | the following will write "mega-mega-kludge" to the output: | |
580 | .nf | |
581 | ||
582 | %% | |
583 | mega- ECHO; yymore(); | |
584 | kludge ECHO; | |
585 | ||
586 | .fi | |
587 | First "mega-" is matched and echoed to the output. Then "kludge" | |
588 | is matched, but the previous "mega-" is still hanging around at the | |
589 | beginning of | |
590 | .B yytext | |
591 | so the | |
592 | .B ECHO | |
593 | for the "kludge" rule will actually write "mega-kludge". | |
594 | The presence of | |
595 | .B yymore() | |
596 | in the scanner's action entails a minor performance penalty in the | |
597 | scanner's matching speed. | |
598 | .IP - | |
599 | .B yyless(n) | |
600 | returns all but the first | |
601 | .I n | |
602 | characters of the current token back to the input stream, where they | |
603 | will be rescanned when the scanner looks for the next match. | |
604 | .B yytext | |
605 | and | |
606 | .B yyleng | |
607 | are adjusted appropriately (e.g., | |
608 | .B yyleng | |
609 | will now be equal to | |
610 | .I n | |
611 | ). For example, on the input "foobar" the following will write out | |
612 | "foobarbar": | |
613 | .nf | |
614 | ||
615 | %% | |
616 | foobar ECHO; yyless(3); | |
617 | [a-z]+ ECHO; | |
618 | ||
619 | .fi | |
620 | An argument of 0 to | |
621 | .B yyless | |
622 | will cause the entire current input string to be scanned again. Unless you've | |
623 | changed how the scanner will subsequently process its input (using | |
624 | .B BEGIN, | |
625 | for example), this will result in an endless loop. | |
626 | .IP - | |
627 | .B unput(c) | |
628 | puts the character | |
629 | .I c | |
630 | back onto the input stream. It will be the next character scanned. | |
631 | The following action will take the current token and cause it | |
632 | to be rescanned enclosed in parentheses. | |
633 | .nf | |
634 | ||
635 | { | |
636 | int i; | |
637 | unput( ')' ); | |
638 | for ( i = yyleng - 1; i >= 0; --i ) | |
639 | unput( yytext[i] ); | |
640 | unput( '(' ); | |
641 | } | |
642 | ||
643 | .fi | |
644 | Note that since each | |
645 | .B unput() | |
646 | puts the given character back at the | |
647 | .I beginning | |
648 | of the input stream, pushing back strings must be done back-to-front. | |
649 | .IP - | |
650 | .B input() | |
651 | reads the next character from the input stream. For example, | |
652 | the following is one way to eat up C comments: | |
653 | .nf | |
654 | ||
655 | %% | |
656 | "/*" { | |
657 | register int c; | |
658 | ||
659 | for ( ; ; ) | |
660 | { | |
661 | while ( (c = input()) != '*' && | |
662 | c != EOF ) | |
663 | ; /* eat up text of comment */ | |
664 | ||
665 | if ( c == '*' ) | |
666 | { | |
667 | while ( (c = input()) == '*' ) | |
668 | ; | |
669 | if ( c == '/' ) | |
670 | break; /* found the end */ | |
671 | } | |
672 | ||
673 | if ( c == EOF ) | |
674 | { | |
675 | error( "EOF in comment" ); | |
676 | break; | |
677 | } | |
678 | } | |
679 | } | |
680 | ||
681 | .fi | |
682 | (Note that if the scanner is compiled using | |
683 | .B C++, | |
684 | then | |
685 | .B input() | |
686 | is instead referred to as | |
687 | .B yyinput(), | |
688 | in order to avoid a name clash with the | |
689 | .B C++ | |
690 | stream by the name of | |
691 | .I input.) | |
692 | .IP - | |
693 | .B yyterminate() | |
694 | can be used in lieu of a return statement in an action. It terminates | |
695 | the scanner and returns a 0 to the scanner's caller, indicating "all done". | |
696 | Subsequent calls to the scanner will immediately return unless preceded | |
697 | by a call to | |
698 | .B yyrestart() | |
699 | (see below). | |
700 | By default, | |
701 | .B yyterminate() | |
702 | is also called when an end-of-file is encountered. It is a macro and | |
703 | may be redefined. | |
704 | .SH THE GENERATED SCANNER | |
705 | The output of | |
706 | .I flex | |
707 | is the file | |
708 | .B lex.yy.c, | |
709 | which contains the scanning routine | |
710 | .B yylex(), | |
711 | a number of tables used by it for matching tokens, and a number | |
712 | of auxiliary routines and macros. By default, | |
713 | .B yylex() | |
714 | is declared as follows: | |
715 | .nf | |
716 | ||
717 | int yylex() | |
718 | { | |
719 | ... various definitions and the actions in here ... | |
720 | } | |
721 | ||
722 | .fi | |
723 | (If your environment supports function prototypes, then it will | |
724 | be "int yylex( void )".) This definition may be changed by redefining | |
725 | the "YY_DECL" macro. For example, you could use: | |
726 | .nf | |
727 | ||
728 | #undef YY_DECL | |
729 | #define YY_DECL float lexscan( a, b ) float a, b; | |
730 | ||
731 | .fi | |
732 | to give the scanning routine the name | |
733 | .I lexscan, | |
734 | returning a float, and taking two floats as arguments. Note that | |
735 | if you give arguments to the scanning routine using a | |
736 | K&R-style/non-prototyped function declaration, you must terminate | |
737 | the definition with a semi-colon (;). | |
738 | .LP | |
739 | Whenever | |
740 | .B yylex() | |
741 | is called, it scans tokens from the global input file | |
742 | .I yyin | |
743 | (which defaults to stdin). It continues until it either reaches | |
744 | an end-of-file (at which point it returns the value 0) or | |
745 | one of its actions executes a | |
746 | .I return | |
747 | statement. | |
748 | In the former case, when called again the scanner will immediately | |
749 | return unless | |
750 | .B yyrestart() | |
751 | is called to point | |
752 | .I yyin | |
753 | at the new input file. ( | |
754 | .B yyrestart() | |
755 | takes one argument, a | |
756 | .B FILE * | |
757 | pointer.) | |
758 | In the latter case (i.e., when an action | |
759 | executes a return), the scanner may then be called again and it | |
760 | will resume scanning where it left off. | |
761 | .LP | |
762 | By default (and for purposes of efficiency), the scanner uses | |
763 | block-reads rather than simple | |
764 | .I getc() | |
765 | calls to read characters from | |
766 | .I yyin. | |
767 | The nature of how it gets its input can be controlled by redefining the | |
768 | .B YY_INPUT | |
769 | macro. | |
770 | YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its | |
771 | action is to place up to | |
772 | .I max_size | |
773 | characters in the character array | |
774 | .I buf | |
775 | and return in the integer variable | |
776 | .I result | |
777 | either the | |
778 | number of characters read or the constant YY_NULL (0 on Unix systems) | |
779 | to indicate EOF. The default YY_INPUT reads from the | |
780 | global file-pointer "yyin". | |
781 | .LP | |
782 | A sample redefinition of YY_INPUT (in the definitions | |
783 | section of the input file): | |
784 | .nf | |
785 | ||
786 | %{ | |
787 | #undef YY_INPUT | |
788 | #define YY_INPUT(buf,result,max_size) \\ | |
789 | result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1; | |
790 | %} | |
791 | ||
792 | .fi | |
793 | This definition will change the input processing to occur | |
794 | one character at a time. | |
795 | .LP | |
796 | You also can add in things like keeping track of the | |
797 | input line number this way; but don't expect your scanner to | |
798 | go very fast. | |
799 | .LP | |
800 | When the scanner receives an end-of-file indication from YY_INPUT, | |
801 | it then checks the | |
802 | .B yywrap() | |
803 | function. If | |
804 | .B yywrap() | |
805 | returns false (zero), then it is assumed that the | |
806 | function has gone ahead and set up | |
807 | .I yyin | |
808 | to point to another input file, and scanning continues. If it returns | |
809 | true (non-zero), then the scanner terminates, returning 0 to its | |
810 | caller. | |
811 | .LP | |
812 | The default | |
813 | .B yywrap() | |
814 | always returns 1. Presently, to redefine it you must first | |
815 | "#undef yywrap", as it is currently implemented as a macro. As indicated | |
816 | by the hedging in the previous sentence, it may be changed to | |
817 | a true function in the near future. | |
818 | .LP | |
819 | The scanner writes its | |
820 | .B ECHO | |
821 | output to the | |
822 | .I yyout | |
823 | global (default, stdout), which may be redefined by the user simply | |
824 | by assigning it to some other | |
825 | .B FILE | |
826 | pointer. | |
827 | .SH START CONDITIONS | |
828 | .I flex | |
829 | provides a mechanism for conditionally activating rules. Any rule | |
830 | whose pattern is prefixed with "<sc>" will only be active when | |
831 | the scanner is in the start condition named "sc". For example, | |
832 | .nf | |
833 | ||
834 | <STRING>[^"]* { /* eat up the string body ... */ | |
835 | ... | |
836 | } | |
837 | ||
838 | .fi | |
839 | will be active only when the scanner is in the "STRING" start | |
840 | condition, and | |
841 | .nf | |
842 | ||
843 | <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */ | |
844 | ... | |
845 | } | |
846 | ||
847 | .fi | |
848 | will be active only when the current start condition is | |
849 | either "INITIAL", "STRING", or "QUOTE". | |
850 | .LP | |
851 | Start conditions | |
852 | are declared in the definitions (first) section of the input | |
853 | using unindented lines beginning with either | |
854 | .B %s | |
855 | or | |
856 | .B %x | |
857 | followed by a list of names. | |
858 | The former declares | |
859 | .I inclusive | |
860 | start conditions, the latter | |
861 | .I exclusive | |
862 | start conditions. A start condition is activated using the | |
863 | .B BEGIN | |
864 | action. Until the next | |
865 | .B BEGIN | |
866 | action is executed, rules with the given start | |
867 | condition will be active and | |
868 | rules with other start conditions will be inactive. | |
869 | If the start condition is | |
870 | .I inclusive, | |
871 | then rules with no start conditions at all will also be active. | |
872 | If it is | |
873 | .I exclusive, | |
874 | then | |
875 | .I only | |
876 | rules qualified with the start condition will be active. | |
877 | A set of rules contingent on the same exclusive start condition | |
878 | describe a scanner which is independent of any of the other rules in the | |
879 | .I flex | |
880 | input. Because of this, | |
881 | exclusive start conditions make it easy to specify "mini-scanners" | |
882 | which scan portions of the input that are syntactically different | |
883 | from the rest (e.g., comments). | |
884 | .LP | |
885 | If the distinction between inclusive and exclusive start conditions | |
886 | is still a little vague, here's a simple example illustrating the | |
887 | connection between the two. The set of rules: | |
888 | .nf | |
889 | ||
890 | %s example | |
891 | %% | |
892 | <example>foo /* do something */ | |
893 | ||
894 | .fi | |
895 | is equivalent to | |
896 | .nf | |
897 | ||
898 | %x example | |
899 | %% | |
900 | <INITIAL,example>foo /* do something */ | |
901 | ||
902 | .fi | |
903 | .LP | |
904 | The default rule (to | |
905 | .B ECHO | |
906 | any unmatched character) remains active in start conditions. | |
907 | .LP | |
908 | .B BEGIN(0) | |
909 | returns to the original state where only the rules with | |
910 | no start conditions are active. This state can also be | |
911 | referred to as the start-condition "INITIAL", so | |
912 | .B BEGIN(INITIAL) | |
913 | is equivalent to | |
914 | .B BEGIN(0). | |
915 | (The parentheses around the start condition name are not required but | |
916 | are considered good style.) | |
917 | .LP | |
918 | .B BEGIN | |
919 | actions can also be given as indented code at the beginning | |
920 | of the rules section. For example, the following will cause | |
921 | the scanner to enter the "SPECIAL" start condition whenever | |
922 | .I yylex() | |
923 | is called and the global variable | |
924 | .I enter_special | |
925 | is true: | |
926 | .nf | |
927 | ||
928 | int enter_special; | |
929 | ||
930 | %x SPECIAL | |
931 | %% | |
932 | if ( enter_special ) | |
933 | BEGIN(SPECIAL); | |
934 | ||
935 | <SPECIAL>blahblahblah | |
936 | ...more rules follow... | |
937 | ||
938 | .fi | |
939 | .LP | |
940 | To illustrate the uses of start conditions, | |
941 | here is a scanner which provides two different interpretations | |
942 | of a string like "123.456". By default it will treat it as | |
943 | as three tokens, the integer "123", a dot ('.'), and the integer "456". | |
944 | But if the string is preceded earlier in the line by the string | |
945 | "expect-floats" | |
946 | it will treat it as a single token, the floating-point number | |
947 | 123.456: | |
948 | .nf | |
949 | ||
950 | %{ | |
951 | #include <math.h> | |
952 | %} | |
953 | %s expect | |
954 | ||
955 | %% | |
956 | expect-floats BEGIN(expect); | |
957 | ||
958 | <expect>[0-9]+"."[0-9]+ { | |
959 | printf( "found a float, = %f\\n", | |
960 | atof( yytext ) ); | |
961 | } | |
962 | <expect>\\n { | |
963 | /* that's the end of the line, so | |
964 | * we need another "expect-number" | |
965 | * before we'll recognize any more | |
966 | * numbers | |
967 | */ | |
968 | BEGIN(INITIAL); | |
969 | } | |
970 | ||
971 | [0-9]+ { | |
972 | printf( "found an integer, = %d\\n", | |
973 | atoi( yytext ) ); | |
974 | } | |
975 | ||
976 | "." printf( "found a dot\\n" ); | |
977 | ||
978 | .fi | |
979 | Here is a scanner which recognizes (and discards) C comments while | |
980 | maintaining a count of the current input line. | |
981 | .nf | |
982 | ||
983 | %x comment | |
984 | %% | |
985 | int line_num = 1; | |
986 | ||
987 | "/*" BEGIN(comment); | |
988 | ||
989 | <comment>[^*\\n]* /* eat anything that's not a '*' */ | |
990 | <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ | |
991 | <comment>\\n ++line_num; | |
992 | <comment>"*"+"/" BEGIN(INITIAL); | |
993 | ||
994 | .fi | |
995 | Note that start-conditions names are really integer values and | |
996 | can be stored as such. Thus, the above could be extended in the | |
997 | following fashion: | |
998 | .nf | |
999 | ||
1000 | %x comment foo | |
1001 | %% | |
1002 | int line_num = 1; | |
1003 | int comment_caller; | |
1004 | ||
1005 | "/*" { | |
1006 | comment_caller = INITIAL; | |
1007 | BEGIN(comment); | |
1008 | } | |
1009 | ||
1010 | ... | |
1011 | ||
1012 | <foo>"/*" { | |
1013 | comment_caller = foo; | |
1014 | BEGIN(comment); | |
1015 | } | |
1016 | ||
1017 | <comment>[^*\\n]* /* eat anything that's not a '*' */ | |
1018 | <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ | |
1019 | <comment>\\n ++line_num; | |
1020 | <comment>"*"+"/" BEGIN(comment_caller); | |
1021 | ||
1022 | .fi | |
1023 | One can then implement a "stack" of start conditions using an | |
1024 | array of integers. (It is likely that such stacks will become | |
1025 | a full-fledged | |
1026 | .I flex | |
1027 | feature in the future.) Note, though, that | |
1028 | start conditions do not have their own name-space; %s's and %x's | |
1029 | declare names in the same fashion as #define's. | |
1030 | .SH MULTIPLE INPUT BUFFERS | |
1031 | Some scanners (such as those which support "include" files) | |
1032 | require reading from several input streams. As | |
1033 | .I flex | |
1034 | scanners do a large amount of buffering, one cannot control | |
1035 | where the next input will be read from by simply writing a | |
1036 | .B YY_INPUT | |
1037 | which is sensitive to the scanning context. | |
1038 | .B YY_INPUT | |
1039 | is only called when the scanner reaches the end of its buffer, which | |
1040 | may be a long time after scanning a statement such as an "include" | |
1041 | which requires switching the input source. | |
1042 | .LP | |
1043 | To negotiate these sorts of problems, | |
1044 | .I flex | |
1045 | provides a mechanism for creating and switching between multiple | |
1046 | input buffers. An input buffer is created by using: | |
1047 | .nf | |
1048 | ||
1049 | YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) | |
1050 | ||
1051 | .fi | |
1052 | which takes a | |
1053 | .I FILE | |
1054 | pointer and a size and creates a buffer associated with the given | |
1055 | file and large enough to hold | |
1056 | .I size | |
1057 | characters (when in doubt, use | |
1058 | .B YY_BUF_SIZE | |
1059 | for the size). It returns a | |
1060 | .B YY_BUFFER_STATE | |
1061 | handle, which may then be passed to other routines: | |
1062 | .nf | |
1063 | ||
1064 | void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) | |
1065 | ||
1066 | .fi | |
1067 | switches the scanner's input buffer so subsequent tokens will | |
1068 | come from | |
1069 | .I new_buffer. | |
1070 | Note that | |
1071 | .B yy_switch_to_buffer() | |
1072 | may be used by yywrap() to sets things up for continued scanning, instead | |
1073 | of opening a new file and pointing | |
1074 | .I yyin | |
1075 | at it. | |
1076 | .nf | |
1077 | ||
1078 | void yy_delete_buffer( YY_BUFFER_STATE buffer ) | |
1079 | ||
1080 | .fi | |
1081 | is used to reclaim the storage associated with a buffer. | |
1082 | .LP | |
1083 | .B yy_new_buffer() | |
1084 | is an alias for | |
1085 | .B yy_create_buffer(), | |
1086 | provided for compatibility with the C++ use of | |
1087 | .I new | |
1088 | and | |
1089 | .I delete | |
1090 | for creating and destroying dynamic objects. | |
1091 | .LP | |
1092 | Finally, the | |
1093 | .B YY_CURRENT_BUFFER | |
1094 | macro returns a | |
1095 | .B YY_BUFFER_STATE | |
1096 | handle to the current buffer. | |
1097 | .LP | |
1098 | Here is an example of using these features for writing a scanner | |
1099 | which expands include files (the | |
1100 | .B <<EOF>> | |
1101 | feature is discussed below): | |
1102 | .nf | |
1103 | ||
1104 | /* the "incl" state is used for picking up the name | |
1105 | * of an include file | |
1106 | */ | |
1107 | %x incl | |
1108 | ||
1109 | %{ | |
1110 | #define MAX_INCLUDE_DEPTH 10 | |
1111 | YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; | |
1112 | int include_stack_ptr = 0; | |
1113 | %} | |
1114 | ||
1115 | %% | |
1116 | include BEGIN(incl); | |
1117 | ||
1118 | [a-z]+ ECHO; | |
1119 | [^a-z\\n]*\\n? ECHO; | |
1120 | ||
1121 | <incl>[ \\t]* /* eat the whitespace */ | |
1122 | <incl>[^ \\t\\n]+ { /* got the include file name */ | |
1123 | if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) | |
1124 | { | |
1125 | fprintf( stderr, "Includes nested too deeply" ); | |
1126 | exit( 1 ); | |
1127 | } | |
1128 | ||
1129 | include_stack[include_stack_ptr++] = | |
1130 | YY_CURRENT_BUFFER; | |
1131 | ||
1132 | yyin = fopen( yytext, "r" ); | |
1133 | ||
1134 | if ( ! yyin ) | |
1135 | error( ... ); | |
1136 | ||
1137 | yy_switch_to_buffer( | |
1138 | yy_create_buffer( yyin, YY_BUF_SIZE ) ); | |
1139 | ||
1140 | BEGIN(INITIAL); | |
1141 | } | |
1142 | ||
1143 | <<EOF>> { | |
1144 | if ( --include_stack_ptr < 0 ) | |
1145 | { | |
1146 | yyterminate(); | |
1147 | } | |
1148 | ||
1149 | else | |
1150 | yy_switch_to_buffer( | |
1151 | include_stack[include_stack_ptr] ); | |
1152 | } | |
1153 | ||
1154 | .fi | |
1155 | .SH END-OF-FILE RULES | |
1156 | The special rule "<<EOF>>" indicates | |
1157 | actions which are to be taken when an end-of-file is | |
1158 | encountered and yywrap() returns non-zero (i.e., indicates | |
1159 | no further files to process). The action must finish | |
1160 | by doing one of four things: | |
1161 | .IP - | |
1162 | the special | |
1163 | .B YY_NEW_FILE | |
1164 | action, if | |
1165 | .I yyin | |
1166 | has been pointed at a new file to process; | |
1167 | .IP - | |
1168 | a | |
1169 | .I return | |
1170 | statement; | |
1171 | .IP - | |
1172 | the special | |
1173 | .B yyterminate() | |
1174 | action; | |
1175 | .IP - | |
1176 | or, switching to a new buffer using | |
1177 | .B yy_switch_to_buffer() | |
1178 | as shown in the example above. | |
1179 | .LP | |
1180 | <<EOF>> rules may not be used with other | |
1181 | patterns; they may only be qualified with a list of start | |
1182 | conditions. If an unqualified <<EOF>> rule is given, it | |
1183 | applies to | |
1184 | .I all | |
1185 | start conditions which do not already have <<EOF>> actions. To | |
1186 | specify an <<EOF>> rule for only the initial start condition, use | |
1187 | .nf | |
1188 | ||
1189 | <INITIAL><<EOF>> | |
1190 | ||
1191 | .fi | |
1192 | .LP | |
1193 | These rules are useful for catching things like unclosed comments. | |
1194 | An example: | |
1195 | .nf | |
1196 | ||
1197 | %x quote | |
1198 | %% | |
1199 | ||
1200 | ...other rules for dealing with quotes... | |
1201 | ||
1202 | <quote><<EOF>> { | |
1203 | error( "unterminated quote" ); | |
1204 | yyterminate(); | |
1205 | } | |
1206 | <<EOF>> { | |
1207 | if ( *++filelist ) | |
1208 | { | |
1209 | yyin = fopen( *filelist, "r" ); | |
1210 | YY_NEW_FILE; | |
1211 | } | |
1212 | else | |
1213 | yyterminate(); | |
1214 | } | |
1215 | ||
1216 | .fi | |
1217 | .SH MISCELLANEOUS MACROS | |
1218 | The macro | |
1219 | .bd | |
1220 | YY_USER_ACTION | |
1221 | can be redefined to provide an action | |
1222 | which is always executed prior to the matched rule's action. For example, | |
1223 | it could be #define'd to call a routine to convert yytext to lower-case. | |
1224 | .LP | |
1225 | The macro | |
1226 | .B YY_USER_INIT | |
1227 | may be redefined to provide an action which is always executed before | |
1228 | the first scan (and before the scanner's internal initializations are done). | |
1229 | For example, it could be used to call a routine to read | |
1230 | in a data table or open a logging file. | |
1231 | .LP | |
1232 | In the generated scanner, the actions are all gathered in one large | |
1233 | switch statement and separated using | |
1234 | .B YY_BREAK, | |
1235 | which may be redefined. By default, it is simply a "break", to separate | |
1236 | each rule's action from the following rule's. | |
1237 | Redefining | |
1238 | .B YY_BREAK | |
1239 | allows, for example, C++ users to | |
1240 | #define YY_BREAK to do nothing (while being very careful that every | |
1241 | rule ends with a "break" or a "return"!) to avoid suffering from | |
1242 | unreachable statement warnings where because a rule's action ends with | |
1243 | "return", the | |
1244 | .B YY_BREAK | |
1245 | is inaccessible. | |
1246 | .SH INTERFACING WITH YACC | |
1247 | One of the main uses of | |
1248 | .I flex | |
1249 | is as a companion to the | |
1250 | .I yacc | |
1251 | parser-generator. | |
1252 | .I yacc | |
1253 | parsers expect to call a routine named | |
1254 | .B yylex() | |
1255 | to find the next input token. The routine is supposed to | |
1256 | return the type of the next token as well as putting any associated | |
1257 | value in the global | |
1258 | .B yylval. | |
1259 | To use | |
1260 | .I flex | |
1261 | with | |
1262 | .I yacc, | |
1263 | one specifies the | |
1264 | .B -d | |
1265 | option to | |
1266 | .I yacc | |
1267 | to instruct it to generate the file | |
1268 | .B y.tab.h | |
1269 | containing definitions of all the | |
1270 | .B %tokens | |
1271 | appearing in the | |
1272 | .I yacc | |
1273 | input. This file is then included in the | |
1274 | .I flex | |
1275 | scanner. For example, if one of the tokens is "TOK_NUMBER", | |
1276 | part of the scanner might look like: | |
1277 | .nf | |
1278 | ||
1279 | %{ | |
1280 | #include "y.tab.h" | |
1281 | %} | |
1282 | ||
1283 | %% | |
1284 | ||
1285 | [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; | |
1286 | ||
1287 | .fi | |
1288 | .SH TRANSLATION TABLE | |
1289 | In the name of POSIX compliance, | |
1290 | .I flex | |
1291 | supports a | |
1292 | .I translation table | |
1293 | for mapping input characters into groups. | |
1294 | The table is specified in the first section, and its format looks like: | |
1295 | .nf | |
1296 | ||
1297 | %t | |
1298 | 1 abcd | |
1299 | 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ | |
1300 | 52 0123456789 | |
1301 | 6 \\t\\ \\n | |
1302 | %t | |
1303 | ||
1304 | .fi | |
1305 | This example specifies that the characters 'a', 'b', 'c', and 'd' | |
1306 | are to all be lumped into group #1, upper-case letters | |
1307 | in group #2, digits in group #52, tabs, blanks, and newlines into | |
1308 | group #6, and | |
1309 | .I | |
1310 | no other characters will appear in the patterns. | |
1311 | The group numbers are actually disregarded by | |
1312 | .I flex; | |
1313 | .B %t | |
1314 | serves, though, to lump characters together. Given the above | |
1315 | table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0". | |
1316 | They both say, "match any character in group #1, followed by | |
1317 | zero-or-more pairs of characters | |
1318 | from group #2, followed by a character from group #52." Thus | |
1319 | .B %t | |
1320 | provides a crude way for introducing equivalence classes into | |
1321 | the scanner specification. | |
1322 | .LP | |
1323 | Note that the | |
1324 | .B -i | |
1325 | option (see below) coupled with the equivalence classes which | |
1326 | .I flex | |
1327 | automatically generates take care of virtually all the instances | |
1328 | when one might consider using | |
1329 | .B %t. | |
1330 | But what the hell, it's there if you want it. | |
1331 | .SH OPTIONS | |
1332 | .I flex | |
1333 | has the following options: | |
1334 | .TP | |
1335 | .B -b | |
1336 | Generate backtracking information to | |
1337 | .I lex.backtrack. | |
1338 | This is a list of scanner states which require backtracking | |
1339 | and the input characters on which they do so. By adding rules one | |
1340 | can remove backtracking states. If all backtracking states | |
1341 | are eliminated and | |
1342 | .B -f | |
1343 | or | |
1344 | .B -F | |
1345 | is used, the generated scanner will run faster (see the | |
1346 | .B -p | |
1347 | flag). Only users who wish to squeeze every last cycle out of their | |
1348 | scanners need worry about this option. (See the section on PERFORMANCE | |
1349 | CONSIDERATIONS below.) | |
1350 | .TP | |
1351 | .B -c | |
1352 | is a do-nothing, deprecated option included for POSIX compliance. | |
1353 | .IP | |
1354 | .B NOTE: | |
1355 | in previous releases of | |
1356 | .I flex | |
1357 | .B -c | |
1358 | specified table-compression options. This functionality is | |
1359 | now given by the | |
1360 | .B -C | |
1361 | flag. To ease the the impact of this change, when | |
1362 | .I flex | |
1363 | encounters | |
1364 | .B -c, | |
1365 | it currently issues a warning message and assumes that | |
1366 | .B -C | |
1367 | was desired instead. In the future this "promotion" of | |
1368 | .B -c | |
1369 | to | |
1370 | .B -C | |
1371 | will go away in the name of full POSIX compliance (unless | |
1372 | the POSIX meaning is removed first). | |
1373 | .TP | |
1374 | .B -d | |
1375 | makes the generated scanner run in | |
1376 | .I debug | |
1377 | mode. Whenever a pattern is recognized and the global | |
1378 | .B yy_flex_debug | |
1379 | is non-zero (which is the default), | |
1380 | the scanner will write to | |
1381 | .I stderr | |
1382 | a line of the form: | |
1383 | .nf | |
1384 | ||
1385 | --accepting rule at line 53 ("the matched text") | |
1386 | ||
1387 | .fi | |
1388 | The line number refers to the location of the rule in the file | |
1389 | defining the scanner (i.e., the file that was fed to flex). Messages | |
1390 | are also generated when the scanner backtracks, accepts the | |
1391 | default rule, reaches the end of its input buffer (or encounters | |
1392 | a NUL; at this point, the two look the same as far as the scanner's concerned), | |
1393 | or reaches an end-of-file. | |
1394 | .TP | |
1395 | .B -f | |
1396 | specifies (take your pick) | |
1397 | .I full table | |
1398 | or | |
1399 | .I fast scanner. | |
1400 | No table compression is done. The result is large but fast. | |
1401 | This option is equivalent to | |
1402 | .B -Cf | |
1403 | (see below). | |
1404 | .TP | |
1405 | .B -i | |
1406 | instructs | |
1407 | .I flex | |
1408 | to generate a | |
1409 | .I case-insensitive | |
1410 | scanner. The case of letters given in the | |
1411 | .I flex | |
1412 | input patterns will | |
1413 | be ignored, and tokens in the input will be matched regardless of case. The | |
1414 | matched text given in | |
1415 | .I yytext | |
1416 | will have the preserved case (i.e., it will not be folded). | |
1417 | .TP | |
1418 | .B -n | |
1419 | is another do-nothing, deprecated option included only for | |
1420 | POSIX compliance. | |
1421 | .TP | |
1422 | .B -p | |
1423 | generates a performance report to stderr. The report | |
1424 | consists of comments regarding features of the | |
1425 | .I flex | |
1426 | input file which will cause a loss of performance in the resulting scanner. | |
1427 | Note that the use of | |
1428 | .I REJECT | |
1429 | and variable trailing context (see the BUGS section in flex(1)) | |
1430 | entails a substantial performance penalty; use of | |
1431 | .I yymore(), | |
1432 | the | |
1433 | .B ^ | |
1434 | operator, | |
1435 | and the | |
1436 | .B -I | |
1437 | flag entail minor performance penalties. | |
1438 | .TP | |
1439 | .B -s | |
1440 | causes the | |
1441 | .I default rule | |
1442 | (that unmatched scanner input is echoed to | |
1443 | .I stdout) | |
1444 | to be suppressed. If the scanner encounters input that does not | |
1445 | match any of its rules, it aborts with an error. This option is | |
1446 | useful for finding holes in a scanner's rule set. | |
1447 | .TP | |
1448 | .B -t | |
1449 | instructs | |
1450 | .I flex | |
1451 | to write the scanner it generates to standard output instead | |
1452 | of | |
1453 | .B lex.yy.c. | |
1454 | .TP | |
1455 | .B -v | |
1456 | specifies that | |
1457 | .I flex | |
1458 | should write to | |
1459 | .I stderr | |
1460 | a summary of statistics regarding the scanner it generates. | |
1461 | Most of the statistics are meaningless to the casual | |
1462 | .I flex | |
1463 | user, but the | |
1464 | first line identifies the version of | |
1465 | .I flex, | |
1466 | which is useful for figuring | |
1467 | out where you stand with respect to patches and new releases, | |
1468 | and the next two lines give the date when the scanner was created | |
1469 | and a summary of the flags which were in effect. | |
1470 | .TP | |
1471 | .B -F | |
1472 | specifies that the | |
1473 | .ul | |
1474 | fast | |
1475 | scanner table representation should be used. This representation is | |
1476 | about as fast as the full table representation | |
1477 | .ul | |
1478 | (-f), | |
1479 | and for some sets of patterns will be considerably smaller (and for | |
1480 | others, larger). In general, if the pattern set contains both "keywords" | |
1481 | and a catch-all, "identifier" rule, such as in the set: | |
1482 | .nf | |
1483 | ||
1484 | "case" return TOK_CASE; | |
1485 | "switch" return TOK_SWITCH; | |
1486 | ... | |
1487 | "default" return TOK_DEFAULT; | |
1488 | [a-z]+ return TOK_ID; | |
1489 | ||
1490 | .fi | |
1491 | then you're better off using the full table representation. If only | |
1492 | the "identifier" rule is present and you then use a hash table or some such | |
1493 | to detect the keywords, you're better off using | |
1494 | .ul | |
1495 | -F. | |
1496 | .IP | |
1497 | This option is equivalent to | |
1498 | .B -CF | |
1499 | (see below). | |
1500 | .TP | |
1501 | .B -I | |
1502 | instructs | |
1503 | .I flex | |
1504 | to generate an | |
1505 | .I interactive | |
1506 | scanner. Normally, scanners generated by | |
1507 | .I flex | |
1508 | always look ahead one | |
1509 | character before deciding that a rule has been matched. At the cost of | |
1510 | some scanning overhead, | |
1511 | .I flex | |
1512 | will generate a scanner which only looks ahead | |
1513 | when needed. Such scanners are called | |
1514 | .I interactive | |
1515 | because if you want to write a scanner for an interactive system such as a | |
1516 | command shell, you will probably want the user's input to be terminated | |
1517 | with a newline, and without | |
1518 | .B -I | |
1519 | the user will have to type a character in addition to the newline in order | |
1520 | to have the newline recognized. This leads to dreadful interactive | |
1521 | performance. | |
1522 | .IP | |
1523 | If all this seems to confusing, here's the general rule: if a human will | |
1524 | be typing in input to your scanner, use | |
1525 | .B -I, | |
1526 | otherwise don't; if you don't care about squeezing the utmost performance | |
1527 | from your scanner and you | |
1528 | don't want to make any assumptions about the input to your scanner, | |
1529 | use | |
1530 | .B -I. | |
1531 | .IP | |
1532 | Note, | |
1533 | .B -I | |
1534 | cannot be used in conjunction with | |
1535 | .I full | |
1536 | or | |
1537 | .I fast tables, | |
1538 | i.e., the | |
1539 | .B -f, -F, -Cf, | |
1540 | or | |
1541 | .B -CF | |
1542 | flags. | |
1543 | .TP | |
1544 | .B -L | |
1545 | instructs | |
1546 | .I flex | |
1547 | not to generate | |
1548 | .B #line | |
1549 | directives. Without this option, | |
1550 | .I flex | |
1551 | peppers the generated scanner | |
1552 | with #line directives so error messages in the actions will be correctly | |
1553 | located with respect to the original | |
1554 | .I flex | |
1555 | input file, and not to | |
1556 | the fairly meaningless line numbers of | |
1557 | .B lex.yy.c. | |
1558 | (Unfortunately | |
1559 | .I flex | |
1560 | does not presently generate the necessary directives | |
1561 | to "retarget" the line numbers for those parts of | |
1562 | .B lex.yy.c | |
1563 | which it generated. So if there is an error in the generated code, | |
1564 | a meaningless line number is reported.) | |
1565 | .TP | |
1566 | .B -T | |
1567 | makes | |
1568 | .I flex | |
1569 | run in | |
1570 | .I trace | |
1571 | mode. It will generate a lot of messages to | |
1572 | .I stdout | |
1573 | concerning | |
1574 | the form of the input and the resultant non-deterministic and deterministic | |
1575 | finite automata. This option is mostly for use in maintaining | |
1576 | .I flex. | |
1577 | .TP | |
1578 | .B -8 | |
1579 | instructs | |
1580 | .I flex | |
1581 | to generate an 8-bit scanner, i.e., one which can recognize 8-bit | |
1582 | characters. On some sites, | |
1583 | .I flex | |
1584 | is installed with this option as the default. On others, the default | |
1585 | is 7-bit characters. To see which is the case, check the verbose | |
1586 | .B (-v) | |
1587 | output for "equivalence classes created". If the denominator of | |
1588 | the number shown is 128, then by default | |
1589 | .I flex | |
1590 | is generating 7-bit characters. If it is 256, then the default is | |
1591 | 8-bit characters and the | |
1592 | .B -8 | |
1593 | flag is not required (but may be a good idea to keep the scanner | |
1594 | specification portable). Feeding a 7-bit scanner 8-bit characters | |
1595 | will result in infinite loops, bus errors, or other such fireworks, | |
1596 | so when in doubt, use the flag. Note that if equivalence classes | |
1597 | are used, 8-bit scanners take only slightly more table space than | |
1598 | 7-bit scanners (128 bytes, to be exact); if equivalence classes are | |
1599 | not used, however, then the tables may grow up to twice their | |
1600 | 7-bit size. | |
1601 | .TP | |
1602 | .B -C[efmF] | |
1603 | controls the degree of table compression. | |
1604 | .IP | |
1605 | .B -Ce | |
1606 | directs | |
1607 | .I flex | |
1608 | to construct | |
1609 | .I equivalence classes, | |
1610 | i.e., sets of characters | |
1611 | which have identical lexical properties (for example, if the only | |
1612 | appearance of digits in the | |
1613 | .I flex | |
1614 | input is in the character class | |
1615 | "[0-9]" then the digits '0', '1', ..., '9' will all be put | |
1616 | in the same equivalence class). Equivalence classes usually give | |
1617 | dramatic reductions in the final table/object file sizes (typically | |
1618 | a factor of 2-5) and are pretty cheap performance-wise (one array | |
1619 | look-up per character scanned). | |
1620 | .IP | |
1621 | .B -Cf | |
1622 | specifies that the | |
1623 | .I full | |
1624 | scanner tables should be generated - | |
1625 | .I flex | |
1626 | should not compress the | |
1627 | tables by taking advantages of similar transition functions for | |
1628 | different states. | |
1629 | .IP | |
1630 | .B -CF | |
1631 | specifies that the alternate fast scanner representation (described | |
1632 | above under the | |
1633 | .B -F | |
1634 | flag) | |
1635 | should be used. | |
1636 | .IP | |
1637 | .B -Cm | |
1638 | directs | |
1639 | .I flex | |
1640 | to construct | |
1641 | .I meta-equivalence classes, | |
1642 | which are sets of equivalence classes (or characters, if equivalence | |
1643 | classes are not being used) that are commonly used together. Meta-equivalence | |
1644 | classes are often a big win when using compressed tables, but they | |
1645 | have a moderate performance impact (one or two "if" tests and one | |
1646 | array look-up per character scanned). | |
1647 | .IP | |
1648 | A lone | |
1649 | .B -C | |
1650 | specifies that the scanner tables should be compressed but neither | |
1651 | equivalence classes nor meta-equivalence classes should be used. | |
1652 | .IP | |
1653 | The options | |
1654 | .B -Cf | |
1655 | or | |
1656 | .B -CF | |
1657 | and | |
1658 | .B -Cm | |
1659 | do not make sense together - there is no opportunity for meta-equivalence | |
1660 | classes if the table is not being compressed. Otherwise the options | |
1661 | may be freely mixed. | |
1662 | .IP | |
1663 | The default setting is | |
1664 | .B -Cem, | |
1665 | which specifies that | |
1666 | .I flex | |
1667 | should generate equivalence classes | |
1668 | and meta-equivalence classes. This setting provides the highest | |
1669 | degree of table compression. You can trade off | |
1670 | faster-executing scanners at the cost of larger tables with | |
1671 | the following generally being true: | |
1672 | .nf | |
1673 | ||
1674 | slowest & smallest | |
1675 | -Cem | |
1676 | -Cm | |
1677 | -Ce | |
1678 | -C | |
1679 | -C{f,F}e | |
1680 | -C{f,F} | |
1681 | fastest & largest | |
1682 | ||
1683 | .fi | |
1684 | Note that scanners with the smallest tables are usually generated and | |
1685 | compiled the quickest, so | |
1686 | during development you will usually want to use the default, maximal | |
1687 | compression. | |
1688 | .IP | |
1689 | .B -Cfe | |
1690 | is often a good compromise between speed and size for production | |
1691 | scanners. | |
1692 | .IP | |
1693 | .B -C | |
1694 | options are not cumulative; whenever the flag is encountered, the | |
1695 | previous -C settings are forgotten. | |
1696 | .TP | |
1697 | .B -Sskeleton_file | |
1698 | overrides the default skeleton file from which | |
1699 | .I flex | |
1700 | constructs its scanners. You'll never need this option unless you are doing | |
1701 | .I flex | |
1702 | maintenance or development. | |
1703 | .SH PERFORMANCE CONSIDERATIONS | |
1704 | The main design goal of | |
1705 | .I flex | |
1706 | is that it generate high-performance scanners. It has been optimized | |
1707 | for dealing well with large sets of rules. Aside from the effects | |
1708 | of table compression on scanner speed outlined above, | |
1709 | there are a number of options/actions which degrade performance. These | |
1710 | are, from most expensive to least: | |
1711 | .nf | |
1712 | ||
1713 | REJECT | |
1714 | ||
1715 | pattern sets that require backtracking | |
1716 | arbitrary trailing context | |
1717 | ||
1718 | '^' beginning-of-line operator | |
1719 | yymore() | |
1720 | ||
1721 | .fi | |
1722 | with the first three all being quite expensive and the last two | |
1723 | being quite cheap. | |
1724 | .LP | |
1725 | .B REJECT | |
1726 | should be avoided at all costs when performance is important. | |
1727 | It is a particularly expensive option. | |
1728 | .LP | |
1729 | Getting rid of backtracking is messy and often may be an enormous | |
1730 | amount of work for a complicated scanner. In principal, one begins | |
1731 | by using the | |
1732 | .B -b | |
1733 | flag to generate a | |
1734 | .I lex.backtrack | |
1735 | file. For example, on the input | |
1736 | .nf | |
1737 | ||
1738 | %% | |
1739 | foo return TOK_KEYWORD; | |
1740 | foobar return TOK_KEYWORD; | |
1741 | ||
1742 | .fi | |
1743 | the file looks like: | |
1744 | .nf | |
1745 | ||
1746 | State #6 is non-accepting - | |
1747 | associated rule line numbers: | |
1748 | 2 3 | |
1749 | out-transitions: [ o ] | |
1750 | jam-transitions: EOF [ \\001-n p-\\177 ] | |
1751 | ||
1752 | State #8 is non-accepting - | |
1753 | associated rule line numbers: | |
1754 | 3 | |
1755 | out-transitions: [ a ] | |
1756 | jam-transitions: EOF [ \\001-` b-\\177 ] | |
1757 | ||
1758 | State #9 is non-accepting - | |
1759 | associated rule line numbers: | |
1760 | 3 | |
1761 | out-transitions: [ r ] | |
1762 | jam-transitions: EOF [ \\001-q s-\\177 ] | |
1763 | ||
1764 | Compressed tables always backtrack. | |
1765 | ||
1766 | .fi | |
1767 | The first few lines tell us that there's a scanner state in | |
1768 | which it can make a transition on an 'o' but not on any other | |
1769 | character, and that in that state the currently scanned text does not match | |
1770 | any rule. The state occurs when trying to match the rules found | |
1771 | at lines 2 and 3 in the input file. | |
1772 | If the scanner is in that state and then reads | |
1773 | something other than an 'o', it will have to backtrack to find | |
1774 | a rule which is matched. With | |
1775 | a bit of headscratching one can see that this must be the | |
1776 | state it's in when it has seen "fo". When this has happened, | |
1777 | if anything other than another 'o' is seen, the scanner will | |
1778 | have to back up to simply match the 'f' (by the default rule). | |
1779 | .LP | |
1780 | The comment regarding State #8 indicates there's a problem | |
1781 | when "foob" has been scanned. Indeed, on any character other | |
1782 | than a 'b', the scanner will have to back up to accept "foo". | |
1783 | Similarly, the comment for State #9 concerns when "fooba" has | |
1784 | been scanned. | |
1785 | .LP | |
1786 | The final comment reminds us that there's no point going to | |
1787 | all the trouble of removing backtracking from the rules unless | |
1788 | we're using | |
1789 | .B -f | |
1790 | or | |
1791 | .B -F, | |
1792 | since there's no performance gain doing so with compressed scanners. | |
1793 | .LP | |
1794 | The way to remove the backtracking is to add "error" rules: | |
1795 | .nf | |
1796 | ||
1797 | %% | |
1798 | foo return TOK_KEYWORD; | |
1799 | foobar return TOK_KEYWORD; | |
1800 | ||
1801 | fooba | | |
1802 | foob | | |
1803 | fo { | |
1804 | /* false alarm, not really a keyword */ | |
1805 | return TOK_ID; | |
1806 | } | |
1807 | ||
1808 | .fi | |
1809 | .LP | |
1810 | Eliminating backtracking among a list of keywords can also be | |
1811 | done using a "catch-all" rule: | |
1812 | .nf | |
1813 | ||
1814 | %% | |
1815 | foo return TOK_KEYWORD; | |
1816 | foobar return TOK_KEYWORD; | |
1817 | ||
1818 | [a-z]+ return TOK_ID; | |
1819 | ||
1820 | .fi | |
1821 | This is usually the best solution when appropriate. | |
1822 | .LP | |
1823 | Backtracking messages tend to cascade. | |
1824 | With a complicated set of rules it's not uncommon to get hundreds | |
1825 | of messages. If one can decipher them, though, it often | |
1826 | only takes a dozen or so rules to eliminate the backtracking (though | |
1827 | it's easy to make a mistake and have an error rule accidentally match | |
1828 | a valid token. A possible future | |
1829 | .I flex | |
1830 | feature will be to automatically add rules to eliminate backtracking). | |
1831 | .LP | |
1832 | .I Variable | |
1833 | trailing context (where both the leading and trailing parts do not have | |
1834 | a fixed length) entails almost the same performance loss as | |
1835 | .I REJECT | |
1836 | (i.e., substantial). So when possible a rule like: | |
1837 | .nf | |
1838 | ||
1839 | %% | |
1840 | mouse|rat/(cat|dog) run(); | |
1841 | ||
1842 | .fi | |
1843 | is better written: | |
1844 | .nf | |
1845 | ||
1846 | %% | |
1847 | mouse/cat|dog run(); | |
1848 | rat/cat|dog run(); | |
1849 | ||
1850 | .fi | |
1851 | or as | |
1852 | .nf | |
1853 | ||
1854 | %% | |
1855 | mouse|rat/cat run(); | |
1856 | mouse|rat/dog run(); | |
1857 | ||
1858 | .fi | |
1859 | Note that here the special '|' action does | |
1860 | .I not | |
1861 | provide any savings, and can even make things worse (see | |
1862 | .B BUGS | |
1863 | in flex(1)). | |
1864 | .LP | |
1865 | Another area where the user can increase a scanner's performance | |
1866 | (and one that's easier to implement) arises from the fact that | |
1867 | the longer the tokens matched, the faster the scanner will run. | |
1868 | This is because with long tokens the processing of most input | |
1869 | characters takes place in the (short) inner scanning loop, and | |
1870 | does not often have to go through the additional work of setting up | |
1871 | the scanning environment (e.g., | |
1872 | .B yytext) | |
1873 | for the action. Recall the scanner for C comments: | |
1874 | .nf | |
1875 | ||
1876 | %x comment | |
1877 | %% | |
1878 | int line_num = 1; | |
1879 | ||
1880 | "/*" BEGIN(comment); | |
1881 | ||
1882 | <comment>[^*\\n]* | |
1883 | <comment>"*"+[^*/\\n]* | |
1884 | <comment>\\n ++line_num; | |
1885 | <comment>"*"+"/" BEGIN(INITIAL); | |
1886 | ||
1887 | .fi | |
1888 | This could be sped up by writing it as: | |
1889 | .nf | |
1890 | ||
1891 | %x comment | |
1892 | %% | |
1893 | int line_num = 1; | |
1894 | ||
1895 | "/*" BEGIN(comment); | |
1896 | ||
1897 | <comment>[^*\\n]* | |
1898 | <comment>[^*\\n]*\\n ++line_num; | |
1899 | <comment>"*"+[^*/\\n]* | |
1900 | <comment>"*"+[^*/\\n]*\\n ++line_num; | |
1901 | <comment>"*"+"/" BEGIN(INITIAL); | |
1902 | ||
1903 | .fi | |
1904 | Now instead of each newline requiring the processing of another | |
1905 | action, recognizing the newlines is "distributed" over the other rules | |
1906 | to keep the matched text as long as possible. Note that | |
1907 | .I adding | |
1908 | rules does | |
1909 | .I not | |
1910 | slow down the scanner! The speed of the scanner is independent | |
1911 | of the number of rules or (modulo the considerations given at the | |
1912 | beginning of this section) how complicated the rules are with | |
1913 | regard to operators such as '*' and '|'. | |
1914 | .LP | |
1915 | A final example in speeding up a scanner: suppose you want to scan | |
1916 | through a file containing identifiers and keywords, one per line | |
1917 | and with no other extraneous characters, and recognize all the | |
1918 | keywords. A natural first approach is: | |
1919 | .nf | |
1920 | ||
1921 | %% | |
1922 | asm | | |
1923 | auto | | |
1924 | break | | |
1925 | ... etc ... | |
1926 | volatile | | |
1927 | while /* it's a keyword */ | |
1928 | ||
1929 | .|\\n /* it's not a keyword */ | |
1930 | ||
1931 | .fi | |
1932 | To eliminate the back-tracking, introduce a catch-all rule: | |
1933 | .nf | |
1934 | ||
1935 | %% | |
1936 | asm | | |
1937 | auto | | |
1938 | break | | |
1939 | ... etc ... | |
1940 | volatile | | |
1941 | while /* it's a keyword */ | |
1942 | ||
1943 | [a-z]+ | | |
1944 | .|\\n /* it's not a keyword */ | |
1945 | ||
1946 | .fi | |
1947 | Now, if it's guaranteed that there's exactly one word per line, | |
1948 | then we can reduce the total number of matches by a half by | |
1949 | merging in the recognition of newlines with that of the other | |
1950 | tokens: | |
1951 | .nf | |
1952 | ||
1953 | %% | |
1954 | asm\\n | | |
1955 | auto\\n | | |
1956 | break\\n | | |
1957 | ... etc ... | |
1958 | volatile\\n | | |
1959 | while\\n /* it's a keyword */ | |
1960 | ||
1961 | [a-z]+\\n | | |
1962 | .|\\n /* it's not a keyword */ | |
1963 | ||
1964 | .fi | |
1965 | One has to be careful here, as we have now reintroduced backtracking | |
1966 | into the scanner. In particular, while | |
1967 | .I we | |
1968 | know that there will never be any characters in the input stream | |
1969 | other than letters or newlines, | |
1970 | .I flex | |
1971 | can't figure this out, and it will plan for possibly needing backtracking | |
1972 | when it has scanned a token like "auto" and then the next character | |
1973 | is something other than a newline or a letter. Previously it would | |
1974 | then just match the "auto" rule and be done, but now it has no "auto" | |
1975 | rule, only a "auto\\n" rule. To eliminate the possibility of backtracking, | |
1976 | we could either duplicate all rules but without final newlines, or, | |
1977 | since we never expect to encounter such an input and therefore don't | |
1978 | how it's classified, we can introduce one more catch-all rule, this | |
1979 | one which doesn't include a newline: | |
1980 | .nf | |
1981 | ||
1982 | %% | |
1983 | asm\\n | | |
1984 | auto\\n | | |
1985 | break\\n | | |
1986 | ... etc ... | |
1987 | volatile\\n | | |
1988 | while\\n /* it's a keyword */ | |
1989 | ||
1990 | [a-z]+\\n | | |
1991 | [a-z]+ | | |
1992 | .|\\n /* it's not a keyword */ | |
1993 | ||
1994 | .fi | |
1995 | Compiled with | |
1996 | .B -Cf, | |
1997 | this is about as fast as one can get a | |
1998 | .I flex | |
1999 | scanner to go for this particular problem. | |
2000 | .LP | |
2001 | A final note: | |
2002 | .I flex | |
2003 | is slow when matching NUL's, particularly when a token contains | |
2004 | multiple NUL's. | |
2005 | It's best to write rules which match | |
2006 | .I short | |
2007 | amounts of text if it's anticipated that the text will often include NUL's. | |
2008 | .SH INCOMPATIBILITIES WITH LEX AND POSIX | |
2009 | .I flex | |
2010 | is a rewrite of the Unix | |
2011 | .I lex | |
2012 | tool (the two implementations do not share any code, though), | |
2013 | with some extensions and incompatibilities, both of which | |
2014 | are of concern to those who wish to write scanners acceptable | |
2015 | to either implementation. At present, the POSIX | |
2016 | .I lex | |
2017 | draft is | |
2018 | very close to the original | |
2019 | .I lex | |
2020 | implementation, so some of these | |
2021 | incompatibilities are also in conflict with the POSIX draft. But | |
2022 | the intent is that except as noted below, | |
2023 | .I flex | |
2024 | as it presently stands will | |
2025 | ultimately be POSIX conformant (i.e., that those areas of conflict with | |
2026 | the POSIX draft will be resolved in | |
2027 | .I flex's | |
2028 | favor). Please bear in | |
2029 | mind that all the comments which follow are with regard to the POSIX | |
2030 | .I draft | |
2031 | standard of Summer 1989, and not the final document (or subsequent | |
2032 | drafts); they are included so | |
2033 | .I flex | |
2034 | users can be aware of the standardization issues and those areas where | |
2035 | .I flex | |
2036 | may in the near future undergo changes incompatible with | |
2037 | its current definition. | |
2038 | .LP | |
2039 | .I flex | |
2040 | is fully compatible with | |
2041 | .I lex | |
2042 | with the following exceptions: | |
2043 | .IP - | |
2044 | .I lex | |
2045 | does not support exclusive start conditions (%x), though they | |
2046 | are in the current POSIX draft. | |
2047 | .IP - | |
2048 | When definitions are expanded, | |
2049 | .I flex | |
2050 | encloses them in parentheses. | |
2051 | With lex, the following: | |
2052 | .nf | |
2053 | ||
2054 | NAME [A-Z][A-Z0-9]* | |
2055 | %% | |
2056 | foo{NAME}? printf( "Found it\\n" ); | |
2057 | %% | |
2058 | ||
2059 | .fi | |
2060 | will not match the string "foo" because when the macro | |
2061 | is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" | |
2062 | and the precedence is such that the '?' is associated with | |
2063 | "[A-Z0-9]*". With | |
2064 | .I flex, | |
2065 | the rule will be expanded to | |
2066 | "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. | |
2067 | Note that because of this, the | |
2068 | .B ^, $, <s>, /, | |
2069 | and | |
2070 | .B <<EOF>> | |
2071 | operators cannot be used in a | |
2072 | .I flex | |
2073 | definition. | |
2074 | .IP | |
2075 | The POSIX draft interpretation is the same as | |
2076 | .I flex's. | |
2077 | .IP - | |
2078 | To specify a character class which matches anything but a left bracket (']'), | |
2079 | in | |
2080 | .I lex | |
2081 | one can use "[^]]" but with | |
2082 | .I flex | |
2083 | one must use "[^\\]]". The latter works with | |
2084 | .I lex, | |
2085 | too. | |
2086 | .IP - | |
2087 | The undocumented | |
2088 | .I lex | |
2089 | scanner internal variable | |
2090 | .B yylineno | |
2091 | is not supported. (The variable is not part of the POSIX draft.) | |
2092 | .IP - | |
2093 | The | |
2094 | .B input() | |
2095 | routine is not redefinable, though it may be called to read characters | |
2096 | following whatever has been matched by a rule. If | |
2097 | .B input() | |
2098 | encounters an end-of-file the normal | |
2099 | .B yywrap() | |
2100 | processing is done. A ``real'' end-of-file is returned by | |
2101 | .B input() | |
2102 | as | |
2103 | .I EOF. | |
2104 | .IP | |
2105 | Input is instead controlled by redefining the | |
2106 | .B YY_INPUT | |
2107 | macro. | |
2108 | .IP | |
2109 | The | |
2110 | .I flex | |
2111 | restriction that | |
2112 | .B input() | |
2113 | cannot be redefined is in accordance with the POSIX draft, but | |
2114 | .B YY_INPUT | |
2115 | has not yet been accepted into the draft. | |
2116 | .IP - | |
2117 | .B output() | |
2118 | is not supported. | |
2119 | Output from the | |
2120 | .B ECHO | |
2121 | macro is done to the file-pointer | |
2122 | .I yyout | |
2123 | (default | |
2124 | .I stdout). | |
2125 | .IP | |
2126 | The POSIX draft mentions that an | |
2127 | .B output() | |
2128 | routine exists but currently gives no details as to what it does. | |
2129 | .IP - | |
2130 | The | |
2131 | .I lex | |
2132 | .B %r | |
2133 | (generate a Ratfor scanner) option is not supported. It is not part | |
2134 | of the POSIX draft. | |
2135 | .IP - | |
2136 | If you are providing your own yywrap() routine, you must include a | |
2137 | "#undef yywrap" in the definitions section (section 1). Note that | |
2138 | the "#undef" will have to be enclosed in %{}'s. | |
2139 | .IP | |
2140 | The POSIX draft | |
2141 | specifies that yywrap() is a function and this is unlikely to change; so | |
2142 | .I flex users are warned | |
2143 | that | |
2144 | .B yywrap() | |
2145 | is likely to be changed to a function in the near future. | |
2146 | .IP - | |
2147 | After a call to | |
2148 | .B unput(), | |
2149 | .I yytext | |
2150 | and | |
2151 | .I yyleng | |
2152 | are undefined until the next token is matched. This is not the case with | |
2153 | .I lex | |
2154 | or the present POSIX draft. | |
2155 | .IP - | |
2156 | The precedence of the | |
2157 | .B {} | |
2158 | (numeric range) operator is different. | |
2159 | .I lex | |
2160 | interprets "abc{1,3}" as "match one, two, or | |
2161 | three occurrences of 'abc'", whereas | |
2162 | .I flex | |
2163 | interprets it as "match 'ab' | |
2164 | followed by one, two, or three occurrences of 'c'". The latter is | |
2165 | in agreement with the current POSIX draft. | |
2166 | .IP - | |
2167 | The precedence of the | |
2168 | .B ^ | |
2169 | operator is different. | |
2170 | .I lex | |
2171 | interprets "^foo|bar" as "match either 'foo' at the beginning of a line, | |
2172 | or 'bar' anywhere", whereas | |
2173 | .I flex | |
2174 | interprets it as "match either 'foo' or 'bar' if they come at the beginning | |
2175 | of a line". The latter is in agreement with the current POSIX draft. | |
2176 | .IP - | |
2177 | To refer to yytext outside of the scanner source file, | |
2178 | the correct definition with | |
2179 | .I flex | |
2180 | is "extern char *yytext" rather than "extern char yytext[]". | |
2181 | This is contrary to the current POSIX draft but a point on which | |
2182 | .I flex | |
2183 | will not be changing, as the array representation entails a | |
2184 | serious performance penalty. It is hoped that the POSIX draft will | |
2185 | be emended to support the | |
2186 | .I flex | |
2187 | variety of declaration (as this is a fairly painless change to | |
2188 | require of | |
2189 | .I lex | |
2190 | users). | |
2191 | .IP - | |
2192 | .I yyin | |
2193 | is | |
2194 | .I initialized | |
2195 | by | |
2196 | .I lex | |
2197 | to be | |
2198 | .I stdin; | |
2199 | .I flex, | |
2200 | on the other hand, | |
2201 | initializes | |
2202 | .I yyin | |
2203 | to NULL | |
2204 | and then | |
2205 | .I assigns | |
2206 | it to | |
2207 | .I stdin | |
2208 | the first time the scanner is called, providing | |
2209 | .I yyin | |
2210 | has not already been assigned to a non-NULL value. The difference is | |
2211 | subtle, but the net effect is that with | |
2212 | .I flex | |
2213 | scanners, | |
2214 | .I yyin | |
2215 | does not have a valid value until the scanner has been called. | |
2216 | .IP - | |
2217 | The special table-size declarations such as | |
2218 | .B %a | |
2219 | supported by | |
2220 | .I lex | |
2221 | are not required by | |
2222 | .I flex | |
2223 | scanners; | |
2224 | .I flex | |
2225 | ignores them. | |
2226 | .IP - | |
2227 | The name | |
2228 | .bd | |
2229 | FLEX_SCANNER | |
2230 | is #define'd so scanners may be written for use with either | |
2231 | .I flex | |
2232 | or | |
2233 | .I lex. | |
2234 | .LP | |
2235 | The following | |
2236 | .I flex | |
2237 | features are not included in | |
2238 | .I lex | |
2239 | or the POSIX draft standard: | |
2240 | .nf | |
2241 | ||
2242 | yyterminate() | |
2243 | <<EOF>> | |
2244 | YY_DECL | |
2245 | #line directives | |
2246 | %{}'s around actions | |
2247 | yyrestart() | |
2248 | comments beginning with '#' (deprecated) | |
2249 | multiple actions on a line | |
2250 | ||
2251 | .fi | |
2252 | This last feature refers to the fact that with | |
2253 | .I flex | |
2254 | you can put multiple actions on the same line, separated with | |
2255 | semi-colons, while with | |
2256 | .I lex, | |
2257 | the following | |
2258 | .nf | |
2259 | ||
2260 | foo handle_foo(); ++num_foos_seen; | |
2261 | ||
2262 | .fi | |
2263 | is (rather surprisingly) truncated to | |
2264 | .nf | |
2265 | ||
2266 | foo handle_foo(); | |
2267 | ||
2268 | .fi | |
2269 | .I flex | |
2270 | does not truncate the action. Actions that are not enclosed in | |
2271 | braces are simply terminated at the end of the line. | |
2272 | .SH DIAGNOSTICS | |
2273 | .I reject_used_but_not_detected undefined | |
2274 | or | |
2275 | .I yymore_used_but_not_detected undefined - | |
2276 | These errors can occur at compile time. They indicate that the | |
2277 | scanner uses | |
2278 | .B REJECT | |
2279 | or | |
2280 | .B yymore() | |
2281 | but that | |
2282 | .I flex | |
2283 | failed to notice the fact, meaning that | |
2284 | .I flex | |
2285 | scanned the first two sections looking for occurrences of these actions | |
2286 | and failed to find any, but somehow you snuck some in (via a #include | |
2287 | file, for example). Make an explicit reference to the action in your | |
2288 | .I flex | |
2289 | input file. (Note that previously | |
2290 | .I flex | |
2291 | supported a | |
2292 | .B %used/%unused | |
2293 | mechanism for dealing with this problem; this feature is still supported | |
2294 | but now deprecated, and will go away soon unless the author hears from | |
2295 | people who can argue compellingly that they need it.) | |
2296 | .LP | |
2297 | .I flex scanner jammed - | |
2298 | a scanner compiled with | |
2299 | .B -s | |
2300 | has encountered an input string which wasn't matched by | |
2301 | any of its rules. | |
2302 | .LP | |
2303 | .I flex input buffer overflowed - | |
2304 | a scanner rule matched a string long enough to overflow the | |
2305 | scanner's internal input buffer (16K bytes by default - controlled by | |
2306 | .B YY_BUF_SIZE | |
2307 | in "flex.skel". Note that to redefine this macro, you must first | |
2308 | .B #undefine | |
2309 | it). | |
2310 | .LP | |
2311 | .I scanner requires -8 flag - | |
2312 | Your scanner specification includes recognizing 8-bit characters and | |
2313 | you did not specify the -8 flag (and your site has not installed flex | |
2314 | with -8 as the default). | |
2315 | .LP | |
2316 | .I too many %t classes! - | |
2317 | You managed to put every single character into its own %t class. | |
2318 | .I flex | |
2319 | requires that at least one of the classes share characters. | |
2320 | .SH DEFICIENCIES / BUGS | |
2321 | See flex(1). | |
2322 | .SH "SEE ALSO" | |
2323 | .LP | |
2324 | flex(1), lex(1), yacc(1), sed(1), awk(1). | |
2325 | .LP | |
2326 | M. E. Lesk and E. Schmidt, | |
2327 | .I LEX - Lexical Analyzer Generator | |
2328 | .SH AUTHOR | |
2329 | Vern Paxson, with the help of many ideas and much inspiration from | |
2330 | Van Jacobson. Original version by Jef Poskanzer. The fast table | |
2331 | representation is a partial implementation of a design done by Van | |
2332 | Jacobson. The implementation was done by Kevin Gong and Vern Paxson. | |
2333 | .LP | |
2334 | Thanks to the many | |
2335 | .I flex | |
2336 | beta-testers, feedbackers, and contributors, especially Casey | |
2337 | Leedom, benson@odi.com, | |
2338 | Frederic Brehm, Nick Christopher, Jason Coughlin, | |
2339 | Scott David Daniels, Leo Eskin, | |
2340 | Chris Faylor, Eric Goldman, Eric | |
2341 | Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht, | |
2342 | Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt, | |
2343 | Jef Poskanzer, Jim Roskind, | |
2344 | Dave Tallman, Frank Whaley, Ken Yap, and those whose names | |
2345 | have slipped my marginal mail-archiving skills but whose contributions | |
2346 | are appreciated all the same. | |
2347 | .LP | |
2348 | Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob | |
2349 | Mulcahy, Rich Salz, and Richard Stallman for help with various distribution | |
2350 | headaches. | |
2351 | .LP | |
2352 | Thanks to Esmond Pitt and Earle Horton for 8-bit character support; | |
2353 | to Benson Margulies and Fred | |
2354 | Burke for C++ support; to Ove Ewerlid for the basics of support for | |
2355 | NUL's; and to Eric Hughes for the basics of support for multiple buffers. | |
2356 | .LP | |
2357 | Work is being done on extending | |
2358 | .I flex | |
2359 | to generate scanners in which the | |
2360 | state machine is directly represented in C code rather than tables. | |
2361 | These scanners may well be substantially faster than those generated | |
2362 | using -f or -F. If you are working in this area and are interested | |
2363 | in comparing notes and seeing whether redundant work can be avoided, | |
2364 | contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE). | |
2365 | .LP | |
2366 | This work was primarily done when I was at the Real Time Systems Group | |
2367 | at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there | |
2368 | for the support I received. | |
2369 | .LP | |
2370 | Send comments to: | |
2371 | .nf | |
2372 | ||
2373 | Vern Paxson | |
2374 | Computer Science Department | |
2375 | 4126 Upson Hall | |
2376 | Cornell University | |
2377 | Ithaca, NY 14853-7501 | |
2378 | ||
2379 | vern@cs.cornell.edu | |
2380 | decvax!cornell!vern | |
2381 | ||
2382 | .fi |