Commit | Line | Data |
---|---|---|
fc272959 WJ |
1 | .\" Copyright (c) 1991 The Regents of the University of California. |
2 | .\" All rights reserved. | |
3 | .\" | |
4 | .\" new copyright; att/bsd/shared | |
5 | .\" This code is derived from software contributed to Berkeley by | |
6 | .\" Vern Paxson of Lawrence Berkeley Laboratory. | |
7 | .\" | |
8 | .\" The United States Government has rights in this work pursuant | |
9 | .\" to contract no. DE-AC03-76SF00098 between the United States | |
10 | .\" Department of Energy and the University of California. | |
11 | .\" | |
12 | .\" Redistribution and use in source and binary forms, with or without | |
13 | .\" modification, are permitted provided that the following conditions | |
14 | .\" are met: | |
15 | .\" 1. Redistributions of source code must retain the above copyright | |
16 | .\" notice, this list of conditions and the following disclaimer. | |
17 | .\" 2. Redistributions in binary form must reproduce the above copyright | |
18 | .\" notice, this list of conditions and the following disclaimer in the | |
19 | .\" documentation and/or other materials provided with the distribution. | |
20 | .\" 3. All advertising materials mentioning features or use of this software | |
21 | .\" must display the following acknowledgement: | |
22 | .\" This product includes software developed by the University of | |
23 | .\" California, Berkeley and its contributors. | |
24 | .\" 4. Neither the name of the University nor the names of its contributors | |
25 | .\" may be used to endorse or promote products derived from this software | |
26 | .\" without specific prior written permission. | |
27 | .\" | |
28 | .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND | |
29 | .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |
30 | .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |
31 | .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE | |
32 | .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |
33 | .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS | |
34 | .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) | |
35 | .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT | |
36 | .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY | |
37 | .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF | |
38 | .\" SUCH DAMAGE. | |
39 | .\" | |
40 | .\" @(#)flexdoc.1 5.2 (Berkeley) 4/12/91 | |
41 | .\" | |
42 | .TH FLEX 1 "April 12, 1991" | |
43 | .SH NAME | |
44 | flex - fast lexical analyzer generator | |
45 | .SH SYNOPSIS | |
46 | .B flex | |
47 | .B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton] | |
48 | .I [filename ...] | |
49 | .SH DESCRIPTION | |
50 | .I flex | |
51 | is a tool for generating | |
52 | .I scanners: | |
53 | programs which recognized lexical patterns in text. | |
54 | .I flex | |
55 | reads | |
56 | the given input files, or its standard input if no file names are given, | |
57 | for a description of a scanner to generate. The description is in | |
58 | the form of pairs | |
59 | of regular expressions and C code, called | |
60 | .I rules. flex | |
61 | generates as output a C source file, | |
62 | .B lex.yy.c, | |
63 | which defines a routine | |
64 | .B yylex(). | |
65 | This file is compiled and linked with the | |
66 | .B -lfl | |
67 | library to produce an executable. When the executable is run, | |
68 | it analyzes its input for occurrences | |
69 | of the regular expressions. Whenever it finds one, it executes | |
70 | the corresponding C code. | |
71 | .SH SOME SIMPLE EXAMPLES | |
72 | .LP | |
73 | First some simple examples to get the flavor of how one uses | |
74 | .I flex. | |
75 | The following | |
76 | .I flex | |
77 | input specifies a scanner which whenever it encounters the string | |
78 | "username" will replace it with the user's login name: | |
79 | .nf | |
80 | ||
81 | %% | |
82 | username printf( "%s", getlogin() ); | |
83 | ||
84 | .fi | |
85 | By default, any text not matched by a | |
86 | .I flex | |
87 | scanner | |
88 | is copied to the output, so the net effect of this scanner is | |
89 | to copy its input file to its output with each occurrence | |
90 | of "username" expanded. | |
91 | In this input, there is just one rule. "username" is the | |
92 | .I pattern | |
93 | and the "printf" is the | |
94 | .I action. | |
95 | The "%%" marks the beginning of the rules. | |
96 | .LP | |
97 | Here's another simple example: | |
98 | .nf | |
99 | ||
100 | int num_lines = 0, num_chars = 0; | |
101 | ||
102 | %% | |
103 | \\n ++num_lines; ++num_chars; | |
104 | . ++num_chars; | |
105 | ||
106 | %% | |
107 | main() | |
108 | { | |
109 | yylex(); | |
110 | printf( "# of lines = %d, # of chars = %d\\n", | |
111 | num_lines, num_chars ); | |
112 | } | |
113 | ||
114 | .fi | |
115 | This scanner counts the number of characters and the number | |
116 | of lines in its input (it produces no output other than the | |
117 | final report on the counts). The first line | |
118 | declares two globals, "num_lines" and "num_chars", which are accessible | |
119 | both inside | |
120 | .B yylex() | |
121 | and in the | |
122 | .B main() | |
123 | routine declared after the second "%%". There are two rules, one | |
124 | which matches a newline ("\\n") and increments both the line count and | |
125 | the character count, and one which matches any character other than | |
126 | a newline (indicated by the "." regular expression). | |
127 | .LP | |
128 | A somewhat more complicated example: | |
129 | .nf | |
130 | ||
131 | /* scanner for a toy Pascal-like language */ | |
132 | ||
133 | %{ | |
134 | /* need this for the call to atof() below */ | |
135 | #include <math.h> | |
136 | %} | |
137 | ||
138 | DIGIT [0-9] | |
139 | ID [a-z][a-z0-9]* | |
140 | ||
141 | %% | |
142 | ||
143 | {DIGIT}+ { | |
144 | printf( "An integer: %s (%d)\\n", yytext, | |
145 | atoi( yytext ) ); | |
146 | } | |
147 | ||
148 | {DIGIT}+"."{DIGIT}* { | |
149 | printf( "A float: %s (%d)\\n", yytext, | |
150 | atof( yytext ) ); | |
151 | } | |
152 | ||
153 | if|then|begin|end|procedure|function { | |
154 | printf( "A keyword: %s\\n", yytext ); | |
155 | } | |
156 | ||
157 | {ID} printf( "An identifier: %s\\n", yytext ); | |
158 | ||
159 | "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); | |
160 | ||
161 | "{"[^}\\n]*"}" /* eat up one-line comments */ | |
162 | ||
163 | [ \\t\\n]+ /* eat up whitespace */ | |
164 | ||
165 | . printf( "Unrecognized character: %s\\n", yytext ); | |
166 | ||
167 | %% | |
168 | ||
169 | main( argc, argv ) | |
170 | int argc; | |
171 | char **argv; | |
172 | { | |
173 | ++argv, --argc; /* skip over program name */ | |
174 | if ( argc > 0 ) | |
175 | yyin = fopen( argv[0], "r" ); | |
176 | else | |
177 | yyin = stdin; | |
178 | ||
179 | yylex(); | |
180 | } | |
181 | ||
182 | .fi | |
183 | This is the beginnings of a simple scanner for a language like | |
184 | Pascal. It identifies different types of | |
185 | .I tokens | |
186 | and reports on what it has seen. | |
187 | .LP | |
188 | The details of this example will be explained in the following | |
189 | sections. | |
190 | .SH FORMAT OF THE INPUT FILE | |
191 | The | |
192 | .I flex | |
193 | input file consists of three sections, separated by a line with just | |
194 | .B %% | |
195 | in it: | |
196 | .nf | |
197 | ||
198 | definitions | |
199 | %% | |
200 | rules | |
201 | %% | |
202 | user code | |
203 | ||
204 | .fi | |
205 | The | |
206 | .I definitions | |
207 | section contains declarations of simple | |
208 | .I name | |
209 | definitions to simplify the scanner specification, and declarations of | |
210 | .I start conditions, | |
211 | which are explained in a later section. | |
212 | .LP | |
213 | Name definitions have the form: | |
214 | .nf | |
215 | ||
216 | name definition | |
217 | ||
218 | .fi | |
219 | The "name" is a word beginning with a letter or an underscore ('_') | |
220 | followed by zero or more letters, digits, '_', or '-' (dash). | |
221 | The definition is taken to begin at the first non-white-space character | |
222 | following the name and continuing to the end of the line. | |
223 | The definition can subsequently be referred to using "{name}", which | |
224 | will expand to "(definition)". For example, | |
225 | .nf | |
226 | ||
227 | DIGIT [0-9] | |
228 | ID [a-z][a-z0-9]* | |
229 | ||
230 | .fi | |
231 | defines "DIGIT" to be a regular expression which matches a | |
232 | single digit, and | |
233 | "ID" to be a regular expression which matches a letter | |
234 | followed by zero-or-more letters-or-digits. | |
235 | A subsequent reference to | |
236 | .nf | |
237 | ||
238 | {DIGIT}+"."{DIGIT}* | |
239 | ||
240 | .fi | |
241 | is identical to | |
242 | .nf | |
243 | ||
244 | ([0-9])+"."([0-9])* | |
245 | ||
246 | .fi | |
247 | and matches one-or-more digits followed by a '.' followed | |
248 | by zero-or-more digits. | |
249 | .LP | |
250 | The | |
251 | .I rules | |
252 | section of the | |
253 | .I flex | |
254 | input contains a series of rules of the form: | |
255 | .nf | |
256 | ||
257 | pattern action | |
258 | ||
259 | .fi | |
260 | where the pattern must be unindented and the action must begin | |
261 | on the same line. | |
262 | .LP | |
263 | See below for a further description of patterns and actions. | |
264 | .LP | |
265 | Finally, the user code section is simply copied to | |
266 | .B lex.yy.c | |
267 | verbatim. | |
268 | It is used for companion routines which call or are called | |
269 | by the scanner. The presence of this section is optional; | |
270 | if it is missing, the second | |
271 | .B %% | |
272 | in the input file may be skipped, too. | |
273 | .LP | |
274 | In the definitions and rules sections, any | |
275 | .I indented | |
276 | text or text enclosed in | |
277 | .B %{ | |
278 | and | |
279 | .B %} | |
280 | is copied verbatim to the output (with the %{}'s removed). | |
281 | The %{}'s must appear unindented on lines by themselves. | |
282 | .LP | |
283 | In the rules section, | |
284 | any indented or %{} text appearing before the | |
285 | first rule may be used to declare variables | |
286 | which are local to the scanning routine and (after the declarations) | |
287 | code which is to be executed whenever the scanning routine is entered. | |
288 | Other indented or %{} text in the rule section is still copied to the output, | |
289 | but its meaning is not well-defined and it may well cause compile-time | |
290 | errors (this feature is present for | |
291 | .I POSIX | |
292 | compliance; see below for other such features). | |
293 | .LP | |
294 | In the definitions section, an unindented comment (i.e., a line | |
295 | beginning with "/*") is also copied verbatim to the output up | |
296 | to the next "*/". Also, any line in the definitions section | |
297 | beginning with '#' is ignored, though this style of comment is | |
298 | deprecated and may go away in the future. | |
299 | .SH PATTERNS | |
300 | The patterns in the input are written using an extended set of regular | |
301 | expressions. These are: | |
302 | .nf | |
303 | ||
304 | x match the character 'x' | |
305 | . any character except newline | |
306 | [xyz] a "character class"; in this case, the pattern | |
307 | matches either an 'x', a 'y', or a 'z' | |
308 | [abj-oZ] a "character class" with a range in it; matches | |
309 | an 'a', a 'b', any letter from 'j' through 'o', | |
310 | or a 'Z' | |
311 | [^A-Z] a "negated character class", i.e., any character | |
312 | but those in the class. In this case, any | |
313 | character EXCEPT an uppercase letter. | |
314 | [^A-Z\\n] any character EXCEPT an uppercase letter or | |
315 | a newline | |
316 | r* zero or more r's, where r is any regular expression | |
317 | r+ one or more r's | |
318 | r? zero or one r's (that is, "an optional r") | |
319 | r{2,5} anywhere from two to five r's | |
320 | r{2,} two or more r's | |
321 | r{4} exactly 4 r's | |
322 | {name} the expansion of the "name" definition | |
323 | (see above) | |
324 | "[xyz]\\"foo" | |
325 | the literal string: [xyz]"foo | |
326 | \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', | |
327 | then the ANSI-C interpretation of \\x. | |
328 | Otherwise, a literal 'X' (used to escape | |
329 | operators such as '*') | |
330 | \\123 the character with octal value 123 | |
331 | \\x2a the character with hexadecimal value 2a | |
332 | (r) match an r; parentheses are used to override | |
333 | precedence (see below) | |
334 | ||
335 | ||
336 | rs the regular expression r followed by the | |
337 | regular expression s; called "concatenation" | |
338 | ||
339 | ||
340 | r|s either an r or an s | |
341 | ||
342 | ||
343 | r/s an r but only if it is followed by an s. The | |
344 | s is not part of the matched text. This type | |
345 | of pattern is called as "trailing context". | |
346 | ^r an r, but only at the beginning of a line | |
347 | r$ an r, but only at the end of a line. Equivalent | |
348 | to "r/\\n". | |
349 | ||
350 | ||
351 | <s>r an r, but only in start condition s (see | |
352 | below for discussion of start conditions) | |
353 | <s1,s2,s3>r | |
354 | same, but in any of start conditions s1, | |
355 | s2, or s3 | |
356 | ||
357 | ||
358 | <<EOF>> an end-of-file | |
359 | <s1,s2><<EOF>> | |
360 | an end-of-file when in start condition s1 or s2 | |
361 | ||
362 | .fi | |
363 | The regular expressions listed above are grouped according to | |
364 | precedence, from highest precedence at the top to lowest at the bottom. | |
365 | Those grouped together have equal precedence. For example, | |
366 | .nf | |
367 | ||
368 | foo|bar* | |
369 | ||
370 | .fi | |
371 | is the same as | |
372 | .nf | |
373 | ||
374 | (foo)|(ba(r*)) | |
375 | ||
376 | .fi | |
377 | since the '*' operator has higher precedence than concatenation, | |
378 | and concatenation higher than alternation ('|'). This pattern | |
379 | therefore matches | |
380 | .I either | |
381 | the string "foo" | |
382 | .I or | |
383 | the string "ba" followed by zero-or-more r's. | |
384 | To match "foo" or zero-or-more "bar"'s, use: | |
385 | .nf | |
386 | ||
387 | foo|(bar)* | |
388 | ||
389 | .fi | |
390 | and to match zero-or-more "foo"'s-or-"bar"'s: | |
391 | .nf | |
392 | ||
393 | (foo|bar)* | |
394 | ||
395 | .fi | |
396 | .LP | |
397 | Some notes on patterns: | |
398 | .IP - | |
399 | A negated character class such as the example "[^A-Z]" | |
400 | above | |
401 | .I will match a newline | |
402 | unless "\\n" (or an equivalent escape sequence) is one of the | |
403 | characters explicitly present in the negated character class | |
404 | (e.g., "[^A-Z\\n]"). This is unlike how many other regular | |
405 | expression tools treat negated character classes, but unfortunately | |
406 | the inconsistency is historically entrenched. | |
407 | Matching newlines means that a pattern like [^"]* can match an entire | |
408 | input (overflowing the scanner's input buffer) unless there's another | |
409 | quote in the input. | |
410 | .IP - | |
411 | A rule can have at most one instance of trailing context (the '/' operator | |
412 | or the '$' operator). The start condition, '^', and "<<EOF>>" patterns | |
413 | can only occur at the beginning of a pattern, and, as well as with '/' and '$', | |
414 | cannot be grouped inside parentheses. A '^' which does not occur at | |
415 | the beginning of a rule or a '$' which does not occur at the end of | |
416 | a rule loses its special properties and is treated as a normal character. | |
417 | .IP | |
418 | The following are illegal: | |
419 | .nf | |
420 | ||
421 | foo/bar$ | |
422 | <sc1>foo<sc2>bar | |
423 | ||
424 | .fi | |
425 | Note that the first of these, can be written "foo/bar\\n". | |
426 | .IP | |
427 | The following will result in '$' or '^' being treated as a normal character: | |
428 | .nf | |
429 | ||
430 | foo|(bar$) | |
431 | foo|^bar | |
432 | ||
433 | .fi | |
434 | If what's wanted is a "foo" or a bar-followed-by-a-newline, the following | |
435 | could be used (the special '|' action is explained below): | |
436 | .nf | |
437 | ||
438 | foo | | |
439 | bar$ /* action goes here */ | |
440 | ||
441 | .fi | |
442 | A similar trick will work for matching a foo or a | |
443 | bar-at-the-beginning-of-a-line. | |
444 | .SH HOW THE INPUT IS MATCHED | |
445 | When the generated scanner is run, it analyzes its input looking | |
446 | for strings which match any of its patterns. If it finds more than | |
447 | one match, it takes the one matching the most text (for trailing | |
448 | context rules, this includes the length of the trailing part, even | |
449 | though it will then be returned to the input). If it finds two | |
450 | or more matches of the same length, the | |
451 | rule listed first in the | |
452 | .I flex | |
453 | input file is chosen. | |
454 | .LP | |
455 | Once the match is determined, the text corresponding to the match | |
456 | (called the | |
457 | .I token) | |
458 | is made available in the global character pointer | |
459 | .B yytext, | |
460 | and its length in the global integer | |
461 | .B yyleng. | |
462 | The | |
463 | .I action | |
464 | corresponding to the matched pattern is then executed (a more | |
465 | detailed description of actions follows), and then the remaining | |
466 | input is scanned for another match. | |
467 | .LP | |
468 | If no match is found, then the | |
469 | .I default rule | |
470 | is executed: the next character in the input is considered matched and | |
471 | copied to the standard output. Thus, the simplest legal | |
472 | .I flex | |
473 | input is: | |
474 | .nf | |
475 | ||
476 | %% | |
477 | ||
478 | .fi | |
479 | which generates a scanner that simply copies its input (one character | |
480 | at a time) to its output. | |
481 | .SH ACTIONS | |
482 | Each pattern in a rule has a corresponding action, which can be any | |
483 | arbitrary C statement. The pattern ends at the first non-escaped | |
484 | whitespace character; the remainder of the line is its action. If the | |
485 | action is empty, then when the pattern is matched the input token | |
486 | is simply discarded. For example, here is the specification for a program | |
487 | which deletes all occurrences of "zap me" from its input: | |
488 | .nf | |
489 | ||
490 | %% | |
491 | "zap me" | |
492 | ||
493 | .fi | |
494 | (It will copy all other characters in the input to the output since | |
495 | they will be matched by the default rule.) | |
496 | .LP | |
497 | Here is a program which compresses multiple blanks and tabs down to | |
498 | a single blank, and throws away whitespace found at the end of a line: | |
499 | .nf | |
500 | ||
501 | %% | |
502 | [ \\t]+ putchar( ' ' ); | |
503 | [ \\t]+$ /* ignore this token */ | |
504 | ||
505 | .fi | |
506 | .LP | |
507 | If the action contains a '{', then the action spans till the balancing '}' | |
508 | is found, and the action may cross multiple lines. | |
509 | .I flex | |
510 | knows about C strings and comments and won't be fooled by braces found | |
511 | within them, but also allows actions to begin with | |
512 | .B %{ | |
513 | and will consider the action to be all the text up to the next | |
514 | .B %} | |
515 | (regardless of ordinary braces inside the action). | |
516 | .LP | |
517 | An action consisting solely of a vertical bar ('|') means "same as | |
518 | the action for the next rule." See below for an illustration. | |
519 | .LP | |
520 | Actions can include arbitrary C code, including | |
521 | .B return | |
522 | statements to return a value to whatever routine called | |
523 | .B yylex(). | |
524 | Each time | |
525 | .B yylex() | |
526 | is called it continues processing tokens from where it last left | |
527 | off until it either reaches | |
528 | the end of the file or executes a return. Once it reaches an end-of-file, | |
529 | however, then any subsequent call to | |
530 | .B yylex() | |
531 | will simply immediately return, unless | |
532 | .B yyrestart() | |
533 | is first called (see below). | |
534 | .LP | |
535 | Actions are not allowed to modify yytext or yyleng. | |
536 | .LP | |
537 | There are a number of special directives which can be included within | |
538 | an action: | |
539 | .IP - | |
540 | .B ECHO | |
541 | copies yytext to the scanner's output. | |
542 | .IP - | |
543 | .B BEGIN | |
544 | followed by the name of a start condition places the scanner in the | |
545 | corresponding start condition (see below). | |
546 | .IP - | |
547 | .B REJECT | |
548 | directs the scanner to proceed on to the "second best" rule which matched the | |
549 | input (or a prefix of the input). The rule is chosen as described | |
550 | above in "How the Input is Matched", and | |
551 | .B yytext | |
552 | and | |
553 | .B yyleng | |
554 | set up appropriately. | |
555 | It may either be one which matched as much text | |
556 | as the originally chosen rule but came later in the | |
557 | .I flex | |
558 | input file, or one which matched less text. | |
559 | For example, the following will both count the | |
560 | words in the input and call the routine special() whenever "frob" is seen: | |
561 | .nf | |
562 | ||
563 | int word_count = 0; | |
564 | %% | |
565 | ||
566 | frob special(); REJECT; | |
567 | [^ \\t\\n]+ ++word_count; | |
568 | ||
569 | .fi | |
570 | Without the | |
571 | .B REJECT, | |
572 | any "frob"'s in the input would not be counted as words, since the | |
573 | scanner normally executes only one action per token. | |
574 | Multiple | |
575 | .B REJECT's | |
576 | are allowed, each one finding the next best choice to the currently | |
577 | active rule. For example, when the following scanner scans the token | |
578 | "abcd", it will write "abcdabcaba" to the output: | |
579 | .nf | |
580 | ||
581 | %% | |
582 | a | | |
583 | ab | | |
584 | abc | | |
585 | abcd ECHO; REJECT; | |
586 | .|\\n /* eat up any unmatched character */ | |
587 | ||
588 | .fi | |
589 | (The first three rules share the fourth's action since they use | |
590 | the special '|' action.) | |
591 | .B REJECT | |
592 | is a particularly expensive feature in terms scanner performance; | |
593 | if it is used in | |
594 | .I any | |
595 | of the scanner's actions it will slow down | |
596 | .I all | |
597 | of the scanner's matching. Furthermore, | |
598 | .B REJECT | |
599 | cannot be used with the | |
600 | .I -f | |
601 | or | |
602 | .I -F | |
603 | options (see below). | |
604 | .IP | |
605 | Note also that unlike the other special actions, | |
606 | .B REJECT | |
607 | is a | |
608 | .I branch; | |
609 | code immediately following it in the action will | |
610 | .I not | |
611 | be executed. | |
612 | .IP - | |
613 | .B yymore() | |
614 | tells the scanner that the next time it matches a rule, the corresponding | |
615 | token should be | |
616 | .I appended | |
617 | onto the current value of | |
618 | .B yytext | |
619 | rather than replacing it. For example, given the input "mega-kludge" | |
620 | the following will write "mega-mega-kludge" to the output: | |
621 | .nf | |
622 | ||
623 | %% | |
624 | mega- ECHO; yymore(); | |
625 | kludge ECHO; | |
626 | ||
627 | .fi | |
628 | First "mega-" is matched and echoed to the output. Then "kludge" | |
629 | is matched, but the previous "mega-" is still hanging around at the | |
630 | beginning of | |
631 | .B yytext | |
632 | so the | |
633 | .B ECHO | |
634 | for the "kludge" rule will actually write "mega-kludge". | |
635 | The presence of | |
636 | .B yymore() | |
637 | in the scanner's action entails a minor performance penalty in the | |
638 | scanner's matching speed. | |
639 | .IP - | |
640 | .B yyless(n) | |
641 | returns all but the first | |
642 | .I n | |
643 | characters of the current token back to the input stream, where they | |
644 | will be rescanned when the scanner looks for the next match. | |
645 | .B yytext | |
646 | and | |
647 | .B yyleng | |
648 | are adjusted appropriately (e.g., | |
649 | .B yyleng | |
650 | will now be equal to | |
651 | .I n | |
652 | ). For example, on the input "foobar" the following will write out | |
653 | "foobarbar": | |
654 | .nf | |
655 | ||
656 | %% | |
657 | foobar ECHO; yyless(3); | |
658 | [a-z]+ ECHO; | |
659 | ||
660 | .fi | |
661 | An argument of 0 to | |
662 | .B yyless | |
663 | will cause the entire current input string to be scanned again. Unless you've | |
664 | changed how the scanner will subsequently process its input (using | |
665 | .B BEGIN, | |
666 | for example), this will result in an endless loop. | |
667 | .IP - | |
668 | .B unput(c) | |
669 | puts the character | |
670 | .I c | |
671 | back onto the input stream. It will be the next character scanned. | |
672 | The following action will take the current token and cause it | |
673 | to be rescanned enclosed in parentheses. | |
674 | .nf | |
675 | ||
676 | { | |
677 | int i; | |
678 | unput( ')' ); | |
679 | for ( i = yyleng - 1; i >= 0; --i ) | |
680 | unput( yytext[i] ); | |
681 | unput( '(' ); | |
682 | } | |
683 | ||
684 | .fi | |
685 | Note that since each | |
686 | .B unput() | |
687 | puts the given character back at the | |
688 | .I beginning | |
689 | of the input stream, pushing back strings must be done back-to-front. | |
690 | .IP - | |
691 | .B input() | |
692 | reads the next character from the input stream. For example, | |
693 | the following is one way to eat up C comments: | |
694 | .nf | |
695 | ||
696 | %% | |
697 | "/*" { | |
698 | register int c; | |
699 | ||
700 | for ( ; ; ) | |
701 | { | |
702 | while ( (c = input()) != '*' && | |
703 | c != EOF ) | |
704 | ; /* eat up text of comment */ | |
705 | ||
706 | if ( c == '*' ) | |
707 | { | |
708 | while ( (c = input()) == '*' ) | |
709 | ; | |
710 | if ( c == '/' ) | |
711 | break; /* found the end */ | |
712 | } | |
713 | ||
714 | if ( c == EOF ) | |
715 | { | |
716 | error( "EOF in comment" ); | |
717 | break; | |
718 | } | |
719 | } | |
720 | } | |
721 | ||
722 | .fi | |
723 | (Note that if the scanner is compiled using | |
724 | .B C++, | |
725 | then | |
726 | .B input() | |
727 | is instead referred to as | |
728 | .B yyinput(), | |
729 | in order to avoid a name clash with the | |
730 | .B C++ | |
731 | stream by the name of | |
732 | .I input.) | |
733 | .IP - | |
734 | .B yyterminate() | |
735 | can be used in lieu of a return statement in an action. It terminates | |
736 | the scanner and returns a 0 to the scanner's caller, indicating "all done". | |
737 | Subsequent calls to the scanner will immediately return unless preceded | |
738 | by a call to | |
739 | .B yyrestart() | |
740 | (see below). | |
741 | By default, | |
742 | .B yyterminate() | |
743 | is also called when an end-of-file is encountered. It is a macro and | |
744 | may be redefined. | |
745 | .SH THE GENERATED SCANNER | |
746 | The output of | |
747 | .I flex | |
748 | is the file | |
749 | .B lex.yy.c, | |
750 | which contains the scanning routine | |
751 | .B yylex(), | |
752 | a number of tables used by it for matching tokens, and a number | |
753 | of auxiliary routines and macros. By default, | |
754 | .B yylex() | |
755 | is declared as follows: | |
756 | .nf | |
757 | ||
758 | int yylex() | |
759 | { | |
760 | ... various definitions and the actions in here ... | |
761 | } | |
762 | ||
763 | .fi | |
764 | (If your environment supports function prototypes, then it will | |
765 | be "int yylex( void )".) This definition may be changed by redefining | |
766 | the "YY_DECL" macro. For example, you could use: | |
767 | .nf | |
768 | ||
769 | #undef YY_DECL | |
770 | #define YY_DECL float lexscan( a, b ) float a, b; | |
771 | ||
772 | .fi | |
773 | to give the scanning routine the name | |
774 | .I lexscan, | |
775 | returning a float, and taking two floats as arguments. Note that | |
776 | if you give arguments to the scanning routine using a | |
777 | K&R-style/non-prototyped function declaration, you must terminate | |
778 | the definition with a semi-colon (;). | |
779 | .LP | |
780 | Whenever | |
781 | .B yylex() | |
782 | is called, it scans tokens from the global input file | |
783 | .I yyin | |
784 | (which defaults to stdin). It continues until it either reaches | |
785 | an end-of-file (at which point it returns the value 0) or | |
786 | one of its actions executes a | |
787 | .I return | |
788 | statement. | |
789 | In the former case, when called again the scanner will immediately | |
790 | return unless | |
791 | .B yyrestart() | |
792 | is called to point | |
793 | .I yyin | |
794 | at the new input file. ( | |
795 | .B yyrestart() | |
796 | takes one argument, a | |
797 | .B FILE * | |
798 | pointer.) | |
799 | In the latter case (i.e., when an action | |
800 | executes a return), the scanner may then be called again and it | |
801 | will resume scanning where it left off. | |
802 | .LP | |
803 | By default (and for purposes of efficiency), the scanner uses | |
804 | block-reads rather than simple | |
805 | .I getc() | |
806 | calls to read characters from | |
807 | .I yyin. | |
808 | The nature of how it gets its input can be controlled by redefining the | |
809 | .B YY_INPUT | |
810 | macro. | |
811 | YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its | |
812 | action is to place up to | |
813 | .I max_size | |
814 | characters in the character array | |
815 | .I buf | |
816 | and return in the integer variable | |
817 | .I result | |
818 | either the | |
819 | number of characters read or the constant YY_NULL (0 on Unix systems) | |
820 | to indicate EOF. The default YY_INPUT reads from the | |
821 | global file-pointer "yyin". | |
822 | .LP | |
823 | A sample redefinition of YY_INPUT (in the definitions | |
824 | section of the input file): | |
825 | .nf | |
826 | ||
827 | %{ | |
828 | #undef YY_INPUT | |
829 | #define YY_INPUT(buf,result,max_size) \\ | |
830 | result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1; | |
831 | %} | |
832 | ||
833 | .fi | |
834 | This definition will change the input processing to occur | |
835 | one character at a time. | |
836 | .LP | |
837 | You also can add in things like keeping track of the | |
838 | input line number this way; but don't expect your scanner to | |
839 | go very fast. | |
840 | .LP | |
841 | When the scanner receives an end-of-file indication from YY_INPUT, | |
842 | it then checks the | |
843 | .B yywrap() | |
844 | function. If | |
845 | .B yywrap() | |
846 | returns false (zero), then it is assumed that the | |
847 | function has gone ahead and set up | |
848 | .I yyin | |
849 | to point to another input file, and scanning continues. If it returns | |
850 | true (non-zero), then the scanner terminates, returning 0 to its | |
851 | caller. | |
852 | .LP | |
853 | The default | |
854 | .B yywrap() | |
855 | always returns 1. Presently, to redefine it you must first | |
856 | "#undef yywrap", as it is currently implemented as a macro. As indicated | |
857 | by the hedging in the previous sentence, it may be changed to | |
858 | a true function in the near future. | |
859 | .LP | |
860 | The scanner writes its | |
861 | .B ECHO | |
862 | output to the | |
863 | .I yyout | |
864 | global (default, stdout), which may be redefined by the user simply | |
865 | by assigning it to some other | |
866 | .B FILE | |
867 | pointer. | |
868 | .SH START CONDITIONS | |
869 | .I flex | |
870 | provides a mechanism for conditionally activating rules. Any rule | |
871 | whose pattern is prefixed with "<sc>" will only be active when | |
872 | the scanner is in the start condition named "sc". For example, | |
873 | .nf | |
874 | ||
875 | <STRING>[^"]* { /* eat up the string body ... */ | |
876 | ... | |
877 | } | |
878 | ||
879 | .fi | |
880 | will be active only when the scanner is in the "STRING" start | |
881 | condition, and | |
882 | .nf | |
883 | ||
884 | <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */ | |
885 | ... | |
886 | } | |
887 | ||
888 | .fi | |
889 | will be active only when the current start condition is | |
890 | either "INITIAL", "STRING", or "QUOTE". | |
891 | .LP | |
892 | Start conditions | |
893 | are declared in the definitions (first) section of the input | |
894 | using unindented lines beginning with either | |
895 | .B %s | |
896 | or | |
897 | .B %x | |
898 | followed by a list of names. | |
899 | The former declares | |
900 | .I inclusive | |
901 | start conditions, the latter | |
902 | .I exclusive | |
903 | start conditions. A start condition is activated using the | |
904 | .B BEGIN | |
905 | action. Until the next | |
906 | .B BEGIN | |
907 | action is executed, rules with the given start | |
908 | condition will be active and | |
909 | rules with other start conditions will be inactive. | |
910 | If the start condition is | |
911 | .I inclusive, | |
912 | then rules with no start conditions at all will also be active. | |
913 | If it is | |
914 | .I exclusive, | |
915 | then | |
916 | .I only | |
917 | rules qualified with the start condition will be active. | |
918 | A set of rules contingent on the same exclusive start condition | |
919 | describe a scanner which is independent of any of the other rules in the | |
920 | .I flex | |
921 | input. Because of this, | |
922 | exclusive start conditions make it easy to specify "mini-scanners" | |
923 | which scan portions of the input that are syntactically different | |
924 | from the rest (e.g., comments). | |
925 | .LP | |
926 | If the distinction between inclusive and exclusive start conditions | |
927 | is still a little vague, here's a simple example illustrating the | |
928 | connection between the two. The set of rules: | |
929 | .nf | |
930 | ||
931 | %s example | |
932 | %% | |
933 | <example>foo /* do something */ | |
934 | ||
935 | .fi | |
936 | is equivalent to | |
937 | .nf | |
938 | ||
939 | %x example | |
940 | %% | |
941 | <INITIAL,example>foo /* do something */ | |
942 | ||
943 | .fi | |
944 | .LP | |
945 | The default rule (to | |
946 | .B ECHO | |
947 | any unmatched character) remains active in start conditions. | |
948 | .LP | |
949 | .B BEGIN(0) | |
950 | returns to the original state where only the rules with | |
951 | no start conditions are active. This state can also be | |
952 | referred to as the start-condition "INITIAL", so | |
953 | .B BEGIN(INITIAL) | |
954 | is equivalent to | |
955 | .B BEGIN(0). | |
956 | (The parentheses around the start condition name are not required but | |
957 | are considered good style.) | |
958 | .LP | |
959 | .B BEGIN | |
960 | actions can also be given as indented code at the beginning | |
961 | of the rules section. For example, the following will cause | |
962 | the scanner to enter the "SPECIAL" start condition whenever | |
963 | .I yylex() | |
964 | is called and the global variable | |
965 | .I enter_special | |
966 | is true: | |
967 | .nf | |
968 | ||
969 | int enter_special; | |
970 | ||
971 | %x SPECIAL | |
972 | %% | |
973 | if ( enter_special ) | |
974 | BEGIN(SPECIAL); | |
975 | ||
976 | <SPECIAL>blahblahblah | |
977 | ...more rules follow... | |
978 | ||
979 | .fi | |
980 | .LP | |
981 | To illustrate the uses of start conditions, | |
982 | here is a scanner which provides two different interpretations | |
983 | of a string like "123.456". By default it will treat it as | |
984 | as three tokens, the integer "123", a dot ('.'), and the integer "456". | |
985 | But if the string is preceded earlier in the line by the string | |
986 | "expect-floats" | |
987 | it will treat it as a single token, the floating-point number | |
988 | 123.456: | |
989 | .nf | |
990 | ||
991 | %{ | |
992 | #include <math.h> | |
993 | %} | |
994 | %s expect | |
995 | ||
996 | %% | |
997 | expect-floats BEGIN(expect); | |
998 | ||
999 | <expect>[0-9]+"."[0-9]+ { | |
1000 | printf( "found a float, = %f\\n", | |
1001 | atof( yytext ) ); | |
1002 | } | |
1003 | <expect>\\n { | |
1004 | /* that's the end of the line, so | |
1005 | * we need another "expect-number" | |
1006 | * before we'll recognize any more | |
1007 | * numbers | |
1008 | */ | |
1009 | BEGIN(INITIAL); | |
1010 | } | |
1011 | ||
1012 | [0-9]+ { | |
1013 | printf( "found an integer, = %d\\n", | |
1014 | atoi( yytext ) ); | |
1015 | } | |
1016 | ||
1017 | "." printf( "found a dot\\n" ); | |
1018 | ||
1019 | .fi | |
1020 | Here is a scanner which recognizes (and discards) C comments while | |
1021 | maintaining a count of the current input line. | |
1022 | .nf | |
1023 | ||
1024 | %x comment | |
1025 | %% | |
1026 | int line_num = 1; | |
1027 | ||
1028 | "/*" BEGIN(comment); | |
1029 | ||
1030 | <comment>[^*\\n]* /* eat anything that's not a '*' */ | |
1031 | <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ | |
1032 | <comment>\\n ++line_num; | |
1033 | <comment>"*"+"/" BEGIN(INITIAL); | |
1034 | ||
1035 | .fi | |
1036 | Note that start-conditions names are really integer values and | |
1037 | can be stored as such. Thus, the above could be extended in the | |
1038 | following fashion: | |
1039 | .nf | |
1040 | ||
1041 | %x comment foo | |
1042 | %% | |
1043 | int line_num = 1; | |
1044 | int comment_caller; | |
1045 | ||
1046 | "/*" { | |
1047 | comment_caller = INITIAL; | |
1048 | BEGIN(comment); | |
1049 | } | |
1050 | ||
1051 | ... | |
1052 | ||
1053 | <foo>"/*" { | |
1054 | comment_caller = foo; | |
1055 | BEGIN(comment); | |
1056 | } | |
1057 | ||
1058 | <comment>[^*\\n]* /* eat anything that's not a '*' */ | |
1059 | <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ | |
1060 | <comment>\\n ++line_num; | |
1061 | <comment>"*"+"/" BEGIN(comment_caller); | |
1062 | ||
1063 | .fi | |
1064 | One can then implement a "stack" of start conditions using an | |
1065 | array of integers. (It is likely that such stacks will become | |
1066 | a full-fledged | |
1067 | .I flex | |
1068 | feature in the future.) Note, though, that | |
1069 | start conditions do not have their own name-space; %s's and %x's | |
1070 | declare names in the same fashion as #define's. | |
1071 | .SH MULTIPLE INPUT BUFFERS | |
1072 | Some scanners (such as those which support "include" files) | |
1073 | require reading from several input streams. As | |
1074 | .I flex | |
1075 | scanners do a large amount of buffering, one cannot control | |
1076 | where the next input will be read from by simply writing a | |
1077 | .B YY_INPUT | |
1078 | which is sensitive to the scanning context. | |
1079 | .B YY_INPUT | |
1080 | is only called when the scanner reaches the end of its buffer, which | |
1081 | may be a long time after scanning a statement such as an "include" | |
1082 | which requires switching the input source. | |
1083 | .LP | |
1084 | To negotiate these sorts of problems, | |
1085 | .I flex | |
1086 | provides a mechanism for creating and switching between multiple | |
1087 | input buffers. An input buffer is created by using: | |
1088 | .nf | |
1089 | ||
1090 | YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) | |
1091 | ||
1092 | .fi | |
1093 | which takes a | |
1094 | .I FILE | |
1095 | pointer and a size and creates a buffer associated with the given | |
1096 | file and large enough to hold | |
1097 | .I size | |
1098 | characters (when in doubt, use | |
1099 | .B YY_BUF_SIZE | |
1100 | for the size). It returns a | |
1101 | .B YY_BUFFER_STATE | |
1102 | handle, which may then be passed to other routines: | |
1103 | .nf | |
1104 | ||
1105 | void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) | |
1106 | ||
1107 | .fi | |
1108 | switches the scanner's input buffer so subsequent tokens will | |
1109 | come from | |
1110 | .I new_buffer. | |
1111 | Note that | |
1112 | .B yy_switch_to_buffer() | |
1113 | may be used by yywrap() to sets things up for continued scanning, instead | |
1114 | of opening a new file and pointing | |
1115 | .I yyin | |
1116 | at it. | |
1117 | .nf | |
1118 | ||
1119 | void yy_delete_buffer( YY_BUFFER_STATE buffer ) | |
1120 | ||
1121 | .fi | |
1122 | is used to reclaim the storage associated with a buffer. | |
1123 | .LP | |
1124 | .B yy_new_buffer() | |
1125 | is an alias for | |
1126 | .B yy_create_buffer(), | |
1127 | provided for compatibility with the C++ use of | |
1128 | .I new | |
1129 | and | |
1130 | .I delete | |
1131 | for creating and destroying dynamic objects. | |
1132 | .LP | |
1133 | Finally, the | |
1134 | .B YY_CURRENT_BUFFER | |
1135 | macro returns a | |
1136 | .B YY_BUFFER_STATE | |
1137 | handle to the current buffer. | |
1138 | .LP | |
1139 | Here is an example of using these features for writing a scanner | |
1140 | which expands include files (the | |
1141 | .B <<EOF>> | |
1142 | feature is discussed below): | |
1143 | .nf | |
1144 | ||
1145 | /* the "incl" state is used for picking up the name | |
1146 | * of an include file | |
1147 | */ | |
1148 | %x incl | |
1149 | ||
1150 | %{ | |
1151 | #define MAX_INCLUDE_DEPTH 10 | |
1152 | YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; | |
1153 | int include_stack_ptr = 0; | |
1154 | %} | |
1155 | ||
1156 | %% | |
1157 | include BEGIN(incl); | |
1158 | ||
1159 | [a-z]+ ECHO; | |
1160 | [^a-z\\n]*\\n? ECHO; | |
1161 | ||
1162 | <incl>[ \\t]* /* eat the whitespace */ | |
1163 | <incl>[^ \\t\\n]+ { /* got the include file name */ | |
1164 | if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) | |
1165 | { | |
1166 | fprintf( stderr, "Includes nested too deeply" ); | |
1167 | exit( 1 ); | |
1168 | } | |
1169 | ||
1170 | include_stack[include_stack_ptr++] = | |
1171 | YY_CURRENT_BUFFER; | |
1172 | ||
1173 | yyin = fopen( yytext, "r" ); | |
1174 | ||
1175 | if ( ! yyin ) | |
1176 | error( ... ); | |
1177 | ||
1178 | yy_switch_to_buffer( | |
1179 | yy_create_buffer( yyin, YY_BUF_SIZE ) ); | |
1180 | ||
1181 | BEGIN(INITIAL); | |
1182 | } | |
1183 | ||
1184 | <<EOF>> { | |
1185 | if ( --include_stack_ptr < 0 ) | |
1186 | { | |
1187 | yyterminate(); | |
1188 | } | |
1189 | ||
1190 | else | |
1191 | yy_switch_to_buffer( | |
1192 | include_stack[include_stack_ptr] ); | |
1193 | } | |
1194 | ||
1195 | .fi | |
1196 | .SH END-OF-FILE RULES | |
1197 | The special rule "<<EOF>>" indicates | |
1198 | actions which are to be taken when an end-of-file is | |
1199 | encountered and yywrap() returns non-zero (i.e., indicates | |
1200 | no further files to process). The action must finish | |
1201 | by doing one of four things: | |
1202 | .IP - | |
1203 | the special | |
1204 | .B YY_NEW_FILE | |
1205 | action, if | |
1206 | .I yyin | |
1207 | has been pointed at a new file to process; | |
1208 | .IP - | |
1209 | a | |
1210 | .I return | |
1211 | statement; | |
1212 | .IP - | |
1213 | the special | |
1214 | .B yyterminate() | |
1215 | action; | |
1216 | .IP - | |
1217 | or, switching to a new buffer using | |
1218 | .B yy_switch_to_buffer() | |
1219 | as shown in the example above. | |
1220 | .LP | |
1221 | <<EOF>> rules may not be used with other | |
1222 | patterns; they may only be qualified with a list of start | |
1223 | conditions. If an unqualified <<EOF>> rule is given, it | |
1224 | applies to | |
1225 | .I all | |
1226 | start conditions which do not already have <<EOF>> actions. To | |
1227 | specify an <<EOF>> rule for only the initial start condition, use | |
1228 | .nf | |
1229 | ||
1230 | <INITIAL><<EOF>> | |
1231 | ||
1232 | .fi | |
1233 | .LP | |
1234 | These rules are useful for catching things like unclosed comments. | |
1235 | An example: | |
1236 | .nf | |
1237 | ||
1238 | %x quote | |
1239 | %% | |
1240 | ||
1241 | ...other rules for dealing with quotes... | |
1242 | ||
1243 | <quote><<EOF>> { | |
1244 | error( "unterminated quote" ); | |
1245 | yyterminate(); | |
1246 | } | |
1247 | <<EOF>> { | |
1248 | if ( *++filelist ) | |
1249 | { | |
1250 | yyin = fopen( *filelist, "r" ); | |
1251 | YY_NEW_FILE; | |
1252 | } | |
1253 | else | |
1254 | yyterminate(); | |
1255 | } | |
1256 | ||
1257 | .fi | |
1258 | .SH MISCELLANEOUS MACROS | |
1259 | The macro | |
1260 | .bd | |
1261 | YY_USER_ACTION | |
1262 | can be redefined to provide an action | |
1263 | which is always executed prior to the matched rule's action. For example, | |
1264 | it could be #define'd to call a routine to convert yytext to lower-case. | |
1265 | .LP | |
1266 | The macro | |
1267 | .B YY_USER_INIT | |
1268 | may be redefined to provide an action which is always executed before | |
1269 | the first scan (and before the scanner's internal initializations are done). | |
1270 | For example, it could be used to call a routine to read | |
1271 | in a data table or open a logging file. | |
1272 | .LP | |
1273 | In the generated scanner, the actions are all gathered in one large | |
1274 | switch statement and separated using | |
1275 | .B YY_BREAK, | |
1276 | which may be redefined. By default, it is simply a "break", to separate | |
1277 | each rule's action from the following rule's. | |
1278 | Redefining | |
1279 | .B YY_BREAK | |
1280 | allows, for example, C++ users to | |
1281 | #define YY_BREAK to do nothing (while being very careful that every | |
1282 | rule ends with a "break" or a "return"!) to avoid suffering from | |
1283 | unreachable statement warnings where because a rule's action ends with | |
1284 | "return", the | |
1285 | .B YY_BREAK | |
1286 | is inaccessible. | |
1287 | .SH INTERFACING WITH YACC | |
1288 | One of the main uses of | |
1289 | .I flex | |
1290 | is as a companion to the | |
1291 | .I yacc | |
1292 | parser-generator. | |
1293 | .I yacc | |
1294 | parsers expect to call a routine named | |
1295 | .B yylex() | |
1296 | to find the next input token. The routine is supposed to | |
1297 | return the type of the next token as well as putting any associated | |
1298 | value in the global | |
1299 | .B yylval. | |
1300 | To use | |
1301 | .I flex | |
1302 | with | |
1303 | .I yacc, | |
1304 | one specifies the | |
1305 | .B -d | |
1306 | option to | |
1307 | .I yacc | |
1308 | to instruct it to generate the file | |
1309 | .B y.tab.h | |
1310 | containing definitions of all the | |
1311 | .B %tokens | |
1312 | appearing in the | |
1313 | .I yacc | |
1314 | input. This file is then included in the | |
1315 | .I flex | |
1316 | scanner. For example, if one of the tokens is "TOK_NUMBER", | |
1317 | part of the scanner might look like: | |
1318 | .nf | |
1319 | ||
1320 | %{ | |
1321 | #include "y.tab.h" | |
1322 | %} | |
1323 | ||
1324 | %% | |
1325 | ||
1326 | [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; | |
1327 | ||
1328 | .fi | |
1329 | .SH TRANSLATION TABLE | |
1330 | In the name of POSIX compliance, | |
1331 | .I flex | |
1332 | supports a | |
1333 | .I translation table | |
1334 | for mapping input characters into groups. | |
1335 | The table is specified in the first section, and its format looks like: | |
1336 | .nf | |
1337 | ||
1338 | %t | |
1339 | 1 abcd | |
1340 | 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ | |
1341 | 52 0123456789 | |
1342 | 6 \\t\\ \\n | |
1343 | %t | |
1344 | ||
1345 | .fi | |
1346 | This example specifies that the characters 'a', 'b', 'c', and 'd' | |
1347 | are to all be lumped into group #1, upper-case letters | |
1348 | in group #2, digits in group #52, tabs, blanks, and newlines into | |
1349 | group #6, and | |
1350 | .I | |
1351 | no other characters will appear in the patterns. | |
1352 | The group numbers are actually disregarded by | |
1353 | .I flex; | |
1354 | .B %t | |
1355 | serves, though, to lump characters together. Given the above | |
1356 | table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0". | |
1357 | They both say, "match any character in group #1, followed by | |
1358 | zero-or-more pairs of characters | |
1359 | from group #2, followed by a character from group #52." Thus | |
1360 | .B %t | |
1361 | provides a crude way for introducing equivalence classes into | |
1362 | the scanner specification. | |
1363 | .LP | |
1364 | Note that the | |
1365 | .B -i | |
1366 | option (see below) coupled with the equivalence classes which | |
1367 | .I flex | |
1368 | automatically generates take care of virtually all the instances | |
1369 | when one might consider using | |
1370 | .B %t. | |
1371 | But what the hell, it's there if you want it. | |
1372 | .SH OPTIONS | |
1373 | .I flex | |
1374 | has the following options: | |
1375 | .TP | |
1376 | .B -b | |
1377 | Generate backtracking information to | |
1378 | .I lex.backtrack. | |
1379 | This is a list of scanner states which require backtracking | |
1380 | and the input characters on which they do so. By adding rules one | |
1381 | can remove backtracking states. If all backtracking states | |
1382 | are eliminated and | |
1383 | .B -f | |
1384 | or | |
1385 | .B -F | |
1386 | is used, the generated scanner will run faster (see the | |
1387 | .B -p | |
1388 | flag). Only users who wish to squeeze every last cycle out of their | |
1389 | scanners need worry about this option. (See the section on PERFORMANCE | |
1390 | CONSIDERATIONS below.) | |
1391 | .TP | |
1392 | .B -c | |
1393 | is a do-nothing, deprecated option included for POSIX compliance. | |
1394 | .IP | |
1395 | .B NOTE: | |
1396 | in previous releases of | |
1397 | .I flex | |
1398 | .B -c | |
1399 | specified table-compression options. This functionality is | |
1400 | now given by the | |
1401 | .B -C | |
1402 | flag. To ease the the impact of this change, when | |
1403 | .I flex | |
1404 | encounters | |
1405 | .B -c, | |
1406 | it currently issues a warning message and assumes that | |
1407 | .B -C | |
1408 | was desired instead. In the future this "promotion" of | |
1409 | .B -c | |
1410 | to | |
1411 | .B -C | |
1412 | will go away in the name of full POSIX compliance (unless | |
1413 | the POSIX meaning is removed first). | |
1414 | .TP | |
1415 | .B -d | |
1416 | makes the generated scanner run in | |
1417 | .I debug | |
1418 | mode. Whenever a pattern is recognized and the global | |
1419 | .B yy_flex_debug | |
1420 | is non-zero (which is the default), | |
1421 | the scanner will write to | |
1422 | .I stderr | |
1423 | a line of the form: | |
1424 | .nf | |
1425 | ||
1426 | --accepting rule at line 53 ("the matched text") | |
1427 | ||
1428 | .fi | |
1429 | The line number refers to the location of the rule in the file | |
1430 | defining the scanner (i.e., the file that was fed to flex). Messages | |
1431 | are also generated when the scanner backtracks, accepts the | |
1432 | default rule, reaches the end of its input buffer (or encounters | |
1433 | a NUL; at this point, the two look the same as far as the scanner's concerned), | |
1434 | or reaches an end-of-file. | |
1435 | .TP | |
1436 | .B -f | |
1437 | specifies (take your pick) | |
1438 | .I full table | |
1439 | or | |
1440 | .I fast scanner. | |
1441 | No table compression is done. The result is large but fast. | |
1442 | This option is equivalent to | |
1443 | .B -Cf | |
1444 | (see below). | |
1445 | .TP | |
1446 | .B -i | |
1447 | instructs | |
1448 | .I flex | |
1449 | to generate a | |
1450 | .I case-insensitive | |
1451 | scanner. The case of letters given in the | |
1452 | .I flex | |
1453 | input patterns will | |
1454 | be ignored, and tokens in the input will be matched regardless of case. The | |
1455 | matched text given in | |
1456 | .I yytext | |
1457 | will have the preserved case (i.e., it will not be folded). | |
1458 | .TP | |
1459 | .B -n | |
1460 | is another do-nothing, deprecated option included only for | |
1461 | POSIX compliance. | |
1462 | .TP | |
1463 | .B -p | |
1464 | generates a performance report to stderr. The report | |
1465 | consists of comments regarding features of the | |
1466 | .I flex | |
1467 | input file which will cause a loss of performance in the resulting scanner. | |
1468 | Note that the use of | |
1469 | .I REJECT | |
1470 | and variable trailing context (see the BUGS section in flex(1)) | |
1471 | entails a substantial performance penalty; use of | |
1472 | .I yymore(), | |
1473 | the | |
1474 | .B ^ | |
1475 | operator, | |
1476 | and the | |
1477 | .B -I | |
1478 | flag entail minor performance penalties. | |
1479 | .TP | |
1480 | .B -s | |
1481 | causes the | |
1482 | .I default rule | |
1483 | (that unmatched scanner input is echoed to | |
1484 | .I stdout) | |
1485 | to be suppressed. If the scanner encounters input that does not | |
1486 | match any of its rules, it aborts with an error. This option is | |
1487 | useful for finding holes in a scanner's rule set. | |
1488 | .TP | |
1489 | .B -t | |
1490 | instructs | |
1491 | .I flex | |
1492 | to write the scanner it generates to standard output instead | |
1493 | of | |
1494 | .B lex.yy.c. | |
1495 | .TP | |
1496 | .B -v | |
1497 | specifies that | |
1498 | .I flex | |
1499 | should write to | |
1500 | .I stderr | |
1501 | a summary of statistics regarding the scanner it generates. | |
1502 | Most of the statistics are meaningless to the casual | |
1503 | .I flex | |
1504 | user, but the | |
1505 | first line identifies the version of | |
1506 | .I flex, | |
1507 | which is useful for figuring | |
1508 | out where you stand with respect to patches and new releases, | |
1509 | and the next two lines give the date when the scanner was created | |
1510 | and a summary of the flags which were in effect. | |
1511 | .TP | |
1512 | .B -F | |
1513 | specifies that the | |
1514 | .ul | |
1515 | fast | |
1516 | scanner table representation should be used. This representation is | |
1517 | about as fast as the full table representation | |
1518 | .ul | |
1519 | (-f), | |
1520 | and for some sets of patterns will be considerably smaller (and for | |
1521 | others, larger). In general, if the pattern set contains both "keywords" | |
1522 | and a catch-all, "identifier" rule, such as in the set: | |
1523 | .nf | |
1524 | ||
1525 | "case" return TOK_CASE; | |
1526 | "switch" return TOK_SWITCH; | |
1527 | ... | |
1528 | "default" return TOK_DEFAULT; | |
1529 | [a-z]+ return TOK_ID; | |
1530 | ||
1531 | .fi | |
1532 | then you're better off using the full table representation. If only | |
1533 | the "identifier" rule is present and you then use a hash table or some such | |
1534 | to detect the keywords, you're better off using | |
1535 | .ul | |
1536 | -F. | |
1537 | .IP | |
1538 | This option is equivalent to | |
1539 | .B -CF | |
1540 | (see below). | |
1541 | .TP | |
1542 | .B -I | |
1543 | instructs | |
1544 | .I flex | |
1545 | to generate an | |
1546 | .I interactive | |
1547 | scanner. Normally, scanners generated by | |
1548 | .I flex | |
1549 | always look ahead one | |
1550 | character before deciding that a rule has been matched. At the cost of | |
1551 | some scanning overhead, | |
1552 | .I flex | |
1553 | will generate a scanner which only looks ahead | |
1554 | when needed. Such scanners are called | |
1555 | .I interactive | |
1556 | because if you want to write a scanner for an interactive system such as a | |
1557 | command shell, you will probably want the user's input to be terminated | |
1558 | with a newline, and without | |
1559 | .B -I | |
1560 | the user will have to type a character in addition to the newline in order | |
1561 | to have the newline recognized. This leads to dreadful interactive | |
1562 | performance. | |
1563 | .IP | |
1564 | If all this seems to confusing, here's the general rule: if a human will | |
1565 | be typing in input to your scanner, use | |
1566 | .B -I, | |
1567 | otherwise don't; if you don't care about squeezing the utmost performance | |
1568 | from your scanner and you | |
1569 | don't want to make any assumptions about the input to your scanner, | |
1570 | use | |
1571 | .B -I. | |
1572 | .IP | |
1573 | Note, | |
1574 | .B -I | |
1575 | cannot be used in conjunction with | |
1576 | .I full | |
1577 | or | |
1578 | .I fast tables, | |
1579 | i.e., the | |
1580 | .B -f, -F, -Cf, | |
1581 | or | |
1582 | .B -CF | |
1583 | flags. | |
1584 | .TP | |
1585 | .B -L | |
1586 | instructs | |
1587 | .I flex | |
1588 | not to generate | |
1589 | .B #line | |
1590 | directives. Without this option, | |
1591 | .I flex | |
1592 | peppers the generated scanner | |
1593 | with #line directives so error messages in the actions will be correctly | |
1594 | located with respect to the original | |
1595 | .I flex | |
1596 | input file, and not to | |
1597 | the fairly meaningless line numbers of | |
1598 | .B lex.yy.c. | |
1599 | (Unfortunately | |
1600 | .I flex | |
1601 | does not presently generate the necessary directives | |
1602 | to "retarget" the line numbers for those parts of | |
1603 | .B lex.yy.c | |
1604 | which it generated. So if there is an error in the generated code, | |
1605 | a meaningless line number is reported.) | |
1606 | .TP | |
1607 | .B -T | |
1608 | makes | |
1609 | .I flex | |
1610 | run in | |
1611 | .I trace | |
1612 | mode. It will generate a lot of messages to | |
1613 | .I stdout | |
1614 | concerning | |
1615 | the form of the input and the resultant non-deterministic and deterministic | |
1616 | finite automata. This option is mostly for use in maintaining | |
1617 | .I flex. | |
1618 | .TP | |
1619 | .B -8 | |
1620 | instructs | |
1621 | .I flex | |
1622 | to generate an 8-bit scanner, i.e., one which can recognize 8-bit | |
1623 | characters. On some sites, | |
1624 | .I flex | |
1625 | is installed with this option as the default. On others, the default | |
1626 | is 7-bit characters. To see which is the case, check the verbose | |
1627 | .B (-v) | |
1628 | output for "equivalence classes created". If the denominator of | |
1629 | the number shown is 128, then by default | |
1630 | .I flex | |
1631 | is generating 7-bit characters. If it is 256, then the default is | |
1632 | 8-bit characters and the | |
1633 | .B -8 | |
1634 | flag is not required (but may be a good idea to keep the scanner | |
1635 | specification portable). Feeding a 7-bit scanner 8-bit characters | |
1636 | will result in infinite loops, bus errors, or other such fireworks, | |
1637 | so when in doubt, use the flag. Note that if equivalence classes | |
1638 | are used, 8-bit scanners take only slightly more table space than | |
1639 | 7-bit scanners (128 bytes, to be exact); if equivalence classes are | |
1640 | not used, however, then the tables may grow up to twice their | |
1641 | 7-bit size. | |
1642 | .TP | |
1643 | .B -C[efmF] | |
1644 | controls the degree of table compression. | |
1645 | .IP | |
1646 | .B -Ce | |
1647 | directs | |
1648 | .I flex | |
1649 | to construct | |
1650 | .I equivalence classes, | |
1651 | i.e., sets of characters | |
1652 | which have identical lexical properties (for example, if the only | |
1653 | appearance of digits in the | |
1654 | .I flex | |
1655 | input is in the character class | |
1656 | "[0-9]" then the digits '0', '1', ..., '9' will all be put | |
1657 | in the same equivalence class). Equivalence classes usually give | |
1658 | dramatic reductions in the final table/object file sizes (typically | |
1659 | a factor of 2-5) and are pretty cheap performance-wise (one array | |
1660 | look-up per character scanned). | |
1661 | .IP | |
1662 | .B -Cf | |
1663 | specifies that the | |
1664 | .I full | |
1665 | scanner tables should be generated - | |
1666 | .I flex | |
1667 | should not compress the | |
1668 | tables by taking advantages of similar transition functions for | |
1669 | different states. | |
1670 | .IP | |
1671 | .B -CF | |
1672 | specifies that the alternate fast scanner representation (described | |
1673 | above under the | |
1674 | .B -F | |
1675 | flag) | |
1676 | should be used. | |
1677 | .IP | |
1678 | .B -Cm | |
1679 | directs | |
1680 | .I flex | |
1681 | to construct | |
1682 | .I meta-equivalence classes, | |
1683 | which are sets of equivalence classes (or characters, if equivalence | |
1684 | classes are not being used) that are commonly used together. Meta-equivalence | |
1685 | classes are often a big win when using compressed tables, but they | |
1686 | have a moderate performance impact (one or two "if" tests and one | |
1687 | array look-up per character scanned). | |
1688 | .IP | |
1689 | A lone | |
1690 | .B -C | |
1691 | specifies that the scanner tables should be compressed but neither | |
1692 | equivalence classes nor meta-equivalence classes should be used. | |
1693 | .IP | |
1694 | The options | |
1695 | .B -Cf | |
1696 | or | |
1697 | .B -CF | |
1698 | and | |
1699 | .B -Cm | |
1700 | do not make sense together - there is no opportunity for meta-equivalence | |
1701 | classes if the table is not being compressed. Otherwise the options | |
1702 | may be freely mixed. | |
1703 | .IP | |
1704 | The default setting is | |
1705 | .B -Cem, | |
1706 | which specifies that | |
1707 | .I flex | |
1708 | should generate equivalence classes | |
1709 | and meta-equivalence classes. This setting provides the highest | |
1710 | degree of table compression. You can trade off | |
1711 | faster-executing scanners at the cost of larger tables with | |
1712 | the following generally being true: | |
1713 | .nf | |
1714 | ||
1715 | slowest & smallest | |
1716 | -Cem | |
1717 | -Cm | |
1718 | -Ce | |
1719 | -C | |
1720 | -C{f,F}e | |
1721 | -C{f,F} | |
1722 | fastest & largest | |
1723 | ||
1724 | .fi | |
1725 | Note that scanners with the smallest tables are usually generated and | |
1726 | compiled the quickest, so | |
1727 | during development you will usually want to use the default, maximal | |
1728 | compression. | |
1729 | .IP | |
1730 | .B -Cfe | |
1731 | is often a good compromise between speed and size for production | |
1732 | scanners. | |
1733 | .IP | |
1734 | .B -C | |
1735 | options are not cumulative; whenever the flag is encountered, the | |
1736 | previous -C settings are forgotten. | |
1737 | .TP | |
1738 | .B -Sskeleton_file | |
1739 | overrides the default skeleton file from which | |
1740 | .I flex | |
1741 | constructs its scanners. You'll never need this option unless you are doing | |
1742 | .I flex | |
1743 | maintenance or development. | |
1744 | .SH PERFORMANCE CONSIDERATIONS | |
1745 | The main design goal of | |
1746 | .I flex | |
1747 | is that it generate high-performance scanners. It has been optimized | |
1748 | for dealing well with large sets of rules. Aside from the effects | |
1749 | of table compression on scanner speed outlined above, | |
1750 | there are a number of options/actions which degrade performance. These | |
1751 | are, from most expensive to least: | |
1752 | .nf | |
1753 | ||
1754 | REJECT | |
1755 | ||
1756 | pattern sets that require backtracking | |
1757 | arbitrary trailing context | |
1758 | ||
1759 | '^' beginning-of-line operator | |
1760 | yymore() | |
1761 | ||
1762 | .fi | |
1763 | with the first three all being quite expensive and the last two | |
1764 | being quite cheap. | |
1765 | .LP | |
1766 | .B REJECT | |
1767 | should be avoided at all costs when performance is important. | |
1768 | It is a particularly expensive option. | |
1769 | .LP | |
1770 | Getting rid of backtracking is messy and often may be an enormous | |
1771 | amount of work for a complicated scanner. In principal, one begins | |
1772 | by using the | |
1773 | .B -b | |
1774 | flag to generate a | |
1775 | .I lex.backtrack | |
1776 | file. For example, on the input | |
1777 | .nf | |
1778 | ||
1779 | %% | |
1780 | foo return TOK_KEYWORD; | |
1781 | foobar return TOK_KEYWORD; | |
1782 | ||
1783 | .fi | |
1784 | the file looks like: | |
1785 | .nf | |
1786 | ||
1787 | State #6 is non-accepting - | |
1788 | associated rule line numbers: | |
1789 | 2 3 | |
1790 | out-transitions: [ o ] | |
1791 | jam-transitions: EOF [ \\001-n p-\\177 ] | |
1792 | ||
1793 | State #8 is non-accepting - | |
1794 | associated rule line numbers: | |
1795 | 3 | |
1796 | out-transitions: [ a ] | |
1797 | jam-transitions: EOF [ \\001-` b-\\177 ] | |
1798 | ||
1799 | State #9 is non-accepting - | |
1800 | associated rule line numbers: | |
1801 | 3 | |
1802 | out-transitions: [ r ] | |
1803 | jam-transitions: EOF [ \\001-q s-\\177 ] | |
1804 | ||
1805 | Compressed tables always backtrack. | |
1806 | ||
1807 | .fi | |
1808 | The first few lines tell us that there's a scanner state in | |
1809 | which it can make a transition on an 'o' but not on any other | |
1810 | character, and that in that state the currently scanned text does not match | |
1811 | any rule. The state occurs when trying to match the rules found | |
1812 | at lines 2 and 3 in the input file. | |
1813 | If the scanner is in that state and then reads | |
1814 | something other than an 'o', it will have to backtrack to find | |
1815 | a rule which is matched. With | |
1816 | a bit of headscratching one can see that this must be the | |
1817 | state it's in when it has seen "fo". When this has happened, | |
1818 | if anything other than another 'o' is seen, the scanner will | |
1819 | have to back up to simply match the 'f' (by the default rule). | |
1820 | .LP | |
1821 | The comment regarding State #8 indicates there's a problem | |
1822 | when "foob" has been scanned. Indeed, on any character other | |
1823 | than a 'b', the scanner will have to back up to accept "foo". | |
1824 | Similarly, the comment for State #9 concerns when "fooba" has | |
1825 | been scanned. | |
1826 | .LP | |
1827 | The final comment reminds us that there's no point going to | |
1828 | all the trouble of removing backtracking from the rules unless | |
1829 | we're using | |
1830 | .B -f | |
1831 | or | |
1832 | .B -F, | |
1833 | since there's no performance gain doing so with compressed scanners. | |
1834 | .LP | |
1835 | The way to remove the backtracking is to add "error" rules: | |
1836 | .nf | |
1837 | ||
1838 | %% | |
1839 | foo return TOK_KEYWORD; | |
1840 | foobar return TOK_KEYWORD; | |
1841 | ||
1842 | fooba | | |
1843 | foob | | |
1844 | fo { | |
1845 | /* false alarm, not really a keyword */ | |
1846 | return TOK_ID; | |
1847 | } | |
1848 | ||
1849 | .fi | |
1850 | .LP | |
1851 | Eliminating backtracking among a list of keywords can also be | |
1852 | done using a "catch-all" rule: | |
1853 | .nf | |
1854 | ||
1855 | %% | |
1856 | foo return TOK_KEYWORD; | |
1857 | foobar return TOK_KEYWORD; | |
1858 | ||
1859 | [a-z]+ return TOK_ID; | |
1860 | ||
1861 | .fi | |
1862 | This is usually the best solution when appropriate. | |
1863 | .LP | |
1864 | Backtracking messages tend to cascade. | |
1865 | With a complicated set of rules it's not uncommon to get hundreds | |
1866 | of messages. If one can decipher them, though, it often | |
1867 | only takes a dozen or so rules to eliminate the backtracking (though | |
1868 | it's easy to make a mistake and have an error rule accidentally match | |
1869 | a valid token. A possible future | |
1870 | .I flex | |
1871 | feature will be to automatically add rules to eliminate backtracking). | |
1872 | .LP | |
1873 | .I Variable | |
1874 | trailing context (where both the leading and trailing parts do not have | |
1875 | a fixed length) entails almost the same performance loss as | |
1876 | .I REJECT | |
1877 | (i.e., substantial). So when possible a rule like: | |
1878 | .nf | |
1879 | ||
1880 | %% | |
1881 | mouse|rat/(cat|dog) run(); | |
1882 | ||
1883 | .fi | |
1884 | is better written: | |
1885 | .nf | |
1886 | ||
1887 | %% | |
1888 | mouse/cat|dog run(); | |
1889 | rat/cat|dog run(); | |
1890 | ||
1891 | .fi | |
1892 | or as | |
1893 | .nf | |
1894 | ||
1895 | %% | |
1896 | mouse|rat/cat run(); | |
1897 | mouse|rat/dog run(); | |
1898 | ||
1899 | .fi | |
1900 | Note that here the special '|' action does | |
1901 | .I not | |
1902 | provide any savings, and can even make things worse (see | |
1903 | .B BUGS | |
1904 | in flex(1)). | |
1905 | .LP | |
1906 | Another area where the user can increase a scanner's performance | |
1907 | (and one that's easier to implement) arises from the fact that | |
1908 | the longer the tokens matched, the faster the scanner will run. | |
1909 | This is because with long tokens the processing of most input | |
1910 | characters takes place in the (short) inner scanning loop, and | |
1911 | does not often have to go through the additional work of setting up | |
1912 | the scanning environment (e.g., | |
1913 | .B yytext) | |
1914 | for the action. Recall the scanner for C comments: | |
1915 | .nf | |
1916 | ||
1917 | %x comment | |
1918 | %% | |
1919 | int line_num = 1; | |
1920 | ||
1921 | "/*" BEGIN(comment); | |
1922 | ||
1923 | <comment>[^*\\n]* | |
1924 | <comment>"*"+[^*/\\n]* | |
1925 | <comment>\\n ++line_num; | |
1926 | <comment>"*"+"/" BEGIN(INITIAL); | |
1927 | ||
1928 | .fi | |
1929 | This could be sped up by writing it as: | |
1930 | .nf | |
1931 | ||
1932 | %x comment | |
1933 | %% | |
1934 | int line_num = 1; | |
1935 | ||
1936 | "/*" BEGIN(comment); | |
1937 | ||
1938 | <comment>[^*\\n]* | |
1939 | <comment>[^*\\n]*\\n ++line_num; | |
1940 | <comment>"*"+[^*/\\n]* | |
1941 | <comment>"*"+[^*/\\n]*\\n ++line_num; | |
1942 | <comment>"*"+"/" BEGIN(INITIAL); | |
1943 | ||
1944 | .fi | |
1945 | Now instead of each newline requiring the processing of another | |
1946 | action, recognizing the newlines is "distributed" over the other rules | |
1947 | to keep the matched text as long as possible. Note that | |
1948 | .I adding | |
1949 | rules does | |
1950 | .I not | |
1951 | slow down the scanner! The speed of the scanner is independent | |
1952 | of the number of rules or (modulo the considerations given at the | |
1953 | beginning of this section) how complicated the rules are with | |
1954 | regard to operators such as '*' and '|'. | |
1955 | .LP | |
1956 | A final example in speeding up a scanner: suppose you want to scan | |
1957 | through a file containing identifiers and keywords, one per line | |
1958 | and with no other extraneous characters, and recognize all the | |
1959 | keywords. A natural first approach is: | |
1960 | .nf | |
1961 | ||
1962 | %% | |
1963 | asm | | |
1964 | auto | | |
1965 | break | | |
1966 | ... etc ... | |
1967 | volatile | | |
1968 | while /* it's a keyword */ | |
1969 | ||
1970 | .|\\n /* it's not a keyword */ | |
1971 | ||
1972 | .fi | |
1973 | To eliminate the back-tracking, introduce a catch-all rule: | |
1974 | .nf | |
1975 | ||
1976 | %% | |
1977 | asm | | |
1978 | auto | | |
1979 | break | | |
1980 | ... etc ... | |
1981 | volatile | | |
1982 | while /* it's a keyword */ | |
1983 | ||
1984 | [a-z]+ | | |
1985 | .|\\n /* it's not a keyword */ | |
1986 | ||
1987 | .fi | |
1988 | Now, if it's guaranteed that there's exactly one word per line, | |
1989 | then we can reduce the total number of matches by a half by | |
1990 | merging in the recognition of newlines with that of the other | |
1991 | tokens: | |
1992 | .nf | |
1993 | ||
1994 | %% | |
1995 | asm\\n | | |
1996 | auto\\n | | |
1997 | break\\n | | |
1998 | ... etc ... | |
1999 | volatile\\n | | |
2000 | while\\n /* it's a keyword */ | |
2001 | ||
2002 | [a-z]+\\n | | |
2003 | .|\\n /* it's not a keyword */ | |
2004 | ||
2005 | .fi | |
2006 | One has to be careful here, as we have now reintroduced backtracking | |
2007 | into the scanner. In particular, while | |
2008 | .I we | |
2009 | know that there will never be any characters in the input stream | |
2010 | other than letters or newlines, | |
2011 | .I flex | |
2012 | can't figure this out, and it will plan for possibly needing backtracking | |
2013 | when it has scanned a token like "auto" and then the next character | |
2014 | is something other than a newline or a letter. Previously it would | |
2015 | then just match the "auto" rule and be done, but now it has no "auto" | |
2016 | rule, only a "auto\\n" rule. To eliminate the possibility of backtracking, | |
2017 | we could either duplicate all rules but without final newlines, or, | |
2018 | since we never expect to encounter such an input and therefore don't | |
2019 | how it's classified, we can introduce one more catch-all rule, this | |
2020 | one which doesn't include a newline: | |
2021 | .nf | |
2022 | ||
2023 | %% | |
2024 | asm\\n | | |
2025 | auto\\n | | |
2026 | break\\n | | |
2027 | ... etc ... | |
2028 | volatile\\n | | |
2029 | while\\n /* it's a keyword */ | |
2030 | ||
2031 | [a-z]+\\n | | |
2032 | [a-z]+ | | |
2033 | .|\\n /* it's not a keyword */ | |
2034 | ||
2035 | .fi | |
2036 | Compiled with | |
2037 | .B -Cf, | |
2038 | this is about as fast as one can get a | |
2039 | .I flex | |
2040 | scanner to go for this particular problem. | |
2041 | .LP | |
2042 | A final note: | |
2043 | .I flex | |
2044 | is slow when matching NUL's, particularly when a token contains | |
2045 | multiple NUL's. | |
2046 | It's best to write rules which match | |
2047 | .I short | |
2048 | amounts of text if it's anticipated that the text will often include NUL's. | |
2049 | .SH INCOMPATIBILITIES WITH LEX AND POSIX | |
2050 | .I flex | |
2051 | is a rewrite of the Unix | |
2052 | .I lex | |
2053 | tool (the two implementations do not share any code, though), | |
2054 | with some extensions and incompatibilities, both of which | |
2055 | are of concern to those who wish to write scanners acceptable | |
2056 | to either implementation. At present, the POSIX | |
2057 | .I lex | |
2058 | draft is | |
2059 | very close to the original | |
2060 | .I lex | |
2061 | implementation, so some of these | |
2062 | incompatibilities are also in conflict with the POSIX draft. But | |
2063 | the intent is that except as noted below, | |
2064 | .I flex | |
2065 | as it presently stands will | |
2066 | ultimately be POSIX conformant (i.e., that those areas of conflict with | |
2067 | the POSIX draft will be resolved in | |
2068 | .I flex's | |
2069 | favor). Please bear in | |
2070 | mind that all the comments which follow are with regard to the POSIX | |
2071 | .I draft | |
2072 | standard of Summer 1989, and not the final document (or subsequent | |
2073 | drafts); they are included so | |
2074 | .I flex | |
2075 | users can be aware of the standardization issues and those areas where | |
2076 | .I flex | |
2077 | may in the near future undergo changes incompatible with | |
2078 | its current definition. | |
2079 | .LP | |
2080 | .I flex | |
2081 | is fully compatible with | |
2082 | .I lex | |
2083 | with the following exceptions: | |
2084 | .IP - | |
2085 | .I lex | |
2086 | does not support exclusive start conditions (%x), though they | |
2087 | are in the current POSIX draft. | |
2088 | .IP - | |
2089 | When definitions are expanded, | |
2090 | .I flex | |
2091 | encloses them in parentheses. | |
2092 | With lex, the following: | |
2093 | .nf | |
2094 | ||
2095 | NAME [A-Z][A-Z0-9]* | |
2096 | %% | |
2097 | foo{NAME}? printf( "Found it\\n" ); | |
2098 | %% | |
2099 | ||
2100 | .fi | |
2101 | will not match the string "foo" because when the macro | |
2102 | is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" | |
2103 | and the precedence is such that the '?' is associated with | |
2104 | "[A-Z0-9]*". With | |
2105 | .I flex, | |
2106 | the rule will be expanded to | |
2107 | "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. | |
2108 | Note that because of this, the | |
2109 | .B ^, $, <s>, /, | |
2110 | and | |
2111 | .B <<EOF>> | |
2112 | operators cannot be used in a | |
2113 | .I flex | |
2114 | definition. | |
2115 | .IP | |
2116 | The POSIX draft interpretation is the same as | |
2117 | .I flex's. | |
2118 | .IP - | |
2119 | To specify a character class which matches anything but a left bracket (']'), | |
2120 | in | |
2121 | .I lex | |
2122 | one can use "[^]]" but with | |
2123 | .I flex | |
2124 | one must use "[^\\]]". The latter works with | |
2125 | .I lex, | |
2126 | too. | |
2127 | .IP - | |
2128 | The undocumented | |
2129 | .I lex | |
2130 | scanner internal variable | |
2131 | .B yylineno | |
2132 | is not supported. (The variable is not part of the POSIX draft.) | |
2133 | .IP - | |
2134 | The | |
2135 | .B input() | |
2136 | routine is not redefinable, though it may be called to read characters | |
2137 | following whatever has been matched by a rule. If | |
2138 | .B input() | |
2139 | encounters an end-of-file the normal | |
2140 | .B yywrap() | |
2141 | processing is done. A ``real'' end-of-file is returned by | |
2142 | .B input() | |
2143 | as | |
2144 | .I EOF. | |
2145 | .IP | |
2146 | Input is instead controlled by redefining the | |
2147 | .B YY_INPUT | |
2148 | macro. | |
2149 | .IP | |
2150 | The | |
2151 | .I flex | |
2152 | restriction that | |
2153 | .B input() | |
2154 | cannot be redefined is in accordance with the POSIX draft, but | |
2155 | .B YY_INPUT | |
2156 | has not yet been accepted into the draft. | |
2157 | .IP - | |
2158 | .B output() | |
2159 | is not supported. | |
2160 | Output from the | |
2161 | .B ECHO | |
2162 | macro is done to the file-pointer | |
2163 | .I yyout | |
2164 | (default | |
2165 | .I stdout). | |
2166 | .IP | |
2167 | The POSIX draft mentions that an | |
2168 | .B output() | |
2169 | routine exists but currently gives no details as to what it does. | |
2170 | .IP - | |
2171 | The | |
2172 | .I lex | |
2173 | .B %r | |
2174 | (generate a Ratfor scanner) option is not supported. It is not part | |
2175 | of the POSIX draft. | |
2176 | .IP - | |
2177 | If you are providing your own yywrap() routine, you must include a | |
2178 | "#undef yywrap" in the definitions section (section 1). Note that | |
2179 | the "#undef" will have to be enclosed in %{}'s. | |
2180 | .IP | |
2181 | The POSIX draft | |
2182 | specifies that yywrap() is a function and this is unlikely to change; so | |
2183 | .I flex users are warned | |
2184 | that | |
2185 | .B yywrap() | |
2186 | is likely to be changed to a function in the near future. | |
2187 | .IP - | |
2188 | After a call to | |
2189 | .B unput(), | |
2190 | .I yytext | |
2191 | and | |
2192 | .I yyleng | |
2193 | are undefined until the next token is matched. This is not the case with | |
2194 | .I lex | |
2195 | or the present POSIX draft. | |
2196 | .IP - | |
2197 | The precedence of the | |
2198 | .B {} | |
2199 | (numeric range) operator is different. | |
2200 | .I lex | |
2201 | interprets "abc{1,3}" as "match one, two, or | |
2202 | three occurrences of 'abc'", whereas | |
2203 | .I flex | |
2204 | interprets it as "match 'ab' | |
2205 | followed by one, two, or three occurrences of 'c'". The latter is | |
2206 | in agreement with the current POSIX draft. | |
2207 | .IP - | |
2208 | The precedence of the | |
2209 | .B ^ | |
2210 | operator is different. | |
2211 | .I lex | |
2212 | interprets "^foo|bar" as "match either 'foo' at the beginning of a line, | |
2213 | or 'bar' anywhere", whereas | |
2214 | .I flex | |
2215 | interprets it as "match either 'foo' or 'bar' if they come at the beginning | |
2216 | of a line". The latter is in agreement with the current POSIX draft. | |
2217 | .IP - | |
2218 | To refer to yytext outside of the scanner source file, | |
2219 | the correct definition with | |
2220 | .I flex | |
2221 | is "extern char *yytext" rather than "extern char yytext[]". | |
2222 | This is contrary to the current POSIX draft but a point on which | |
2223 | .I flex | |
2224 | will not be changing, as the array representation entails a | |
2225 | serious performance penalty. It is hoped that the POSIX draft will | |
2226 | be emended to support the | |
2227 | .I flex | |
2228 | variety of declaration (as this is a fairly painless change to | |
2229 | require of | |
2230 | .I lex | |
2231 | users). | |
2232 | .IP - | |
2233 | .I yyin | |
2234 | is | |
2235 | .I initialized | |
2236 | by | |
2237 | .I lex | |
2238 | to be | |
2239 | .I stdin; | |
2240 | .I flex, | |
2241 | on the other hand, | |
2242 | initializes | |
2243 | .I yyin | |
2244 | to NULL | |
2245 | and then | |
2246 | .I assigns | |
2247 | it to | |
2248 | .I stdin | |
2249 | the first time the scanner is called, providing | |
2250 | .I yyin | |
2251 | has not already been assigned to a non-NULL value. The difference is | |
2252 | subtle, but the net effect is that with | |
2253 | .I flex | |
2254 | scanners, | |
2255 | .I yyin | |
2256 | does not have a valid value until the scanner has been called. | |
2257 | .IP - | |
2258 | The special table-size declarations such as | |
2259 | .B %a | |
2260 | supported by | |
2261 | .I lex | |
2262 | are not required by | |
2263 | .I flex | |
2264 | scanners; | |
2265 | .I flex | |
2266 | ignores them. | |
2267 | .IP - | |
2268 | The name | |
2269 | .bd | |
2270 | FLEX_SCANNER | |
2271 | is #define'd so scanners may be written for use with either | |
2272 | .I flex | |
2273 | or | |
2274 | .I lex. | |
2275 | .LP | |
2276 | The following | |
2277 | .I flex | |
2278 | features are not included in | |
2279 | .I lex | |
2280 | or the POSIX draft standard: | |
2281 | .nf | |
2282 | ||
2283 | yyterminate() | |
2284 | <<EOF>> | |
2285 | YY_DECL | |
2286 | #line directives | |
2287 | %{}'s around actions | |
2288 | yyrestart() | |
2289 | comments beginning with '#' (deprecated) | |
2290 | multiple actions on a line | |
2291 | ||
2292 | .fi | |
2293 | This last feature refers to the fact that with | |
2294 | .I flex | |
2295 | you can put multiple actions on the same line, separated with | |
2296 | semi-colons, while with | |
2297 | .I lex, | |
2298 | the following | |
2299 | .nf | |
2300 | ||
2301 | foo handle_foo(); ++num_foos_seen; | |
2302 | ||
2303 | .fi | |
2304 | is (rather surprisingly) truncated to | |
2305 | .nf | |
2306 | ||
2307 | foo handle_foo(); | |
2308 | ||
2309 | .fi | |
2310 | .I flex | |
2311 | does not truncate the action. Actions that are not enclosed in | |
2312 | braces are simply terminated at the end of the line. | |
2313 | .SH DIAGNOSTICS | |
2314 | .I reject_used_but_not_detected undefined | |
2315 | or | |
2316 | .I yymore_used_but_not_detected undefined - | |
2317 | These errors can occur at compile time. They indicate that the | |
2318 | scanner uses | |
2319 | .B REJECT | |
2320 | or | |
2321 | .B yymore() | |
2322 | but that | |
2323 | .I flex | |
2324 | failed to notice the fact, meaning that | |
2325 | .I flex | |
2326 | scanned the first two sections looking for occurrences of these actions | |
2327 | and failed to find any, but somehow you snuck some in (via a #include | |
2328 | file, for example). Make an explicit reference to the action in your | |
2329 | .I flex | |
2330 | input file. (Note that previously | |
2331 | .I flex | |
2332 | supported a | |
2333 | .B %used/%unused | |
2334 | mechanism for dealing with this problem; this feature is still supported | |
2335 | but now deprecated, and will go away soon unless the author hears from | |
2336 | people who can argue compellingly that they need it.) | |
2337 | .LP | |
2338 | .I flex scanner jammed - | |
2339 | a scanner compiled with | |
2340 | .B -s | |
2341 | has encountered an input string which wasn't matched by | |
2342 | any of its rules. | |
2343 | .LP | |
2344 | .I flex input buffer overflowed - | |
2345 | a scanner rule matched a string long enough to overflow the | |
2346 | scanner's internal input buffer (16K bytes by default - controlled by | |
2347 | .B YY_BUF_SIZE | |
2348 | in "flex.skel". Note that to redefine this macro, you must first | |
2349 | .B #undefine | |
2350 | it). | |
2351 | .LP | |
2352 | .I scanner requires -8 flag - | |
2353 | Your scanner specification includes recognizing 8-bit characters and | |
2354 | you did not specify the -8 flag (and your site has not installed flex | |
2355 | with -8 as the default). | |
2356 | .LP | |
2357 | .I too many %t classes! - | |
2358 | You managed to put every single character into its own %t class. | |
2359 | .I flex | |
2360 | requires that at least one of the classes share characters. | |
2361 | .SH DEFICIENCIES / BUGS | |
2362 | See flex(1). | |
2363 | .SH "SEE ALSO" | |
2364 | .LP | |
2365 | flex(1), lex(1), yacc(1), sed(1), awk(1). | |
2366 | .LP | |
2367 | M. E. Lesk and E. Schmidt, | |
2368 | .I LEX - Lexical Analyzer Generator | |
2369 | .SH AUTHOR | |
2370 | Vern Paxson, with the help of many ideas and much inspiration from | |
2371 | Van Jacobson. Original version by Jef Poskanzer. The fast table | |
2372 | representation is a partial implementation of a design done by Van | |
2373 | Jacobson. The implementation was done by Kevin Gong and Vern Paxson. | |
2374 | .LP | |
2375 | Thanks to the many | |
2376 | .I flex | |
2377 | beta-testers, feedbackers, and contributors, especially Casey | |
2378 | Leedom, benson@odi.com, | |
2379 | Frederic Brehm, Nick Christopher, Jason Coughlin, | |
2380 | Scott David Daniels, Leo Eskin, | |
2381 | Chris Faylor, Eric Goldman, Eric | |
2382 | Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht, | |
2383 | Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt, | |
2384 | Jef Poskanzer, Jim Roskind, | |
2385 | Dave Tallman, Frank Whaley, Ken Yap, and those whose names | |
2386 | have slipped my marginal mail-archiving skills but whose contributions | |
2387 | are appreciated all the same. | |
2388 | .LP | |
2389 | Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob | |
2390 | Mulcahy, Rich Salz, and Richard Stallman for help with various distribution | |
2391 | headaches. | |
2392 | .LP | |
2393 | Thanks to Esmond Pitt and Earle Horton for 8-bit character support; | |
2394 | to Benson Margulies and Fred | |
2395 | Burke for C++ support; to Ove Ewerlid for the basics of support for | |
2396 | NUL's; and to Eric Hughes for the basics of support for multiple buffers. | |
2397 | .LP | |
2398 | Work is being done on extending | |
2399 | .I flex | |
2400 | to generate scanners in which the | |
2401 | state machine is directly represented in C code rather than tables. | |
2402 | These scanners may well be substantially faster than those generated | |
2403 | using -f or -F. If you are working in this area and are interested | |
2404 | in comparing notes and seeing whether redundant work can be avoided, | |
2405 | contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE). | |
2406 | .LP | |
2407 | This work was primarily done when I was at the Real Time Systems Group | |
2408 | at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there | |
2409 | for the support I received. | |
2410 | .LP | |
2411 | Send comments to: | |
2412 | .nf | |
2413 | ||
2414 | Vern Paxson | |
2415 | Computer Science Department | |
2416 | 4126 Upson Hall | |
2417 | Cornell University | |
2418 | Ithaca, NY 14853-7501 | |
2419 | ||
2420 | vern@cs.cornell.edu | |
2421 | decvax!cornell!vern | |
2422 | ||
2423 | .fi |