Commit | Line | Data |
---|---|---|
5652ef93 | 1 | .\" @(#)lex.ms 6.1 (Berkeley) %G% |
b1716fed | 2 | .\" |
5652ef93 KM |
3 | .EH 'PS1:16-%''Lex \- A Lexical Analyzer Generator' |
4 | .OH 'Lex \- A Lexical Analyzer Generator''PS1:16-%' | |
b1716fed KM |
5 | .hc ~ |
6 | .bd I 2 | |
7 | .de TS | |
8 | .br | |
9 | .nf | |
10 | .SP 1v | |
11 | .ul 0 | |
12 | .. | |
13 | .de TE | |
14 | .SP 1v | |
15 | .fi | |
16 | .. | |
5652ef93 KM |
17 | .\".de PT |
18 | .\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0'' | |
19 | .\".if \\n%>1 'sp | |
20 | .\".. | |
b1716fed | 21 | .ND July 21, 1975 |
5652ef93 KM |
22 | .\".RP |
23 | .\".TM 75-1274-15 39199 39199-11 | |
b1716fed KM |
24 | .TL |
25 | Lex \- A Lexical Analyzer ~Generator~ | |
26 | .AU ``MH 2C-569'' 6377 | |
27 | M. E. Lesk and E. Schmidt | |
28 | .AI | |
29 | .MH | |
30 | .AB | |
31 | .sp | |
32 | .bd I 2 | |
5652ef93 KM |
33 | .\".nr PS 8 |
34 | .\".nr VS 9 | |
35 | .\".ps 8 | |
36 | .\".vs 9p | |
b1716fed KM |
37 | Lex helps write programs whose control flow |
38 | is directed by instances of regular | |
39 | expressions in the input stream. | |
40 | It is well suited for editor-script type transformations and | |
41 | for segmenting input in preparation for | |
42 | a parsing routine. | |
43 | .PP | |
44 | Lex source is a table of regular expressions and corresponding program fragments. | |
45 | The table is translated to a program | |
46 | which reads an input stream, copying it to an output stream | |
47 | and partitioning the input | |
48 | into strings which match the given expressions. | |
49 | As each such string is recognized the corresponding | |
50 | program fragment is executed. | |
51 | The recognition of the expressions | |
52 | is performed by a deterministic finite automaton | |
53 | generated by Lex. | |
54 | The program fragments written by the user are executed in the order in which the | |
55 | corresponding regular expressions occur in the input stream. | |
56 | .if n .if \n(tm .ig | |
57 | .PP | |
58 | The lexical analysis | |
59 | programs written with Lex accept ambiguous specifications | |
60 | and choose the longest | |
61 | match possible at each input point. | |
62 | If necessary, substantial look~ahead | |
63 | is performed on the input, but the | |
64 | input stream will be backed up to the | |
65 | end of the current partition, so that the user | |
66 | has general freedom to manipulate it. | |
67 | .PP | |
68 | Lex can generate analyzers in either C or Ratfor, a language | |
69 | which can be translated automatically to portable Fortran. | |
70 | It is available on the PDP-11 UNIX, Honeywell GCOS, | |
71 | and IBM OS systems. | |
72 | This manual, however, will only discuss generating analyzers | |
73 | in C on the UNIX system, which is the only supported | |
74 | form of Lex under UNIX Version 7. | |
75 | Lex is designed to simplify | |
76 | interfacing with Yacc, for those | |
77 | with access to this compiler-compiler system. | |
78 | .. | |
5652ef93 KM |
79 | .\".nr PS 9 |
80 | .\".nr VS 11 | |
b1716fed | 81 | .AE |
b1716fed KM |
82 | .2C |
83 | .NH | |
84 | Introduction. | |
85 | .PP | |
86 | Lex is a program generator designed for | |
87 | lexical processing of character input streams. | |
88 | It accepts a high-level, problem oriented specification | |
89 | for character string matching, | |
90 | and | |
91 | produces a program in a general purpose language which recognizes | |
92 | regular expressions. | |
93 | The regular expressions are specified by the user in the | |
94 | source specifications given to Lex. | |
95 | The Lex written code recognizes these expressions | |
96 | in an input stream and partitions the input stream into | |
97 | strings matching the expressions. At the bound~aries | |
98 | between strings | |
99 | program sections | |
100 | provided by the user are executed. | |
101 | The Lex source file associates the regular expressions and the | |
102 | program fragments. | |
103 | As each expression appears in the input to the program written by Lex, | |
104 | the corresponding fragment is executed. | |
105 | .PP | |
106 | .de MH | |
107 | Bell Laboratories, Murray Hill, NJ 07974. | |
108 | .. | |
109 | The user supplies the additional code | |
110 | beyond expression matching | |
111 | needed to complete his tasks, possibly | |
112 | including code written by other generators. | |
113 | The program that recognizes the expressions is generated in the | |
114 | general purpose programming language employed for the | |
115 | user's program fragments. | |
116 | Thus, a high level expression | |
117 | language is provided to write the string expressions to be | |
118 | matched while the user's freedom to write actions | |
119 | is unimpaired. | |
120 | This avoids forcing the user who wishes to use a string manipulation | |
121 | language for input analysis to write processing programs in the same | |
122 | and often inappropriate string handling language. | |
123 | .PP | |
124 | Lex is not a complete language, but rather a generator representing | |
125 | a new language feature which can be added to | |
126 | different programming languages, called ``host languages.'' | |
127 | Just as general purpose languages | |
128 | can produce code to run on different computer hardware, | |
129 | Lex can write code in different host languages. | |
130 | The host language is used for the output code generated by Lex | |
131 | and also for the program fragments added by the user. | |
132 | Compatible run-time libraries for the different host languages | |
133 | are also provided. | |
134 | This makes Lex adaptable to different environments and | |
135 | different users. | |
136 | Each application | |
137 | may be directed to the combination of hardware and host language appropriate | |
138 | to the task, the user's background, and the properties of local | |
139 | implementations. | |
140 | At present, the only supported host language is C, | |
141 | although Fortran (in the form of Ratfor [2] has been available | |
142 | in the past. | |
143 | Lex itself exists on UNIX, GCOS, and OS/370; but the | |
144 | code generated by Lex may be taken anywhere the appropriate | |
145 | compilers exist. | |
146 | .PP | |
147 | Lex turns the user's expressions and actions | |
148 | (called | |
149 | .ul | |
150 | source | |
151 | in this memo) into the host general-purpose language; | |
152 | the generated program is named | |
153 | .ul | |
154 | yylex. | |
155 | The | |
156 | .ul | |
157 | yylex | |
158 | program | |
159 | will recognize expressions | |
160 | in a stream | |
161 | (called | |
162 | .ul | |
163 | input | |
164 | in this memo) | |
165 | and perform the specified actions for each expression as it is detected. | |
166 | See Figure 1. | |
167 | .GS | |
168 | .TS | |
169 | center; | |
170 | l _ r | |
171 | l|c|r | |
172 | l _ r | |
173 | l _ r | |
174 | l|c|r | |
175 | l _ r | |
176 | c s s | |
177 | c s s. | |
178 | ||
179 | Source \(-> Lex \(-> yylex | |
180 | ||
181 | .sp 2 | |
182 | ||
183 | Input \(-> yylex \(-> Output | |
184 | ||
185 | .sp | |
186 | An overview of Lex | |
b1716fed KM |
187 | Figure 1 |
188 | .TE | |
189 | .GE | |
190 | .PP | |
191 | For a trivial example, consider a program to delete | |
192 | from the input | |
193 | all blanks or tabs at the ends of lines. | |
194 | .TS | |
195 | center; | |
196 | l l. | |
197 | %% | |
198 | [ \et]+$ ; | |
199 | .TE | |
200 | is all that is required. | |
201 | The program | |
202 | contains a %% delimiter to mark the beginning of the rules, and | |
203 | one rule. | |
204 | This rule contains a regular expression | |
205 | which matches one or more | |
206 | instances of the characters blank or tab | |
207 | (written \et for visibility, in accordance with the C language convention) | |
208 | just prior to the end of a line. | |
209 | The brackets indicate the character | |
210 | class made of blank and tab; the + indicates ``one or more ...''; | |
211 | and the $ indicates ``end of line,'' as in QED. | |
212 | No action is specified, | |
213 | so the program generated by Lex (yylex) will ignore these characters. | |
214 | Everything else will be copied. | |
215 | To change any remaining | |
216 | string of blanks or tabs to a single blank, | |
217 | add another rule: | |
218 | .TS | |
219 | center; | |
220 | l l. | |
221 | %% | |
222 | [ \et]+$ ; | |
223 | [ \et]+ printf(" "); | |
224 | .TE | |
225 | The finite automaton generated for this | |
226 | source will scan for both rules at once, | |
227 | observing at | |
228 | the termination of the string of blanks or tabs | |
229 | whether or not there is a newline character, and executing | |
230 | the desired rule action. | |
231 | The first rule matches all strings of blanks or tabs | |
232 | at the end of lines, and the second | |
233 | rule all remaining strings of blanks or tabs. | |
234 | .PP | |
235 | Lex can be used alone for simple transformations, or | |
236 | for analysis and statistics gathering on a lexical level. | |
237 | Lex can also be used with a parser generator | |
238 | to perform the lexical analysis phase; it is particularly | |
239 | easy to interface Lex and Yacc [3]. | |
240 | Lex programs recognize only regular expressions; | |
241 | Yacc writes parsers that accept a large class of context free grammars, | |
242 | but require a lower level analyzer to recognize input tokens. | |
243 | Thus, a combination of Lex and Yacc is often appropriate. | |
244 | When used as a preprocessor for a later parser generator, | |
245 | Lex is used to partition the input stream, | |
246 | and the parser generator assigns structure to | |
247 | the resulting pieces. | |
248 | The flow of control | |
249 | in such a case (which might be the first half of a compiler, | |
250 | for example) is shown in Figure 2. | |
251 | Additional programs, | |
252 | written by other generators | |
253 | or by hand, can | |
254 | be added easily to programs written by Lex. | |
255 | .BS 2 | |
5652ef93 KM |
256 | .ps 9 |
257 | .vs 11 | |
b1716fed KM |
258 | .TS |
259 | center; | |
260 | l c c c l | |
261 | l c c c l | |
262 | l c c c l | |
263 | l _ c _ l | |
264 | l|c|c|c|l | |
265 | l _ c _ l | |
266 | l c c c l | |
267 | l _ c _ l | |
268 | l|c|c|c|l | |
269 | l _ c _ l | |
270 | l c s s l | |
271 | l c s s l. | |
272 | lexical grammar | |
273 | rules rules | |
274 | \(da \(da | |
275 | ||
276 | Lex Yacc | |
277 | ||
278 | \(da \(da | |
279 | ||
280 | Input \(-> yylex \(-> yyparse \(-> Parsed input | |
281 | ||
282 | .sp | |
283 | Lex with Yacc | |
b1716fed KM |
284 | Figure 2 |
285 | .TE | |
5652ef93 KM |
286 | .ps 10 |
287 | .vs 12 | |
b1716fed KM |
288 | .BE |
289 | Yacc users | |
290 | will realize that the name | |
291 | .ul | |
292 | yylex | |
293 | is what Yacc expects its lexical analyzer to be named, | |
294 | so that the use of this name by Lex simplifies | |
295 | interfacing. | |
296 | .PP | |
297 | Lex generates a deterministic finite automaton from the regular expressions | |
298 | in the source [4]. | |
299 | The automaton is interpreted, rather than compiled, in order | |
300 | to save space. | |
301 | The result is still a fast analyzer. | |
302 | In particular, the time taken by a Lex program | |
303 | to recognize and partition an input stream is | |
304 | proportional to the length of the input. | |
305 | The number of Lex rules or | |
306 | the complexity of the rules is | |
307 | not important in determining speed, | |
308 | unless rules which include | |
309 | forward context require a significant amount of re~scanning. | |
310 | What does increase with the number and complexity of rules | |
311 | is the size of the finite | |
312 | automaton, and therefore the size of the program | |
313 | generated by Lex. | |
314 | .PP | |
315 | In the program written by Lex, the user's fragments | |
316 | (representing the | |
317 | .ul | |
318 | actions | |
319 | to be performed as each regular expression | |
320 | is found) | |
321 | are gathered | |
322 | as cases of a switch. | |
323 | The automaton interpreter directs the control flow. | |
324 | Opportunity is provided for the user to insert either | |
325 | declarations or additional statements in the routine containing | |
326 | the actions, or to | |
327 | add subroutines outside this action routine. | |
328 | .PP | |
329 | Lex is not limited to source which can | |
330 | be interpreted on the basis of one character | |
331 | look~ahead. | |
332 | For example, | |
333 | if there are two rules, one looking for | |
334 | .I ab | |
335 | and another for | |
336 | .I abcdefg , | |
337 | and the input stream is | |
338 | .I abcdefh , | |
339 | Lex will recognize | |
340 | .I ab | |
341 | and leave | |
342 | the input pointer just before | |
343 | .I "cd. . ." | |
344 | Such backup is more costly | |
345 | than the processing of simpler languages. | |
346 | .2C | |
347 | .NH | |
348 | Lex Source. | |
349 | .PP | |
350 | The general format of Lex source is: | |
351 | .TS | |
352 | center; | |
353 | l. | |
354 | {definitions} | |
355 | %% | |
356 | {rules} | |
357 | %% | |
358 | {user subroutines} | |
359 | .TE | |
360 | where the definitions and the user subroutines | |
361 | are often omitted. | |
362 | The second | |
363 | .I %% | |
364 | is optional, but the first is required | |
365 | to mark the beginning of the rules. | |
366 | The absolute minimum Lex program is thus | |
367 | .TS | |
368 | center; | |
369 | l. | |
370 | %% | |
371 | .TE | |
372 | (no definitions, no rules) which translates into a program | |
373 | which copies the input to the output unchanged. | |
374 | .PP | |
375 | In the outline of Lex programs shown above, the | |
376 | .I | |
377 | rules | |
378 | .R | |
379 | represent the user's control | |
380 | decisions; they are a table, in which the left column | |
381 | contains | |
382 | .I | |
383 | regular expressions | |
384 | .R | |
385 | (see section 3) | |
386 | and the right column contains | |
387 | .I | |
388 | actions, | |
389 | .R | |
390 | program fragments to be executed when the expressions | |
391 | are recognized. | |
392 | Thus an individual rule might appear | |
393 | .TS | |
394 | center; | |
395 | l l. | |
396 | integer printf("found keyword INT"); | |
397 | .TE | |
398 | to look for the string | |
399 | .I integer | |
400 | in the input stream and | |
401 | print the message ``found keyword INT'' whenever it appears. | |
402 | In this example the host procedural language is C and | |
403 | the C library function | |
404 | .I | |
405 | printf | |
406 | .R | |
407 | is used to print the string. | |
408 | The end | |
409 | of the expression is indicated by the first blank or tab character. | |
410 | If the action is merely a single C expression, | |
411 | it can just be given on the right side of the line; if it is | |
412 | compound, or takes more than a line, it should be enclosed in | |
413 | braces. | |
414 | As a slightly more useful example, suppose it is desired to | |
415 | change a number of words from British to American spelling. | |
416 | Lex rules such as | |
417 | .TS | |
418 | center; | |
419 | l l. | |
420 | colour printf("color"); | |
421 | mechanise printf("mechanize"); | |
422 | petrol printf("gas"); | |
423 | .TE | |
424 | would be a start. These rules are not quite enough, | |
425 | since | |
426 | the word | |
427 | .I petroleum | |
428 | would become | |
429 | .I gaseum ; | |
430 | a way of dealing | |
431 | with this will be described later. | |
432 | .2C | |
433 | .NH | |
434 | Lex Regular Expressions. | |
435 | .PP | |
436 | The definitions of regular expressions are very similar to those | |
437 | in QED [5]. | |
438 | A regular | |
439 | expression specifies a set of strings to be matched. | |
440 | It contains text characters (which match the corresponding | |
441 | characters in the strings being compared) | |
442 | and operator characters (which specify | |
443 | repetitions, choices, and other features). | |
444 | The letters of the alphabet and the digits are | |
445 | always text characters; thus the regular expression | |
446 | .TS | |
447 | center; | |
448 | l l. | |
449 | integer | |
450 | .TE | |
451 | matches the string | |
452 | .ul | |
453 | integer | |
454 | wherever it appears | |
455 | and the expression | |
456 | .TS | |
457 | center; | |
458 | l. | |
459 | a57D | |
460 | .TE | |
461 | looks for the string | |
462 | .ul | |
463 | a57D. | |
464 | .PP | |
465 | .I | |
466 | Operators. | |
467 | .R | |
468 | The operator characters are | |
469 | .TS | |
470 | center; | |
471 | l. | |
472 | " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < > | |
473 | .TE | |
474 | and if they are to be used as text characters, an escape | |
475 | should be used. | |
476 | The quotation mark operator (") | |
477 | indicates that whatever is contained between a pair of quotes | |
478 | is to be taken as text characters. | |
479 | Thus | |
480 | .TS | |
481 | center; | |
482 | l. | |
483 | xyz"++" | |
484 | .TE | |
485 | matches the string | |
486 | .I xyz++ | |
487 | when it appears. Note that a part of a string may be quoted. | |
488 | It is harmless but unnecessary to quote an ordinary | |
489 | text character; the expression | |
490 | .TS | |
491 | center; | |
492 | l. | |
493 | "xyz++" | |
494 | .TE | |
495 | is the same as the one above. | |
496 | Thus by quoting every non-alphanumeric character | |
497 | being used as a text character, the user can avoid remembering | |
498 | the list above of current | |
499 | operator characters, and is safe should further extensions to Lex | |
500 | lengthen the list. | |
501 | .PP | |
502 | An operator character may also be turned into a text character | |
503 | by preceding it with \e as in | |
504 | .TS | |
505 | center; | |
506 | l. | |
507 | xyz\e+\e+ | |
508 | .TE | |
509 | which | |
510 | is another, less readable, equivalent of the above expressions. | |
511 | Another use of the quoting mechanism is to get a blank into | |
512 | an expression; normally, as explained above, blanks or tabs end | |
513 | a rule. | |
514 | Any blank character not contained within [\|] (see below) must | |
515 | be quoted. | |
516 | Several normal C escapes with \e | |
517 | are recognized: \en is newline, \et is tab, and \eb is backspace. | |
518 | To enter \e itself, use \e\e. | |
519 | Since newline is illegal in an expression, \en must be used; | |
520 | it is not | |
521 | required to escape tab and backspace. | |
522 | Every character but blank, tab, newline and the list above is always | |
523 | a text character. | |
524 | .PP | |
525 | .I | |
526 | Character classes. | |
527 | .R | |
528 | Classes of characters can be specified using the operator pair [\|]. | |
529 | The construction | |
530 | .I [abc] | |
531 | matches a | |
532 | single character, which may be | |
533 | .I a , | |
534 | .I b , | |
535 | or | |
536 | .I c . | |
537 | Within square brackets, | |
538 | most operator meanings are ignored. | |
539 | Only three characters are special: | |
540 | these are \e \(mi and ^. The \(mi character | |
541 | indicates ranges. For example, | |
542 | .TS | |
543 | center; | |
544 | l. | |
545 | [a\(miz0\(mi9<>_] | |
546 | .TE | |
547 | indicates the character class containing all the lower case letters, | |
548 | the digits, | |
549 | the angle brackets, and underline. | |
550 | Ranges may be given in either order. | |
551 | Using \(mi between any pair of characters which are | |
552 | not both upper case letters, both lower case letters, or both digits | |
553 | is implementation dependent and will get a warning message. | |
554 | (E.g., [0\-z] in ASCII is many more characters | |
555 | than it is in EBCDIC). | |
556 | If it is desired to include the | |
557 | character \(mi in a character class, it should be first or | |
558 | last; thus | |
559 | .TS | |
560 | center; | |
561 | l. | |
562 | [\(mi+0\(mi9] | |
563 | .TE | |
564 | matches all the digits and the two signs. | |
565 | .PP | |
566 | In character classes, | |
567 | the ^ operator must appear as the first character | |
568 | after the left bracket; it indicates that the resulting string | |
569 | is to be complemented with respect to the computer character set. | |
570 | Thus | |
571 | .TS | |
572 | center; | |
573 | l. | |
574 | [^abc] | |
575 | .TE | |
576 | matches all characters except a, b, or c, including | |
577 | all special or control characters; or | |
578 | .TS | |
579 | center; | |
580 | l. | |
581 | [^a\-zA\-Z] | |
582 | .TE | |
583 | is any character which is not a letter. | |
584 | The \e character provides the usual escapes within | |
585 | character class brackets. | |
586 | .PP | |
587 | .I | |
588 | Arbitrary character. | |
589 | .R | |
590 | To match almost any character, the operator character | |
591 | .TS | |
592 | center; | |
593 | l. | |
594 | \&. | |
595 | .TE | |
596 | is the class of all characters except newline. | |
597 | Escaping into octal is possible although non-portable: | |
598 | .TS | |
599 | center; | |
600 | l. | |
601 | [\e40\-\e176] | |
602 | .TE | |
603 | matches all printable characters in the ASCII character set, from octal | |
604 | 40 (blank) to octal 176 (tilde). | |
605 | .PP | |
606 | .I | |
607 | Optional expressions. | |
608 | .R | |
609 | The operator | |
610 | .I ? | |
611 | indicates | |
612 | an optional element of an expression. | |
613 | Thus | |
614 | .TS | |
615 | center; | |
616 | l. | |
617 | ab?c | |
618 | .TE | |
619 | matches either | |
620 | .I ac | |
621 | or | |
622 | .I abc . | |
623 | .PP | |
624 | .I | |
625 | Repeated expressions. | |
626 | .R | |
627 | Repetitions of classes are indicated by the operators | |
628 | .I \(** | |
629 | and | |
630 | .I + . | |
631 | .TS | |
632 | center; | |
633 | l. | |
634 | \f2a\(**\f1 | |
635 | .TE | |
636 | is any number of consecutive | |
637 | .I a | |
638 | characters, including zero; while | |
639 | .TS | |
640 | center; | |
641 | l. | |
642 | a+ | |
643 | .TE | |
644 | is one or more instances of | |
645 | .I a. | |
646 | For example, | |
647 | .TS | |
648 | center; | |
649 | l. | |
650 | [a\-z]+ | |
651 | .TE | |
652 | is all strings of lower case letters. | |
653 | And | |
654 | .TS | |
655 | center; | |
656 | l. | |
657 | [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(** | |
658 | .TE | |
659 | indicates all alphanumeric strings with a leading | |
660 | alphabetic character. | |
661 | This is a typical expression for recognizing identifiers in | |
662 | computer languages. | |
663 | .PP | |
664 | .I | |
665 | Alternation and Grouping. | |
666 | .R | |
667 | The operator | | |
668 | indicates alternation: | |
669 | .TS | |
670 | center; | |
671 | l. | |
672 | (ab\||\|cd) | |
673 | .TE | |
674 | matches either | |
675 | .ul | |
676 | ab | |
677 | or | |
678 | .ul | |
679 | cd. | |
680 | Note that parentheses are used for grouping, although | |
681 | they are | |
682 | not necessary on the outside level; | |
683 | .TS | |
684 | center; | |
685 | l. | |
686 | ab\||\|cd | |
687 | .TE | |
688 | would have sufficed. | |
689 | Parentheses | |
690 | can be used for more complex expressions: | |
691 | .TS | |
692 | center; | |
693 | l. | |
694 | (ab\||\|cd+)?(ef)\(** | |
695 | .TE | |
696 | matches such strings as | |
697 | .I abefef , | |
698 | .I efefef , | |
699 | .I cdef , | |
700 | or | |
701 | .I cddd\| ; | |
702 | but not | |
703 | .I abc , | |
704 | .I abcd , | |
705 | or | |
706 | .I abcdef . | |
707 | .PP | |
708 | .I | |
709 | Context sensitivity. | |
710 | .R | |
711 | Lex will recognize a small amount of surrounding | |
712 | context. The two simplest operators for this are | |
713 | .I ^ | |
714 | and | |
715 | .I $ . | |
716 | If the first character of an expression is | |
717 | .I ^ , | |
718 | the expression will only be matched at the beginning | |
719 | of a line (after a newline character, or at the beginning of | |
720 | the input stream). | |
721 | This can never conflict with the other meaning of | |
722 | .I ^ , | |
723 | complementation | |
724 | of character classes, since that only applies within | |
725 | the [\|] operators. | |
726 | If the very last character is | |
727 | .I $ , | |
728 | the expression will only be matched at the end of a line (when | |
729 | immediately followed by newline). | |
730 | The latter operator is a special case of the | |
731 | .I / | |
732 | operator character, | |
733 | which indicates trailing context. | |
734 | The expression | |
735 | .TS | |
736 | center; | |
737 | l. | |
738 | ab/cd | |
739 | .TE | |
740 | matches the string | |
741 | .I ab , | |
742 | but only if followed by | |
743 | .ul | |
744 | cd. | |
745 | Thus | |
746 | .TS | |
747 | center; | |
748 | l. | |
749 | ab$ | |
750 | .TE | |
751 | is the same as | |
752 | .TS | |
753 | center; | |
754 | l. | |
755 | ab/\en | |
756 | .TE | |
757 | Left context is handled in Lex by | |
758 | .I | |
759 | start conditions | |
760 | .R | |
761 | as explained in section 10. If a rule is only to be executed | |
762 | when the Lex automaton interpreter is in start condition | |
763 | .I | |
764 | x, | |
765 | .R | |
766 | the rule should be prefixed by | |
767 | .TS | |
768 | center; | |
769 | l. | |
770 | <x> | |
771 | .TE | |
772 | using the angle bracket operator characters. | |
773 | If we considered ``being at the beginning of a line'' to be | |
774 | start condition | |
775 | .I ONE , | |
776 | then the ^ operator | |
777 | would be equivalent to | |
778 | .TS | |
779 | center; | |
780 | l. | |
781 | <ONE> | |
782 | .TE | |
783 | Start conditions are explained more fully later. | |
784 | .PP | |
785 | .I | |
786 | Repetitions and Definitions. | |
787 | .R | |
788 | The operators {} specify | |
789 | either repetitions (if they enclose numbers) | |
790 | or | |
791 | definition expansion (if they enclose a name). For example | |
792 | .TS | |
793 | center; | |
794 | l. | |
795 | {digit} | |
796 | .TE | |
797 | looks for a predefined string named | |
798 | .I digit | |
799 | and inserts it | |
800 | at that point in the expression. | |
801 | The definitions are given in the first part of the Lex | |
802 | input, before the rules. | |
803 | In contrast, | |
804 | .TS | |
805 | center; | |
806 | l. | |
807 | a{1,5} | |
808 | .TE | |
809 | looks for 1 to 5 occurrences of | |
810 | .I a . | |
811 | .PP | |
812 | Finally, initial | |
813 | .I % | |
814 | is special, being the separator | |
815 | for Lex source segments. | |
816 | .2C | |
817 | .NH | |
818 | Lex Actions. | |
819 | .PP | |
820 | When an expression written as above is matched, Lex | |
821 | executes the corresponding action. This section describes | |
822 | some features of Lex which aid in writing actions. Note | |
823 | that there is a default action, which | |
824 | consists of copying the input to the output. This | |
825 | is performed on all strings not otherwise matched. Thus | |
826 | the Lex user who wishes to absorb the entire input, without | |
827 | producing any output, must provide rules to match everything. | |
828 | When Lex is being used with Yacc, this is the normal | |
829 | situation. | |
830 | One may consider that actions are what is done instead of | |
831 | copying the input to the output; thus, in general, | |
832 | a rule which merely copies can be omitted. | |
833 | Also, a character combination | |
834 | which is omitted from the rules | |
835 | and which appears as input | |
836 | is likely to be printed on the output, thus calling | |
837 | attention to the gap in the rules. | |
838 | .PP | |
839 | One of the simplest things that can be done is to ignore | |
840 | the input. Specifying a C null statement, \fI;\fR as an action | |
841 | causes this result. A frequent rule is | |
842 | .TS | |
843 | center; | |
844 | l l. | |
845 | [ \et\en] ; | |
846 | .TE | |
847 | which causes the three spacing characters (blank, tab, and newline) | |
848 | to be ignored. | |
849 | .PP | |
850 | Another easy way to avoid writing actions is the action character | |
851 | |, which indicates that the action for this rule is the action | |
852 | for the next rule. | |
853 | The previous example could also have been written | |
854 | .TS | |
855 | center; | |
856 | l l. | |
857 | " " | | |
858 | "\et" | | |
859 | "\en" ; | |
860 | .TE | |
861 | with the same result, although in different style. | |
862 | The quotes around \en and \et are not required. | |
863 | .PP | |
864 | In more complex actions, the user | |
865 | will | |
866 | often want to know the actual text that matched some expression | |
867 | like | |
868 | .I [a\(miz]+ . | |
869 | Lex leaves this text in an external character | |
870 | array named | |
871 | .I | |
872 | yytext. | |
873 | .R | |
874 | Thus, to print the name found, | |
875 | a rule like | |
876 | .TS | |
877 | center; | |
878 | l l. | |
879 | [a\-z]+ printf("%s", yytext); | |
880 | .TE | |
881 | will print | |
882 | the string in | |
883 | .I | |
884 | yytext. | |
885 | .R | |
886 | The C function | |
887 | .I | |
888 | printf | |
889 | .R | |
890 | accepts a format argument and data to be printed; | |
891 | in this case, the format is ``print string'' (% indicating | |
892 | data conversion, and | |
893 | .I s | |
894 | indicating string type), | |
895 | and the data are the characters | |
896 | in | |
897 | .I | |
898 | yytext. | |
899 | .R | |
900 | So this just places | |
901 | the matched string | |
902 | on the output. | |
903 | This action | |
904 | is so common that | |
905 | it may be written as ECHO: | |
906 | .TS | |
907 | center; | |
908 | l l. | |
909 | [a\-z]+ ECHO; | |
910 | .TE | |
911 | is the same as the above. | |
912 | Since the default action is just to | |
913 | print the characters found, one might ask why | |
914 | give a rule, like this one, which merely specifies | |
915 | the default action? | |
916 | Such rules are often required | |
917 | to avoid matching some other rule | |
918 | which is not desired. For example, if there is a rule | |
919 | which matches | |
920 | .I read | |
921 | it will normally match the instances of | |
922 | .I read | |
923 | contained in | |
924 | .I bread | |
925 | or | |
926 | .I readjust ; | |
927 | to avoid | |
928 | this, | |
929 | a rule | |
930 | of the form | |
931 | .I [a\(miz]+ | |
932 | is needed. | |
933 | This is explained further below. | |
934 | .PP | |
935 | Sometimes it is more convenient to know the end of what | |
936 | has been found; hence Lex also provides a count | |
937 | .I | |
938 | yyleng | |
939 | .R | |
940 | of the number of characters matched. | |
941 | To count both the number | |
942 | of words and the number of characters in words in the input, the user might write | |
943 | .TS | |
944 | center; | |
945 | l l. | |
946 | [a\-zA\-Z]+ {words++; chars += yyleng;} | |
947 | .TE | |
948 | which accumulates in | |
949 | .ul | |
950 | chars | |
951 | the number | |
952 | of characters in the words recognized. | |
953 | The last character in the string matched can | |
954 | be accessed by | |
955 | .TS | |
956 | center; | |
957 | l. | |
958 | yytext[yyleng\-1] | |
959 | .TE | |
960 | .PP | |
961 | Occasionally, a Lex | |
962 | action may decide that a rule has not recognized the correct | |
963 | span of characters. | |
964 | Two routines are provided to aid with this situation. | |
965 | First, | |
966 | .I | |
967 | yymore() | |
968 | .R | |
969 | can be called to indicate that the next input expression recognized is to be | |
970 | tacked on to the end of this input. Normally, | |
971 | the next input string would overwrite the current | |
972 | entry in | |
973 | .I | |
974 | yytext. | |
975 | .R | |
976 | Second, | |
977 | .I | |
978 | yyless (n) | |
979 | .R | |
980 | may be called to indicate that not all the characters matched | |
981 | by the currently successful expression are wanted right now. | |
982 | The argument | |
983 | .I | |
984 | n | |
985 | .R | |
986 | indicates the number of characters | |
987 | in | |
988 | .I | |
989 | yytext | |
990 | .R | |
991 | to be retained. | |
992 | Further characters previously matched | |
993 | are | |
994 | returned to the input. This provides the same sort of | |
995 | look~ahead offered by the / operator, | |
996 | but in a different form. | |
997 | .PP | |
998 | .I | |
999 | Example: | |
1000 | .R | |
1001 | Consider a language which defines | |
1002 | a string as a set of characters between quotation (") marks, and provides that | |
1003 | to include a " in a string it must be preceded by a \e. The | |
1004 | regular expression which matches that is somewhat confusing, | |
1005 | so that it might be preferable to write | |
1006 | .TS | |
1007 | center; | |
1008 | l l. | |
1009 | \e"[^"]\(** { | |
1010 | if (yytext[yyleng\-1] == \(fm\e\e\(fm) | |
1011 | yymore(); | |
1012 | else | |
1013 | ... normal user processing | |
1014 | } | |
1015 | .TE | |
1016 | which will, when faced with a string such as | |
1017 | .I | |
1018 | "abc\e"def\|" | |
1019 | .R | |
1020 | first match | |
1021 | the five characters | |
1022 | \fI"abc\e\|\fR; | |
1023 | then | |
1024 | the call to | |
1025 | .I yymore() | |
1026 | will | |
1027 | cause the next part of the string, | |
1028 | \fI"def\|\fR, | |
1029 | to be tacked on the end. | |
1030 | Note that the final quote terminating the string should be picked | |
1031 | up in the code labeled ``normal processing''. | |
1032 | .PP | |
1033 | The function | |
1034 | .I | |
1035 | yyless() | |
1036 | .R | |
1037 | might be used to reprocess | |
1038 | text in various circumstances. Consider the C problem of distinguishing | |
1039 | the ambiguity of ``=\(mia''. | |
1040 | Suppose it is desired to treat this as ``=\(mi a'' | |
1041 | but print a message. A rule might be | |
5652ef93 KM |
1042 | .ps 9 |
1043 | .vs 11 | |
b1716fed KM |
1044 | .TS |
1045 | center; | |
1046 | l l. | |
1047 | =\(mi[a\-zA\-Z] { | |
5652ef93 | 1048 | printf("Op (=\(mi) ambiguous\en"); |
b1716fed KM |
1049 | yyless(yyleng\-1); |
1050 | ... action for =\(mi ... | |
1051 | } | |
1052 | .TE | |
5652ef93 KM |
1053 | .ps 10 |
1054 | .vs 12 | |
b1716fed KM |
1055 | which prints a message, returns the letter after the |
1056 | operator to the input stream, and treats the operator as ``=\(mi''. | |
1057 | Alternatively it might be desired to treat this as ``= \(mia''. | |
1058 | To do this, just return the minus | |
1059 | sign as well as the letter to the input: | |
5652ef93 KM |
1060 | .ps 9 |
1061 | .vs 11 | |
b1716fed KM |
1062 | .TS |
1063 | center; | |
1064 | l l. | |
1065 | =\(mi[a\-zA\-Z] { | |
5652ef93 | 1066 | printf("Op (=\(mi) ambiguous\en"); |
b1716fed KM |
1067 | yyless(yyleng\-2); |
1068 | ... action for = ... | |
1069 | } | |
1070 | .TE | |
5652ef93 KM |
1071 | .ps 10 |
1072 | .vs 12 | |
b1716fed KM |
1073 | will perform the other interpretation. |
1074 | Note that the expressions for the two cases might more easily | |
1075 | be written | |
1076 | .TS | |
1077 | center; | |
1078 | l l. | |
1079 | =\(mi/[A\-Za\-z] | |
1080 | .TE | |
1081 | in the first case and | |
1082 | .TS | |
1083 | center; | |
1084 | l. | |
1085 | =/\-[A\-Za\-z] | |
1086 | .TE | |
1087 | in the second; | |
1088 | no backup would be required in the rule action. | |
1089 | It is not necessary to recognize the whole identifier | |
1090 | to observe the ambiguity. | |
1091 | The | |
1092 | possibility of ``=\(mi3'', however, makes | |
1093 | .TS | |
1094 | center; | |
1095 | l. | |
1096 | =\(mi/[^ \et\en] | |
1097 | .TE | |
1098 | a still better rule. | |
1099 | .PP | |
1100 | In addition to these routines, Lex also permits | |
1101 | access to the I/O routines | |
1102 | it uses. | |
1103 | They are: | |
1104 | .IP 1) | |
1105 | .I | |
1106 | input() | |
1107 | .R | |
1108 | which returns the next input character; | |
1109 | .IP 2) | |
1110 | .I | |
1111 | output(c) | |
1112 | .R | |
1113 | which writes the character | |
1114 | .I | |
1115 | c | |
1116 | .R | |
1117 | on the output; and | |
1118 | .IP 3) | |
1119 | .I | |
1120 | unput(c) | |
1121 | .R | |
1122 | pushes the character | |
1123 | .I | |
1124 | c | |
1125 | .R | |
1126 | back onto the input stream to be read later by | |
1127 | .I | |
1128 | input(). | |
1129 | .R | |
1130 | .LP | |
1131 | By default these routines are provided as macro definitions, | |
1132 | but the user can override them and supply private versions. | |
1133 | These routines | |
1134 | define the relationship between external files and | |
1135 | internal characters, and must all be retained | |
1136 | or modified consistently. | |
1137 | They may be redefined, to | |
1138 | cause input or output to be transmitted to or from strange | |
1139 | places, including other programs or internal memory; | |
1140 | but the character set used must be consistent in all routines; | |
1141 | a value of zero returned by | |
1142 | .I | |
1143 | input | |
1144 | .R | |
1145 | must mean end of file; and | |
1146 | the relationship between | |
1147 | .I | |
1148 | unput | |
1149 | .R | |
1150 | and | |
1151 | .I | |
1152 | input | |
1153 | .R | |
1154 | must be retained | |
1155 | or the Lex look~ahead will not work. | |
1156 | Lex does not look ahead at all if it does not have to, | |
1157 | but every rule ending in | |
1158 | .ft I | |
1159 | + \(** ? | |
1160 | .ft R | |
1161 | or | |
1162 | .ft I | |
1163 | $ | |
1164 | .ft R | |
1165 | or containing | |
1166 | .ft I | |
1167 | / | |
1168 | .ft R | |
1169 | implies look~ahead. | |
1170 | Look~ahead is also necessary to match an expression that is a prefix | |
1171 | of another expression. | |
1172 | See below for a discussion of the character set used by Lex. | |
1173 | The standard Lex library imposes | |
1174 | a 100 character limit on backup. | |
1175 | .PP | |
1176 | Another Lex library routine that the user will sometimes want | |
1177 | to redefine is | |
1178 | .I | |
1179 | yywrap() | |
1180 | .R | |
1181 | which is called whenever Lex reaches an end-of-file. | |
1182 | If | |
1183 | .I | |
1184 | yywrap | |
1185 | .R | |
1186 | returns a 1, Lex continues with the normal wrapup on end of input. | |
1187 | Sometimes, however, it is convenient to arrange for more | |
1188 | input to arrive | |
1189 | from a new source. | |
1190 | In this case, the user should provide | |
1191 | a | |
1192 | .I | |
1193 | yywrap | |
1194 | .R | |
1195 | which | |
1196 | arranges for new input and | |
1197 | returns 0. This instructs Lex to continue processing. | |
1198 | The default | |
1199 | .I | |
1200 | yywrap | |
1201 | .R | |
1202 | always returns 1. | |
1203 | .PP | |
1204 | This routine is also a convenient place | |
1205 | to print tables, summaries, etc. at the end | |
1206 | of a program. Note that it is not | |
1207 | possible to write a normal rule which recognizes | |
1208 | end-of-file; the only access to this condition is | |
1209 | through | |
1210 | .I | |
1211 | yywrap. | |
1212 | .R | |
1213 | In fact, unless a private version of | |
1214 | .I | |
1215 | input() | |
1216 | .R | |
1217 | is supplied | |
1218 | a file containing nulls | |
1219 | cannot be handled, | |
1220 | since a value of 0 returned by | |
1221 | .I | |
1222 | input | |
1223 | .R | |
1224 | is taken to be end-of-file. | |
1225 | .PP | |
1226 | .2C | |
1227 | .NH | |
1228 | Ambiguous Source Rules. | |
1229 | .PP | |
1230 | Lex can handle ambiguous specifications. | |
1231 | When more than one expression can match the | |
1232 | current input, Lex chooses as follows: | |
1233 | .IP 1) | |
1234 | The longest match is preferred. | |
1235 | .IP 2) | |
1236 | Among rules which matched the same number of characters, | |
1237 | the rule given first is preferred. | |
1238 | .LP | |
1239 | Thus, suppose the rules | |
1240 | .TS | |
1241 | center; | |
1242 | l l. | |
1243 | integer keyword action ...; | |
1244 | [a\-z]+ identifier action ...; | |
1245 | .TE | |
1246 | to be given in that order. If the input is | |
1247 | .I integers , | |
1248 | it is taken as an identifier, because | |
1249 | .I [a\-z]+ | |
1250 | matches 8 characters while | |
1251 | .I integer | |
1252 | matches only 7. | |
1253 | If the input is | |
1254 | .I integer , | |
1255 | both rules match 7 characters, and | |
1256 | the keyword rule is selected because it was given first. | |
1257 | Anything shorter (e.g. \fIint\fR\|) will | |
1258 | not match the expression | |
1259 | .I integer | |
1260 | and so the identifier interpretation is used. | |
1261 | .PP | |
1262 | The principle of preferring the longest | |
1263 | match makes rules containing | |
1264 | expressions like | |
1265 | .I \&.\(** | |
1266 | dangerous. | |
1267 | For example, | |
1268 | .TS | |
1269 | center; | |
1270 | l. | |
1271 | \&\(fm.\(**\(fm | |
1272 | .TE | |
1273 | might seem a good way of recognizing | |
1274 | a string in single quotes. | |
1275 | But it is an invitation for the program to read far | |
1276 | ahead, looking for a distant | |
1277 | single quote. | |
1278 | Presented with the input | |
1279 | .TS | |
1280 | center; | |
1281 | l l. | |
1282 | \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here | |
1283 | .TE | |
1284 | the above expression will match | |
1285 | .TS | |
1286 | center; | |
1287 | l l. | |
1288 | \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm | |
1289 | .TE | |
1290 | which is probably not what was wanted. | |
1291 | A better rule is of the form | |
1292 | .TS | |
1293 | center; | |
1294 | l. | |
1295 | \&\(fm[^\(fm\en]\(**\(fm | |
1296 | .TE | |
1297 | which, on the above input, will stop | |
1298 | after | |
1299 | .I \(fmfirst\(fm . | |
1300 | The consequences | |
1301 | of errors like this are mitigated by the fact | |
1302 | that the | |
1303 | .I \&. | |
1304 | operator will not match newline. | |
1305 | Thus expressions like | |
1306 | .I \&.\(** | |
1307 | stop on the | |
1308 | current line. | |
1309 | Don't try to defeat this with expressions like | |
1310 | .I [.\en]+ | |
1311 | or | |
1312 | equivalents; | |
1313 | the Lex generated program will try to read | |
1314 | the entire input file, causing | |
1315 | internal buffer overflows. | |
1316 | .PP | |
1317 | Note that Lex is normally partitioning | |
1318 | the input stream, not searching for all possible matches | |
1319 | of each expression. | |
1320 | This means that each character is accounted for | |
1321 | once and only once. | |
1322 | For example, suppose it is desired to | |
1323 | count occurrences of both \fIshe\fR and \fIhe\fR in an input text. | |
1324 | Some Lex rules to do this might be | |
1325 | .TS | |
1326 | center; | |
1327 | l l. | |
1328 | she s++; | |
1329 | he h++; | |
1330 | \en | | |
1331 | \&. ; | |
1332 | .TE | |
1333 | where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR. | |
1334 | Remember that . does not include newline. | |
1335 | Since \fIshe\fR includes \fIhe\fR, Lex will normally | |
1336 | .I | |
1337 | not | |
1338 | .R | |
1339 | recognize | |
1340 | the instances of \fIhe\fR included in \fIshe\fR, | |
1341 | since once it has passed a \fIshe\fR those characters are gone. | |
1342 | .PP | |
1343 | Sometimes the user would like to override this choice. The action | |
1344 | REJECT | |
1345 | means ``go do the next alternative.'' | |
1346 | It causes whatever rule was second choice after the current | |
1347 | rule to be executed. | |
1348 | The position of the input pointer is adjusted accordingly. | |
1349 | Suppose the user really wants to count the included instances of \fIhe\fR: | |
1350 | .TS | |
1351 | center; | |
1352 | l l. | |
1353 | she {s++; REJECT;} | |
1354 | he {h++; REJECT;} | |
1355 | \en | | |
1356 | \&. ; | |
1357 | .TE | |
1358 | these rules are one way of changing the previous example | |
1359 | to do just that. | |
1360 | After counting each expression, it is rejected; whenever appropriate, | |
1361 | the other expression will then be counted. In this example, of course, | |
1362 | the user could note that \fIshe\fR includes \fIhe\fR but not | |
1363 | vice versa, and omit the REJECT action on \fIhe\fR; | |
1364 | in other cases, however, it | |
1365 | would not be possible a priori to tell | |
1366 | which input characters | |
1367 | were in both classes. | |
1368 | .PP | |
1369 | Consider the two rules | |
1370 | .TS | |
1371 | center; | |
1372 | l l. | |
1373 | a[bc]+ { ... ; REJECT;} | |
1374 | a[cd]+ { ... ; REJECT;} | |
1375 | .TE | |
1376 | If the input is | |
1377 | .I ab , | |
1378 | only the first rule matches, | |
1379 | and on | |
1380 | .I ad | |
1381 | only the second matches. | |
1382 | The input string | |
1383 | .I accb | |
1384 | matches the first rule for four characters | |
1385 | and then the second rule for three characters. | |
1386 | In contrast, the input | |
1387 | .I accd | |
1388 | agrees with | |
1389 | the second rule for four characters and then the first | |
1390 | rule for three. | |
1391 | .PP | |
1392 | In general, REJECT is useful whenever | |
1393 | the purpose of Lex is not to partition the input | |
1394 | stream but to detect all examples of some items | |
1395 | in the input, and the instances of these items | |
1396 | may overlap or include each other. | |
1397 | Suppose a digram table of the input is desired; | |
1398 | normally the digrams overlap, that is the word | |
1399 | .I the | |
1400 | is considered to contain | |
1401 | both | |
1402 | .I th | |
1403 | and | |
1404 | .I he . | |
1405 | Assuming a two-dimensional array named | |
1406 | .ul | |
1407 | digram | |
1408 | to be incremented, the appropriate | |
1409 | source is | |
1410 | .TS | |
1411 | center; | |
1412 | l l. | |
1413 | %% | |
5652ef93 KM |
1414 | [a\-z][a\-z] { |
1415 | digram[yytext[0]][yytext[1]]++; | |
1416 | REJECT; | |
1417 | } | |
1418 | \. ; | |
b1716fed KM |
1419 | \en ; |
1420 | .TE | |
1421 | where the REJECT is necessary to pick up | |
1422 | a letter pair beginning at every character, rather than at every | |
1423 | other character. | |
1424 | .2C | |
1425 | .NH | |
1426 | Lex Source Definitions. | |
1427 | .PP | |
1428 | Remember the format of the Lex | |
1429 | source: | |
1430 | .TS | |
1431 | center; | |
1432 | l. | |
1433 | {definitions} | |
1434 | %% | |
1435 | {rules} | |
1436 | %% | |
1437 | {user routines} | |
1438 | .TE | |
1439 | So far only the rules have been described. The user needs | |
1440 | additional options, | |
1441 | though, to define variables for use in his program and for use | |
1442 | by Lex. | |
1443 | These can go either in the definitions section | |
1444 | or in the rules section. | |
1445 | .PP | |
1446 | Remember that Lex is turning the rules into a program. | |
1447 | Any source not intercepted by Lex is copied | |
1448 | into the generated program. There are three classes | |
1449 | of such things. | |
1450 | .IP 1) | |
1451 | Any line which is not part of a Lex rule or action | |
1452 | which begins with a blank or tab is copied into | |
1453 | the Lex generated program. | |
1454 | Such source input prior to the first %% delimiter will be external | |
1455 | to any function in the code; if it appears immediately after the first | |
1456 | %%, | |
1457 | it appears in an appropriate place for declarations | |
1458 | in the function written by Lex which contains the actions. | |
1459 | This material must look like program fragments, | |
1460 | and should precede the first Lex rule. | |
1461 | .IP | |
1462 | As a side effect of the above, lines which begin with a blank | |
1463 | or tab, and which contain a comment, | |
1464 | are passed through to the generated program. | |
1465 | This can be used to include comments in either the Lex source or | |
1466 | the generated code. The comments should follow the host | |
1467 | language convention. | |
1468 | .IP 2) | |
1469 | Anything included between lines containing | |
1470 | only | |
1471 | .I %{ | |
1472 | and | |
1473 | .I %} | |
1474 | is | |
1475 | copied out as above. The delimiters are discarded. | |
1476 | This format permits entering text like preprocessor statements that | |
1477 | must begin in column 1, | |
1478 | or copying lines that do not look like programs. | |
1479 | .IP 3) | |
1480 | Anything after the third %% delimiter, regardless of formats, etc., | |
1481 | is copied out after the Lex output. | |
1482 | .PP | |
1483 | Definitions intended for Lex are given | |
1484 | before the first %% delimiter. Any line in this section | |
1485 | not contained between %{ and %}, and begining | |
1486 | in column 1, is assumed to define Lex substitution strings. | |
1487 | The format of such lines is | |
1488 | .TS | |
1489 | center; | |
1490 | l l. | |
1491 | name translation | |
1492 | .TE | |
1493 | and it | |
1494 | causes the string given as a translation to | |
1495 | be associated with the name. | |
1496 | The name and translation | |
1497 | must be separated by at least one blank or tab, and the name must begin with a letter. | |
1498 | The translation can then be called out | |
1499 | by the {name} syntax in a rule. | |
1500 | Using {D} for the digits and {E} for an exponent field, | |
1501 | for example, might abbreviate rules to recognize numbers: | |
1502 | .TS | |
1503 | center; | |
1504 | l l. | |
1505 | D [0\-9] | |
1506 | E [DEde][\-+]?{D}+ | |
1507 | %% | |
1508 | {D}+ printf("integer"); | |
1509 | {D}+"."{D}\(**({E})? | | |
1510 | {D}\(**"."{D}+({E})? | | |
1511 | {D}+{E} printf("real"); | |
1512 | .TE | |
1513 | Note the first two rules for real numbers; | |
1514 | both require a decimal point and contain | |
1515 | an optional exponent field, | |
1516 | but the first requires at least one digit before the | |
1517 | decimal point and the second requires at least one | |
1518 | digit after the decimal point. | |
1519 | To correctly handle the problem | |
1520 | posed by a Fortran expression such as | |
1521 | .I 35.EQ.I , | |
1522 | which does not contain a real number, a context-sensitive | |
1523 | rule such as | |
1524 | .TS | |
1525 | center; | |
1526 | l l. | |
1527 | [0\-9]+/"."EQ printf("integer"); | |
1528 | .TE | |
1529 | could be used in addition to the normal rule for integers. | |
1530 | .PP | |
1531 | The definitions | |
1532 | section may also contain other commands, including the | |
1533 | selection of a host language, a character set table, | |
1534 | a list of start conditions, or adjustments to the default | |
1535 | size of arrays within Lex itself for larger source programs. | |
1536 | These possibilities | |
1537 | are discussed below under ``Summary of Source Format,'' | |
1538 | section 12. | |
1539 | .2C | |
1540 | .NH | |
1541 | Usage. | |
1542 | .PP | |
1543 | There are two steps in | |
1544 | compiling a Lex source program. | |
1545 | First, the Lex source must be turned into a generated program | |
1546 | in the host general purpose language. | |
1547 | Then this program must be compiled and loaded, usually with | |
1548 | a library of Lex subroutines. | |
1549 | The generated program | |
1550 | is on a file named | |
1551 | .I lex.yy.c . | |
1552 | The I/O library is defined in terms of the C standard | |
1553 | library [6]. | |
1554 | .PP | |
1555 | The C programs generated by Lex are slightly different | |
1556 | on OS/370, because the | |
1557 | OS compiler is less powerful than the UNIX or GCOS compilers, | |
1558 | and does less at compile time. | |
1559 | C programs generated on GCOS and UNIX are the same. | |
1560 | .PP | |
1561 | .I | |
1562 | UNIX. | |
1563 | .R | |
1564 | The library is accessed by the loader flag | |
1565 | .I \-ll . | |
1566 | So an appropriate | |
1567 | set of commands is | |
1568 | .KS | |
1569 | .in 5 | |
1570 | lex source | |
1571 | cc lex.yy.c \-ll | |
1572 | .in 0 | |
1573 | .KE | |
1574 | The resulting program is placed on the usual file | |
1575 | .I | |
1576 | a.out | |
1577 | .R | |
1578 | for later execution. | |
1579 | To use Lex with Yacc see below. | |
1580 | Although the default Lex I/O routines use the C standard library, | |
1581 | the Lex automata themselves do not do so; | |
1582 | if private versions of | |
1583 | .I | |
1584 | input, | |
1585 | output | |
1586 | .R | |
1587 | and | |
1588 | .I unput | |
1589 | are given, the library can be avoided. | |
1590 | .PP | |
1591 | .2C | |
1592 | .NH | |
1593 | Lex and Yacc. | |
1594 | .PP | |
1595 | If you want to use Lex with Yacc, note that what Lex writes is a program | |
1596 | named | |
1597 | .I | |
1598 | yylex(), | |
1599 | .R | |
1600 | the name required by Yacc for its analyzer. | |
1601 | Normally, the default main program on the Lex library | |
1602 | calls this routine, but if Yacc is loaded, and its main | |
1603 | program is used, Yacc will call | |
1604 | .I | |
1605 | yylex(). | |
1606 | .R | |
1607 | In this case each Lex rule should end with | |
1608 | .TS | |
1609 | center; | |
1610 | l. | |
1611 | return(token); | |
1612 | .TE | |
1613 | where the appropriate token value is returned. | |
1614 | An easy way to get access | |
1615 | to Yacc's names for tokens is to | |
1616 | compile the Lex output file as part of | |
1617 | the Yacc output file by placing the line | |
1618 | .TS | |
1619 | center; | |
1620 | l. | |
1621 | # include "lex.yy.c" | |
1622 | .TE | |
1623 | in the last section of Yacc input. | |
1624 | Supposing the grammar to be | |
1625 | named ``good'' and the lexical rules to be named ``better'' | |
1626 | the UNIX command sequence can just be: | |
1627 | .TS | |
1628 | center; | |
1629 | l. | |
1630 | yacc good | |
1631 | lex better | |
1632 | cc y.tab.c \-ly \-ll | |
1633 | .TE | |
1634 | The Yacc library (\-ly) should be loaded before the Lex library, | |
1635 | to obtain a main program which invokes the Yacc parser. | |
1636 | The generations of Lex and Yacc programs can be done in | |
1637 | either order. | |
1638 | .2C | |
1639 | .NH | |
1640 | Examples. | |
1641 | .PP | |
1642 | As a trivial problem, consider copying an input file while | |
1643 | adding 3 to every positive number divisible by 7. | |
1644 | Here is a suitable Lex source program | |
1645 | .TS | |
1646 | center; | |
1647 | l l. | |
1648 | %% | |
1649 | int k; | |
1650 | [0\-9]+ { | |
1651 | k = atoi(yytext); | |
1652 | if (k%7 == 0) | |
1653 | printf("%d", k+3); | |
1654 | else | |
1655 | printf("%d",k); | |
1656 | } | |
1657 | .TE | |
1658 | to do just that. | |
1659 | The rule [0\-9]+ recognizes strings of digits; | |
1660 | .I | |
1661 | atoi | |
1662 | .R | |
1663 | converts the digits to binary | |
1664 | and stores the result in | |
1665 | .ul | |
1666 | k. | |
1667 | The operator % (remainder) is used to check whether | |
1668 | .ul | |
1669 | k | |
1670 | is divisible by 7; if it is, | |
1671 | it is incremented by 3 as it is written out. | |
1672 | It may be objected that this program will alter such | |
1673 | input items as | |
1674 | .I 49.63 | |
1675 | or | |
1676 | .I X7 . | |
1677 | Furthermore, it increments the absolute value | |
1678 | of all negative numbers divisible by 7. | |
1679 | To avoid this, just add a few more rules after the active one, | |
1680 | as here: | |
1681 | .TS | |
1682 | center; | |
1683 | l l. | |
1684 | %% | |
1685 | int k; | |
1686 | \-?[0\-9]+ { | |
1687 | k = atoi(yytext); | |
5652ef93 KM |
1688 | printf("%d", |
1689 | k%7 == 0 ? k+3 : k); | |
b1716fed KM |
1690 | } |
1691 | \-?[0\-9.]+ ECHO; | |
1692 | [A-Za-z][A-Za-z0-9]+ ECHO; | |
1693 | .TE | |
1694 | Numerical strings containing | |
1695 | a ``.'' or preceded by a letter will be picked up by | |
1696 | one of the last two rules, and not changed. | |
1697 | The | |
1698 | .I if\-else | |
1699 | has been replaced by | |
1700 | a C conditional expression to save space; | |
1701 | the form | |
1702 | .ul | |
1703 | a?b:c | |
1704 | means ``if | |
1705 | .I a | |
1706 | then | |
1707 | .I b | |
1708 | else | |
1709 | .I c ''. | |
1710 | .PP | |
1711 | For an example of statistics gathering, here | |
1712 | is a program which histograms the lengths | |
1713 | of words, where a word is defined as a string of letters. | |
1714 | .TS | |
1715 | center; | |
1716 | l l. | |
1717 | int lengs[100]; | |
1718 | %% | |
1719 | [a\-z]+ lengs[yyleng]++; | |
1720 | \&. | | |
1721 | \en ; | |
1722 | %% | |
1723 | .T& | |
1724 | l s. | |
1725 | yywrap() | |
1726 | { | |
1727 | int i; | |
1728 | printf("Length No. words\en"); | |
1729 | for(i=0; i<100; i++) | |
1730 | if (lengs[i] > 0) | |
1731 | printf("%5d%10d\en",i,lengs[i]); | |
1732 | return(1); | |
1733 | } | |
1734 | .TE | |
1735 | This program | |
1736 | accumulates the histogram, while producing no output. At the end | |
1737 | of the input it prints the table. | |
1738 | The final statement | |
1739 | .I | |
1740 | return(1); | |
1741 | .R | |
1742 | indicates that Lex is to perform wrapup. If | |
1743 | .I | |
1744 | yywrap | |
1745 | .R | |
1746 | returns zero (false) | |
1747 | it implies that further input is available | |
1748 | and the program is | |
1749 | to continue reading and processing. | |
1750 | To provide a | |
1751 | .I | |
1752 | yywrap | |
1753 | .R | |
1754 | that never | |
1755 | returns true causes an infinite loop. | |
1756 | .PP | |
1757 | As a larger example, | |
1758 | here are some parts of a program written by N. L. Schryer | |
1759 | to convert double precision Fortran to single precision Fortran. | |
1760 | Because Fortran does not distinguish upper and lower case letters, | |
1761 | this routine begins by defining a set of classes including | |
1762 | both cases of each letter: | |
1763 | .TS | |
1764 | center; | |
1765 | l l. | |
1766 | a [aA] | |
1767 | b [bB] | |
1768 | c [cC] | |
1769 | \&... | |
1770 | z [zZ] | |
1771 | .TE | |
1772 | An additional class recognizes white space: | |
1773 | .TS | |
1774 | center; | |
1775 | l l. | |
1776 | W [ \et]\(** | |
1777 | .TE | |
1778 | The first rule changes | |
1779 | ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''. | |
1780 | .TS | |
1781 | center; | |
1782 | l. | |
1783 | {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { | |
1784 | printf(yytext[0]==\(fmd\(fm? "real" : "REAL"); | |
1785 | } | |
1786 | .TE | |
1787 | Care is taken throughout this program to preserve the case | |
1788 | (upper or lower) | |
1789 | of the original program. | |
1790 | The conditional operator is used to | |
1791 | select the proper form of the keyword. | |
1792 | The next rule copies continuation card indications to | |
1793 | avoid confusing them with constants: | |
1794 | .TS | |
1795 | center; | |
1796 | l l. | |
1797 | ^" "[^ 0] ECHO; | |
1798 | .TE | |
1799 | In the regular expression, the quotes surround the | |
1800 | blanks. | |
1801 | It is interpreted as | |
1802 | ``beginning of line, then five blanks, then | |
1803 | anything but blank or zero.'' | |
1804 | Note the two different meanings of | |
1805 | .I ^ . | |
1806 | There follow some rules to change double precision | |
1807 | constants to ordinary floating constants. | |
1808 | .TS | |
1809 | center; | |
1810 | l. | |
1811 | [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ | | |
1812 | [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ | | |
1813 | "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ { | |
1814 | /\(** convert constants \(**/ | |
1815 | for(p=yytext; \(**p != 0; p++) | |
1816 | { | |
1817 | if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm) | |
1818 | \(**p=+ \(fme\(fm\- \(fmd\(fm; | |
1819 | ECHO; | |
1820 | } | |
1821 | .TE | |
1822 | After the floating point constant is recognized, it is | |
1823 | scanned by the | |
1824 | .ul | |
1825 | for | |
1826 | loop | |
1827 | to find the letter | |
1828 | .I d | |
1829 | or | |
1830 | .I D . | |
1831 | The program than adds | |
1832 | .c | |
1833 | .I \(fme\(fm\-\(fmd\(fm , | |
1834 | which converts | |
1835 | it to the next letter of the alphabet. | |
1836 | The modified constant, now single-precision, | |
1837 | is written out again. | |
1838 | There follow a series of names which must be respelled to remove | |
1839 | their initial \fId\fR. | |
1840 | By using the | |
1841 | array | |
1842 | .I | |
1843 | yytext | |
1844 | .R | |
1845 | the same action suffices for all the names (only a sample of | |
1846 | a rather long list is given here). | |
1847 | .TS | |
1848 | center; | |
1849 | l l. | |
1850 | {d}{s}{i}{n} | | |
1851 | {d}{c}{o}{s} | | |
1852 | {d}{s}{q}{r}{t} | | |
1853 | {d}{a}{t}{a}{n} | | |
1854 | \&... | |
1855 | {d}{f}{l}{o}{a}{t} printf("%s",yytext+1); | |
1856 | .TE | |
1857 | Another list of names must have initial \fId\fR changed to initial \fIa\fR: | |
1858 | .TS | |
1859 | center; | |
1860 | l l. | |
1861 | {d}{l}{o}{g} | | |
1862 | {d}{l}{o}{g}10 | | |
1863 | {d}{m}{i}{n}1 | | |
1864 | {d}{m}{a}{x}1 { | |
1865 | yytext[0] =+ \(fma\(fm \- \(fmd\(fm; | |
1866 | ECHO; | |
1867 | } | |
1868 | .TE | |
1869 | And one routine | |
1870 | must have initial \fId\fR changed to initial \fIr\fR: | |
1871 | .TS | |
1872 | center; | |
1873 | l l. | |
1874 | {d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm; | |
1875 | ECHO; | |
1876 | } | |
1877 | .TE | |
1878 | To avoid such names as \fIdsinx\fR being detected as instances | |
1879 | of \fIdsin\fR, some final rules pick up longer words as identifiers | |
1880 | and copy some surviving characters: | |
1881 | .TS | |
1882 | center; | |
1883 | l l. | |
1884 | [A\-Za\-z][A\-Za\-z0\-9]\(** | | |
1885 | [0\-9]+ | | |
1886 | \en | | |
1887 | \&. ECHO; | |
1888 | .TE | |
1889 | Note that this program is not complete; it | |
1890 | does not deal with the spacing problems in Fortran or | |
1891 | with the use of keywords as identifiers. | |
1892 | .br | |
1893 | .2C | |
1894 | .NH | |
1895 | Left Context Sensitivity. | |
1896 | .PP | |
1897 | Sometimes | |
1898 | it is desirable to have several sets of lexical rules | |
1899 | to be applied at different times in the input. | |
1900 | For example, a compiler preprocessor might distinguish | |
1901 | preprocessor statements and analyze them differently | |
1902 | from ordinary statements. | |
1903 | This requires | |
1904 | sensitivity | |
1905 | to prior context, and there are several ways of handling | |
1906 | such problems. | |
1907 | The \fI^\fR operator, for example, is a prior context operator, | |
1908 | recognizing immediately preceding left context just as \fI$\fR recognizes | |
1909 | immediately following right context. | |
1910 | Adjacent left context could be extended, to produce a facility similar to | |
1911 | that for adjacent right context, but it is unlikely | |
1912 | to be as useful, since often the relevant left context | |
1913 | appeared some time earlier, such as at the beginning of a line. | |
1914 | .PP | |
1915 | This section describes three means of dealing | |
1916 | with different environments: a simple use of flags, | |
1917 | when only a few rules change from one environment to another, | |
1918 | the use of | |
1919 | .I | |
1920 | start conditions | |
1921 | .R | |
1922 | on rules, | |
1923 | and the possibility of making multiple lexical analyzers all run | |
1924 | together. | |
1925 | In each case, there are rules which recognize the need to change the | |
1926 | environment in which the | |
1927 | following input text is analyzed, and set some parameter | |
1928 | to reflect the change. This may be a flag explicitly tested by | |
1929 | the user's action code; such a flag is the simplest way of dealing | |
1930 | with the problem, since Lex is not involved at all. | |
1931 | It may be more convenient, | |
1932 | however, | |
1933 | to have Lex remember the flags as initial conditions on the rules. | |
1934 | Any rule may be associated with a start condition. It will only | |
1935 | be recognized when Lex is in | |
1936 | that start condition. | |
1937 | The current start condition may be changed at any time. | |
1938 | Finally, if the sets of rules for the different environments | |
1939 | are very dissimilar, | |
1940 | clarity may be best achieved by writing several distinct lexical | |
1941 | analyzers, and switching from one to another as desired. | |
1942 | .PP | |
1943 | Consider the following problem: copy the input to the output, | |
1944 | changing the word \fImagic\fR to \fIfirst\fR on every line which began | |
1945 | with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line | |
1946 | which began with the letter \fIb\fR, and changing | |
1947 | \fImagic\fR to \fIthird\fR on every line which began | |
1948 | with the letter \fIc\fR. All other words and all other lines | |
1949 | are left unchanged. | |
1950 | .PP | |
1951 | These rules are so simple that the easiest way | |
1952 | to do this job is with a flag: | |
1953 | .TS | |
1954 | center; | |
1955 | l l. | |
1956 | int flag; | |
1957 | %% | |
1958 | ^a {flag = \(fma\(fm; ECHO;} | |
1959 | ^b {flag = \(fmb\(fm; ECHO;} | |
1960 | ^c {flag = \(fmc\(fm; ECHO;} | |
1961 | \en {flag = 0 ; ECHO;} | |
1962 | magic { | |
1963 | switch (flag) | |
1964 | { | |
1965 | case \(fma\(fm: printf("first"); break; | |
1966 | case \(fmb\(fm: printf("second"); break; | |
1967 | \11\11 case \(fmc\(fm: printf("third"); break; | |
1968 | default: ECHO; break; | |
1969 | } | |
1970 | } | |
1971 | .TE | |
1972 | should be adequate. | |
1973 | .PP | |
1974 | To handle the same problem with start conditions, each | |
1975 | start condition must be introduced to Lex in the definitions section | |
1976 | with a line reading | |
1977 | .TS | |
1978 | center; | |
1979 | l l. | |
1980 | %Start name1 name2 ... | |
1981 | .TE | |
1982 | where the conditions may be named in any order. | |
1983 | The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR. | |
1984 | The conditions may be referenced at the | |
1985 | head of a rule with the <> brackets: | |
1986 | .TS | |
1987 | center; | |
1988 | l. | |
1989 | <name1>expression | |
1990 | .TE | |
1991 | is a rule which is only recognized when Lex is in the | |
1992 | start condition \fIname1\fR. | |
1993 | To enter a start condition, | |
1994 | execute the action statement | |
1995 | .TS | |
1996 | center; | |
1997 | l. | |
1998 | BEGIN name1; | |
1999 | .TE | |
2000 | which changes the start condition to \fIname1\fR. | |
2001 | To resume the normal state, | |
2002 | .TS | |
2003 | center; | |
2004 | l. | |
2005 | BEGIN 0; | |
2006 | .TE | |
2007 | resets the initial condition | |
2008 | of the Lex automaton interpreter. | |
2009 | A rule may be active in several | |
2010 | start conditions: | |
2011 | .TS | |
2012 | center; | |
2013 | l. | |
2014 | <name1,name2,name3> | |
2015 | .TE | |
2016 | is a legal prefix. Any rule not beginning with the | |
2017 | <> prefix operator is always active. | |
2018 | .PP | |
2019 | The same example as before can be written: | |
2020 | .TS | |
2021 | center; | |
2022 | l l. | |
2023 | %START AA BB CC | |
2024 | %% | |
2025 | ^a {ECHO; BEGIN AA;} | |
2026 | ^b {ECHO; BEGIN BB;} | |
2027 | ^c {ECHO; BEGIN CC;} | |
2028 | \en {ECHO; BEGIN 0;} | |
2029 | <AA>magic printf("first"); | |
2030 | <BB>magic printf("second"); | |
2031 | <CC>magic printf("third"); | |
2032 | .TE | |
2033 | where the logic is exactly the same as in the previous | |
2034 | method of handling the problem, but Lex does the work | |
2035 | rather than the user's code. | |
2036 | .2C | |
2037 | .NH | |
2038 | Character Set. | |
2039 | .PP | |
2040 | The programs generated by Lex handle | |
2041 | character I/O only through the routines | |
2042 | .I | |
2043 | input, | |
2044 | output, | |
2045 | .R | |
2046 | and | |
2047 | .I | |
2048 | unput. | |
2049 | .R | |
2050 | Thus the character representation | |
2051 | provided in these routines | |
2052 | is accepted by Lex and employed to return | |
2053 | values in | |
2054 | .I | |
2055 | yytext. | |
2056 | .R | |
2057 | For internal use | |
2058 | a character is represented as a small integer | |
2059 | which, if the standard library is used, | |
2060 | has a value equal to the integer value of the bit | |
2061 | pattern representing the character on the host computer. | |
2062 | Normally, the letter | |
2063 | .I a | |
2064 | is represented as the same form as the character constant | |
2065 | .I \(fma\(fm . | |
2066 | If this interpretation is changed, by providing I/O | |
2067 | routines which translate the characters, | |
2068 | Lex must be told about | |
2069 | it, by giving a translation table. | |
2070 | This table must be in the definitions section, | |
2071 | and must be bracketed by lines containing only | |
2072 | ``%T''. | |
2073 | The table contains lines of the form | |
2074 | .TS | |
2075 | center; | |
2076 | l. | |
2077 | {integer} {character string} | |
2078 | .TE | |
2079 | which indicate the value associated with each character. | |
2080 | Thus the next example | |
2081 | .GS 2 | |
2082 | .TS | |
2083 | center; | |
2084 | l l. | |
2085 | %T | |
2086 | 1 Aa | |
2087 | 2 Bb | |
2088 | \&... | |
2089 | 26 Zz | |
2090 | 27 \en | |
2091 | 28 + | |
2092 | 29 \- | |
2093 | 30 0 | |
2094 | 31 1 | |
2095 | \&... | |
2096 | 39 9 | |
2097 | %T | |
2098 | .TE | |
2099 | .sp | |
2100 | .ce 1 | |
2101 | Sample character table. | |
2102 | .GE | |
2103 | maps the lower and upper case letters together into the integers 1 through 26, | |
2104 | newline into 27, + and \- into 28 and 29, and the | |
2105 | digits into 30 through 39. | |
2106 | Note the escape for newline. | |
2107 | If a table is supplied, every character that is to appear either | |
2108 | in the rules or in any valid input must be included | |
2109 | in the table. | |
2110 | No character | |
2111 | may be assigned the number 0, and no character may be | |
2112 | assigned a bigger number than the size of the hardware character set. | |
2113 | .2C | |
2114 | .NH | |
2115 | Summary of Source Format. | |
2116 | .PP | |
2117 | The general form of a Lex source file is: | |
2118 | .TS | |
2119 | center; | |
2120 | l. | |
2121 | {definitions} | |
2122 | %% | |
2123 | {rules} | |
2124 | %% | |
2125 | {user subroutines} | |
2126 | .TE | |
2127 | The definitions section contains | |
2128 | a combination of | |
2129 | .IP 1) | |
2130 | Definitions, in the form ``name space translation''. | |
2131 | .IP 2) | |
2132 | Included code, in the form ``space code''. | |
2133 | .IP 3) | |
2134 | Included code, in the form | |
2135 | .TS | |
2136 | center; | |
2137 | l. | |
2138 | %{ | |
2139 | code | |
2140 | %} | |
2141 | .TE | |
2142 | .ns | |
2143 | .IP 4) | |
2144 | Start conditions, given in the form | |
2145 | .TS | |
2146 | center; | |
2147 | l. | |
2148 | %S name1 name2 ... | |
2149 | .TE | |
2150 | .ns | |
2151 | .IP 5) | |
2152 | Character set tables, in the form | |
2153 | .TS | |
2154 | center; | |
2155 | l. | |
2156 | %T | |
2157 | number space character-string | |
2158 | \&... | |
2159 | %T | |
2160 | .TE | |
2161 | .ns | |
2162 | .IP 6) | |
2163 | Changes to internal array sizes, in the form | |
2164 | .TS | |
2165 | center; | |
2166 | l. | |
2167 | %\fIx\fR\0\0\fInnn\fR | |
2168 | .TE | |
2169 | where \fInnn\fR is a decimal integer representing an array size | |
2170 | and \fIx\fR selects the parameter as follows: | |
2171 | .TS | |
2172 | center; | |
2173 | c c | |
2174 | c l. | |
2175 | Letter Parameter | |
2176 | p positions | |
2177 | n states | |
2178 | e tree nodes | |
2179 | a transitions | |
2180 | k packed character classes | |
2181 | o output array size | |
2182 | .TE | |
2183 | .LP | |
2184 | Lines in the rules section have the form ``expression action'' | |
2185 | where the action may be continued on succeeding | |
2186 | lines by using braces to delimit it. | |
2187 | .PP | |
2188 | Regular expressions in Lex use the following | |
2189 | operators: | |
2190 | .br | |
2191 | .TS | |
2192 | center; | |
2193 | l l. | |
2194 | x the character "x" | |
2195 | "x" an "x", even if x is an operator. | |
2196 | \ex an "x", even if x is an operator. | |
2197 | [xy] the character x or y. | |
2198 | [x\-z] the characters x, y or z. | |
2199 | [^x] any character but x. | |
2200 | \&. any character but newline. | |
2201 | ^x an x at the beginning of a line. | |
2202 | <y>x an x when Lex is in start condition y. | |
2203 | x$ an x at the end of a line. | |
2204 | x? an optional x. | |
2205 | x\(** 0,1,2, ... instances of x. | |
2206 | x+ 1,2,3, ... instances of x. | |
2207 | x|y an x or a y. | |
2208 | (x) an x. | |
2209 | x/y an x but only if followed by y. | |
5652ef93 KM |
2210 | {xx} the translation of xx from the |
2211 | definitions section. | |
b1716fed KM |
2212 | x{m,n} \fIm\fR through \fIn\fR occurrences of x |
2213 | .TE | |
2214 | .NH | |
2215 | Caveats and Bugs. | |
2216 | .PP | |
2217 | There are pathological expressions which | |
2218 | produce exponential growth of the tables when | |
2219 | converted to deterministic machines; | |
2220 | fortunately, they are rare. | |
2221 | .PP | |
2222 | REJECT does not rescan the input; instead it remembers the results of the previous | |
2223 | scan. This means that if a rule with trailing context is found, and | |
2224 | REJECT executed, the user | |
2225 | must not have used | |
2226 | .ul | |
2227 | unput | |
2228 | to change the characters forthcoming | |
2229 | from the input stream. | |
2230 | This is the only restriction on the user's ability to manipulate | |
2231 | the not-yet-processed input. | |
2232 | .PP | |
2233 | .2C | |
2234 | .NH | |
2235 | Acknowledgments. | |
2236 | .PP | |
2237 | As should | |
2238 | be obvious from the above, the outside of Lex | |
2239 | is patterned | |
2240 | on Yacc and the inside on Aho's string matching routines. | |
2241 | Therefore, both S. C. Johnson and A. V. Aho | |
2242 | are really originators | |
2243 | of much of Lex, | |
2244 | as well as debuggers of it. | |
2245 | Many thanks are due to both. | |
2246 | .PP | |
2247 | The code of the current version of Lex was designed, written, | |
2248 | and debugged by Eric Schmidt. | |
2249 | .SG MH-1274-MEL-unix | |
2250 | .sp 1 | |
2251 | .2C | |
2252 | .NH | |
2253 | References. | |
2254 | .SP 1v | |
2255 | .IP 1. | |
2256 | B. W. Kernighan and D. M. Ritchie, | |
2257 | .I | |
2258 | The C Programming Language, | |
2259 | .R | |
2260 | Prentice-Hall, N. J. (1978). | |
2261 | .IP 2. | |
2262 | B. W. Kernighan, | |
2263 | .I | |
2264 | Ratfor: A Preprocessor for a Rational Fortran, | |
2265 | .R | |
2266 | Software \- Practice and Experience, | |
2267 | \fB5\fR, pp. 395-496 (1975). | |
2268 | .IP 3. | |
2269 | S. C. Johnson, | |
2270 | .I | |
2271 | Yacc: Yet Another Compiler Compiler, | |
2272 | .R | |
2273 | Computing Science Technical Report No. 32, | |
2274 | 1975, | |
2275 | .MH | |
2276 | .if \n(tm (also TM 75-1273-6) | |
2277 | .IP 4. | |
2278 | A. V. Aho and M. J. Corasick, | |
2279 | .I | |
2280 | Efficient String Matching: An Aid to Bibliographic Search, | |
2281 | .R | |
2282 | Comm. ACM | |
2283 | .B | |
2284 | 18, | |
2285 | .R | |
2286 | 333-340 (1975). | |
2287 | .IP 5. | |
2288 | B. W. Kernighan, D. M. Ritchie and K. L. Thompson, | |
2289 | .I | |
2290 | QED Text Editor, | |
2291 | .R | |
2292 | Computing Science Technical Report No. 5, | |
2293 | 1972, | |
2294 | .MH | |
2295 | .IP 6. | |
2296 | D. M. Ritchie, | |
2297 | private communication. | |
2298 | See also | |
2299 | M. E. Lesk, | |
2300 | .I | |
2301 | The Portable C Library, | |
2302 | .R | |
2303 | Computing Science Technical Report No. 31, | |
2304 | .MH | |
2305 | .if \n(tm (also TM 75-1274-11) |