| 1 | .hc ~ |
| 2 | .bd I 2 |
| 3 | .de TS |
| 4 | .br |
| 5 | .nf |
| 6 | .SP 1v |
| 7 | .ul 0 |
| 8 | .. |
| 9 | .de TE |
| 10 | .SP 1v |
| 11 | .fi |
| 12 | .. |
| 13 | .de PT |
| 14 | .if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0'' |
| 15 | .if \\n%>1 'sp |
| 16 | .. |
| 17 | .ND July 21, 1975 |
| 18 | .RP |
| 19 | .TM 75-1274-15 39199 39199-11 |
| 20 | .TL |
| 21 | Lex \- A Lexical Analyzer ~Generator~ |
| 22 | .AU ``MH 2C-569'' 6377 |
| 23 | M. E. Lesk and E. Schmidt |
| 24 | .AI |
| 25 | .MH |
| 26 | .AB |
| 27 | .sp |
| 28 | .bd I 2 |
| 29 | .nr PS 8 |
| 30 | .nr VS 9 |
| 31 | .ps 8 |
| 32 | .vs 9p |
| 33 | Lex helps write programs whose control flow |
| 34 | is directed by instances of regular |
| 35 | expressions in the input stream. |
| 36 | It is well suited for editor-script type transformations and |
| 37 | for segmenting input in preparation for |
| 38 | a parsing routine. |
| 39 | .PP |
| 40 | Lex source is a table of regular expressions and corresponding program fragments. |
| 41 | The table is translated to a program |
| 42 | which reads an input stream, copying it to an output stream |
| 43 | and partitioning the input |
| 44 | into strings which match the given expressions. |
| 45 | As each such string is recognized the corresponding |
| 46 | program fragment is executed. |
| 47 | The recognition of the expressions |
| 48 | is performed by a deterministic finite automaton |
| 49 | generated by Lex. |
| 50 | The program fragments written by the user are executed in the order in which the |
| 51 | corresponding regular expressions occur in the input stream. |
| 52 | .if n .if \n(tm .ig |
| 53 | .PP |
| 54 | The lexical analysis |
| 55 | programs written with Lex accept ambiguous specifications |
| 56 | and choose the longest |
| 57 | match possible at each input point. |
| 58 | If necessary, substantial look~ahead |
| 59 | is performed on the input, but the |
| 60 | input stream will be backed up to the |
| 61 | end of the current partition, so that the user |
| 62 | has general freedom to manipulate it. |
| 63 | .PP |
| 64 | Lex can generate analyzers in either C or Ratfor, a language |
| 65 | which can be translated automatically to portable Fortran. |
| 66 | It is available on the PDP-11 UNIX, Honeywell GCOS, |
| 67 | and IBM OS systems. |
| 68 | This manual, however, will only discuss generating analyzers |
| 69 | in C on the UNIX system, which is the only supported |
| 70 | form of Lex under UNIX Version 7. |
| 71 | Lex is designed to simplify |
| 72 | interfacing with Yacc, for those |
| 73 | with access to this compiler-compiler system. |
| 74 | .. |
| 75 | .nr PS 9 |
| 76 | .nr VS 11 |
| 77 | .AE |
| 78 | .SH |
| 79 | .ce 1 |
| 80 | Table of Contents |
| 81 | .LP |
| 82 | .ce 100 |
| 83 | .TS |
| 84 | r 1l 2r . |
| 85 | 1. Introduction. 1 |
| 86 | 2. Lex Source. 3 |
| 87 | 3. Lex Regular Expressions. 3 |
| 88 | 4. Lex Actions. 5 |
| 89 | 5. Ambiguous Source Rules. 7 |
| 90 | 6. Lex Source Definitions. 8 |
| 91 | 7. Usage. 8 |
| 92 | 8. Lex and Yacc. 9 |
| 93 | 9. Examples. 10 |
| 94 | 10. Left Context Sensitivity. 11 |
| 95 | 11. Character Set. 12 |
| 96 | 12. Summary of Source Format. 12 |
| 97 | 13. Caveats and Bugs. 13 |
| 98 | 14. Acknowledgments. 13 |
| 99 | 15. References. 13 |
| 100 | .TE |
| 101 | .ce 0 |
| 102 | .2C |
| 103 | .NH |
| 104 | Introduction. |
| 105 | .PP |
| 106 | Lex is a program generator designed for |
| 107 | lexical processing of character input streams. |
| 108 | It accepts a high-level, problem oriented specification |
| 109 | for character string matching, |
| 110 | and |
| 111 | produces a program in a general purpose language which recognizes |
| 112 | regular expressions. |
| 113 | The regular expressions are specified by the user in the |
| 114 | source specifications given to Lex. |
| 115 | The Lex written code recognizes these expressions |
| 116 | in an input stream and partitions the input stream into |
| 117 | strings matching the expressions. At the bound~aries |
| 118 | between strings |
| 119 | program sections |
| 120 | provided by the user are executed. |
| 121 | The Lex source file associates the regular expressions and the |
| 122 | program fragments. |
| 123 | As each expression appears in the input to the program written by Lex, |
| 124 | the corresponding fragment is executed. |
| 125 | .PP |
| 126 | .de MH |
| 127 | Bell Laboratories, Murray Hill, NJ 07974. |
| 128 | .. |
| 129 | The user supplies the additional code |
| 130 | beyond expression matching |
| 131 | needed to complete his tasks, possibly |
| 132 | including code written by other generators. |
| 133 | The program that recognizes the expressions is generated in the |
| 134 | general purpose programming language employed for the |
| 135 | user's program fragments. |
| 136 | Thus, a high level expression |
| 137 | language is provided to write the string expressions to be |
| 138 | matched while the user's freedom to write actions |
| 139 | is unimpaired. |
| 140 | This avoids forcing the user who wishes to use a string manipulation |
| 141 | language for input analysis to write processing programs in the same |
| 142 | and often inappropriate string handling language. |
| 143 | .PP |
| 144 | Lex is not a complete language, but rather a generator representing |
| 145 | a new language feature which can be added to |
| 146 | different programming languages, called ``host languages.'' |
| 147 | Just as general purpose languages |
| 148 | can produce code to run on different computer hardware, |
| 149 | Lex can write code in different host languages. |
| 150 | The host language is used for the output code generated by Lex |
| 151 | and also for the program fragments added by the user. |
| 152 | Compatible run-time libraries for the different host languages |
| 153 | are also provided. |
| 154 | This makes Lex adaptable to different environments and |
| 155 | different users. |
| 156 | Each application |
| 157 | may be directed to the combination of hardware and host language appropriate |
| 158 | to the task, the user's background, and the properties of local |
| 159 | implementations. |
| 160 | At present, the only supported host language is C, |
| 161 | although Fortran (in the form of Ratfor [2] has been available |
| 162 | in the past. |
| 163 | Lex itself exists on UNIX, GCOS, and OS/370; but the |
| 164 | code generated by Lex may be taken anywhere the appropriate |
| 165 | compilers exist. |
| 166 | .PP |
| 167 | Lex turns the user's expressions and actions |
| 168 | (called |
| 169 | .ul |
| 170 | source |
| 171 | in this memo) into the host general-purpose language; |
| 172 | the generated program is named |
| 173 | .ul |
| 174 | yylex. |
| 175 | The |
| 176 | .ul |
| 177 | yylex |
| 178 | program |
| 179 | will recognize expressions |
| 180 | in a stream |
| 181 | (called |
| 182 | .ul |
| 183 | input |
| 184 | in this memo) |
| 185 | and perform the specified actions for each expression as it is detected. |
| 186 | See Figure 1. |
| 187 | .GS |
| 188 | .TS |
| 189 | center; |
| 190 | l _ r |
| 191 | l|c|r |
| 192 | l _ r |
| 193 | l _ r |
| 194 | l|c|r |
| 195 | l _ r |
| 196 | c s s |
| 197 | c s s. |
| 198 | |
| 199 | Source \(-> Lex \(-> yylex |
| 200 | |
| 201 | .sp 2 |
| 202 | |
| 203 | Input \(-> yylex \(-> Output |
| 204 | |
| 205 | .sp |
| 206 | An overview of Lex |
| 207 | .sp |
| 208 | Figure 1 |
| 209 | .TE |
| 210 | .GE |
| 211 | .PP |
| 212 | For a trivial example, consider a program to delete |
| 213 | from the input |
| 214 | all blanks or tabs at the ends of lines. |
| 215 | .TS |
| 216 | center; |
| 217 | l l. |
| 218 | %% |
| 219 | [ \et]+$ ; |
| 220 | .TE |
| 221 | is all that is required. |
| 222 | The program |
| 223 | contains a %% delimiter to mark the beginning of the rules, and |
| 224 | one rule. |
| 225 | This rule contains a regular expression |
| 226 | which matches one or more |
| 227 | instances of the characters blank or tab |
| 228 | (written \et for visibility, in accordance with the C language convention) |
| 229 | just prior to the end of a line. |
| 230 | The brackets indicate the character |
| 231 | class made of blank and tab; the + indicates ``one or more ...''; |
| 232 | and the $ indicates ``end of line,'' as in QED. |
| 233 | No action is specified, |
| 234 | so the program generated by Lex (yylex) will ignore these characters. |
| 235 | Everything else will be copied. |
| 236 | To change any remaining |
| 237 | string of blanks or tabs to a single blank, |
| 238 | add another rule: |
| 239 | .TS |
| 240 | center; |
| 241 | l l. |
| 242 | %% |
| 243 | [ \et]+$ ; |
| 244 | [ \et]+ printf(" "); |
| 245 | .TE |
| 246 | The finite automaton generated for this |
| 247 | source will scan for both rules at once, |
| 248 | observing at |
| 249 | the termination of the string of blanks or tabs |
| 250 | whether or not there is a newline character, and executing |
| 251 | the desired rule action. |
| 252 | The first rule matches all strings of blanks or tabs |
| 253 | at the end of lines, and the second |
| 254 | rule all remaining strings of blanks or tabs. |
| 255 | .PP |
| 256 | Lex can be used alone for simple transformations, or |
| 257 | for analysis and statistics gathering on a lexical level. |
| 258 | Lex can also be used with a parser generator |
| 259 | to perform the lexical analysis phase; it is particularly |
| 260 | easy to interface Lex and Yacc [3]. |
| 261 | Lex programs recognize only regular expressions; |
| 262 | Yacc writes parsers that accept a large class of context free grammars, |
| 263 | but require a lower level analyzer to recognize input tokens. |
| 264 | Thus, a combination of Lex and Yacc is often appropriate. |
| 265 | When used as a preprocessor for a later parser generator, |
| 266 | Lex is used to partition the input stream, |
| 267 | and the parser generator assigns structure to |
| 268 | the resulting pieces. |
| 269 | The flow of control |
| 270 | in such a case (which might be the first half of a compiler, |
| 271 | for example) is shown in Figure 2. |
| 272 | Additional programs, |
| 273 | written by other generators |
| 274 | or by hand, can |
| 275 | be added easily to programs written by Lex. |
| 276 | .BS 2 |
| 277 | .TS |
| 278 | center; |
| 279 | l c c c l |
| 280 | l c c c l |
| 281 | l c c c l |
| 282 | l _ c _ l |
| 283 | l|c|c|c|l |
| 284 | l _ c _ l |
| 285 | l c c c l |
| 286 | l _ c _ l |
| 287 | l|c|c|c|l |
| 288 | l _ c _ l |
| 289 | l c s s l |
| 290 | l c s s l. |
| 291 | lexical grammar |
| 292 | rules rules |
| 293 | \(da \(da |
| 294 | |
| 295 | Lex Yacc |
| 296 | |
| 297 | \(da \(da |
| 298 | |
| 299 | Input \(-> yylex \(-> yyparse \(-> Parsed input |
| 300 | |
| 301 | .sp |
| 302 | Lex with Yacc |
| 303 | .sp |
| 304 | Figure 2 |
| 305 | .TE |
| 306 | .BE |
| 307 | Yacc users |
| 308 | will realize that the name |
| 309 | .ul |
| 310 | yylex |
| 311 | is what Yacc expects its lexical analyzer to be named, |
| 312 | so that the use of this name by Lex simplifies |
| 313 | interfacing. |
| 314 | .PP |
| 315 | Lex generates a deterministic finite automaton from the regular expressions |
| 316 | in the source [4]. |
| 317 | The automaton is interpreted, rather than compiled, in order |
| 318 | to save space. |
| 319 | The result is still a fast analyzer. |
| 320 | In particular, the time taken by a Lex program |
| 321 | to recognize and partition an input stream is |
| 322 | proportional to the length of the input. |
| 323 | The number of Lex rules or |
| 324 | the complexity of the rules is |
| 325 | not important in determining speed, |
| 326 | unless rules which include |
| 327 | forward context require a significant amount of re~scanning. |
| 328 | What does increase with the number and complexity of rules |
| 329 | is the size of the finite |
| 330 | automaton, and therefore the size of the program |
| 331 | generated by Lex. |
| 332 | .PP |
| 333 | In the program written by Lex, the user's fragments |
| 334 | (representing the |
| 335 | .ul |
| 336 | actions |
| 337 | to be performed as each regular expression |
| 338 | is found) |
| 339 | are gathered |
| 340 | as cases of a switch. |
| 341 | The automaton interpreter directs the control flow. |
| 342 | Opportunity is provided for the user to insert either |
| 343 | declarations or additional statements in the routine containing |
| 344 | the actions, or to |
| 345 | add subroutines outside this action routine. |
| 346 | .PP |
| 347 | Lex is not limited to source which can |
| 348 | be interpreted on the basis of one character |
| 349 | look~ahead. |
| 350 | For example, |
| 351 | if there are two rules, one looking for |
| 352 | .I ab |
| 353 | and another for |
| 354 | .I abcdefg , |
| 355 | and the input stream is |
| 356 | .I abcdefh , |
| 357 | Lex will recognize |
| 358 | .I ab |
| 359 | and leave |
| 360 | the input pointer just before |
| 361 | .I "cd. . ." |
| 362 | Such backup is more costly |
| 363 | than the processing of simpler languages. |
| 364 | .2C |
| 365 | .NH |
| 366 | Lex Source. |
| 367 | .PP |
| 368 | The general format of Lex source is: |
| 369 | .TS |
| 370 | center; |
| 371 | l. |
| 372 | {definitions} |
| 373 | %% |
| 374 | {rules} |
| 375 | %% |
| 376 | {user subroutines} |
| 377 | .TE |
| 378 | where the definitions and the user subroutines |
| 379 | are often omitted. |
| 380 | The second |
| 381 | .I %% |
| 382 | is optional, but the first is required |
| 383 | to mark the beginning of the rules. |
| 384 | The absolute minimum Lex program is thus |
| 385 | .TS |
| 386 | center; |
| 387 | l. |
| 388 | %% |
| 389 | .TE |
| 390 | (no definitions, no rules) which translates into a program |
| 391 | which copies the input to the output unchanged. |
| 392 | .PP |
| 393 | In the outline of Lex programs shown above, the |
| 394 | .I |
| 395 | rules |
| 396 | .R |
| 397 | represent the user's control |
| 398 | decisions; they are a table, in which the left column |
| 399 | contains |
| 400 | .I |
| 401 | regular expressions |
| 402 | .R |
| 403 | (see section 3) |
| 404 | and the right column contains |
| 405 | .I |
| 406 | actions, |
| 407 | .R |
| 408 | program fragments to be executed when the expressions |
| 409 | are recognized. |
| 410 | Thus an individual rule might appear |
| 411 | .TS |
| 412 | center; |
| 413 | l l. |
| 414 | integer printf("found keyword INT"); |
| 415 | .TE |
| 416 | to look for the string |
| 417 | .I integer |
| 418 | in the input stream and |
| 419 | print the message ``found keyword INT'' whenever it appears. |
| 420 | In this example the host procedural language is C and |
| 421 | the C library function |
| 422 | .I |
| 423 | printf |
| 424 | .R |
| 425 | is used to print the string. |
| 426 | The end |
| 427 | of the expression is indicated by the first blank or tab character. |
| 428 | If the action is merely a single C expression, |
| 429 | it can just be given on the right side of the line; if it is |
| 430 | compound, or takes more than a line, it should be enclosed in |
| 431 | braces. |
| 432 | As a slightly more useful example, suppose it is desired to |
| 433 | change a number of words from British to American spelling. |
| 434 | Lex rules such as |
| 435 | .TS |
| 436 | center; |
| 437 | l l. |
| 438 | colour printf("color"); |
| 439 | mechanise printf("mechanize"); |
| 440 | petrol printf("gas"); |
| 441 | .TE |
| 442 | would be a start. These rules are not quite enough, |
| 443 | since |
| 444 | the word |
| 445 | .I petroleum |
| 446 | would become |
| 447 | .I gaseum ; |
| 448 | a way of dealing |
| 449 | with this will be described later. |
| 450 | .2C |
| 451 | .NH |
| 452 | Lex Regular Expressions. |
| 453 | .PP |
| 454 | The definitions of regular expressions are very similar to those |
| 455 | in QED [5]. |
| 456 | A regular |
| 457 | expression specifies a set of strings to be matched. |
| 458 | It contains text characters (which match the corresponding |
| 459 | characters in the strings being compared) |
| 460 | and operator characters (which specify |
| 461 | repetitions, choices, and other features). |
| 462 | The letters of the alphabet and the digits are |
| 463 | always text characters; thus the regular expression |
| 464 | .TS |
| 465 | center; |
| 466 | l l. |
| 467 | integer |
| 468 | .TE |
| 469 | matches the string |
| 470 | .ul |
| 471 | integer |
| 472 | wherever it appears |
| 473 | and the expression |
| 474 | .TS |
| 475 | center; |
| 476 | l. |
| 477 | a57D |
| 478 | .TE |
| 479 | looks for the string |
| 480 | .ul |
| 481 | a57D. |
| 482 | .PP |
| 483 | .I |
| 484 | Operators. |
| 485 | .R |
| 486 | The operator characters are |
| 487 | .TS |
| 488 | center; |
| 489 | l. |
| 490 | " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < > |
| 491 | .TE |
| 492 | and if they are to be used as text characters, an escape |
| 493 | should be used. |
| 494 | The quotation mark operator (") |
| 495 | indicates that whatever is contained between a pair of quotes |
| 496 | is to be taken as text characters. |
| 497 | Thus |
| 498 | .TS |
| 499 | center; |
| 500 | l. |
| 501 | xyz"++" |
| 502 | .TE |
| 503 | matches the string |
| 504 | .I xyz++ |
| 505 | when it appears. Note that a part of a string may be quoted. |
| 506 | It is harmless but unnecessary to quote an ordinary |
| 507 | text character; the expression |
| 508 | .TS |
| 509 | center; |
| 510 | l. |
| 511 | "xyz++" |
| 512 | .TE |
| 513 | is the same as the one above. |
| 514 | Thus by quoting every non-alphanumeric character |
| 515 | being used as a text character, the user can avoid remembering |
| 516 | the list above of current |
| 517 | operator characters, and is safe should further extensions to Lex |
| 518 | lengthen the list. |
| 519 | .PP |
| 520 | An operator character may also be turned into a text character |
| 521 | by preceding it with \e as in |
| 522 | .TS |
| 523 | center; |
| 524 | l. |
| 525 | xyz\e+\e+ |
| 526 | .TE |
| 527 | which |
| 528 | is another, less readable, equivalent of the above expressions. |
| 529 | Another use of the quoting mechanism is to get a blank into |
| 530 | an expression; normally, as explained above, blanks or tabs end |
| 531 | a rule. |
| 532 | Any blank character not contained within [\|] (see below) must |
| 533 | be quoted. |
| 534 | Several normal C escapes with \e |
| 535 | are recognized: \en is newline, \et is tab, and \eb is backspace. |
| 536 | To enter \e itself, use \e\e. |
| 537 | Since newline is illegal in an expression, \en must be used; |
| 538 | it is not |
| 539 | required to escape tab and backspace. |
| 540 | Every character but blank, tab, newline and the list above is always |
| 541 | a text character. |
| 542 | .PP |
| 543 | .I |
| 544 | Character classes. |
| 545 | .R |
| 546 | Classes of characters can be specified using the operator pair [\|]. |
| 547 | The construction |
| 548 | .I [abc] |
| 549 | matches a |
| 550 | single character, which may be |
| 551 | .I a , |
| 552 | .I b , |
| 553 | or |
| 554 | .I c . |
| 555 | Within square brackets, |
| 556 | most operator meanings are ignored. |
| 557 | Only three characters are special: |
| 558 | these are \e \(mi and ^. The \(mi character |
| 559 | indicates ranges. For example, |
| 560 | .TS |
| 561 | center; |
| 562 | l. |
| 563 | [a\(miz0\(mi9<>_] |
| 564 | .TE |
| 565 | indicates the character class containing all the lower case letters, |
| 566 | the digits, |
| 567 | the angle brackets, and underline. |
| 568 | Ranges may be given in either order. |
| 569 | Using \(mi between any pair of characters which are |
| 570 | not both upper case letters, both lower case letters, or both digits |
| 571 | is implementation dependent and will get a warning message. |
| 572 | (E.g., [0\-z] in ASCII is many more characters |
| 573 | than it is in EBCDIC). |
| 574 | If it is desired to include the |
| 575 | character \(mi in a character class, it should be first or |
| 576 | last; thus |
| 577 | .TS |
| 578 | center; |
| 579 | l. |
| 580 | [\(mi+0\(mi9] |
| 581 | .TE |
| 582 | matches all the digits and the two signs. |
| 583 | .PP |
| 584 | In character classes, |
| 585 | the ^ operator must appear as the first character |
| 586 | after the left bracket; it indicates that the resulting string |
| 587 | is to be complemented with respect to the computer character set. |
| 588 | Thus |
| 589 | .TS |
| 590 | center; |
| 591 | l. |
| 592 | [^abc] |
| 593 | .TE |
| 594 | matches all characters except a, b, or c, including |
| 595 | all special or control characters; or |
| 596 | .TS |
| 597 | center; |
| 598 | l. |
| 599 | [^a\-zA\-Z] |
| 600 | .TE |
| 601 | is any character which is not a letter. |
| 602 | The \e character provides the usual escapes within |
| 603 | character class brackets. |
| 604 | .PP |
| 605 | .I |
| 606 | Arbitrary character. |
| 607 | .R |
| 608 | To match almost any character, the operator character |
| 609 | .TS |
| 610 | center; |
| 611 | l. |
| 612 | \&. |
| 613 | .TE |
| 614 | is the class of all characters except newline. |
| 615 | Escaping into octal is possible although non-portable: |
| 616 | .TS |
| 617 | center; |
| 618 | l. |
| 619 | [\e40\-\e176] |
| 620 | .TE |
| 621 | matches all printable characters in the ASCII character set, from octal |
| 622 | 40 (blank) to octal 176 (tilde). |
| 623 | .PP |
| 624 | .I |
| 625 | Optional expressions. |
| 626 | .R |
| 627 | The operator |
| 628 | .I ? |
| 629 | indicates |
| 630 | an optional element of an expression. |
| 631 | Thus |
| 632 | .TS |
| 633 | center; |
| 634 | l. |
| 635 | ab?c |
| 636 | .TE |
| 637 | matches either |
| 638 | .I ac |
| 639 | or |
| 640 | .I abc . |
| 641 | .PP |
| 642 | .I |
| 643 | Repeated expressions. |
| 644 | .R |
| 645 | Repetitions of classes are indicated by the operators |
| 646 | .I \(** |
| 647 | and |
| 648 | .I + . |
| 649 | .TS |
| 650 | center; |
| 651 | l. |
| 652 | \f2a\(**\f1 |
| 653 | .TE |
| 654 | is any number of consecutive |
| 655 | .I a |
| 656 | characters, including zero; while |
| 657 | .TS |
| 658 | center; |
| 659 | l. |
| 660 | a+ |
| 661 | .TE |
| 662 | is one or more instances of |
| 663 | .I a. |
| 664 | For example, |
| 665 | .TS |
| 666 | center; |
| 667 | l. |
| 668 | [a\-z]+ |
| 669 | .TE |
| 670 | is all strings of lower case letters. |
| 671 | And |
| 672 | .TS |
| 673 | center; |
| 674 | l. |
| 675 | [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(** |
| 676 | .TE |
| 677 | indicates all alphanumeric strings with a leading |
| 678 | alphabetic character. |
| 679 | This is a typical expression for recognizing identifiers in |
| 680 | computer languages. |
| 681 | .PP |
| 682 | .I |
| 683 | Alternation and Grouping. |
| 684 | .R |
| 685 | The operator | |
| 686 | indicates alternation: |
| 687 | .TS |
| 688 | center; |
| 689 | l. |
| 690 | (ab\||\|cd) |
| 691 | .TE |
| 692 | matches either |
| 693 | .ul |
| 694 | ab |
| 695 | or |
| 696 | .ul |
| 697 | cd. |
| 698 | Note that parentheses are used for grouping, although |
| 699 | they are |
| 700 | not necessary on the outside level; |
| 701 | .TS |
| 702 | center; |
| 703 | l. |
| 704 | ab\||\|cd |
| 705 | .TE |
| 706 | would have sufficed. |
| 707 | Parentheses |
| 708 | can be used for more complex expressions: |
| 709 | .TS |
| 710 | center; |
| 711 | l. |
| 712 | (ab\||\|cd+)?(ef)\(** |
| 713 | .TE |
| 714 | matches such strings as |
| 715 | .I abefef , |
| 716 | .I efefef , |
| 717 | .I cdef , |
| 718 | or |
| 719 | .I cddd\| ; |
| 720 | but not |
| 721 | .I abc , |
| 722 | .I abcd , |
| 723 | or |
| 724 | .I abcdef . |
| 725 | .PP |
| 726 | .I |
| 727 | Context sensitivity. |
| 728 | .R |
| 729 | Lex will recognize a small amount of surrounding |
| 730 | context. The two simplest operators for this are |
| 731 | .I ^ |
| 732 | and |
| 733 | .I $ . |
| 734 | If the first character of an expression is |
| 735 | .I ^ , |
| 736 | the expression will only be matched at the beginning |
| 737 | of a line (after a newline character, or at the beginning of |
| 738 | the input stream). |
| 739 | This can never conflict with the other meaning of |
| 740 | .I ^ , |
| 741 | complementation |
| 742 | of character classes, since that only applies within |
| 743 | the [\|] operators. |
| 744 | If the very last character is |
| 745 | .I $ , |
| 746 | the expression will only be matched at the end of a line (when |
| 747 | immediately followed by newline). |
| 748 | The latter operator is a special case of the |
| 749 | .I / |
| 750 | operator character, |
| 751 | which indicates trailing context. |
| 752 | The expression |
| 753 | .TS |
| 754 | center; |
| 755 | l. |
| 756 | ab/cd |
| 757 | .TE |
| 758 | matches the string |
| 759 | .I ab , |
| 760 | but only if followed by |
| 761 | .ul |
| 762 | cd. |
| 763 | Thus |
| 764 | .TS |
| 765 | center; |
| 766 | l. |
| 767 | ab$ |
| 768 | .TE |
| 769 | is the same as |
| 770 | .TS |
| 771 | center; |
| 772 | l. |
| 773 | ab/\en |
| 774 | .TE |
| 775 | Left context is handled in Lex by |
| 776 | .I |
| 777 | start conditions |
| 778 | .R |
| 779 | as explained in section 10. If a rule is only to be executed |
| 780 | when the Lex automaton interpreter is in start condition |
| 781 | .I |
| 782 | x, |
| 783 | .R |
| 784 | the rule should be prefixed by |
| 785 | .TS |
| 786 | center; |
| 787 | l. |
| 788 | <x> |
| 789 | .TE |
| 790 | using the angle bracket operator characters. |
| 791 | If we considered ``being at the beginning of a line'' to be |
| 792 | start condition |
| 793 | .I ONE , |
| 794 | then the ^ operator |
| 795 | would be equivalent to |
| 796 | .TS |
| 797 | center; |
| 798 | l. |
| 799 | <ONE> |
| 800 | .TE |
| 801 | Start conditions are explained more fully later. |
| 802 | .PP |
| 803 | .I |
| 804 | Repetitions and Definitions. |
| 805 | .R |
| 806 | The operators {} specify |
| 807 | either repetitions (if they enclose numbers) |
| 808 | or |
| 809 | definition expansion (if they enclose a name). For example |
| 810 | .TS |
| 811 | center; |
| 812 | l. |
| 813 | {digit} |
| 814 | .TE |
| 815 | looks for a predefined string named |
| 816 | .I digit |
| 817 | and inserts it |
| 818 | at that point in the expression. |
| 819 | The definitions are given in the first part of the Lex |
| 820 | input, before the rules. |
| 821 | In contrast, |
| 822 | .TS |
| 823 | center; |
| 824 | l. |
| 825 | a{1,5} |
| 826 | .TE |
| 827 | looks for 1 to 5 occurrences of |
| 828 | .I a . |
| 829 | .PP |
| 830 | Finally, initial |
| 831 | .I % |
| 832 | is special, being the separator |
| 833 | for Lex source segments. |
| 834 | .2C |
| 835 | .NH |
| 836 | Lex Actions. |
| 837 | .PP |
| 838 | When an expression written as above is matched, Lex |
| 839 | executes the corresponding action. This section describes |
| 840 | some features of Lex which aid in writing actions. Note |
| 841 | that there is a default action, which |
| 842 | consists of copying the input to the output. This |
| 843 | is performed on all strings not otherwise matched. Thus |
| 844 | the Lex user who wishes to absorb the entire input, without |
| 845 | producing any output, must provide rules to match everything. |
| 846 | When Lex is being used with Yacc, this is the normal |
| 847 | situation. |
| 848 | One may consider that actions are what is done instead of |
| 849 | copying the input to the output; thus, in general, |
| 850 | a rule which merely copies can be omitted. |
| 851 | Also, a character combination |
| 852 | which is omitted from the rules |
| 853 | and which appears as input |
| 854 | is likely to be printed on the output, thus calling |
| 855 | attention to the gap in the rules. |
| 856 | .PP |
| 857 | One of the simplest things that can be done is to ignore |
| 858 | the input. Specifying a C null statement, \fI;\fR as an action |
| 859 | causes this result. A frequent rule is |
| 860 | .TS |
| 861 | center; |
| 862 | l l. |
| 863 | [ \et\en] ; |
| 864 | .TE |
| 865 | which causes the three spacing characters (blank, tab, and newline) |
| 866 | to be ignored. |
| 867 | .PP |
| 868 | Another easy way to avoid writing actions is the action character |
| 869 | |, which indicates that the action for this rule is the action |
| 870 | for the next rule. |
| 871 | The previous example could also have been written |
| 872 | .TS |
| 873 | center; |
| 874 | l l. |
| 875 | " " | |
| 876 | "\et" | |
| 877 | "\en" ; |
| 878 | .TE |
| 879 | with the same result, although in different style. |
| 880 | The quotes around \en and \et are not required. |
| 881 | .PP |
| 882 | In more complex actions, the user |
| 883 | will |
| 884 | often want to know the actual text that matched some expression |
| 885 | like |
| 886 | .I [a\(miz]+ . |
| 887 | Lex leaves this text in an external character |
| 888 | array named |
| 889 | .I |
| 890 | yytext. |
| 891 | .R |
| 892 | Thus, to print the name found, |
| 893 | a rule like |
| 894 | .TS |
| 895 | center; |
| 896 | l l. |
| 897 | [a\-z]+ printf("%s", yytext); |
| 898 | .TE |
| 899 | will print |
| 900 | the string in |
| 901 | .I |
| 902 | yytext. |
| 903 | .R |
| 904 | The C function |
| 905 | .I |
| 906 | printf |
| 907 | .R |
| 908 | accepts a format argument and data to be printed; |
| 909 | in this case, the format is ``print string'' (% indicating |
| 910 | data conversion, and |
| 911 | .I s |
| 912 | indicating string type), |
| 913 | and the data are the characters |
| 914 | in |
| 915 | .I |
| 916 | yytext. |
| 917 | .R |
| 918 | So this just places |
| 919 | the matched string |
| 920 | on the output. |
| 921 | This action |
| 922 | is so common that |
| 923 | it may be written as ECHO: |
| 924 | .TS |
| 925 | center; |
| 926 | l l. |
| 927 | [a\-z]+ ECHO; |
| 928 | .TE |
| 929 | is the same as the above. |
| 930 | Since the default action is just to |
| 931 | print the characters found, one might ask why |
| 932 | give a rule, like this one, which merely specifies |
| 933 | the default action? |
| 934 | Such rules are often required |
| 935 | to avoid matching some other rule |
| 936 | which is not desired. For example, if there is a rule |
| 937 | which matches |
| 938 | .I read |
| 939 | it will normally match the instances of |
| 940 | .I read |
| 941 | contained in |
| 942 | .I bread |
| 943 | or |
| 944 | .I readjust ; |
| 945 | to avoid |
| 946 | this, |
| 947 | a rule |
| 948 | of the form |
| 949 | .I [a\(miz]+ |
| 950 | is needed. |
| 951 | This is explained further below. |
| 952 | .PP |
| 953 | Sometimes it is more convenient to know the end of what |
| 954 | has been found; hence Lex also provides a count |
| 955 | .I |
| 956 | yyleng |
| 957 | .R |
| 958 | of the number of characters matched. |
| 959 | To count both the number |
| 960 | of words and the number of characters in words in the input, the user might write |
| 961 | .TS |
| 962 | center; |
| 963 | l l. |
| 964 | [a\-zA\-Z]+ {words++; chars += yyleng;} |
| 965 | .TE |
| 966 | which accumulates in |
| 967 | .ul |
| 968 | chars |
| 969 | the number |
| 970 | of characters in the words recognized. |
| 971 | The last character in the string matched can |
| 972 | be accessed by |
| 973 | .TS |
| 974 | center; |
| 975 | l. |
| 976 | yytext[yyleng\-1] |
| 977 | .TE |
| 978 | .PP |
| 979 | Occasionally, a Lex |
| 980 | action may decide that a rule has not recognized the correct |
| 981 | span of characters. |
| 982 | Two routines are provided to aid with this situation. |
| 983 | First, |
| 984 | .I |
| 985 | yymore() |
| 986 | .R |
| 987 | can be called to indicate that the next input expression recognized is to be |
| 988 | tacked on to the end of this input. Normally, |
| 989 | the next input string would overwrite the current |
| 990 | entry in |
| 991 | .I |
| 992 | yytext. |
| 993 | .R |
| 994 | Second, |
| 995 | .I |
| 996 | yyless (n) |
| 997 | .R |
| 998 | may be called to indicate that not all the characters matched |
| 999 | by the currently successful expression are wanted right now. |
| 1000 | The argument |
| 1001 | .I |
| 1002 | n |
| 1003 | .R |
| 1004 | indicates the number of characters |
| 1005 | in |
| 1006 | .I |
| 1007 | yytext |
| 1008 | .R |
| 1009 | to be retained. |
| 1010 | Further characters previously matched |
| 1011 | are |
| 1012 | returned to the input. This provides the same sort of |
| 1013 | look~ahead offered by the / operator, |
| 1014 | but in a different form. |
| 1015 | .PP |
| 1016 | .I |
| 1017 | Example: |
| 1018 | .R |
| 1019 | Consider a language which defines |
| 1020 | a string as a set of characters between quotation (") marks, and provides that |
| 1021 | to include a " in a string it must be preceded by a \e. The |
| 1022 | regular expression which matches that is somewhat confusing, |
| 1023 | so that it might be preferable to write |
| 1024 | .TS |
| 1025 | center; |
| 1026 | l l. |
| 1027 | \e"[^"]\(** { |
| 1028 | if (yytext[yyleng\-1] == \(fm\e\e\(fm) |
| 1029 | yymore(); |
| 1030 | else |
| 1031 | ... normal user processing |
| 1032 | } |
| 1033 | .TE |
| 1034 | which will, when faced with a string such as |
| 1035 | .I |
| 1036 | "abc\e"def\|" |
| 1037 | .R |
| 1038 | first match |
| 1039 | the five characters |
| 1040 | \fI"abc\e\|\fR; |
| 1041 | then |
| 1042 | the call to |
| 1043 | .I yymore() |
| 1044 | will |
| 1045 | cause the next part of the string, |
| 1046 | \fI"def\|\fR, |
| 1047 | to be tacked on the end. |
| 1048 | Note that the final quote terminating the string should be picked |
| 1049 | up in the code labeled ``normal processing''. |
| 1050 | .PP |
| 1051 | The function |
| 1052 | .I |
| 1053 | yyless() |
| 1054 | .R |
| 1055 | might be used to reprocess |
| 1056 | text in various circumstances. Consider the C problem of distinguishing |
| 1057 | the ambiguity of ``=\(mia''. |
| 1058 | Suppose it is desired to treat this as ``=\(mi a'' |
| 1059 | but print a message. A rule might be |
| 1060 | .TS |
| 1061 | center; |
| 1062 | l l. |
| 1063 | =\(mi[a\-zA\-Z] { |
| 1064 | printf("Operator (=\(mi) ambiguous\en"); |
| 1065 | yyless(yyleng\-1); |
| 1066 | ... action for =\(mi ... |
| 1067 | } |
| 1068 | .TE |
| 1069 | which prints a message, returns the letter after the |
| 1070 | operator to the input stream, and treats the operator as ``=\(mi''. |
| 1071 | Alternatively it might be desired to treat this as ``= \(mia''. |
| 1072 | To do this, just return the minus |
| 1073 | sign as well as the letter to the input: |
| 1074 | .TS |
| 1075 | center; |
| 1076 | l l. |
| 1077 | =\(mi[a\-zA\-Z] { |
| 1078 | printf("Operator (=\(mi) ambiguous\en"); |
| 1079 | yyless(yyleng\-2); |
| 1080 | ... action for = ... |
| 1081 | } |
| 1082 | .TE |
| 1083 | will perform the other interpretation. |
| 1084 | Note that the expressions for the two cases might more easily |
| 1085 | be written |
| 1086 | .TS |
| 1087 | center; |
| 1088 | l l. |
| 1089 | =\(mi/[A\-Za\-z] |
| 1090 | .TE |
| 1091 | in the first case and |
| 1092 | .TS |
| 1093 | center; |
| 1094 | l. |
| 1095 | =/\-[A\-Za\-z] |
| 1096 | .TE |
| 1097 | in the second; |
| 1098 | no backup would be required in the rule action. |
| 1099 | It is not necessary to recognize the whole identifier |
| 1100 | to observe the ambiguity. |
| 1101 | The |
| 1102 | possibility of ``=\(mi3'', however, makes |
| 1103 | .TS |
| 1104 | center; |
| 1105 | l. |
| 1106 | =\(mi/[^ \et\en] |
| 1107 | .TE |
| 1108 | a still better rule. |
| 1109 | .PP |
| 1110 | In addition to these routines, Lex also permits |
| 1111 | access to the I/O routines |
| 1112 | it uses. |
| 1113 | They are: |
| 1114 | .IP 1) |
| 1115 | .I |
| 1116 | input() |
| 1117 | .R |
| 1118 | which returns the next input character; |
| 1119 | .IP 2) |
| 1120 | .I |
| 1121 | output(c) |
| 1122 | .R |
| 1123 | which writes the character |
| 1124 | .I |
| 1125 | c |
| 1126 | .R |
| 1127 | on the output; and |
| 1128 | .IP 3) |
| 1129 | .I |
| 1130 | unput(c) |
| 1131 | .R |
| 1132 | pushes the character |
| 1133 | .I |
| 1134 | c |
| 1135 | .R |
| 1136 | back onto the input stream to be read later by |
| 1137 | .I |
| 1138 | input(). |
| 1139 | .R |
| 1140 | .LP |
| 1141 | By default these routines are provided as macro definitions, |
| 1142 | but the user can override them and supply private versions. |
| 1143 | These routines |
| 1144 | define the relationship between external files and |
| 1145 | internal characters, and must all be retained |
| 1146 | or modified consistently. |
| 1147 | They may be redefined, to |
| 1148 | cause input or output to be transmitted to or from strange |
| 1149 | places, including other programs or internal memory; |
| 1150 | but the character set used must be consistent in all routines; |
| 1151 | a value of zero returned by |
| 1152 | .I |
| 1153 | input |
| 1154 | .R |
| 1155 | must mean end of file; and |
| 1156 | the relationship between |
| 1157 | .I |
| 1158 | unput |
| 1159 | .R |
| 1160 | and |
| 1161 | .I |
| 1162 | input |
| 1163 | .R |
| 1164 | must be retained |
| 1165 | or the Lex look~ahead will not work. |
| 1166 | Lex does not look ahead at all if it does not have to, |
| 1167 | but every rule ending in |
| 1168 | .ft I |
| 1169 | + \(** ? |
| 1170 | .ft R |
| 1171 | or |
| 1172 | .ft I |
| 1173 | $ |
| 1174 | .ft R |
| 1175 | or containing |
| 1176 | .ft I |
| 1177 | / |
| 1178 | .ft R |
| 1179 | implies look~ahead. |
| 1180 | Look~ahead is also necessary to match an expression that is a prefix |
| 1181 | of another expression. |
| 1182 | See below for a discussion of the character set used by Lex. |
| 1183 | The standard Lex library imposes |
| 1184 | a 100 character limit on backup. |
| 1185 | .PP |
| 1186 | Another Lex library routine that the user will sometimes want |
| 1187 | to redefine is |
| 1188 | .I |
| 1189 | yywrap() |
| 1190 | .R |
| 1191 | which is called whenever Lex reaches an end-of-file. |
| 1192 | If |
| 1193 | .I |
| 1194 | yywrap |
| 1195 | .R |
| 1196 | returns a 1, Lex continues with the normal wrapup on end of input. |
| 1197 | Sometimes, however, it is convenient to arrange for more |
| 1198 | input to arrive |
| 1199 | from a new source. |
| 1200 | In this case, the user should provide |
| 1201 | a |
| 1202 | .I |
| 1203 | yywrap |
| 1204 | .R |
| 1205 | which |
| 1206 | arranges for new input and |
| 1207 | returns 0. This instructs Lex to continue processing. |
| 1208 | The default |
| 1209 | .I |
| 1210 | yywrap |
| 1211 | .R |
| 1212 | always returns 1. |
| 1213 | .PP |
| 1214 | This routine is also a convenient place |
| 1215 | to print tables, summaries, etc. at the end |
| 1216 | of a program. Note that it is not |
| 1217 | possible to write a normal rule which recognizes |
| 1218 | end-of-file; the only access to this condition is |
| 1219 | through |
| 1220 | .I |
| 1221 | yywrap. |
| 1222 | .R |
| 1223 | In fact, unless a private version of |
| 1224 | .I |
| 1225 | input() |
| 1226 | .R |
| 1227 | is supplied |
| 1228 | a file containing nulls |
| 1229 | cannot be handled, |
| 1230 | since a value of 0 returned by |
| 1231 | .I |
| 1232 | input |
| 1233 | .R |
| 1234 | is taken to be end-of-file. |
| 1235 | .PP |
| 1236 | .2C |
| 1237 | .NH |
| 1238 | Ambiguous Source Rules. |
| 1239 | .PP |
| 1240 | Lex can handle ambiguous specifications. |
| 1241 | When more than one expression can match the |
| 1242 | current input, Lex chooses as follows: |
| 1243 | .IP 1) |
| 1244 | The longest match is preferred. |
| 1245 | .IP 2) |
| 1246 | Among rules which matched the same number of characters, |
| 1247 | the rule given first is preferred. |
| 1248 | .LP |
| 1249 | Thus, suppose the rules |
| 1250 | .TS |
| 1251 | center; |
| 1252 | l l. |
| 1253 | integer keyword action ...; |
| 1254 | [a\-z]+ identifier action ...; |
| 1255 | .TE |
| 1256 | to be given in that order. If the input is |
| 1257 | .I integers , |
| 1258 | it is taken as an identifier, because |
| 1259 | .I [a\-z]+ |
| 1260 | matches 8 characters while |
| 1261 | .I integer |
| 1262 | matches only 7. |
| 1263 | If the input is |
| 1264 | .I integer , |
| 1265 | both rules match 7 characters, and |
| 1266 | the keyword rule is selected because it was given first. |
| 1267 | Anything shorter (e.g. \fIint\fR\|) will |
| 1268 | not match the expression |
| 1269 | .I integer |
| 1270 | and so the identifier interpretation is used. |
| 1271 | .PP |
| 1272 | The principle of preferring the longest |
| 1273 | match makes rules containing |
| 1274 | expressions like |
| 1275 | .I \&.\(** |
| 1276 | dangerous. |
| 1277 | For example, |
| 1278 | .TS |
| 1279 | center; |
| 1280 | l. |
| 1281 | \&\(fm.\(**\(fm |
| 1282 | .TE |
| 1283 | might seem a good way of recognizing |
| 1284 | a string in single quotes. |
| 1285 | But it is an invitation for the program to read far |
| 1286 | ahead, looking for a distant |
| 1287 | single quote. |
| 1288 | Presented with the input |
| 1289 | .TS |
| 1290 | center; |
| 1291 | l l. |
| 1292 | \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here |
| 1293 | .TE |
| 1294 | the above expression will match |
| 1295 | .TS |
| 1296 | center; |
| 1297 | l l. |
| 1298 | \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm |
| 1299 | .TE |
| 1300 | which is probably not what was wanted. |
| 1301 | A better rule is of the form |
| 1302 | .TS |
| 1303 | center; |
| 1304 | l. |
| 1305 | \&\(fm[^\(fm\en]\(**\(fm |
| 1306 | .TE |
| 1307 | which, on the above input, will stop |
| 1308 | after |
| 1309 | .I \(fmfirst\(fm . |
| 1310 | The consequences |
| 1311 | of errors like this are mitigated by the fact |
| 1312 | that the |
| 1313 | .I \&. |
| 1314 | operator will not match newline. |
| 1315 | Thus expressions like |
| 1316 | .I \&.\(** |
| 1317 | stop on the |
| 1318 | current line. |
| 1319 | Don't try to defeat this with expressions like |
| 1320 | .I [.\en]+ |
| 1321 | or |
| 1322 | equivalents; |
| 1323 | the Lex generated program will try to read |
| 1324 | the entire input file, causing |
| 1325 | internal buffer overflows. |
| 1326 | .PP |
| 1327 | Note that Lex is normally partitioning |
| 1328 | the input stream, not searching for all possible matches |
| 1329 | of each expression. |
| 1330 | This means that each character is accounted for |
| 1331 | once and only once. |
| 1332 | For example, suppose it is desired to |
| 1333 | count occurrences of both \fIshe\fR and \fIhe\fR in an input text. |
| 1334 | Some Lex rules to do this might be |
| 1335 | .TS |
| 1336 | center; |
| 1337 | l l. |
| 1338 | she s++; |
| 1339 | he h++; |
| 1340 | \en | |
| 1341 | \&. ; |
| 1342 | .TE |
| 1343 | where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR. |
| 1344 | Remember that . does not include newline. |
| 1345 | Since \fIshe\fR includes \fIhe\fR, Lex will normally |
| 1346 | .I |
| 1347 | not |
| 1348 | .R |
| 1349 | recognize |
| 1350 | the instances of \fIhe\fR included in \fIshe\fR, |
| 1351 | since once it has passed a \fIshe\fR those characters are gone. |
| 1352 | .PP |
| 1353 | Sometimes the user would like to override this choice. The action |
| 1354 | REJECT |
| 1355 | means ``go do the next alternative.'' |
| 1356 | It causes whatever rule was second choice after the current |
| 1357 | rule to be executed. |
| 1358 | The position of the input pointer is adjusted accordingly. |
| 1359 | Suppose the user really wants to count the included instances of \fIhe\fR: |
| 1360 | .TS |
| 1361 | center; |
| 1362 | l l. |
| 1363 | she {s++; REJECT;} |
| 1364 | he {h++; REJECT;} |
| 1365 | \en | |
| 1366 | \&. ; |
| 1367 | .TE |
| 1368 | these rules are one way of changing the previous example |
| 1369 | to do just that. |
| 1370 | After counting each expression, it is rejected; whenever appropriate, |
| 1371 | the other expression will then be counted. In this example, of course, |
| 1372 | the user could note that \fIshe\fR includes \fIhe\fR but not |
| 1373 | vice versa, and omit the REJECT action on \fIhe\fR; |
| 1374 | in other cases, however, it |
| 1375 | would not be possible a priori to tell |
| 1376 | which input characters |
| 1377 | were in both classes. |
| 1378 | .PP |
| 1379 | Consider the two rules |
| 1380 | .TS |
| 1381 | center; |
| 1382 | l l. |
| 1383 | a[bc]+ { ... ; REJECT;} |
| 1384 | a[cd]+ { ... ; REJECT;} |
| 1385 | .TE |
| 1386 | If the input is |
| 1387 | .I ab , |
| 1388 | only the first rule matches, |
| 1389 | and on |
| 1390 | .I ad |
| 1391 | only the second matches. |
| 1392 | The input string |
| 1393 | .I accb |
| 1394 | matches the first rule for four characters |
| 1395 | and then the second rule for three characters. |
| 1396 | In contrast, the input |
| 1397 | .I accd |
| 1398 | agrees with |
| 1399 | the second rule for four characters and then the first |
| 1400 | rule for three. |
| 1401 | .PP |
| 1402 | In general, REJECT is useful whenever |
| 1403 | the purpose of Lex is not to partition the input |
| 1404 | stream but to detect all examples of some items |
| 1405 | in the input, and the instances of these items |
| 1406 | may overlap or include each other. |
| 1407 | Suppose a digram table of the input is desired; |
| 1408 | normally the digrams overlap, that is the word |
| 1409 | .I the |
| 1410 | is considered to contain |
| 1411 | both |
| 1412 | .I th |
| 1413 | and |
| 1414 | .I he . |
| 1415 | Assuming a two-dimensional array named |
| 1416 | .ul |
| 1417 | digram |
| 1418 | to be incremented, the appropriate |
| 1419 | source is |
| 1420 | .TS |
| 1421 | center; |
| 1422 | l l. |
| 1423 | %% |
| 1424 | [a\-z][a\-z] {digram[yytext[0]][yytext[1]]++; REJECT;} |
| 1425 | . ; |
| 1426 | \en ; |
| 1427 | .TE |
| 1428 | where the REJECT is necessary to pick up |
| 1429 | a letter pair beginning at every character, rather than at every |
| 1430 | other character. |
| 1431 | .2C |
| 1432 | .NH |
| 1433 | Lex Source Definitions. |
| 1434 | .PP |
| 1435 | Remember the format of the Lex |
| 1436 | source: |
| 1437 | .TS |
| 1438 | center; |
| 1439 | l. |
| 1440 | {definitions} |
| 1441 | %% |
| 1442 | {rules} |
| 1443 | %% |
| 1444 | {user routines} |
| 1445 | .TE |
| 1446 | So far only the rules have been described. The user needs |
| 1447 | additional options, |
| 1448 | though, to define variables for use in his program and for use |
| 1449 | by Lex. |
| 1450 | These can go either in the definitions section |
| 1451 | or in the rules section. |
| 1452 | .PP |
| 1453 | Remember that Lex is turning the rules into a program. |
| 1454 | Any source not intercepted by Lex is copied |
| 1455 | into the generated program. There are three classes |
| 1456 | of such things. |
| 1457 | .IP 1) |
| 1458 | Any line which is not part of a Lex rule or action |
| 1459 | which begins with a blank or tab is copied into |
| 1460 | the Lex generated program. |
| 1461 | Such source input prior to the first %% delimiter will be external |
| 1462 | to any function in the code; if it appears immediately after the first |
| 1463 | %%, |
| 1464 | it appears in an appropriate place for declarations |
| 1465 | in the function written by Lex which contains the actions. |
| 1466 | This material must look like program fragments, |
| 1467 | and should precede the first Lex rule. |
| 1468 | .IP |
| 1469 | As a side effect of the above, lines which begin with a blank |
| 1470 | or tab, and which contain a comment, |
| 1471 | are passed through to the generated program. |
| 1472 | This can be used to include comments in either the Lex source or |
| 1473 | the generated code. The comments should follow the host |
| 1474 | language convention. |
| 1475 | .IP 2) |
| 1476 | Anything included between lines containing |
| 1477 | only |
| 1478 | .I %{ |
| 1479 | and |
| 1480 | .I %} |
| 1481 | is |
| 1482 | copied out as above. The delimiters are discarded. |
| 1483 | This format permits entering text like preprocessor statements that |
| 1484 | must begin in column 1, |
| 1485 | or copying lines that do not look like programs. |
| 1486 | .IP 3) |
| 1487 | Anything after the third %% delimiter, regardless of formats, etc., |
| 1488 | is copied out after the Lex output. |
| 1489 | .PP |
| 1490 | Definitions intended for Lex are given |
| 1491 | before the first %% delimiter. Any line in this section |
| 1492 | not contained between %{ and %}, and begining |
| 1493 | in column 1, is assumed to define Lex substitution strings. |
| 1494 | The format of such lines is |
| 1495 | .TS |
| 1496 | center; |
| 1497 | l l. |
| 1498 | name translation |
| 1499 | .TE |
| 1500 | and it |
| 1501 | causes the string given as a translation to |
| 1502 | be associated with the name. |
| 1503 | The name and translation |
| 1504 | must be separated by at least one blank or tab, and the name must begin with a letter. |
| 1505 | The translation can then be called out |
| 1506 | by the {name} syntax in a rule. |
| 1507 | Using {D} for the digits and {E} for an exponent field, |
| 1508 | for example, might abbreviate rules to recognize numbers: |
| 1509 | .TS |
| 1510 | center; |
| 1511 | l l. |
| 1512 | D [0\-9] |
| 1513 | E [DEde][\-+]?{D}+ |
| 1514 | %% |
| 1515 | {D}+ printf("integer"); |
| 1516 | {D}+"."{D}\(**({E})? | |
| 1517 | {D}\(**"."{D}+({E})? | |
| 1518 | {D}+{E} printf("real"); |
| 1519 | .TE |
| 1520 | Note the first two rules for real numbers; |
| 1521 | both require a decimal point and contain |
| 1522 | an optional exponent field, |
| 1523 | but the first requires at least one digit before the |
| 1524 | decimal point and the second requires at least one |
| 1525 | digit after the decimal point. |
| 1526 | To correctly handle the problem |
| 1527 | posed by a Fortran expression such as |
| 1528 | .I 35.EQ.I , |
| 1529 | which does not contain a real number, a context-sensitive |
| 1530 | rule such as |
| 1531 | .TS |
| 1532 | center; |
| 1533 | l l. |
| 1534 | [0\-9]+/"."EQ printf("integer"); |
| 1535 | .TE |
| 1536 | could be used in addition to the normal rule for integers. |
| 1537 | .PP |
| 1538 | The definitions |
| 1539 | section may also contain other commands, including the |
| 1540 | selection of a host language, a character set table, |
| 1541 | a list of start conditions, or adjustments to the default |
| 1542 | size of arrays within Lex itself for larger source programs. |
| 1543 | These possibilities |
| 1544 | are discussed below under ``Summary of Source Format,'' |
| 1545 | section 12. |
| 1546 | .2C |
| 1547 | .NH |
| 1548 | Usage. |
| 1549 | .PP |
| 1550 | There are two steps in |
| 1551 | compiling a Lex source program. |
| 1552 | First, the Lex source must be turned into a generated program |
| 1553 | in the host general purpose language. |
| 1554 | Then this program must be compiled and loaded, usually with |
| 1555 | a library of Lex subroutines. |
| 1556 | The generated program |
| 1557 | is on a file named |
| 1558 | .I lex.yy.c . |
| 1559 | The I/O library is defined in terms of the C standard |
| 1560 | library [6]. |
| 1561 | .PP |
| 1562 | The C programs generated by Lex are slightly different |
| 1563 | on OS/370, because the |
| 1564 | OS compiler is less powerful than the UNIX or GCOS compilers, |
| 1565 | and does less at compile time. |
| 1566 | C programs generated on GCOS and UNIX are the same. |
| 1567 | .PP |
| 1568 | .I |
| 1569 | UNIX. |
| 1570 | .R |
| 1571 | The library is accessed by the loader flag |
| 1572 | .I \-ll . |
| 1573 | So an appropriate |
| 1574 | set of commands is |
| 1575 | .KS |
| 1576 | .in 5 |
| 1577 | lex source |
| 1578 | cc lex.yy.c \-ll |
| 1579 | .in 0 |
| 1580 | .KE |
| 1581 | The resulting program is placed on the usual file |
| 1582 | .I |
| 1583 | a.out |
| 1584 | .R |
| 1585 | for later execution. |
| 1586 | To use Lex with Yacc see below. |
| 1587 | Although the default Lex I/O routines use the C standard library, |
| 1588 | the Lex automata themselves do not do so; |
| 1589 | if private versions of |
| 1590 | .I |
| 1591 | input, |
| 1592 | output |
| 1593 | .R |
| 1594 | and |
| 1595 | .I unput |
| 1596 | are given, the library can be avoided. |
| 1597 | .PP |
| 1598 | .2C |
| 1599 | .NH |
| 1600 | Lex and Yacc. |
| 1601 | .PP |
| 1602 | If you want to use Lex with Yacc, note that what Lex writes is a program |
| 1603 | named |
| 1604 | .I |
| 1605 | yylex(), |
| 1606 | .R |
| 1607 | the name required by Yacc for its analyzer. |
| 1608 | Normally, the default main program on the Lex library |
| 1609 | calls this routine, but if Yacc is loaded, and its main |
| 1610 | program is used, Yacc will call |
| 1611 | .I |
| 1612 | yylex(). |
| 1613 | .R |
| 1614 | In this case each Lex rule should end with |
| 1615 | .TS |
| 1616 | center; |
| 1617 | l. |
| 1618 | return(token); |
| 1619 | .TE |
| 1620 | where the appropriate token value is returned. |
| 1621 | An easy way to get access |
| 1622 | to Yacc's names for tokens is to |
| 1623 | compile the Lex output file as part of |
| 1624 | the Yacc output file by placing the line |
| 1625 | .TS |
| 1626 | center; |
| 1627 | l. |
| 1628 | # include "lex.yy.c" |
| 1629 | .TE |
| 1630 | in the last section of Yacc input. |
| 1631 | Supposing the grammar to be |
| 1632 | named ``good'' and the lexical rules to be named ``better'' |
| 1633 | the UNIX command sequence can just be: |
| 1634 | .TS |
| 1635 | center; |
| 1636 | l. |
| 1637 | yacc good |
| 1638 | lex better |
| 1639 | cc y.tab.c \-ly \-ll |
| 1640 | .TE |
| 1641 | The Yacc library (\-ly) should be loaded before the Lex library, |
| 1642 | to obtain a main program which invokes the Yacc parser. |
| 1643 | The generations of Lex and Yacc programs can be done in |
| 1644 | either order. |
| 1645 | .2C |
| 1646 | .NH |
| 1647 | Examples. |
| 1648 | .PP |
| 1649 | As a trivial problem, consider copying an input file while |
| 1650 | adding 3 to every positive number divisible by 7. |
| 1651 | Here is a suitable Lex source program |
| 1652 | .TS |
| 1653 | center; |
| 1654 | l l. |
| 1655 | %% |
| 1656 | int k; |
| 1657 | [0\-9]+ { |
| 1658 | k = atoi(yytext); |
| 1659 | if (k%7 == 0) |
| 1660 | printf("%d", k+3); |
| 1661 | else |
| 1662 | printf("%d",k); |
| 1663 | } |
| 1664 | .TE |
| 1665 | to do just that. |
| 1666 | The rule [0\-9]+ recognizes strings of digits; |
| 1667 | .I |
| 1668 | atoi |
| 1669 | .R |
| 1670 | converts the digits to binary |
| 1671 | and stores the result in |
| 1672 | .ul |
| 1673 | k. |
| 1674 | The operator % (remainder) is used to check whether |
| 1675 | .ul |
| 1676 | k |
| 1677 | is divisible by 7; if it is, |
| 1678 | it is incremented by 3 as it is written out. |
| 1679 | It may be objected that this program will alter such |
| 1680 | input items as |
| 1681 | .I 49.63 |
| 1682 | or |
| 1683 | .I X7 . |
| 1684 | Furthermore, it increments the absolute value |
| 1685 | of all negative numbers divisible by 7. |
| 1686 | To avoid this, just add a few more rules after the active one, |
| 1687 | as here: |
| 1688 | .TS |
| 1689 | center; |
| 1690 | l l. |
| 1691 | %% |
| 1692 | int k; |
| 1693 | \-?[0\-9]+ { |
| 1694 | k = atoi(yytext); |
| 1695 | printf("%d", k%7 == 0 ? k+3 : k); |
| 1696 | } |
| 1697 | \-?[0\-9.]+ ECHO; |
| 1698 | [A-Za-z][A-Za-z0-9]+ ECHO; |
| 1699 | .TE |
| 1700 | Numerical strings containing |
| 1701 | a ``.'' or preceded by a letter will be picked up by |
| 1702 | one of the last two rules, and not changed. |
| 1703 | The |
| 1704 | .I if\-else |
| 1705 | has been replaced by |
| 1706 | a C conditional expression to save space; |
| 1707 | the form |
| 1708 | .ul |
| 1709 | a?b:c |
| 1710 | means ``if |
| 1711 | .I a |
| 1712 | then |
| 1713 | .I b |
| 1714 | else |
| 1715 | .I c ''. |
| 1716 | .PP |
| 1717 | For an example of statistics gathering, here |
| 1718 | is a program which histograms the lengths |
| 1719 | of words, where a word is defined as a string of letters. |
| 1720 | .TS |
| 1721 | center; |
| 1722 | l l. |
| 1723 | int lengs[100]; |
| 1724 | %% |
| 1725 | [a\-z]+ lengs[yyleng]++; |
| 1726 | \&. | |
| 1727 | \en ; |
| 1728 | %% |
| 1729 | .T& |
| 1730 | l s. |
| 1731 | yywrap() |
| 1732 | { |
| 1733 | int i; |
| 1734 | printf("Length No. words\en"); |
| 1735 | for(i=0; i<100; i++) |
| 1736 | if (lengs[i] > 0) |
| 1737 | printf("%5d%10d\en",i,lengs[i]); |
| 1738 | return(1); |
| 1739 | } |
| 1740 | .TE |
| 1741 | This program |
| 1742 | accumulates the histogram, while producing no output. At the end |
| 1743 | of the input it prints the table. |
| 1744 | The final statement |
| 1745 | .I |
| 1746 | return(1); |
| 1747 | .R |
| 1748 | indicates that Lex is to perform wrapup. If |
| 1749 | .I |
| 1750 | yywrap |
| 1751 | .R |
| 1752 | returns zero (false) |
| 1753 | it implies that further input is available |
| 1754 | and the program is |
| 1755 | to continue reading and processing. |
| 1756 | To provide a |
| 1757 | .I |
| 1758 | yywrap |
| 1759 | .R |
| 1760 | that never |
| 1761 | returns true causes an infinite loop. |
| 1762 | .PP |
| 1763 | As a larger example, |
| 1764 | here are some parts of a program written by N. L. Schryer |
| 1765 | to convert double precision Fortran to single precision Fortran. |
| 1766 | Because Fortran does not distinguish upper and lower case letters, |
| 1767 | this routine begins by defining a set of classes including |
| 1768 | both cases of each letter: |
| 1769 | .TS |
| 1770 | center; |
| 1771 | l l. |
| 1772 | a [aA] |
| 1773 | b [bB] |
| 1774 | c [cC] |
| 1775 | \&... |
| 1776 | z [zZ] |
| 1777 | .TE |
| 1778 | An additional class recognizes white space: |
| 1779 | .TS |
| 1780 | center; |
| 1781 | l l. |
| 1782 | W [ \et]\(** |
| 1783 | .TE |
| 1784 | The first rule changes |
| 1785 | ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''. |
| 1786 | .TS |
| 1787 | center; |
| 1788 | l. |
| 1789 | {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { |
| 1790 | printf(yytext[0]==\(fmd\(fm? "real" : "REAL"); |
| 1791 | } |
| 1792 | .TE |
| 1793 | Care is taken throughout this program to preserve the case |
| 1794 | (upper or lower) |
| 1795 | of the original program. |
| 1796 | The conditional operator is used to |
| 1797 | select the proper form of the keyword. |
| 1798 | The next rule copies continuation card indications to |
| 1799 | avoid confusing them with constants: |
| 1800 | .TS |
| 1801 | center; |
| 1802 | l l. |
| 1803 | ^" "[^ 0] ECHO; |
| 1804 | .TE |
| 1805 | In the regular expression, the quotes surround the |
| 1806 | blanks. |
| 1807 | It is interpreted as |
| 1808 | ``beginning of line, then five blanks, then |
| 1809 | anything but blank or zero.'' |
| 1810 | Note the two different meanings of |
| 1811 | .I ^ . |
| 1812 | There follow some rules to change double precision |
| 1813 | constants to ordinary floating constants. |
| 1814 | .TS |
| 1815 | center; |
| 1816 | l. |
| 1817 | [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ | |
| 1818 | [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ | |
| 1819 | "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ { |
| 1820 | /\(** convert constants \(**/ |
| 1821 | for(p=yytext; \(**p != 0; p++) |
| 1822 | { |
| 1823 | if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm) |
| 1824 | \(**p=+ \(fme\(fm\- \(fmd\(fm; |
| 1825 | ECHO; |
| 1826 | } |
| 1827 | .TE |
| 1828 | After the floating point constant is recognized, it is |
| 1829 | scanned by the |
| 1830 | .ul |
| 1831 | for |
| 1832 | loop |
| 1833 | to find the letter |
| 1834 | .I d |
| 1835 | or |
| 1836 | .I D . |
| 1837 | The program than adds |
| 1838 | .c |
| 1839 | .I \(fme\(fm\-\(fmd\(fm , |
| 1840 | which converts |
| 1841 | it to the next letter of the alphabet. |
| 1842 | The modified constant, now single-precision, |
| 1843 | is written out again. |
| 1844 | There follow a series of names which must be respelled to remove |
| 1845 | their initial \fId\fR. |
| 1846 | By using the |
| 1847 | array |
| 1848 | .I |
| 1849 | yytext |
| 1850 | .R |
| 1851 | the same action suffices for all the names (only a sample of |
| 1852 | a rather long list is given here). |
| 1853 | .TS |
| 1854 | center; |
| 1855 | l l. |
| 1856 | {d}{s}{i}{n} | |
| 1857 | {d}{c}{o}{s} | |
| 1858 | {d}{s}{q}{r}{t} | |
| 1859 | {d}{a}{t}{a}{n} | |
| 1860 | \&... |
| 1861 | {d}{f}{l}{o}{a}{t} printf("%s",yytext+1); |
| 1862 | .TE |
| 1863 | Another list of names must have initial \fId\fR changed to initial \fIa\fR: |
| 1864 | .TS |
| 1865 | center; |
| 1866 | l l. |
| 1867 | {d}{l}{o}{g} | |
| 1868 | {d}{l}{o}{g}10 | |
| 1869 | {d}{m}{i}{n}1 | |
| 1870 | {d}{m}{a}{x}1 { |
| 1871 | yytext[0] =+ \(fma\(fm \- \(fmd\(fm; |
| 1872 | ECHO; |
| 1873 | } |
| 1874 | .TE |
| 1875 | And one routine |
| 1876 | must have initial \fId\fR changed to initial \fIr\fR: |
| 1877 | .TS |
| 1878 | center; |
| 1879 | l l. |
| 1880 | {d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm; |
| 1881 | ECHO; |
| 1882 | } |
| 1883 | .TE |
| 1884 | To avoid such names as \fIdsinx\fR being detected as instances |
| 1885 | of \fIdsin\fR, some final rules pick up longer words as identifiers |
| 1886 | and copy some surviving characters: |
| 1887 | .TS |
| 1888 | center; |
| 1889 | l l. |
| 1890 | [A\-Za\-z][A\-Za\-z0\-9]\(** | |
| 1891 | [0\-9]+ | |
| 1892 | \en | |
| 1893 | \&. ECHO; |
| 1894 | .TE |
| 1895 | Note that this program is not complete; it |
| 1896 | does not deal with the spacing problems in Fortran or |
| 1897 | with the use of keywords as identifiers. |
| 1898 | .br |
| 1899 | .2C |
| 1900 | .NH |
| 1901 | Left Context Sensitivity. |
| 1902 | .PP |
| 1903 | Sometimes |
| 1904 | it is desirable to have several sets of lexical rules |
| 1905 | to be applied at different times in the input. |
| 1906 | For example, a compiler preprocessor might distinguish |
| 1907 | preprocessor statements and analyze them differently |
| 1908 | from ordinary statements. |
| 1909 | This requires |
| 1910 | sensitivity |
| 1911 | to prior context, and there are several ways of handling |
| 1912 | such problems. |
| 1913 | The \fI^\fR operator, for example, is a prior context operator, |
| 1914 | recognizing immediately preceding left context just as \fI$\fR recognizes |
| 1915 | immediately following right context. |
| 1916 | Adjacent left context could be extended, to produce a facility similar to |
| 1917 | that for adjacent right context, but it is unlikely |
| 1918 | to be as useful, since often the relevant left context |
| 1919 | appeared some time earlier, such as at the beginning of a line. |
| 1920 | .PP |
| 1921 | This section describes three means of dealing |
| 1922 | with different environments: a simple use of flags, |
| 1923 | when only a few rules change from one environment to another, |
| 1924 | the use of |
| 1925 | .I |
| 1926 | start conditions |
| 1927 | .R |
| 1928 | on rules, |
| 1929 | and the possibility of making multiple lexical analyzers all run |
| 1930 | together. |
| 1931 | In each case, there are rules which recognize the need to change the |
| 1932 | environment in which the |
| 1933 | following input text is analyzed, and set some parameter |
| 1934 | to reflect the change. This may be a flag explicitly tested by |
| 1935 | the user's action code; such a flag is the simplest way of dealing |
| 1936 | with the problem, since Lex is not involved at all. |
| 1937 | It may be more convenient, |
| 1938 | however, |
| 1939 | to have Lex remember the flags as initial conditions on the rules. |
| 1940 | Any rule may be associated with a start condition. It will only |
| 1941 | be recognized when Lex is in |
| 1942 | that start condition. |
| 1943 | The current start condition may be changed at any time. |
| 1944 | Finally, if the sets of rules for the different environments |
| 1945 | are very dissimilar, |
| 1946 | clarity may be best achieved by writing several distinct lexical |
| 1947 | analyzers, and switching from one to another as desired. |
| 1948 | .PP |
| 1949 | Consider the following problem: copy the input to the output, |
| 1950 | changing the word \fImagic\fR to \fIfirst\fR on every line which began |
| 1951 | with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line |
| 1952 | which began with the letter \fIb\fR, and changing |
| 1953 | \fImagic\fR to \fIthird\fR on every line which began |
| 1954 | with the letter \fIc\fR. All other words and all other lines |
| 1955 | are left unchanged. |
| 1956 | .PP |
| 1957 | These rules are so simple that the easiest way |
| 1958 | to do this job is with a flag: |
| 1959 | .TS |
| 1960 | center; |
| 1961 | l l. |
| 1962 | int flag; |
| 1963 | %% |
| 1964 | ^a {flag = \(fma\(fm; ECHO;} |
| 1965 | ^b {flag = \(fmb\(fm; ECHO;} |
| 1966 | ^c {flag = \(fmc\(fm; ECHO;} |
| 1967 | \en {flag = 0 ; ECHO;} |
| 1968 | magic { |
| 1969 | switch (flag) |
| 1970 | { |
| 1971 | case \(fma\(fm: printf("first"); break; |
| 1972 | case \(fmb\(fm: printf("second"); break; |
| 1973 | \11\11 case \(fmc\(fm: printf("third"); break; |
| 1974 | default: ECHO; break; |
| 1975 | } |
| 1976 | } |
| 1977 | .TE |
| 1978 | should be adequate. |
| 1979 | .PP |
| 1980 | To handle the same problem with start conditions, each |
| 1981 | start condition must be introduced to Lex in the definitions section |
| 1982 | with a line reading |
| 1983 | .TS |
| 1984 | center; |
| 1985 | l l. |
| 1986 | %Start name1 name2 ... |
| 1987 | .TE |
| 1988 | where the conditions may be named in any order. |
| 1989 | The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR. |
| 1990 | The conditions may be referenced at the |
| 1991 | head of a rule with the <> brackets: |
| 1992 | .TS |
| 1993 | center; |
| 1994 | l. |
| 1995 | <name1>expression |
| 1996 | .TE |
| 1997 | is a rule which is only recognized when Lex is in the |
| 1998 | start condition \fIname1\fR. |
| 1999 | To enter a start condition, |
| 2000 | execute the action statement |
| 2001 | .TS |
| 2002 | center; |
| 2003 | l. |
| 2004 | BEGIN name1; |
| 2005 | .TE |
| 2006 | which changes the start condition to \fIname1\fR. |
| 2007 | To resume the normal state, |
| 2008 | .TS |
| 2009 | center; |
| 2010 | l. |
| 2011 | BEGIN 0; |
| 2012 | .TE |
| 2013 | resets the initial condition |
| 2014 | of the Lex automaton interpreter. |
| 2015 | A rule may be active in several |
| 2016 | start conditions: |
| 2017 | .TS |
| 2018 | center; |
| 2019 | l. |
| 2020 | <name1,name2,name3> |
| 2021 | .TE |
| 2022 | is a legal prefix. Any rule not beginning with the |
| 2023 | <> prefix operator is always active. |
| 2024 | .PP |
| 2025 | The same example as before can be written: |
| 2026 | .TS |
| 2027 | center; |
| 2028 | l l. |
| 2029 | %START AA BB CC |
| 2030 | %% |
| 2031 | ^a {ECHO; BEGIN AA;} |
| 2032 | ^b {ECHO; BEGIN BB;} |
| 2033 | ^c {ECHO; BEGIN CC;} |
| 2034 | \en {ECHO; BEGIN 0;} |
| 2035 | <AA>magic printf("first"); |
| 2036 | <BB>magic printf("second"); |
| 2037 | <CC>magic printf("third"); |
| 2038 | .TE |
| 2039 | where the logic is exactly the same as in the previous |
| 2040 | method of handling the problem, but Lex does the work |
| 2041 | rather than the user's code. |
| 2042 | .2C |
| 2043 | .NH |
| 2044 | Character Set. |
| 2045 | .PP |
| 2046 | The programs generated by Lex handle |
| 2047 | character I/O only through the routines |
| 2048 | .I |
| 2049 | input, |
| 2050 | output, |
| 2051 | .R |
| 2052 | and |
| 2053 | .I |
| 2054 | unput. |
| 2055 | .R |
| 2056 | Thus the character representation |
| 2057 | provided in these routines |
| 2058 | is accepted by Lex and employed to return |
| 2059 | values in |
| 2060 | .I |
| 2061 | yytext. |
| 2062 | .R |
| 2063 | For internal use |
| 2064 | a character is represented as a small integer |
| 2065 | which, if the standard library is used, |
| 2066 | has a value equal to the integer value of the bit |
| 2067 | pattern representing the character on the host computer. |
| 2068 | Normally, the letter |
| 2069 | .I a |
| 2070 | is represented as the same form as the character constant |
| 2071 | .I \(fma\(fm . |
| 2072 | If this interpretation is changed, by providing I/O |
| 2073 | routines which translate the characters, |
| 2074 | Lex must be told about |
| 2075 | it, by giving a translation table. |
| 2076 | This table must be in the definitions section, |
| 2077 | and must be bracketed by lines containing only |
| 2078 | ``%T''. |
| 2079 | The table contains lines of the form |
| 2080 | .TS |
| 2081 | center; |
| 2082 | l. |
| 2083 | {integer} {character string} |
| 2084 | .TE |
| 2085 | which indicate the value associated with each character. |
| 2086 | Thus the next example |
| 2087 | .GS 2 |
| 2088 | .TS |
| 2089 | center; |
| 2090 | l l. |
| 2091 | %T |
| 2092 | 1 Aa |
| 2093 | 2 Bb |
| 2094 | \&... |
| 2095 | 26 Zz |
| 2096 | 27 \en |
| 2097 | 28 + |
| 2098 | 29 \- |
| 2099 | 30 0 |
| 2100 | 31 1 |
| 2101 | \&... |
| 2102 | 39 9 |
| 2103 | %T |
| 2104 | .TE |
| 2105 | .sp |
| 2106 | .ce 1 |
| 2107 | Sample character table. |
| 2108 | .GE |
| 2109 | maps the lower and upper case letters together into the integers 1 through 26, |
| 2110 | newline into 27, + and \- into 28 and 29, and the |
| 2111 | digits into 30 through 39. |
| 2112 | Note the escape for newline. |
| 2113 | If a table is supplied, every character that is to appear either |
| 2114 | in the rules or in any valid input must be included |
| 2115 | in the table. |
| 2116 | No character |
| 2117 | may be assigned the number 0, and no character may be |
| 2118 | assigned a bigger number than the size of the hardware character set. |
| 2119 | .2C |
| 2120 | .NH |
| 2121 | Summary of Source Format. |
| 2122 | .PP |
| 2123 | The general form of a Lex source file is: |
| 2124 | .TS |
| 2125 | center; |
| 2126 | l. |
| 2127 | {definitions} |
| 2128 | %% |
| 2129 | {rules} |
| 2130 | %% |
| 2131 | {user subroutines} |
| 2132 | .TE |
| 2133 | The definitions section contains |
| 2134 | a combination of |
| 2135 | .IP 1) |
| 2136 | Definitions, in the form ``name space translation''. |
| 2137 | .IP 2) |
| 2138 | Included code, in the form ``space code''. |
| 2139 | .IP 3) |
| 2140 | Included code, in the form |
| 2141 | .TS |
| 2142 | center; |
| 2143 | l. |
| 2144 | %{ |
| 2145 | code |
| 2146 | %} |
| 2147 | .TE |
| 2148 | .ns |
| 2149 | .IP 4) |
| 2150 | Start conditions, given in the form |
| 2151 | .TS |
| 2152 | center; |
| 2153 | l. |
| 2154 | %S name1 name2 ... |
| 2155 | .TE |
| 2156 | .ns |
| 2157 | .IP 5) |
| 2158 | Character set tables, in the form |
| 2159 | .TS |
| 2160 | center; |
| 2161 | l. |
| 2162 | %T |
| 2163 | number space character-string |
| 2164 | \&... |
| 2165 | %T |
| 2166 | .TE |
| 2167 | .ns |
| 2168 | .IP 6) |
| 2169 | Changes to internal array sizes, in the form |
| 2170 | .TS |
| 2171 | center; |
| 2172 | l. |
| 2173 | %\fIx\fR\0\0\fInnn\fR |
| 2174 | .TE |
| 2175 | where \fInnn\fR is a decimal integer representing an array size |
| 2176 | and \fIx\fR selects the parameter as follows: |
| 2177 | .TS |
| 2178 | center; |
| 2179 | c c |
| 2180 | c l. |
| 2181 | Letter Parameter |
| 2182 | p positions |
| 2183 | n states |
| 2184 | e tree nodes |
| 2185 | a transitions |
| 2186 | k packed character classes |
| 2187 | o output array size |
| 2188 | .TE |
| 2189 | .LP |
| 2190 | Lines in the rules section have the form ``expression action'' |
| 2191 | where the action may be continued on succeeding |
| 2192 | lines by using braces to delimit it. |
| 2193 | .PP |
| 2194 | Regular expressions in Lex use the following |
| 2195 | operators: |
| 2196 | .br |
| 2197 | .TS |
| 2198 | center; |
| 2199 | l l. |
| 2200 | x the character "x" |
| 2201 | "x" an "x", even if x is an operator. |
| 2202 | \ex an "x", even if x is an operator. |
| 2203 | [xy] the character x or y. |
| 2204 | [x\-z] the characters x, y or z. |
| 2205 | [^x] any character but x. |
| 2206 | \&. any character but newline. |
| 2207 | ^x an x at the beginning of a line. |
| 2208 | <y>x an x when Lex is in start condition y. |
| 2209 | x$ an x at the end of a line. |
| 2210 | x? an optional x. |
| 2211 | x\(** 0,1,2, ... instances of x. |
| 2212 | x+ 1,2,3, ... instances of x. |
| 2213 | x|y an x or a y. |
| 2214 | (x) an x. |
| 2215 | x/y an x but only if followed by y. |
| 2216 | {xx} the translation of xx from the definitions section. |
| 2217 | x{m,n} \fIm\fR through \fIn\fR occurrences of x |
| 2218 | .TE |
| 2219 | .NH |
| 2220 | Caveats and Bugs. |
| 2221 | .PP |
| 2222 | There are pathological expressions which |
| 2223 | produce exponential growth of the tables when |
| 2224 | converted to deterministic machines; |
| 2225 | fortunately, they are rare. |
| 2226 | .PP |
| 2227 | REJECT does not rescan the input; instead it remembers the results of the previous |
| 2228 | scan. This means that if a rule with trailing context is found, and |
| 2229 | REJECT executed, the user |
| 2230 | must not have used |
| 2231 | .ul |
| 2232 | unput |
| 2233 | to change the characters forthcoming |
| 2234 | from the input stream. |
| 2235 | This is the only restriction on the user's ability to manipulate |
| 2236 | the not-yet-processed input. |
| 2237 | .PP |
| 2238 | .2C |
| 2239 | .NH |
| 2240 | Acknowledgments. |
| 2241 | .PP |
| 2242 | As should |
| 2243 | be obvious from the above, the outside of Lex |
| 2244 | is patterned |
| 2245 | on Yacc and the inside on Aho's string matching routines. |
| 2246 | Therefore, both S. C. Johnson and A. V. Aho |
| 2247 | are really originators |
| 2248 | of much of Lex, |
| 2249 | as well as debuggers of it. |
| 2250 | Many thanks are due to both. |
| 2251 | .PP |
| 2252 | The code of the current version of Lex was designed, written, |
| 2253 | and debugged by Eric Schmidt. |
| 2254 | .SG MH-1274-MEL-unix |
| 2255 | .sp 1 |
| 2256 | .2C |
| 2257 | .NH |
| 2258 | References. |
| 2259 | .SP 1v |
| 2260 | .IP 1. |
| 2261 | B. W. Kernighan and D. M. Ritchie, |
| 2262 | .I |
| 2263 | The C Programming Language, |
| 2264 | .R |
| 2265 | Prentice-Hall, N. J. (1978). |
| 2266 | .IP 2. |
| 2267 | B. W. Kernighan, |
| 2268 | .I |
| 2269 | Ratfor: A Preprocessor for a Rational Fortran, |
| 2270 | .R |
| 2271 | Software \- Practice and Experience, |
| 2272 | \fB5\fR, pp. 395-496 (1975). |
| 2273 | .IP 3. |
| 2274 | S. C. Johnson, |
| 2275 | .I |
| 2276 | Yacc: Yet Another Compiler Compiler, |
| 2277 | .R |
| 2278 | Computing Science Technical Report No. 32, |
| 2279 | 1975, |
| 2280 | .MH |
| 2281 | .if \n(tm (also TM 75-1273-6) |
| 2282 | .IP 4. |
| 2283 | A. V. Aho and M. J. Corasick, |
| 2284 | .I |
| 2285 | Efficient String Matching: An Aid to Bibliographic Search, |
| 2286 | .R |
| 2287 | Comm. ACM |
| 2288 | .B |
| 2289 | 18, |
| 2290 | .R |
| 2291 | 333-340 (1975). |
| 2292 | .IP 5. |
| 2293 | B. W. Kernighan, D. M. Ritchie and K. L. Thompson, |
| 2294 | .I |
| 2295 | QED Text Editor, |
| 2296 | .R |
| 2297 | Computing Science Technical Report No. 5, |
| 2298 | 1972, |
| 2299 | .MH |
| 2300 | .IP 6. |
| 2301 | D. M. Ritchie, |
| 2302 | private communication. |
| 2303 | See also |
| 2304 | M. E. Lesk, |
| 2305 | .I |
| 2306 | The Portable C Library, |
| 2307 | .R |
| 2308 | Computing Science Technical Report No. 31, |
| 2309 | .MH |
| 2310 | .if \n(tm (also TM 75-1274-11) |