Commit | Line | Data |
---|---|---|
2240a03d TL |
1 | .SH |
2 | 3: Lexical Analysis | |
3 | .PP | |
4 | The user must supply a lexical analyzer to read the input stream and communicate tokens | |
5 | (with values, if desired) to the parser. | |
6 | The lexical analyzer is an integer-valued function called | |
7 | .I yylex . | |
8 | The function returns an integer, the | |
9 | .I "token number" , | |
10 | representing the kind of token read. | |
11 | If there is a value associated with that token, it should be assigned | |
12 | to the external variable | |
13 | .I yylval . | |
14 | .PP | |
15 | The parser and the lexical analyzer must agree on these token numbers in order for | |
16 | communication between them to take place. | |
17 | The numbers may be chosen by Yacc, or chosen by the user. | |
18 | In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer | |
19 | to return these numbers symbolically. | |
20 | For example, suppose that the token name DIGIT has been defined in the declarations section of the | |
21 | Yacc specification file. | |
22 | The relevant portion of the lexical analyzer might look like: | |
23 | .DS | |
24 | yylex(){ | |
25 | extern int yylval; | |
26 | int c; | |
27 | . . . | |
28 | c = getchar(); | |
29 | . . . | |
30 | switch( c ) { | |
31 | . . . | |
32 | case \'0\': | |
33 | case \'1\': | |
34 | . . . | |
35 | case \'9\': | |
36 | yylval = c\-\'0\'; | |
37 | return( DIGIT ); | |
38 | . . . | |
39 | } | |
40 | . . . | |
41 | .DE | |
42 | .PP | |
43 | The intent is to return a token number of DIGIT, and a value equal to the numerical value of the | |
44 | digit. | |
45 | Provided that the lexical analyzer code is placed in the programs section of the specification file, | |
46 | the identifier DIGIT will be defined as the token number associated | |
47 | with the token DIGIT. | |
48 | .PP | |
49 | This mechanism leads to clear, | |
50 | easily modified lexical analyzers; the only pitfall is the need | |
51 | to avoid using any token names in the grammar that are reserved | |
52 | or significant in C or the parser; for example, the use of | |
53 | token names | |
54 | .I if | |
55 | or | |
56 | .I while | |
57 | will almost certainly cause severe | |
58 | difficulties when the lexical analyzer is compiled. | |
59 | The token name | |
60 | .I error | |
61 | is reserved for error handling, and should not be used naively | |
62 | (see Section 7). | |
63 | .PP | |
64 | As mentioned above, the token numbers may be chosen by Yacc or by the user. | |
65 | In the default situation, the numbers are chosen by Yacc. | |
66 | The default token number for a literal | |
67 | character is the numerical value of the character in the local character set. | |
68 | Other names are assigned token numbers | |
69 | starting at 257. | |
70 | .PP | |
71 | To assign a token number to a token (including literals), | |
72 | the first appearance of the token name or literal | |
73 | .I | |
74 | in the declarations section | |
75 | .R | |
76 | can be immediately followed by | |
77 | a nonnegative integer. | |
78 | This integer is taken to be the token number of the name or literal. | |
79 | Names and literals not defined by this mechanism retain their default definition. | |
80 | It is important that all token numbers be distinct. | |
81 | .PP | |
82 | For historical reasons, the endmarker must have token | |
83 | number 0 or negative. | |
84 | This token number cannot be redefined by the user; thus, all | |
85 | lexical analyzers should be prepared to return 0 or negative as a token number | |
86 | upon reaching the end of their input. | |
87 | .PP | |
88 | A very useful tool for constructing lexical analyzers is | |
89 | the | |
90 | .I Lex | |
91 | program developed by Mike Lesk. | |
92 | .[ | |
93 | Lesk Lex | |
94 | .] | |
95 | These lexical analyzers are designed to work in close | |
96 | harmony with Yacc parsers. | |
97 | The specifications for these lexical analyzers | |
98 | use regular expressions instead of grammar rules. | |
99 | Lex can be easily used to produce quite complicated lexical analyzers, | |
100 | but there remain some languages (such as FORTRAN) which do not | |
101 | fit any theoretical framework, and whose lexical analyzers | |
102 | must be crafted by hand. |