macros for different classes of network
[unix-history] / .ref-BSD-3 / usr / doc / yacc / ss3
CommitLineData
8340f87c
BJ
1.SH
23: Lexical Analysis
3.PP
4The user must supply a lexical analyzer to read the input stream and communicate tokens
5(with values, if desired) to the parser.
6The lexical analyzer is an integer-valued function called
7.I yylex .
8The function returns an integer, the
9.I "token number" ,
10representing the kind of token read.
11If there is a value associated with that token, it should be assigned
12to the external variable
13.I yylval .
14.PP
15The parser and the lexical analyzer must agree on these token numbers in order for
16communication between them to take place.
17The numbers may be chosen by Yacc, or chosen by the user.
18In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer
19to return these numbers symbolically.
20For example, suppose that the token name DIGIT has been defined in the declarations section of the
21Yacc specification file.
22The relevant portion of the lexical analyzer might look like:
23.DS
24yylex(){
25 extern int yylval;
26 int c;
27 . . .
28 c = getchar();
29 . . .
30 switch( c ) {
31 . . .
32 case \'0\':
33 case \'1\':
34 . . .
35 case \'9\':
36 yylval = c\-\'0\';
37 return( DIGIT );
38 . . .
39 }
40 . . .
41.DE
42.PP
43The intent is to return a token number of DIGIT, and a value equal to the numerical value of the
44digit.
45Provided that the lexical analyzer code is placed in the programs section of the specification file,
46the identifier DIGIT will be defined as the token number associated
47with the token DIGIT.
48.PP
49This mechanism leads to clear,
50easily modified lexical analyzers; the only pitfall is the need
51to avoid using any token names in the grammar that are reserved
52or significant in C or the parser; for example, the use of
53token names
54.I if
55or
56.I while
57will almost certainly cause severe
58difficulties when the lexical analyzer is compiled.
59The token name
60.I error
61is reserved for error handling, and should not be used naively
62(see Section 7).
63.PP
64As mentioned above, the token numbers may be chosen by Yacc or by the user.
65In the default situation, the numbers are chosen by Yacc.
66The default token number for a literal
67character is the numerical value of the character in the local character set.
68Other names are assigned token numbers
69starting at 257.
70.PP
71To assign a token number to a token (including literals),
72the first appearance of the token name or literal
73.I
74in the declarations section
75.R
76can be immediately followed by
77a nonnegative integer.
78This integer is taken to be the token number of the name or literal.
79Names and literals not defined by this mechanism retain their default definition.
80It is important that all token numbers be distinct.
81.PP
82For historical reasons, the endmarker must have token
83number 0 or negative.
84This token number cannot be redefined by the user; thus, all
85lexical analyzers should be prepared to return 0 or negative as a token number
86upon reaching the end of their input.
87.PP
88A very useful tool for constructing lexical analyzers is
89the
90.I Lex
91program developed by Mike Lesk.
92.[
93Lesk Lex
94.]
95These lexical analyzers are designed to work in close
96harmony with Yacc parsers.
97The specifications for these lexical analyzers
98use regular expressions instead of grammar rules.
99Lex can be easily used to produce quite complicated lexical analyzers,
100but there remain some languages (such as FORTRAN) which do not
101fit any theoretical framework, and whose lexical analyzers
102must be crafted by hand.