[unix-history] / .ref-BSD-3 / usr / doc / yacc / ss3

.SH
3: Lexical Analysis
.PP
The user must supply a lexical analyzer to read the input stream and communicate tokens
(with values, if desired) to the parser.
The lexical analyzer is an integer-valued function called
.I yylex .
The function returns an integer, the
.I "token number" ,
representing the kind of token read.
If there is a value associated with that token, it should be assigned
to the external variable
.I yylval .
.PP
The parser and the lexical analyzer must agree on these token numbers in order for
communication between them to take place.
The numbers may be chosen by Yacc, or chosen by the user.
In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer
to return these numbers symbolically.
For example, suppose that the token name DIGIT has been defined in the declarations section of the
Yacc specification file.
The relevant portion of the lexical analyzer might look like:
.DS
yylex(){
	extern int yylval;
	int c;
	. . .
	c = getchar();
	. . .
	switch( c ) {
		. . .
	case \'0\':
	case \'1\':
	  . . .
	case \'9\':
		yylval = c\-\'0\';
		return( DIGIT );
		. . .
		}
	. . .
.DE
.PP
The intent is to return a token number of DIGIT, and a value equal to the numerical value of the
digit.
Provided that the lexical analyzer code is placed in the programs section of the specification file,
the identifier DIGIT will be defined as the token number associated
with the token DIGIT.
.PP
This mechanism leads to clear,
easily modified lexical analyzers; the only pitfall is the need
to avoid using any token names in the grammar that are reserved
or significant in C or the parser; for example, the use of
token names
.I if
or
.I while
will almost certainly cause severe
difficulties when the lexical analyzer is compiled.
The token name
.I error
is reserved for error handling, and should not be used naively
(see Section 7).
.PP
As mentioned above, the token numbers may be chosen by Yacc or by the user.
In the default situation, the numbers are chosen by Yacc.
The default token number for a literal
character is the numerical value of the character in the local character set.
Other names are assigned token numbers
starting at 257.
.PP
To assign a token number to a token (including literals),
the first appearance of the token name or literal
.I
in the declarations section
.R
can be immediately followed by
a nonnegative integer.
This integer is taken to be the token number of the name or literal.
Names and literals not defined by this mechanism retain their default definition.
It is important that all token numbers be distinct.
.PP
For historical reasons, the endmarker must have token
number 0 or negative.
This token number cannot be redefined by the user; thus, all
lexical analyzers should be prepared to return 0 or negative as a token number
upon reaching the end of their input.
.PP
A very useful tool for constructing lexical analyzers is
the
.I Lex
program developed by Mike Lesk.
.[
Lesk Lex
.]
These lexical analyzers are designed to work in close
harmony with Yacc parsers.
The specifications for these lexical analyzers
use regular expressions instead of grammar rules.
Lex can be easily used to produce quite complicated lexical analyzers,
but there remain some languages (such as FORTRAN) which do not
fit any theoretical framework, and whose lexical analyzers
must be crafted by hand.
Commit	Line	Data
8340f87c BJ	1	.SH
	2	3: Lexical Analysis
	3	.PP
	4	The user must supply a lexical analyzer to read the input stream and communicate tokens
	5	(with values, if desired) to the parser.
	6	The lexical analyzer is an integer-valued function called
	7	.I yylex .
	8	The function returns an integer, the
	9	.I "token number" ,
	10	representing the kind of token read.
	11	If there is a value associated with that token, it should be assigned
	12	to the external variable
	13	.I yylval .
	14	.PP
	15	The parser and the lexical analyzer must agree on these token numbers in order for
	16	communication between them to take place.
	17	The numbers may be chosen by Yacc, or chosen by the user.
	18	In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer
	19	to return these numbers symbolically.
	20	For example, suppose that the token name DIGIT has been defined in the declarations section of the
	21	Yacc specification file.
	22	The relevant portion of the lexical analyzer might look like:
	23	.DS
	24	yylex(){
	25	extern int yylval;
	26	int c;
	27	. . .
	28	c = getchar();
	29	. . .
	30	switch( c ) {
	31	. . .
	32	case \'0\':
	33	case \'1\':
	34	. . .
	35	case \'9\':
	36	yylval = c\-\'0\';
	37	return( DIGIT );
	38	. . .
	39	}
	40	. . .
	41	.DE
	42	.PP
	43	The intent is to return a token number of DIGIT, and a value equal to the numerical value of the
	44	digit.
	45	Provided that the lexical analyzer code is placed in the programs section of the specification file,
	46	the identifier DIGIT will be defined as the token number associated
	47	with the token DIGIT.
	48	.PP
	49	This mechanism leads to clear,
	50	easily modified lexical analyzers; the only pitfall is the need
	51	to avoid using any token names in the grammar that are reserved
	52	or significant in C or the parser; for example, the use of
	53	token names
	54	.I if
	55	or
	56	.I while
	57	will almost certainly cause severe
	58	difficulties when the lexical analyzer is compiled.
	59	The token name
	60	.I error
	61	is reserved for error handling, and should not be used naively
	62	(see Section 7).
	63	.PP
	64	As mentioned above, the token numbers may be chosen by Yacc or by the user.
65	In the default situation, the numbers are chosen by Yacc.
66	The default token number for a literal
67	character is the numerical value of the character in the local character set.
68	Other names are assigned token numbers
69	starting at 257.
70	.PP
71	To assign a token number to a token (including literals),
72	the first appearance of the token name or literal
73	.I
74	in the declarations section
75	.R
76	can be immediately followed by
77	a nonnegative integer.
78	This integer is taken to be the token number of the name or literal.
79	Names and literals not defined by this mechanism retain their default definition.
80	It is important that all token numbers be distinct.
81	.PP
82	For historical reasons, the endmarker must have token
83	number 0 or negative.
84	This token number cannot be redefined by the user; thus, all
85	lexical analyzers should be prepared to return 0 or negative as a token number
86	upon reaching the end of their input.
87	.PP
88	A very useful tool for constructing lexical analyzers is
89	the
90	.I Lex
91	program developed by Mike Lesk.
92	.[
93	Lesk Lex
94	.]
95	These lexical analyzers are designed to work in close
96	harmony with Yacc parsers.
97	The specifications for these lexical analyzers
98	use regular expressions instead of grammar rules.
99	Lex can be easily used to produce quite complicated lexical analyzers,
100	but there remain some languages (such as FORTRAN) which do not
101	fit any theoretical framework, and whose lexical analyzers
102	must be crafted by hand.