[unix-history] / usr / doc / yacc / ss4

.SH
4: How the Parser Works
.PP
Yacc turns the specification file into a C program, which
parses the input according to the specification given.
The algorithm used to go from the
specification to the parser is complex, and will not be discussed
here (see
the references for more information).
The parser itself, however, is relatively simple,
and understanding how it works, while
not strictly necessary, will nevertheless make
treatment of error recovery and ambiguities much more
comprehensible.
.PP
The parser produced by Yacc consists
of a finite state machine with a stack.
The parser is also capable of reading and remembering the next
input token (called the
.I lookahead
token).
The
.I "current state"
is always the one on the top of the stack.
The states of the finite state machine are given
small integer labels; initially, the machine is in state 0,
the stack contains only state 0, and no lookahead token has been read.
.PP
The machine has only four actions available to it, called
.I shift ,
.I reduce ,
.I accept ,
and
.I error .
A move of the parser is done as follows:
.IP 1.
Based on its current state, the parser decides
whether it needs a lookahead token to decide
what action should be done; if it needs one, and does
not have one, it calls
.I yylex
to obtain the next token.
.IP 2.
Using the current state, and the lookahead token
if needed, the parser decides on its next action, and
carries it out.
This may result in states being pushed onto the stack, or popped off of
the stack, and in the lookahead token being processed
or left alone.
.PP
The
.I shift
action is the most common action the parser takes.
Whenever a shift action is taken, there is always
a lookahead token.
For example, in state 56 there may be
an action:
.DS
	IF	shift 34
.DE
which says, in state 56, if the lookahead token is IF,
the current state (56) is pushed down on the stack,
and state 34 becomes the current state (on the
top of the stack).
The lookahead token is cleared.
.PP
The
.I reduce
action keeps the stack from growing without
bounds.
Reduce actions are appropriate when the parser has seen
the right hand side of a grammar rule,
and is prepared to announce that it has seen
an instance of the rule, replacing the right hand side
by the left hand side.
It may be necessary to consult the lookahead token
to decide whether to reduce, but
usually it is not; in fact, the default
action (represented by a ``.'') is often a reduce action.
.PP
Reduce actions are associated with individual grammar rules.
Grammar rules are also given small integer
numbers, leading to some confusion.
The action
.DS
	\fB.\fR	reduce 18
.DE
refers to
.I "grammar rule"
18, while the action
.DS
	IF	shift 34
.DE
refers to
.I state
34.
.PP
Suppose the rule being reduced is
.DS
A	\fB:\fR	x  y  z    ;
.DE
The reduce action depends on the
left hand symbol (A in this case), and the number of
symbols on the right hand side (three in this case).
To reduce, first pop off the top three states
from the stack
(In general, the number of states popped equals the number of symbols on the
right side of the rule).
In effect, these states were the ones
put on the stack while recognizing
.I x ,
.I y ,
and
.I z ,
and no longer serve any useful purpose.
After popping these states, a state is uncovered
which was the state the parser was in before beginning to
process the rule.
Using this uncovered state, and the symbol
on the left side of the rule, perform what is in
effect a shift of A.
A new state is obtained, pushed onto the stack, and parsing continues.
There are significant differences between the processing of
the left hand symbol and an ordinary shift of a token,
however, so this action is called a
.I goto
action.
In particular, the lookahead token is cleared by a shift, and
is not affected by a goto.
In any case, the uncovered state contains an entry such as:
.DS
	A	goto 20
.DE
causing state 20 to be pushed
onto the stack, and become the current state.
.PP
In effect, the reduce action ``turns back the clock'' in the parse,
popping the states off the stack to go back to the
state where the right hand side of the rule was first seen.
The parser then behaves as if it had seen the left side at that time.
If the right hand side of the rule is empty,
no states are popped off of the stack: the uncovered state
is in fact the current state.
.PP
The reduce action is also important in the treatment of user-supplied
actions and values.
When a rule is reduced, the code supplied with the rule is executed
before the stack is adjusted.
In addition to the stack holding the states, another stack,
running in parallel with it, holds the values returned
from the lexical analyzer and the actions.
When a shift takes place, the external variable
.I yylval
is copied onto the value stack.
After the return from the user code, the reduction is carried out.
When the
.I goto
action is done, the external variable
.I yyval
is copied onto the value stack.
The pseudo-variables $1, $2, etc., refer to the value stack.
.PP
The other two parser actions are conceptually much simpler.
The
.I accept
action indicates that the entire input has been seen and
that it matches the specification.
This action appears only when the lookahead token is 
the endmarker, and indicates that the parser has successfully
done its job.
The
.I error
action, on the other hand, represents a place where the parser
can no longer continue parsing according to the specification.
The input tokens it has seen, together with the lookahead token,
cannot be followed by anything that would result
in a legal input.
The parser reports an error, and attempts to recover the situation and
resume parsing: the error recovery (as opposed to the detection of error)
will be covered in Section 7.
.PP
It is time for an example!
Consider the specification
.DS
%token  DING  DONG  DELL
%%
rhyme	:	sound  place
	;
sound	:	DING  DONG
	;
place	:	DELL
	;
.DE
.PP
When Yacc is invoked with the
.B \-v
option, a file called
.I y.output
is produced, with a human-readable description of the parser.
The
.I y.output
file corresponding to the above grammar (with some statistics
stripped off the end) is:
.DS
state 0
	$accept  :  \_rhyme  $end 

	DING  shift 3
	.  error

	rhyme  goto 1
	sound  goto 2

state 1
	$accept  :   rhyme\_$end 

	$end  accept
	.  error

state 2
	rhyme  :   sound\_place 

	DELL  shift 5
	.  error

	place   goto 4

state 3
	sound   :   DING\_DONG 

	DONG  shift 6
	.  error

state 4
	rhyme  :   sound  place\_    (1)

	.   reduce  1

state 5
	place  :   DELL\_    (3)

	.   reduce  3

state 6
	sound   :   DING  DONG\_    (2)

	.   reduce  2
.DE
Notice that, in addition to the actions for each state, there is a
description of the parsing rules being processed in each
state.  The \_ character is used to indicate
what has been seen, and what is yet to come, in each rule.
Suppose the input is
.DS
DING  DONG  DELL
.DE
It is instructive to follow the steps of the parser while
processing this input.
.PP
Initially, the current state is state 0.
The parser needs to refer to the input in order to decide
between the actions available in state 0, so
the first token,
.I DING ,
is read, becoming the lookahead token.
The action in state 0 on
.I DING
is
is ``shift 3'', so state 3 is pushed onto the stack,
and the lookahead token is cleared.
State 3 becomes the current state.
The next token,
.I DONG ,
is read, becoming the lookahead token.
The action in state 3 on the token
.I DONG
is ``shift 6'',
so state 6 is pushed onto the stack, and the lookahead is cleared.
The stack now contains 0, 3, and 6.
In state 6, without even consulting the lookahead,
the parser reduces by rule 2.
.DS
	sound  :   DING  DONG
.DE
This rule has two symbols on the right hand side, so
two states, 6 and 3, are popped off of the stack, uncovering state 0.
Consulting the description of state 0, looking for a goto on 
.I sound ,
.DS
	sound	goto 2
.DE
is obtained; thus state 2 is pushed onto the stack,
becoming the current state.
.PP
In state 2, the next token,
.I DELL ,
must be read.
The action is ``shift 5'', so state 5 is pushed onto the stack, which now has
0, 2, and 5 on it, and the lookahead token is cleared.
In state 5, the only action is to reduce by rule 3.
This has one symbol on the right hand side, so one state, 5,
is popped off, and state 2 is uncovered.
The goto in state 2 on
.I place ,
the left side of rule 3,
is state 4.
Now, the stack contains 0, 2, and 4.
In state 4, the only action is to reduce by rule 1.
There are two symbols on the right, so the top two states are popped off,
uncovering state 0 again.
In state 0, there is a goto on
.I rhyme
causing the parser to enter state 1.
In state 1, the input is read; the endmarker is obtained,
indicated by ``$end'' in the
.I y.output
file.
The action in state 1 when the endmarker is seen is to accept,
successfully ending the parse.
.PP
The reader is urged to consider how the parser works
when confronted with such incorrect strings as
.I "DING DONG DONG" ,
.I "DING DONG" ,
.I "DING DONG DELL DELL" ,
etc.
A few minutes spend with this and other simple examples will
probably be repaid when problems arise in more complicated contexts.
Commit	Line	Data
2240a03d TL	1	.SH
	2	4: How the Parser Works
	3	.PP
	4	Yacc turns the specification file into a C program, which
	5	parses the input according to the specification given.
	6	The algorithm used to go from the
	7	specification to the parser is complex, and will not be discussed
	8	here (see
	9	the references for more information).
	10	The parser itself, however, is relatively simple,
	11	and understanding how it works, while
	12	not strictly necessary, will nevertheless make
	13	treatment of error recovery and ambiguities much more
	14	comprehensible.
	15	.PP
	16	The parser produced by Yacc consists
	17	of a finite state machine with a stack.
	18	The parser is also capable of reading and remembering the next
	19	input token (called the
	20	.I lookahead
	21	token).
	22	The
	23	.I "current state"
	24	is always the one on the top of the stack.
	25	The states of the finite state machine are given
	26	small integer labels; initially, the machine is in state 0,
	27	the stack contains only state 0, and no lookahead token has been read.
	28	.PP
	29	The machine has only four actions available to it, called
	30	.I shift ,
	31	.I reduce ,
	32	.I accept ,
	33	and
	34	.I error .
	35	A move of the parser is done as follows:
	36	.IP 1.
	37	Based on its current state, the parser decides
	38	whether it needs a lookahead token to decide
	39	what action should be done; if it needs one, and does
	40	not have one, it calls
	41	.I yylex
	42	to obtain the next token.
	43	.IP 2.
	44	Using the current state, and the lookahead token
	45	if needed, the parser decides on its next action, and
	46	carries it out.
	47	This may result in states being pushed onto the stack, or popped off of
	48	the stack, and in the lookahead token being processed
	49	or left alone.
	50	.PP
	51	The
	52	.I shift
	53	action is the most common action the parser takes.
	54	Whenever a shift action is taken, there is always
	55	a lookahead token.
	56	For example, in state 56 there may be
	57	an action:
	58	.DS
	59	IF shift 34
	60	.DE
	61	which says, in state 56, if the lookahead token is IF,
	62	the current state (56) is pushed down on the stack,
	63	and state 34 becomes the current state (on the
	64	top of the stack).
65	The lookahead token is cleared.
66	.PP
67	The
68	.I reduce
69	action keeps the stack from growing without
70	bounds.
71	Reduce actions are appropriate when the parser has seen
72	the right hand side of a grammar rule,
73	and is prepared to announce that it has seen
74	an instance of the rule, replacing the right hand side
75	by the left hand side.
76	It may be necessary to consult the lookahead token
77	to decide whether to reduce, but
78	usually it is not; in fact, the default
79	action (represented by a ``.'') is often a reduce action.
80	.PP
81	Reduce actions are associated with individual grammar rules.
82	Grammar rules are also given small integer
83	numbers, leading to some confusion.
84	The action
85	.DS
86	\fB.\fR reduce 18
87	.DE
88	refers to
89	.I "grammar rule"
90	18, while the action
91	.DS
92	IF shift 34
93	.DE
94	refers to
95	.I state
96	34.
97	.PP
98	Suppose the rule being reduced is
99	.DS
100	A \fB:\fR x y z ;
101	.DE
102	The reduce action depends on the
103	left hand symbol (A in this case), and the number of
104	symbols on the right hand side (three in this case).
105	To reduce, first pop off the top three states
106	from the stack
107	(In general, the number of states popped equals the number of symbols on the
108	right side of the rule).
109	In effect, these states were the ones
110	put on the stack while recognizing
111	.I x ,
112	.I y ,
113	and
114	.I z ,
115	and no longer serve any useful purpose.
116	After popping these states, a state is uncovered
117	which was the state the parser was in before beginning to
118	process the rule.
119	Using this uncovered state, and the symbol
120	on the left side of the rule, perform what is in
121	effect a shift of A.
122	A new state is obtained, pushed onto the stack, and parsing continues.
123	There are significant differences between the processing of
124	the left hand symbol and an ordinary shift of a token,
125	however, so this action is called a
126	.I goto
127	action.
128	In particular, the lookahead token is cleared by a shift, and
129	is not affected by a goto.
130	In any case, the uncovered state contains an entry such as:
131	.DS
132	A goto 20
133	.DE
134	causing state 20 to be pushed
135	onto the stack, and become the current state.
136	.PP
137	In effect, the reduce action ``turns back the clock'' in the parse,
138	popping the states off the stack to go back to the
139	state where the right hand side of the rule was first seen.
140	The parser then behaves as if it had seen the left side at that time.
141	If the right hand side of the rule is empty,
142	no states are popped off of the stack: the uncovered state
143	is in fact the current state.
144	.PP
145	The reduce action is also important in the treatment of user-supplied
146	actions and values.
147	When a rule is reduced, the code supplied with the rule is executed
148	before the stack is adjusted.
149	In addition to the stack holding the states, another stack,
150	running in parallel with it, holds the values returned
151	from the lexical analyzer and the actions.
152	When a shift takes place, the external variable
153	.I yylval
154	is copied onto the value stack.
155	After the return from the user code, the reduction is carried out.
156	When the
157	.I goto
158	action is done, the external variable
159	.I yyval
160	is copied onto the value stack.
161	The pseudo-variables $1, $2, etc., refer to the value stack.
162	.PP
163	The other two parser actions are conceptually much simpler.
164	The
165	.I accept
166	action indicates that the entire input has been seen and
167	that it matches the specification.
168	This action appears only when the lookahead token is
169	the endmarker, and indicates that the parser has successfully
170	done its job.
171	The
172	.I error
173	action, on the other hand, represents a place where the parser
174	can no longer continue parsing according to the specification.
175	The input tokens it has seen, together with the lookahead token,
176	cannot be followed by anything that would result
177	in a legal input.
178	The parser reports an error, and attempts to recover the situation and
179	resume parsing: the error recovery (as opposed to the detection of error)
180	will be covered in Section 7.
181	.PP
182	It is time for an example!
183	Consider the specification
184	.DS
185	%token DING DONG DELL
186	%%
187	rhyme : sound place
188	;
189	sound : DING DONG
190	;
191	place : DELL
192	;
193	.DE
194	.PP
195	When Yacc is invoked with the
196	.B \-v
197	option, a file called
198	.I y.output
199	is produced, with a human-readable description of the parser.
200	The
201	.I y.output
202	file corresponding to the above grammar (with some statistics
203	stripped off the end) is:
204	.DS
205	state 0
206	$accept : \_rhyme $end
207
208	DING shift 3
209	. error
210
211	rhyme goto 1
212	sound goto 2
213
214	state 1
215	$accept : rhyme\_$end
216
217	$end accept
218	. error
219
220	state 2
221	rhyme : sound\_place
222
223	DELL shift 5
224	. error
225
226	place goto 4
227
228	state 3
229	sound : DING\_DONG
230
231	DONG shift 6
232	. error
233
234	state 4
235	rhyme : sound place\_ (1)
236
237	. reduce 1
238
239	state 5
240	place : DELL\_ (3)
241
242	. reduce 3
243
244	state 6
245	sound : DING DONG\_ (2)
246
247	. reduce 2
248	.DE
249	Notice that, in addition to the actions for each state, there is a
250	description of the parsing rules being processed in each
251	state. The \_ character is used to indicate
252	what has been seen, and what is yet to come, in each rule.
253	Suppose the input is
254	.DS
255	DING DONG DELL
256	.DE
257	It is instructive to follow the steps of the parser while
258	processing this input.
259	.PP
260	Initially, the current state is state 0.
261	The parser needs to refer to the input in order to decide
262	between the actions available in state 0, so
263	the first token,
264	.I DING ,
265	is read, becoming the lookahead token.
266	The action in state 0 on
267	.I DING
268	is
269	is ``shift 3'', so state 3 is pushed onto the stack,
270	and the lookahead token is cleared.
271	State 3 becomes the current state.
272	The next token,
273	.I DONG ,
274	is read, becoming the lookahead token.
275	The action in state 3 on the token
276	.I DONG
277	is ``shift 6'',
278	so state 6 is pushed onto the stack, and the lookahead is cleared.
279	The stack now contains 0, 3, and 6.
280	In state 6, without even consulting the lookahead,
281	the parser reduces by rule 2.
282	.DS
283	sound : DING DONG
284	.DE
285	This rule has two symbols on the right hand side, so
286	two states, 6 and 3, are popped off of the stack, uncovering state 0.
287	Consulting the description of state 0, looking for a goto on
288	.I sound ,
289	.DS
290	sound goto 2
291	.DE
292	is obtained; thus state 2 is pushed onto the stack,
293	becoming the current state.
294	.PP
295	In state 2, the next token,
296	.I DELL ,
297	must be read.
298	The action is ``shift 5'', so state 5 is pushed onto the stack, which now has
299	0, 2, and 5 on it, and the lookahead token is cleared.
300	In state 5, the only action is to reduce by rule 3.
301	This has one symbol on the right hand side, so one state, 5,
302	is popped off, and state 2 is uncovered.
303	The goto in state 2 on
304	.I place ,
305	the left side of rule 3,
306	is state 4.
307	Now, the stack contains 0, 2, and 4.
308	In state 4, the only action is to reduce by rule 1.
309	There are two symbols on the right, so the top two states are popped off,
310	uncovering state 0 again.
311	In state 0, there is a goto on
312	.I rhyme
313	causing the parser to enter state 1.
314	In state 1, the input is read; the endmarker is obtained,
315	indicated by ``$end'' in the
316	.I y.output
317	file.
318	The action in state 1 when the endmarker is seen is to accept,
319	successfully ending the parse.
320	.PP
321	The reader is urged to consider how the parser works
322	when confronted with such incorrect strings as
323	.I "DING DONG DONG" ,
324	.I "DING DONG" ,
325	.I "DING DONG DELL DELL" ,
326	etc.
327	A few minutes spend with this and other simple examples will
328	probably be repaid when problems arise in more complicated contexts.