Bell 32V development
[unix-history] / usr / doc / yacc / ss4
CommitLineData
2240a03d
TL
1.SH
24: How the Parser Works
3.PP
4Yacc turns the specification file into a C program, which
5parses the input according to the specification given.
6The algorithm used to go from the
7specification to the parser is complex, and will not be discussed
8here (see
9the references for more information).
10The parser itself, however, is relatively simple,
11and understanding how it works, while
12not strictly necessary, will nevertheless make
13treatment of error recovery and ambiguities much more
14comprehensible.
15.PP
16The parser produced by Yacc consists
17of a finite state machine with a stack.
18The parser is also capable of reading and remembering the next
19input token (called the
20.I lookahead
21token).
22The
23.I "current state"
24is always the one on the top of the stack.
25The states of the finite state machine are given
26small integer labels; initially, the machine is in state 0,
27the stack contains only state 0, and no lookahead token has been read.
28.PP
29The machine has only four actions available to it, called
30.I shift ,
31.I reduce ,
32.I accept ,
33and
34.I error .
35A move of the parser is done as follows:
36.IP 1.
37Based on its current state, the parser decides
38whether it needs a lookahead token to decide
39what action should be done; if it needs one, and does
40not have one, it calls
41.I yylex
42to obtain the next token.
43.IP 2.
44Using the current state, and the lookahead token
45if needed, the parser decides on its next action, and
46carries it out.
47This may result in states being pushed onto the stack, or popped off of
48the stack, and in the lookahead token being processed
49or left alone.
50.PP
51The
52.I shift
53action is the most common action the parser takes.
54Whenever a shift action is taken, there is always
55a lookahead token.
56For example, in state 56 there may be
57an action:
58.DS
59 IF shift 34
60.DE
61which says, in state 56, if the lookahead token is IF,
62the current state (56) is pushed down on the stack,
63and state 34 becomes the current state (on the
64top of the stack).
65The lookahead token is cleared.
66.PP
67The
68.I reduce
69action keeps the stack from growing without
70bounds.
71Reduce actions are appropriate when the parser has seen
72the right hand side of a grammar rule,
73and is prepared to announce that it has seen
74an instance of the rule, replacing the right hand side
75by the left hand side.
76It may be necessary to consult the lookahead token
77to decide whether to reduce, but
78usually it is not; in fact, the default
79action (represented by a ``.'') is often a reduce action.
80.PP
81Reduce actions are associated with individual grammar rules.
82Grammar rules are also given small integer
83numbers, leading to some confusion.
84The action
85.DS
86 \fB.\fR reduce 18
87.DE
88refers to
89.I "grammar rule"
9018, while the action
91.DS
92 IF shift 34
93.DE
94refers to
95.I state
9634.
97.PP
98Suppose the rule being reduced is
99.DS
100A \fB:\fR x y z ;
101.DE
102The reduce action depends on the
103left hand symbol (A in this case), and the number of
104symbols on the right hand side (three in this case).
105To reduce, first pop off the top three states
106from the stack
107(In general, the number of states popped equals the number of symbols on the
108right side of the rule).
109In effect, these states were the ones
110put on the stack while recognizing
111.I x ,
112.I y ,
113and
114.I z ,
115and no longer serve any useful purpose.
116After popping these states, a state is uncovered
117which was the state the parser was in before beginning to
118process the rule.
119Using this uncovered state, and the symbol
120on the left side of the rule, perform what is in
121effect a shift of A.
122A new state is obtained, pushed onto the stack, and parsing continues.
123There are significant differences between the processing of
124the left hand symbol and an ordinary shift of a token,
125however, so this action is called a
126.I goto
127action.
128In particular, the lookahead token is cleared by a shift, and
129is not affected by a goto.
130In any case, the uncovered state contains an entry such as:
131.DS
132 A goto 20
133.DE
134causing state 20 to be pushed
135onto the stack, and become the current state.
136.PP
137In effect, the reduce action ``turns back the clock'' in the parse,
138popping the states off the stack to go back to the
139state where the right hand side of the rule was first seen.
140The parser then behaves as if it had seen the left side at that time.
141If the right hand side of the rule is empty,
142no states are popped off of the stack: the uncovered state
143is in fact the current state.
144.PP
145The reduce action is also important in the treatment of user-supplied
146actions and values.
147When a rule is reduced, the code supplied with the rule is executed
148before the stack is adjusted.
149In addition to the stack holding the states, another stack,
150running in parallel with it, holds the values returned
151from the lexical analyzer and the actions.
152When a shift takes place, the external variable
153.I yylval
154is copied onto the value stack.
155After the return from the user code, the reduction is carried out.
156When the
157.I goto
158action is done, the external variable
159.I yyval
160is copied onto the value stack.
161The pseudo-variables $1, $2, etc., refer to the value stack.
162.PP
163The other two parser actions are conceptually much simpler.
164The
165.I accept
166action indicates that the entire input has been seen and
167that it matches the specification.
168This action appears only when the lookahead token is
169the endmarker, and indicates that the parser has successfully
170done its job.
171The
172.I error
173action, on the other hand, represents a place where the parser
174can no longer continue parsing according to the specification.
175The input tokens it has seen, together with the lookahead token,
176cannot be followed by anything that would result
177in a legal input.
178The parser reports an error, and attempts to recover the situation and
179resume parsing: the error recovery (as opposed to the detection of error)
180will be covered in Section 7.
181.PP
182It is time for an example!
183Consider the specification
184.DS
185%token DING DONG DELL
186%%
187rhyme : sound place
188 ;
189sound : DING DONG
190 ;
191place : DELL
192 ;
193.DE
194.PP
195When Yacc is invoked with the
196.B \-v
197option, a file called
198.I y.output
199is produced, with a human-readable description of the parser.
200The
201.I y.output
202file corresponding to the above grammar (with some statistics
203stripped off the end) is:
204.DS
205state 0
206 $accept : \_rhyme $end
207
208 DING shift 3
209 . error
210
211 rhyme goto 1
212 sound goto 2
213
214state 1
215 $accept : rhyme\_$end
216
217 $end accept
218 . error
219
220state 2
221 rhyme : sound\_place
222
223 DELL shift 5
224 . error
225
226 place goto 4
227
228state 3
229 sound : DING\_DONG
230
231 DONG shift 6
232 . error
233
234state 4
235 rhyme : sound place\_ (1)
236
237 . reduce 1
238
239state 5
240 place : DELL\_ (3)
241
242 . reduce 3
243
244state 6
245 sound : DING DONG\_ (2)
246
247 . reduce 2
248.DE
249Notice that, in addition to the actions for each state, there is a
250description of the parsing rules being processed in each
251state. The \_ character is used to indicate
252what has been seen, and what is yet to come, in each rule.
253Suppose the input is
254.DS
255DING DONG DELL
256.DE
257It is instructive to follow the steps of the parser while
258processing this input.
259.PP
260Initially, the current state is state 0.
261The parser needs to refer to the input in order to decide
262between the actions available in state 0, so
263the first token,
264.I DING ,
265is read, becoming the lookahead token.
266The action in state 0 on
267.I DING
268is
269is ``shift 3'', so state 3 is pushed onto the stack,
270and the lookahead token is cleared.
271State 3 becomes the current state.
272The next token,
273.I DONG ,
274is read, becoming the lookahead token.
275The action in state 3 on the token
276.I DONG
277is ``shift 6'',
278so state 6 is pushed onto the stack, and the lookahead is cleared.
279The stack now contains 0, 3, and 6.
280In state 6, without even consulting the lookahead,
281the parser reduces by rule 2.
282.DS
283 sound : DING DONG
284.DE
285This rule has two symbols on the right hand side, so
286two states, 6 and 3, are popped off of the stack, uncovering state 0.
287Consulting the description of state 0, looking for a goto on
288.I sound ,
289.DS
290 sound goto 2
291.DE
292is obtained; thus state 2 is pushed onto the stack,
293becoming the current state.
294.PP
295In state 2, the next token,
296.I DELL ,
297must be read.
298The action is ``shift 5'', so state 5 is pushed onto the stack, which now has
2990, 2, and 5 on it, and the lookahead token is cleared.
300In state 5, the only action is to reduce by rule 3.
301This has one symbol on the right hand side, so one state, 5,
302is popped off, and state 2 is uncovered.
303The goto in state 2 on
304.I place ,
305the left side of rule 3,
306is state 4.
307Now, the stack contains 0, 2, and 4.
308In state 4, the only action is to reduce by rule 1.
309There are two symbols on the right, so the top two states are popped off,
310uncovering state 0 again.
311In state 0, there is a goto on
312.I rhyme
313causing the parser to enter state 1.
314In state 1, the input is read; the endmarker is obtained,
315indicated by ``$end'' in the
316.I y.output
317file.
318The action in state 1 when the endmarker is seen is to accept,
319successfully ending the parse.
320.PP
321The reader is urged to consider how the parser works
322when confronted with such incorrect strings as
323.I "DING DONG DONG" ,
324.I "DING DONG" ,
325.I "DING DONG DELL DELL" ,
326etc.
327A few minutes spend with this and other simple examples will
328probably be repaid when problems arise in more complicated contexts.