Commit | Line | Data |
---|---|---|
2240a03d TL |
1 | .SH |
2 | 4: How the Parser Works | |
3 | .PP | |
4 | Yacc turns the specification file into a C program, which | |
5 | parses the input according to the specification given. | |
6 | The algorithm used to go from the | |
7 | specification to the parser is complex, and will not be discussed | |
8 | here (see | |
9 | the references for more information). | |
10 | The parser itself, however, is relatively simple, | |
11 | and understanding how it works, while | |
12 | not strictly necessary, will nevertheless make | |
13 | treatment of error recovery and ambiguities much more | |
14 | comprehensible. | |
15 | .PP | |
16 | The parser produced by Yacc consists | |
17 | of a finite state machine with a stack. | |
18 | The parser is also capable of reading and remembering the next | |
19 | input token (called the | |
20 | .I lookahead | |
21 | token). | |
22 | The | |
23 | .I "current state" | |
24 | is always the one on the top of the stack. | |
25 | The states of the finite state machine are given | |
26 | small integer labels; initially, the machine is in state 0, | |
27 | the stack contains only state 0, and no lookahead token has been read. | |
28 | .PP | |
29 | The machine has only four actions available to it, called | |
30 | .I shift , | |
31 | .I reduce , | |
32 | .I accept , | |
33 | and | |
34 | .I error . | |
35 | A move of the parser is done as follows: | |
36 | .IP 1. | |
37 | Based on its current state, the parser decides | |
38 | whether it needs a lookahead token to decide | |
39 | what action should be done; if it needs one, and does | |
40 | not have one, it calls | |
41 | .I yylex | |
42 | to obtain the next token. | |
43 | .IP 2. | |
44 | Using the current state, and the lookahead token | |
45 | if needed, the parser decides on its next action, and | |
46 | carries it out. | |
47 | This may result in states being pushed onto the stack, or popped off of | |
48 | the stack, and in the lookahead token being processed | |
49 | or left alone. | |
50 | .PP | |
51 | The | |
52 | .I shift | |
53 | action is the most common action the parser takes. | |
54 | Whenever a shift action is taken, there is always | |
55 | a lookahead token. | |
56 | For example, in state 56 there may be | |
57 | an action: | |
58 | .DS | |
59 | IF shift 34 | |
60 | .DE | |
61 | which says, in state 56, if the lookahead token is IF, | |
62 | the current state (56) is pushed down on the stack, | |
63 | and state 34 becomes the current state (on the | |
64 | top of the stack). | |
65 | The lookahead token is cleared. | |
66 | .PP | |
67 | The | |
68 | .I reduce | |
69 | action keeps the stack from growing without | |
70 | bounds. | |
71 | Reduce actions are appropriate when the parser has seen | |
72 | the right hand side of a grammar rule, | |
73 | and is prepared to announce that it has seen | |
74 | an instance of the rule, replacing the right hand side | |
75 | by the left hand side. | |
76 | It may be necessary to consult the lookahead token | |
77 | to decide whether to reduce, but | |
78 | usually it is not; in fact, the default | |
79 | action (represented by a ``.'') is often a reduce action. | |
80 | .PP | |
81 | Reduce actions are associated with individual grammar rules. | |
82 | Grammar rules are also given small integer | |
83 | numbers, leading to some confusion. | |
84 | The action | |
85 | .DS | |
86 | \fB.\fR reduce 18 | |
87 | .DE | |
88 | refers to | |
89 | .I "grammar rule" | |
90 | 18, while the action | |
91 | .DS | |
92 | IF shift 34 | |
93 | .DE | |
94 | refers to | |
95 | .I state | |
96 | 34. | |
97 | .PP | |
98 | Suppose the rule being reduced is | |
99 | .DS | |
100 | A \fB:\fR x y z ; | |
101 | .DE | |
102 | The reduce action depends on the | |
103 | left hand symbol (A in this case), and the number of | |
104 | symbols on the right hand side (three in this case). | |
105 | To reduce, first pop off the top three states | |
106 | from the stack | |
107 | (In general, the number of states popped equals the number of symbols on the | |
108 | right side of the rule). | |
109 | In effect, these states were the ones | |
110 | put on the stack while recognizing | |
111 | .I x , | |
112 | .I y , | |
113 | and | |
114 | .I z , | |
115 | and no longer serve any useful purpose. | |
116 | After popping these states, a state is uncovered | |
117 | which was the state the parser was in before beginning to | |
118 | process the rule. | |
119 | Using this uncovered state, and the symbol | |
120 | on the left side of the rule, perform what is in | |
121 | effect a shift of A. | |
122 | A new state is obtained, pushed onto the stack, and parsing continues. | |
123 | There are significant differences between the processing of | |
124 | the left hand symbol and an ordinary shift of a token, | |
125 | however, so this action is called a | |
126 | .I goto | |
127 | action. | |
128 | In particular, the lookahead token is cleared by a shift, and | |
129 | is not affected by a goto. | |
130 | In any case, the uncovered state contains an entry such as: | |
131 | .DS | |
132 | A goto 20 | |
133 | .DE | |
134 | causing state 20 to be pushed | |
135 | onto the stack, and become the current state. | |
136 | .PP | |
137 | In effect, the reduce action ``turns back the clock'' in the parse, | |
138 | popping the states off the stack to go back to the | |
139 | state where the right hand side of the rule was first seen. | |
140 | The parser then behaves as if it had seen the left side at that time. | |
141 | If the right hand side of the rule is empty, | |
142 | no states are popped off of the stack: the uncovered state | |
143 | is in fact the current state. | |
144 | .PP | |
145 | The reduce action is also important in the treatment of user-supplied | |
146 | actions and values. | |
147 | When a rule is reduced, the code supplied with the rule is executed | |
148 | before the stack is adjusted. | |
149 | In addition to the stack holding the states, another stack, | |
150 | running in parallel with it, holds the values returned | |
151 | from the lexical analyzer and the actions. | |
152 | When a shift takes place, the external variable | |
153 | .I yylval | |
154 | is copied onto the value stack. | |
155 | After the return from the user code, the reduction is carried out. | |
156 | When the | |
157 | .I goto | |
158 | action is done, the external variable | |
159 | .I yyval | |
160 | is copied onto the value stack. | |
161 | The pseudo-variables $1, $2, etc., refer to the value stack. | |
162 | .PP | |
163 | The other two parser actions are conceptually much simpler. | |
164 | The | |
165 | .I accept | |
166 | action indicates that the entire input has been seen and | |
167 | that it matches the specification. | |
168 | This action appears only when the lookahead token is | |
169 | the endmarker, and indicates that the parser has successfully | |
170 | done its job. | |
171 | The | |
172 | .I error | |
173 | action, on the other hand, represents a place where the parser | |
174 | can no longer continue parsing according to the specification. | |
175 | The input tokens it has seen, together with the lookahead token, | |
176 | cannot be followed by anything that would result | |
177 | in a legal input. | |
178 | The parser reports an error, and attempts to recover the situation and | |
179 | resume parsing: the error recovery (as opposed to the detection of error) | |
180 | will be covered in Section 7. | |
181 | .PP | |
182 | It is time for an example! | |
183 | Consider the specification | |
184 | .DS | |
185 | %token DING DONG DELL | |
186 | %% | |
187 | rhyme : sound place | |
188 | ; | |
189 | sound : DING DONG | |
190 | ; | |
191 | place : DELL | |
192 | ; | |
193 | .DE | |
194 | .PP | |
195 | When Yacc is invoked with the | |
196 | .B \-v | |
197 | option, a file called | |
198 | .I y.output | |
199 | is produced, with a human-readable description of the parser. | |
200 | The | |
201 | .I y.output | |
202 | file corresponding to the above grammar (with some statistics | |
203 | stripped off the end) is: | |
204 | .DS | |
205 | state 0 | |
206 | $accept : \_rhyme $end | |
207 | ||
208 | DING shift 3 | |
209 | . error | |
210 | ||
211 | rhyme goto 1 | |
212 | sound goto 2 | |
213 | ||
214 | state 1 | |
215 | $accept : rhyme\_$end | |
216 | ||
217 | $end accept | |
218 | . error | |
219 | ||
220 | state 2 | |
221 | rhyme : sound\_place | |
222 | ||
223 | DELL shift 5 | |
224 | . error | |
225 | ||
226 | place goto 4 | |
227 | ||
228 | state 3 | |
229 | sound : DING\_DONG | |
230 | ||
231 | DONG shift 6 | |
232 | . error | |
233 | ||
234 | state 4 | |
235 | rhyme : sound place\_ (1) | |
236 | ||
237 | . reduce 1 | |
238 | ||
239 | state 5 | |
240 | place : DELL\_ (3) | |
241 | ||
242 | . reduce 3 | |
243 | ||
244 | state 6 | |
245 | sound : DING DONG\_ (2) | |
246 | ||
247 | . reduce 2 | |
248 | .DE | |
249 | Notice that, in addition to the actions for each state, there is a | |
250 | description of the parsing rules being processed in each | |
251 | state. The \_ character is used to indicate | |
252 | what has been seen, and what is yet to come, in each rule. | |
253 | Suppose the input is | |
254 | .DS | |
255 | DING DONG DELL | |
256 | .DE | |
257 | It is instructive to follow the steps of the parser while | |
258 | processing this input. | |
259 | .PP | |
260 | Initially, the current state is state 0. | |
261 | The parser needs to refer to the input in order to decide | |
262 | between the actions available in state 0, so | |
263 | the first token, | |
264 | .I DING , | |
265 | is read, becoming the lookahead token. | |
266 | The action in state 0 on | |
267 | .I DING | |
268 | is | |
269 | is ``shift 3'', so state 3 is pushed onto the stack, | |
270 | and the lookahead token is cleared. | |
271 | State 3 becomes the current state. | |
272 | The next token, | |
273 | .I DONG , | |
274 | is read, becoming the lookahead token. | |
275 | The action in state 3 on the token | |
276 | .I DONG | |
277 | is ``shift 6'', | |
278 | so state 6 is pushed onto the stack, and the lookahead is cleared. | |
279 | The stack now contains 0, 3, and 6. | |
280 | In state 6, without even consulting the lookahead, | |
281 | the parser reduces by rule 2. | |
282 | .DS | |
283 | sound : DING DONG | |
284 | .DE | |
285 | This rule has two symbols on the right hand side, so | |
286 | two states, 6 and 3, are popped off of the stack, uncovering state 0. | |
287 | Consulting the description of state 0, looking for a goto on | |
288 | .I sound , | |
289 | .DS | |
290 | sound goto 2 | |
291 | .DE | |
292 | is obtained; thus state 2 is pushed onto the stack, | |
293 | becoming the current state. | |
294 | .PP | |
295 | In state 2, the next token, | |
296 | .I DELL , | |
297 | must be read. | |
298 | The action is ``shift 5'', so state 5 is pushed onto the stack, which now has | |
299 | 0, 2, and 5 on it, and the lookahead token is cleared. | |
300 | In state 5, the only action is to reduce by rule 3. | |
301 | This has one symbol on the right hand side, so one state, 5, | |
302 | is popped off, and state 2 is uncovered. | |
303 | The goto in state 2 on | |
304 | .I place , | |
305 | the left side of rule 3, | |
306 | is state 4. | |
307 | Now, the stack contains 0, 2, and 4. | |
308 | In state 4, the only action is to reduce by rule 1. | |
309 | There are two symbols on the right, so the top two states are popped off, | |
310 | uncovering state 0 again. | |
311 | In state 0, there is a goto on | |
312 | .I rhyme | |
313 | causing the parser to enter state 1. | |
314 | In state 1, the input is read; the endmarker is obtained, | |
315 | indicated by ``$end'' in the | |
316 | .I y.output | |
317 | file. | |
318 | The action in state 1 when the endmarker is seen is to accept, | |
319 | successfully ending the parse. | |
320 | .PP | |
321 | The reader is urged to consider how the parser works | |
322 | when confronted with such incorrect strings as | |
323 | .I "DING DONG DONG" , | |
324 | .I "DING DONG" , | |
325 | .I "DING DONG DELL DELL" , | |
326 | etc. | |
327 | A few minutes spend with this and other simple examples will | |
328 | probably be repaid when problems arise in more complicated contexts. |