Commit | Line | Data |
---|---|---|
86530b38 AT |
1 | =head1 NAME |
2 | ||
3 | Parse::RecDescent - Generate Recursive-Descent Parsers | |
4 | ||
5 | =head1 VERSION | |
6 | ||
7 | This document describes version 1.79 of Parse::RecDescent, | |
8 | released August 21, 2000. | |
9 | ||
10 | =head1 SYNOPSIS | |
11 | ||
12 | use Parse::RecDescent; | |
13 | ||
14 | # Generate a parser from the specification in $grammar: | |
15 | ||
16 | $parser = new Parse::RecDescent ($grammar); | |
17 | ||
18 | # Generate a parser from the specification in $othergrammar | |
19 | ||
20 | $anotherparser = new Parse::RecDescent ($othergrammar); | |
21 | ||
22 | ||
23 | # Parse $text using rule 'startrule' (which must be | |
24 | # defined in $grammar): | |
25 | ||
26 | $parser->startrule($text); | |
27 | ||
28 | ||
29 | # Parse $text using rule 'otherrule' (which must also | |
30 | # be defined in $grammar): | |
31 | ||
32 | $parser->otherrule($text); | |
33 | ||
34 | ||
35 | # Change the universal token prefix pattern | |
36 | # (the default is: '\s*'): | |
37 | ||
38 | $Parse::RecDescent::skip = '[ \t]+'; | |
39 | ||
40 | ||
41 | # Replace productions of existing rules (or create new ones) | |
42 | # with the productions defined in $newgrammar: | |
43 | ||
44 | $parser->Replace($newgrammar); | |
45 | ||
46 | ||
47 | # Extend existing rules (or create new ones) | |
48 | # by adding extra productions defined in $moregrammar: | |
49 | ||
50 | $parser->Extend($moregrammar); | |
51 | ||
52 | ||
53 | # Global flags (useful as command line arguments under -s): | |
54 | ||
55 | $::RD_ERRORS # unless undefined, report fatal errors | |
56 | $::RD_WARN # unless undefined, also report non-fatal problems | |
57 | $::RD_HINT # if defined, also suggestion remedies | |
58 | $::RD_TRACE # if defined, also trace parsers' behaviour | |
59 | $::RD_AUTOSTUB # if defined, generates "stubs" for undefined rules | |
60 | $::RD_AUTOACTION # if defined, appends specified action to productions | |
61 | ||
62 | ||
63 | =head1 DESCRIPTION | |
64 | ||
65 | =head2 Overview | |
66 | ||
67 | Parse::RecDescent incrementally generates top-down recursive-descent text | |
68 | parsers from simple I<yacc>-like grammar specifications. It provides: | |
69 | ||
70 | =over 4 | |
71 | ||
72 | =item * | |
73 | ||
74 | Regular expressions or literal strings as terminals (tokens), | |
75 | ||
76 | =item * | |
77 | ||
78 | Multiple (non-contiguous) productions for any rule, | |
79 | ||
80 | =item * | |
81 | ||
82 | Repeated and optional subrules within productions, | |
83 | ||
84 | =item * | |
85 | ||
86 | Full access to Perl within actions specified as part of the grammar, | |
87 | ||
88 | =item * | |
89 | ||
90 | Simple automated error reporting during parser generation and parsing, | |
91 | ||
92 | =item * | |
93 | ||
94 | The ability to commit to, uncommit to, or reject particular | |
95 | productions during a parse, | |
96 | ||
97 | =item * | |
98 | ||
99 | The ability to pass data up and down the parse tree ("down" via subrule | |
100 | argument lists, "up" via subrule return values) | |
101 | ||
102 | =item * | |
103 | ||
104 | Incremental extension of the parsing grammar (even during a parse), | |
105 | ||
106 | =item * | |
107 | ||
108 | Precompilation of parser objects, | |
109 | ||
110 | =item * | |
111 | ||
112 | User-definable reduce-reduce conflict resolution via | |
113 | "scoring" of matching productions. | |
114 | ||
115 | =back | |
116 | ||
117 | =head2 Using C<Parse::RecDescent> | |
118 | ||
119 | Parser objects are created by calling C<Parse::RecDescent::new>, passing in a | |
120 | grammar specification (see the following subsections). If the grammar is | |
121 | correct, C<new> returns a blessed reference which can then be used to initiate | |
122 | parsing through any rule specified in the original grammar. A typical sequence | |
123 | looks like this: | |
124 | ||
125 | $grammar = q { | |
126 | # GRAMMAR SPECIFICATION HERE | |
127 | }; | |
128 | ||
129 | $parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n"; | |
130 | ||
131 | # acquire $text | |
132 | ||
133 | defined $parser->startrule($text) or print "Bad text!\n"; | |
134 | ||
135 | The rule through which parsing is initiated must be explicitly defined | |
136 | in the grammar (i.e. for the above example, the grammar must include a | |
137 | rule of the form: "startrule: <subrules>". | |
138 | ||
139 | If the starting rule succeeds, its value (see below) | |
140 | is returned. Failure to generate the original parser or failure to match a text | |
141 | is indicated by returning C<undef>. Note that it's easy to set up grammars | |
142 | that can succeed, but which return a value of 0, "0", or "". So don't be | |
143 | tempted to write: | |
144 | ||
145 | $parser->startrule($text) or print "Bad text!\n"; | |
146 | ||
147 | Normally, the parser has no effect on the original text. So in the | |
148 | previous example the value of $text would be unchanged after having | |
149 | been parsed. | |
150 | ||
151 | If, however, the text to be matched is passed by reference: | |
152 | ||
153 | $parser->startrule(\$text) | |
154 | ||
155 | then any text which was consumed during the match will be removed from the | |
156 | start of $text. | |
157 | ||
158 | ||
159 | =head2 Rules | |
160 | ||
161 | In the grammar from which the parser is built, rules are specified by | |
162 | giving an identifier (which must satisfy /[A-Za-z]\w*/), followed by a | |
163 | colon I<on the same line>, followed by one or more productions, | |
164 | separated by single vertical bars. The layout of the productions | |
165 | is entirely free-format: | |
166 | ||
167 | rule1: production1 | |
168 | | production2 | | |
169 | production3 | production4 | |
170 | ||
171 | At any point in the grammar previously defined rules may be extended with | |
172 | additional productions. This is achieved by redeclaring the rule with the new | |
173 | productions. Thus: | |
174 | ||
175 | rule1: a | b | c | |
176 | rule2: d | e | f | |
177 | rule1: g | h | |
178 | ||
179 | is exactly equivalent to: | |
180 | ||
181 | rule1: a | b | c | g | h | |
182 | rule2: d | e | f | |
183 | ||
184 | Each production in a rule consists of zero or more items, each of which | |
185 | may be either: the name of another rule to be matched (a "subrule"), | |
186 | a pattern or string literal to be matched directly (a "token"), a | |
187 | block of Perl code to be executed (an "action"), a special instruction | |
188 | to the parser (a "directive"), or a standard Perl comment (which is | |
189 | ignored). | |
190 | ||
191 | A rule matches a text if one of its productions matches. A production | |
192 | matches if each of its items match consecutive substrings of the | |
193 | text. The productions of a rule being matched are tried in the same | |
194 | order that they appear in the original grammar, and the first matching | |
195 | production terminates the match attempt (successfully). If all | |
196 | productions are tried and none matches, the match attempt fails. | |
197 | ||
198 | Note that this behaviour is quite different from the "prefer the longer match" | |
199 | behaviour of I<yacc>. For example, if I<yacc> were parsing the rule: | |
200 | ||
201 | seq : 'A' 'B' | |
202 | | 'A' 'B' 'C' | |
203 | ||
204 | upon matching "AB" it would look ahead to see if a 'C' is next and, if | |
205 | so, will match the second production in preference to the first. In | |
206 | other words, I<yacc> effectively tries all the productions of a rule | |
207 | breadth-first in parallel, and selects the "best" match, where "best" | |
208 | means longest (note that this is a gross simplification of the true | |
209 | behaviour of I<yacc> but it will do for our purposes). | |
210 | ||
211 | In contrast, C<Parse::RecDescent> tries each production depth-first in | |
212 | sequence, and selects the "best" match, where "best" means first. This is | |
213 | the fundamental difference between "bottom-up" and "recursive descent" | |
214 | parsing. | |
215 | ||
216 | Each successfully matched item in a production is assigned a value, | |
217 | which can be accessed in subsequent actions within the same | |
218 | production (or, in some cases, as the return value of a successful | |
219 | subrule call). Unsuccessful items don't have an associated value, | |
220 | since the failure of an item causes the entire surrounding production | |
221 | to immediately fail. The following sections describe the various types | |
222 | of items and their success values. | |
223 | ||
224 | ||
225 | =head2 Subrules | |
226 | ||
227 | A subrule which appears in a production is an instruction to the parser to | |
228 | attempt to match the named rule at that point in the text being | |
229 | parsed. If the named subrule is not defined when requested the | |
230 | production containing it immediately fails (unless it was "autostubbed" - see | |
231 | L<Autostubbing>). | |
232 | ||
233 | A rule may (recursively) call itself as a subrule, but I<not> as the | |
234 | left-most item in any of its productions (since such recursions are usually | |
235 | non-terminating). | |
236 | ||
237 | The value associated with a subrule is the value associated with its | |
238 | C<$return> variable (see L<"Actions"> below), or with the last successfully | |
239 | matched item in the subrule match. | |
240 | ||
241 | Subrules may also be specified with a trailing repetition specifier, | |
242 | indicating that they are to be (greedily) matched the specified number | |
243 | of times. The available specifiers are: | |
244 | ||
245 | subrule(?) # Match one-or-zero times | |
246 | subrule(s) # Match one-or-more times | |
247 | subrule(s?) # Match zero-or-more times | |
248 | subrule(N) # Match exactly N times for integer N > 0 | |
249 | subrule(N..M) # Match between N and M times | |
250 | subrule(..M) # Match between 1 and M times | |
251 | subrule(N..) # Match at least N times | |
252 | ||
253 | Repeated subrules keep matching until either the subrule fails to | |
254 | match, or it has matched the minimal number of times but fails to | |
255 | consume any of the parsed text (this second condition prevents the | |
256 | subrule matching forever in some cases). | |
257 | ||
258 | Since a repeated subrule may match many instances of the subrule itself, the | |
259 | value associated with it is not a simple scalar, but rather a reference to a | |
260 | list of scalars, each of which is the value associated with one of the | |
261 | individual subrule matches. In other words in the rule: | |
262 | ||
263 | program: statement(s) | |
264 | ||
265 | the value associated with the repeated subrule "statement(s)" is a reference | |
266 | to an array containing the values matched by each call to the individual | |
267 | subrule "statement". | |
268 | ||
269 | Repetition modifieres may include a separator pattern: | |
270 | ||
271 | program: statement(s /;/) | |
272 | ||
273 | specifying some sequence of characters to be skipped between each repetition. | |
274 | This is really just a shorthand for the E<lt>leftop:...E<gt> directive | |
275 | (see below). | |
276 | ||
277 | =head2 Tokens | |
278 | ||
279 | If a quote-delimited string or a Perl regex appears in a production, | |
280 | the parser attempts to match that string or pattern at that point in | |
281 | the text. For example: | |
282 | ||
283 | typedef: "typedef" typename identifier ';' | |
284 | ||
285 | identifier: /[A-Za-z_][A-Za-z0-9_]*/ | |
286 | ||
287 | As in regular Perl, a single quoted string is uninterpolated, whilst | |
288 | a double-quoted string or a pattern is interpolated (at the time | |
289 | of matching, I<not> when the parser is constructed). Hence, it is | |
290 | possible to define rules in which tokens can be set at run-time: | |
291 | ||
292 | typedef: "$::typedefkeyword" typename identifier ';' | |
293 | ||
294 | identifier: /$::identpat/ | |
295 | ||
296 | Note that, since each rule is implemented inside a special namespace | |
297 | belonging to its parser, it is necessary to explicitly quantify | |
298 | variables from the main package. | |
299 | ||
300 | Regex tokens can be specified using just slashes as delimiters | |
301 | or with the explicit C<mE<lt>delimiterE<gt>......E<lt>delimiterE<gt>> syntax: | |
302 | ||
303 | typedef: "typedef" typename identifier ';' | |
304 | ||
305 | typename: /[A-Za-z_][A-Za-z0-9_]*/ | |
306 | ||
307 | identifier: m{[A-Za-z_][A-Za-z0-9_]*} | |
308 | ||
309 | A regex of either type can also have any valid trailing parameter(s) | |
310 | (that is, any of [cgimsox]): | |
311 | ||
312 | typedef: "typedef" typename identifier ';' | |
313 | ||
314 | identifier: / [a-z_] # LEADING ALPHA OR UNDERSCORE | |
315 | [a-z0-9_]* # THEN DIGITS ALSO ALLOWED | |
316 | /ix # CASE/SPACE/COMMENT INSENSITIVE | |
317 | ||
318 | The value associated with any successfully matched token is a string | |
319 | containing the actual text which was matched by the token. | |
320 | ||
321 | It is important to remember that, since each grammar is specified in a | |
322 | Perl string, all instances of the universal escape character '\' within | |
323 | a grammar must be "doubled", so that they interpolate to single '\'s when | |
324 | the string is compiled. For example, to use the grammar: | |
325 | ||
326 | word: /\S+/ | backslash | |
327 | line: prefix word(s) "\n" | |
328 | backslash: '\\' | |
329 | ||
330 | the following code is required: | |
331 | ||
332 | $parser = new Parse::RecDescent (q{ | |
333 | ||
334 | word: /\\S+/ | backslash | |
335 | line: prefix word(s) "\\n" | |
336 | backslash: '\\\\' | |
337 | ||
338 | }); | |
339 | ||
340 | ||
341 | =head2 Terminal Separators | |
342 | ||
343 | For the purpose of matching, each terminal in a production is considered | |
344 | to be preceded by a "prefix" - a pattern which must be | |
345 | matched before a token match is attempted. By default, the | |
346 | prefix is optional whitespace (which always matches, at | |
347 | least trivially), but this default may be reset in any production. | |
348 | ||
349 | The variable C<$Parse::RecDescent::skip> stores the universal | |
350 | prefix, which is the default for all terminal matches in all parsers | |
351 | built with C<Parse::RecDescent>. | |
352 | ||
353 | The prefix for an individual production can be altered | |
354 | by using the C<E<lt>skip:...E<gt>> directive (see below). | |
355 | ||
356 | ||
357 | =head2 Actions | |
358 | ||
359 | An action is a block of Perl code which is to be executed (as the | |
360 | block of a C<do> statement) when the parser reaches that point in a | |
361 | production. The action executes within a special namespace belonging to | |
362 | the active parser, so care must be taken in correctly qualifying variable | |
363 | names (see also L<Start-up Actions> below). | |
364 | ||
365 | The action is considered to succeed if the final value of the block | |
366 | is defined (that is, if the implied C<do> statement evaluates to a | |
367 | defined value - I<even one which would be treated as "false">). Note | |
368 | that the value associated with a successful action is also the final | |
369 | value in the block. | |
370 | ||
371 | An action will I<fail> if its last evaluated value is C<undef>. This is | |
372 | surprisingly easy to accomplish by accident. For instance, here's an | |
373 | infuriating case of an action that makes its production fail, but only | |
374 | when debugging I<isn't> activated: | |
375 | ||
376 | description: name rank serial_number | |
377 | { print "Got $item[2] $item[1] ($item[3])\n" | |
378 | if $::debugging | |
379 | } | |
380 | ||
381 | If C<$debugging> is false, no statement in the block is executed, so | |
382 | the final value is C<undef>, and the entire production fails. The solution is: | |
383 | ||
384 | description: name rank serial_number | |
385 | { print "Got $item[2] $item[1] ($item[3])\n" | |
386 | if $::debugging; | |
387 | 1; | |
388 | } | |
389 | ||
390 | Within an action, a number of useful parse-time variables are | |
391 | available in the special parser namespace (there are other variables | |
392 | also accessible, but meddling with them will probably just break your | |
393 | parser. As a general rule, if you avoid referring to unqualified | |
394 | variables - especially those starting with an underscore - inside an action, | |
395 | things should be okay): | |
396 | ||
397 | =over 4 | |
398 | ||
399 | =item C<@item> and C<%item> | |
400 | ||
401 | The array slice C<@item[1..$#item]> stores the value associated with each item | |
402 | (that is, each subrule, token, or action) in the current production. The | |
403 | analogy is to C<$1>, C<$2>, etc. in a I<yacc> grammar. | |
404 | Note that, for obvious reasons, C<@item> only contains the | |
405 | values of items I<before> the current point in the production. | |
406 | ||
407 | The first element (C<$item[0]>) stores the name of the current rule | |
408 | being matched. | |
409 | ||
410 | C<@item> is a standard Perl array, so it can also be indexed with negative | |
411 | numbers, representing the number of items I<back> from the current position in | |
412 | the parse: | |
413 | ||
414 | stuff: /various/ bits 'and' pieces "then" data 'end' | |
415 | { print $item[-2] } # PRINTS data | |
416 | # (EASIER THAN: $item[6]) | |
417 | ||
418 | The C<%item> hash complements the <@item> array, providing named | |
419 | access to the same item values: | |
420 | ||
421 | stuff: /various/ bits 'and' pieces "then" data 'end' | |
422 | { print $item{data} # PRINTS data | |
423 | # (EVEN EASIER THAN USING @item) | |
424 | ||
425 | ||
426 | The results of named subrules are stored in the hash under each | |
427 | subrule's name, whilst all other items are stored under a "named | |
428 | positional" key that indictates their ordinal position within their item | |
429 | type: __STRINGI<n>__, __PATTERNI<n>__, __DIRECTIVEI<n>__, __ACTIONI<n>__: | |
430 | ||
431 | stuff: /various/ bits 'and' pieces "then" data 'end' { save } | |
432 | { print $item{__PATTERN1__}, # PRINTS 'various' | |
433 | $item{__STRING2__}, # PRINTS 'then' | |
434 | $item{__ACTION1__}, # PRINTS RETURN | |
435 | # VALUE OF save | |
436 | } | |
437 | ||
438 | ||
439 | If you want proper I<named> access to patterns or literals, you need to turn | |
440 | them into separate rules: | |
441 | ||
442 | stuff: various bits 'and' pieces "then" data 'end' | |
443 | { print $item{various} # PRINTS various | |
444 | } | |
445 | ||
446 | various: /various/ | |
447 | ||
448 | ||
449 | The special entry C<$item{__RULE__}> stores the name of the current | |
450 | rule (i.e. the same value as C<$item[0]>. | |
451 | ||
452 | The advantage of using C<%item>, instead of C<@items> is that it | |
453 | removes the need to track items positions that may change as a grammar | |
454 | evolves. For example, adding an interim C<E<lt>skipE<gt>> directive | |
455 | of action can silently ruin a trailing action, by moving an C<@item> | |
456 | element "down" the array one place. In contrast, the named entry | |
457 | of C<%item> is unaffected by such an insertion. | |
458 | ||
459 | A limitation of the C<%item> hash is that it only records the I<last> | |
460 | value of a particular subrule. For example: | |
461 | ||
462 | range: '(' number '..' number )' | |
463 | { $return = $item{number} } | |
464 | ||
465 | will return only the value corresponding to the I<second> match of the | |
466 | C<number> subrule. In other words, successive calls to a subrule | |
467 | overwrite the corresponding entry in C<%item>. Once again, the | |
468 | solution is to rename each subrule in its own rule: | |
469 | ||
470 | range: '(' from_num '..' to_num )' | |
471 | { $return = $item{from_num} } | |
472 | ||
473 | from_num: number | |
474 | to_num: number | |
475 | ||
476 | ||
477 | ||
478 | =item C<@arg> and C<%arg> | |
479 | ||
480 | The array C<@arg> and the hash C<%arg> store any arguments passed to | |
481 | the rule from some other rule (see L<"Subrule argument lists>). Changes | |
482 | to the elements of either variable do not propagate back to the calling | |
483 | rule (data can be passed back from a subrule via the C<$return> | |
484 | variable - see next item). | |
485 | ||
486 | ||
487 | =item C<$return> | |
488 | ||
489 | If a value is assigned to C<$return> within an action, that value is | |
490 | returned if the production containing the action eventually matches | |
491 | successfully. Note that setting C<$return> I<doesn't> cause the current | |
492 | production to succeed. It merely tells it what to return if it I<does> succeed. | |
493 | Hence C<$return> is analogous to C<$$> in a I<yacc> grammar. | |
494 | ||
495 | If C<$return> is not assigned within a production, the value of the | |
496 | last component of the production (namely: C<$item[$#item]>) is | |
497 | returned if the production succeeds. | |
498 | ||
499 | ||
500 | =item C<$commit> | |
501 | ||
502 | The current state of commitment to the current production (see L<"Directives"> | |
503 | below). | |
504 | ||
505 | =item C<$skip> | |
506 | ||
507 | The current terminal prefix (see L<"Directives"> below). | |
508 | ||
509 | =item C<$text> | |
510 | ||
511 | The remaining (unparsed) text. Changes to C<$text> I<do not | |
512 | propagate> out of unsuccessful productions, but I<do> survive | |
513 | successful productions. Hence it is possible to dynamically alter the | |
514 | text being parsed - for example, to provide a C<#include>-like facility: | |
515 | ||
516 | hash_include: '#include' filename | |
517 | { $text = ::loadfile($item[2]) . $text } | |
518 | ||
519 | filename: '<' /[a-z0-9._-]+/i '>' { $return = $item[2] } | |
520 | | '"' /[a-z0-9._-]+/i '"' { $return = $item[2] } | |
521 | ||
522 | ||
523 | =item C<$thisline> and C<$prevline> | |
524 | ||
525 | C<$thisline> stores the current line number within the current parse | |
526 | (starting from 1). C<$prevline> stores the line number for the last | |
527 | character which was already successfully parsed (this will be different from | |
528 | C<$thisline> at the end of each line). | |
529 | ||
530 | For efficiency, C<$thisline> and C<$prevline> are actually tied | |
531 | hashes, and only recompute the required line number when the variable's | |
532 | value is used. | |
533 | ||
534 | Assignment to C<$thisline> adjusts the line number calculator, so that | |
535 | it believes that the current line number is the value being assigned. Note | |
536 | that this adjustment will be reflected in all subsequent line numbers | |
537 | calculations. | |
538 | ||
539 | Modifying the value of the variable C<$text> (as in the previous | |
540 | C<hash_include> example, for instance) will confuse the line | |
541 | counting mechanism. To prevent this, you should call | |
542 | C<Parse::RecDescent::LineCounter::resync($thisline)> I<immediately> | |
543 | after any assignment to the variable C<$text> (or, at least, before the | |
544 | next attempt to use C<$thisline>). | |
545 | ||
546 | Note that if a production fails after assigning to or | |
547 | resync'ing C<$thisline>, the parser's line counter mechanism will | |
548 | usually be corrupted. | |
549 | ||
550 | Also see the entry for C<@itempos>. | |
551 | ||
552 | The line number can be set to values other than 1, by calling the start | |
553 | rule with a second argument. For example: | |
554 | ||
555 | $parser = new Parse::RecDescent ($grammar); | |
556 | ||
557 | $parser->input($text, 10); # START LINE NUMBERS AT 10 | |
558 | ||
559 | ||
560 | =item C<$thiscolumn> and C<$prevcolumn> | |
561 | ||
562 | C<$thiscolumn> stores the current column number within the current line | |
563 | being parsed (starting from 1). C<$prevcolumn> stores the column number | |
564 | of the last character which was actually successfully parsed. Usually | |
565 | C<$prevcolumn == $thiscolumn-1>, but not at the end of lines. | |
566 | ||
567 | For efficiency, C<$thiscolumn> and C<$prevcolumn> are | |
568 | actually tied hashes, and only recompute the required column number | |
569 | when the variable's value is used. | |
570 | ||
571 | Assignment to C<$thiscolumn> or C<$prevcolumn> is a fatal error. | |
572 | ||
573 | Modifying the value of the variable C<$text> (as in the previous | |
574 | C<hash_include> example, for instance) may confuse the column | |
575 | counting mechanism. | |
576 | ||
577 | Note that C<$thiscolumn> reports the column number I<before> any | |
578 | whitespace that might be skipped before reading a token. Hence | |
579 | if you wish to know where a token started (and ended) use something like this: | |
580 | ||
581 | rule: token1 token2 startcol token3 endcol token4 | |
582 | { print "token3: columns $item[3] to $item[5]"; } | |
583 | ||
584 | startcol: // { $thiscolumn } # NEED THE // TO STEP PAST TOKEN SEP | |
585 | endcol: { $prevcolumn } | |
586 | ||
587 | Also see the entry for C<@itempos>. | |
588 | ||
589 | =item C<$thisoffset> and C<$prevoffset> | |
590 | ||
591 | C<$thisoffset> stores the offset of the current parsing position | |
592 | within the complete text | |
593 | being parsed (starting from 0). C<$prevoffset> stores the offset | |
594 | of the last character which was actually successfully parsed. In all | |
595 | cases C<$prevoffset == $thisoffset-1>. | |
596 | ||
597 | For efficiency, C<$thisoffset> and C<$prevoffset> are | |
598 | actually tied hashes, and only recompute the required offset | |
599 | when the variable's value is used. | |
600 | ||
601 | Assignment to C<$thisoffset> or <$prevoffset> is a fatal error. | |
602 | ||
603 | Modifying the value of the variable C<$text> will I<not> affect the | |
604 | offset counting mechanism. | |
605 | ||
606 | Also see the entry for C<@itempos>. | |
607 | ||
608 | =item C<@itempos> | |
609 | ||
610 | The array C<@itempos> stores a hash reference corresponding to | |
611 | each element of C<@item>. The elements of the hash provide the | |
612 | following: | |
613 | ||
614 | $itempos[$n]{offset}{from} # VALUE OF $thisoffset BEFORE $item[$n] | |
615 | $itempos[$n]{offset}{to} # VALUE OF $prevoffset AFTER $item[$n] | |
616 | $itempos[$n]{line}{from} # VALUE OF $thisline BEFORE $item[$n] | |
617 | $itempos[$n]{line}{to} # VALUE OF $prevline AFTER $item[$n] | |
618 | $itempos[$n]{column}{from} # VALUE OF $thiscolumn BEFORE $item[$n] | |
619 | $itempos[$n]{column}{to} # VALUE OF $prevcolumn AFTER $item[$n] | |
620 | ||
621 | Note that the various C<$itempos[$n]...{from}> values record the | |
622 | appropriate value I<after> any token prefix has been skipped. | |
623 | ||
624 | Hence, instead of the somewhat tedious and error-prone: | |
625 | ||
626 | rule: startcol token1 endcol | |
627 | startcol token2 endcol | |
628 | startcol token3 endcol | |
629 | { print "token1: columns $item[1] | |
630 | to $item[3] | |
631 | token2: columns $item[4] | |
632 | to $item[6] | |
633 | token3: columns $item[7] | |
634 | to $item[9]" } | |
635 | ||
636 | startcol: // { $thiscolumn } # NEED THE // TO STEP PAST TOKEN SEP | |
637 | endcol: { $prevcolumn } | |
638 | ||
639 | it is possible to write: | |
640 | ||
641 | rule: token1 token2 token3 | |
642 | { print "token1: columns $itempos[1]{column}{from} | |
643 | to $itempos[1]{column}{to} | |
644 | token2: columns $itempos[2]{column}{from} | |
645 | to $itempos[2]{column}{to} | |
646 | token3: columns $itempos[3]{column}{from} | |
647 | to $itempos[3]{column}{to}" } | |
648 | ||
649 | Note however that (in the current implementation) the use of C<@itempos> | |
650 | anywhere in a grammar implies that item positioning information is | |
651 | collected I<everywhere> during the parse. Depending on the grammar | |
652 | and the size of the text to be parsed, this may be prohibitively | |
653 | expensive and the explicit use of C<$thisline>, C<$thiscolumn>, etc. may | |
654 | be a better choice. | |
655 | ||
656 | ||
657 | =item C<$thisparser> | |
658 | ||
659 | A reference to the S<C<Parse::RecDescent>> object through which | |
660 | parsing was initiated. | |
661 | ||
662 | The value of C<$thisparser> propagates down the subrules of a parse | |
663 | but not back up. Hence, you can invoke subrules from another parser | |
664 | for the scope of the current rule as follows: | |
665 | ||
666 | rule: subrule1 subrule2 | |
667 | | { $thisparser = $::otherparser } <reject> | |
668 | | subrule3 subrule4 | |
669 | | subrule5 | |
670 | ||
671 | The result is that the production calls "subrule1" and "subrule2" of | |
672 | the current parser, and the remaining productions call the named subrules | |
673 | from C<$::otherparser>. Note, however that "Bad Things" will happen if | |
674 | C<::otherparser> isn't a blessed reference and/or doesn't have methods | |
675 | with the same names as the required subrules! | |
676 | ||
677 | =item C<$thisrule> | |
678 | ||
679 | A reference to the S<C<Parse::RecDescent::Rule>> object corresponding to the | |
680 | rule currently being matched. | |
681 | ||
682 | =item C<$thisprod> | |
683 | ||
684 | A reference to the S<C<Parse::RecDescent::Production>> object | |
685 | corresponding to the production currently being matched. | |
686 | ||
687 | =item C<$score> and C<$score_return> | |
688 | ||
689 | $score stores the best production score to date, as specified by | |
690 | an earlier C<E<lt>score:...E<gt>> directive. $score_return stores | |
691 | the corresponding return value for the successful production. | |
692 | ||
693 | See L<Scored productions>. | |
694 | ||
695 | =back | |
696 | ||
697 | B<Warning:> the parser relies on the information in the various C<this...> | |
698 | objects in some non-obvious ways. Tinkering with the other members of | |
699 | these objects will probably cause Bad Things to happen, unless you | |
700 | I<really> know what you're doing. The only exception to this advice is | |
701 | that the use of C<$this...-E<gt>{local}> is always safe. | |
702 | ||
703 | ||
704 | =head2 Start-up Actions | |
705 | ||
706 | Any actions which appear I<before> the first rule definition in a | |
707 | grammar are treated as "start-up" actions. Each such action is | |
708 | stripped of its outermost brackets and then evaluated (in the parser's | |
709 | special namespace) just before the rules of the grammar are first | |
710 | compiled. | |
711 | ||
712 | The main use of start-up actions is to declare local variables within the | |
713 | parser's special namespace: | |
714 | ||
715 | { my $lastitem = '???'; } | |
716 | ||
717 | list: item(s) { $return = $lastitem } | |
718 | ||
719 | item: book { $lastitem = 'book'; } | |
720 | bell { $lastitem = 'bell'; } | |
721 | candle { $lastitem = 'candle'; } | |
722 | ||
723 | but start-up actions can be used to execute I<any> valid Perl code | |
724 | within a parser's special namespace. | |
725 | ||
726 | Start-up actions can appear within a grammar extension or replacement | |
727 | (that is, a partial grammar installed via C<Parse::RecDescent::Extend()> or | |
728 | C<Parse::RecDescent::Replace()> - see L<Incremental Parsing>), and will be | |
729 | executed before the new grammar is installed. Note, however, that a | |
730 | particular start-up action is only ever executed once. | |
731 | ||
732 | ||
733 | =head2 Autoactions | |
734 | ||
735 | It is sometimes desirable to be able to specify a default action to be | |
736 | taken at the end of every production (for example, in order to easily | |
737 | build a parse tree). If the variable C<$::RD_AUTOACTION> is defined | |
738 | when C<Parse::RecDescent::new()> is called, the contents of that | |
739 | variable are treated as a specification of an action which is to appended | |
740 | to each production in the corresponding grammar. So, for example, to construct | |
741 | a simple parse tree: | |
742 | ||
743 | $::RD_AUTOACTION = q { [@item] }; | |
744 | ||
745 | parser = new Parse::RecDescent (q{ | |
746 | expression: and_expr '||' expression | and_expr | |
747 | and_expr: not_expr '&&' and_expr | not_expr | |
748 | not_expr: '!' brack_expr | brack_expr | |
749 | brack_expr: '(' expression ')' | identifier | |
750 | identifier: /[a-z]+/i | |
751 | }); | |
752 | ||
753 | which is equivalent to: | |
754 | ||
755 | parser = new Parse::RecDescent (q{ | |
756 | expression: and_expr '&&' expression | |
757 | { [@item] } | |
758 | | and_expr | |
759 | { [@item] } | |
760 | ||
761 | and_expr: not_expr '&&' and_expr | |
762 | { [@item] } | |
763 | | not_expr | |
764 | { [@item] } | |
765 | ||
766 | not_expr: '!' brack_expr | |
767 | { [@item] } | |
768 | | brack_expr | |
769 | { [@item] } | |
770 | ||
771 | brack_expr: '(' expression ')' | |
772 | { [@item] } | |
773 | | identifier | |
774 | { [@item] } | |
775 | ||
776 | identifier: /[a-z]+/i | |
777 | { [@item] } | |
778 | }); | |
779 | ||
780 | Alternatively, we could take an object-oriented approach, use different | |
781 | classes for each node (and also eliminating redundant intermediate nodes): | |
782 | ||
783 | $::RD_AUTOACTION = q | |
784 | { $#item==1 ? $item[1] : new ${"$item[0]_node"} (@item[1..$#item]) }; | |
785 | ||
786 | parser = new Parse::RecDescent (q{ | |
787 | expression: and_expr '||' expression | and_expr | |
788 | and_expr: not_expr '&&' and_expr | not_expr | |
789 | not_expr: '!' brack_expr | brack_expr | |
790 | brack_expr: '(' expression ')' | identifier | |
791 | identifier: /[a-z]+/i | |
792 | }); | |
793 | ||
794 | which is equivalent to: | |
795 | ||
796 | parser = new Parse::RecDescent (q{ | |
797 | expression: and_expr '&&' expression | |
798 | { new expression_node (@item[1..3]) } | |
799 | | and_expr | |
800 | ||
801 | and_expr: not_expr '&&' and_expr | |
802 | { new and_expr_node (@item[1..3]) } | |
803 | | not_expr | |
804 | ||
805 | not_expr: '!' brack_expr | |
806 | { new not_expr_node (@item[1..2]) } | |
807 | | brack_expr | |
808 | ||
809 | brack_expr: '(' expression ')' | |
810 | { new brack_expr_node (@item[1..3]) } | |
811 | | identifier | |
812 | ||
813 | identifier: /[a-z]+/i | |
814 | { new identifer_node (@item[1]) } | |
815 | }); | |
816 | ||
817 | Note that, if a production already ends in an action, no autoaction is appended | |
818 | to it. For example, in this version: | |
819 | ||
820 | $::RD_AUTOACTION = q | |
821 | { $#item==1 ? $item[1] : new ${"$item[0]_node"} (@item[1..$#item]) }; | |
822 | ||
823 | parser = new Parse::RecDescent (q{ | |
824 | expression: and_expr '&&' expression | and_expr | |
825 | and_expr: not_expr '&&' and_expr | not_expr | |
826 | not_expr: '!' brack_expr | brack_expr | |
827 | brack_expr: '(' expression ')' | identifier | |
828 | identifier: /[a-z]+/i | |
829 | { new terminal_node($item[1]) } | |
830 | }); | |
831 | ||
832 | each C<identifier> match produces a C<terminal_node> object, I<not> an | |
833 | C<identifier_node> object. | |
834 | ||
835 | A level 1 warning is issued each time an "autoaction" is added to | |
836 | some production. | |
837 | ||
838 | ||
839 | =head2 Autotrees | |
840 | ||
841 | A commonly needed autoaction is one that builds a parse-tree. It is moderately | |
842 | tricky to set up such an action (which must treat terminals differently from | |
843 | non-terminals), so Parse::RecDescent simplifies the process by providing the | |
844 | C<E<lt>autotreeE<gt>> directive. | |
845 | ||
846 | If this directive appears at the start of grammar, it causes | |
847 | Parse::RecDescent to insert autoactions at the end of any rule except | |
848 | those which already end in an action. The action inserted depends on whether | |
849 | the production is an intermediate rule (two or more items), or a terminal | |
850 | of the grammar (i.e. a single pattern or string item). | |
851 | ||
852 | So, for example, the following grammar: | |
853 | ||
854 | <autotree> | |
855 | ||
856 | file : command(s) | |
857 | command : get | set | vet | |
858 | get : 'get' ident ';' | |
859 | set : 'set' ident 'to' value ';' | |
860 | vet : 'check' ident 'is' value ';' | |
861 | ident : /\w+/ | |
862 | value : /\d+/ | |
863 | ||
864 | is equivalent to: | |
865 | ||
866 | file : command(s) { bless \%item, $item[0] } | |
867 | command : get { bless \%item, $item[0] } | |
868 | | set { bless \%item, $item[0] } | |
869 | | vet { bless \%item, $item[0] } | |
870 | get : 'get' ident ';' { bless \%item, $item[0] } | |
871 | set : 'set' ident 'to' value ';' { bless \%item, $item[0] } | |
872 | vet : 'check' ident 'is' value ';' { bless \%item, $item[0] } | |
873 | ||
874 | ident : /\w+/ { bless {__VALUE__=>$item[1]}, $item[0] } | |
875 | value : /\d+/ { bless {__VALUE__=>$item[1]}, $item[0] } | |
876 | ||
877 | Note that each node in the tree is blessed into a class of the same name | |
878 | as the rule itself. This makes it easy to build object-oriented | |
879 | processors for the parse-trees that the grammar produces. Note too that | |
880 | the last two rules produce special objects with the single attribute | |
881 | '__VALUE__'. This is because they consist solely of a single terminal. | |
882 | ||
883 | This autoaction-ed grammar would then produce a parse tree in a data | |
884 | structure like this: | |
885 | ||
886 | { | |
887 | file => { | |
888 | command => { | |
889 | [ get => { | |
890 | identifier => { __VALUE__ => 'a' }, | |
891 | }, | |
892 | set => { | |
893 | identifier => { __VALUE__ => 'b' }, | |
894 | value => { __VALUE__ => '7' }, | |
895 | }, | |
896 | vet => { | |
897 | identifier => { __VALUE__ => 'b' }, | |
898 | value => { __VALUE__ => '7' }, | |
899 | }, | |
900 | ], | |
901 | }, | |
902 | } | |
903 | } | |
904 | ||
905 | (except, of course, that each nested hash would also be blessed into | |
906 | the appropriate class). | |
907 | ||
908 | ||
909 | =head2 Autostubbing | |
910 | ||
911 | Normally, if a subrule appears in some production, but no rule of that | |
912 | name is ever defined in the grammar, the production which refers to the | |
913 | non-existent subrule fails immediately. This typically occurs as a | |
914 | result of misspellings, and is a sufficiently common occurance that a | |
915 | warning is generated for such situations. | |
916 | ||
917 | However, when prototyping a grammar it is sometimes useful to be | |
918 | able to use subrules before a proper specification of them is | |
919 | really possible. For example, a grammar might include a section like: | |
920 | ||
921 | function_call: identifier '(' arg(s?) ')' | |
922 | ||
923 | identifier: /[a-z]\w*/i | |
924 | ||
925 | where the possible format of an argument is sufficiently complex that | |
926 | it is not worth specifying in full until the general function call | |
927 | syntax has been debugged. In this situation it is convenient to leave | |
928 | the real rule C<arg> undefined and just slip in a placeholder (or | |
929 | "stub"): | |
930 | ||
931 | arg: 'arg' | |
932 | ||
933 | so that the function call syntax can be tested with dummy input such as: | |
934 | ||
935 | f0() | |
936 | f1(arg) | |
937 | f2(arg arg) | |
938 | f3(arg arg arg) | |
939 | ||
940 | et cetera. | |
941 | ||
942 | Early in prototyping, many such "stubs" may be required, so | |
943 | C<Parse::RecDescent> provides a means of automating their definition. | |
944 | If the variable C<$::RD_AUTOSTUB> is defined when a parser is built, | |
945 | a subrule reference to any non-existent rule (say, C<sr>), | |
946 | causes a "stub" rule of the form: | |
947 | ||
948 | sr: 'sr' | |
949 | ||
950 | to be automatically defined in the generated parser. | |
951 | A level 1 warning is issued for each such "autostubbed" rule. | |
952 | ||
953 | Hence, with C<$::AUTOSTUB> defined, it is possible to only partially | |
954 | specify a grammar, and then "fake" matches of the unspecified | |
955 | (sub)rules by just typing in their name. | |
956 | ||
957 | ||
958 | ||
959 | =head2 Look-ahead | |
960 | ||
961 | If a subrule, token, or action is prefixed by "...", then it is | |
962 | treated as a "look-ahead" request. That means that the current production can | |
963 | (as usual) only succeed if the specified item is matched, but that the matching | |
964 | I<does not consume any of the text being parsed>. This is very similar to the | |
965 | C</(?=...)/> look-ahead construct in Perl patterns. Thus, the rule: | |
966 | ||
967 | inner_word: word ...word | |
968 | ||
969 | will match whatever the subrule "word" matches, provided that match is followed | |
970 | by some more text which subrule "word" would also match (although this | |
971 | second substring is not actually consumed by "inner_word") | |
972 | ||
973 | Likewise, a "...!" prefix, causes the following item to succeed (without | |
974 | consuming any text) if and only if it would normally fail. Hence, a | |
975 | rule such as: | |
976 | ||
977 | identifier: ...!keyword ...!'_' /[A-Za-z_]\w*/ | |
978 | ||
979 | matches a string of characters which satisfies the pattern | |
980 | C</[A-Za-z_]\w*/>, but only if the same sequence of characters would | |
981 | not match either subrule "keyword" or the literal token '_'. | |
982 | ||
983 | Sequences of look-ahead prefixes accumulate, multiplying their positive and/or | |
984 | negative senses. Hence: | |
985 | ||
986 | inner_word: word ...!......!word | |
987 | ||
988 | is exactly equivalent the the original example above (a warning is issued in | |
989 | cases like these, since they often indicate something left out, or | |
990 | misunderstood). | |
991 | ||
992 | Note that actions can also be treated as look-aheads. In such cases, | |
993 | the state of the parser text (in the local variable C<$text>) | |
994 | I<after> the look-ahead action is guaranteed to be identical to its | |
995 | state I<before> the action, regardless of how it's changed I<within> | |
996 | the action (unless you actually undefine C<$text>, in which case you | |
997 | get the disaster you deserve :-). | |
998 | ||
999 | ||
1000 | =head2 Directives | |
1001 | ||
1002 | Directives are special pre-defined actions which may be used to alter | |
1003 | the behaviour of the parser. There are currently eighteen directives: | |
1004 | C<E<lt>commitE<gt>>, | |
1005 | C<E<lt>uncommitE<gt>>, | |
1006 | C<E<lt>rejectE<gt>>, | |
1007 | C<E<lt>scoreE<gt>>, | |
1008 | C<E<lt>autoscoreE<gt>>, | |
1009 | C<E<lt>skipE<gt>>, | |
1010 | C<E<lt>resyncE<gt>>, | |
1011 | C<E<lt>errorE<gt>>, | |
1012 | C<E<lt>rulevarE<gt>>, | |
1013 | C<E<lt>matchruleE<gt>>, | |
1014 | C<E<lt>leftopE<gt>>, | |
1015 | C<E<lt>rightopE<gt>>, | |
1016 | C<E<lt>deferE<gt>>, | |
1017 | C<E<lt>nocheckE<gt>>, | |
1018 | C<E<lt>perl_quotelikeE<gt>>, | |
1019 | C<E<lt>perl_codeblockE<gt>>, | |
1020 | C<E<lt>perl_variableE<gt>>, | |
1021 | and C<E<lt>tokenE<gt>>. | |
1022 | ||
1023 | =over 4 | |
1024 | ||
1025 | =item Committing and uncommitting | |
1026 | ||
1027 | The C<E<lt>commitE<gt>> and C<E<lt>uncommitE<gt>> directives permit the recursive | |
1028 | descent of the parse tree to be pruned (or "cut") for efficiency. | |
1029 | Within a rule, a C<E<lt>commitE<gt>> directive instructs the rule to ignore subsequent | |
1030 | productions if the current production fails. For example: | |
1031 | ||
1032 | command: 'find' <commit> filename | |
1033 | | 'open' <commit> filename | |
1034 | | 'move' filename filename | |
1035 | ||
1036 | Clearly, if the leading token 'find' is matched in the first production but that | |
1037 | production fails for some other reason, then the remaining | |
1038 | productions cannot possibly match. The presence of the | |
1039 | C<E<lt>commitE<gt>> causes the "command" rule to fail immediately if | |
1040 | an invalid "find" command is found, and likewise if an invalid "open" | |
1041 | command is encountered. | |
1042 | ||
1043 | It is also possible to revoke a previous commitment. For example: | |
1044 | ||
1045 | if_statement: 'if' <commit> condition | |
1046 | 'then' block <uncommit> | |
1047 | 'else' block | |
1048 | | 'if' <commit> condition | |
1049 | 'then' block | |
1050 | ||
1051 | In this case, a failure to find an "else" block in the first | |
1052 | production shouldn't preclude trying the second production, but a | |
1053 | failure to find a "condition" certainly should. | |
1054 | ||
1055 | As a special case, any production in which the I<first> item is an | |
1056 | C<E<lt>uncommitE<gt>> immediately revokes a preceding C<E<lt>commitE<gt>> | |
1057 | (even though the production would not otherwise have been tried). For | |
1058 | example, in the rule: | |
1059 | ||
1060 | request: 'explain' expression | |
1061 | | 'explain' <commit> keyword | |
1062 | | 'save' | |
1063 | | 'quit' | |
1064 | | <uncommit> term '?' | |
1065 | ||
1066 | if the text being matched was "explain?", and the first two | |
1067 | productions failed, then the C<E<lt>commitE<gt>> in production two would cause | |
1068 | productions three and four to be skipped, but the leading | |
1069 | C<E<lt>uncommitE<gt>> in the production five would allow that production to | |
1070 | attempt a match. | |
1071 | ||
1072 | Note in the preceding example, that the C<E<lt>commitE<gt>> was only placed | |
1073 | in production two. If production one had been: | |
1074 | ||
1075 | request: 'explain' <commit> expression | |
1076 | ||
1077 | then production two would be (inappropriately) skipped if a leading | |
1078 | "explain..." was encountered. | |
1079 | ||
1080 | Both C<E<lt>commitE<gt>> and C<E<lt>uncommitE<gt>> directives always succeed, and their value | |
1081 | is always 1. | |
1082 | ||
1083 | ||
1084 | =item Rejecting a production | |
1085 | ||
1086 | The C<E<lt>rejectE<gt>> directive immediately causes the current production | |
1087 | to fail (it is exactly equivalent to, but more obvious than, the | |
1088 | action C<{undef}>). A C<E<lt>rejectE<gt>> is useful when it is desirable to get | |
1089 | the side effects of the actions in one production, without prejudicing a match | |
1090 | by some other production later in the rule. For example, to insert | |
1091 | tracing code into the parse: | |
1092 | ||
1093 | complex_rule: { print "In complex rule...\n"; } <reject> | |
1094 | ||
1095 | complex_rule: simple_rule '+' 'i' '*' simple_rule | |
1096 | | 'i' '*' simple_rule | |
1097 | | simple_rule | |
1098 | ||
1099 | ||
1100 | It is also possible to specify a conditional rejection, using the | |
1101 | form C<E<lt>reject:I<condition>E<gt>>, which only rejects if the | |
1102 | specified condition is true. This form of rejection is exactly | |
1103 | equivalent to the action C<{(I<condition>)?undef:1}E<gt>>. | |
1104 | For example: | |
1105 | ||
1106 | command: save_command | |
1107 | | restore_command | |
1108 | | <reject: defined $::tolerant> { exit } | |
1109 | | <error: Unknown command. Ignored.> | |
1110 | ||
1111 | A C<E<lt>rejectE<gt>> directive never succeeds (and hence has no | |
1112 | associated value). A conditional rejection may succeed (if its | |
1113 | condition is not satisfied), in which case its value is 1. | |
1114 | ||
1115 | As an extra optimization, C<Parse::RecDescent> ignores any production | |
1116 | which I<begins> with an unconditional C<E<lt>rejectE<gt>> directive, | |
1117 | since any such production can never successfully match or have any | |
1118 | useful side-effects. A level 1 warning is issued in all such cases. | |
1119 | ||
1120 | Note that productions beginning with conditional | |
1121 | C<E<lt>reject:...E<gt>> directives are I<never> "optimized away" in | |
1122 | this manner, even if they are always guaranteed to fail (for example: | |
1123 | C<E<lt>reject:1E<gt>>) | |
1124 | ||
1125 | Due to the way grammars are parsed, there is a minor restriction on the | |
1126 | condition of a conditional C<E<lt>reject:...E<gt>>: it cannot | |
1127 | contain any raw '<' or '>' characters. For example: | |
1128 | ||
1129 | line: cmd <reject: $thiscolumn > max> data | |
1130 | ||
1131 | results in an error when a parser is built from this grammar (since the | |
1132 | grammar parser has no way of knowing whether the first > is a "less than" | |
1133 | or the end of the C<E<lt>reject:...E<gt>>. | |
1134 | ||
1135 | To overcome this problem, put the condition inside a do{} block: | |
1136 | ||
1137 | line: cmd <reject: do{$thiscolumn > max}> data | |
1138 | ||
1139 | Note that the same problem may occur in other directives that take | |
1140 | arguments. The same solution will work in all cases. | |
1141 | ||
1142 | =item Skipping between terminals | |
1143 | ||
1144 | The C<E<lt>skipE<gt>> directive enables the terminal prefix used in | |
1145 | a production to be changed. For example: | |
1146 | ||
1147 | OneLiner: Command <skip:'[ \t]*'> Arg(s) /;/ | |
1148 | ||
1149 | causes only blanks and tabs to be skipped before terminals in the C<Arg> | |
1150 | subrule (and any of I<its> subrules>, and also before the final C</;/> terminal. | |
1151 | Once the production is complete, the previous terminal prefix is | |
1152 | reinstated. Note that this implies that distinct productions of a rule | |
1153 | must reset their terminal prefixes individually. | |
1154 | ||
1155 | The C<E<lt>skipE<gt>> directive evaluates to the I<previous> terminal prefix, | |
1156 | so it's easy to reinstate a prefix later in a production: | |
1157 | ||
1158 | Command: <skip:","> CSV(s) <skip:$item[1]> Modifier | |
1159 | ||
1160 | The value specified after the colon is interpolated into a pattern, so all of | |
1161 | the following are equivalent (though their efficiency increases down the list): | |
1162 | ||
1163 | <skip: "$colon|$comma"> # ASSUMING THE VARS HOLD THE OBVIOUS VALUES | |
1164 | ||
1165 | <skip: ':|,'> | |
1166 | ||
1167 | <skip: q{[:,]}> | |
1168 | ||
1169 | <skip: qr/[:,]/> | |
1170 | ||
1171 | There is no way of directly setting the prefix for | |
1172 | an entire rule, except as follows: | |
1173 | ||
1174 | Rule: <skip: '[ \t]*'> Prod1 | |
1175 | | <skip: '[ \t]*'> Prod2a Prod2b | |
1176 | | <skip: '[ \t]*'> Prod3 | |
1177 | ||
1178 | or, better: | |
1179 | ||
1180 | Rule: <skip: '[ \t]*'> | |
1181 | ( | |
1182 | Prod1 | |
1183 | | Prod2a Prod2b | |
1184 | | Prod3 | |
1185 | ) | |
1186 | ||
1187 | ||
1188 | B<Note: Up to release 1.51 of Parse::RecDescent, an entirely different | |
1189 | mechanism was used for specifying terminal prefixes. The current method | |
1190 | is not backwards-compatible with that early approach. The current approach | |
1191 | is stable and will not to change again.> | |
1192 | ||
1193 | ||
1194 | =item Resynchronization | |
1195 | ||
1196 | The C<E<lt>resyncE<gt>> directive provides a visually distinctive | |
1197 | means of consuming some of the text being parsed, usually to skip an | |
1198 | erroneous input. In its simplest form C<E<lt>resyncE<gt>> simply | |
1199 | consumes text up to and including the next newline (C<"\n">) | |
1200 | character, succeeding only if the newline is found, in which case it | |
1201 | causes its surrounding rule to return zero on success. | |
1202 | ||
1203 | In other words, a C<E<lt>resyncE<gt>> is exactly equivalent to the token | |
1204 | C</[^\n]*\n/> followed by the action S<C<{ $return = 0 }>> (except that | |
1205 | productions beginning with a C<E<lt>resyncE<gt>> are ignored when generating | |
1206 | error messages). A typical use might be: | |
1207 | ||
1208 | script : command(s) | |
1209 | ||
1210 | command: save_command | |
1211 | | restore_command | |
1212 | | <resync> # TRY NEXT LINE, IF POSSIBLE | |
1213 | ||
1214 | It is also possible to explicitly specify a resynchronization | |
1215 | pattern, using the C<E<lt>resync:I<pattern>E<gt>> variant. This version | |
1216 | succeeds only if the specified pattern matches (and consumes) the | |
1217 | parsed text. In other words, C<E<lt>resync:I<pattern>E<gt>> is exactly | |
1218 | equivalent to the token C</I<pattern>/> (followed by a S<C<{ $return = 0 }>> | |
1219 | action). For example, if commands were terminated by newlines or semi-colons: | |
1220 | ||
1221 | command: save_command | |
1222 | | restore_command | |
1223 | | <resync:[^;\n]*[;\n]> | |
1224 | ||
1225 | The value of a successfully matched C<E<lt>resyncE<gt>> directive (of either | |
1226 | type) is the text that it consumed. Note, however, that since the | |
1227 | directive also sets C<$return>, a production consisting of a lone | |
1228 | C<E<lt>resyncE<gt>> succeeds but returns the value zero (which a calling rule | |
1229 | may find useful to distinguish between "true" matches and "tolerant" matches). | |
1230 | Remember that returning a zero value indicates that the rule I<succeeded> (since | |
1231 | only an C<undef> denotes failure within C<Parse::RecDescent> parsers. | |
1232 | ||
1233 | ||
1234 | =item Error handling | |
1235 | ||
1236 | The C<E<lt>errorE<gt>> directive provides automatic or user-defined | |
1237 | generation of error messages during a parse. In its simplest form | |
1238 | C<E<lt>errorE<gt>> prepares an error message based on | |
1239 | the mismatch between the last item expected and the text which cause | |
1240 | it to fail. For example, given the rule: | |
1241 | ||
1242 | McCoy: curse ',' name ', I'm a doctor, not a' a_profession '!' | |
1243 | | pronoun 'dead,' name '!' | |
1244 | | <error> | |
1245 | ||
1246 | the following strings would produce the following messages: | |
1247 | ||
1248 | =over 4 | |
1249 | ||
1250 | =item "Amen, Jim!" | |
1251 | ||
1252 | ERROR (line 1): Invalid McCoy: Expected curse or pronoun | |
1253 | not found | |
1254 | ||
1255 | =item "Dammit, Jim, I'm a doctor!" | |
1256 | ||
1257 | ERROR (line 1): Invalid McCoy: Expected ", I'm a doctor, not a" | |
1258 | but found ", I'm a doctor!" instead | |
1259 | ||
1260 | =item "He's dead,\n" | |
1261 | ||
1262 | ERROR (line 2): Invalid McCoy: Expected name not found | |
1263 | ||
1264 | =item "He's alive!" | |
1265 | ||
1266 | ERROR (line 1): Invalid McCoy: Expected 'dead,' but found | |
1267 | "alive!" instead | |
1268 | ||
1269 | =item "Dammit, Jim, I'm a doctor, not a pointy-eared Vulcan!" | |
1270 | ||
1271 | ERROR (line 1): Invalid McCoy: Expected a profession but found | |
1272 | "pointy-eared Vulcan!" instead | |
1273 | ||
1274 | ||
1275 | =back | |
1276 | ||
1277 | Note that, when autogenerating error messages, all underscores in any | |
1278 | rule name used in a message are replaced by single spaces (for example | |
1279 | "a_production" becomes "a production"). Judicious choice of rule | |
1280 | names can therefore considerably improve the readability of automatic | |
1281 | error messages (as well as the maintainability of the original | |
1282 | grammar). | |
1283 | ||
1284 | If the automatically generated error is not sufficient, it is possible to | |
1285 | provide an explicit message as part of the error directive. For example: | |
1286 | ||
1287 | Spock: "Fascinating ',' (name | 'Captain') '.' | |
1288 | | "Highly illogical, doctor." | |
1289 | | <error: He never said that!> | |
1290 | ||
1291 | which would result in I<all> failures to parse a "Spock" subrule printing the | |
1292 | following message: | |
1293 | ||
1294 | ERROR (line <N>): Invalid Spock: He never said that! | |
1295 | ||
1296 | The error message is treated as a "qq{...}" string and interpolated | |
1297 | when the error is generated (I<not> when the directive is specified!). | |
1298 | Hence: | |
1299 | ||
1300 | <error: Mystical error near "$text"> | |
1301 | ||
1302 | would correctly insert the ambient text string which caused the error. | |
1303 | ||
1304 | There are two other forms of error directive: C<E<lt>error?E<gt>> and | |
1305 | S<C<E<lt>error?: msgE<gt>>>. These behave just like C<E<lt>errorE<gt>> | |
1306 | and S<C<E<lt>error: msgE<gt>>> respectively, except that they are | |
1307 | only triggered if the rule is "committed" at the time they are | |
1308 | encountered. For example: | |
1309 | ||
1310 | Scotty: "Ya kenna change the Laws of Phusics," <commit> name | |
1311 | | name <commit> ',' 'she's goanta blaw!' | |
1312 | | <error?> | |
1313 | ||
1314 | will only generate an error for a string beginning with "Ya kenna | |
1315 | change the Laws o' Phusics," or a valid name, but which still fails to match the | |
1316 | corresponding production. That is, C<$parser-E<gt>Scotty("Aye, Cap'ain")> will | |
1317 | fail silently (since neither production will "commit" the rule on that | |
1318 | input), whereas S<C<$parser-E<gt>Scotty("Mr Spock, ah jest kenna do'ut!")>> | |
1319 | will fail with the error message: | |
1320 | ||
1321 | ERROR (line 1): Invalid Scotty: expected 'she's goanta blaw!' | |
1322 | but found 'I jest kenna do'ut!' instead. | |
1323 | ||
1324 | since in that case the second production would commit after matching | |
1325 | the leading name. | |
1326 | ||
1327 | Note that to allow this behaviour, all C<E<lt>errorE<gt>> directives which are | |
1328 | the first item in a production automatically uncommit the rule just | |
1329 | long enough to allow their production to be attempted (that is, when | |
1330 | their production fails, the commitment is reinstated so that | |
1331 | subsequent productions are skipped). | |
1332 | ||
1333 | In order to I<permanently> uncommit the rule before an error message, | |
1334 | it is necessary to put an explicit C<E<lt>uncommitE<gt>> before the | |
1335 | C<E<lt>errorE<gt>>. For example: | |
1336 | ||
1337 | line: 'Kirk:' <commit> Kirk | |
1338 | | 'Spock:' <commit> Spock | |
1339 | | 'McCoy:' <commit> McCoy | |
1340 | | <uncommit> <error?> <reject> | |
1341 | | <resync> | |
1342 | ||
1343 | ||
1344 | Error messages generated by the various C<E<lt>error...E<gt>> directives | |
1345 | are not displayed immediately. Instead, they are "queued" in a buffer and | |
1346 | are only displayed once parsing ultimately fails. Moreover, | |
1347 | C<E<lt>error...E<gt>> directives that cause one production of a rule | |
1348 | to fail are automatically removed from the message queue | |
1349 | if another production subsequently causes the entire rule to succeed. | |
1350 | This means that you can put | |
1351 | C<E<lt>error...E<gt>> directives wherever useful diagnosis can be done, | |
1352 | and only those associated with actual parser failure will ever be | |
1353 | displayed. Also see L<"Gotchas">. | |
1354 | ||
1355 | As a general rule, the most useful diagnostics are usually generated | |
1356 | either at the very lowest level within the grammar, or at the very | |
1357 | highest. A good rule of thumb is to identify those subrules which | |
1358 | consist mainly (or entirely) of terminals, and then put an | |
1359 | C<E<lt>error...E<gt>> directive at the end of any other rule which calls | |
1360 | one or more of those subrules. | |
1361 | ||
1362 | There is one other situation in which the output of the various types of | |
1363 | error directive is suppressed; namely, when the rule containing them | |
1364 | is being parsed as part of a "look-ahead" (see L<"Look-ahead">). In this | |
1365 | case, the error directive will still cause the rule to fail, but will do | |
1366 | so silently. | |
1367 | ||
1368 | An unconditional C<E<lt>errorE<gt>> directive always fails (and hence has no | |
1369 | associated value). This means that encountering such a directive | |
1370 | always causes the production containing it to fail. Hence an | |
1371 | C<E<lt>errorE<gt>> directive will inevitably be the last (useful) item of a | |
1372 | rule (a level 3 warning is issued if a production contains items after an unconditional | |
1373 | C<E<lt>errorE<gt>> directive). | |
1374 | ||
1375 | An C<E<lt>error?E<gt>> directive will I<succeed> (that is: fail to fail :-), if | |
1376 | the current rule is uncommitted when the directive is encountered. In | |
1377 | that case the directive's associated value is zero. Hence, this type | |
1378 | of error directive I<can> be used before the end of a | |
1379 | production. For example: | |
1380 | ||
1381 | command: 'do' <commit> something | |
1382 | | 'report' <commit> something | |
1383 | | <error?: Syntax error> <error: Unknown command> | |
1384 | ||
1385 | ||
1386 | B<Warning:> The C<E<lt>error?E<gt>> directive does I<not> mean "always fail (but | |
1387 | do so silently unless committed)". It actually means "only fail (and report) if | |
1388 | committed, otherwise I<succeed>". To achieve the "fail silently if uncommitted" | |
1389 | semantics, it is necessary to use: | |
1390 | ||
1391 | rule: item <commit> item(s) | |
1392 | | <error?> <reject> # FAIL SILENTLY UNLESS COMMITTED | |
1393 | ||
1394 | However, because people seem to expect a lone C<E<lt>error?E<gt>> directive | |
1395 | to work like this: | |
1396 | ||
1397 | rule: item <commit> item(s) | |
1398 | | <error?: Error message if committed> | |
1399 | | <error: Error message if uncommitted> | |
1400 | ||
1401 | Parse::RecDescent automatically appends a | |
1402 | C<E<lt>rejectE<gt>> directive if the C<E<lt>error?E<gt>> directive | |
1403 | is the only item in a production. A level 2 warning (see below) | |
1404 | is issued when this happens. | |
1405 | ||
1406 | The level of error reporting during both parser construction and | |
1407 | parsing is controlled by the presence or absence of four global | |
1408 | variables: C<$::RD_ERRORS>, C<$::RD_WARN>, C<$::RD_HINT>, and | |
1409 | <$::RD_TRACE>. If C<$::RD_ERRORS> is defined (and, by default, it is) | |
1410 | then fatal errors are reported. | |
1411 | ||
1412 | Whenever C<$::RD_WARN> is defined, certain non-fatal problems are also reported. | |
1413 | Warnings have an associated "level": 1, 2, or 3. The higher the level, | |
1414 | the more serious the warning. The value of the corresponding global | |
1415 | variable (C<$::RD_WARN>) determines the I<lowest> level of warning to | |
1416 | be displayed. Hence, to see I<all> warnings, set C<$::RD_WARN> to 1. | |
1417 | To see only the most serious warnings set C<$::RD_WARN> to 3. | |
1418 | By default C<$::RD_WARN> is initialized to 3, ensuring that serious but | |
1419 | non-fatal errors are automatically reported. | |
1420 | ||
1421 | See F<"DIAGNOSTICS"> for a list of the varous error and warning messages | |
1422 | that Parse::RecDescent generates when these two variables are defined. | |
1423 | ||
1424 | Defining any of the remaining variables (which are not defined by | |
1425 | default) further increases the amount of information reported. | |
1426 | Defining C<$::RD_HINT> causes the parser generator to offer | |
1427 | more detailed analyses and hints on both errors and warnings. | |
1428 | Note that setting C<$::RD_HINT> at any point automagically | |
1429 | sets C<$::RD_WARN> to 1. | |
1430 | ||
1431 | Defining C<$::RD_TRACE> causes the parser generator and the parser to | |
1432 | report their progress to STDERR in excruciating detail (although, without hints | |
1433 | unless $::RD_HINT is separately defined). This detail | |
1434 | can be moderated in only one respect: if C<$::RD_TRACE> has an | |
1435 | integer value (I<N>) greater than 1, only the I<N> characters of | |
1436 | the "current parsing context" (that is, where in the input string we | |
1437 | are at any point in the parse) is reported at any time. | |
1438 | ||
1439 | C<$::RD_TRACE> is mainly useful for debugging a grammar that isn't | |
1440 | behaving as you expected it to. To this end, if C<$::RD_TRACE> is | |
1441 | defined when a parser is built, any actual parser code which is | |
1442 | generated is also written to a file named "RD_TRACE" in the local | |
1443 | directory. | |
1444 | ||
1445 | Note that the four variables belong to the "main" package, which | |
1446 | makes them easier to refer to in the code controlling the parser, and | |
1447 | also makes it easy to turn them into command line flags ("-RD_ERRORS", | |
1448 | "-RD_WARN", "-RD_HINT", "-RD_TRACE") under B<perl -s>. | |
1449 | ||
1450 | =item Specifying local variables | |
1451 | ||
1452 | It is occasionally convenient to specify variables which are local | |
1453 | to a single rule. This may be achieved by including a | |
1454 | C<E<lt>rulevar:...E<gt>> directive anywhere in the rule. For example: | |
1455 | ||
1456 | markup: <rulevar: $tag> | |
1457 | ||
1458 | markup: tag {($tag=$item[1]) =~ s/^<|>$//g} body[$tag] | |
1459 | ||
1460 | The example C<E<lt>rulevar: $tagE<gt>> directive causes a "my" variable named | |
1461 | C<$tag> to be declared at the start of the subroutine implementing the | |
1462 | C<markup> rule (that is, I<before> the first production, regardless of | |
1463 | where in the rule it is specified). | |
1464 | ||
1465 | Specifically, any directive of the form: | |
1466 | C<E<lt>rulevar:I<text>E<gt>> causes a line of the form C<my I<text>;> | |
1467 | to be added at the beginning of the rule subroutine, immediately after | |
1468 | the definitions of the following local variables: | |
1469 | ||
1470 | $thisparser $commit | |
1471 | $thisrule @item | |
1472 | $thisline @arg | |
1473 | $text %arg | |
1474 | ||
1475 | This means that the following C<E<lt>rulevarE<gt>> directives work | |
1476 | as expected: | |
1477 | ||
1478 | <rulevar: $count = 0 > | |
1479 | ||
1480 | <rulevar: $firstarg = $arg[0] || '' > | |
1481 | ||
1482 | <rulevar: $myItems = \@item > | |
1483 | ||
1484 | <rulevar: @context = ( $thisline, $text, @arg ) > | |
1485 | ||
1486 | <rulevar: ($name,$age) = $arg{"name","age"} > | |
1487 | ||
1488 | ||
1489 | Note however that, because all such variables are "my" variables, their | |
1490 | values I<do not persist> between match attempts on a given rule. To | |
1491 | preserve values between match attempts, values can be stored within the | |
1492 | "local" member of the C<$thisrule> object: | |
1493 | ||
1494 | countedrule: { $thisrule->{"local"}{"count"}++ } | |
1495 | <reject> | |
1496 | | subrule1 | |
1497 | | subrule2 | |
1498 | | <reject: $thisrule->{"local"}{"count"} == 1> | |
1499 | subrule3 | |
1500 | ||
1501 | ||
1502 | When matching a rule, each C<E<lt>rulevarE<gt>> directive is matched as | |
1503 | if it were an unconditional C<E<lt>rejectE<gt>> directive (that is, it | |
1504 | causes any production in which it appears to immediately fail to match). | |
1505 | For this reason (and to improve readability) it is usual to specify any | |
1506 | C<E<lt>rulevarE<gt>> directive in a separate production at the start of | |
1507 | the rule (this has the added advantage that it enables | |
1508 | C<Parse::RecDescent> to optimize away such productions, just as it does | |
1509 | for the C<E<lt>rejectE<gt>> directive). | |
1510 | ||
1511 | ||
1512 | =item Dynamically matched rules | |
1513 | ||
1514 | Because regexes and double-quoted strings are interpolated, it is relatively | |
1515 | easy to specify productions with "context sensitive" tokens. For example: | |
1516 | ||
1517 | command: keyword body "end $item[1]" | |
1518 | ||
1519 | which ensures that a command block is bounded by a | |
1520 | "I<E<lt>keywordE<gt>>...end I<E<lt>same keywordE<gt>>" pair. | |
1521 | ||
1522 | Building productions in which subrules are context sensitive is also possible, | |
1523 | via the C<E<lt>matchrule:...E<gt>> directive. This directive behaves | |
1524 | identically to a subrule item, except that the rule which is invoked to match | |
1525 | it is determined by the string specified after the colon. For example, we could | |
1526 | rewrite the C<command> rule like this: | |
1527 | ||
1528 | command: keyword <matchrule:body> "end $item[1]" | |
1529 | ||
1530 | Whatever appears after the colon in the directive is treated as an interpolated | |
1531 | string (that is, as if it appeared in C<qq{...}> operator) and the value of | |
1532 | that interpolated string is the name of the subrule to be matched. | |
1533 | ||
1534 | Of course, just putting a constant string like C<body> in a | |
1535 | C<E<lt>matchrule:...E<gt>> directive is of little interest or benefit. | |
1536 | The power of directive is seen when we use a string that interpolates | |
1537 | to something interesting. For example: | |
1538 | ||
1539 | command: keyword <matchrule:$item[1]_body> "end $item[1]" | |
1540 | ||
1541 | keyword: 'while' | 'if' | 'function' | |
1542 | ||
1543 | while_body: condition block | |
1544 | ||
1545 | if_body: condition block ('else' block)(?) | |
1546 | ||
1547 | function_body: arglist block | |
1548 | ||
1549 | Now the C<command> rule selects how to proceed on the basis of the keyword | |
1550 | that is found. It is as if C<command> were declared: | |
1551 | ||
1552 | command: 'while' while_body "end while" | |
1553 | | 'if' if_body "end if" | |
1554 | | 'function' function_body "end function" | |
1555 | ||
1556 | ||
1557 | When a C<E<lt>matchrule:...E<gt>> directive is used as a repeated | |
1558 | subrule, the rule name expression is "late-bound". That is, the name of | |
1559 | the rule to be called is re-evaluated I<each time> a match attempt is | |
1560 | made. Hence, the following grammar: | |
1561 | ||
1562 | { $::species = 'dogs' } | |
1563 | ||
1564 | pair: 'two' <matchrule:$::species>(s) | |
1565 | ||
1566 | dogs: /dogs/ { $::species = 'cats' } | |
1567 | ||
1568 | cats: /cats/ | |
1569 | ||
1570 | will match the string "two dogs cats cats" completely, whereas it will | |
1571 | only match the string "two dogs dogs dogs" up to the eighth letter. If | |
1572 | the rule name were "early bound" (that is, evaluated only the first | |
1573 | time the directive is encountered in a production), the reverse | |
1574 | behaviour would be expected. | |
1575 | ||
1576 | =item Deferred actions | |
1577 | ||
1578 | The C<E<lt>defer:...E<gt>> directive is used to specify an action to be | |
1579 | performed when (and only if!) the current production ultimately succeeds. | |
1580 | ||
1581 | Whenever a C<E<lt>defer:...E<gt>> directive appears, the code it specifies | |
1582 | is converted to a closure (an anonymous subroutine reference) which is | |
1583 | queued within the active parser object. Note that, | |
1584 | because the deferred code is converted to a closure, the values of any | |
1585 | "local" variable (such as C<$text>, <@item>, etc.) are preserved | |
1586 | until the deferred code is actually executed. | |
1587 | ||
1588 | If the parse ultimately succeeds | |
1589 | I<and> the production in which the C<E<lt>defer:...E<gt>> directive was | |
1590 | evaluated formed part of the successful parse, then the deferred code is | |
1591 | executed immediately before the parse returns. If however the production | |
1592 | which queued a deferred action fails, or one of the higher-level | |
1593 | rules which called that production fails, then the deferred action is | |
1594 | removed from the queue, and hence is never executed. | |
1595 | ||
1596 | For example, given the grammar: | |
1597 | ||
1598 | sentence: noun trans noun | |
1599 | | noun intrans | |
1600 | ||
1601 | noun: 'the dog' | |
1602 | { print "$item[1]\t(noun)\n" } | |
1603 | | 'the meat' | |
1604 | { print "$item[1]\t(noun)\n" } | |
1605 | ||
1606 | trans: 'ate' | |
1607 | { print "$item[1]\t(transitive)\n" } | |
1608 | ||
1609 | intrans: 'ate' | |
1610 | { print "$item[1]\t(intransitive)\n" } | |
1611 | | 'barked' | |
1612 | { print "$item[1]\t(intransitive)\n" } | |
1613 | ||
1614 | then parsing the sentence C<"the dog ate"> would produce the output: | |
1615 | ||
1616 | the dog (noun) | |
1617 | ate (transitive) | |
1618 | the dog (noun) | |
1619 | ate (intransitive) | |
1620 | ||
1621 | This is because, even though the first production of C<sentence> | |
1622 | ultimately fails, its initial subrules C<noun> and C<trans> do match, | |
1623 | and hence they execute their associated actions. | |
1624 | Then the second production of C<sentence> succeeds, causing the | |
1625 | actions of the subrules C<noun> and C<intrans> to be executed as well. | |
1626 | ||
1627 | On the other hand, if the actions were replaced by C<E<lt>defer:...E<gt>> | |
1628 | directives: | |
1629 | ||
1630 | sentence: noun trans noun | |
1631 | | noun intrans | |
1632 | ||
1633 | noun: 'the dog' | |
1634 | <defer: print "$item[1]\t(noun)\n" > | |
1635 | | 'the meat' | |
1636 | <defer: print "$item[1]\t(noun)\n" > | |
1637 | ||
1638 | trans: 'ate' | |
1639 | <defer: print "$item[1]\t(transitive)\n" > | |
1640 | ||
1641 | intrans: 'ate' | |
1642 | <defer: print "$item[1]\t(intransitive)\n" > | |
1643 | | 'barked' | |
1644 | <defer: print "$item[1]\t(intransitive)\n" > | |
1645 | ||
1646 | the output would be: | |
1647 | ||
1648 | the dog (noun) | |
1649 | ate (intransitive) | |
1650 | ||
1651 | since deferred actions are only executed if they were evaluated in | |
1652 | a production which ultimately contributes to the successful parse. | |
1653 | ||
1654 | In this case, even though the first production of C<sentence> caused | |
1655 | the subrules C<noun> and C<trans> to match, that production ultimately | |
1656 | failed and so the deferred actions queued by those subrules were subsequently | |
1657 | disgarded. The second production then succeeded, causing the entire | |
1658 | parse to succeed, and so the deferred actions queued by the (second) match of | |
1659 | the C<noun> subrule and the subsequent match of C<intrans> I<are> preserved and | |
1660 | eventually executed. | |
1661 | ||
1662 | Deferred actions provide a means of improving the performance of a parser, | |
1663 | by only executing those actions which are part of the final parse-tree | |
1664 | for the input data. | |
1665 | ||
1666 | Alternatively, deferred actions can be viewed as a mechanism for building | |
1667 | (and executing) a | |
1668 | customized subroutine corresponding to the given input data, much in the | |
1669 | same way that autoactions (see L<"Autoactions">) can be used to build a | |
1670 | customized data structure for specific input. | |
1671 | ||
1672 | Whether or not the action it specifies is ever executed, | |
1673 | a C<E<lt>defer:...E<gt>> directive always succeeds, returning the | |
1674 | number of deferred actions currently queued at that point. | |
1675 | ||
1676 | ||
1677 | =item Parsing Perl | |
1678 | ||
1679 | Parse::RecDescent provides limited support for parsing subsets of Perl, | |
1680 | namely: quote-like operators, Perl variables, and complete code blocks. | |
1681 | ||
1682 | The C<E<lt>perl_quotelikeE<gt>> directive can be used to parse any Perl | |
1683 | quote-like operator: C<'a string'>, C<m/a pattern/>, C<tr{ans}{lation}>, | |
1684 | etc. It does this by calling Text::Balanced::quotelike(). | |
1685 | ||
1686 | If a quote-like operator is found, a reference to an array of eight elements | |
1687 | is returned. Those elements are identical to the last eight elements returned | |
1688 | by Text::Balanced::extract_quotelike() in an array context, namely: | |
1689 | ||
1690 | =over 4 | |
1691 | ||
1692 | =item [0] | |
1693 | ||
1694 | the name of the quotelike operator -- 'q', 'qq', 'm', 's', 'tr' -- if the | |
1695 | operator was named; otherwise C<undef>, | |
1696 | ||
1697 | =item [1] | |
1698 | ||
1699 | the left delimiter of the first block of the operation, | |
1700 | ||
1701 | =item [2] | |
1702 | ||
1703 | the text of the first block of the operation | |
1704 | (that is, the contents of | |
1705 | a quote, the regex of a match, or substitution or the target list of a | |
1706 | translation), | |
1707 | ||
1708 | =item [3] | |
1709 | ||
1710 | the right delimiter of the first block of the operation, | |
1711 | ||
1712 | =item [4] | |
1713 | ||
1714 | the left delimiter of the second block of the operation if there is one | |
1715 | (that is, if it is a C<s>, C<tr>, or C<y>); otherwise C<undef>, | |
1716 | ||
1717 | =item [5] | |
1718 | ||
1719 | the text of the second block of the operation if there is one | |
1720 | (that is, the replacement of a substitution or the translation list | |
1721 | of a translation); otherwise C<undef>, | |
1722 | ||
1723 | =item [6] | |
1724 | ||
1725 | the right delimiter of the second block of the operation (if any); | |
1726 | otherwise C<undef>, | |
1727 | ||
1728 | =item [7] | |
1729 | ||
1730 | the trailing modifiers on the operation (if any); otherwise C<undef>. | |
1731 | ||
1732 | =back | |
1733 | ||
1734 | If a quote-like expression is not found, the directive fails with the usual | |
1735 | C<undef> value. | |
1736 | ||
1737 | The C<E<lt>perl_variableE<gt>> directive can be used to parse any Perl | |
1738 | variable: $scalar, @array, %hash, $ref->{field}[$index], etc. | |
1739 | It does this by calling Text::Balanced::extract_variable(). | |
1740 | ||
1741 | If the directive matches text representing a valid Perl variable | |
1742 | specification, it returns that text. Otherwise it fails with the usual | |
1743 | C<undef> value. | |
1744 | ||
1745 | The C<E<lt>perl_codeblockE<gt>> directive can be used to parse curly-brace-delimited block of Perl code, such as: { $a = 1; f() =~ m/pat/; }. | |
1746 | It does this by calling Text::Balanced::extract_codeblock(). | |
1747 | ||
1748 | If the directive matches text representing a valid Perl code block, | |
1749 | it returns that text. Otherwise it fails with the usual C<undef> value. | |
1750 | ||
1751 | ||
1752 | =item Constructing tokens | |
1753 | ||
1754 | Eventually, Parse::RecDescent will be able to parse tokenized input, as | |
1755 | well as ordinary strings. In preparation for this joyous day, the | |
1756 | C<E<lt>token:...E<gt>> directive has been provided. | |
1757 | This directive creates a token which will be suitable for | |
1758 | input to a Parse::RecDescent parser (when it eventually supports | |
1759 | tokenized input). | |
1760 | ||
1761 | The text of the token is the value of the | |
1762 | immediately preceding item in the production. A | |
1763 | C<E<lt>token:...E<gt>> directive always succeeds with a return | |
1764 | value which is the hash reference that is the new token. It also | |
1765 | sets the return value for the production to that hash ref. | |
1766 | ||
1767 | The C<E<lt>token:...E<gt>> directive makes it easy to build | |
1768 | a Parse::RecDescent-compatible lexer in Parse::RecDescent: | |
1769 | ||
1770 | my $lexer = new Parse::RecDescent q | |
1771 | { | |
1772 | lex: token(s) | |
1773 | ||
1774 | token: /a\b/ <token:INDEF> | |
1775 | | /the\b/ <token:DEF> | |
1776 | | /fly\b/ <token:NOUN,VERB> | |
1777 | | /[a-z]+/i { lc $item[1] } <token:ALPHA> | |
1778 | | <error: Unknown token> | |
1779 | ||
1780 | }; | |
1781 | ||
1782 | which will eventually be able to be used with a regular Parse::RecDescent | |
1783 | grammar: | |
1784 | ||
1785 | my $parser = new Parse::RecDescent q | |
1786 | { | |
1787 | startrule: subrule1 subrule 2 | |
1788 | ||
1789 | # ETC... | |
1790 | }; | |
1791 | ||
1792 | either with a pre-lexing phase: | |
1793 | ||
1794 | $parser->startrule( $lexer->lex($data) ); | |
1795 | ||
1796 | or with a lex-on-demand approach: | |
1797 | ||
1798 | $parser->startrule( sub{$lexer->token(\$data)} ); | |
1799 | ||
1800 | But at present, only the C<E<lt>token:...E<gt>> directive is | |
1801 | actually implemented. The rest is vapourware. | |
1802 | ||
1803 | =item Specifying operations | |
1804 | ||
1805 | One of the commonest requirements when building a parser is to specify | |
1806 | binary operators. Unfortunately, in a normal grammar, the rules for | |
1807 | such things are awkward: | |
1808 | ||
1809 | disjunction: conjunction ('or' conjunction)(s?) | |
1810 | { $return = [ $item[1], @{$item[2]} ] } | |
1811 | ||
1812 | conjunction: atom ('and' atom)(s?) | |
1813 | { $return = [ $item[1], @{$item[2]} ] } | |
1814 | ||
1815 | or inefficient: | |
1816 | ||
1817 | disjunction: conjunction 'or' disjunction | |
1818 | { $return = [ $item[1], @{$item[2]} ] } | |
1819 | | conjunction | |
1820 | { $return = [ $item[1] ] } | |
1821 | ||
1822 | conjunction: atom 'and' conjunction | |
1823 | { $return = [ $item[1], @{$item[2]} ] } | |
1824 | | atom | |
1825 | { $return = [ $item[1] ] } | |
1826 | ||
1827 | and either way is ugly and hard to get right. | |
1828 | ||
1829 | The C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives provide an | |
1830 | easier way of specifying such operations. Using C<E<lt>leftop:...E<gt>> the | |
1831 | above examples become: | |
1832 | ||
1833 | disjunction: <leftop: conjunction 'or' conjunction> | |
1834 | conjunction: <leftop: atom 'and' atom> | |
1835 | ||
1836 | The C<E<lt>leftop:...E<gt>> directive specifies a left-associative binary operator. | |
1837 | It is specified around three other grammar elements | |
1838 | (typically subrules or terminals), which match the left operand, | |
1839 | the operator itself, and the right operand respectively. | |
1840 | ||
1841 | A C<E<lt>leftop:...E<gt>> directive such as: | |
1842 | ||
1843 | disjunction: <leftop: conjunction 'or' conjunction> | |
1844 | ||
1845 | is converted to the following: | |
1846 | ||
1847 | disjunction: ( conjunction ('or' conjunction)(s?) | |
1848 | { $return = [ $item[1], @{$item[2]} ] } ) | |
1849 | ||
1850 | In other words, a C<E<lt>leftop:...E<gt>> directive matches the left operand followed by zero | |
1851 | or more repetitions of both the operator and the right operand. It then | |
1852 | flattens the matched items into an anonymous array which becomes the | |
1853 | (single) value of the entire C<E<lt>leftop:...E<gt>> directive. | |
1854 | ||
1855 | For example, an C<E<lt>leftop:...E<gt>> directive such as: | |
1856 | ||
1857 | output: <leftop: ident '<<' expr > | |
1858 | ||
1859 | when given a string such as: | |
1860 | ||
1861 | cout << var << "str" << 3 | |
1862 | ||
1863 | would match, and C<$item[1]> would be set to: | |
1864 | ||
1865 | [ 'cout', 'var', '"str"', '3' ] | |
1866 | ||
1867 | In other words: | |
1868 | ||
1869 | output: <leftop: ident '<<' expr > | |
1870 | ||
1871 | is equivalent to a left-associative operator: | |
1872 | ||
1873 | output: ident { $return = [$item[1]] } | |
1874 | | ident '<<' expr { $return = [@item[1,3]] } | |
1875 | | ident '<<' expr '<<' expr { $return = [@item[1,3,5]] } | |
1876 | | ident '<<' expr '<<' expr '<<' expr { $return = [@item[1,3,5,7]] } | |
1877 | # ...etc... | |
1878 | ||
1879 | ||
1880 | Similarly, the C<E<lt>rightop:...E<gt>> directive takes a left operand, an operator, and a right operand: | |
1881 | ||
1882 | assign: <rightop: var '=' expr > | |
1883 | ||
1884 | and converts them to: | |
1885 | ||
1886 | assign: ( (var '=' {$return=$item[1]})(s?) expr | |
1887 | { $return = [ @{$item[1]}, $item[2] ] } ) | |
1888 | ||
1889 | which is equivalent to a right-associative operator: | |
1890 | ||
1891 | assign: var { $return = [$item[1]] } | |
1892 | | var '=' expr { $return = [@item[1,3]] } | |
1893 | | var '=' var '=' expr { $return = [@item[1,3,5]] } | |
1894 | | var '=' var '=' var '=' expr { $return = [@item[1,3,5,7]] } | |
1895 | # ...etc... | |
1896 | ||
1897 | ||
1898 | Note that for both the C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives, the directive does not normally | |
1899 | return the operator itself, just a list of the operands involved. This is | |
1900 | particularly handy for specifying lists: | |
1901 | ||
1902 | list: '(' <leftop: list_item ',' list_item> ')' | |
1903 | { $return = $item[2] } | |
1904 | ||
1905 | There is, however, a problem: sometimes the operator is itself significant. | |
1906 | For example, in a Perl list a comma and a C<=E<gt>> are both | |
1907 | valid separators, but the C<=E<gt>> has additional stringification semantics. | |
1908 | Hence it's important to know which was used in each case. | |
1909 | ||
1910 | To solve this problem the | |
1911 | C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives | |
1912 | I<do> return the operator(s) as well, under two circumstances. | |
1913 | The first case is where the operator is specified as a subrule. In that instance, | |
1914 | whatever the operator matches is returned (on the assumption that if the operator | |
1915 | is important enough to have its own subrule, then it's important enough to return). | |
1916 | ||
1917 | The second case is where the operator is specified as a regular | |
1918 | expression. In that case, if the first bracketed subpattern of the | |
1919 | regular expression matches, that matching value is returned (this is analogous to | |
1920 | the behaviour of the Perl C<split> function, except that only the first subpattern | |
1921 | is returned). | |
1922 | ||
1923 | In other words, given the input: | |
1924 | ||
1925 | ( a=>1, b=>2 ) | |
1926 | ||
1927 | the specifications: | |
1928 | ||
1929 | list: '(' <leftop: list_item separator list_item> ')' | |
1930 | ||
1931 | separator: ',' | '=>' | |
1932 | ||
1933 | or: | |
1934 | ||
1935 | list: '(' <leftop: list_item /(,|=>)/ list_item> ')' | |
1936 | ||
1937 | cause the list separators to be interleaved with the operands in the | |
1938 | anonymous array in C<$item[2]>: | |
1939 | ||
1940 | [ 'a', '=>', '1', ',', 'b', '=>', '2' ] | |
1941 | ||
1942 | ||
1943 | But the following version: | |
1944 | ||
1945 | list: '(' <leftop: list_item /,|=>/ list_item> ')' | |
1946 | ||
1947 | returns only the operators: | |
1948 | ||
1949 | [ 'a', '1', 'b', '2' ] | |
1950 | ||
1951 | Of course, none of the above specifications handle the case of an empty | |
1952 | list, since the C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives | |
1953 | require at least a single right or left operand to match. To specify | |
1954 | that the operator can match "trivially", | |
1955 | it's necessary to add a C<(?)> qualifier to the directive: | |
1956 | ||
1957 | list: '(' <leftop: list_item /(,|=>)/ list_item>(?) ')' | |
1958 | ||
1959 | Note that in almost all the above examples, the first and third arguments | |
1960 | of the C<<leftop:...E<gt>> directive were the same subrule. That is because | |
1961 | C<<leftop:...E<gt>>'s are frequently used to specify "separated" lists of the | |
1962 | same type of item. To make such lists easier to specify, the following | |
1963 | syntax: | |
1964 | ||
1965 | list: element(s /,/) | |
1966 | ||
1967 | is exactly equivalent to: | |
1968 | ||
1969 | list: <leftop: element /,/ element> | |
1970 | ||
1971 | Note that the separator must be specified as a raw pattern (i.e. | |
1972 | not a string or subrule). | |
1973 | ||
1974 | ||
1975 | =item Scored productions | |
1976 | ||
1977 | By default, Parse::RecDescent grammar rules always accept the first | |
1978 | production that matches the input. But if two or more productions may | |
1979 | potentially match the same input, choosing the first that does so may | |
1980 | not be optimal. | |
1981 | ||
1982 | For example, if you were parsing the sentence "time flies like an arrow", | |
1983 | you might use a rule like this: | |
1984 | ||
1985 | sentence: verb noun preposition article noun { [@item] } | |
1986 | | adjective noun verb article noun { [@item] } | |
1987 | | noun verb preposition article noun { [@item] } | |
1988 | ||
1989 | Each of these productions matches the sentence, but the third one | |
1990 | is the most likely interpretation. However, if the sentence had been | |
1991 | "fruit flies like a banana", then the second production is probably | |
1992 | the right match. | |
1993 | ||
1994 | To cater for such situtations, the C<E<lt>score:...E<gt>> can be used. | |
1995 | The directive is equivalent to an unconditional C<E<lt>rejectE<gt>>, | |
1996 | except that it allows you to specify a "score" for the current | |
1997 | production. If that score is numerically greater than the best | |
1998 | score of any preceding production, the current production is cached for later | |
1999 | consideration. If no later production matches, then the cached | |
2000 | production is treated as having matched, and the value of the | |
2001 | item immediately before its C<E<lt>score:...E<gt>> directive is returned as the | |
2002 | result. | |
2003 | ||
2004 | In other words, by putting a C<E<lt>score:...E<gt>> directive at the end of | |
2005 | each production, you can select which production matches using | |
2006 | criteria other than specification order. For example: | |
2007 | ||
2008 | sentence: verb noun preposition article noun { [@item] } <score: sensible(@item)> | |
2009 | | adjective noun verb article noun { [@item] } <score: sensible(@item)> | |
2010 | | noun verb preposition article noun { [@item] } <score: sensible(@item)> | |
2011 | ||
2012 | Now, when each production reaches its respective C<E<lt>score:...E<gt>> | |
2013 | directive, the subroutine C<sensible> will be called to evaluate the | |
2014 | matched items (somehow). Once all productions have been tried, the | |
2015 | one which C<sensible> scored most highly will be the one that is | |
2016 | accepted as a match for the rule. | |
2017 | ||
2018 | The variable $score always holds the current best score of any production, | |
2019 | and the variable $score_return holds the corresponding return value. | |
2020 | ||
2021 | As another example, the following grammar matches lines that may be | |
2022 | separated by commas, colons, or semi-colons. This can be tricky if | |
2023 | a colon-separated line also contains commas, or vice versa. The grammar | |
2024 | resolves the ambiguity by selecting the rule that results in the | |
2025 | fewest fields: | |
2026 | ||
2027 | line: seplist[sep=>','] <score: -@{$item[1]}> | |
2028 | | seplist[sep=>':'] <score: -@{$item[1]}> | |
2029 | | seplist[sep=>" "] <score: -@{$item[1]}> | |
2030 | ||
2031 | seplist: <skip:""> <leftop: /[^$arg{sep}]*/ "$arg{sep}" /[^$arg{sep}]*/> | |
2032 | ||
2033 | Note the use of negation within the C<E<lt>score:...E<gt>> directive | |
2034 | to ensure that the seplist with the most items gets the lowest score. | |
2035 | ||
2036 | As the above examples indicate, it is often the case that all productions | |
2037 | in a rule use exactly the same C<E<lt>score:...E<gt>> directive. It is | |
2038 | tedious to have to repeat this identical directive in every production, so | |
2039 | Parse::RecDescent also provides the C<E<lt>autoscore:...E<gt>> directive. | |
2040 | ||
2041 | If an C<E<lt>autoscore:...E<gt>> directive appears in any | |
2042 | production of a rule, the code it specifies is used as the scoring | |
2043 | code for every production of that rule, except productions that already | |
2044 | end with an explicit C<E<lt>score:...E<gt>> directive. Thus the rules above could | |
2045 | be rewritten: | |
2046 | ||
2047 | line: <autoscore: -@{$item[1]}> | |
2048 | line: seplist[sep=>','] | |
2049 | | seplist[sep=>':'] | |
2050 | | seplist[sep=>" "] | |
2051 | ||
2052 | ||
2053 | sentence: <autoscore: sensible(@item)> | |
2054 | | verb noun preposition article noun { [@item] } | |
2055 | | adjective noun verb article noun { [@item] } | |
2056 | | noun verb preposition article noun { [@item] } | |
2057 | ||
2058 | Note that the C<E<lt>autoscore:...E<gt>> directive itself acts as an | |
2059 | unconditional C<E<lt>rejectE<gt>>, and (like the C<E<lt>rulevar:...E<gt>> | |
2060 | directive) is pruned at compile-time wherever possible. | |
2061 | ||
2062 | ||
2063 | =item Dispensing with grammar checks | |
2064 | ||
2065 | During the compilation phase of parser construction, Parse::RecDescent performs | |
2066 | a small number of checks on the grammar it's given. Specifically it checks that | |
2067 | the grammar is not left-recursive, that there are no "insatiable" constructs of | |
2068 | the form: | |
2069 | ||
2070 | rule: subrule(s) subrule | |
2071 | ||
2072 | and that there are no rules missing (i.e. referred to, but never defined). | |
2073 | ||
2074 | These checks are important during development, but can slow down parser | |
2075 | construction in stable code. So Parse::RecDescent provides the | |
2076 | E<lt>nocheckE<gt> directive to turn them off. The directive can only appear | |
2077 | before the first rule definition, and switches off checking throughout the rest | |
2078 | of the current grammar. | |
2079 | ||
2080 | Typically, this directive would be added when a parser has been thoroughly | |
2081 | tested and is ready for release. | |
2082 | ||
2083 | =back | |
2084 | ||
2085 | ||
2086 | =head2 Subrule argument lists | |
2087 | ||
2088 | It is occasionally useful to pass data to a subrule which is being invoked. For | |
2089 | example, consider the following grammar fragment: | |
2090 | ||
2091 | classdecl: keyword decl | |
2092 | ||
2093 | keyword: 'struct' | 'class'; | |
2094 | ||
2095 | decl: # WHATEVER | |
2096 | ||
2097 | The C<decl> rule might wish to know which of the two keywords was used | |
2098 | (since it may affect some aspect of the way the subsequent declaration | |
2099 | is interpreted). C<Parse::RecDescent> allows the grammar designer to | |
2100 | pass data into a rule, by placing that data in an I<argument list> | |
2101 | (that is, in square brackets) immediately after any subrule item in a | |
2102 | production. Hence, we could pass the keyword to C<decl> as follows: | |
2103 | ||
2104 | classdecl: keyword decl[ $item[1] ] | |
2105 | ||
2106 | keyword: 'struct' | 'class'; | |
2107 | ||
2108 | decl: # WHATEVER | |
2109 | ||
2110 | The argument list can consist of any number (including zero!) of comma-separated | |
2111 | Perl expressions. In other words, it looks exactly like a Perl anonymous | |
2112 | array reference. For example, we could pass the keyword, the name of the | |
2113 | surrounding rule, and the literal 'keyword' to C<decl> like so: | |
2114 | ||
2115 | classdecl: keyword decl[$item[1],$item[0],'keyword'] | |
2116 | ||
2117 | keyword: 'struct' | 'class'; | |
2118 | ||
2119 | decl: # WHATEVER | |
2120 | ||
2121 | Within the rule to which the data is passed (C<decl> in the above examples) | |
2122 | that data is available as the elements of a local variable C<@arg>. Hence | |
2123 | C<decl> might report its intentions as follows: | |
2124 | ||
2125 | classdecl: keyword decl[$item[1],$item[0],'keyword'] | |
2126 | ||
2127 | keyword: 'struct' | 'class'; | |
2128 | ||
2129 | decl: { print "Declaring $arg[0] (a $arg[2])\n"; | |
2130 | print "(this rule called by $arg[1])" } | |
2131 | ||
2132 | Subrule argument lists can also be interpreted as hashes, simply by using | |
2133 | the local variable C<%arg> instead of C<@arg>. Hence we could rewrite the | |
2134 | previous example: | |
2135 | ||
2136 | classdecl: keyword decl[keyword => $item[1], | |
2137 | caller => $item[0], | |
2138 | type => 'keyword'] | |
2139 | ||
2140 | keyword: 'struct' | 'class'; | |
2141 | ||
2142 | decl: { print "Declaring $arg{keyword} (a $arg{type})\n"; | |
2143 | print "(this rule called by $arg{caller})" } | |
2144 | ||
2145 | Both C<@arg> and C<%arg> are always available, so the grammar designer may | |
2146 | choose whichever convention (or combination of conventions) suits best. | |
2147 | ||
2148 | Subrule argument lists are also useful for creating "rule templates" | |
2149 | (especially when used in conjunction with the C<E<lt>matchrule:...E<gt>> | |
2150 | directive). For example, the subrule: | |
2151 | ||
2152 | list: <matchrule:$arg{rule}> /$arg{sep}/ list[%arg] | |
2153 | { $return = [ $item[1], @{$item[3]} ] } | |
2154 | | <matchrule:$arg{rule}> | |
2155 | { $return = [ $item[1]] } | |
2156 | ||
2157 | is a handy template for the common problem of matching a separated list. | |
2158 | For example: | |
2159 | ||
2160 | function: 'func' name '(' list[rule=>'param',sep=>';'] ')' | |
2161 | ||
2162 | param: list[rule=>'name',sep=>','] ':' typename | |
2163 | ||
2164 | name: /\w+/ | |
2165 | ||
2166 | typename: name | |
2167 | ||
2168 | ||
2169 | When a subrule argument list is used with a repeated subrule, the argument list | |
2170 | goes I<before> the repetition specifier: | |
2171 | ||
2172 | list: /some|many/ thing[ $item[1] ](s) | |
2173 | ||
2174 | The argument list is "late bound". That is, it is re-evaluated for every | |
2175 | repetition of the repeated subrule. | |
2176 | This means that each repeated attempt to match the subrule may be | |
2177 | passed a completely different set of arguments if the value of the | |
2178 | expression in the argument list changes between attempts. So, for | |
2179 | example, the grammar: | |
2180 | ||
2181 | { $::species = 'dogs' } | |
2182 | ||
2183 | pair: 'two' animal[$::species](s) | |
2184 | ||
2185 | animal: /$arg[0]/ { $::species = 'cats' } | |
2186 | ||
2187 | will match the string "two dogs cats cats" completely, whereas | |
2188 | it will only match the string "two dogs dogs dogs" up to the | |
2189 | eighth letter. If the value of the argument list were "early bound" | |
2190 | (that is, evaluated only the first time a repeated subrule match is | |
2191 | attempted), one would expect the matching behaviours to be reversed. | |
2192 | ||
2193 | Of course, it is possible to effectively "early bind" such argument lists | |
2194 | by passing them a value which does not change on each repetition. For example: | |
2195 | ||
2196 | { $::species = 'dogs' } | |
2197 | ||
2198 | pair: 'two' { $::species } animal[$item[2]](s) | |
2199 | ||
2200 | animal: /$arg[0]/ { $::species = 'cats' } | |
2201 | ||
2202 | ||
2203 | Arguments can also be passed to the start rule, simply by appending them | |
2204 | to the argument list with which the start rule is called (I<after> the | |
2205 | "line number" parameter). For example, given: | |
2206 | ||
2207 | $parser = new Parse::RecDescent ( $grammar ); | |
2208 | ||
2209 | $parser->data($text, 1, "str", 2, \@arr); | |
2210 | ||
2211 | # ^^^^^ ^ ^^^^^^^^^^^^^^^ | |
2212 | # | | | | |
2213 | # TEXT TO BE PARSED | | | |
2214 | # STARTING LINE NUMBER | | |
2215 | # ELEMENTS OF @arg WHICH IS PASSED TO RULE data | |
2216 | ||
2217 | then within the productions of the rule C<data>, the array C<@arg> will contain | |
2218 | C<("str", 2, \@arr)>. | |
2219 | ||
2220 | ||
2221 | =head2 Alternations | |
2222 | ||
2223 | Alternations are implicit (unnamed) rules defined as part of a production. An | |
2224 | alternation is defined as a series of '|'-separated productions inside a | |
2225 | pair of round brackets. For example: | |
2226 | ||
2227 | character: 'the' ( good | bad | ugly ) /dude/ | |
2228 | ||
2229 | Every alternation implicitly defines a new subrule, whose | |
2230 | automatically-generated name indicates its origin: | |
2231 | "_alternation_<I>_of_production_<P>_of_rule<R>" for the appropriate | |
2232 | values of <I>, <P>, and <R>. A call to this implicit subrule is then | |
2233 | inserted in place of the brackets. Hence the above example is merely a | |
2234 | convenient short-hand for: | |
2235 | ||
2236 | character: 'the' | |
2237 | _alternation_1_of_production_1_of_rule_character | |
2238 | /dude/ | |
2239 | ||
2240 | _alternation_1_of_production_1_of_rule_character: | |
2241 | good | bad | ugly | |
2242 | ||
2243 | Since alternations are parsed by recursively calling the parser generator, | |
2244 | any type(s) of item can appear in an alternation. For example: | |
2245 | ||
2246 | character: 'the' ( 'high' "plains" # Silent, with poncho | |
2247 | | /no[- ]name/ # Silent, no poncho | |
2248 | | vengeance_seeking # Poncho-optional | |
2249 | | <error> | |
2250 | ) drifter | |
2251 | ||
2252 | In this case, if an error occurred, the automatically generated | |
2253 | message would be: | |
2254 | ||
2255 | ERROR (line <N>): Invalid implicit subrule: Expected | |
2256 | 'high' or /no[- ]name/ or generic, | |
2257 | but found "pacifist" instead | |
2258 | ||
2259 | Since every alternation actually has a name, it's even possible | |
2260 | to extend or replace them: | |
2261 | ||
2262 | parser->Replace( | |
2263 | "_alternation_1_of_production_1_of_rule_character: | |
2264 | 'generic Eastwood'" | |
2265 | ); | |
2266 | ||
2267 | More importantly, since alternations are a form of subrule, they can be given | |
2268 | repetition specifiers: | |
2269 | ||
2270 | character: 'the' ( good | bad | ugly )(?) /dude/ | |
2271 | ||
2272 | ||
2273 | =head2 Incremental Parsing | |
2274 | ||
2275 | C<Parse::RecDescent> provides two methods - C<Extend> and C<Replace> - which | |
2276 | can be used to alter the grammar matched by a parser. Both methods | |
2277 | take the same argument as C<Parse::RecDescent::new>, namely a | |
2278 | grammar specification string | |
2279 | ||
2280 | C<Parse::RecDescent::Extend> interprets the grammar specification and adds any | |
2281 | productions it finds to the end of the rules for which they are specified. For | |
2282 | example: | |
2283 | ||
2284 | $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/"; | |
2285 | parser->Extend($add); | |
2286 | ||
2287 | adds two productions to the rule "name" (creating it if necessary) and one | |
2288 | production to the rule "desc". | |
2289 | ||
2290 | C<Parse::RecDescent::Replace> is identical, except that it first resets are | |
2291 | rule specified in the additional grammar, removing any existing productions. | |
2292 | Hence after: | |
2293 | ||
2294 | $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/"; | |
2295 | parser->Replace($add); | |
2296 | ||
2297 | are are I<only> valid "name"s and the one possible description. | |
2298 | ||
2299 | A more interesting use of the C<Extend> and C<Replace> methods is to call them | |
2300 | inside the action of an executing parser. For example: | |
2301 | ||
2302 | typedef: 'typedef' type_name identifier ';' | |
2303 | { $thisparser->Extend("type_name: '$item[3]'") } | |
2304 | | <error> | |
2305 | ||
2306 | identifier: ...!type_name /[A-Za-z_]w*/ | |
2307 | ||
2308 | which automatically prevents type names from being typedef'd, or: | |
2309 | ||
2310 | command: 'map' key_name 'to' abort_key | |
2311 | { $thisparser->Replace("abort_key: '$item[2]'") } | |
2312 | | 'map' key_name 'to' key_name | |
2313 | { map_key($item[2],$item[4]) } | |
2314 | | abort_key | |
2315 | { exit if confirm("abort?") } | |
2316 | ||
2317 | abort_key: 'q' | |
2318 | ||
2319 | key_name: ...!abort_key /[A-Za-z]/ | |
2320 | ||
2321 | which allows the user to change the abort key binding, but not to unbind it. | |
2322 | ||
2323 | The careful use of such constructs makes it possible to reconfigure a | |
2324 | a running parser, eliminating the need for semantic feedback by | |
2325 | providing syntactic feedback instead. However, as currently implemented, | |
2326 | C<Replace()> and C<Extend()> have to regenerate and re-C<eval> the | |
2327 | entire parser whenever they are called. This makes them quite slow for | |
2328 | large grammars. | |
2329 | ||
2330 | In such cases, the judicious use of an interpolated regex is likely to | |
2331 | be far more efficient: | |
2332 | ||
2333 | typedef: 'typedef' type_name/ identifier ';' | |
2334 | { $thisparser->{local}{type_name} .= "|$item[3]" } | |
2335 | | <error> | |
2336 | ||
2337 | identifier: ...!type_name /[A-Za-z_]w*/ | |
2338 | ||
2339 | type_name: /$thisparser->{local}{type_name}/ | |
2340 | ||
2341 | ||
2342 | =head2 Precompiling parsers | |
2343 | ||
2344 | Normally Parse::RecDescent builds a parser from a grammar at run-time. | |
2345 | That approach simplifies the design and implementation of parsing code, | |
2346 | but has the disadvantage that it slows the parsing process down - you | |
2347 | have to wait for Parse::RecDescent to build the parser every time the | |
2348 | program runs. Long or complex grammars can be particularly slow to | |
2349 | build, leading to unacceptable delays at start-up. | |
2350 | ||
2351 | To overcome this, the module provides a way of "pre-building" a parser | |
2352 | object and saving it in a separate module. That module can then be used | |
2353 | to create clones of the original parser. | |
2354 | ||
2355 | A grammar may be precompiled using the C<Precompile> class method. | |
2356 | For example, to precompile a grammar stored in the scalar $grammar, | |
2357 | and produce a class named PreGrammar in a module file named PreGrammar.pm, | |
2358 | you could use: | |
2359 | ||
2360 | use Parse::RecDescent; | |
2361 | ||
2362 | Parse::RecDescent->Precompile($grammar, "PreGrammar"); | |
2363 | ||
2364 | The first argument is the grammar string, the second is the name of the class | |
2365 | to be built. The name of the module file is generated automatically by | |
2366 | appending ".pm" to the last element of the class name. Thus | |
2367 | ||
2368 | Parse::RecDescent->Precompile($grammar, "My::New::Parser"); | |
2369 | ||
2370 | would produce a module file named Parser.pm. | |
2371 | ||
2372 | It is somewhat tedious to have to write a small Perl program just to | |
2373 | generate a precompiled grammar class, so Parse::RecDescent has some special | |
2374 | magic that allows you to do the job directly from the command-line. | |
2375 | ||
2376 | If your grammar is specified in a file named F<grammar>, you can generate | |
2377 | a class named Yet::Another::Grammar like so: | |
2378 | ||
2379 | > perl -MParse::RecDescent - grammar Yet::Another::Grammar | |
2380 | ||
2381 | This would produce a file named F<Grammar.pm> containing the full | |
2382 | definition of a class called Yet::Another::Grammar. Of course, to use | |
2383 | that class, you would need to put the F<Grammar.pm> file in a | |
2384 | directory named F<Yet/Another>, somewhere in your Perl include path. | |
2385 | ||
2386 | Having created the new class, it's very easy to use it to build | |
2387 | a parser. You simply C<use> the new module, and then call its | |
2388 | C<new> method to create a parser object. For example: | |
2389 | ||
2390 | use Yet::Another::Grammar; | |
2391 | my $parser = Yet::Another::Grammar->new(); | |
2392 | ||
2393 | The effect of these two lines is exactly the same as: | |
2394 | ||
2395 | use Parse::RecDescent; | |
2396 | ||
2397 | open GRAMMAR_FILE, "grammar" or die; | |
2398 | local $/; | |
2399 | my $grammar = <GRAMMAR_FILE>; | |
2400 | ||
2401 | my $parser = Parse::RecDescent->new($grammar); | |
2402 | ||
2403 | only considerably faster. | |
2404 | ||
2405 | Note however that the parsers produced by either approach are exactly | |
2406 | the same, so whilst precompilation has an effect on I<set-up> speed, | |
2407 | it has no effect on I<parsing> speed. RecDescent 2.0 will address that | |
2408 | problem. | |
2409 | ||
2410 | ||
2411 | =head2 A Metagrammar for C<Parse::RecDescent> | |
2412 | ||
2413 | The following is a specification of grammar format accepted by | |
2414 | C<Parse::RecDescent::new> (specified in the C<Parse::RecDescent> grammar format!): | |
2415 | ||
2416 | grammar : components(s) | |
2417 | ||
2418 | component : rule | comment | |
2419 | ||
2420 | rule : "\n" identifier ":" production(s?) | |
2421 | ||
2422 | production : items(s) | |
2423 | ||
2424 | item : lookahead(?) simpleitem | |
2425 | | directive | |
2426 | | comment | |
2427 | ||
2428 | lookahead : '...' | '...!' # +'ve or -'ve lookahead | |
2429 | ||
2430 | simpleitem : subrule args(?) # match another rule | |
2431 | | repetition # match repeated subrules | |
2432 | | terminal # match the next input | |
2433 | | bracket args(?) # match alternative items | |
2434 | | action # do something | |
2435 | ||
2436 | subrule : identifier # the name of the rule | |
2437 | ||
2438 | args : {extract_codeblock($text,'[]')} # just like a [...] array ref | |
2439 | ||
2440 | repetition : subrule args(?) howoften | |
2441 | ||
2442 | howoften : '(?)' # 0 or 1 times | |
2443 | | '(s?)' # 0 or more times | |
2444 | | '(s)' # 1 or more times | |
2445 | | /(\d+)[.][.](/\d+)/ # $1 to $2 times | |
2446 | | /[.][.](/\d*)/ # at most $1 times | |
2447 | | /(\d*)[.][.])/ # at least $1 times | |
2448 | ||
2449 | terminal : /[/]([\][/]|[^/])*[/]/ # interpolated pattern | |
2450 | | /"([\]"|[^"])*"/ # interpolated literal | |
2451 | | /'([\]'|[^'])*'/ # uninterpolated literal | |
2452 | ||
2453 | action : { extract_codeblock($text) } # embedded Perl code | |
2454 | ||
2455 | bracket : '(' Item(s) production(s?) ')' # alternative subrules | |
2456 | ||
2457 | directive : '<commit>' # commit to production | |
2458 | | '<uncommit>' # cancel commitment | |
2459 | | '<resync>' # skip to newline | |
2460 | | '<resync:' pattern '>' # skip <pattern> | |
2461 | | '<reject>' # fail this production | |
2462 | | '<reject:' condition '>' # fail if <condition> | |
2463 | | '<error>' # report an error | |
2464 | | '<error:' string '>' # report error as "<string>" | |
2465 | | '<error?>' # error only if committed | |
2466 | | '<error?:' string '>' # " " " " | |
2467 | | '<rulevar:' /[^>]+/ '>' # define rule-local variable | |
2468 | | '<matchrule:' string '>' # invoke rule named in string | |
2469 | ||
2470 | identifier : /[a-z]\w*/i # must start with alpha | |
2471 | ||
2472 | comment : /#[^\n]*/ # same as Perl | |
2473 | ||
2474 | pattern : {extract_bracketed($text,'<')} # allow embedded "<..>" | |
2475 | ||
2476 | condition : {extract_codeblock($text,'{<')} # full Perl expression | |
2477 | ||
2478 | string : {extract_variable($text)} # any Perl variable | |
2479 | | {extract_quotelike($text)} # or quotelike string | |
2480 | | {extract_bracketed($text,'<')} # or balanced brackets | |
2481 | ||
2482 | ||
2483 | =head1 GOTCHAS | |
2484 | ||
2485 | This section describes common mistakes that grammar writers seem to | |
2486 | make on a regular basis. | |
2487 | ||
2488 | =head2 1. Expecting an error to always invalidate a parse | |
2489 | ||
2490 | A common mistake when using error messages is to write the grammar like this: | |
2491 | ||
2492 | file: line(s) | |
2493 | ||
2494 | line: line_type_1 | |
2495 | | line_type_2 | |
2496 | | line_type_3 | |
2497 | | <error> | |
2498 | ||
2499 | The expectation seems to be that any line that is not of type 1, 2 or 3 will | |
2500 | invoke the C<E<lt>errorE<gt>> directive and thereby cause the parse to fail. | |
2501 | ||
2502 | Unfortunately, that only happens if the error occurs in the very first line. | |
2503 | The first rule states that a C<file> is matched by one or more lines, so if | |
2504 | even a single line succeeds, the first rule is completely satisfied and the | |
2505 | parse as a whole succeeds. That means that any error messages generated by | |
2506 | subsequent failures in the C<line> rule are quietly ignored. | |
2507 | ||
2508 | Typically what's really needed is this: | |
2509 | ||
2510 | file: line(s) eofile { $return = $item[1] } | |
2511 | ||
2512 | line: line_type_1 | |
2513 | | line_type_2 | |
2514 | | line_type_3 | |
2515 | | <error> | |
2516 | ||
2517 | eofile: /^\Z/ | |
2518 | ||
2519 | The addition of the C<eofile> subrule to the first production means that | |
2520 | a file only matches a series of successful C<line> matches I<that consume the | |
2521 | complete input text>. If any input text remains after the lines are matched, | |
2522 | there must have been an error in the last C<line>. In that case the C<eofile> | |
2523 | rule will fail, causing the entire C<file> rule to fail too. | |
2524 | ||
2525 | Note too that C<eofile> must match C</^\Z/> (end-of-text), I<not> | |
2526 | C</^\cZ/> or C</^\cD/> (end-of-file). | |
2527 | ||
2528 | And don't forget the action at the end of the production. If you just | |
2529 | write: | |
2530 | ||
2531 | file: line(s) eofile | |
2532 | ||
2533 | then the value returned by the C<file> rule will be the value of its | |
2534 | last item: C<eofile>. Since C<eofile> always returns an empty string | |
2535 | on success, that will cause the C<file> rule to return that empty | |
2536 | string. Apart from returning the wrong value, returning an empty string | |
2537 | will trip up code such as: | |
2538 | ||
2539 | $parser->file($filetext) || die; | |
2540 | ||
2541 | (since "" is false). | |
2542 | ||
2543 | Remember that Parse::RecDescent returns undef on failure, | |
2544 | so the only safe test for failure is: | |
2545 | ||
2546 | defined($parser->file($filetext)) || die; | |
2547 | ||
2548 | ||
2549 | =head1 DIAGNOSTICS | |
2550 | ||
2551 | Diagnostics are intended to be self-explanatory (particularly if you | |
2552 | use B<-RD_HINT> (under B<perl -s>) or define C<$::RD_HINT> inside the program). | |
2553 | ||
2554 | C<Parse::RecDescent> currently diagnoses the following: | |
2555 | ||
2556 | =over 4 | |
2557 | ||
2558 | =item * | |
2559 | ||
2560 | Invalid regular expressions used as pattern terminals (fatal error). | |
2561 | ||
2562 | =item * | |
2563 | ||
2564 | Invalid Perl code in code blocks (fatal error). | |
2565 | ||
2566 | =item * | |
2567 | ||
2568 | Lookahead used in the wrong place or in a nonsensical way (fatal error). | |
2569 | ||
2570 | =item * | |
2571 | ||
2572 | "Obvious" cases of left-recursion (fatal error). | |
2573 | ||
2574 | =item * | |
2575 | ||
2576 | Missing or extra components in a C<E<lt>leftopE<gt>> or C<E<lt>rightopE<gt>> | |
2577 | directive. | |
2578 | ||
2579 | =item * | |
2580 | ||
2581 | Unrecognisable components in the grammar specification (fatal error). | |
2582 | ||
2583 | =item * | |
2584 | ||
2585 | "Orphaned" rule components specified before the first rule (fatal error) | |
2586 | or after an C<E<lt>errorE<gt>> directive (level 3 warning). | |
2587 | ||
2588 | =item * | |
2589 | ||
2590 | Missing rule definitions (this only generates a level 3 warning, since you | |
2591 | may be providing them later via C<Parse::RecDescent::Extend()>). | |
2592 | ||
2593 | =item * | |
2594 | ||
2595 | Instances where greedy repetition behaviour will almost certainly | |
2596 | cause the failure of a production (a level 3 warning - see | |
2597 | L<"ON-GOING ISSUES AND FUTURE DIRECTIONS"> below). | |
2598 | ||
2599 | =item * | |
2600 | ||
2601 | Attempts to define rules named 'Replace' or 'Extend', which cannot be | |
2602 | called directly through the parser object because of the predefined | |
2603 | meaning of C<Parse::RecDescent::Replace> and | |
2604 | C<Parse::RecDescent::Extend>. (Only a level 2 warning is generated, since | |
2605 | such rules I<can> still be used as subrules). | |
2606 | ||
2607 | =item * | |
2608 | ||
2609 | Productions which consist of a single C<E<lt>error?E<gt>> | |
2610 | directive, and which therefore may succeed unexpectedly | |
2611 | (a level 2 warning, since this might conceivably be the desired effect). | |
2612 | ||
2613 | =item * | |
2614 | ||
2615 | Multiple consecutive lookahead specifiers (a level 1 warning only, since their | |
2616 | effects simply accumulate). | |
2617 | ||
2618 | =item * | |
2619 | ||
2620 | Productions which start with a C<E<lt>rejectE<gt>> or C<E<lt>rulevar:...E<gt>> | |
2621 | directive. Such productions are optimized away (a level 1 warning). | |
2622 | ||
2623 | =item * | |
2624 | ||
2625 | Rules which are autogenerated under C<$::AUTOSTUB> (a level 1 warning). | |
2626 | ||
2627 | =back | |
2628 | ||
2629 | =head1 AUTHOR | |
2630 | ||
2631 | Damian Conway (damian@conway.org) | |
2632 | ||
2633 | =head1 BUGS AND IRRITATIONS | |
2634 | ||
2635 | There are undoubtedly serious bugs lurking somewhere in this much code :-) | |
2636 | Bug reports and other feedback are most welcome. | |
2637 | ||
2638 | Ongoing annoyances include: | |
2639 | ||
2640 | =over 4 | |
2641 | ||
2642 | =item * | |
2643 | ||
2644 | There's no support for parsing directly from an input stream. | |
2645 | If and when the Perl Gods give us regular expressions on streams, | |
2646 | this should be trivial (ahem!) to implement. | |
2647 | ||
2648 | =item * | |
2649 | ||
2650 | The parser generator can get confused if actions aren't properly | |
2651 | closed or if they contain particularly nasty Perl syntax errors | |
2652 | (especially unmatched curly brackets). | |
2653 | ||
2654 | =item * | |
2655 | ||
2656 | The generator only detects the most obvious form of left recursion | |
2657 | (potential recursion on the first subrule in a rule). More subtle | |
2658 | forms of left recursion (for example, through the second item in a | |
2659 | rule after a "zero" match of a preceding "zero-or-more" repetition, | |
2660 | or after a match of a subrule with an empty production) are not found. | |
2661 | ||
2662 | =item * | |
2663 | ||
2664 | Instead of complaining about left-recursion, the generator should | |
2665 | silently transform the grammar to remove it. Don't expect this | |
2666 | feature any time soon as it would require a more sophisticated | |
2667 | approach to parser generation than is currently used. | |
2668 | ||
2669 | =item * | |
2670 | ||
2671 | The generated parsers don't always run as fast as might be wished. | |
2672 | ||
2673 | =item * | |
2674 | ||
2675 | The meta-parser should be bootstrapped using C<Parse::RecDescent> :-) | |
2676 | ||
2677 | =back | |
2678 | ||
2679 | =head1 ON-GOING ISSUES AND FUTURE DIRECTIONS | |
2680 | ||
2681 | =over 4 | |
2682 | ||
2683 | =item 1. | |
2684 | ||
2685 | Repetitions are "incorrigibly greedy" in that they will eat everything they can | |
2686 | and won't backtrack if that behaviour causes a production to fail needlessly. | |
2687 | So, for example: | |
2688 | ||
2689 | rule: subrule(s) subrule | |
2690 | ||
2691 | will I<never> succeed, because the repetition will eat all the | |
2692 | subrules it finds, leaving none to match the second item. Such | |
2693 | constructions are relatively rare (and C<Parse::RecDescent::new> generates a | |
2694 | warning whenever they occur) so this may not be a problem, especially | |
2695 | since the insatiable behaviour can be overcome "manually" by writing: | |
2696 | ||
2697 | rule: penultimate_subrule(s) subrule | |
2698 | ||
2699 | penultimate_subrule: subrule ...subrule | |
2700 | ||
2701 | The issue is that this construction is exactly twice as expensive as the | |
2702 | original, whereas backtracking would add only 1/I<N> to the cost (for | |
2703 | matching I<N> repetitions of C<subrule>). I would welcome feedback on | |
2704 | the need for backtracking; particularly on cases where the lack of it | |
2705 | makes parsing performance problematical. | |
2706 | ||
2707 | =item 2. | |
2708 | ||
2709 | Having opened that can of worms, it's also necessary to consider whether there | |
2710 | is a need for non-greedy repetition specifiers. Again, it's possible (at some | |
2711 | cost) to manually provide the required functionality: | |
2712 | ||
2713 | rule: nongreedy_subrule(s) othersubrule | |
2714 | ||
2715 | nongreedy_subrule: subrule ...!othersubrule | |
2716 | ||
2717 | Overall, the issue is whether the benefit of this extra functionality | |
2718 | outweighs the drawbacks of further complicating the (currently | |
2719 | minimalist) grammar specification syntax, and (worse) introducing more overhead | |
2720 | into the generated parsers. | |
2721 | ||
2722 | =item 3. | |
2723 | ||
2724 | An C<E<lt>autocommitE<gt>> directive would be nice. That is, it would be useful to be | |
2725 | able to say: | |
2726 | ||
2727 | command: <autocommit> | |
2728 | command: 'find' name | |
2729 | | 'find' address | |
2730 | | 'do' command 'at' time 'if' condition | |
2731 | | 'do' command 'at' time | |
2732 | | 'do' command | |
2733 | | unusual_command | |
2734 | ||
2735 | and have the generator work out that this should be "pruned" thus: | |
2736 | ||
2737 | command: 'find' name | |
2738 | | 'find' <commit> address | |
2739 | | 'do' <commit> command <uncommit> | |
2740 | 'at' time | |
2741 | 'if' <commit> condition | |
2742 | | 'do' <commit> command <uncommit> | |
2743 | 'at' <commit> time | |
2744 | | 'do' <commit> command | |
2745 | | unusual_command | |
2746 | ||
2747 | There are several issues here. Firstly, should the | |
2748 | C<E<lt>autocommitE<gt>> automatically install an C<E<lt>uncommitE<gt>> | |
2749 | at the start of the last production (on the grounds that the "command" | |
2750 | rule doesn't know whether an "unusual_command" might start with "find" | |
2751 | or "do") or should the "unusual_command" subgraph be analysed (to see | |
2752 | if it I<might> be viable after a "find" or "do")? | |
2753 | ||
2754 | The second issue is how regular expressions should be treated. The simplest | |
2755 | approach would be simply to uncommit before them (on the grounds that they | |
2756 | I<might> match). Better efficiency would be obtained by analyzing all preceding | |
2757 | literal tokens to determine whether the pattern would match them. | |
2758 | ||
2759 | Overall, the issues are: can such automated "pruning" approach a hand-tuned | |
2760 | version sufficiently closely to warrant the extra set-up expense, and (more | |
2761 | importantly) is the problem important enough to even warrant the non-trivial | |
2762 | effort of building an automated solution? | |
2763 | ||
2764 | =back | |
2765 | ||
2766 | =head1 COPYRIGHT | |
2767 | ||
2768 | Copyright (c) 1997-2000, Damian Conway. All Rights Reserved. | |
2769 | This module is free software. It may be used, redistributed | |
2770 | and/or modified under the terms of the Perl Artistic License | |
2771 | (see http://www.perl.com/perl/misc/Artistic.html) |