Initial commit of OpenSPARC T2 design and verification files.
[OpenSPARC-T2-DV] / tools / perl-5.8.0 / lib / site_perl / 5.8.0 / Parse / RecDescent.pod
CommitLineData
86530b38
AT
1=head1 NAME
2
3Parse::RecDescent - Generate Recursive-Descent Parsers
4
5=head1 VERSION
6
7This document describes version 1.79 of Parse::RecDescent,
8released August 21, 2000.
9
10=head1 SYNOPSIS
11
12 use Parse::RecDescent;
13
14 # Generate a parser from the specification in $grammar:
15
16 $parser = new Parse::RecDescent ($grammar);
17
18 # Generate a parser from the specification in $othergrammar
19
20 $anotherparser = new Parse::RecDescent ($othergrammar);
21
22
23 # Parse $text using rule 'startrule' (which must be
24 # defined in $grammar):
25
26 $parser->startrule($text);
27
28
29 # Parse $text using rule 'otherrule' (which must also
30 # be defined in $grammar):
31
32 $parser->otherrule($text);
33
34
35 # Change the universal token prefix pattern
36 # (the default is: '\s*'):
37
38 $Parse::RecDescent::skip = '[ \t]+';
39
40
41 # Replace productions of existing rules (or create new ones)
42 # with the productions defined in $newgrammar:
43
44 $parser->Replace($newgrammar);
45
46
47 # Extend existing rules (or create new ones)
48 # by adding extra productions defined in $moregrammar:
49
50 $parser->Extend($moregrammar);
51
52
53 # Global flags (useful as command line arguments under -s):
54
55 $::RD_ERRORS # unless undefined, report fatal errors
56 $::RD_WARN # unless undefined, also report non-fatal problems
57 $::RD_HINT # if defined, also suggestion remedies
58 $::RD_TRACE # if defined, also trace parsers' behaviour
59 $::RD_AUTOSTUB # if defined, generates "stubs" for undefined rules
60 $::RD_AUTOACTION # if defined, appends specified action to productions
61
62
63=head1 DESCRIPTION
64
65=head2 Overview
66
67Parse::RecDescent incrementally generates top-down recursive-descent text
68parsers from simple I<yacc>-like grammar specifications. It provides:
69
70=over 4
71
72=item *
73
74Regular expressions or literal strings as terminals (tokens),
75
76=item *
77
78Multiple (non-contiguous) productions for any rule,
79
80=item *
81
82Repeated and optional subrules within productions,
83
84=item *
85
86Full access to Perl within actions specified as part of the grammar,
87
88=item *
89
90Simple automated error reporting during parser generation and parsing,
91
92=item *
93
94The ability to commit to, uncommit to, or reject particular
95productions during a parse,
96
97=item *
98
99The ability to pass data up and down the parse tree ("down" via subrule
100argument lists, "up" via subrule return values)
101
102=item *
103
104Incremental extension of the parsing grammar (even during a parse),
105
106=item *
107
108Precompilation of parser objects,
109
110=item *
111
112User-definable reduce-reduce conflict resolution via
113"scoring" of matching productions.
114
115=back
116
117=head2 Using C<Parse::RecDescent>
118
119Parser objects are created by calling C<Parse::RecDescent::new>, passing in a
120grammar specification (see the following subsections). If the grammar is
121correct, C<new> returns a blessed reference which can then be used to initiate
122parsing through any rule specified in the original grammar. A typical sequence
123looks like this:
124
125 $grammar = q {
126 # GRAMMAR SPECIFICATION HERE
127 };
128
129 $parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n";
130
131 # acquire $text
132
133 defined $parser->startrule($text) or print "Bad text!\n";
134
135The rule through which parsing is initiated must be explicitly defined
136in the grammar (i.e. for the above example, the grammar must include a
137rule of the form: "startrule: <subrules>".
138
139If the starting rule succeeds, its value (see below)
140is returned. Failure to generate the original parser or failure to match a text
141is indicated by returning C<undef>. Note that it's easy to set up grammars
142that can succeed, but which return a value of 0, "0", or "". So don't be
143tempted to write:
144
145 $parser->startrule($text) or print "Bad text!\n";
146
147Normally, the parser has no effect on the original text. So in the
148previous example the value of $text would be unchanged after having
149been parsed.
150
151If, however, the text to be matched is passed by reference:
152
153 $parser->startrule(\$text)
154
155then any text which was consumed during the match will be removed from the
156start of $text.
157
158
159=head2 Rules
160
161In the grammar from which the parser is built, rules are specified by
162giving an identifier (which must satisfy /[A-Za-z]\w*/), followed by a
163colon I<on the same line>, followed by one or more productions,
164separated by single vertical bars. The layout of the productions
165is entirely free-format:
166
167 rule1: production1
168 | production2 |
169 production3 | production4
170
171At any point in the grammar previously defined rules may be extended with
172additional productions. This is achieved by redeclaring the rule with the new
173productions. Thus:
174
175 rule1: a | b | c
176 rule2: d | e | f
177 rule1: g | h
178
179is exactly equivalent to:
180
181 rule1: a | b | c | g | h
182 rule2: d | e | f
183
184Each production in a rule consists of zero or more items, each of which
185may be either: the name of another rule to be matched (a "subrule"),
186a pattern or string literal to be matched directly (a "token"), a
187block of Perl code to be executed (an "action"), a special instruction
188to the parser (a "directive"), or a standard Perl comment (which is
189ignored).
190
191A rule matches a text if one of its productions matches. A production
192matches if each of its items match consecutive substrings of the
193text. The productions of a rule being matched are tried in the same
194order that they appear in the original grammar, and the first matching
195production terminates the match attempt (successfully). If all
196productions are tried and none matches, the match attempt fails.
197
198Note that this behaviour is quite different from the "prefer the longer match"
199behaviour of I<yacc>. For example, if I<yacc> were parsing the rule:
200
201 seq : 'A' 'B'
202 | 'A' 'B' 'C'
203
204upon matching "AB" it would look ahead to see if a 'C' is next and, if
205so, will match the second production in preference to the first. In
206other words, I<yacc> effectively tries all the productions of a rule
207breadth-first in parallel, and selects the "best" match, where "best"
208means longest (note that this is a gross simplification of the true
209behaviour of I<yacc> but it will do for our purposes).
210
211In contrast, C<Parse::RecDescent> tries each production depth-first in
212sequence, and selects the "best" match, where "best" means first. This is
213the fundamental difference between "bottom-up" and "recursive descent"
214parsing.
215
216Each successfully matched item in a production is assigned a value,
217which can be accessed in subsequent actions within the same
218production (or, in some cases, as the return value of a successful
219subrule call). Unsuccessful items don't have an associated value,
220since the failure of an item causes the entire surrounding production
221to immediately fail. The following sections describe the various types
222of items and their success values.
223
224
225=head2 Subrules
226
227A subrule which appears in a production is an instruction to the parser to
228attempt to match the named rule at that point in the text being
229parsed. If the named subrule is not defined when requested the
230production containing it immediately fails (unless it was "autostubbed" - see
231L<Autostubbing>).
232
233A rule may (recursively) call itself as a subrule, but I<not> as the
234left-most item in any of its productions (since such recursions are usually
235non-terminating).
236
237The value associated with a subrule is the value associated with its
238C<$return> variable (see L<"Actions"> below), or with the last successfully
239matched item in the subrule match.
240
241Subrules may also be specified with a trailing repetition specifier,
242indicating that they are to be (greedily) matched the specified number
243of times. The available specifiers are:
244
245 subrule(?) # Match one-or-zero times
246 subrule(s) # Match one-or-more times
247 subrule(s?) # Match zero-or-more times
248 subrule(N) # Match exactly N times for integer N > 0
249 subrule(N..M) # Match between N and M times
250 subrule(..M) # Match between 1 and M times
251 subrule(N..) # Match at least N times
252
253Repeated subrules keep matching until either the subrule fails to
254match, or it has matched the minimal number of times but fails to
255consume any of the parsed text (this second condition prevents the
256subrule matching forever in some cases).
257
258Since a repeated subrule may match many instances of the subrule itself, the
259value associated with it is not a simple scalar, but rather a reference to a
260list of scalars, each of which is the value associated with one of the
261individual subrule matches. In other words in the rule:
262
263 program: statement(s)
264
265the value associated with the repeated subrule "statement(s)" is a reference
266to an array containing the values matched by each call to the individual
267subrule "statement".
268
269Repetition modifieres may include a separator pattern:
270
271 program: statement(s /;/)
272
273specifying some sequence of characters to be skipped between each repetition.
274This is really just a shorthand for the E<lt>leftop:...E<gt> directive
275(see below).
276
277=head2 Tokens
278
279If a quote-delimited string or a Perl regex appears in a production,
280the parser attempts to match that string or pattern at that point in
281the text. For example:
282
283 typedef: "typedef" typename identifier ';'
284
285 identifier: /[A-Za-z_][A-Za-z0-9_]*/
286
287As in regular Perl, a single quoted string is uninterpolated, whilst
288a double-quoted string or a pattern is interpolated (at the time
289of matching, I<not> when the parser is constructed). Hence, it is
290possible to define rules in which tokens can be set at run-time:
291
292 typedef: "$::typedefkeyword" typename identifier ';'
293
294 identifier: /$::identpat/
295
296Note that, since each rule is implemented inside a special namespace
297belonging to its parser, it is necessary to explicitly quantify
298variables from the main package.
299
300Regex tokens can be specified using just slashes as delimiters
301or with the explicit C<mE<lt>delimiterE<gt>......E<lt>delimiterE<gt>> syntax:
302
303 typedef: "typedef" typename identifier ';'
304
305 typename: /[A-Za-z_][A-Za-z0-9_]*/
306
307 identifier: m{[A-Za-z_][A-Za-z0-9_]*}
308
309A regex of either type can also have any valid trailing parameter(s)
310(that is, any of [cgimsox]):
311
312 typedef: "typedef" typename identifier ';'
313
314 identifier: / [a-z_] # LEADING ALPHA OR UNDERSCORE
315 [a-z0-9_]* # THEN DIGITS ALSO ALLOWED
316 /ix # CASE/SPACE/COMMENT INSENSITIVE
317
318The value associated with any successfully matched token is a string
319containing the actual text which was matched by the token.
320
321It is important to remember that, since each grammar is specified in a
322Perl string, all instances of the universal escape character '\' within
323a grammar must be "doubled", so that they interpolate to single '\'s when
324the string is compiled. For example, to use the grammar:
325
326 word: /\S+/ | backslash
327 line: prefix word(s) "\n"
328 backslash: '\\'
329
330the following code is required:
331
332 $parser = new Parse::RecDescent (q{
333
334 word: /\\S+/ | backslash
335 line: prefix word(s) "\\n"
336 backslash: '\\\\'
337
338 });
339
340
341=head2 Terminal Separators
342
343For the purpose of matching, each terminal in a production is considered
344to be preceded by a "prefix" - a pattern which must be
345matched before a token match is attempted. By default, the
346prefix is optional whitespace (which always matches, at
347least trivially), but this default may be reset in any production.
348
349The variable C<$Parse::RecDescent::skip> stores the universal
350prefix, which is the default for all terminal matches in all parsers
351built with C<Parse::RecDescent>.
352
353The prefix for an individual production can be altered
354by using the C<E<lt>skip:...E<gt>> directive (see below).
355
356
357=head2 Actions
358
359An action is a block of Perl code which is to be executed (as the
360block of a C<do> statement) when the parser reaches that point in a
361production. The action executes within a special namespace belonging to
362the active parser, so care must be taken in correctly qualifying variable
363names (see also L<Start-up Actions> below).
364
365The action is considered to succeed if the final value of the block
366is defined (that is, if the implied C<do> statement evaluates to a
367defined value - I<even one which would be treated as "false">). Note
368that the value associated with a successful action is also the final
369value in the block.
370
371An action will I<fail> if its last evaluated value is C<undef>. This is
372surprisingly easy to accomplish by accident. For instance, here's an
373infuriating case of an action that makes its production fail, but only
374when debugging I<isn't> activated:
375
376 description: name rank serial_number
377 { print "Got $item[2] $item[1] ($item[3])\n"
378 if $::debugging
379 }
380
381If C<$debugging> is false, no statement in the block is executed, so
382the final value is C<undef>, and the entire production fails. The solution is:
383
384 description: name rank serial_number
385 { print "Got $item[2] $item[1] ($item[3])\n"
386 if $::debugging;
387 1;
388 }
389
390Within an action, a number of useful parse-time variables are
391available in the special parser namespace (there are other variables
392also accessible, but meddling with them will probably just break your
393parser. As a general rule, if you avoid referring to unqualified
394variables - especially those starting with an underscore - inside an action,
395things should be okay):
396
397=over 4
398
399=item C<@item> and C<%item>
400
401The array slice C<@item[1..$#item]> stores the value associated with each item
402(that is, each subrule, token, or action) in the current production. The
403analogy is to C<$1>, C<$2>, etc. in a I<yacc> grammar.
404Note that, for obvious reasons, C<@item> only contains the
405values of items I<before> the current point in the production.
406
407The first element (C<$item[0]>) stores the name of the current rule
408being matched.
409
410C<@item> is a standard Perl array, so it can also be indexed with negative
411numbers, representing the number of items I<back> from the current position in
412the parse:
413
414 stuff: /various/ bits 'and' pieces "then" data 'end'
415 { print $item[-2] } # PRINTS data
416 # (EASIER THAN: $item[6])
417
418The C<%item> hash complements the <@item> array, providing named
419access to the same item values:
420
421 stuff: /various/ bits 'and' pieces "then" data 'end'
422 { print $item{data} # PRINTS data
423 # (EVEN EASIER THAN USING @item)
424
425
426The results of named subrules are stored in the hash under each
427subrule's name, whilst all other items are stored under a "named
428positional" key that indictates their ordinal position within their item
429type: __STRINGI<n>__, __PATTERNI<n>__, __DIRECTIVEI<n>__, __ACTIONI<n>__:
430
431 stuff: /various/ bits 'and' pieces "then" data 'end' { save }
432 { print $item{__PATTERN1__}, # PRINTS 'various'
433 $item{__STRING2__}, # PRINTS 'then'
434 $item{__ACTION1__}, # PRINTS RETURN
435 # VALUE OF save
436 }
437
438
439If you want proper I<named> access to patterns or literals, you need to turn
440them into separate rules:
441
442 stuff: various bits 'and' pieces "then" data 'end'
443 { print $item{various} # PRINTS various
444 }
445
446 various: /various/
447
448
449The special entry C<$item{__RULE__}> stores the name of the current
450rule (i.e. the same value as C<$item[0]>.
451
452The advantage of using C<%item>, instead of C<@items> is that it
453removes the need to track items positions that may change as a grammar
454evolves. For example, adding an interim C<E<lt>skipE<gt>> directive
455of action can silently ruin a trailing action, by moving an C<@item>
456element "down" the array one place. In contrast, the named entry
457of C<%item> is unaffected by such an insertion.
458
459A limitation of the C<%item> hash is that it only records the I<last>
460value of a particular subrule. For example:
461
462 range: '(' number '..' number )'
463 { $return = $item{number} }
464
465will return only the value corresponding to the I<second> match of the
466C<number> subrule. In other words, successive calls to a subrule
467overwrite the corresponding entry in C<%item>. Once again, the
468solution is to rename each subrule in its own rule:
469
470 range: '(' from_num '..' to_num )'
471 { $return = $item{from_num} }
472
473 from_num: number
474 to_num: number
475
476
477
478=item C<@arg> and C<%arg>
479
480The array C<@arg> and the hash C<%arg> store any arguments passed to
481the rule from some other rule (see L<"Subrule argument lists>). Changes
482to the elements of either variable do not propagate back to the calling
483rule (data can be passed back from a subrule via the C<$return>
484variable - see next item).
485
486
487=item C<$return>
488
489If a value is assigned to C<$return> within an action, that value is
490returned if the production containing the action eventually matches
491successfully. Note that setting C<$return> I<doesn't> cause the current
492production to succeed. It merely tells it what to return if it I<does> succeed.
493Hence C<$return> is analogous to C<$$> in a I<yacc> grammar.
494
495If C<$return> is not assigned within a production, the value of the
496last component of the production (namely: C<$item[$#item]>) is
497returned if the production succeeds.
498
499
500=item C<$commit>
501
502The current state of commitment to the current production (see L<"Directives">
503below).
504
505=item C<$skip>
506
507The current terminal prefix (see L<"Directives"> below).
508
509=item C<$text>
510
511The remaining (unparsed) text. Changes to C<$text> I<do not
512propagate> out of unsuccessful productions, but I<do> survive
513successful productions. Hence it is possible to dynamically alter the
514text being parsed - for example, to provide a C<#include>-like facility:
515
516 hash_include: '#include' filename
517 { $text = ::loadfile($item[2]) . $text }
518
519 filename: '<' /[a-z0-9._-]+/i '>' { $return = $item[2] }
520 | '"' /[a-z0-9._-]+/i '"' { $return = $item[2] }
521
522
523=item C<$thisline> and C<$prevline>
524
525C<$thisline> stores the current line number within the current parse
526(starting from 1). C<$prevline> stores the line number for the last
527character which was already successfully parsed (this will be different from
528C<$thisline> at the end of each line).
529
530For efficiency, C<$thisline> and C<$prevline> are actually tied
531hashes, and only recompute the required line number when the variable's
532value is used.
533
534Assignment to C<$thisline> adjusts the line number calculator, so that
535it believes that the current line number is the value being assigned. Note
536that this adjustment will be reflected in all subsequent line numbers
537calculations.
538
539Modifying the value of the variable C<$text> (as in the previous
540C<hash_include> example, for instance) will confuse the line
541counting mechanism. To prevent this, you should call
542C<Parse::RecDescent::LineCounter::resync($thisline)> I<immediately>
543after any assignment to the variable C<$text> (or, at least, before the
544next attempt to use C<$thisline>).
545
546Note that if a production fails after assigning to or
547resync'ing C<$thisline>, the parser's line counter mechanism will
548usually be corrupted.
549
550Also see the entry for C<@itempos>.
551
552The line number can be set to values other than 1, by calling the start
553rule with a second argument. For example:
554
555 $parser = new Parse::RecDescent ($grammar);
556
557 $parser->input($text, 10); # START LINE NUMBERS AT 10
558
559
560=item C<$thiscolumn> and C<$prevcolumn>
561
562C<$thiscolumn> stores the current column number within the current line
563being parsed (starting from 1). C<$prevcolumn> stores the column number
564of the last character which was actually successfully parsed. Usually
565C<$prevcolumn == $thiscolumn-1>, but not at the end of lines.
566
567For efficiency, C<$thiscolumn> and C<$prevcolumn> are
568actually tied hashes, and only recompute the required column number
569when the variable's value is used.
570
571Assignment to C<$thiscolumn> or C<$prevcolumn> is a fatal error.
572
573Modifying the value of the variable C<$text> (as in the previous
574C<hash_include> example, for instance) may confuse the column
575counting mechanism.
576
577Note that C<$thiscolumn> reports the column number I<before> any
578whitespace that might be skipped before reading a token. Hence
579if you wish to know where a token started (and ended) use something like this:
580
581 rule: token1 token2 startcol token3 endcol token4
582 { print "token3: columns $item[3] to $item[5]"; }
583
584 startcol: // { $thiscolumn } # NEED THE // TO STEP PAST TOKEN SEP
585 endcol: { $prevcolumn }
586
587Also see the entry for C<@itempos>.
588
589=item C<$thisoffset> and C<$prevoffset>
590
591C<$thisoffset> stores the offset of the current parsing position
592within the complete text
593being parsed (starting from 0). C<$prevoffset> stores the offset
594of the last character which was actually successfully parsed. In all
595cases C<$prevoffset == $thisoffset-1>.
596
597For efficiency, C<$thisoffset> and C<$prevoffset> are
598actually tied hashes, and only recompute the required offset
599when the variable's value is used.
600
601Assignment to C<$thisoffset> or <$prevoffset> is a fatal error.
602
603Modifying the value of the variable C<$text> will I<not> affect the
604offset counting mechanism.
605
606Also see the entry for C<@itempos>.
607
608=item C<@itempos>
609
610The array C<@itempos> stores a hash reference corresponding to
611each element of C<@item>. The elements of the hash provide the
612following:
613
614 $itempos[$n]{offset}{from} # VALUE OF $thisoffset BEFORE $item[$n]
615 $itempos[$n]{offset}{to} # VALUE OF $prevoffset AFTER $item[$n]
616 $itempos[$n]{line}{from} # VALUE OF $thisline BEFORE $item[$n]
617 $itempos[$n]{line}{to} # VALUE OF $prevline AFTER $item[$n]
618 $itempos[$n]{column}{from} # VALUE OF $thiscolumn BEFORE $item[$n]
619 $itempos[$n]{column}{to} # VALUE OF $prevcolumn AFTER $item[$n]
620
621Note that the various C<$itempos[$n]...{from}> values record the
622appropriate value I<after> any token prefix has been skipped.
623
624Hence, instead of the somewhat tedious and error-prone:
625
626 rule: startcol token1 endcol
627 startcol token2 endcol
628 startcol token3 endcol
629 { print "token1: columns $item[1]
630 to $item[3]
631 token2: columns $item[4]
632 to $item[6]
633 token3: columns $item[7]
634 to $item[9]" }
635
636 startcol: // { $thiscolumn } # NEED THE // TO STEP PAST TOKEN SEP
637 endcol: { $prevcolumn }
638
639it is possible to write:
640
641 rule: token1 token2 token3
642 { print "token1: columns $itempos[1]{column}{from}
643 to $itempos[1]{column}{to}
644 token2: columns $itempos[2]{column}{from}
645 to $itempos[2]{column}{to}
646 token3: columns $itempos[3]{column}{from}
647 to $itempos[3]{column}{to}" }
648
649Note however that (in the current implementation) the use of C<@itempos>
650anywhere in a grammar implies that item positioning information is
651collected I<everywhere> during the parse. Depending on the grammar
652and the size of the text to be parsed, this may be prohibitively
653expensive and the explicit use of C<$thisline>, C<$thiscolumn>, etc. may
654be a better choice.
655
656
657=item C<$thisparser>
658
659A reference to the S<C<Parse::RecDescent>> object through which
660parsing was initiated.
661
662The value of C<$thisparser> propagates down the subrules of a parse
663but not back up. Hence, you can invoke subrules from another parser
664for the scope of the current rule as follows:
665
666 rule: subrule1 subrule2
667 | { $thisparser = $::otherparser } <reject>
668 | subrule3 subrule4
669 | subrule5
670
671The result is that the production calls "subrule1" and "subrule2" of
672the current parser, and the remaining productions call the named subrules
673from C<$::otherparser>. Note, however that "Bad Things" will happen if
674C<::otherparser> isn't a blessed reference and/or doesn't have methods
675with the same names as the required subrules!
676
677=item C<$thisrule>
678
679A reference to the S<C<Parse::RecDescent::Rule>> object corresponding to the
680rule currently being matched.
681
682=item C<$thisprod>
683
684A reference to the S<C<Parse::RecDescent::Production>> object
685corresponding to the production currently being matched.
686
687=item C<$score> and C<$score_return>
688
689$score stores the best production score to date, as specified by
690an earlier C<E<lt>score:...E<gt>> directive. $score_return stores
691the corresponding return value for the successful production.
692
693See L<Scored productions>.
694
695=back
696
697B<Warning:> the parser relies on the information in the various C<this...>
698objects in some non-obvious ways. Tinkering with the other members of
699these objects will probably cause Bad Things to happen, unless you
700I<really> know what you're doing. The only exception to this advice is
701that the use of C<$this...-E<gt>{local}> is always safe.
702
703
704=head2 Start-up Actions
705
706Any actions which appear I<before> the first rule definition in a
707grammar are treated as "start-up" actions. Each such action is
708stripped of its outermost brackets and then evaluated (in the parser's
709special namespace) just before the rules of the grammar are first
710compiled.
711
712The main use of start-up actions is to declare local variables within the
713parser's special namespace:
714
715 { my $lastitem = '???'; }
716
717 list: item(s) { $return = $lastitem }
718
719 item: book { $lastitem = 'book'; }
720 bell { $lastitem = 'bell'; }
721 candle { $lastitem = 'candle'; }
722
723but start-up actions can be used to execute I<any> valid Perl code
724within a parser's special namespace.
725
726Start-up actions can appear within a grammar extension or replacement
727(that is, a partial grammar installed via C<Parse::RecDescent::Extend()> or
728C<Parse::RecDescent::Replace()> - see L<Incremental Parsing>), and will be
729executed before the new grammar is installed. Note, however, that a
730particular start-up action is only ever executed once.
731
732
733=head2 Autoactions
734
735It is sometimes desirable to be able to specify a default action to be
736taken at the end of every production (for example, in order to easily
737build a parse tree). If the variable C<$::RD_AUTOACTION> is defined
738when C<Parse::RecDescent::new()> is called, the contents of that
739variable are treated as a specification of an action which is to appended
740to each production in the corresponding grammar. So, for example, to construct
741a simple parse tree:
742
743 $::RD_AUTOACTION = q { [@item] };
744
745 parser = new Parse::RecDescent (q{
746 expression: and_expr '||' expression | and_expr
747 and_expr: not_expr '&&' and_expr | not_expr
748 not_expr: '!' brack_expr | brack_expr
749 brack_expr: '(' expression ')' | identifier
750 identifier: /[a-z]+/i
751 });
752
753which is equivalent to:
754
755 parser = new Parse::RecDescent (q{
756 expression: and_expr '&&' expression
757 { [@item] }
758 | and_expr
759 { [@item] }
760
761 and_expr: not_expr '&&' and_expr
762 { [@item] }
763 | not_expr
764 { [@item] }
765
766 not_expr: '!' brack_expr
767 { [@item] }
768 | brack_expr
769 { [@item] }
770
771 brack_expr: '(' expression ')'
772 { [@item] }
773 | identifier
774 { [@item] }
775
776 identifier: /[a-z]+/i
777 { [@item] }
778 });
779
780Alternatively, we could take an object-oriented approach, use different
781classes for each node (and also eliminating redundant intermediate nodes):
782
783 $::RD_AUTOACTION = q
784 { $#item==1 ? $item[1] : new ${"$item[0]_node"} (@item[1..$#item]) };
785
786 parser = new Parse::RecDescent (q{
787 expression: and_expr '||' expression | and_expr
788 and_expr: not_expr '&&' and_expr | not_expr
789 not_expr: '!' brack_expr | brack_expr
790 brack_expr: '(' expression ')' | identifier
791 identifier: /[a-z]+/i
792 });
793
794which is equivalent to:
795
796 parser = new Parse::RecDescent (q{
797 expression: and_expr '&&' expression
798 { new expression_node (@item[1..3]) }
799 | and_expr
800
801 and_expr: not_expr '&&' and_expr
802 { new and_expr_node (@item[1..3]) }
803 | not_expr
804
805 not_expr: '!' brack_expr
806 { new not_expr_node (@item[1..2]) }
807 | brack_expr
808
809 brack_expr: '(' expression ')'
810 { new brack_expr_node (@item[1..3]) }
811 | identifier
812
813 identifier: /[a-z]+/i
814 { new identifer_node (@item[1]) }
815 });
816
817Note that, if a production already ends in an action, no autoaction is appended
818to it. For example, in this version:
819
820 $::RD_AUTOACTION = q
821 { $#item==1 ? $item[1] : new ${"$item[0]_node"} (@item[1..$#item]) };
822
823 parser = new Parse::RecDescent (q{
824 expression: and_expr '&&' expression | and_expr
825 and_expr: not_expr '&&' and_expr | not_expr
826 not_expr: '!' brack_expr | brack_expr
827 brack_expr: '(' expression ')' | identifier
828 identifier: /[a-z]+/i
829 { new terminal_node($item[1]) }
830 });
831
832each C<identifier> match produces a C<terminal_node> object, I<not> an
833C<identifier_node> object.
834
835A level 1 warning is issued each time an "autoaction" is added to
836some production.
837
838
839=head2 Autotrees
840
841A commonly needed autoaction is one that builds a parse-tree. It is moderately
842tricky to set up such an action (which must treat terminals differently from
843non-terminals), so Parse::RecDescent simplifies the process by providing the
844C<E<lt>autotreeE<gt>> directive.
845
846If this directive appears at the start of grammar, it causes
847Parse::RecDescent to insert autoactions at the end of any rule except
848those which already end in an action. The action inserted depends on whether
849the production is an intermediate rule (two or more items), or a terminal
850of the grammar (i.e. a single pattern or string item).
851
852So, for example, the following grammar:
853
854 <autotree>
855
856 file : command(s)
857 command : get | set | vet
858 get : 'get' ident ';'
859 set : 'set' ident 'to' value ';'
860 vet : 'check' ident 'is' value ';'
861 ident : /\w+/
862 value : /\d+/
863
864is equivalent to:
865
866 file : command(s) { bless \%item, $item[0] }
867 command : get { bless \%item, $item[0] }
868 | set { bless \%item, $item[0] }
869 | vet { bless \%item, $item[0] }
870 get : 'get' ident ';' { bless \%item, $item[0] }
871 set : 'set' ident 'to' value ';' { bless \%item, $item[0] }
872 vet : 'check' ident 'is' value ';' { bless \%item, $item[0] }
873
874 ident : /\w+/ { bless {__VALUE__=>$item[1]}, $item[0] }
875 value : /\d+/ { bless {__VALUE__=>$item[1]}, $item[0] }
876
877Note that each node in the tree is blessed into a class of the same name
878as the rule itself. This makes it easy to build object-oriented
879processors for the parse-trees that the grammar produces. Note too that
880the last two rules produce special objects with the single attribute
881'__VALUE__'. This is because they consist solely of a single terminal.
882
883This autoaction-ed grammar would then produce a parse tree in a data
884structure like this:
885
886 {
887 file => {
888 command => {
889 [ get => {
890 identifier => { __VALUE__ => 'a' },
891 },
892 set => {
893 identifier => { __VALUE__ => 'b' },
894 value => { __VALUE__ => '7' },
895 },
896 vet => {
897 identifier => { __VALUE__ => 'b' },
898 value => { __VALUE__ => '7' },
899 },
900 ],
901 },
902 }
903 }
904
905(except, of course, that each nested hash would also be blessed into
906the appropriate class).
907
908
909=head2 Autostubbing
910
911Normally, if a subrule appears in some production, but no rule of that
912name is ever defined in the grammar, the production which refers to the
913non-existent subrule fails immediately. This typically occurs as a
914result of misspellings, and is a sufficiently common occurance that a
915warning is generated for such situations.
916
917However, when prototyping a grammar it is sometimes useful to be
918able to use subrules before a proper specification of them is
919really possible. For example, a grammar might include a section like:
920
921 function_call: identifier '(' arg(s?) ')'
922
923 identifier: /[a-z]\w*/i
924
925where the possible format of an argument is sufficiently complex that
926it is not worth specifying in full until the general function call
927syntax has been debugged. In this situation it is convenient to leave
928the real rule C<arg> undefined and just slip in a placeholder (or
929"stub"):
930
931 arg: 'arg'
932
933so that the function call syntax can be tested with dummy input such as:
934
935 f0()
936 f1(arg)
937 f2(arg arg)
938 f3(arg arg arg)
939
940et cetera.
941
942Early in prototyping, many such "stubs" may be required, so
943C<Parse::RecDescent> provides a means of automating their definition.
944If the variable C<$::RD_AUTOSTUB> is defined when a parser is built,
945a subrule reference to any non-existent rule (say, C<sr>),
946causes a "stub" rule of the form:
947
948 sr: 'sr'
949
950to be automatically defined in the generated parser.
951A level 1 warning is issued for each such "autostubbed" rule.
952
953Hence, with C<$::AUTOSTUB> defined, it is possible to only partially
954specify a grammar, and then "fake" matches of the unspecified
955(sub)rules by just typing in their name.
956
957
958
959=head2 Look-ahead
960
961If a subrule, token, or action is prefixed by "...", then it is
962treated as a "look-ahead" request. That means that the current production can
963(as usual) only succeed if the specified item is matched, but that the matching
964I<does not consume any of the text being parsed>. This is very similar to the
965C</(?=...)/> look-ahead construct in Perl patterns. Thus, the rule:
966
967 inner_word: word ...word
968
969will match whatever the subrule "word" matches, provided that match is followed
970by some more text which subrule "word" would also match (although this
971second substring is not actually consumed by "inner_word")
972
973Likewise, a "...!" prefix, causes the following item to succeed (without
974consuming any text) if and only if it would normally fail. Hence, a
975rule such as:
976
977 identifier: ...!keyword ...!'_' /[A-Za-z_]\w*/
978
979matches a string of characters which satisfies the pattern
980C</[A-Za-z_]\w*/>, but only if the same sequence of characters would
981not match either subrule "keyword" or the literal token '_'.
982
983Sequences of look-ahead prefixes accumulate, multiplying their positive and/or
984negative senses. Hence:
985
986 inner_word: word ...!......!word
987
988is exactly equivalent the the original example above (a warning is issued in
989cases like these, since they often indicate something left out, or
990misunderstood).
991
992Note that actions can also be treated as look-aheads. In such cases,
993the state of the parser text (in the local variable C<$text>)
994I<after> the look-ahead action is guaranteed to be identical to its
995state I<before> the action, regardless of how it's changed I<within>
996the action (unless you actually undefine C<$text>, in which case you
997get the disaster you deserve :-).
998
999
1000=head2 Directives
1001
1002Directives are special pre-defined actions which may be used to alter
1003the behaviour of the parser. There are currently eighteen directives:
1004C<E<lt>commitE<gt>>,
1005C<E<lt>uncommitE<gt>>,
1006C<E<lt>rejectE<gt>>,
1007C<E<lt>scoreE<gt>>,
1008C<E<lt>autoscoreE<gt>>,
1009C<E<lt>skipE<gt>>,
1010C<E<lt>resyncE<gt>>,
1011C<E<lt>errorE<gt>>,
1012C<E<lt>rulevarE<gt>>,
1013C<E<lt>matchruleE<gt>>,
1014C<E<lt>leftopE<gt>>,
1015C<E<lt>rightopE<gt>>,
1016C<E<lt>deferE<gt>>,
1017C<E<lt>nocheckE<gt>>,
1018C<E<lt>perl_quotelikeE<gt>>,
1019C<E<lt>perl_codeblockE<gt>>,
1020C<E<lt>perl_variableE<gt>>,
1021and C<E<lt>tokenE<gt>>.
1022
1023=over 4
1024
1025=item Committing and uncommitting
1026
1027The C<E<lt>commitE<gt>> and C<E<lt>uncommitE<gt>> directives permit the recursive
1028descent of the parse tree to be pruned (or "cut") for efficiency.
1029Within a rule, a C<E<lt>commitE<gt>> directive instructs the rule to ignore subsequent
1030productions if the current production fails. For example:
1031
1032 command: 'find' <commit> filename
1033 | 'open' <commit> filename
1034 | 'move' filename filename
1035
1036Clearly, if the leading token 'find' is matched in the first production but that
1037production fails for some other reason, then the remaining
1038productions cannot possibly match. The presence of the
1039C<E<lt>commitE<gt>> causes the "command" rule to fail immediately if
1040an invalid "find" command is found, and likewise if an invalid "open"
1041command is encountered.
1042
1043It is also possible to revoke a previous commitment. For example:
1044
1045 if_statement: 'if' <commit> condition
1046 'then' block <uncommit>
1047 'else' block
1048 | 'if' <commit> condition
1049 'then' block
1050
1051In this case, a failure to find an "else" block in the first
1052production shouldn't preclude trying the second production, but a
1053failure to find a "condition" certainly should.
1054
1055As a special case, any production in which the I<first> item is an
1056C<E<lt>uncommitE<gt>> immediately revokes a preceding C<E<lt>commitE<gt>>
1057(even though the production would not otherwise have been tried). For
1058example, in the rule:
1059
1060 request: 'explain' expression
1061 | 'explain' <commit> keyword
1062 | 'save'
1063 | 'quit'
1064 | <uncommit> term '?'
1065
1066if the text being matched was "explain?", and the first two
1067productions failed, then the C<E<lt>commitE<gt>> in production two would cause
1068productions three and four to be skipped, but the leading
1069C<E<lt>uncommitE<gt>> in the production five would allow that production to
1070attempt a match.
1071
1072Note in the preceding example, that the C<E<lt>commitE<gt>> was only placed
1073in production two. If production one had been:
1074
1075 request: 'explain' <commit> expression
1076
1077then production two would be (inappropriately) skipped if a leading
1078"explain..." was encountered.
1079
1080Both C<E<lt>commitE<gt>> and C<E<lt>uncommitE<gt>> directives always succeed, and their value
1081is always 1.
1082
1083
1084=item Rejecting a production
1085
1086The C<E<lt>rejectE<gt>> directive immediately causes the current production
1087to fail (it is exactly equivalent to, but more obvious than, the
1088action C<{undef}>). A C<E<lt>rejectE<gt>> is useful when it is desirable to get
1089the side effects of the actions in one production, without prejudicing a match
1090by some other production later in the rule. For example, to insert
1091tracing code into the parse:
1092
1093 complex_rule: { print "In complex rule...\n"; } <reject>
1094
1095 complex_rule: simple_rule '+' 'i' '*' simple_rule
1096 | 'i' '*' simple_rule
1097 | simple_rule
1098
1099
1100It is also possible to specify a conditional rejection, using the
1101form C<E<lt>reject:I<condition>E<gt>>, which only rejects if the
1102specified condition is true. This form of rejection is exactly
1103equivalent to the action C<{(I<condition>)?undef:1}E<gt>>.
1104For example:
1105
1106 command: save_command
1107 | restore_command
1108 | <reject: defined $::tolerant> { exit }
1109 | <error: Unknown command. Ignored.>
1110
1111A C<E<lt>rejectE<gt>> directive never succeeds (and hence has no
1112associated value). A conditional rejection may succeed (if its
1113condition is not satisfied), in which case its value is 1.
1114
1115As an extra optimization, C<Parse::RecDescent> ignores any production
1116which I<begins> with an unconditional C<E<lt>rejectE<gt>> directive,
1117since any such production can never successfully match or have any
1118useful side-effects. A level 1 warning is issued in all such cases.
1119
1120Note that productions beginning with conditional
1121C<E<lt>reject:...E<gt>> directives are I<never> "optimized away" in
1122this manner, even if they are always guaranteed to fail (for example:
1123C<E<lt>reject:1E<gt>>)
1124
1125Due to the way grammars are parsed, there is a minor restriction on the
1126condition of a conditional C<E<lt>reject:...E<gt>>: it cannot
1127contain any raw '<' or '>' characters. For example:
1128
1129 line: cmd <reject: $thiscolumn > max> data
1130
1131results in an error when a parser is built from this grammar (since the
1132grammar parser has no way of knowing whether the first > is a "less than"
1133or the end of the C<E<lt>reject:...E<gt>>.
1134
1135To overcome this problem, put the condition inside a do{} block:
1136
1137 line: cmd <reject: do{$thiscolumn > max}> data
1138
1139Note that the same problem may occur in other directives that take
1140arguments. The same solution will work in all cases.
1141
1142=item Skipping between terminals
1143
1144The C<E<lt>skipE<gt>> directive enables the terminal prefix used in
1145a production to be changed. For example:
1146
1147 OneLiner: Command <skip:'[ \t]*'> Arg(s) /;/
1148
1149causes only blanks and tabs to be skipped before terminals in the C<Arg>
1150subrule (and any of I<its> subrules>, and also before the final C</;/> terminal.
1151Once the production is complete, the previous terminal prefix is
1152reinstated. Note that this implies that distinct productions of a rule
1153must reset their terminal prefixes individually.
1154
1155The C<E<lt>skipE<gt>> directive evaluates to the I<previous> terminal prefix,
1156so it's easy to reinstate a prefix later in a production:
1157
1158 Command: <skip:","> CSV(s) <skip:$item[1]> Modifier
1159
1160The value specified after the colon is interpolated into a pattern, so all of
1161the following are equivalent (though their efficiency increases down the list):
1162
1163 <skip: "$colon|$comma"> # ASSUMING THE VARS HOLD THE OBVIOUS VALUES
1164
1165 <skip: ':|,'>
1166
1167 <skip: q{[:,]}>
1168
1169 <skip: qr/[:,]/>
1170
1171There is no way of directly setting the prefix for
1172an entire rule, except as follows:
1173
1174 Rule: <skip: '[ \t]*'> Prod1
1175 | <skip: '[ \t]*'> Prod2a Prod2b
1176 | <skip: '[ \t]*'> Prod3
1177
1178or, better:
1179
1180 Rule: <skip: '[ \t]*'>
1181 (
1182 Prod1
1183 | Prod2a Prod2b
1184 | Prod3
1185 )
1186
1187
1188B<Note: Up to release 1.51 of Parse::RecDescent, an entirely different
1189mechanism was used for specifying terminal prefixes. The current method
1190is not backwards-compatible with that early approach. The current approach
1191is stable and will not to change again.>
1192
1193
1194=item Resynchronization
1195
1196The C<E<lt>resyncE<gt>> directive provides a visually distinctive
1197means of consuming some of the text being parsed, usually to skip an
1198erroneous input. In its simplest form C<E<lt>resyncE<gt>> simply
1199consumes text up to and including the next newline (C<"\n">)
1200character, succeeding only if the newline is found, in which case it
1201causes its surrounding rule to return zero on success.
1202
1203In other words, a C<E<lt>resyncE<gt>> is exactly equivalent to the token
1204C</[^\n]*\n/> followed by the action S<C<{ $return = 0 }>> (except that
1205productions beginning with a C<E<lt>resyncE<gt>> are ignored when generating
1206error messages). A typical use might be:
1207
1208 script : command(s)
1209
1210 command: save_command
1211 | restore_command
1212 | <resync> # TRY NEXT LINE, IF POSSIBLE
1213
1214It is also possible to explicitly specify a resynchronization
1215pattern, using the C<E<lt>resync:I<pattern>E<gt>> variant. This version
1216succeeds only if the specified pattern matches (and consumes) the
1217parsed text. In other words, C<E<lt>resync:I<pattern>E<gt>> is exactly
1218equivalent to the token C</I<pattern>/> (followed by a S<C<{ $return = 0 }>>
1219action). For example, if commands were terminated by newlines or semi-colons:
1220
1221 command: save_command
1222 | restore_command
1223 | <resync:[^;\n]*[;\n]>
1224
1225The value of a successfully matched C<E<lt>resyncE<gt>> directive (of either
1226type) is the text that it consumed. Note, however, that since the
1227directive also sets C<$return>, a production consisting of a lone
1228C<E<lt>resyncE<gt>> succeeds but returns the value zero (which a calling rule
1229may find useful to distinguish between "true" matches and "tolerant" matches).
1230Remember that returning a zero value indicates that the rule I<succeeded> (since
1231only an C<undef> denotes failure within C<Parse::RecDescent> parsers.
1232
1233
1234=item Error handling
1235
1236The C<E<lt>errorE<gt>> directive provides automatic or user-defined
1237generation of error messages during a parse. In its simplest form
1238C<E<lt>errorE<gt>> prepares an error message based on
1239the mismatch between the last item expected and the text which cause
1240it to fail. For example, given the rule:
1241
1242 McCoy: curse ',' name ', I'm a doctor, not a' a_profession '!'
1243 | pronoun 'dead,' name '!'
1244 | <error>
1245
1246the following strings would produce the following messages:
1247
1248=over 4
1249
1250=item "Amen, Jim!"
1251
1252 ERROR (line 1): Invalid McCoy: Expected curse or pronoun
1253 not found
1254
1255=item "Dammit, Jim, I'm a doctor!"
1256
1257 ERROR (line 1): Invalid McCoy: Expected ", I'm a doctor, not a"
1258 but found ", I'm a doctor!" instead
1259
1260=item "He's dead,\n"
1261
1262 ERROR (line 2): Invalid McCoy: Expected name not found
1263
1264=item "He's alive!"
1265
1266 ERROR (line 1): Invalid McCoy: Expected 'dead,' but found
1267 "alive!" instead
1268
1269=item "Dammit, Jim, I'm a doctor, not a pointy-eared Vulcan!"
1270
1271 ERROR (line 1): Invalid McCoy: Expected a profession but found
1272 "pointy-eared Vulcan!" instead
1273
1274
1275=back
1276
1277Note that, when autogenerating error messages, all underscores in any
1278rule name used in a message are replaced by single spaces (for example
1279"a_production" becomes "a production"). Judicious choice of rule
1280names can therefore considerably improve the readability of automatic
1281error messages (as well as the maintainability of the original
1282grammar).
1283
1284If the automatically generated error is not sufficient, it is possible to
1285provide an explicit message as part of the error directive. For example:
1286
1287 Spock: "Fascinating ',' (name | 'Captain') '.'
1288 | "Highly illogical, doctor."
1289 | <error: He never said that!>
1290
1291which would result in I<all> failures to parse a "Spock" subrule printing the
1292following message:
1293
1294 ERROR (line <N>): Invalid Spock: He never said that!
1295
1296The error message is treated as a "qq{...}" string and interpolated
1297when the error is generated (I<not> when the directive is specified!).
1298Hence:
1299
1300 <error: Mystical error near "$text">
1301
1302would correctly insert the ambient text string which caused the error.
1303
1304There are two other forms of error directive: C<E<lt>error?E<gt>> and
1305S<C<E<lt>error?: msgE<gt>>>. These behave just like C<E<lt>errorE<gt>>
1306and S<C<E<lt>error: msgE<gt>>> respectively, except that they are
1307only triggered if the rule is "committed" at the time they are
1308encountered. For example:
1309
1310 Scotty: "Ya kenna change the Laws of Phusics," <commit> name
1311 | name <commit> ',' 'she's goanta blaw!'
1312 | <error?>
1313
1314will only generate an error for a string beginning with "Ya kenna
1315change the Laws o' Phusics," or a valid name, but which still fails to match the
1316corresponding production. That is, C<$parser-E<gt>Scotty("Aye, Cap'ain")> will
1317fail silently (since neither production will "commit" the rule on that
1318input), whereas S<C<$parser-E<gt>Scotty("Mr Spock, ah jest kenna do'ut!")>>
1319will fail with the error message:
1320
1321 ERROR (line 1): Invalid Scotty: expected 'she's goanta blaw!'
1322 but found 'I jest kenna do'ut!' instead.
1323
1324since in that case the second production would commit after matching
1325the leading name.
1326
1327Note that to allow this behaviour, all C<E<lt>errorE<gt>> directives which are
1328the first item in a production automatically uncommit the rule just
1329long enough to allow their production to be attempted (that is, when
1330their production fails, the commitment is reinstated so that
1331subsequent productions are skipped).
1332
1333In order to I<permanently> uncommit the rule before an error message,
1334it is necessary to put an explicit C<E<lt>uncommitE<gt>> before the
1335C<E<lt>errorE<gt>>. For example:
1336
1337 line: 'Kirk:' <commit> Kirk
1338 | 'Spock:' <commit> Spock
1339 | 'McCoy:' <commit> McCoy
1340 | <uncommit> <error?> <reject>
1341 | <resync>
1342
1343
1344Error messages generated by the various C<E<lt>error...E<gt>> directives
1345are not displayed immediately. Instead, they are "queued" in a buffer and
1346are only displayed once parsing ultimately fails. Moreover,
1347C<E<lt>error...E<gt>> directives that cause one production of a rule
1348to fail are automatically removed from the message queue
1349if another production subsequently causes the entire rule to succeed.
1350This means that you can put
1351C<E<lt>error...E<gt>> directives wherever useful diagnosis can be done,
1352and only those associated with actual parser failure will ever be
1353displayed. Also see L<"Gotchas">.
1354
1355As a general rule, the most useful diagnostics are usually generated
1356either at the very lowest level within the grammar, or at the very
1357highest. A good rule of thumb is to identify those subrules which
1358consist mainly (or entirely) of terminals, and then put an
1359C<E<lt>error...E<gt>> directive at the end of any other rule which calls
1360one or more of those subrules.
1361
1362There is one other situation in which the output of the various types of
1363error directive is suppressed; namely, when the rule containing them
1364is being parsed as part of a "look-ahead" (see L<"Look-ahead">). In this
1365case, the error directive will still cause the rule to fail, but will do
1366so silently.
1367
1368An unconditional C<E<lt>errorE<gt>> directive always fails (and hence has no
1369associated value). This means that encountering such a directive
1370always causes the production containing it to fail. Hence an
1371C<E<lt>errorE<gt>> directive will inevitably be the last (useful) item of a
1372rule (a level 3 warning is issued if a production contains items after an unconditional
1373C<E<lt>errorE<gt>> directive).
1374
1375An C<E<lt>error?E<gt>> directive will I<succeed> (that is: fail to fail :-), if
1376the current rule is uncommitted when the directive is encountered. In
1377that case the directive's associated value is zero. Hence, this type
1378of error directive I<can> be used before the end of a
1379production. For example:
1380
1381 command: 'do' <commit> something
1382 | 'report' <commit> something
1383 | <error?: Syntax error> <error: Unknown command>
1384
1385
1386B<Warning:> The C<E<lt>error?E<gt>> directive does I<not> mean "always fail (but
1387do so silently unless committed)". It actually means "only fail (and report) if
1388committed, otherwise I<succeed>". To achieve the "fail silently if uncommitted"
1389semantics, it is necessary to use:
1390
1391 rule: item <commit> item(s)
1392 | <error?> <reject> # FAIL SILENTLY UNLESS COMMITTED
1393
1394However, because people seem to expect a lone C<E<lt>error?E<gt>> directive
1395to work like this:
1396
1397 rule: item <commit> item(s)
1398 | <error?: Error message if committed>
1399 | <error: Error message if uncommitted>
1400
1401Parse::RecDescent automatically appends a
1402C<E<lt>rejectE<gt>> directive if the C<E<lt>error?E<gt>> directive
1403is the only item in a production. A level 2 warning (see below)
1404is issued when this happens.
1405
1406The level of error reporting during both parser construction and
1407parsing is controlled by the presence or absence of four global
1408variables: C<$::RD_ERRORS>, C<$::RD_WARN>, C<$::RD_HINT>, and
1409<$::RD_TRACE>. If C<$::RD_ERRORS> is defined (and, by default, it is)
1410then fatal errors are reported.
1411
1412Whenever C<$::RD_WARN> is defined, certain non-fatal problems are also reported.
1413Warnings have an associated "level": 1, 2, or 3. The higher the level,
1414the more serious the warning. The value of the corresponding global
1415variable (C<$::RD_WARN>) determines the I<lowest> level of warning to
1416be displayed. Hence, to see I<all> warnings, set C<$::RD_WARN> to 1.
1417To see only the most serious warnings set C<$::RD_WARN> to 3.
1418By default C<$::RD_WARN> is initialized to 3, ensuring that serious but
1419non-fatal errors are automatically reported.
1420
1421See F<"DIAGNOSTICS"> for a list of the varous error and warning messages
1422that Parse::RecDescent generates when these two variables are defined.
1423
1424Defining any of the remaining variables (which are not defined by
1425default) further increases the amount of information reported.
1426Defining C<$::RD_HINT> causes the parser generator to offer
1427more detailed analyses and hints on both errors and warnings.
1428Note that setting C<$::RD_HINT> at any point automagically
1429sets C<$::RD_WARN> to 1.
1430
1431Defining C<$::RD_TRACE> causes the parser generator and the parser to
1432report their progress to STDERR in excruciating detail (although, without hints
1433unless $::RD_HINT is separately defined). This detail
1434can be moderated in only one respect: if C<$::RD_TRACE> has an
1435integer value (I<N>) greater than 1, only the I<N> characters of
1436the "current parsing context" (that is, where in the input string we
1437are at any point in the parse) is reported at any time.
1438
1439C<$::RD_TRACE> is mainly useful for debugging a grammar that isn't
1440behaving as you expected it to. To this end, if C<$::RD_TRACE> is
1441defined when a parser is built, any actual parser code which is
1442generated is also written to a file named "RD_TRACE" in the local
1443directory.
1444
1445Note that the four variables belong to the "main" package, which
1446makes them easier to refer to in the code controlling the parser, and
1447also makes it easy to turn them into command line flags ("-RD_ERRORS",
1448"-RD_WARN", "-RD_HINT", "-RD_TRACE") under B<perl -s>.
1449
1450=item Specifying local variables
1451
1452It is occasionally convenient to specify variables which are local
1453to a single rule. This may be achieved by including a
1454C<E<lt>rulevar:...E<gt>> directive anywhere in the rule. For example:
1455
1456 markup: <rulevar: $tag>
1457
1458 markup: tag {($tag=$item[1]) =~ s/^<|>$//g} body[$tag]
1459
1460The example C<E<lt>rulevar: $tagE<gt>> directive causes a "my" variable named
1461C<$tag> to be declared at the start of the subroutine implementing the
1462C<markup> rule (that is, I<before> the first production, regardless of
1463where in the rule it is specified).
1464
1465Specifically, any directive of the form:
1466C<E<lt>rulevar:I<text>E<gt>> causes a line of the form C<my I<text>;>
1467to be added at the beginning of the rule subroutine, immediately after
1468the definitions of the following local variables:
1469
1470 $thisparser $commit
1471 $thisrule @item
1472 $thisline @arg
1473 $text %arg
1474
1475This means that the following C<E<lt>rulevarE<gt>> directives work
1476as expected:
1477
1478 <rulevar: $count = 0 >
1479
1480 <rulevar: $firstarg = $arg[0] || '' >
1481
1482 <rulevar: $myItems = \@item >
1483
1484 <rulevar: @context = ( $thisline, $text, @arg ) >
1485
1486 <rulevar: ($name,$age) = $arg{"name","age"} >
1487
1488
1489Note however that, because all such variables are "my" variables, their
1490values I<do not persist> between match attempts on a given rule. To
1491preserve values between match attempts, values can be stored within the
1492"local" member of the C<$thisrule> object:
1493
1494 countedrule: { $thisrule->{"local"}{"count"}++ }
1495 <reject>
1496 | subrule1
1497 | subrule2
1498 | <reject: $thisrule->{"local"}{"count"} == 1>
1499 subrule3
1500
1501
1502When matching a rule, each C<E<lt>rulevarE<gt>> directive is matched as
1503if it were an unconditional C<E<lt>rejectE<gt>> directive (that is, it
1504causes any production in which it appears to immediately fail to match).
1505For this reason (and to improve readability) it is usual to specify any
1506C<E<lt>rulevarE<gt>> directive in a separate production at the start of
1507the rule (this has the added advantage that it enables
1508C<Parse::RecDescent> to optimize away such productions, just as it does
1509for the C<E<lt>rejectE<gt>> directive).
1510
1511
1512=item Dynamically matched rules
1513
1514Because regexes and double-quoted strings are interpolated, it is relatively
1515easy to specify productions with "context sensitive" tokens. For example:
1516
1517 command: keyword body "end $item[1]"
1518
1519which ensures that a command block is bounded by a
1520"I<E<lt>keywordE<gt>>...end I<E<lt>same keywordE<gt>>" pair.
1521
1522Building productions in which subrules are context sensitive is also possible,
1523via the C<E<lt>matchrule:...E<gt>> directive. This directive behaves
1524identically to a subrule item, except that the rule which is invoked to match
1525it is determined by the string specified after the colon. For example, we could
1526rewrite the C<command> rule like this:
1527
1528 command: keyword <matchrule:body> "end $item[1]"
1529
1530Whatever appears after the colon in the directive is treated as an interpolated
1531string (that is, as if it appeared in C<qq{...}> operator) and the value of
1532that interpolated string is the name of the subrule to be matched.
1533
1534Of course, just putting a constant string like C<body> in a
1535C<E<lt>matchrule:...E<gt>> directive is of little interest or benefit.
1536The power of directive is seen when we use a string that interpolates
1537to something interesting. For example:
1538
1539 command: keyword <matchrule:$item[1]_body> "end $item[1]"
1540
1541 keyword: 'while' | 'if' | 'function'
1542
1543 while_body: condition block
1544
1545 if_body: condition block ('else' block)(?)
1546
1547 function_body: arglist block
1548
1549Now the C<command> rule selects how to proceed on the basis of the keyword
1550that is found. It is as if C<command> were declared:
1551
1552 command: 'while' while_body "end while"
1553 | 'if' if_body "end if"
1554 | 'function' function_body "end function"
1555
1556
1557When a C<E<lt>matchrule:...E<gt>> directive is used as a repeated
1558subrule, the rule name expression is "late-bound". That is, the name of
1559the rule to be called is re-evaluated I<each time> a match attempt is
1560made. Hence, the following grammar:
1561
1562 { $::species = 'dogs' }
1563
1564 pair: 'two' <matchrule:$::species>(s)
1565
1566 dogs: /dogs/ { $::species = 'cats' }
1567
1568 cats: /cats/
1569
1570will match the string "two dogs cats cats" completely, whereas it will
1571only match the string "two dogs dogs dogs" up to the eighth letter. If
1572the rule name were "early bound" (that is, evaluated only the first
1573time the directive is encountered in a production), the reverse
1574behaviour would be expected.
1575
1576=item Deferred actions
1577
1578The C<E<lt>defer:...E<gt>> directive is used to specify an action to be
1579performed when (and only if!) the current production ultimately succeeds.
1580
1581Whenever a C<E<lt>defer:...E<gt>> directive appears, the code it specifies
1582is converted to a closure (an anonymous subroutine reference) which is
1583queued within the active parser object. Note that,
1584because the deferred code is converted to a closure, the values of any
1585"local" variable (such as C<$text>, <@item>, etc.) are preserved
1586until the deferred code is actually executed.
1587
1588If the parse ultimately succeeds
1589I<and> the production in which the C<E<lt>defer:...E<gt>> directive was
1590evaluated formed part of the successful parse, then the deferred code is
1591executed immediately before the parse returns. If however the production
1592which queued a deferred action fails, or one of the higher-level
1593rules which called that production fails, then the deferred action is
1594removed from the queue, and hence is never executed.
1595
1596For example, given the grammar:
1597
1598 sentence: noun trans noun
1599 | noun intrans
1600
1601 noun: 'the dog'
1602 { print "$item[1]\t(noun)\n" }
1603 | 'the meat'
1604 { print "$item[1]\t(noun)\n" }
1605
1606 trans: 'ate'
1607 { print "$item[1]\t(transitive)\n" }
1608
1609 intrans: 'ate'
1610 { print "$item[1]\t(intransitive)\n" }
1611 | 'barked'
1612 { print "$item[1]\t(intransitive)\n" }
1613
1614then parsing the sentence C<"the dog ate"> would produce the output:
1615
1616 the dog (noun)
1617 ate (transitive)
1618 the dog (noun)
1619 ate (intransitive)
1620
1621This is because, even though the first production of C<sentence>
1622ultimately fails, its initial subrules C<noun> and C<trans> do match,
1623and hence they execute their associated actions.
1624Then the second production of C<sentence> succeeds, causing the
1625actions of the subrules C<noun> and C<intrans> to be executed as well.
1626
1627On the other hand, if the actions were replaced by C<E<lt>defer:...E<gt>>
1628directives:
1629
1630 sentence: noun trans noun
1631 | noun intrans
1632
1633 noun: 'the dog'
1634 <defer: print "$item[1]\t(noun)\n" >
1635 | 'the meat'
1636 <defer: print "$item[1]\t(noun)\n" >
1637
1638 trans: 'ate'
1639 <defer: print "$item[1]\t(transitive)\n" >
1640
1641 intrans: 'ate'
1642 <defer: print "$item[1]\t(intransitive)\n" >
1643 | 'barked'
1644 <defer: print "$item[1]\t(intransitive)\n" >
1645
1646the output would be:
1647
1648 the dog (noun)
1649 ate (intransitive)
1650
1651since deferred actions are only executed if they were evaluated in
1652a production which ultimately contributes to the successful parse.
1653
1654In this case, even though the first production of C<sentence> caused
1655the subrules C<noun> and C<trans> to match, that production ultimately
1656failed and so the deferred actions queued by those subrules were subsequently
1657disgarded. The second production then succeeded, causing the entire
1658parse to succeed, and so the deferred actions queued by the (second) match of
1659the C<noun> subrule and the subsequent match of C<intrans> I<are> preserved and
1660eventually executed.
1661
1662Deferred actions provide a means of improving the performance of a parser,
1663by only executing those actions which are part of the final parse-tree
1664for the input data.
1665
1666Alternatively, deferred actions can be viewed as a mechanism for building
1667(and executing) a
1668customized subroutine corresponding to the given input data, much in the
1669same way that autoactions (see L<"Autoactions">) can be used to build a
1670customized data structure for specific input.
1671
1672Whether or not the action it specifies is ever executed,
1673a C<E<lt>defer:...E<gt>> directive always succeeds, returning the
1674number of deferred actions currently queued at that point.
1675
1676
1677=item Parsing Perl
1678
1679Parse::RecDescent provides limited support for parsing subsets of Perl,
1680namely: quote-like operators, Perl variables, and complete code blocks.
1681
1682The C<E<lt>perl_quotelikeE<gt>> directive can be used to parse any Perl
1683quote-like operator: C<'a string'>, C<m/a pattern/>, C<tr{ans}{lation}>,
1684etc. It does this by calling Text::Balanced::quotelike().
1685
1686If a quote-like operator is found, a reference to an array of eight elements
1687is returned. Those elements are identical to the last eight elements returned
1688by Text::Balanced::extract_quotelike() in an array context, namely:
1689
1690=over 4
1691
1692=item [0]
1693
1694the name of the quotelike operator -- 'q', 'qq', 'm', 's', 'tr' -- if the
1695operator was named; otherwise C<undef>,
1696
1697=item [1]
1698
1699the left delimiter of the first block of the operation,
1700
1701=item [2]
1702
1703the text of the first block of the operation
1704(that is, the contents of
1705a quote, the regex of a match, or substitution or the target list of a
1706translation),
1707
1708=item [3]
1709
1710the right delimiter of the first block of the operation,
1711
1712=item [4]
1713
1714the left delimiter of the second block of the operation if there is one
1715(that is, if it is a C<s>, C<tr>, or C<y>); otherwise C<undef>,
1716
1717=item [5]
1718
1719the text of the second block of the operation if there is one
1720(that is, the replacement of a substitution or the translation list
1721of a translation); otherwise C<undef>,
1722
1723=item [6]
1724
1725the right delimiter of the second block of the operation (if any);
1726otherwise C<undef>,
1727
1728=item [7]
1729
1730the trailing modifiers on the operation (if any); otherwise C<undef>.
1731
1732=back
1733
1734If a quote-like expression is not found, the directive fails with the usual
1735C<undef> value.
1736
1737The C<E<lt>perl_variableE<gt>> directive can be used to parse any Perl
1738variable: $scalar, @array, %hash, $ref->{field}[$index], etc.
1739It does this by calling Text::Balanced::extract_variable().
1740
1741If the directive matches text representing a valid Perl variable
1742specification, it returns that text. Otherwise it fails with the usual
1743C<undef> value.
1744
1745The C<E<lt>perl_codeblockE<gt>> directive can be used to parse curly-brace-delimited block of Perl code, such as: { $a = 1; f() =~ m/pat/; }.
1746It does this by calling Text::Balanced::extract_codeblock().
1747
1748If the directive matches text representing a valid Perl code block,
1749it returns that text. Otherwise it fails with the usual C<undef> value.
1750
1751
1752=item Constructing tokens
1753
1754Eventually, Parse::RecDescent will be able to parse tokenized input, as
1755well as ordinary strings. In preparation for this joyous day, the
1756C<E<lt>token:...E<gt>> directive has been provided.
1757This directive creates a token which will be suitable for
1758input to a Parse::RecDescent parser (when it eventually supports
1759tokenized input).
1760
1761The text of the token is the value of the
1762immediately preceding item in the production. A
1763C<E<lt>token:...E<gt>> directive always succeeds with a return
1764value which is the hash reference that is the new token. It also
1765sets the return value for the production to that hash ref.
1766
1767The C<E<lt>token:...E<gt>> directive makes it easy to build
1768a Parse::RecDescent-compatible lexer in Parse::RecDescent:
1769
1770 my $lexer = new Parse::RecDescent q
1771 {
1772 lex: token(s)
1773
1774 token: /a\b/ <token:INDEF>
1775 | /the\b/ <token:DEF>
1776 | /fly\b/ <token:NOUN,VERB>
1777 | /[a-z]+/i { lc $item[1] } <token:ALPHA>
1778 | <error: Unknown token>
1779
1780 };
1781
1782which will eventually be able to be used with a regular Parse::RecDescent
1783grammar:
1784
1785 my $parser = new Parse::RecDescent q
1786 {
1787 startrule: subrule1 subrule 2
1788
1789 # ETC...
1790 };
1791
1792either with a pre-lexing phase:
1793
1794 $parser->startrule( $lexer->lex($data) );
1795
1796or with a lex-on-demand approach:
1797
1798 $parser->startrule( sub{$lexer->token(\$data)} );
1799
1800But at present, only the C<E<lt>token:...E<gt>> directive is
1801actually implemented. The rest is vapourware.
1802
1803=item Specifying operations
1804
1805One of the commonest requirements when building a parser is to specify
1806binary operators. Unfortunately, in a normal grammar, the rules for
1807such things are awkward:
1808
1809 disjunction: conjunction ('or' conjunction)(s?)
1810 { $return = [ $item[1], @{$item[2]} ] }
1811
1812 conjunction: atom ('and' atom)(s?)
1813 { $return = [ $item[1], @{$item[2]} ] }
1814
1815or inefficient:
1816
1817 disjunction: conjunction 'or' disjunction
1818 { $return = [ $item[1], @{$item[2]} ] }
1819 | conjunction
1820 { $return = [ $item[1] ] }
1821
1822 conjunction: atom 'and' conjunction
1823 { $return = [ $item[1], @{$item[2]} ] }
1824 | atom
1825 { $return = [ $item[1] ] }
1826
1827and either way is ugly and hard to get right.
1828
1829The C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives provide an
1830easier way of specifying such operations. Using C<E<lt>leftop:...E<gt>> the
1831above examples become:
1832
1833 disjunction: <leftop: conjunction 'or' conjunction>
1834 conjunction: <leftop: atom 'and' atom>
1835
1836The C<E<lt>leftop:...E<gt>> directive specifies a left-associative binary operator.
1837It is specified around three other grammar elements
1838(typically subrules or terminals), which match the left operand,
1839the operator itself, and the right operand respectively.
1840
1841A C<E<lt>leftop:...E<gt>> directive such as:
1842
1843 disjunction: <leftop: conjunction 'or' conjunction>
1844
1845is converted to the following:
1846
1847 disjunction: ( conjunction ('or' conjunction)(s?)
1848 { $return = [ $item[1], @{$item[2]} ] } )
1849
1850In other words, a C<E<lt>leftop:...E<gt>> directive matches the left operand followed by zero
1851or more repetitions of both the operator and the right operand. It then
1852flattens the matched items into an anonymous array which becomes the
1853(single) value of the entire C<E<lt>leftop:...E<gt>> directive.
1854
1855For example, an C<E<lt>leftop:...E<gt>> directive such as:
1856
1857 output: <leftop: ident '<<' expr >
1858
1859when given a string such as:
1860
1861 cout << var << "str" << 3
1862
1863would match, and C<$item[1]> would be set to:
1864
1865 [ 'cout', 'var', '"str"', '3' ]
1866
1867In other words:
1868
1869 output: <leftop: ident '<<' expr >
1870
1871is equivalent to a left-associative operator:
1872
1873 output: ident { $return = [$item[1]] }
1874 | ident '<<' expr { $return = [@item[1,3]] }
1875 | ident '<<' expr '<<' expr { $return = [@item[1,3,5]] }
1876 | ident '<<' expr '<<' expr '<<' expr { $return = [@item[1,3,5,7]] }
1877 # ...etc...
1878
1879
1880Similarly, the C<E<lt>rightop:...E<gt>> directive takes a left operand, an operator, and a right operand:
1881
1882 assign: <rightop: var '=' expr >
1883
1884and converts them to:
1885
1886 assign: ( (var '=' {$return=$item[1]})(s?) expr
1887 { $return = [ @{$item[1]}, $item[2] ] } )
1888
1889which is equivalent to a right-associative operator:
1890
1891 assign: var { $return = [$item[1]] }
1892 | var '=' expr { $return = [@item[1,3]] }
1893 | var '=' var '=' expr { $return = [@item[1,3,5]] }
1894 | var '=' var '=' var '=' expr { $return = [@item[1,3,5,7]] }
1895 # ...etc...
1896
1897
1898Note that for both the C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives, the directive does not normally
1899return the operator itself, just a list of the operands involved. This is
1900particularly handy for specifying lists:
1901
1902 list: '(' <leftop: list_item ',' list_item> ')'
1903 { $return = $item[2] }
1904
1905There is, however, a problem: sometimes the operator is itself significant.
1906For example, in a Perl list a comma and a C<=E<gt>> are both
1907valid separators, but the C<=E<gt>> has additional stringification semantics.
1908Hence it's important to know which was used in each case.
1909
1910To solve this problem the
1911C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives
1912I<do> return the operator(s) as well, under two circumstances.
1913The first case is where the operator is specified as a subrule. In that instance,
1914whatever the operator matches is returned (on the assumption that if the operator
1915is important enough to have its own subrule, then it's important enough to return).
1916
1917The second case is where the operator is specified as a regular
1918expression. In that case, if the first bracketed subpattern of the
1919regular expression matches, that matching value is returned (this is analogous to
1920the behaviour of the Perl C<split> function, except that only the first subpattern
1921is returned).
1922
1923In other words, given the input:
1924
1925 ( a=>1, b=>2 )
1926
1927the specifications:
1928
1929 list: '(' <leftop: list_item separator list_item> ')'
1930
1931 separator: ',' | '=>'
1932
1933or:
1934
1935 list: '(' <leftop: list_item /(,|=>)/ list_item> ')'
1936
1937cause the list separators to be interleaved with the operands in the
1938anonymous array in C<$item[2]>:
1939
1940 [ 'a', '=>', '1', ',', 'b', '=>', '2' ]
1941
1942
1943But the following version:
1944
1945 list: '(' <leftop: list_item /,|=>/ list_item> ')'
1946
1947returns only the operators:
1948
1949 [ 'a', '1', 'b', '2' ]
1950
1951Of course, none of the above specifications handle the case of an empty
1952list, since the C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives
1953require at least a single right or left operand to match. To specify
1954that the operator can match "trivially",
1955it's necessary to add a C<(?)> qualifier to the directive:
1956
1957 list: '(' <leftop: list_item /(,|=>)/ list_item>(?) ')'
1958
1959Note that in almost all the above examples, the first and third arguments
1960of the C<<leftop:...E<gt>> directive were the same subrule. That is because
1961C<<leftop:...E<gt>>'s are frequently used to specify "separated" lists of the
1962same type of item. To make such lists easier to specify, the following
1963syntax:
1964
1965 list: element(s /,/)
1966
1967is exactly equivalent to:
1968
1969 list: <leftop: element /,/ element>
1970
1971Note that the separator must be specified as a raw pattern (i.e.
1972not a string or subrule).
1973
1974
1975=item Scored productions
1976
1977By default, Parse::RecDescent grammar rules always accept the first
1978production that matches the input. But if two or more productions may
1979potentially match the same input, choosing the first that does so may
1980not be optimal.
1981
1982For example, if you were parsing the sentence "time flies like an arrow",
1983you might use a rule like this:
1984
1985 sentence: verb noun preposition article noun { [@item] }
1986 | adjective noun verb article noun { [@item] }
1987 | noun verb preposition article noun { [@item] }
1988
1989Each of these productions matches the sentence, but the third one
1990is the most likely interpretation. However, if the sentence had been
1991"fruit flies like a banana", then the second production is probably
1992the right match.
1993
1994To cater for such situtations, the C<E<lt>score:...E<gt>> can be used.
1995The directive is equivalent to an unconditional C<E<lt>rejectE<gt>>,
1996except that it allows you to specify a "score" for the current
1997production. If that score is numerically greater than the best
1998score of any preceding production, the current production is cached for later
1999consideration. If no later production matches, then the cached
2000production is treated as having matched, and the value of the
2001item immediately before its C<E<lt>score:...E<gt>> directive is returned as the
2002result.
2003
2004In other words, by putting a C<E<lt>score:...E<gt>> directive at the end of
2005each production, you can select which production matches using
2006criteria other than specification order. For example:
2007
2008 sentence: verb noun preposition article noun { [@item] } <score: sensible(@item)>
2009 | adjective noun verb article noun { [@item] } <score: sensible(@item)>
2010 | noun verb preposition article noun { [@item] } <score: sensible(@item)>
2011
2012Now, when each production reaches its respective C<E<lt>score:...E<gt>>
2013directive, the subroutine C<sensible> will be called to evaluate the
2014matched items (somehow). Once all productions have been tried, the
2015one which C<sensible> scored most highly will be the one that is
2016accepted as a match for the rule.
2017
2018The variable $score always holds the current best score of any production,
2019and the variable $score_return holds the corresponding return value.
2020
2021As another example, the following grammar matches lines that may be
2022separated by commas, colons, or semi-colons. This can be tricky if
2023a colon-separated line also contains commas, or vice versa. The grammar
2024resolves the ambiguity by selecting the rule that results in the
2025fewest fields:
2026
2027 line: seplist[sep=>','] <score: -@{$item[1]}>
2028 | seplist[sep=>':'] <score: -@{$item[1]}>
2029 | seplist[sep=>" "] <score: -@{$item[1]}>
2030
2031 seplist: <skip:""> <leftop: /[^$arg{sep}]*/ "$arg{sep}" /[^$arg{sep}]*/>
2032
2033Note the use of negation within the C<E<lt>score:...E<gt>> directive
2034to ensure that the seplist with the most items gets the lowest score.
2035
2036As the above examples indicate, it is often the case that all productions
2037in a rule use exactly the same C<E<lt>score:...E<gt>> directive. It is
2038tedious to have to repeat this identical directive in every production, so
2039Parse::RecDescent also provides the C<E<lt>autoscore:...E<gt>> directive.
2040
2041If an C<E<lt>autoscore:...E<gt>> directive appears in any
2042production of a rule, the code it specifies is used as the scoring
2043code for every production of that rule, except productions that already
2044end with an explicit C<E<lt>score:...E<gt>> directive. Thus the rules above could
2045be rewritten:
2046
2047 line: <autoscore: -@{$item[1]}>
2048 line: seplist[sep=>',']
2049 | seplist[sep=>':']
2050 | seplist[sep=>" "]
2051
2052
2053 sentence: <autoscore: sensible(@item)>
2054 | verb noun preposition article noun { [@item] }
2055 | adjective noun verb article noun { [@item] }
2056 | noun verb preposition article noun { [@item] }
2057
2058Note that the C<E<lt>autoscore:...E<gt>> directive itself acts as an
2059unconditional C<E<lt>rejectE<gt>>, and (like the C<E<lt>rulevar:...E<gt>>
2060directive) is pruned at compile-time wherever possible.
2061
2062
2063=item Dispensing with grammar checks
2064
2065During the compilation phase of parser construction, Parse::RecDescent performs
2066a small number of checks on the grammar it's given. Specifically it checks that
2067the grammar is not left-recursive, that there are no "insatiable" constructs of
2068the form:
2069
2070 rule: subrule(s) subrule
2071
2072and that there are no rules missing (i.e. referred to, but never defined).
2073
2074These checks are important during development, but can slow down parser
2075construction in stable code. So Parse::RecDescent provides the
2076E<lt>nocheckE<gt> directive to turn them off. The directive can only appear
2077before the first rule definition, and switches off checking throughout the rest
2078of the current grammar.
2079
2080Typically, this directive would be added when a parser has been thoroughly
2081tested and is ready for release.
2082
2083=back
2084
2085
2086=head2 Subrule argument lists
2087
2088It is occasionally useful to pass data to a subrule which is being invoked. For
2089example, consider the following grammar fragment:
2090
2091 classdecl: keyword decl
2092
2093 keyword: 'struct' | 'class';
2094
2095 decl: # WHATEVER
2096
2097The C<decl> rule might wish to know which of the two keywords was used
2098(since it may affect some aspect of the way the subsequent declaration
2099is interpreted). C<Parse::RecDescent> allows the grammar designer to
2100pass data into a rule, by placing that data in an I<argument list>
2101(that is, in square brackets) immediately after any subrule item in a
2102production. Hence, we could pass the keyword to C<decl> as follows:
2103
2104 classdecl: keyword decl[ $item[1] ]
2105
2106 keyword: 'struct' | 'class';
2107
2108 decl: # WHATEVER
2109
2110The argument list can consist of any number (including zero!) of comma-separated
2111Perl expressions. In other words, it looks exactly like a Perl anonymous
2112array reference. For example, we could pass the keyword, the name of the
2113surrounding rule, and the literal 'keyword' to C<decl> like so:
2114
2115 classdecl: keyword decl[$item[1],$item[0],'keyword']
2116
2117 keyword: 'struct' | 'class';
2118
2119 decl: # WHATEVER
2120
2121Within the rule to which the data is passed (C<decl> in the above examples)
2122that data is available as the elements of a local variable C<@arg>. Hence
2123C<decl> might report its intentions as follows:
2124
2125 classdecl: keyword decl[$item[1],$item[0],'keyword']
2126
2127 keyword: 'struct' | 'class';
2128
2129 decl: { print "Declaring $arg[0] (a $arg[2])\n";
2130 print "(this rule called by $arg[1])" }
2131
2132Subrule argument lists can also be interpreted as hashes, simply by using
2133the local variable C<%arg> instead of C<@arg>. Hence we could rewrite the
2134previous example:
2135
2136 classdecl: keyword decl[keyword => $item[1],
2137 caller => $item[0],
2138 type => 'keyword']
2139
2140 keyword: 'struct' | 'class';
2141
2142 decl: { print "Declaring $arg{keyword} (a $arg{type})\n";
2143 print "(this rule called by $arg{caller})" }
2144
2145Both C<@arg> and C<%arg> are always available, so the grammar designer may
2146choose whichever convention (or combination of conventions) suits best.
2147
2148Subrule argument lists are also useful for creating "rule templates"
2149(especially when used in conjunction with the C<E<lt>matchrule:...E<gt>>
2150directive). For example, the subrule:
2151
2152 list: <matchrule:$arg{rule}> /$arg{sep}/ list[%arg]
2153 { $return = [ $item[1], @{$item[3]} ] }
2154 | <matchrule:$arg{rule}>
2155 { $return = [ $item[1]] }
2156
2157is a handy template for the common problem of matching a separated list.
2158For example:
2159
2160 function: 'func' name '(' list[rule=>'param',sep=>';'] ')'
2161
2162 param: list[rule=>'name',sep=>','] ':' typename
2163
2164 name: /\w+/
2165
2166 typename: name
2167
2168
2169When a subrule argument list is used with a repeated subrule, the argument list
2170goes I<before> the repetition specifier:
2171
2172 list: /some|many/ thing[ $item[1] ](s)
2173
2174The argument list is "late bound". That is, it is re-evaluated for every
2175repetition of the repeated subrule.
2176This means that each repeated attempt to match the subrule may be
2177passed a completely different set of arguments if the value of the
2178expression in the argument list changes between attempts. So, for
2179example, the grammar:
2180
2181 { $::species = 'dogs' }
2182
2183 pair: 'two' animal[$::species](s)
2184
2185 animal: /$arg[0]/ { $::species = 'cats' }
2186
2187will match the string "two dogs cats cats" completely, whereas
2188it will only match the string "two dogs dogs dogs" up to the
2189eighth letter. If the value of the argument list were "early bound"
2190(that is, evaluated only the first time a repeated subrule match is
2191attempted), one would expect the matching behaviours to be reversed.
2192
2193Of course, it is possible to effectively "early bind" such argument lists
2194by passing them a value which does not change on each repetition. For example:
2195
2196 { $::species = 'dogs' }
2197
2198 pair: 'two' { $::species } animal[$item[2]](s)
2199
2200 animal: /$arg[0]/ { $::species = 'cats' }
2201
2202
2203Arguments can also be passed to the start rule, simply by appending them
2204to the argument list with which the start rule is called (I<after> the
2205"line number" parameter). For example, given:
2206
2207 $parser = new Parse::RecDescent ( $grammar );
2208
2209 $parser->data($text, 1, "str", 2, \@arr);
2210
2211 # ^^^^^ ^ ^^^^^^^^^^^^^^^
2212 # | | |
2213 # TEXT TO BE PARSED | |
2214 # STARTING LINE NUMBER |
2215 # ELEMENTS OF @arg WHICH IS PASSED TO RULE data
2216
2217then within the productions of the rule C<data>, the array C<@arg> will contain
2218C<("str", 2, \@arr)>.
2219
2220
2221=head2 Alternations
2222
2223Alternations are implicit (unnamed) rules defined as part of a production. An
2224alternation is defined as a series of '|'-separated productions inside a
2225pair of round brackets. For example:
2226
2227 character: 'the' ( good | bad | ugly ) /dude/
2228
2229Every alternation implicitly defines a new subrule, whose
2230automatically-generated name indicates its origin:
2231"_alternation_<I>_of_production_<P>_of_rule<R>" for the appropriate
2232values of <I>, <P>, and <R>. A call to this implicit subrule is then
2233inserted in place of the brackets. Hence the above example is merely a
2234convenient short-hand for:
2235
2236 character: 'the'
2237 _alternation_1_of_production_1_of_rule_character
2238 /dude/
2239
2240 _alternation_1_of_production_1_of_rule_character:
2241 good | bad | ugly
2242
2243Since alternations are parsed by recursively calling the parser generator,
2244any type(s) of item can appear in an alternation. For example:
2245
2246 character: 'the' ( 'high' "plains" # Silent, with poncho
2247 | /no[- ]name/ # Silent, no poncho
2248 | vengeance_seeking # Poncho-optional
2249 | <error>
2250 ) drifter
2251
2252In this case, if an error occurred, the automatically generated
2253message would be:
2254
2255 ERROR (line <N>): Invalid implicit subrule: Expected
2256 'high' or /no[- ]name/ or generic,
2257 but found "pacifist" instead
2258
2259Since every alternation actually has a name, it's even possible
2260to extend or replace them:
2261
2262 parser->Replace(
2263 "_alternation_1_of_production_1_of_rule_character:
2264 'generic Eastwood'"
2265 );
2266
2267More importantly, since alternations are a form of subrule, they can be given
2268repetition specifiers:
2269
2270 character: 'the' ( good | bad | ugly )(?) /dude/
2271
2272
2273=head2 Incremental Parsing
2274
2275C<Parse::RecDescent> provides two methods - C<Extend> and C<Replace> - which
2276can be used to alter the grammar matched by a parser. Both methods
2277take the same argument as C<Parse::RecDescent::new>, namely a
2278grammar specification string
2279
2280C<Parse::RecDescent::Extend> interprets the grammar specification and adds any
2281productions it finds to the end of the rules for which they are specified. For
2282example:
2283
2284 $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
2285 parser->Extend($add);
2286
2287adds two productions to the rule "name" (creating it if necessary) and one
2288production to the rule "desc".
2289
2290C<Parse::RecDescent::Replace> is identical, except that it first resets are
2291rule specified in the additional grammar, removing any existing productions.
2292Hence after:
2293
2294 $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
2295 parser->Replace($add);
2296
2297are are I<only> valid "name"s and the one possible description.
2298
2299A more interesting use of the C<Extend> and C<Replace> methods is to call them
2300inside the action of an executing parser. For example:
2301
2302 typedef: 'typedef' type_name identifier ';'
2303 { $thisparser->Extend("type_name: '$item[3]'") }
2304 | <error>
2305
2306 identifier: ...!type_name /[A-Za-z_]w*/
2307
2308which automatically prevents type names from being typedef'd, or:
2309
2310 command: 'map' key_name 'to' abort_key
2311 { $thisparser->Replace("abort_key: '$item[2]'") }
2312 | 'map' key_name 'to' key_name
2313 { map_key($item[2],$item[4]) }
2314 | abort_key
2315 { exit if confirm("abort?") }
2316
2317 abort_key: 'q'
2318
2319 key_name: ...!abort_key /[A-Za-z]/
2320
2321which allows the user to change the abort key binding, but not to unbind it.
2322
2323The careful use of such constructs makes it possible to reconfigure a
2324a running parser, eliminating the need for semantic feedback by
2325providing syntactic feedback instead. However, as currently implemented,
2326C<Replace()> and C<Extend()> have to regenerate and re-C<eval> the
2327entire parser whenever they are called. This makes them quite slow for
2328large grammars.
2329
2330In such cases, the judicious use of an interpolated regex is likely to
2331be far more efficient:
2332
2333 typedef: 'typedef' type_name/ identifier ';'
2334 { $thisparser->{local}{type_name} .= "|$item[3]" }
2335 | <error>
2336
2337 identifier: ...!type_name /[A-Za-z_]w*/
2338
2339 type_name: /$thisparser->{local}{type_name}/
2340
2341
2342=head2 Precompiling parsers
2343
2344Normally Parse::RecDescent builds a parser from a grammar at run-time.
2345That approach simplifies the design and implementation of parsing code,
2346but has the disadvantage that it slows the parsing process down - you
2347have to wait for Parse::RecDescent to build the parser every time the
2348program runs. Long or complex grammars can be particularly slow to
2349build, leading to unacceptable delays at start-up.
2350
2351To overcome this, the module provides a way of "pre-building" a parser
2352object and saving it in a separate module. That module can then be used
2353to create clones of the original parser.
2354
2355A grammar may be precompiled using the C<Precompile> class method.
2356For example, to precompile a grammar stored in the scalar $grammar,
2357and produce a class named PreGrammar in a module file named PreGrammar.pm,
2358you could use:
2359
2360 use Parse::RecDescent;
2361
2362 Parse::RecDescent->Precompile($grammar, "PreGrammar");
2363
2364The first argument is the grammar string, the second is the name of the class
2365to be built. The name of the module file is generated automatically by
2366appending ".pm" to the last element of the class name. Thus
2367
2368 Parse::RecDescent->Precompile($grammar, "My::New::Parser");
2369
2370would produce a module file named Parser.pm.
2371
2372It is somewhat tedious to have to write a small Perl program just to
2373generate a precompiled grammar class, so Parse::RecDescent has some special
2374magic that allows you to do the job directly from the command-line.
2375
2376If your grammar is specified in a file named F<grammar>, you can generate
2377a class named Yet::Another::Grammar like so:
2378
2379 > perl -MParse::RecDescent - grammar Yet::Another::Grammar
2380
2381This would produce a file named F<Grammar.pm> containing the full
2382definition of a class called Yet::Another::Grammar. Of course, to use
2383that class, you would need to put the F<Grammar.pm> file in a
2384directory named F<Yet/Another>, somewhere in your Perl include path.
2385
2386Having created the new class, it's very easy to use it to build
2387a parser. You simply C<use> the new module, and then call its
2388C<new> method to create a parser object. For example:
2389
2390 use Yet::Another::Grammar;
2391 my $parser = Yet::Another::Grammar->new();
2392
2393The effect of these two lines is exactly the same as:
2394
2395 use Parse::RecDescent;
2396
2397 open GRAMMAR_FILE, "grammar" or die;
2398 local $/;
2399 my $grammar = <GRAMMAR_FILE>;
2400
2401 my $parser = Parse::RecDescent->new($grammar);
2402
2403only considerably faster.
2404
2405Note however that the parsers produced by either approach are exactly
2406the same, so whilst precompilation has an effect on I<set-up> speed,
2407it has no effect on I<parsing> speed. RecDescent 2.0 will address that
2408problem.
2409
2410
2411=head2 A Metagrammar for C<Parse::RecDescent>
2412
2413The following is a specification of grammar format accepted by
2414C<Parse::RecDescent::new> (specified in the C<Parse::RecDescent> grammar format!):
2415
2416 grammar : components(s)
2417
2418 component : rule | comment
2419
2420 rule : "\n" identifier ":" production(s?)
2421
2422 production : items(s)
2423
2424 item : lookahead(?) simpleitem
2425 | directive
2426 | comment
2427
2428 lookahead : '...' | '...!' # +'ve or -'ve lookahead
2429
2430 simpleitem : subrule args(?) # match another rule
2431 | repetition # match repeated subrules
2432 | terminal # match the next input
2433 | bracket args(?) # match alternative items
2434 | action # do something
2435
2436 subrule : identifier # the name of the rule
2437
2438 args : {extract_codeblock($text,'[]')} # just like a [...] array ref
2439
2440 repetition : subrule args(?) howoften
2441
2442 howoften : '(?)' # 0 or 1 times
2443 | '(s?)' # 0 or more times
2444 | '(s)' # 1 or more times
2445 | /(\d+)[.][.](/\d+)/ # $1 to $2 times
2446 | /[.][.](/\d*)/ # at most $1 times
2447 | /(\d*)[.][.])/ # at least $1 times
2448
2449 terminal : /[/]([\][/]|[^/])*[/]/ # interpolated pattern
2450 | /"([\]"|[^"])*"/ # interpolated literal
2451 | /'([\]'|[^'])*'/ # uninterpolated literal
2452
2453 action : { extract_codeblock($text) } # embedded Perl code
2454
2455 bracket : '(' Item(s) production(s?) ')' # alternative subrules
2456
2457 directive : '<commit>' # commit to production
2458 | '<uncommit>' # cancel commitment
2459 | '<resync>' # skip to newline
2460 | '<resync:' pattern '>' # skip <pattern>
2461 | '<reject>' # fail this production
2462 | '<reject:' condition '>' # fail if <condition>
2463 | '<error>' # report an error
2464 | '<error:' string '>' # report error as "<string>"
2465 | '<error?>' # error only if committed
2466 | '<error?:' string '>' # " " " "
2467 | '<rulevar:' /[^>]+/ '>' # define rule-local variable
2468 | '<matchrule:' string '>' # invoke rule named in string
2469
2470 identifier : /[a-z]\w*/i # must start with alpha
2471
2472 comment : /#[^\n]*/ # same as Perl
2473
2474 pattern : {extract_bracketed($text,'<')} # allow embedded "<..>"
2475
2476 condition : {extract_codeblock($text,'{<')} # full Perl expression
2477
2478 string : {extract_variable($text)} # any Perl variable
2479 | {extract_quotelike($text)} # or quotelike string
2480 | {extract_bracketed($text,'<')} # or balanced brackets
2481
2482
2483=head1 GOTCHAS
2484
2485This section describes common mistakes that grammar writers seem to
2486make on a regular basis.
2487
2488=head2 1. Expecting an error to always invalidate a parse
2489
2490A common mistake when using error messages is to write the grammar like this:
2491
2492 file: line(s)
2493
2494 line: line_type_1
2495 | line_type_2
2496 | line_type_3
2497 | <error>
2498
2499The expectation seems to be that any line that is not of type 1, 2 or 3 will
2500invoke the C<E<lt>errorE<gt>> directive and thereby cause the parse to fail.
2501
2502Unfortunately, that only happens if the error occurs in the very first line.
2503The first rule states that a C<file> is matched by one or more lines, so if
2504even a single line succeeds, the first rule is completely satisfied and the
2505parse as a whole succeeds. That means that any error messages generated by
2506subsequent failures in the C<line> rule are quietly ignored.
2507
2508Typically what's really needed is this:
2509
2510 file: line(s) eofile { $return = $item[1] }
2511
2512 line: line_type_1
2513 | line_type_2
2514 | line_type_3
2515 | <error>
2516
2517 eofile: /^\Z/
2518
2519The addition of the C<eofile> subrule to the first production means that
2520a file only matches a series of successful C<line> matches I<that consume the
2521complete input text>. If any input text remains after the lines are matched,
2522there must have been an error in the last C<line>. In that case the C<eofile>
2523rule will fail, causing the entire C<file> rule to fail too.
2524
2525Note too that C<eofile> must match C</^\Z/> (end-of-text), I<not>
2526C</^\cZ/> or C</^\cD/> (end-of-file).
2527
2528And don't forget the action at the end of the production. If you just
2529write:
2530
2531 file: line(s) eofile
2532
2533then the value returned by the C<file> rule will be the value of its
2534last item: C<eofile>. Since C<eofile> always returns an empty string
2535on success, that will cause the C<file> rule to return that empty
2536string. Apart from returning the wrong value, returning an empty string
2537will trip up code such as:
2538
2539 $parser->file($filetext) || die;
2540
2541(since "" is false).
2542
2543Remember that Parse::RecDescent returns undef on failure,
2544so the only safe test for failure is:
2545
2546 defined($parser->file($filetext)) || die;
2547
2548
2549=head1 DIAGNOSTICS
2550
2551Diagnostics are intended to be self-explanatory (particularly if you
2552use B<-RD_HINT> (under B<perl -s>) or define C<$::RD_HINT> inside the program).
2553
2554C<Parse::RecDescent> currently diagnoses the following:
2555
2556=over 4
2557
2558=item *
2559
2560Invalid regular expressions used as pattern terminals (fatal error).
2561
2562=item *
2563
2564Invalid Perl code in code blocks (fatal error).
2565
2566=item *
2567
2568Lookahead used in the wrong place or in a nonsensical way (fatal error).
2569
2570=item *
2571
2572"Obvious" cases of left-recursion (fatal error).
2573
2574=item *
2575
2576Missing or extra components in a C<E<lt>leftopE<gt>> or C<E<lt>rightopE<gt>>
2577directive.
2578
2579=item *
2580
2581Unrecognisable components in the grammar specification (fatal error).
2582
2583=item *
2584
2585"Orphaned" rule components specified before the first rule (fatal error)
2586or after an C<E<lt>errorE<gt>> directive (level 3 warning).
2587
2588=item *
2589
2590Missing rule definitions (this only generates a level 3 warning, since you
2591may be providing them later via C<Parse::RecDescent::Extend()>).
2592
2593=item *
2594
2595Instances where greedy repetition behaviour will almost certainly
2596cause the failure of a production (a level 3 warning - see
2597L<"ON-GOING ISSUES AND FUTURE DIRECTIONS"> below).
2598
2599=item *
2600
2601Attempts to define rules named 'Replace' or 'Extend', which cannot be
2602called directly through the parser object because of the predefined
2603meaning of C<Parse::RecDescent::Replace> and
2604C<Parse::RecDescent::Extend>. (Only a level 2 warning is generated, since
2605such rules I<can> still be used as subrules).
2606
2607=item *
2608
2609Productions which consist of a single C<E<lt>error?E<gt>>
2610directive, and which therefore may succeed unexpectedly
2611(a level 2 warning, since this might conceivably be the desired effect).
2612
2613=item *
2614
2615Multiple consecutive lookahead specifiers (a level 1 warning only, since their
2616effects simply accumulate).
2617
2618=item *
2619
2620Productions which start with a C<E<lt>rejectE<gt>> or C<E<lt>rulevar:...E<gt>>
2621directive. Such productions are optimized away (a level 1 warning).
2622
2623=item *
2624
2625Rules which are autogenerated under C<$::AUTOSTUB> (a level 1 warning).
2626
2627=back
2628
2629=head1 AUTHOR
2630
2631Damian Conway (damian@conway.org)
2632
2633=head1 BUGS AND IRRITATIONS
2634
2635There are undoubtedly serious bugs lurking somewhere in this much code :-)
2636Bug reports and other feedback are most welcome.
2637
2638Ongoing annoyances include:
2639
2640=over 4
2641
2642=item *
2643
2644There's no support for parsing directly from an input stream.
2645If and when the Perl Gods give us regular expressions on streams,
2646this should be trivial (ahem!) to implement.
2647
2648=item *
2649
2650The parser generator can get confused if actions aren't properly
2651closed or if they contain particularly nasty Perl syntax errors
2652(especially unmatched curly brackets).
2653
2654=item *
2655
2656The generator only detects the most obvious form of left recursion
2657(potential recursion on the first subrule in a rule). More subtle
2658forms of left recursion (for example, through the second item in a
2659rule after a "zero" match of a preceding "zero-or-more" repetition,
2660or after a match of a subrule with an empty production) are not found.
2661
2662=item *
2663
2664Instead of complaining about left-recursion, the generator should
2665silently transform the grammar to remove it. Don't expect this
2666feature any time soon as it would require a more sophisticated
2667approach to parser generation than is currently used.
2668
2669=item *
2670
2671The generated parsers don't always run as fast as might be wished.
2672
2673=item *
2674
2675The meta-parser should be bootstrapped using C<Parse::RecDescent> :-)
2676
2677=back
2678
2679=head1 ON-GOING ISSUES AND FUTURE DIRECTIONS
2680
2681=over 4
2682
2683=item 1.
2684
2685Repetitions are "incorrigibly greedy" in that they will eat everything they can
2686and won't backtrack if that behaviour causes a production to fail needlessly.
2687So, for example:
2688
2689 rule: subrule(s) subrule
2690
2691will I<never> succeed, because the repetition will eat all the
2692subrules it finds, leaving none to match the second item. Such
2693constructions are relatively rare (and C<Parse::RecDescent::new> generates a
2694warning whenever they occur) so this may not be a problem, especially
2695since the insatiable behaviour can be overcome "manually" by writing:
2696
2697 rule: penultimate_subrule(s) subrule
2698
2699 penultimate_subrule: subrule ...subrule
2700
2701The issue is that this construction is exactly twice as expensive as the
2702original, whereas backtracking would add only 1/I<N> to the cost (for
2703matching I<N> repetitions of C<subrule>). I would welcome feedback on
2704the need for backtracking; particularly on cases where the lack of it
2705makes parsing performance problematical.
2706
2707=item 2.
2708
2709Having opened that can of worms, it's also necessary to consider whether there
2710is a need for non-greedy repetition specifiers. Again, it's possible (at some
2711cost) to manually provide the required functionality:
2712
2713 rule: nongreedy_subrule(s) othersubrule
2714
2715 nongreedy_subrule: subrule ...!othersubrule
2716
2717Overall, the issue is whether the benefit of this extra functionality
2718outweighs the drawbacks of further complicating the (currently
2719minimalist) grammar specification syntax, and (worse) introducing more overhead
2720into the generated parsers.
2721
2722=item 3.
2723
2724An C<E<lt>autocommitE<gt>> directive would be nice. That is, it would be useful to be
2725able to say:
2726
2727 command: <autocommit>
2728 command: 'find' name
2729 | 'find' address
2730 | 'do' command 'at' time 'if' condition
2731 | 'do' command 'at' time
2732 | 'do' command
2733 | unusual_command
2734
2735and have the generator work out that this should be "pruned" thus:
2736
2737 command: 'find' name
2738 | 'find' <commit> address
2739 | 'do' <commit> command <uncommit>
2740 'at' time
2741 'if' <commit> condition
2742 | 'do' <commit> command <uncommit>
2743 'at' <commit> time
2744 | 'do' <commit> command
2745 | unusual_command
2746
2747There are several issues here. Firstly, should the
2748C<E<lt>autocommitE<gt>> automatically install an C<E<lt>uncommitE<gt>>
2749at the start of the last production (on the grounds that the "command"
2750rule doesn't know whether an "unusual_command" might start with "find"
2751or "do") or should the "unusual_command" subgraph be analysed (to see
2752if it I<might> be viable after a "find" or "do")?
2753
2754The second issue is how regular expressions should be treated. The simplest
2755approach would be simply to uncommit before them (on the grounds that they
2756I<might> match). Better efficiency would be obtained by analyzing all preceding
2757literal tokens to determine whether the pattern would match them.
2758
2759Overall, the issues are: can such automated "pruning" approach a hand-tuned
2760version sufficiently closely to warrant the extra set-up expense, and (more
2761importantly) is the problem important enough to even warrant the non-trivial
2762effort of building an automated solution?
2763
2764=back
2765
2766=head1 COPYRIGHT
2767
2768Copyright (c) 1997-2000, Damian Conway. All Rights Reserved.
2769This module is free software. It may be used, redistributed
2770and/or modified under the terms of the Perl Artistic License
2771 (see http://www.perl.com/perl/misc/Artistic.html)