BSD 4_2 development
[unix-history] / usr / doc / yacc / ss1
CommitLineData
c9528a00
C
1.tr *\(**
2.tr |\(or
3.SH
41: Basic Specifications
5.PP
6Names refer to either tokens or nonterminal symbols.
7Yacc requires
8token names to be declared as such.
9In addition, for reasons discussed in Section 3, it is often desirable
10to include the lexical analyzer as part of the specification file;
11it may be useful to include other programs as well.
12Thus, every specification file consists of three sections:
13the
14.I declarations ,
15.I "(grammar) rules" ,
16and
17.I programs .
18The sections are separated by double percent ``%%'' marks.
19(The percent ``%'' is generally used in Yacc specifications as an escape character.)
20.PP
21In other words, a full specification file looks like
22.DS
23declarations
24%%
25rules
26%%
27programs
28.DE
29.PP
30The declaration section may be empty.
31Moreover, if the programs section is omitted, the second %% mark may be omitted also;
32thus, the smallest legal Yacc specification is
33.DS
34%%
35rules
36.DE
37.PP
38Blanks, tabs, and newlines are ignored except
39that they may not appear in names or multi-character reserved symbols.
40Comments may appear wherever a name is legal; they are enclosed
41in /* . . . */, as in C and PL/I.
42.PP
43The rules section is made up of one or more grammar rules.
44A grammar rule has the form:
45.DS
46A : BODY ;
47.DE
48A represents a nonterminal name, and BODY represents a sequence of zero or more names and literals.
49The colon and the semicolon are Yacc punctuation.
50.PP
51Names may be of arbitrary length, and may be made up of letters, dot ``.'', underscore ``\_'', and
52non-initial digits.
53Upper and lower case letters are distinct.
54The names used in the body of a grammar rule may represent tokens or nonterminal symbols.
55.PP
56A literal consists of a character enclosed in single quotes ``\'''.
57As in C, the backslash ``\e'' is an escape character within literals, and all the C escapes
58are recognized.
59Thus
60.DS
61\'\en\' newline
62\'\er\' return
63\'\e\'\' single quote ``\'''
64\'\e\e\' backslash ``\e''
65\'\et\' tab
66\'\eb\' backspace
67\'\ef\' form feed
68\'\exxx\' ``xxx'' in octal
69.DE
70For a number of technical reasons, the
71\s-2NUL\s0
72character (\'\e0\' or 0) should never
73be used in grammar rules.
74.PP
75If there are several grammar rules with the same left hand side, the vertical bar ``|''
76can be used to avoid rewriting the left hand side.
77In addition,
78the semicolon at the end of a rule can be dropped before a vertical bar.
79Thus the grammar rules
80.DS
81A : B C D ;
82A : E F ;
83A : G ;
84.DE
85can be given to Yacc as
86.DS
87A : B C D
88 | E F
89 | G
90 ;
91.DE
92It is not necessary that all grammar rules with the same left side appear together in the grammar rules section,
93although it makes the input much more readable, and easier to change.
94.PP
95If a nonterminal symbol matches the empty string, this can be indicated in the obvious way:
96.DS
97empty : ;
98.DE
99.PP
100Names representing tokens must be declared; this is most simply done by writing
101.DS
102%token name1 name2 . . .
103.DE
104in the declarations section.
105(See Sections 3 , 5, and 6 for much more discussion).
106Every name not defined in the declarations section is assumed to represent a nonterminal symbol.
107Every nonterminal symbol must appear on the left side of at least one rule.
108.PP
109Of all the nonterminal symbols, one, called the
110.I "start symbol" ,
111has particular importance.
112The parser is designed to recognize the start symbol; thus,
113this symbol represents the largest,
114most general structure described by the grammar rules.
115By default,
116the start symbol is taken to be the left hand side of the first
117grammar rule in the rules section.
118It is possible, and in fact desirable, to declare the start
119symbol explicitly in the declarations section using the %start keyword:
120.DS
121%start symbol
122.DE
123.PP
124The end of the input to the parser is signaled by a special token, called the
125.I endmarker .
126If the tokens up to, but not including, the endmarker form a structure
127which matches the start symbol, the parser function returns to its caller
128after the endmarker is seen; it
129.I accepts
130the input.
131If the endmarker is seen in any other context, it is an error.
132.PP
133It is the job of the user-supplied lexical analyzer
134to return the endmarker when appropriate; see section 3, below.
135Usually the endmarker represents some reasonably obvious
136I/O status, such as ``end-of-file'' or ``end-of-record''.