[unix-history] / usr / doc / eqn / e5

.NH
Language Theory
.PP
The basic structure of the language is
not a particularly original one.
Equations are pictured as a set of ``boxes,''
pieced together in various ways.
For example, something with a subscript is
just a box followed by another box moved downward
and shrunk
by an appropriate amount.
A fraction is just a box centered above another box,
at the right altitude,
with a line of correct length drawn between them.
.PP
The grammar for the language is shown below.
For purposes of exposition, we have collapsed
some productions. In the original grammar, there
are about 70 productions, but many of these
are simple ones used only to guarantee
that some keyword is recognized early enough in the parsing process.
Symbols in
capital letters
are terminal symbols; 
lower case
symbols are non-terminals, i.e., syntactic categories.
The vertical bar \(bv indicates an alternative;
the brackets [ ] indicate optional material.
A
.UC TEXT
is a string of non-blank characters or
any string inside double quotes;
the other terminal symbols represent literal occurrences
of the corresponding keyword.
.P1
.ce 0
.ta .3i
.ps 9
.ne 17
.in 1
eqn	: box | eqn box
.sp 5p
box	: text
	| { eqn }
	| box OVER box
	| SQRT box
	| box SUB box | box SUP box
	| [ L | C | R ]PILE { list }
	| LEFT text eqn [ RIGHT text ]
	| box [ FROM box ] [ TO box ]
	| SIZE text box
	| [ROMAN | BOLD | ITALIC] box
	| box [HAT | BAR | DOT | DOTDOT | TILDE]
	| DEFINE text text
.sp 5p
list	: eqn | list ABOVE eqn
.sp 5p
text	: TEXT
.ps 10
.in 0
.P2
.PP
The grammar makes it obvious why there are few exceptions.
For example, the observation that something can be replaced by a more complicated something 
in braces is implicit in the productions:
.P1
.ce 0
   eqn	: box | eqn box
   box	: text | { eqn }
.P2
Anywhere a single character could be used,
.ul
any
legal construction can be used.
.PP
Clearly, our grammar is highly ambiguous.
What, for instance, do we do with the input
.P1
a over b over c  ?
.P2
Is it
.P1
{a over b} over c 
.P2
or is it
.P1
a over {b over c}  ?
.P2
.PP
To answer questions like this, the grammar
is supplemented with a small set of rules that describe the precedence 
and associativity
of operators.
In particular, we specify (more or less arbitrarily)
that
.ul
over
associates to the left,
so the first alternative above is the one chosen.
On the other hand, 
.ul
sub
and
.ul
sup
bind to the right,
because this is closer to standard mathematical practice.
That is, we assume $x sup a sup b$ is $x sup {(a sup b )}$,
not  $(x sup a ) sup b$.
.PP
The precedence rules resolve the ambiguity in a construction like
.P1
a sup 2 over b
.P2
We define
.ul
sup
to have a higher precedence than
.ul
over,
so this construction is parsed as
$a sup 2 over b$ instead of $a sup {2 over b}$.
.PP
Naturally, a user can always
force a particular parsing
by placing braces around expressions.
.PP
The ambiguous grammar approach seems to be quite useful.
The grammar we use is small enough to be easily understood,
for it contains none of the productions that would be
normally used for resolving ambiguity.
Instead the supplemental information about
precedence and associativity (also small enough to be understood)
provides the compiler-compiler
with the information it needs
to make a fast, deterministic parser for
the specific language we want.
When the language is supplemented by the disambiguating rules,
it is in fact
.UC LR(1)
and thus easy to parse[5].
.PP
The output code is generated as the input is scanned.
Any time a production
of the grammar is recognized,
(potentially) some
.UC TROFF
commands are output.
For example, when the lexical analyzer 
reports that it has found a
.UC TEXT
(i.e., a string of contiguous characters),
we have recognized the production:
.P1
text    : TEXT
.P2
The translation of this is simple.
We generate a local name for the string,
then hand the name and the string to
.UC TROFF,
and let
.UC TROFF
perform the storage management.
All we save is the name of the string,
its height, and its baseline.
.PP
As another example,
the translation associated with the production
.P1
box    : box OVER box
.P2
is:
.P1
.ce 0
.in 1
.ne 14
Width of output box =
  slightly more than largest input width
Height of output box =
  slightly more than sum of input heights
Base of output box =
  slightly more than height of bottom input box
String describing output box =
  move down;
  move right enough to center bottom box;
  draw bottom box (i.e., copy string for bottom box);
  move up; move left enough to center top box;
  draw top box (i.e., copy string for top box);
  move down and left; draw line full width;
  return to proper base line.
.in 0
.P2
Most of the other productions have 
equally simple semantic actions.
Picturing the output as a set of properly placed boxes
makes the right sequence of positioning commands
quite obvious.
The main difficulty is in finding the right numbers to use
for esthetically pleasing positioning.
.PP
With a grammar, it is usually clear how to extend the language.
For instance, one of our users
suggested a
.UC TENSOR
operator, to make constructions like
.EQ
~ sub size 7 m sup size 7 l
{bold T from n to k} sub size 7 i sup size 7 j
.EN
Grammatically, this is easy:
it is sufficient to add a production like
.P1
box    : TENSOR { list }
.P2
Semantically, we need only juggle the boxes to the right places.
Commit	Line	Data
755c0b0d TL	1	.NH
	2	Language Theory
	3	.PP
	4	The basic structure of the language is
	5	not a particularly original one.
	6	Equations are pictured as a set of ``boxes,''
	7	pieced together in various ways.
	8	For example, something with a subscript is
	9	just a box followed by another box moved downward
	10	and shrunk
	11	by an appropriate amount.
	12	A fraction is just a box centered above another box,
	13	at the right altitude,
	14	with a line of correct length drawn between them.
	15	.PP
	16	The grammar for the language is shown below.
	17	For purposes of exposition, we have collapsed
	18	some productions. In the original grammar, there
	19	are about 70 productions, but many of these
	20	are simple ones used only to guarantee
	21	that some keyword is recognized early enough in the parsing process.
	22	Symbols in
	23	capital letters
	24	are terminal symbols;
	25	lower case
	26	symbols are non-terminals, i.e., syntactic categories.
	27	The vertical bar \(bv indicates an alternative;
	28	the brackets [ ] indicate optional material.
	29	A
	30	.UC TEXT
	31	is a string of non-blank characters or
	32	any string inside double quotes;
	33	the other terminal symbols represent literal occurrences
	34	of the corresponding keyword.
	35	.P1
	36	.ce 0
	37	.ta .3i
	38	.ps 9
	39	.ne 17
	40	.in 1
	41	eqn : box \| eqn box
	42	.sp 5p
	43	box : text
	44	\| { eqn }
	45	\| box OVER box
	46	\| SQRT box
	47	\| box SUB box \| box SUP box
	48	\| [ L \| C \| R ]PILE { list }
	49	\| LEFT text eqn [ RIGHT text ]
	50	\| box [ FROM box ] [ TO box ]
	51	\| SIZE text box
	52	\| [ROMAN \| BOLD \| ITALIC] box
	53	\| box [HAT \| BAR \| DOT \| DOTDOT \| TILDE]
	54	\| DEFINE text text
	55	.sp 5p
	56	list : eqn \| list ABOVE eqn
	57	.sp 5p
	58	text : TEXT
	59	.ps 10
	60	.in 0
	61	.P2
	62	.PP
	63	The grammar makes it obvious why there are few exceptions.
	64	For example, the observation that something can be replaced by a more complicated something
65	in braces is implicit in the productions:
66	.P1
67	.ce 0
68	eqn : box \| eqn box
69	box : text \| { eqn }
70	.P2
71	Anywhere a single character could be used,
72	.ul
73	any
74	legal construction can be used.
75	.PP
76	Clearly, our grammar is highly ambiguous.
77	What, for instance, do we do with the input
78	.P1
79	a over b over c ?
80	.P2
81	Is it
82	.P1
83	{a over b} over c
84	.P2
85	or is it
86	.P1
87	a over {b over c} ?
88	.P2
89	.PP
90	To answer questions like this, the grammar
91	is supplemented with a small set of rules that describe the precedence
92	and associativity
93	of operators.
94	In particular, we specify (more or less arbitrarily)
95	that
96	.ul
97	over
98	associates to the left,
99	so the first alternative above is the one chosen.
100	On the other hand,
101	.ul
102	sub
103	and
104	.ul
105	sup
106	bind to the right,
107	because this is closer to standard mathematical practice.
108	That is, we assume $x sup a sup b$ is $x sup {(a sup b )}$,
109	not $(x sup a ) sup b$.
110	.PP
111	The precedence rules resolve the ambiguity in a construction like
112	.P1
113	a sup 2 over b
114	.P2
115	We define
116	.ul
117	sup
118	to have a higher precedence than
119	.ul
120	over,
121	so this construction is parsed as
122	$a sup 2 over b$ instead of $a sup {2 over b}$.
123	.PP
124	Naturally, a user can always
125	force a particular parsing
126	by placing braces around expressions.
127	.PP
128	The ambiguous grammar approach seems to be quite useful.
129	The grammar we use is small enough to be easily understood,
130	for it contains none of the productions that would be
131	normally used for resolving ambiguity.
132	Instead the supplemental information about
133	precedence and associativity (also small enough to be understood)
134	provides the compiler-compiler
135	with the information it needs
136	to make a fast, deterministic parser for
137	the specific language we want.
138	When the language is supplemented by the disambiguating rules,
139	it is in fact
140	.UC LR(1)
141	and thus easy to parse[5].
142	.PP
143	The output code is generated as the input is scanned.
144	Any time a production
145	of the grammar is recognized,
146	(potentially) some
147	.UC TROFF
148	commands are output.
149	For example, when the lexical analyzer
150	reports that it has found a
151	.UC TEXT
152	(i.e., a string of contiguous characters),
153	we have recognized the production:
154	.P1
155	text : TEXT
156	.P2
157	The translation of this is simple.
158	We generate a local name for the string,
159	then hand the name and the string to
160	.UC TROFF,
161	and let
162	.UC TROFF
163	perform the storage management.
164	All we save is the name of the string,
165	its height, and its baseline.
166	.PP
167	As another example,
168	the translation associated with the production
169	.P1
170	box : box OVER box
171	.P2
172	is:
173	.P1
174	.ce 0
175	.in 1
176	.ne 14
177	Width of output box =
178	slightly more than largest input width
179	Height of output box =
180	slightly more than sum of input heights
181	Base of output box =
182	slightly more than height of bottom input box
183	String describing output box =
184	move down;
185	move right enough to center bottom box;
186	draw bottom box (i.e., copy string for bottom box);
187	move up; move left enough to center top box;
188	draw top box (i.e., copy string for top box);
189	move down and left; draw line full width;
190	return to proper base line.
191	.in 0
192	.P2
193	Most of the other productions have
194	equally simple semantic actions.
195	Picturing the output as a set of properly placed boxes
196	makes the right sequence of positioning commands
197	quite obvious.
198	The main difficulty is in finding the right numbers to use
199	for esthetically pleasing positioning.
200	.PP
201	With a grammar, it is usually clear how to extend the language.
202	For instance, one of our users
203	suggested a
204	.UC TENSOR
205	operator, to make constructions like
206	.EQ
207	~ sub size 7 m sup size 7 l
208	{bold T from n to k} sub size 7 i sup size 7 j
209	.EN
210	Grammatically, this is easy:
211	it is sufficient to add a production like
212	.P1
213	box : TENSOR { list }
214	.P2
215	Semantically, we need only juggle the boxes to the right places.