[unix-history] / usr / doc / ctour / cdoc1

.SH
The Intermediate Language
.PP
.FS
\(dgUNIX is a Trademark of Bell Laboratories.
.FE
Communication between the two phases of the compiler proper
is carried out by means of a pair of intermediate files.
These files are treated as having identical structure,
although the second file contains only the code generated for strings.
It is convenient to write strings out separately to reduce the
need for multiple location counters in a later assembly
phase.
.PP
The intermediate language is not machine-independent;
its structure in a number of ways reflects
the fact that C was originally a one-pass compiler
chopped in two to reduce the maximum memory
requirement.
In fact, only the latest version
of the compiler has a complete
intermediate language at all.
Until recently, the first phase of the compiler generated
assembly code for those constructions it could deal with,
and passed expression parse trees, in absolute binary
form,
to the second phase for code generation.
Now, at least, all inter-phase information
is passed in a describable form, and there are
no absolute pointers involved, so the coupling between
the phases is not so strong.
.PP
The areas in which the machine
(and system) dependencies are most noticeable are
.IP 1.
Storage allocation for automatic variables and arguments
has already been performed,
and nodes for such variables refer to them by offset
from a display pointer.
Type conversion
(for example, from integer to pointer)
has already occurred using the assumption of
byte addressing and 2-byte words.
.IP 2.
Data representations suitable to the PDP-11 are assumed;
in particular, floating point constants are passed as
four words in the machine representation.
.PP
As it happens, each intermediate file is represented as a sequence
of binary numbers without any explicit demarcations.
It consists of a sequence of
conceptual lines, each headed by an operator, and possibly containing
various operands.
The operators are small numbers;
to assist in recognizing failure in synchronization,
the high-order byte of each operator word is always the
octal number 376.
Operands are
either 16-bit binary numbers or strings of characters representing names.
Each name is terminated by a null character.
There is no alignment requirement for numerical
operands and so there is no padding
after a name string.
.PP
The binary representation was chosen to avoid the necessity
of converting to and from character form
and to minimize the size of the files.
It would be very easy to make
each operator-operand `line' in the file be
a genuine, printable line, with the numbers in octal or decimal;
this in fact was the representation originally used.
.PP
The operators fall naturally into two classes:
those which represent part of an expression, and all others.
Expressions are transmitted in a reverse-Polish notation;
as they are being read, a tree is built which is isomorphic
to the tree constructed in the first phase.
Expressions are passed as a whole, with no non-expression operators
intervening.
The reader maintains a stack; each leaf of the expression tree (name, constant)
is pushed on the stack;
each unary operator replaces the top of the stack by a node whose
operand is the old top-of-stack;
each binary operator replaces the top pair on the stack with
a single entry.
When the expression is complete there is exactly one item on the
stack.
Following each expression
is a special operator which passes the unique previous expression
to the `optimizer' described below and then to the code
generator.
.PP
Here is the list of operators not themselves part of expressions.
.LP
.Op EOF
marks the end of an input file.
.Op BDATA "flag data ..."
specifies a sequence of bytes to be assembled
as static data.
It is followed by pairs of words; the first member
of the pair is non-zero to indicate that the data continue;
a zero flag is not followed by data and terminates
the operator.
The data bytes occupy the low-order part of a word.
.Op WDATA "flag data ..."
specifies a sequence of words to be assembled as
static data; it is identical to the BDATA operator
except that entire words, not just bytes, are passed.
.Op PROG
means that subsequent information is to be compiled as program text.
.Op DATA
means that subsequent information is to be compiled as static data.
.Op BSS
means that subsequent information is to be compiled as unitialized
static data.
.Op SYMDEF name
means that
the symbol
.I
name
.R
is an external name defined in the current program.
It is produced for each external data or function definition.
.Op CSPACE "name size"
indicates that the name refers to a data area whose size is the
specified number of bytes.
It is produced for external data definitions without explicit initialization.
.Op SSPACE size
indicates that
.I
size
.R
bytes should be set aside for data storage.
It is used to pad out short initializations of external data
and to reserve space for static (internal) data.
It will be preceded by an appropriate label.
.Op EVEN
is produced after each
external data definition whose size is not
an integral number of words.
It is not produced after strings except when they initialize
a character array.
.Op NLABEL name
is produced just before a BDATA or WDATA initializing
external data, and serves as a label for the data.
.Op RLABEL name
is produced just before each function definition,
and labels its entry point.
.Op SNAME "name number"
is produced at the start of each function for each static variable
or label
declared therein.
Subsequent uses of the variable will be in terms of the given number.
The code generator uses this only to produce a debugging symbol table.
.Op ANAME "name number"
Likewise, each automatic variable's name and stack offset
is specified by this operator.
Arguments count as automatics.
.Op RNAME "name number"
Each register variable is similarly named, with its register number.
.Op SAVE number
produces a register-save sequence at the start of each function,
just after its label (RLABEL).
.Op SETREG number
is used to indicate the number of registers used
for register variables.
It actually gives the register number of the lowest
free register; it is redundant because the RNAME operators could be
counted instead.
.Op PROFIL
is produced before the save sequence for functions
when the profile option is turned on.
It produces code to count the number
of times the function is called.
.Op SWIT "deflab line label value ..."
is produced for switches.
When control flows into it,
the value being switched on is in the register
forced by RFORCE (below).
The switch statement occurred on the indicated line
of the source, and the label number of the default location
is
.I
deflab.
.R
Then the operator is followed by a sequence of label-number and value pairs;
the list is terminated by a 0 label.
.Op LABEL number
generates an internal label.
It is referred to elsewhere using the given number.
.Op BRANCH number
indicates an unconditional transfer to the internal label number
given.
.Op RETRN
produces the return sequence for a function.
It occurs only once, at the end of each function.
.Op EXPR line
causes the expression just preceding to be compiled.
The argument is the line number in the source where the
expression occurred.
.Op NAME "class type name"
.Op NAME "class type number"
indicates a name occurring in an expression.
The first form is used when the name is external;
the second when the name is automatic, static, or a register.
Then the number indicates the stack offset, the label number,
or the register number as appropriate.
Class and type encoding is described elsewhere.
.Op CON "type value"
transmits an integer constant.
This and the next two operators occur as part of expressions.
.Op FCON "type 4-word-value"
transmits a floating constant as
four words in PDP-11 notation.
.Op SFCON "type value"
transmits a floating-point constant
whose value is correctly represented by its high-order word
in PDP-11 notation.
.Op NULL
indicates a null argument list of a function call in an expression;
call is a binary operator whose second operand is the argument list.
.Op CBRANCH "label cond"
produces a conditional branch.
It is an expression operator, and will be followed
by an EXPR.
The branch to the label number takes place if the expression's
truth value is the same as that of
.I
cond.
.R
That is, if
.I
cond=1
.R
and the expression evaluates to true, the branch is taken.
.Op binary-operator type
There are binary operators corresponding
to each such source-language operator;
the type of the result of each is passed as well.
Some perhaps-unexpected ones are:
COMMA, which is a right-associative operator designed
to simplify right-to-left evaluation
of function arguments;
prefix and postfix ++ and \-\-, whose second operand
is the increment amount, as a CON;
QUEST and COLON, to express the conditional
expression as `a?(b:c)';
and a sequence of special operators for expressing
relations between pointers, in case pointer comparison
is different from integer comparison
(e.g. unsigned).
.Op unary-operator type
There are also numerous unary operators.
These include ITOF, FTOI, FTOL, LTOF, ITOL, LTOI
which convert among floating, long, and integer;
JUMP which branches indirectly through a label expression;
INIT, which compiles the value of a constant expression
used as an initializer;
RFORCE, which is used before a return sequence or
a switch to place a value in an agreed-upon register.
Commit	Line	Data
c9528a00 C	1	.SH
	2	The Intermediate Language
	3	.PP
	4	.FS
	5	\(dgUNIX is a Trademark of Bell Laboratories.
	6	.FE
	7	Communication between the two phases of the compiler proper
	8	is carried out by means of a pair of intermediate files.
	9	These files are treated as having identical structure,
	10	although the second file contains only the code generated for strings.
	11	It is convenient to write strings out separately to reduce the
	12	need for multiple location counters in a later assembly
	13	phase.
	14	.PP
	15	The intermediate language is not machine-independent;
	16	its structure in a number of ways reflects
	17	the fact that C was originally a one-pass compiler
	18	chopped in two to reduce the maximum memory
	19	requirement.
	20	In fact, only the latest version
	21	of the compiler has a complete
	22	intermediate language at all.
	23	Until recently, the first phase of the compiler generated
	24	assembly code for those constructions it could deal with,
	25	and passed expression parse trees, in absolute binary
	26	form,
	27	to the second phase for code generation.
	28	Now, at least, all inter-phase information
	29	is passed in a describable form, and there are
	30	no absolute pointers involved, so the coupling between
	31	the phases is not so strong.
	32	.PP
	33	The areas in which the machine
	34	(and system) dependencies are most noticeable are
	35	.IP 1.
	36	Storage allocation for automatic variables and arguments
	37	has already been performed,
	38	and nodes for such variables refer to them by offset
	39	from a display pointer.
	40	Type conversion
	41	(for example, from integer to pointer)
	42	has already occurred using the assumption of
	43	byte addressing and 2-byte words.
	44	.IP 2.
	45	Data representations suitable to the PDP-11 are assumed;
	46	in particular, floating point constants are passed as
	47	four words in the machine representation.
	48	.PP
	49	As it happens, each intermediate file is represented as a sequence
	50	of binary numbers without any explicit demarcations.
	51	It consists of a sequence of
	52	conceptual lines, each headed by an operator, and possibly containing
	53	various operands.
	54	The operators are small numbers;
	55	to assist in recognizing failure in synchronization,
	56	the high-order byte of each operator word is always the
	57	octal number 376.
	58	Operands are
	59	either 16-bit binary numbers or strings of characters representing names.
	60	Each name is terminated by a null character.
	61	There is no alignment requirement for numerical
	62	operands and so there is no padding
	63	after a name string.
	64	.PP
65	The binary representation was chosen to avoid the necessity
66	of converting to and from character form
67	and to minimize the size of the files.
68	It would be very easy to make
69	each operator-operand `line' in the file be
70	a genuine, printable line, with the numbers in octal or decimal;
71	this in fact was the representation originally used.
72	.PP
73	The operators fall naturally into two classes:
74	those which represent part of an expression, and all others.
75	Expressions are transmitted in a reverse-Polish notation;
76	as they are being read, a tree is built which is isomorphic
77	to the tree constructed in the first phase.
78	Expressions are passed as a whole, with no non-expression operators
79	intervening.
80	The reader maintains a stack; each leaf of the expression tree (name, constant)
81	is pushed on the stack;
82	each unary operator replaces the top of the stack by a node whose
83	operand is the old top-of-stack;
84	each binary operator replaces the top pair on the stack with
85	a single entry.
86	When the expression is complete there is exactly one item on the
87	stack.
88	Following each expression
89	is a special operator which passes the unique previous expression
90	to the `optimizer' described below and then to the code
91	generator.
92	.PP
93	Here is the list of operators not themselves part of expressions.
94	.LP
95	.Op EOF
96	marks the end of an input file.
97	.Op BDATA "flag data ..."
98	specifies a sequence of bytes to be assembled
99	as static data.
100	It is followed by pairs of words; the first member
101	of the pair is non-zero to indicate that the data continue;
102	a zero flag is not followed by data and terminates
103	the operator.
104	The data bytes occupy the low-order part of a word.
105	.Op WDATA "flag data ..."
106	specifies a sequence of words to be assembled as
107	static data; it is identical to the BDATA operator
108	except that entire words, not just bytes, are passed.
109	.Op PROG
110	means that subsequent information is to be compiled as program text.
111	.Op DATA
112	means that subsequent information is to be compiled as static data.
113	.Op BSS
114	means that subsequent information is to be compiled as unitialized
115	static data.
116	.Op SYMDEF name
117	means that
118	the symbol
119	.I
120	name
121	.R
122	is an external name defined in the current program.
123	It is produced for each external data or function definition.
124	.Op CSPACE "name size"
125	indicates that the name refers to a data area whose size is the
126	specified number of bytes.
127	It is produced for external data definitions without explicit initialization.
128	.Op SSPACE size
129	indicates that
130	.I
131	size
132	.R
133	bytes should be set aside for data storage.
134	It is used to pad out short initializations of external data
135	and to reserve space for static (internal) data.
136	It will be preceded by an appropriate label.
137	.Op EVEN
138	is produced after each
139	external data definition whose size is not
140	an integral number of words.
141	It is not produced after strings except when they initialize
142	a character array.
143	.Op NLABEL name
144	is produced just before a BDATA or WDATA initializing
145	external data, and serves as a label for the data.
146	.Op RLABEL name
147	is produced just before each function definition,
148	and labels its entry point.
149	.Op SNAME "name number"
150	is produced at the start of each function for each static variable
151	or label
152	declared therein.
153	Subsequent uses of the variable will be in terms of the given number.
154	The code generator uses this only to produce a debugging symbol table.
155	.Op ANAME "name number"
156	Likewise, each automatic variable's name and stack offset
157	is specified by this operator.
158	Arguments count as automatics.
159	.Op RNAME "name number"
160	Each register variable is similarly named, with its register number.
161	.Op SAVE number
162	produces a register-save sequence at the start of each function,
163	just after its label (RLABEL).
164	.Op SETREG number
165	is used to indicate the number of registers used
166	for register variables.
167	It actually gives the register number of the lowest
168	free register; it is redundant because the RNAME operators could be
169	counted instead.
170	.Op PROFIL
171	is produced before the save sequence for functions
172	when the profile option is turned on.
173	It produces code to count the number
174	of times the function is called.
175	.Op SWIT "deflab line label value ..."
176	is produced for switches.
177	When control flows into it,
178	the value being switched on is in the register
179	forced by RFORCE (below).
180	The switch statement occurred on the indicated line
181	of the source, and the label number of the default location
182	is
183	.I
184	deflab.
185	.R
186	Then the operator is followed by a sequence of label-number and value pairs;
187	the list is terminated by a 0 label.
188	.Op LABEL number
189	generates an internal label.
190	It is referred to elsewhere using the given number.
191	.Op BRANCH number
192	indicates an unconditional transfer to the internal label number
193	given.
194	.Op RETRN
195	produces the return sequence for a function.
196	It occurs only once, at the end of each function.
197	.Op EXPR line
198	causes the expression just preceding to be compiled.
199	The argument is the line number in the source where the
200	expression occurred.
201	.Op NAME "class type name"
202	.Op NAME "class type number"
203	indicates a name occurring in an expression.
204	The first form is used when the name is external;
205	the second when the name is automatic, static, or a register.
206	Then the number indicates the stack offset, the label number,
207	or the register number as appropriate.
208	Class and type encoding is described elsewhere.
209	.Op CON "type value"
210	transmits an integer constant.
211	This and the next two operators occur as part of expressions.
212	.Op FCON "type 4-word-value"
213	transmits a floating constant as
214	four words in PDP-11 notation.
215	.Op SFCON "type value"
216	transmits a floating-point constant
217	whose value is correctly represented by its high-order word
218	in PDP-11 notation.
219	.Op NULL
220	indicates a null argument list of a function call in an expression;
221	call is a binary operator whose second operand is the argument list.
222	.Op CBRANCH "label cond"
223	produces a conditional branch.
224	It is an expression operator, and will be followed
225	by an EXPR.
226	The branch to the label number takes place if the expression's
227	truth value is the same as that of
228	.I
229	cond.
230	.R
231	That is, if
232	.I
233	cond=1
234	.R
235	and the expression evaluates to true, the branch is taken.
236	.Op binary-operator type
237	There are binary operators corresponding
238	to each such source-language operator;
239	the type of the result of each is passed as well.
240	Some perhaps-unexpected ones are:
241	COMMA, which is a right-associative operator designed
242	to simplify right-to-left evaluation
243	of function arguments;
244	prefix and postfix ++ and \-\-, whose second operand
245	is the increment amount, as a CON;
246	QUEST and COLON, to express the conditional
247	expression as `a?(b:c)';
248	and a sequence of special operators for expressing
249	relations between pointers, in case pointer comparison
250	is different from integer comparison
251	(e.g. unsigned).
252	.Op unary-operator type
253	There are also numerous unary operators.
254	These include ITOF, FTOI, FTOL, LTOF, ITOL, LTOI
255	which convert among floating, long, and integer;
256	JUMP which branches indirectly through a label expression;
257	INIT, which compiles the value of a constant expression
258	used as an initializer;
259	RFORCE, which is used before a return sequence or
260	a switch to place a value in an agreed-upon register.