BSD 4_2 development
[unix-history] / usr / doc / ctour / cdoc1
CommitLineData
c9528a00
C
1.SH
2The Intermediate Language
3.PP
4.FS
5\(dgUNIX is a Trademark of Bell Laboratories.
6.FE
7Communication between the two phases of the compiler proper
8is carried out by means of a pair of intermediate files.
9These files are treated as having identical structure,
10although the second file contains only the code generated for strings.
11It is convenient to write strings out separately to reduce the
12need for multiple location counters in a later assembly
13phase.
14.PP
15The intermediate language is not machine-independent;
16its structure in a number of ways reflects
17the fact that C was originally a one-pass compiler
18chopped in two to reduce the maximum memory
19requirement.
20In fact, only the latest version
21of the compiler has a complete
22intermediate language at all.
23Until recently, the first phase of the compiler generated
24assembly code for those constructions it could deal with,
25and passed expression parse trees, in absolute binary
26form,
27to the second phase for code generation.
28Now, at least, all inter-phase information
29is passed in a describable form, and there are
30no absolute pointers involved, so the coupling between
31the phases is not so strong.
32.PP
33The areas in which the machine
34(and system) dependencies are most noticeable are
35.IP 1.
36Storage allocation for automatic variables and arguments
37has already been performed,
38and nodes for such variables refer to them by offset
39from a display pointer.
40Type conversion
41(for example, from integer to pointer)
42has already occurred using the assumption of
43byte addressing and 2-byte words.
44.IP 2.
45Data representations suitable to the PDP-11 are assumed;
46in particular, floating point constants are passed as
47four words in the machine representation.
48.PP
49As it happens, each intermediate file is represented as a sequence
50of binary numbers without any explicit demarcations.
51It consists of a sequence of
52conceptual lines, each headed by an operator, and possibly containing
53various operands.
54The operators are small numbers;
55to assist in recognizing failure in synchronization,
56the high-order byte of each operator word is always the
57octal number 376.
58Operands are
59either 16-bit binary numbers or strings of characters representing names.
60Each name is terminated by a null character.
61There is no alignment requirement for numerical
62operands and so there is no padding
63after a name string.
64.PP
65The binary representation was chosen to avoid the necessity
66of converting to and from character form
67and to minimize the size of the files.
68It would be very easy to make
69each operator-operand `line' in the file be
70a genuine, printable line, with the numbers in octal or decimal;
71this in fact was the representation originally used.
72.PP
73The operators fall naturally into two classes:
74those which represent part of an expression, and all others.
75Expressions are transmitted in a reverse-Polish notation;
76as they are being read, a tree is built which is isomorphic
77to the tree constructed in the first phase.
78Expressions are passed as a whole, with no non-expression operators
79intervening.
80The reader maintains a stack; each leaf of the expression tree (name, constant)
81is pushed on the stack;
82each unary operator replaces the top of the stack by a node whose
83operand is the old top-of-stack;
84each binary operator replaces the top pair on the stack with
85a single entry.
86When the expression is complete there is exactly one item on the
87stack.
88Following each expression
89is a special operator which passes the unique previous expression
90to the `optimizer' described below and then to the code
91generator.
92.PP
93Here is the list of operators not themselves part of expressions.
94.LP
95.Op EOF
96marks the end of an input file.
97.Op BDATA "flag data ..."
98specifies a sequence of bytes to be assembled
99as static data.
100It is followed by pairs of words; the first member
101of the pair is non-zero to indicate that the data continue;
102a zero flag is not followed by data and terminates
103the operator.
104The data bytes occupy the low-order part of a word.
105.Op WDATA "flag data ..."
106specifies a sequence of words to be assembled as
107static data; it is identical to the BDATA operator
108except that entire words, not just bytes, are passed.
109.Op PROG
110means that subsequent information is to be compiled as program text.
111.Op DATA
112means that subsequent information is to be compiled as static data.
113.Op BSS
114means that subsequent information is to be compiled as unitialized
115static data.
116.Op SYMDEF name
117means that
118the symbol
119.I
120name
121.R
122is an external name defined in the current program.
123It is produced for each external data or function definition.
124.Op CSPACE "name size"
125indicates that the name refers to a data area whose size is the
126specified number of bytes.
127It is produced for external data definitions without explicit initialization.
128.Op SSPACE size
129indicates that
130.I
131size
132.R
133bytes should be set aside for data storage.
134It is used to pad out short initializations of external data
135and to reserve space for static (internal) data.
136It will be preceded by an appropriate label.
137.Op EVEN
138is produced after each
139external data definition whose size is not
140an integral number of words.
141It is not produced after strings except when they initialize
142a character array.
143.Op NLABEL name
144is produced just before a BDATA or WDATA initializing
145external data, and serves as a label for the data.
146.Op RLABEL name
147is produced just before each function definition,
148and labels its entry point.
149.Op SNAME "name number"
150is produced at the start of each function for each static variable
151or label
152declared therein.
153Subsequent uses of the variable will be in terms of the given number.
154The code generator uses this only to produce a debugging symbol table.
155.Op ANAME "name number"
156Likewise, each automatic variable's name and stack offset
157is specified by this operator.
158Arguments count as automatics.
159.Op RNAME "name number"
160Each register variable is similarly named, with its register number.
161.Op SAVE number
162produces a register-save sequence at the start of each function,
163just after its label (RLABEL).
164.Op SETREG number
165is used to indicate the number of registers used
166for register variables.
167It actually gives the register number of the lowest
168free register; it is redundant because the RNAME operators could be
169counted instead.
170.Op PROFIL
171is produced before the save sequence for functions
172when the profile option is turned on.
173It produces code to count the number
174of times the function is called.
175.Op SWIT "deflab line label value ..."
176is produced for switches.
177When control flows into it,
178the value being switched on is in the register
179forced by RFORCE (below).
180The switch statement occurred on the indicated line
181of the source, and the label number of the default location
182is
183.I
184deflab.
185.R
186Then the operator is followed by a sequence of label-number and value pairs;
187the list is terminated by a 0 label.
188.Op LABEL number
189generates an internal label.
190It is referred to elsewhere using the given number.
191.Op BRANCH number
192indicates an unconditional transfer to the internal label number
193given.
194.Op RETRN
195produces the return sequence for a function.
196It occurs only once, at the end of each function.
197.Op EXPR line
198causes the expression just preceding to be compiled.
199The argument is the line number in the source where the
200expression occurred.
201.Op NAME "class type name"
202.Op NAME "class type number"
203indicates a name occurring in an expression.
204The first form is used when the name is external;
205the second when the name is automatic, static, or a register.
206Then the number indicates the stack offset, the label number,
207or the register number as appropriate.
208Class and type encoding is described elsewhere.
209.Op CON "type value"
210transmits an integer constant.
211This and the next two operators occur as part of expressions.
212.Op FCON "type 4-word-value"
213transmits a floating constant as
214four words in PDP-11 notation.
215.Op SFCON "type value"
216transmits a floating-point constant
217whose value is correctly represented by its high-order word
218in PDP-11 notation.
219.Op NULL
220indicates a null argument list of a function call in an expression;
221call is a binary operator whose second operand is the argument list.
222.Op CBRANCH "label cond"
223produces a conditional branch.
224It is an expression operator, and will be followed
225by an EXPR.
226The branch to the label number takes place if the expression's
227truth value is the same as that of
228.I
229cond.
230.R
231That is, if
232.I
233cond=1
234.R
235and the expression evaluates to true, the branch is taken.
236.Op binary-operator type
237There are binary operators corresponding
238to each such source-language operator;
239the type of the result of each is passed as well.
240Some perhaps-unexpected ones are:
241COMMA, which is a right-associative operator designed
242to simplify right-to-left evaluation
243of function arguments;
244prefix and postfix ++ and \-\-, whose second operand
245is the increment amount, as a CON;
246QUEST and COLON, to express the conditional
247expression as `a?(b:c)';
248and a sequence of special operators for expressing
249relations between pointers, in case pointer comparison
250is different from integer comparison
251(e.g. unsigned).
252.Op unary-operator type
253There are also numerous unary operators.
254These include ITOF, FTOI, FTOL, LTOF, ITOL, LTOI
255which convert among floating, long, and integer;
256JUMP which branches indirectly through a label expression;
257INIT, which compiles the value of a constant expression
258used as an initializer;
259RFORCE, which is used before a return sequence or
260a switch to place a value in an agreed-upon register.