BSD 3 development
[unix-history] / usr / doc / porttour / porttour2
CommitLineData
8340f87c
BJ
1.SH
2Pass Two
3.PP
4Code generation is far less well understood than
5parsing or lexical analysis, and for this reason
6the second pass is far harder to discuss in a file by file manner.
7A great deal of the difficulty is in understanding the issues
8and the strategies employed to meet them.
9Any particular function is likely to be reasonably straightforward.
10.PP
11Thus, this part of the paper will concentrate a good deal on the broader
12aspects of strategy in the code generator,
13and will not get too intimate with the details.
14.SH
15Overview.
16.PP
17It is difficult to organize a code generator to be flexible enough to
18generate code for a large number of machines,
19and still be efficient for any one of them.
20Flexibility is also important when it comes time to tune
21the code generator to improve the output code quality.
22On the other hand, too much flexibility can lead to semantically
23incorrect code, and potentially a combinatorial explosion in the
24number of cases to be considered in the compiler.
25.PP
26One goal of the code generator is to have a high degree of correctness.
27It is very desirable to have the compiler detect its own inability to
28generate correct code, rather than to produce incorrect code.
29This goal is achieved by having a simple model of the job
30to be done (e.g., an expression tree)
31and a simple model of the machine state
32(e.g., which registers are free).
33The act of generating an instruction performs a transformation
34on the tree and the machine state;
35hopefully, the tree eventually gets
36reduced to a single node.
37If each of these instruction/transformation pairs is correct,
38and if the machine state model really represents the actual machine,
39and if the transformations reduce the input tree to the desired single node, then the
40output code will be correct.
41.PP
42For most real machines, there is no definitive theory of code generation that
43encompasses all the C operators.
44Thus the selection of which instruction/transformations to generate, and in what
45order, will have a heuristic flavor.
46If, for some expression tree, no transformation applies, or, more
47seriously, if the heuristics select a sequence of instruction/transformations
48that do not in fact reduce the tree, the compiler
49will report its inability to generate code, and abort.
50.PP
51A major part of the code generator is concerned with the model and the transformations,
52\(em most of this is machine independent, or depends only on simple tables.
53The flexibility comes from the heuristics that guide the transformations
54of the trees, the selection of subgoals, and the ordering of the computation.
55.SH
56The Machine Model
57.PP
58The machine is assumed to have a number of registers, of at most two different
59types:
60.I A
61and
62.I B .
63Within each register class, there may be scratch (temporary) registers and dedicated registers
64(e.g., register variables, the stack pointer, etc.).
65Requests to allocate and free registers involve only the temporary registers.
66.PP
67Each of the registers in the machine is given a name and a number
68in the
69.I mac2defs
70file; the numbers are used as indices into various tables
71that describe the registers, so they should be kept small.
72One such table is the
73.I rstatus
74table on file
75.I local2.c .
76This table is indexed by register number, and
77contains expressions made up from manifest constants describing the register types:
78SAREG for dedicated AREG's, SAREG\(orSTAREG for scratch AREGS's,
79and SBREG and SBREG\(orSTBREG similarly for BREG's.
80There are macros that access this information:
81.I isbreg(r)
82returns true if register number
83.I r
84is a BREG, and
85.I istreg(r)
86returns true if register number
87.I r
88is a temporary AREG or BREG.
89Another table,
90.I rnames ,
91contains the register names; this is used when
92putting out assembler code and diagnostics.
93.PP
94The usage of registers is kept track of by an array called
95.I busy .
96.I Busy[r]
97is the number of uses of register
98.I r
99in the current tree being processed.
100The allocation and freeing of registers will be discussed later as part of the code generation
101algorithm.
102.SH
103General Organization
104.PP
105As mentioned above, the second pass reads lines from
106the intermediate file, copying through to the output unchanged
107any lines that begin with a `)', and making note of the
108information about stack usage and register allocation contained on
109lines beginning with `]' and `['.
110The expression trees, whose beginning is indicated by a line beginning with `.',
111are read and rebuilt into trees.
112If the compiler is loaded as one pass, the expression trees are
113immediately available to the code generator.
114.PP
115The actual code generation is done by a hierarchy of routines.
116The routine
117.I delay
118is first given the tree; it attempts to delay some postfix
119++ and \-\- computations that might reasonably be done after the
120smoke clears.
121It also attempts to handle comma (,) operators by computing the
122left side expression first, and then rewriting the tree
123to eliminate the operator.
124.I Delay
125calls
126.I codgen
127to control the actual code generation process.
128.I Codgen
129takes as arguments a pointer to the expression tree,
130and a second argument that, for socio-historical reasons, is called a
131.I cookie .
132The cookie describes a set of goals that would be acceptable
133for the code generation: these are assigned to individual bits,
134so they may be logically or'ed together to form a large number of possible goals.
135Among the possible goals are FOREFF (compute for side effects only;
136don't worry about the value), INTEMP (compute and store value into a temporary location
137in memory), INAREG (compute into an A register), INTAREG (compute into a scratch
138A register), INBREG and INTBREG similarly, FORCC (compute for condition codes),
139and FORARG (compute it as a function argument; e.g., stack it if appropriate).
140.PP
141.I Codgen
142first canonicalizes the tree by calling
143.I canon .
144This routine looks for certain transformations that might now be
145applicable to the tree.
146One, which is very common and very powerful, is to
147fold together an indirection operator (UNARY MUL)
148and a register (REG); in most machines, this combination is
149addressable directly, and so is similar to a NAME in its
150behavior.
151The UNARY MUL and REG are folded together to make
152another node type called OREG.
153In fact, in many machines it is possible to directly address not just the
154cell pointed to by a register, but also cells differing by a constant
155offset from the cell pointed to by the register.
156.I Canon
157also looks for such cases, calling the
158machine dependent routine
159.I notoff
160to decide if the offset is acceptable (for example, in the IBM 370 the offset
161must be between 0 and 4095 bytes).
162Another optimization is to replace bit field operations
163by shifts and masks if the operation involves extracting the field.
164Finally, a machine dependent routine,
165.I sucomp ,
166is called that computes the Sethi-Ullman numbers
167for the tree (see below).
168.PP
169After the tree is canonicalized,
170.I codgen
171calls the routine
172.I store
173whose job is to select a subtree of the tree to be computed
174and (usually) stored before beginning the
175computation of the full tree.
176.I Store
177must return a tree that can be computed without need
178for any temporary storage locations.
179In effect, the only store operations generated while processing the subtree must be as a response
180to explicit assignment operators in the tree.
181This division of the job marks one of the more significant, and successful,
182departures from most other compilers.
183It means that the code generator can operate
184under the assumption that there are enough
185registers to do its job, without worrying about
186temporary storage.
187If a store into a temporary appears in the output, it is always
188as a direct result of logic in the
189.I store
190routine; this makes debugging easier.
191.PP
192One consequence of this organization is that
193code is not generated by a treewalk.
194There are theoretical results that support this decision.
195.[
196Aho Johnson Optimal Code Trees jacm
197.]
198It may be desirable to compute several subtrees and store
199them before tackling the whole tree;
200if a subtree is to be stored, this is known before the code
201generation for the subtree is begun, and the subtree is computed
202when all scratch registers are available.
203.PP
204The
205.I store
206routine decides what subtrees, if any, should be
207stored by making use of numbers,
208called
209.I "Sethi-Ullman numbers" ,
210that give, for each subtree of an expression tree,
211the minimum number of scratch registers required to
212compile the subtree, without any stores into temporaries.
213.[
214Sethi Ullman jacm 1970
215.]
216These numbers are computed by the machine-dependent
217routine
218.I sucomp ,
219called by
220.I canon .
221The basic notion is that, knowing the Sethi-Ullman numbers for
222the descendants of a node, and knowing the operator of the
223node and some information about the machine, the
224Sethi-Ullman number of the node itself can be computed.
225If the Sethi-Ullman number for a tree exceeds the number of scratch
226registers available, some subtree must be stored.
227Unfortunately, the theory behind the Sethi-Ullman numbers
228applies only to uselessly simple machines and operators.
229For the rich set of C operators, and for machines with
230asymmetric registers, register pairs, different kinds of registers,
231and exceptional forms of addressing,
232the theory cannot be applied directly.
233The basic idea of estimation is a good one, however,
234and well worth applying; the application, especially
235when the compiler comes to be tuned for high code
236quality, goes beyond the park of theory into the
237swamp of heuristics.
238This topic will be taken up again later, when more of the compiler
239structure has been described.
240.PP
241After examining the Sethi-Ullman numbers,
242.I store
243selects a subtree, if any, to be stored, and returns the subtree and the associated cookie in
244the external variables
245.I stotree
246and
247.I stocook .
248If a subtree has been selected, or if
249the whole tree is ready to be processed, the
250routine
251.I order
252is called, with a tree and cookie.
253.I Order
254generates code for trees that
255do not require temporary locations.
256.I Order
257may make recursive calls on itself, and,
258in some cases, on
259.I codgen ;
260for example, when processing the operators &&, \(or\(or, and comma (`,'), that have
261a left to right evaluation, it is
262incorrect for
263.I store
264examine the right operand for subtrees to be stored.
265In these cases,
266.I order
267will call
268.I codgen
269recursively when it is permissible to work on the right operand.
270A similar issue arises with the ? : operator.
271.PP
272The
273.I order
274routine works by matching the current tree with
275a set of code templates.
276If a template is discovered that will
277match the current tree and cookie, the associated assembly language
278statement or statements are generated.
279The tree is then rewritten,
280as specified by the template, to represent the effect of the output instruction(s).
281If no template match is found, first an attempt is made to find a match with a
282different cookie; for example, in order to compute
283an expression with cookie INTEMP (store into a temporary storage location),
284it is usually necessary to compute the expression into a scratch register
285first.
286If all attempts to match the tree fail, the heuristic part of
287the algorithm becomes dominant.
288Control is typically given to one of a number of machine-dependent routines
289that may in turn recursively call
290.I order
291to achieve a subgoal of the computation (for example, one of the
292arguments may be computed into a temporary register).
293After this subgoal has been achieved, the process begins again with the
294modified tree.
295If the machine-dependent heuristics are unable to reduce the tree further,
296a number of default rewriting rules may be considered appropriate.
297For example, if the left operand of a + is a scratch
298register, the + can be replaced by a += operator;
299the tree may then match a template.
300.PP
301To close this introduction, we will discuss the steps in compiling
302code for the expression
303.DS
304\fIa\fR += \fIb\fR
305.DE
306where
307.I a
308and
309.I b
310are static variables.
311.PP
312To begin with, the whole expression tree is examined with cookie FOREFF, and
313no match is found. Search with other cookies is equally fruitless, so an
314attempt at rewriting is made.
315Suppose we are dealing with the Interdata 8/32 for the moment.
316It is recognized that the left hand and right hand sides of the += operator
317are addressable, and in particular the left hand side has no
318side effects, so it is permissible to rewrite this as
319.DS
320\fIa\fR = \fIa\fR + \fIb\fR
321.DE
322and this is done.
323No match is found on this tree either, so a machine dependent rewrite is done; it is recognized
324that the left hand side of the assignment is addressable, but the right hand side is not
325in a register, so
326.I order
327is called recursively, being asked to put the right
328hand side of the assignment into a register.
329This invocation of
330.I order
331searches the tree for a match, and fails.
332The machine dependent rule for +
333notices that the right hand operand is addressable;
334it decides to put the left operand into a scratch register.
335Another recursive call to
336.I order
337is made, with the tree
338consisting solely of the leaf
339.I a ,
340and the cookie asking that the value be placed into a scratch register.
341This now matches a template, and a load instruction is emitted.
342The node consisting of
343.I a
344is rewritten in place to represent the register into which
345.I a
346is loaded,
347and this third call to
348.I order
349returns.
350The second call to
351.I order
352now finds that it has the tree
353.DS
354\fBreg\fR + \fIb\fR
355.DE
356to consider.
357Once again, there is no match, but the default rewriting rule rewrites
358the + as a += operator, since the left operand is a scratch register.
359When this is done, there is a match: in fact,
360.DS
361\fBreg\fR += \fIb\fR
362.DE
363simply describes the effect of the add instruction
364on a typical machine.
365After the add is emitted, the tree is rewritten
366to consist merely of the register node, since the result of the add
367is now in the register.
368This agrees with the cookie passed to the second invocation of
369.I order ,
370so this invocation
371terminates, returning to the first level.
372The original tree has now
373become
374.DS
375\fIa\fR = \fBreg\fR
376.DE
377which matches a template for the store instruction.
378The store is output, and the tree rewritten to become
379just a single register node.
380At this point, since the top level call to
381.I order
382was
383interested only in side effects, the call to
384.I order
385returns, and the code generation is completed;
386we have generated a load, add, and store, as might have been expected.
387.PP
388The effect of machine architecture on this is considerable.
389For example, on the Honeywell 6000, the machine dependent heuristics recognize that there is an ``add to storage''
390instruction, so the strategy is quite different;
391.I b
392is loaded in to
393a register, and then an add to storage instruction generated
394to add this register in to
395.I a .
396The transformations, involving as they do the semantics of C,
397are largely machine independent.
398The decisions as to when to use them, however, are
399almost totally machine dependent.
400.PP
401Having given a broad outline of
402the code generation process, we shall next consider the
403heart of it: the templates.
404This leads naturally into discussions of template matching and register allocation,
405and finally a discussion of the machine dependent interfaces and strategies.
406.SH
407The Templates
408.PP
409The templates describe the effect of the target machine instructions
410on the model of computation around which the compiler is organized.
411In effect, each template has five logical sections, and represents an assertion
412of the form:
413.IP
414.B If
415we have a subtree of a given shape (1), and we have a goal (cookie) or goals to
416achieve (2), and we have sufficient free resources (3),
417.B then
418we may emit an instruction or instructions (4), and
419rewrite the subtree in a particular manner (5),
420and the rewritten tree will achieve the desired goals.
421.PP
422These five sections will be discussed in more
423detail later. First, we give an example of a
424template:
425.DS
426.ta 1i 2i 3i 4i 5i
427ASG PLUS, INAREG,
428 SAREG, TINT,
429 SNAME, TINT,
430 0, RLEFT,
431 " add AL,AR\en",
432.DE
433The top line specifies the operator (+=) and the cookie (compute the
434value of the subtree into an AREG).
435The second and third lines specify the left and right descendants,
436respectively,
437of the += operator.
438The left descendant must be a REG node, representing an
439A register, and have integer type, while the right side must be a NAME node,
440and also have integer type.
441The fourth line contains the resource requirements (no scratch registers
442or temporaries needed), and the rewriting rule (replace the subtree by the left descendant).
443Finally, the quoted string on the last line represents the output to the assembler:
444lower case letters, tabs, spaces, etc. are copied
445.I verbatim .
446to the output; upper case letters trigger various macro-like expansions.
447Thus,
448.B AL
449would expand into the \fBA\fRddress form of the \fBL\fReft operand \(em
450presumably the register number.
451Similarly,
452.B AR
453would expand into the name of the right operand.
454The
455.I add
456instruction of the last section might well be
457emitted by this template.
458.PP
459In principle, it would be possible to make separate templates
460for all legal combinations of operators, cookies, types, and shapes.
461In practice, the number of combinations is very large.
462Thus, a considerable amount of mechanism is present to
463permit a large number of subtrees to be matched
464by a single template.
465Most of the shape and type specifiers are individual bits, and can
466be logically
467or'ed
468together.
469There are a number of special descriptors for matching classes of
470operators.
471The cookies can also be combined.
472As an example of the kind of template
473that really arises in practice, the
474actual template for the Interdata 8/32
475that subsumes the above example is:
476.DS
477.ta 1i 2i 3i 4i 5i
478ASG OPSIMP, INAREG\(orFORCC,
479 SAREG, TINT\(orTUNSIGNED\(orTPOINT,
480 SAREG\(orSNAME\(orSOREG\(orSCON, TINT\(orTUNSIGNED\(orTPOINT,
481 0, RLEFT\(orRESCC,
482 " OI AL,AR\en",
483.DE
484Here, OPSIMP represents the operators
485+, \-, \(or, &, and ^.
486The
487.B OI
488macro in the output string expands into the
489appropriate \fBI\fRnteger \fBO\fRpcode for the operator.
490The left and right sides can be integers, unsigned, or pointer types.
491The right side can be, in addition to a name, a register,
492a memory location whose address is given by a register and displacement (OREG),
493or a constant.
494Finally, these instructions set the condition codes,
495and so can be used in condition contexts:
496the cookie and rewriting rules reflect this.
497.SH
498The Template Matching Algorithm.
499.PP
500The heart of the second pass is the template matching
501algorithm, in the routine
502.I match .
503.I Match
504is called with a tree and a cookie; it attempts to match
505the given tree against some template that will transform it
506according to one of the goals given in the cookie.
507If a match is successful, the transformation is
508applied;
509.I expand
510is called to generate the assembly code, and then
511.I reclaim
512rewrites the tree, and reclaims the resources, such
513as registers, that might have become free as a result
514of the generated code.
515.PP
516This part of the compiler is among the most time critical.
517There is a spectrum of implementation techniques available
518for doing this matching.
519The most naive algorithm simply looks at the templates one by one.
520This can be considerably improved upon by restricting the search
521for an acceptable template.
522It would be possible to do better than this if the templates were given
523to a separate program that ate them and generated a template
524matching subroutine.
525This would make maintenance of the compiler much more
526complicated, however, so this has not been done.
527.PP
528The matching algorithm is actually carried out by restricting
529the range in the table that must be searched for each opcode.
530This introduces a number of complications, however, and needs a
531bit of sympathetic help by the person constructing the
532compiler in order to obtain best results.
533The exact tuning of this algorithm continues; it
534is best to consult the code and comments in
535.I match
536for the latest version.
537.PP
538In order to match a template to a tree,
539it is necessary to match not only the cookie and the
540op of the root, but also the types and shapes of the
541left and right descendants (if any) of the tree.
542A convention is established here that is carried out throughout
543the second pass of the compiler.
544If a node represents a unary operator, the single descendant
545is always the ``left'' descendant.
546If a node represents a unary operator or a leaf node (no descendants)
547the ``right'' descendant is taken by convention to be the node itself.
548This enables templates to easily match leaves and conversion operators, for example,
549without any additional mechanism in the matching program.
550.PP
551The type matching is straightforward; it is possible to specify any combination
552of basic types, general pointers, and pointers to one or more of
553the basic types.
554The shape matching is somewhat more complicated, but still pretty simple.
555Templates have a collection of possible operand shapes
556on which the opcode might match.
557In the simplest case, an
558.I add
559operation might be able to add to either a register variable
560or a scratch register, and might be able (with appropriate
561help from the assembler) to add an integer constant (ICON), a static
562memory cell (NAME), or a stack location (OREG).
563.PP
564It is usually attractive to specify a number of such shapes,
565and distinguish between them when the assembler output is produced.
566It is possible to describe the union of many elementary
567shapes such as ICON, NAME, OREG,
568AREG or BREG
569(both scratch and register forms), etc.
570To handle at least the simple forms of indirection, one can also
571match some more complicated forms of trees; STARNM and STARREG
572can match more complicated trees headed by an indirection operator,
573and SFLD can match certain trees headed by a FLD operator: these
574patterns call machine dependent routines that match the
575patterns of interest on a given machine.
576The shape SWADD may be used to recognize NAME or OREG
577nodes that lie on word boundaries: this may be of some importance
578on word\-addressed machines.
579Finally, there are some special shapes: these may not
580be used in conjunction with the other shapes, but may be
581defined and extended in machine dependent ways.
582The special shapes SZERO, SONE, and SMONE are predefined and match
583constants 0, 1, and \-1, respectively; others are easy to add
584and match by using the machine dependent routine
585.I special .
586.PP
587When a template has been found that matches the root of the tree,
588the cookie, and the shapes and types of the descendants,
589there is still one bar to a total match: the template may
590call for some resources (for example, a scratch register).
591The routine
592.I allo
593is called, and it attempts to allocate the resources.
594If it cannot, the match fails; no resources are
595allocated.
596If successful, the allocated resources are given numbers
5971, 2, etc. for later reference when the assembly code is
598generated.
599The routines
600.I expand
601and
602.I reclaim
603are then called.
604The
605.I match
606routine then returns a special value, MDONE.
607If no match was found, the value MNOPE is returned;
608this is a signal to the caller to try more cookie
609values, or attempt a rewriting rule.
610.I Match
611is also used to select rewriting rules, although
612the way of doing this is pretty straightforward.
613A special cookie, FORREW, is used to ask
614.I match
615to search for a rewriting rule.
616The rewriting rules are keyed to various opcodes; most
617are carried out in
618.I order .
619Since the question of when to rewrite is one of the key issues in
620code generation, it will be taken up again later.
621.SH
622Register Allocation.
623.PP
624The register allocation routines, and the allocation strategy,
625play a central role in the correctness of the code generation algorithm.
626If there are bugs in the Sethi-Ullman computation that cause the
627number of needed registers to be underestimated,
628the compiler may run out of scratch registers;
629it is essential that the allocator keep track of those registers that
630are free and busy, in order to detect such conditions.
631.PP
632Allocation of registers takes place as the result of a template
633match; the routine
634.I allo
635is called with a word describing the number of A registers,
636B registers, and temporary locations needed.
637The allocation of temporary locations on the stack is relatively
638straightforward, and will not be further covered; the
639bookkeeping is a bit tricky, but conceptually trivial, and requests
640for temporary space on the stack will never fail.
641.PP
642Register allocation is less straightforward.
643The two major complications are
644.I pairing
645and
646.I sharing .
647In many machines, some operations (such as multiplication
648and division), and/or some types (such as longs or double precision)
649require even/odd pairs of registers.
650Operations of the first type are exceptionally difficult to
651deal with in the compiler; in fact, their theoretical
652properties are rather bad as well.
653.[
654Aho Johnson Ullman Multiregister
655.]
656The second issue is dealt with rather more successfully;
657a machine dependent function called
658.I szty(t)
659is called that returns 1 or 2, depending on the
660number of A registers required to hold an object of type
661.I t .
662If
663.I szty
664returns 2, an even/odd pair of A registers is allocated
665for each request.
666.PP
667The other issue, sharing, is more subtle, but
668important for good code quality.
669When registers are allocated, it
670is possible to reuse registers that hold address
671information, and use them to contain the values
672computed or accessed.
673For example, on the IBM 360, if register 2 has
674a pointer to an integer in it, we may load the
675integer into register 2 itself by saying:
676.DS
677L 2,0(2)
678.DE
679If register 2 had a byte pointer, however, the sequence for
680loading a character involves clearing the target
681register first, and then inserting the desired character:
682.DS
683SR 3,3
684IC 3,0(2)
685.DE
686In the first case, if register 3 were used as the target,
687it would lead to a larger number of registers
688used for the expression than were required; the compiler would
689generate inefficient code.
690On the other hand, if register 2 were used as the target in the second
691case, the code would simply be wrong.
692In the first case, register 2 can be
693.I shared
694while in the second, it cannot.
695.PP
696In the specification of the register needs in the templates,
697it is possible to indicate whether required scratch registers
698may be shared with possible registers on the left or the right of the input tree.
699In order that a register be shared, it must be scratch, and it must
700be used only once, on the appropriate side of the tree being compiled.
701.PP
702The
703.I allo
704routine thus has a bit more to do than meets the eye;
705it calls
706.I freereg
707to obtain a free register for each A and B register request.
708.I Freereg
709makes multiple calls on the routine
710.I usable
711to decide if a given register can be used to satisfy
712a given need.
713.I Usable
714calls
715.I shareit
716if the register is busy, but might be shared.
717Finally,
718.I shareit
719calls
720.I ushare
721to decide if the desired register is actually in the appropriate
722subtree, and can be shared.
723.PP
724Just to add additional complexity, on some machines (such as the IBM 370) it
725is possible to have ``double indexing'' forms of
726addressing; these are represented by OREGS's
727with the base and index registers encoded into the register field.
728While the register allocation and deallocation
729.I "per se"
730is not made more difficult by this phenomenon, the code itself
731is somewhat more complex.
732.PP
733Having allocated the registers and expanded the assembly language,
734it is time to reclaim the resources; the routine
735.I reclaim
736does this.
737Many operations produce more than one result.
738For example, many arithmetic operations may produce
739a value in a register, and also set the condition
740codes.
741Assignment operations may leave results both in a register and in memory.
742.I Reclaim
743is passed three parameters; the tree and cookie
744that were matched, and the rewriting field of the template.
745The rewriting field allows the specification of possible results;
746the tree is rewritten to reflect the results of the operation.
747If the tree was computed for side effects only (FOREFF),
748the tree is freed, and all resources in it reclaimed.
749If the tree was computed for condition codes, the resources
750are also freed, and the tree replaced by a special
751node type, FORCC.
752Otherwise, the value may be found in the left
753argument of the root, the right argument of the root,
754or one of the temporary resources allocated.
755In these cases, first the resources of the tree,
756and the newly allocated resources,
757are
758freed; then the resources needed by the result
759are made busy again.
760The final result must always match the shape of the input cookie;
761otherwise, the compiler error
762``cannot reclaim''
763is generated.
764There are some machine dependent ways of
765preferring results in registers or memory when
766there are multiple results matching multiple goals in the cookie.
767.SH
768The Machine Dependent Interface
769.PP
770The files
771.I order.c ,
772.I local2.c ,
773and
774.I table.c ,
775as well as the header file
776.I mac2defs ,
777represent the machine dependent portion of the second pass.
778The machine dependent portion can be roughly divided into
779two: the easy portion and the hard portion.
780The easy portion
781tells the compiler the names of the registers, and arranges that
782the compiler generate the proper assembler formats, opcode names, location counters, etc.
783The hard portion involves the Sethi\-Ullman computation, the
784rewriting rules, and, to some extent, the templates.
785It is hard because there are no real algorithms that apply;
786most of this portion is based on heuristics.
787This section discusses the easy portion; the next several
788sections will discuss the hard portion.
789.PP
790If the compiler is adapted from a compiler for a machine
791of similar architecture, the easy part is indeed easy.
792In
793.I mac2defs ,
794the register numbers are defined, as well as various parameters for
795the stack frame, and various macros that describe the machine architecture.
796If double indexing is to be permitted, for example, the symbol
797R2REGS is defined.
798Also, a number of macros that are involved in function call processing,
799especially for unusual function call mechanisms, are defined here.
800.PP
801In
802.I local2.c ,
803a large number of simple functions are defined.
804These do things such as write out opcodes, register names,
805and address forms for the assembler.
806Part of the function call code is defined here; that is nontrivial
807to design, but typically rather straightforward to implement.
808Among the easy routines in
809.I order.c
810are routines for generating a created label,
811defining a label, and generating the arguments
812of a function call.
813.PP
814These routines tend to have a local effect, and depend on a fairly straightforward way
815on the target assembler and the design decisions already made about
816the compiler.
817Thus they will not be further treated here.
818.SH
819The Rewriting Rules
820.PP
821When a tree fails to match any template, it becomes
822a candidate for rewriting.
823Before the tree is rewritten,
824the machine dependent routine
825.I nextcook
826is called with the tree and the cookie; it suggests
827another cookie that might be a better candidate for the
828matching of the tree.
829If all else fails, the templates are searched with the cookie
830FORREW, to look for a rewriting rule.
831The rewriting rules are of two kinds;
832for most of the common operators, there are
833machine dependent rewriting rules that may be applied;
834these are handled by machine dependent functions
835that are called and given the tree to be computed.
836These routines may recursively call
837.I order
838or
839.I codgen
840to cause certain subgoals to be achieved;
841if they actually call for some alteration of the tree,
842they return 1, and the
843code generation algorithm recanonicalizes and tries again.
844If these routines choose not to deal with the tree, the
845default rewriting rules are applied.
846.PP
847The assignment ops, when rewritten, call the routine
848.I setasg .
849This is assumed to rewrite the tree at least to the point where there are
850no side effects in the left hand side.
851If there is still no template match,
852a default rewriting is done that causes
853an expression such as
854.DS
855.I "a += b"
856.DE
857to be rewritten as
858.DS
859.I "a = a + b"
860.DE
861This is a useful default for certain mixtures of strange types
862(for example, when
863.I a
864is a bit field and
865.I b
866an character) that
867otherwise might need separate table entries.
868.PP
869Simple assignment, structure assignment, and all forms of calls
870are handled completely by the machine dependent routines.
871For historical reasons, the routines generating the calls return
8721 on failure, 0 on success, unlike the other routines.
873.PP
874The machine dependent routine
875.I setbin
876handles binary operators; it too must do most of the job.
877In particular, when it returns 0, it must do so with
878the left hand side in a temporary register.
879The default rewriting rule in this case is to convert the
880binary operator into the associated assignment operator;
881since the left hand side is assumed to be a temporary register,
882this preserves the semantics and often allows a considerable
883saving in the template table.
884.PP
885The increment and decrement operators may be dealt with with
886the machine dependent routine
887.I setincr .
888If this routine chooses not to deal with the tree, the rewriting rule replaces
889.DS
890.I "x ++"
891.DE
892by
893.DS
894.I "( (x += 1) \- 1)"
895.DE
896which preserves the semantics.
897Once again, this is not too attractive for the most common
898cases, but can generate close to optimal code when the
899type of x is unusual.
900.PP
901Finally, the indirection (UNARY MUL) operator is also handled
902in a special way.
903The machine dependent routine
904.I offstar
905is extremely important for the efficient generation of code.
906.I Offstar
907is called with a tree that is the direct descendant of a UNARY MUL node;
908its job is to transform this tree so that the combination of
909UNARY MUL with the transformed tree becomes addressable.
910On most machines,
911.I offstar
912can simply compute the tree into an A or B register,
913depending on the architecture, and then
914.I canon
915will make the resulting tree into an OREG.
916On many machines,
917.I offstar
918can profitably choose to do less work than computing
919its entire argument into a register.
920For example, if the target machine supports OREGS
921with a constant offset from a register, and
922.I offstar
923is called
924with a tree of the form
925.DS
926.I "expr + const"
927.DE
928where
929.I const
930is a constant, then
931.I offstar
932need only compute
933.I expr
934into the appropriate form of register.
935On machines that support double indexing,
936.I offstar
937may have even more choice as to how to proceed.
938The proper tuning of
939.I offstar ,
940which is not typically too difficult, should be one of the
941first tries at optimization attempted by the
942compiler writer.
943.SH
944The Sethi-Ullman Computation
945.PP
946The heart of the heuristics is the computation of the Sethi-Ullman numbers.
947This computation is closely linked with the rewriting rules and the
948templates.
949As mentioned before, the Sethi-Ullman numbers are expected to
950estimate the number of scratch registers needed to compute
951the subtrees without using any stores.
952However, the original theory does not apply to real machines.
953For one thing, the theory assumes that all registers
954are interchangeable.
955Real machines have general purpose, floating point, and index registers,
956register pairs, etc.
957The theory also does not account for side effects;
958this rules out various forms of pathology that arise
959from assignment and assignment ops.
960Condition codes are also undreamed of.
961Finally, the influence of types, conversions, and the
962various addressability restrictions and extensions of real
963machines are also ignored.
964.PP
965Nevertheless, for a ``useless'' theory,
966the basic insight of Sethi and Ullman is amazingly
967useful in a real compiler.
968The notion that one should attempt to estimate the
969resource needs of trees before starting the
970code generation provides a natural means of splitting the
971code generation problem, and provides a bit of redundancy
972and self checking in the compiler.
973Moreover, if writing the
974Sethi-Ullman routines is hard, describing, writing, and debugging the
975alternative (routines that attempt to free up registers by stores into
976temporaries ``on the fly'') is even worse.
977Nevertheless, it should be clearly understood that these routines exist in a realm
978where there is no ``right'' way to write them;
979it is an art, the realm of heuristics, and, consequently, a major
980source of bugs in the compiler.
981Often, the early, crude versions of these routines give little trouble;
982only after
983the compiler is actually working
984and the
985code quality is being improved do serious problem have to be faced.
986Having a simple, regular machine architecture is worth quite
987a lot at this time.
988.PP
989The major problems arise from asymmetries in the registers: register pairs,
990having different kinds of registers, and the related problem of
991needing more than one register (frequently a pair) to store certain
992data
993types (such as longs or doubles).
994There appears to be no general way of treating this problem;
995solutions have to be fudged for each machine where the problem arises.
996On the Honeywell 66, for example, there are only two general purpose registers,
997so a need for a pair is the same as the need for two registers.
998On the IBM 370, the register pair (0,1) is used to do multiplications and divisions;
999registers 0 and 1 are not generally considered part of the scratch registers, and so
1000do not require allocation explicitly.
1001On the Interdata 8/32, after much consideration, the
1002decision was made not to try to deal with the register pair issue;
1003operations such as multiplication and division that required pairs
1004were simply assumed to take all of the scratch registers.
1005Several weeks of effort had failed to produce
1006an algorithm that seemed to have much chance of running successfully
1007without inordinate debugging effort.
1008The difficulty of this issue should not be minimized; it represents one of the
1009main intellectual efforts in porting the compiler.
1010Nevertheless, this problem has been fudged with a degree of
1011success on nearly a dozen machines, so the compiler writer should
1012not abandon hope.
1013.PP
1014The Sethi-Ullman computations interact with the
1015rest of the compiler in a number of rather subtle ways.
1016As already discussed, the
1017.I store
1018routine uses the Sethi-Ullman numbers to decide which subtrees are too difficult
1019to compute in registers, and must be stored.
1020There are also subtle interactions between the
1021rewriting routines and the Sethi-Ullman numbers.
1022Suppose we have a tree such as
1023.DS
1024.I "A \- B"
1025.DE
1026where
1027.I A
1028and
1029.I B
1030are expressions; suppose further that
1031.I B
1032takes two registers, and
1033.I A
1034one.
1035It is possible to compute the full expression in two registers by
1036first computing
1037.I B ,
1038and then, using the scratch register
1039used by
1040.I B ,
1041but not containing the answer, compute
1042.I A .
1043The subtraction can then be done, computing the expression.
1044(Note that this assumes a number of things, not the least of which
1045are register-to-register subtraction operators and symmetric
1046registers.)
1047If the machine dependent routine
1048.I setbin ,
1049however, is not prepared to recognize this case
1050and compute the more difficult side of the expression
1051first, the
1052Sethi-Ullman number must be set to three.
1053Thus, the
1054Sethi-Ullman number for a tree should represent the code that
1055the machine dependent routines are actually willing to generate.
1056.PP
1057The interaction can go the other way.
1058If we take an expression such as
1059.DS
1060.I "* ( p + i )"
1061.DE
1062where
1063.I p
1064is a pointer and
1065.I i
1066an integer,
1067this can probably be done in one register on most machines.
1068Thus, its Sethi-Ullman number would probably be set to one.
1069If double indexing is possible in the machine, a possible way
1070of computing the expression is to load both
1071.I p
1072and
1073.I i
1074into registers, and then use double indexing.
1075This would use two scratch registers; in such a case,
1076it is possible that the scratch registers might be unobtainable,
1077or might make some other part of the computation run out of
1078registers.
1079The usual solution is to cause
1080.I offstar
1081to ignore opportunities for double indexing that would tie up more scratch
1082registers than the Sethi-Ullman number had reserved.
1083.PP
1084In summary, the Sethi-Ullman computation represents much of the craftsmanship and artistry in any application
1085of the portable compiler.
1086It is also a frequent source of bugs.
1087Algorithms are available that will produce nearly optimal code
1088for specialized machines, but unfortunately most existing machines
1089are far removed from these ideals.
1090The best way of proceeding in practice is to start with a compiler
1091for a similar machine to the target, and proceed very
1092carefully.
1093.SH
1094Register Allocation
1095.PP
1096After the Sethi-Ullman numbers are computed,
1097.I order
1098calls a routine,
1099.I rallo ,
1100that does register allocation, if appropriate.
1101This routine does relatively little, in general;
1102this is especially true if the target machine
1103is fairly regular.
1104There are a few cases where it is assumed that
1105the result of a computation takes place in a particular register;
1106switch and function return are the two major places.
1107The expression tree has a field,
1108.I rall ,
1109that may be filled with a register number; this is taken
1110to be a preferred register, and the first temporary
1111register allocated by a template match will be this preferred one, if
1112it is free.
1113If not, no particular action is taken; this is just a heuristic.
1114If no register preference is present, the field contains NOPREF.
1115In some cases, the result must be placed in
1116a given register, no matter what.
1117The register number is placed in
1118.I rall ,
1119and the mask MUSTDO is logically or'ed in with it.
1120In this case, if the subtree is requested in a register, and comes
1121back in a register other than the demanded one, it is moved
1122by calling the routine
1123.I rmove .
1124If the target register for this move is busy, it is a compiler
1125error.
1126.PP
1127Note that this mechanism is the only one that will ever cause a register-to-register
1128move between scratch registers (unless such a move is buried in the depths of
1129some template).
1130This simplifies debugging.
1131In some cases, there is a rather strange interaction between
1132the register allocation and the Sethi-Ullman number;
1133if there is an operator or situation requiring a
1134particular register, the allocator and the Sethi-Ullman
1135computation must conspire to ensure that the target
1136register is not being used by some intermediate result of some far-removed computation.
1137This is most easily done by making the special operation take
1138all of the free registers, preventing any other partially-computed
1139results from cluttering up the works.
1140.SH
1141Compiler Bugs
1142.PP
1143The portable compiler has an excellent record of generating correct code.
1144The requirement for reasonable cooperation between the register allocation,
1145Sethi-Ullman computation, rewriting rules, and templates builds quite a bit
1146of redundancy into the compiling process.
1147The effect of this is that, in a surprisingly short time, the compiler will
1148start generating correct code for those
1149programs that it can compile.
1150The hard part of the job then becomes finding and
1151eliminating those situations where the compiler refuses to
1152compile a program because it knows it cannot do it right.
1153For example, a template may simply be missing; this may either
1154give a compiler error of the form ``no match for op ...'' , or cause
1155the compiler to go into an infinite loop applying various rewriting rules.
1156The compiler has a variable,
1157.I nrecur ,
1158that is set to 0 at the beginning of an expressions, and
1159incremented at key spots in the
1160compilation process; if this parameter gets too large, the
1161compiler decides that it is in a loop, and aborts.
1162Loops are also characteristic of botches in the machine-dependent rewriting rules.
1163Bad Sethi-Ullman computations usually cause the scratch registers
1164to run out; this often means
1165that the Sethi-Ullman number was underestimated, so
1166.I store
1167did not store something it should have; alternatively,
1168it can mean that the rewriting rules were not smart enough to
1169find the sequence that
1170.I sucomp
1171assumed would be used.
1172.PP
1173The best approach when a compiler error is detected involves several stages.
1174First, try to get a small example program that steps on the bug.
1175Second, turn on various debugging flags in the code generator, and follow the
1176tree through the process of being matched and rewritten.
1177Some flags of interest are
1178\-e, which prints the expression tree,
1179\-r, which gives information about the allocation of registers,
1180\-a, which gives information about the performance of
1181.I rallo ,
1182and \-o, which gives information about the behavior of
1183.I order .
1184This technique should allow most bugs to be found relatively quickly.
1185.PP
1186Unfortunately, finding the bug is usually not enough; it must also
1187be fixed!
1188The difficulty arises because a fix to the particular bug of interest tends
1189to break other code that already works.
1190Regression tests, tests that compare the performance of
1191a new compiler against the performance of an older one, are very
1192valuable in preventing major catastrophes.
1193.SH
1194Summary and Conclusion
1195.PP
1196The portable compiler has been a useful tool for providing C
1197capability on a large number of diverse machines,
1198and for testing a number of theoretical
1199constructs in a practical setting.
1200It has many blemishes, both in style and functionality.
1201It has been applied to many more machines than first
1202anticipated, of a much wider range than originally dreamed of.
1203Its use has also spread much faster than expected, leaving parts of
1204the compiler still somewhat raw in shape.
1205.PP
1206On the theoretical side, there is some hope that the
1207skeleton of the
1208.I sucomp
1209routine could be generated for many machines directly from the
1210templates; this would give a considerable boost
1211to the portability and correctness of the compiler,
1212but might affect tunability and code quality.
1213There is also room for more optimization, both within
1214.I optim
1215and in the form of a portable ``peephole'' optimizer.
1216.PP
1217On the practical, development side,
1218the compiler could probably be sped up and made smaller
1219without doing too much violence to its basic structure.
1220Parts of the compiler deserve to be rewritten;
1221the initialization code, register allocation, and
1222parser are prime candidates.
1223It might be that doing some or all of the parsing
1224with a recursive descent parser might
1225save enough space and time to be worthwhile;
1226it would certainly ease the problem of moving the
1227compiler to an environment where
1228.I Yacc
1229is not already present.
1230.PP
1231Finally, I would like to thank the many people who have
1232sympathetically, and even enthusiastically, helped me grapple
1233with what has been a frustrating program to write, test, and install.
1234D. M. Ritchie and E. N. Pinson provided needed early
1235encouragement and philosophical guidance;
1236M. E. Lesk,
1237R. Muha, T. G. Peterson,
1238G. Riddle, L. Rosler,
1239R. W. Mitze,
1240B. R. Rowland,
1241S. I. Feldman,
1242and
1243T. B. London
1244have all contributed ideas, gripes, and all, at one time or another,
1245climbed ``into the pits'' with me to help debug.
1246Without their help this effort would have not been possible;
1247with it, it was often kind of fun.
1248.sp 100
1249.LP
1250.[
1251$LIST$
1252.]
1253.LP
1254.sp 100
1255.LP