git.subgeniuskitty.com - unix-history/.git/blame_incremental

... / ...

Commit	Line	Data
	1	.SH
	2	Pass Two
	3	.PP
	4	Code generation is far less well understood than
	5	parsing or lexical analysis, and for this reason
	6	the second pass is far harder to discuss in a file by file manner.
	7	A great deal of the difficulty is in understanding the issues
	8	and the strategies employed to meet them.
	9	Any particular function is likely to be reasonably straightforward.
	10	.PP
	11	Thus, this part of the paper will concentrate a good deal on the broader
	12	aspects of strategy in the code generator,
	13	and will not get too intimate with the details.
	14	.SH
	15	Overview.
	16	.PP
	17	It is difficult to organize a code generator to be flexible enough to
	18	generate code for a large number of machines,
	19	and still be efficient for any one of them.
	20	Flexibility is also important when it comes time to tune
	21	the code generator to improve the output code quality.
	22	On the other hand, too much flexibility can lead to semantically
	23	incorrect code, and potentially a combinatorial explosion in the
	24	number of cases to be considered in the compiler.
	25	.PP
	26	One goal of the code generator is to have a high degree of correctness.
	27	It is very desirable to have the compiler detect its own inability to
	28	generate correct code, rather than to produce incorrect code.
	29	This goal is achieved by having a simple model of the job
	30	to be done (e.g., an expression tree)
	31	and a simple model of the machine state
	32	(e.g., which registers are free).
	33	The act of generating an instruction performs a transformation
	34	on the tree and the machine state;
	35	hopefully, the tree eventually gets
	36	reduced to a single node.
	37	If each of these instruction/transformation pairs is correct,
	38	and if the machine state model really represents the actual machine,
	39	and if the transformations reduce the input tree to the desired single node, then the
	40	output code will be correct.
	41	.PP
	42	For most real machines, there is no definitive theory of code generation that
	43	encompasses all the C operators.
	44	Thus the selection of which instruction/transformations to generate, and in what
	45	order, will have a heuristic flavor.
	46	If, for some expression tree, no transformation applies, or, more
	47	seriously, if the heuristics select a sequence of instruction/transformations
	48	that do not in fact reduce the tree, the compiler
	49	will report its inability to generate code, and abort.
	50	.PP
	51	A major part of the code generator is concerned with the model and the transformations,
	52	\(em most of this is machine independent, or depends only on simple tables.
	53	The flexibility comes from the heuristics that guide the transformations
	54	of the trees, the selection of subgoals, and the ordering of the computation.
	55	.SH
	56	The Machine Model
	57	.PP
	58	The machine is assumed to have a number of registers, of at most two different
	59	types:
	60	.I A
	61	and
	62	.I B .
	63	Within each register class, there may be scratch (temporary) registers and dedicated registers
	64	(e.g., register variables, the stack pointer, etc.).
	65	Requests to allocate and free registers involve only the temporary registers.
	66	.PP
	67	Each of the registers in the machine is given a name and a number
	68	in the
	69	.I mac2defs
	70	file; the numbers are used as indices into various tables
	71	that describe the registers, so they should be kept small.
	72	One such table is the
	73	.I rstatus
	74	table on file
	75	.I local2.c .
	76	This table is indexed by register number, and
	77	contains expressions made up from manifest constants describing the register types:
	78	SAREG for dedicated AREG's, SAREG\(orSTAREG for scratch AREGS's,
	79	and SBREG and SBREG\(orSTBREG similarly for BREG's.
	80	There are macros that access this information:
	81	.I isbreg(r)
	82	returns true if register number
	83	.I r
	84	is a BREG, and
	85	.I istreg(r)
	86	returns true if register number
	87	.I r
	88	is a temporary AREG or BREG.
	89	Another table,
	90	.I rnames ,
	91	contains the register names; this is used when
	92	putting out assembler code and diagnostics.
	93	.PP
	94	The usage of registers is kept track of by an array called
	95	.I busy .
	96	.I Busy[r]
	97	is the number of uses of register
	98	.I r
	99	in the current tree being processed.
	100	The allocation and freeing of registers will be discussed later as part of the code generation
	101	algorithm.
	102	.SH
	103	General Organization
	104	.PP
	105	As mentioned above, the second pass reads lines from
	106	the intermediate file, copying through to the output unchanged
	107	any lines that begin with a `)', and making note of the
	108	information about stack usage and register allocation contained on
	109	lines beginning with `]' and `['.
	110	The expression trees, whose beginning is indicated by a line beginning with `.',
	111	are read and rebuilt into trees.
	112	If the compiler is loaded as one pass, the expression trees are
	113	immediately available to the code generator.
	114	.PP
	115	The actual code generation is done by a hierarchy of routines.
	116	The routine
	117	.I delay
	118	is first given the tree; it attempts to delay some postfix
	119	++ and \-\- computations that might reasonably be done after the
	120	smoke clears.
	121	It also attempts to handle comma (,) operators by computing the
	122	left side expression first, and then rewriting the tree
	123	to eliminate the operator.
	124	.I Delay
	125	calls
	126	.I codgen
	127	to control the actual code generation process.
	128	.I Codgen
	129	takes as arguments a pointer to the expression tree,
	130	and a second argument that, for socio-historical reasons, is called a
	131	.I cookie .
	132	The cookie describes a set of goals that would be acceptable
	133	for the code generation: these are assigned to individual bits,
	134	so they may be logically or'ed together to form a large number of possible goals.
	135	Among the possible goals are FOREFF (compute for side effects only;
	136	don't worry about the value), INTEMP (compute and store value into a temporary location
	137	in memory), INAREG (compute into an A register), INTAREG (compute into a scratch
	138	A register), INBREG and INTBREG similarly, FORCC (compute for condition codes),
	139	and FORARG (compute it as a function argument; e.g., stack it if appropriate).
	140	.PP
	141	.I Codgen
	142	first canonicalizes the tree by calling
	143	.I canon .
	144	This routine looks for certain transformations that might now be
	145	applicable to the tree.
	146	One, which is very common and very powerful, is to
	147	fold together an indirection operator (UNARY MUL)
	148	and a register (REG); in most machines, this combination is
	149	addressable directly, and so is similar to a NAME in its
	150	behavior.
	151	The UNARY MUL and REG are folded together to make
	152	another node type called OREG.
	153	In fact, in many machines it is possible to directly address not just the
	154	cell pointed to by a register, but also cells differing by a constant
	155	offset from the cell pointed to by the register.
	156	.I Canon
	157	also looks for such cases, calling the
	158	machine dependent routine
	159	.I notoff
	160	to decide if the offset is acceptable (for example, in the IBM 370 the offset
	161	must be between 0 and 4095 bytes).
	162	Another optimization is to replace bit field operations
	163	by shifts and masks if the operation involves extracting the field.
	164	Finally, a machine dependent routine,
	165	.I sucomp ,
	166	is called that computes the Sethi-Ullman numbers
	167	for the tree (see below).
	168	.PP
	169	After the tree is canonicalized,
	170	.I codgen
	171	calls the routine
	172	.I store
	173	whose job is to select a subtree of the tree to be computed
	174	and (usually) stored before beginning the
	175	computation of the full tree.
	176	.I Store
	177	must return a tree that can be computed without need
	178	for any temporary storage locations.
	179	In effect, the only store operations generated while processing the subtree must be as a response
	180	to explicit assignment operators in the tree.
	181	This division of the job marks one of the more significant, and successful,
	182	departures from most other compilers.
	183	It means that the code generator can operate
	184	under the assumption that there are enough
	185	registers to do its job, without worrying about
	186	temporary storage.
	187	If a store into a temporary appears in the output, it is always
	188	as a direct result of logic in the
	189	.I store
	190	routine; this makes debugging easier.
	191	.PP
	192	One consequence of this organization is that
	193	code is not generated by a treewalk.
	194	There are theoretical results that support this decision.
	195	.[
	196	Aho Johnson Optimal Code Trees jacm
	197	.]
	198	It may be desirable to compute several subtrees and store
	199	them before tackling the whole tree;
	200	if a subtree is to be stored, this is known before the code
	201	generation for the subtree is begun, and the subtree is computed
	202	when all scratch registers are available.
	203	.PP
	204	The
	205	.I store
	206	routine decides what subtrees, if any, should be
	207	stored by making use of numbers,
	208	called
	209	.I "Sethi-Ullman numbers" ,
	210	that give, for each subtree of an expression tree,
	211	the minimum number of scratch registers required to
	212	compile the subtree, without any stores into temporaries.
	213	.[
	214	Sethi Ullman jacm 1970
	215	.]
	216	These numbers are computed by the machine-dependent
	217	routine
	218	.I sucomp ,
	219	called by
	220	.I canon .
	221	The basic notion is that, knowing the Sethi-Ullman numbers for
	222	the descendants of a node, and knowing the operator of the
	223	node and some information about the machine, the
	224	Sethi-Ullman number of the node itself can be computed.
	225	If the Sethi-Ullman number for a tree exceeds the number of scratch
	226	registers available, some subtree must be stored.
	227	Unfortunately, the theory behind the Sethi-Ullman numbers
	228	applies only to uselessly simple machines and operators.
	229	For the rich set of C operators, and for machines with
	230	asymmetric registers, register pairs, different kinds of registers,
	231	and exceptional forms of addressing,
	232	the theory cannot be applied directly.
	233	The basic idea of estimation is a good one, however,
	234	and well worth applying; the application, especially
	235	when the compiler comes to be tuned for high code
	236	quality, goes beyond the park of theory into the
	237	swamp of heuristics.
	238	This topic will be taken up again later, when more of the compiler
	239	structure has been described.
	240	.PP
	241	After examining the Sethi-Ullman numbers,
	242	.I store
	243	selects a subtree, if any, to be stored, and returns the subtree and the associated cookie in
	244	the external variables
	245	.I stotree
	246	and
	247	.I stocook .
	248	If a subtree has been selected, or if
	249	the whole tree is ready to be processed, the
	250	routine
	251	.I order
	252	is called, with a tree and cookie.
	253	.I Order
	254	generates code for trees that
	255	do not require temporary locations.
	256	.I Order
	257	may make recursive calls on itself, and,
	258	in some cases, on
	259	.I codgen ;
	260	for example, when processing the operators &&, \(or\(or, and comma (`,'), that have
	261	a left to right evaluation, it is
	262	incorrect for
	263	.I store
	264	examine the right operand for subtrees to be stored.
	265	In these cases,
	266	.I order
	267	will call
	268	.I codgen
	269	recursively when it is permissible to work on the right operand.
	270	A similar issue arises with the ? : operator.
	271	.PP
	272	The
	273	.I order
	274	routine works by matching the current tree with
	275	a set of code templates.
	276	If a template is discovered that will
	277	match the current tree and cookie, the associated assembly language
	278	statement or statements are generated.
	279	The tree is then rewritten,
	280	as specified by the template, to represent the effect of the output instruction(s).
	281	If no template match is found, first an attempt is made to find a match with a
	282	different cookie; for example, in order to compute
	283	an expression with cookie INTEMP (store into a temporary storage location),
	284	it is usually necessary to compute the expression into a scratch register
	285	first.
	286	If all attempts to match the tree fail, the heuristic part of
	287	the algorithm becomes dominant.
	288	Control is typically given to one of a number of machine-dependent routines
	289	that may in turn recursively call
	290	.I order
	291	to achieve a subgoal of the computation (for example, one of the
	292	arguments may be computed into a temporary register).
	293	After this subgoal has been achieved, the process begins again with the
	294	modified tree.
	295	If the machine-dependent heuristics are unable to reduce the tree further,
	296	a number of default rewriting rules may be considered appropriate.
	297	For example, if the left operand of a + is a scratch
	298	register, the + can be replaced by a += operator;
	299	the tree may then match a template.
	300	.PP
	301	To close this introduction, we will discuss the steps in compiling
	302	code for the expression
	303	.DS
	304	\fIa\fR += \fIb\fR
	305	.DE
	306	where
	307	.I a
	308	and
	309	.I b
	310	are static variables.
	311	.PP
	312	To begin with, the whole expression tree is examined with cookie FOREFF, and
	313	no match is found. Search with other cookies is equally fruitless, so an
	314	attempt at rewriting is made.
	315	Suppose we are dealing with the Interdata 8/32 for the moment.
	316	It is recognized that the left hand and right hand sides of the += operator
	317	are addressable, and in particular the left hand side has no
	318	side effects, so it is permissible to rewrite this as
	319	.DS
	320	\fIa\fR = \fIa\fR + \fIb\fR
	321	.DE
	322	and this is done.
	323	No match is found on this tree either, so a machine dependent rewrite is done; it is recognized
	324	that the left hand side of the assignment is addressable, but the right hand side is not
	325	in a register, so
	326	.I order
	327	is called recursively, being asked to put the right
	328	hand side of the assignment into a register.
	329	This invocation of
	330	.I order
	331	searches the tree for a match, and fails.
	332	The machine dependent rule for +
	333	notices that the right hand operand is addressable;
	334	it decides to put the left operand into a scratch register.
	335	Another recursive call to
	336	.I order
	337	is made, with the tree
	338	consisting solely of the leaf
	339	.I a ,
	340	and the cookie asking that the value be placed into a scratch register.
	341	This now matches a template, and a load instruction is emitted.
	342	The node consisting of
	343	.I a
	344	is rewritten in place to represent the register into which
	345	.I a
	346	is loaded,
	347	and this third call to
	348	.I order
	349	returns.
	350	The second call to
	351	.I order
	352	now finds that it has the tree
	353	.DS
	354	\fBreg\fR + \fIb\fR
	355	.DE
	356	to consider.
	357	Once again, there is no match, but the default rewriting rule rewrites
	358	the + as a += operator, since the left operand is a scratch register.
	359	When this is done, there is a match: in fact,
	360	.DS
	361	\fBreg\fR += \fIb\fR
	362	.DE
	363	simply describes the effect of the add instruction
	364	on a typical machine.
	365	After the add is emitted, the tree is rewritten
	366	to consist merely of the register node, since the result of the add
	367	is now in the register.
	368	This agrees with the cookie passed to the second invocation of
	369	.I order ,
	370	so this invocation
	371	terminates, returning to the first level.
	372	The original tree has now
	373	become
	374	.DS
	375	\fIa\fR = \fBreg\fR
	376	.DE
	377	which matches a template for the store instruction.
	378	The store is output, and the tree rewritten to become
	379	just a single register node.
	380	At this point, since the top level call to
	381	.I order
	382	was
	383	interested only in side effects, the call to
	384	.I order
	385	returns, and the code generation is completed;
	386	we have generated a load, add, and store, as might have been expected.
	387	.PP
	388	The effect of machine architecture on this is considerable.
	389	For example, on the Honeywell 6000, the machine dependent heuristics recognize that there is an ``add to storage''
	390	instruction, so the strategy is quite different;
	391	.I b
	392	is loaded in to
	393	a register, and then an add to storage instruction generated
	394	to add this register in to
	395	.I a .
	396	The transformations, involving as they do the semantics of C,
	397	are largely machine independent.
	398	The decisions as to when to use them, however, are
	399	almost totally machine dependent.
	400	.PP
	401	Having given a broad outline of
	402	the code generation process, we shall next consider the
	403	heart of it: the templates.
	404	This leads naturally into discussions of template matching and register allocation,
	405	and finally a discussion of the machine dependent interfaces and strategies.
	406	.SH
	407	The Templates
	408	.PP
	409	The templates describe the effect of the target machine instructions
	410	on the model of computation around which the compiler is organized.
	411	In effect, each template has five logical sections, and represents an assertion
	412	of the form:
	413	.IP
	414	.B If
	415	we have a subtree of a given shape (1), and we have a goal (cookie) or goals to
	416	achieve (2), and we have sufficient free resources (3),
	417	.B then
	418	we may emit an instruction or instructions (4), and
	419	rewrite the subtree in a particular manner (5),
	420	and the rewritten tree will achieve the desired goals.
	421	.PP
	422	These five sections will be discussed in more
	423	detail later. First, we give an example of a
	424	template:
	425	.DS
	426	.ta 1i 2i 3i 4i 5i
	427	ASG PLUS, INAREG,
	428	SAREG, TINT,
	429	SNAME, TINT,
	430	0, RLEFT,
	431	" add AL,AR\en",
	432	.DE
	433	The top line specifies the operator (+=) and the cookie (compute the
	434	value of the subtree into an AREG).
	435	The second and third lines specify the left and right descendants,
	436	respectively,
	437	of the += operator.
	438	The left descendant must be a REG node, representing an
	439	A register, and have integer type, while the right side must be a NAME node,
	440	and also have integer type.
	441	The fourth line contains the resource requirements (no scratch registers
	442	or temporaries needed), and the rewriting rule (replace the subtree by the left descendant).
	443	Finally, the quoted string on the last line represents the output to the assembler:
	444	lower case letters, tabs, spaces, etc. are copied
	445	.I verbatim .
	446	to the output; upper case letters trigger various macro-like expansions.
	447	Thus,
	448	.B AL
	449	would expand into the \fBA\fRddress form of the \fBL\fReft operand \(em
	450	presumably the register number.
	451	Similarly,
	452	.B AR
	453	would expand into the name of the right operand.
	454	The
	455	.I add
	456	instruction of the last section might well be
	457	emitted by this template.
	458	.PP
	459	In principle, it would be possible to make separate templates
	460	for all legal combinations of operators, cookies, types, and shapes.
	461	In practice, the number of combinations is very large.
	462	Thus, a considerable amount of mechanism is present to
	463	permit a large number of subtrees to be matched
	464	by a single template.
	465	Most of the shape and type specifiers are individual bits, and can
	466	be logically
	467	or'ed
	468	together.
	469	There are a number of special descriptors for matching classes of
	470	operators.
	471	The cookies can also be combined.
	472	As an example of the kind of template
	473	that really arises in practice, the
	474	actual template for the Interdata 8/32
	475	that subsumes the above example is:
	476	.DS
	477	.ta 1i 2i 3i 4i 5i
	478	ASG OPSIMP, INAREG\(orFORCC,
	479	SAREG, TINT\(orTUNSIGNED\(orTPOINT,
	480	SAREG\(orSNAME\(orSOREG\(orSCON, TINT\(orTUNSIGNED\(orTPOINT,
	481	0, RLEFT\(orRESCC,
	482	" OI AL,AR\en",
	483	.DE
	484	Here, OPSIMP represents the operators
	485	+, \-, \(or, &, and ^.
	486	The
	487	.B OI
	488	macro in the output string expands into the
	489	appropriate \fBI\fRnteger \fBO\fRpcode for the operator.
	490	The left and right sides can be integers, unsigned, or pointer types.
	491	The right side can be, in addition to a name, a register,
	492	a memory location whose address is given by a register and displacement (OREG),
	493	or a constant.
	494	Finally, these instructions set the condition codes,
	495	and so can be used in condition contexts:
	496	the cookie and rewriting rules reflect this.
	497	.SH
	498	The Template Matching Algorithm.
	499	.PP
	500	The heart of the second pass is the template matching
	501	algorithm, in the routine
	502	.I match .
	503	.I Match
	504	is called with a tree and a cookie; it attempts to match
	505	the given tree against some template that will transform it
	506	according to one of the goals given in the cookie.
	507	If a match is successful, the transformation is
	508	applied;
	509	.I expand
	510	is called to generate the assembly code, and then
	511	.I reclaim
	512	rewrites the tree, and reclaims the resources, such
	513	as registers, that might have become free as a result
	514	of the generated code.
	515	.PP
	516	This part of the compiler is among the most time critical.
	517	There is a spectrum of implementation techniques available
	518	for doing this matching.
	519	The most naive algorithm simply looks at the templates one by one.
	520	This can be considerably improved upon by restricting the search
	521	for an acceptable template.
	522	It would be possible to do better than this if the templates were given
	523	to a separate program that ate them and generated a template
	524	matching subroutine.
	525	This would make maintenance of the compiler much more
	526	complicated, however, so this has not been done.
	527	.PP
	528	The matching algorithm is actually carried out by restricting
	529	the range in the table that must be searched for each opcode.
	530	This introduces a number of complications, however, and needs a
	531	bit of sympathetic help by the person constructing the
	532	compiler in order to obtain best results.
	533	The exact tuning of this algorithm continues; it
	534	is best to consult the code and comments in
	535	.I match
	536	for the latest version.
	537	.PP
	538	In order to match a template to a tree,
	539	it is necessary to match not only the cookie and the
	540	op of the root, but also the types and shapes of the
	541	left and right descendants (if any) of the tree.
	542	A convention is established here that is carried out throughout
	543	the second pass of the compiler.
	544	If a node represents a unary operator, the single descendant
	545	is always the ``left'' descendant.
	546	If a node represents a unary operator or a leaf node (no descendants)
	547	the ``right'' descendant is taken by convention to be the node itself.
	548	This enables templates to easily match leaves and conversion operators, for example,
	549	without any additional mechanism in the matching program.
	550	.PP
	551	The type matching is straightforward; it is possible to specify any combination
	552	of basic types, general pointers, and pointers to one or more of
	553	the basic types.
	554	The shape matching is somewhat more complicated, but still pretty simple.
	555	Templates have a collection of possible operand shapes
	556	on which the opcode might match.
	557	In the simplest case, an
	558	.I add
	559	operation might be able to add to either a register variable
	560	or a scratch register, and might be able (with appropriate
	561	help from the assembler) to add an integer constant (ICON), a static
	562	memory cell (NAME), or a stack location (OREG).
	563	.PP
	564	It is usually attractive to specify a number of such shapes,
	565	and distinguish between them when the assembler output is produced.
	566	It is possible to describe the union of many elementary
	567	shapes such as ICON, NAME, OREG,
	568	AREG or BREG
	569	(both scratch and register forms), etc.
	570	To handle at least the simple forms of indirection, one can also
	571	match some more complicated forms of trees; STARNM and STARREG
	572	can match more complicated trees headed by an indirection operator,
	573	and SFLD can match certain trees headed by a FLD operator: these
	574	patterns call machine dependent routines that match the
	575	patterns of interest on a given machine.
	576	The shape SWADD may be used to recognize NAME or OREG
	577	nodes that lie on word boundaries: this may be of some importance
	578	on word\-addressed machines.
	579	Finally, there are some special shapes: these may not
	580	be used in conjunction with the other shapes, but may be
	581	defined and extended in machine dependent ways.
	582	The special shapes SZERO, SONE, and SMONE are predefined and match
	583	constants 0, 1, and \-1, respectively; others are easy to add
	584	and match by using the machine dependent routine
	585	.I special .
	586	.PP
	587	When a template has been found that matches the root of the tree,
	588	the cookie, and the shapes and types of the descendants,
	589	there is still one bar to a total match: the template may
	590	call for some resources (for example, a scratch register).
	591	The routine
	592	.I allo
	593	is called, and it attempts to allocate the resources.
	594	If it cannot, the match fails; no resources are
	595	allocated.
	596	If successful, the allocated resources are given numbers
	597	1, 2, etc. for later reference when the assembly code is
	598	generated.
	599	The routines
	600	.I expand
	601	and
	602	.I reclaim
	603	are then called.
	604	The
	605	.I match
	606	routine then returns a special value, MDONE.
	607	If no match was found, the value MNOPE is returned;
	608	this is a signal to the caller to try more cookie
	609	values, or attempt a rewriting rule.
	610	.I Match
	611	is also used to select rewriting rules, although
	612	the way of doing this is pretty straightforward.
	613	A special cookie, FORREW, is used to ask
	614	.I match
	615	to search for a rewriting rule.
	616	The rewriting rules are keyed to various opcodes; most
	617	are carried out in
	618	.I order .
	619	Since the question of when to rewrite is one of the key issues in
	620	code generation, it will be taken up again later.
	621	.SH
	622	Register Allocation.
	623	.PP
	624	The register allocation routines, and the allocation strategy,
	625	play a central role in the correctness of the code generation algorithm.
	626	If there are bugs in the Sethi-Ullman computation that cause the
	627	number of needed registers to be underestimated,
	628	the compiler may run out of scratch registers;
	629	it is essential that the allocator keep track of those registers that
	630	are free and busy, in order to detect such conditions.
	631	.PP
	632	Allocation of registers takes place as the result of a template
	633	match; the routine
	634	.I allo
	635	is called with a word describing the number of A registers,
	636	B registers, and temporary locations needed.
	637	The allocation of temporary locations on the stack is relatively
	638	straightforward, and will not be further covered; the
	639	bookkeeping is a bit tricky, but conceptually trivial, and requests
	640	for temporary space on the stack will never fail.
	641	.PP
	642	Register allocation is less straightforward.
	643	The two major complications are
	644	.I pairing
	645	and
	646	.I sharing .
	647	In many machines, some operations (such as multiplication
	648	and division), and/or some types (such as longs or double precision)
	649	require even/odd pairs of registers.
	650	Operations of the first type are exceptionally difficult to
	651	deal with in the compiler; in fact, their theoretical
	652	properties are rather bad as well.
	653	.[
	654	Aho Johnson Ullman Multiregister
	655	.]
	656	The second issue is dealt with rather more successfully;
	657	a machine dependent function called
	658	.I szty(t)
	659	is called that returns 1 or 2, depending on the
	660	number of A registers required to hold an object of type
	661	.I t .
	662	If
	663	.I szty
	664	returns 2, an even/odd pair of A registers is allocated
	665	for each request.
	666	.PP
	667	The other issue, sharing, is more subtle, but
	668	important for good code quality.
	669	When registers are allocated, it
	670	is possible to reuse registers that hold address
	671	information, and use them to contain the values
	672	computed or accessed.
	673	For example, on the IBM 360, if register 2 has
	674	a pointer to an integer in it, we may load the
	675	integer into register 2 itself by saying:
	676	.DS
	677	L 2,0(2)
	678	.DE
	679	If register 2 had a byte pointer, however, the sequence for
	680	loading a character involves clearing the target
	681	register first, and then inserting the desired character:
	682	.DS
	683	SR 3,3
	684	IC 3,0(2)
	685	.DE
	686	In the first case, if register 3 were used as the target,
	687	it would lead to a larger number of registers
	688	used for the expression than were required; the compiler would
	689	generate inefficient code.
	690	On the other hand, if register 2 were used as the target in the second
	691	case, the code would simply be wrong.
	692	In the first case, register 2 can be
	693	.I shared
	694	while in the second, it cannot.
	695	.PP
	696	In the specification of the register needs in the templates,
	697	it is possible to indicate whether required scratch registers
	698	may be shared with possible registers on the left or the right of the input tree.
	699	In order that a register be shared, it must be scratch, and it must
	700	be used only once, on the appropriate side of the tree being compiled.
	701	.PP
	702	The
	703	.I allo
	704	routine thus has a bit more to do than meets the eye;
	705	it calls
	706	.I freereg
	707	to obtain a free register for each A and B register request.
	708	.I Freereg
	709	makes multiple calls on the routine
	710	.I usable
	711	to decide if a given register can be used to satisfy
	712	a given need.
	713	.I Usable
	714	calls
	715	.I shareit
	716	if the register is busy, but might be shared.
	717	Finally,
	718	.I shareit
	719	calls
	720	.I ushare
	721	to decide if the desired register is actually in the appropriate
	722	subtree, and can be shared.
	723	.PP
	724	Just to add additional complexity, on some machines (such as the IBM 370) it
	725	is possible to have ``double indexing'' forms of
	726	addressing; these are represented by OREGS's
	727	with the base and index registers encoded into the register field.
	728	While the register allocation and deallocation
	729	.I "per se"
	730	is not made more difficult by this phenomenon, the code itself
	731	is somewhat more complex.
	732	.PP
	733	Having allocated the registers and expanded the assembly language,
	734	it is time to reclaim the resources; the routine
	735	.I reclaim
	736	does this.
	737	Many operations produce more than one result.
	738	For example, many arithmetic operations may produce
	739	a value in a register, and also set the condition
	740	codes.
	741	Assignment operations may leave results both in a register and in memory.
	742	.I Reclaim
	743	is passed three parameters; the tree and cookie
	744	that were matched, and the rewriting field of the template.
	745	The rewriting field allows the specification of possible results;
	746	the tree is rewritten to reflect the results of the operation.
	747	If the tree was computed for side effects only (FOREFF),
	748	the tree is freed, and all resources in it reclaimed.
	749	If the tree was computed for condition codes, the resources
	750	are also freed, and the tree replaced by a special
	751	node type, FORCC.
	752	Otherwise, the value may be found in the left
	753	argument of the root, the right argument of the root,
	754	or one of the temporary resources allocated.
	755	In these cases, first the resources of the tree,
	756	and the newly allocated resources,
	757	are
	758	freed; then the resources needed by the result
	759	are made busy again.
	760	The final result must always match the shape of the input cookie;
	761	otherwise, the compiler error
	762	``cannot reclaim''
	763	is generated.
	764	There are some machine dependent ways of
	765	preferring results in registers or memory when
	766	there are multiple results matching multiple goals in the cookie.
	767	.SH
	768	The Machine Dependent Interface
	769	.PP
	770	The files
	771	.I order.c ,
	772	.I local2.c ,
	773	and
	774	.I table.c ,
	775	as well as the header file
	776	.I mac2defs ,
	777	represent the machine dependent portion of the second pass.
	778	The machine dependent portion can be roughly divided into
	779	two: the easy portion and the hard portion.
	780	The easy portion
	781	tells the compiler the names of the registers, and arranges that
	782	the compiler generate the proper assembler formats, opcode names, location counters, etc.
	783	The hard portion involves the Sethi\-Ullman computation, the
	784	rewriting rules, and, to some extent, the templates.
	785	It is hard because there are no real algorithms that apply;
	786	most of this portion is based on heuristics.
	787	This section discusses the easy portion; the next several
	788	sections will discuss the hard portion.
	789	.PP
	790	If the compiler is adapted from a compiler for a machine
	791	of similar architecture, the easy part is indeed easy.
	792	In
	793	.I mac2defs ,
	794	the register numbers are defined, as well as various parameters for
	795	the stack frame, and various macros that describe the machine architecture.
	796	If double indexing is to be permitted, for example, the symbol
	797	R2REGS is defined.
	798	Also, a number of macros that are involved in function call processing,
	799	especially for unusual function call mechanisms, are defined here.
	800	.PP
	801	In
	802	.I local2.c ,
	803	a large number of simple functions are defined.
	804	These do things such as write out opcodes, register names,
	805	and address forms for the assembler.
	806	Part of the function call code is defined here; that is nontrivial
	807	to design, but typically rather straightforward to implement.
	808	Among the easy routines in
	809	.I order.c
	810	are routines for generating a created label,
	811	defining a label, and generating the arguments
	812	of a function call.
	813	.PP
	814	These routines tend to have a local effect, and depend on a fairly straightforward way
	815	on the target assembler and the design decisions already made about
	816	the compiler.
	817	Thus they will not be further treated here.
	818	.SH
	819	The Rewriting Rules
	820	.PP
	821	When a tree fails to match any template, it becomes
	822	a candidate for rewriting.
	823	Before the tree is rewritten,
	824	the machine dependent routine
	825	.I nextcook
	826	is called with the tree and the cookie; it suggests
	827	another cookie that might be a better candidate for the
	828	matching of the tree.
	829	If all else fails, the templates are searched with the cookie
	830	FORREW, to look for a rewriting rule.
	831	The rewriting rules are of two kinds;
	832	for most of the common operators, there are
	833	machine dependent rewriting rules that may be applied;
	834	these are handled by machine dependent functions
	835	that are called and given the tree to be computed.
	836	These routines may recursively call
	837	.I order
	838	or
	839	.I codgen
	840	to cause certain subgoals to be achieved;
	841	if they actually call for some alteration of the tree,
	842	they return 1, and the
	843	code generation algorithm recanonicalizes and tries again.
	844	If these routines choose not to deal with the tree, the
	845	default rewriting rules are applied.
	846	.PP
	847	The assignment ops, when rewritten, call the routine
	848	.I setasg .
	849	This is assumed to rewrite the tree at least to the point where there are
	850	no side effects in the left hand side.
	851	If there is still no template match,
	852	a default rewriting is done that causes
	853	an expression such as
	854	.DS
	855	.I "a += b"
	856	.DE
	857	to be rewritten as
	858	.DS
	859	.I "a = a + b"
	860	.DE
	861	This is a useful default for certain mixtures of strange types
	862	(for example, when
	863	.I a
	864	is a bit field and
	865	.I b
	866	an character) that
	867	otherwise might need separate table entries.
	868	.PP
	869	Simple assignment, structure assignment, and all forms of calls
	870	are handled completely by the machine dependent routines.
	871	For historical reasons, the routines generating the calls return
	872	1 on failure, 0 on success, unlike the other routines.
	873	.PP
	874	The machine dependent routine
	875	.I setbin
	876	handles binary operators; it too must do most of the job.
	877	In particular, when it returns 0, it must do so with
	878	the left hand side in a temporary register.
	879	The default rewriting rule in this case is to convert the
	880	binary operator into the associated assignment operator;
	881	since the left hand side is assumed to be a temporary register,
	882	this preserves the semantics and often allows a considerable
	883	saving in the template table.
	884	.PP
	885	The increment and decrement operators may be dealt with with
	886	the machine dependent routine
	887	.I setincr .
	888	If this routine chooses not to deal with the tree, the rewriting rule replaces
	889	.DS
	890	.I "x ++"
	891	.DE
	892	by
	893	.DS
	894	.I "( (x += 1) \- 1)"
	895	.DE
	896	which preserves the semantics.
	897	Once again, this is not too attractive for the most common
	898	cases, but can generate close to optimal code when the
	899	type of x is unusual.
	900	.PP
	901	Finally, the indirection (UNARY MUL) operator is also handled
	902	in a special way.
	903	The machine dependent routine
	904	.I offstar
	905	is extremely important for the efficient generation of code.
	906	.I Offstar
	907	is called with a tree that is the direct descendant of a UNARY MUL node;
	908	its job is to transform this tree so that the combination of
	909	UNARY MUL with the transformed tree becomes addressable.
	910	On most machines,
	911	.I offstar
	912	can simply compute the tree into an A or B register,
	913	depending on the architecture, and then
	914	.I canon
	915	will make the resulting tree into an OREG.
	916	On many machines,
	917	.I offstar
	918	can profitably choose to do less work than computing
	919	its entire argument into a register.
	920	For example, if the target machine supports OREGS
	921	with a constant offset from a register, and
	922	.I offstar
	923	is called
	924	with a tree of the form
	925	.DS
	926	.I "expr + const"
	927	.DE
	928	where
	929	.I const
	930	is a constant, then
	931	.I offstar
	932	need only compute
	933	.I expr
	934	into the appropriate form of register.
	935	On machines that support double indexing,
	936	.I offstar
	937	may have even more choice as to how to proceed.
	938	The proper tuning of
	939	.I offstar ,
	940	which is not typically too difficult, should be one of the
	941	first tries at optimization attempted by the
	942	compiler writer.
	943	.SH
	944	The Sethi-Ullman Computation
	945	.PP
	946	The heart of the heuristics is the computation of the Sethi-Ullman numbers.
	947	This computation is closely linked with the rewriting rules and the
	948	templates.
	949	As mentioned before, the Sethi-Ullman numbers are expected to
	950	estimate the number of scratch registers needed to compute
	951	the subtrees without using any stores.
	952	However, the original theory does not apply to real machines.
	953	For one thing, the theory assumes that all registers
	954	are interchangeable.
	955	Real machines have general purpose, floating point, and index registers,
	956	register pairs, etc.
	957	The theory also does not account for side effects;
	958	this rules out various forms of pathology that arise
	959	from assignment and assignment ops.
	960	Condition codes are also undreamed of.
	961	Finally, the influence of types, conversions, and the
	962	various addressability restrictions and extensions of real
	963	machines are also ignored.
	964	.PP
	965	Nevertheless, for a ``useless'' theory,
	966	the basic insight of Sethi and Ullman is amazingly
	967	useful in a real compiler.
	968	The notion that one should attempt to estimate the
	969	resource needs of trees before starting the
	970	code generation provides a natural means of splitting the
	971	code generation problem, and provides a bit of redundancy
	972	and self checking in the compiler.
	973	Moreover, if writing the
	974	Sethi-Ullman routines is hard, describing, writing, and debugging the
	975	alternative (routines that attempt to free up registers by stores into
	976	temporaries ``on the fly'') is even worse.
	977	Nevertheless, it should be clearly understood that these routines exist in a realm
	978	where there is no ``right'' way to write them;
	979	it is an art, the realm of heuristics, and, consequently, a major
	980	source of bugs in the compiler.
	981	Often, the early, crude versions of these routines give little trouble;
	982	only after
	983	the compiler is actually working
	984	and the
	985	code quality is being improved do serious problem have to be faced.
	986	Having a simple, regular machine architecture is worth quite
	987	a lot at this time.
	988	.PP
	989	The major problems arise from asymmetries in the registers: register pairs,
	990	having different kinds of registers, and the related problem of
	991	needing more than one register (frequently a pair) to store certain
	992	data
	993	types (such as longs or doubles).
	994	There appears to be no general way of treating this problem;
	995	solutions have to be fudged for each machine where the problem arises.
	996	On the Honeywell 66, for example, there are only two general purpose registers,
	997	so a need for a pair is the same as the need for two registers.
	998	On the IBM 370, the register pair (0,1) is used to do multiplications and divisions;
	999	registers 0 and 1 are not generally considered part of the scratch registers, and so
	1000	do not require allocation explicitly.
	1001	On the Interdata 8/32, after much consideration, the
	1002	decision was made not to try to deal with the register pair issue;
	1003	operations such as multiplication and division that required pairs
	1004	were simply assumed to take all of the scratch registers.
	1005	Several weeks of effort had failed to produce
	1006	an algorithm that seemed to have much chance of running successfully
	1007	without inordinate debugging effort.
	1008	The difficulty of this issue should not be minimized; it represents one of the
	1009	main intellectual efforts in porting the compiler.
	1010	Nevertheless, this problem has been fudged with a degree of
	1011	success on nearly a dozen machines, so the compiler writer should
	1012	not abandon hope.
	1013	.PP
	1014	The Sethi-Ullman computations interact with the
	1015	rest of the compiler in a number of rather subtle ways.
	1016	As already discussed, the
	1017	.I store
	1018	routine uses the Sethi-Ullman numbers to decide which subtrees are too difficult
	1019	to compute in registers, and must be stored.
	1020	There are also subtle interactions between the
	1021	rewriting routines and the Sethi-Ullman numbers.
	1022	Suppose we have a tree such as
	1023	.DS
	1024	.I "A \- B"
	1025	.DE
	1026	where
	1027	.I A
	1028	and
	1029	.I B
	1030	are expressions; suppose further that
	1031	.I B
	1032	takes two registers, and
	1033	.I A
	1034	one.
	1035	It is possible to compute the full expression in two registers by
	1036	first computing
	1037	.I B ,
	1038	and then, using the scratch register
	1039	used by
	1040	.I B ,
	1041	but not containing the answer, compute
	1042	.I A .
	1043	The subtraction can then be done, computing the expression.
	1044	(Note that this assumes a number of things, not the least of which
	1045	are register-to-register subtraction operators and symmetric
	1046	registers.)
	1047	If the machine dependent routine
	1048	.I setbin ,
	1049	however, is not prepared to recognize this case
	1050	and compute the more difficult side of the expression
	1051	first, the
	1052	Sethi-Ullman number must be set to three.
	1053	Thus, the
	1054	Sethi-Ullman number for a tree should represent the code that
	1055	the machine dependent routines are actually willing to generate.
	1056	.PP
	1057	The interaction can go the other way.
	1058	If we take an expression such as
	1059	.DS
	1060	.I "* ( p + i )"
	1061	.DE
	1062	where
	1063	.I p
	1064	is a pointer and
	1065	.I i
	1066	an integer,
	1067	this can probably be done in one register on most machines.
	1068	Thus, its Sethi-Ullman number would probably be set to one.
	1069	If double indexing is possible in the machine, a possible way
	1070	of computing the expression is to load both
	1071	.I p
	1072	and
	1073	.I i
	1074	into registers, and then use double indexing.
	1075	This would use two scratch registers; in such a case,
	1076	it is possible that the scratch registers might be unobtainable,
	1077	or might make some other part of the computation run out of
	1078	registers.
	1079	The usual solution is to cause
	1080	.I offstar
	1081	to ignore opportunities for double indexing that would tie up more scratch
	1082	registers than the Sethi-Ullman number had reserved.
	1083	.PP
	1084	In summary, the Sethi-Ullman computation represents much of the craftsmanship and artistry in any application
	1085	of the portable compiler.
	1086	It is also a frequent source of bugs.
	1087	Algorithms are available that will produce nearly optimal code
	1088	for specialized machines, but unfortunately most existing machines
	1089	are far removed from these ideals.
	1090	The best way of proceeding in practice is to start with a compiler
	1091	for a similar machine to the target, and proceed very
	1092	carefully.
	1093	.SH
	1094	Register Allocation
	1095	.PP
	1096	After the Sethi-Ullman numbers are computed,
	1097	.I order
	1098	calls a routine,
	1099	.I rallo ,
	1100	that does register allocation, if appropriate.
	1101	This routine does relatively little, in general;
	1102	this is especially true if the target machine
	1103	is fairly regular.
	1104	There are a few cases where it is assumed that
	1105	the result of a computation takes place in a particular register;
	1106	switch and function return are the two major places.
	1107	The expression tree has a field,
	1108	.I rall ,
	1109	that may be filled with a register number; this is taken
	1110	to be a preferred register, and the first temporary
	1111	register allocated by a template match will be this preferred one, if
	1112	it is free.
	1113	If not, no particular action is taken; this is just a heuristic.
	1114	If no register preference is present, the field contains NOPREF.
	1115	In some cases, the result must be placed in
	1116	a given register, no matter what.
	1117	The register number is placed in
	1118	.I rall ,
	1119	and the mask MUSTDO is logically or'ed in with it.
	1120	In this case, if the subtree is requested in a register, and comes
	1121	back in a register other than the demanded one, it is moved
	1122	by calling the routine
	1123	.I rmove .
	1124	If the target register for this move is busy, it is a compiler
	1125	error.
	1126	.PP
	1127	Note that this mechanism is the only one that will ever cause a register-to-register
	1128	move between scratch registers (unless such a move is buried in the depths of
	1129	some template).
	1130	This simplifies debugging.
	1131	In some cases, there is a rather strange interaction between
	1132	the register allocation and the Sethi-Ullman number;
	1133	if there is an operator or situation requiring a
	1134	particular register, the allocator and the Sethi-Ullman
	1135	computation must conspire to ensure that the target
	1136	register is not being used by some intermediate result of some far-removed computation.
	1137	This is most easily done by making the special operation take
	1138	all of the free registers, preventing any other partially-computed
	1139	results from cluttering up the works.
	1140	.SH
	1141	Compiler Bugs
	1142	.PP
	1143	The portable compiler has an excellent record of generating correct code.
	1144	The requirement for reasonable cooperation between the register allocation,
	1145	Sethi-Ullman computation, rewriting rules, and templates builds quite a bit
	1146	of redundancy into the compiling process.
	1147	The effect of this is that, in a surprisingly short time, the compiler will
	1148	start generating correct code for those
	1149	programs that it can compile.
	1150	The hard part of the job then becomes finding and
	1151	eliminating those situations where the compiler refuses to
	1152	compile a program because it knows it cannot do it right.
	1153	For example, a template may simply be missing; this may either
	1154	give a compiler error of the form ``no match for op ...'' , or cause
	1155	the compiler to go into an infinite loop applying various rewriting rules.
	1156	The compiler has a variable,
	1157	.I nrecur ,
	1158	that is set to 0 at the beginning of an expressions, and
	1159	incremented at key spots in the
	1160	compilation process; if this parameter gets too large, the
	1161	compiler decides that it is in a loop, and aborts.
	1162	Loops are also characteristic of botches in the machine-dependent rewriting rules.
	1163	Bad Sethi-Ullman computations usually cause the scratch registers
	1164	to run out; this often means
	1165	that the Sethi-Ullman number was underestimated, so
	1166	.I store
	1167	did not store something it should have; alternatively,
	1168	it can mean that the rewriting rules were not smart enough to
	1169	find the sequence that
	1170	.I sucomp
	1171	assumed would be used.
	1172	.PP
	1173	The best approach when a compiler error is detected involves several stages.
	1174	First, try to get a small example program that steps on the bug.
	1175	Second, turn on various debugging flags in the code generator, and follow the
	1176	tree through the process of being matched and rewritten.
	1177	Some flags of interest are
	1178	\-e, which prints the expression tree,
	1179	\-r, which gives information about the allocation of registers,
	1180	\-a, which gives information about the performance of
	1181	.I rallo ,
	1182	and \-o, which gives information about the behavior of
	1183	.I order .
	1184	This technique should allow most bugs to be found relatively quickly.
	1185	.PP
	1186	Unfortunately, finding the bug is usually not enough; it must also
	1187	be fixed!
	1188	The difficulty arises because a fix to the particular bug of interest tends
	1189	to break other code that already works.
	1190	Regression tests, tests that compare the performance of
	1191	a new compiler against the performance of an older one, are very
	1192	valuable in preventing major catastrophes.
	1193	.SH
	1194	Summary and Conclusion
	1195	.PP
	1196	The portable compiler has been a useful tool for providing C
	1197	capability on a large number of diverse machines,
	1198	and for testing a number of theoretical
	1199	constructs in a practical setting.
	1200	It has many blemishes, both in style and functionality.
	1201	It has been applied to many more machines than first
	1202	anticipated, of a much wider range than originally dreamed of.
	1203	Its use has also spread much faster than expected, leaving parts of
	1204	the compiler still somewhat raw in shape.
	1205	.PP
	1206	On the theoretical side, there is some hope that the
	1207	skeleton of the
	1208	.I sucomp
	1209	routine could be generated for many machines directly from the
	1210	templates; this would give a considerable boost
	1211	to the portability and correctness of the compiler,
	1212	but might affect tunability and code quality.
	1213	There is also room for more optimization, both within
	1214	.I optim
	1215	and in the form of a portable ``peephole'' optimizer.
	1216	.PP
	1217	On the practical, development side,
	1218	the compiler could probably be sped up and made smaller
	1219	without doing too much violence to its basic structure.
	1220	Parts of the compiler deserve to be rewritten;
	1221	the initialization code, register allocation, and
	1222	parser are prime candidates.
	1223	It might be that doing some or all of the parsing
	1224	with a recursive descent parser might
	1225	save enough space and time to be worthwhile;
	1226	it would certainly ease the problem of moving the
	1227	compiler to an environment where
	1228	.I Yacc
	1229	is not already present.
	1230	.PP
	1231	Finally, I would like to thank the many people who have
	1232	sympathetically, and even enthusiastically, helped me grapple
	1233	with what has been a frustrating program to write, test, and install.
	1234	D. M. Ritchie and E. N. Pinson provided needed early
	1235	encouragement and philosophical guidance;
	1236	M. E. Lesk,
	1237	R. Muha, T. G. Peterson,
	1238	G. Riddle, L. Rosler,
	1239	R. W. Mitze,
	1240	B. R. Rowland,
	1241	S. I. Feldman,
	1242	and
	1243	T. B. London
	1244	have all contributed ideas, gripes, and all, at one time or another,
	1245	climbed ``into the pits'' with me to help debug.
	1246	Without their help this effort would have not been possible;
	1247	with it, it was often kind of fun.
	1248	.sp 100
	1249	.LP
	1250	.[
	1251	$LIST$
	1252	.]
	1253	.LP
	1254	.sp 100
	1255	.LP