[unix-history] / usr / src / usr.bin / gprof / PSD.doc / gathering.me

\"	@(#)gathering.me	1.7 %G%
.sh 1 "Gathering Profile Data"
.pp
Routine calls or statement executions can be measured by having a
compiler augment the code at strategic points.
The additions can be inline increments to counters [Knuth71]
[Satterthwaite72] [Joy79] or calls to
monitoring routines [Unix].
The counter increment overhead is low, and is suitable for
profiling statements.
A call of the monitoring routine has an overhead comparable with a
call of a regular routine, and is therefore only suited
to profiling on a routine by routine basis.
However, the monitoring routine solution has certain advantages.
Whatever counters are needed by the monitoring routine can be
managed by the monitoring routine itself, rather than being
distributed around the code.
In particular, a monitoring routine can easily be called from separately
compiled programs.
In addition, different monitoring routines can be linked into the
program
being measured
to assemble different profiling data without having to
change the compiler or recompile the program.
We have exploited this approach;
our compilers for C, Fortran77, and Pascal can insert calls to a
monitoring routine in the prologue for each routine.
Use of the monitoring routine requires no planning on part of a
programmer other than to request that augmented routine
prologues be produced during compilation.
.pp
We are interested in gathering three pieces of information during
program execution: call counts and execution times for 
each profiled routine, and the arcs of the dynamic call graph
traversed by this execution of the program.
By post-processing of this data we can build the dynamic call
graph for this execution of the program and propagate times along
the edges of this graph to attribute times for routines to the
routines that invoke them.
.pp
Gathering of the profiling information should not greatly
interfere with the running of the program.
Thus, the monitoring routine must not produce trace output each
time it is invoked.
The volume of data thus produced would be unmanageably large,
and the time required to record it would overwhelm the running
time of most programs.
Similarly, the monitoring routine can not do the analysis of
the profiling data (e.g. assembling the call graph, propagating
times around it, discovering cycles, etc.) during program
execution.
Our solution is to gather profiling data in memory during program
execution and to condense it to a file as the profiled
program exits.
This file is then processed by a separate program to produce the
listing of the profile data.
An advantage of this approach is that the profile data for
several executions of a program can be combined by the
post-processing to provide a profile of many
executions.
.pp
The execution time monitoring consists of three parts.
The first part allocates and initializes the runtime monitoring data
structures before the program begins execution.
The second part is the monitoring routine invoked from the
prologue of each profiled routine.
The third part condenses the data structures and writes them
to a file as the program terminates.
The monitoring routine is discussed in detail in the following sections.
.sh 2 "Execution Counts"
.pp
The \fBgprof\fP monitoring routine counts the number of times
each profiled routine is called.
The monitoring routine also records the arc in the call graph
that activated the profiled routine.
The count is associated with the arc in the call graph
rather than with the routine.
Call counts for routines can then be determined by summing the counts
on arcs directed into that routine.
In a machine-dependent fashion, the monitoring routine notes its
own return address.
This address is in the prologue of some profiled routine that is
the destination of an arc in the dynamic call graph.
The monitoring routine also discovers the return address for that
routine, thus identifying the call site, or source of the arc.
The source of the arc is in the \fIcaller\fP, and the destination is in
the \fIcallee\fP.
For example, if a routine A calls a routine B, A is the caller, 
and B is the callee.
The prologue of B will include a call to the monitoring routine
that will note the arc from A to B and either initialize or
increment a counter for that arc.
.pp
One can not afford to have the monitoring routine output tracing
information as each arc is identified.
Therefore, the monitoring routine maintains a table of all the
arcs discovered, with counts of the numbers of times each is
traversed during execution.
This table is accessed once per routine call.
Access to it
must be as fast as possible so as not to overwhelm the time
required to execute the program.
.pp
Our solution is to access the table through a hash table.
We use the call site as the primary key with the callee
address being the secondary key.
Since each call site typically calls only one callee, we can
reduce (usually to one) the number of minor lookups based on the callee.
Another alternative would use the callee as the primary key and the
call site as the secondary key.
Such an organization has the advantage of associating callers with
callees, at the expense of longer lookups in the monitoring
routine.
We are fortunate to be running in a virtual memory environment,
and (for the sake of speed) were able to allocate enough space
for the primary hash table to allow a one-to-one mapping from
call site addresses to the primary hash table.
Thus our hash function is trivial to calculate and collisions
occur only for call sites that call multiple
destinations (e.g. functional parameters and functional variables).
A one level hash function using both call site and callee would
result in an unreasonably large hash table.
Further, the number of dynamic call sites and callees is not known during
execution of the profiled program.
.pp
Not all callers and callees can be identified by the monitoring
routine.
Routines that were compiled without the profiling augmentations
will not call the monitoring routine as part of their prologue,
and thus no arcs will be recorded whose destinations are in these
routines.
One need not profile all the routines in a program.
Routines that are not profiled run at full speed.
Certain routines, notably exception handlers, are invoked by
non-standard calling sequences.
Thus the monitoring routine may know the destination of an arc
(the callee),
but find it difficult or
impossible to determine the source of the arc (the caller).
Often in these cases the apparent source of the arc is not a call
site at all.
Such anomalous invocations are declared ``spontaneous''.
.sh 2 "Execution Times"
.pp
The execution times for routines can be gathered in at least two
ways.
One method measures the execution time of a routine by measuring
the elapsed time from routine entry to routine exit.
Unfortunately, time measurement is complicated on time-sharing
systems by the time-slicing of the program.
A second method samples the value of the program counter at some
interval, and infers execution time from the distribution of the
samples within the program.
This technique is particularly suited to time-sharing systems,
where the time-slicing can serve as the basis for sampling
the program counter.
Notice that, whereas the first method could provide exact timings,
the second is inherently a statistical approximation.
.pp
The sampling method need not require support from the operating
system:  all that is needed is the ability to set and respond to
``alarm clock'' interrupts that run relative to program time.
It is imperative that the intervals be uniform since the
sampling of the program counter rather than the duration of the
interval is the basis of the distribution.
If sampling is done too often, the interruptions to sample the
program counter will overwhelm the running of the profiled program.
On the other hand, the program must run for enough sampled
intervals that the distribution of the samples accurately
represents the distribution of time for the execution of the
program.
As with routine call tracing, the monitoring routine can not
afford to output information for each program counter
sample.
In our computing environment, the operating system can provide a
histogram of the location of the program counter at the end of
each clock tick (1/60th of a second) in which a program runs.
The histogram is assembled in memory as the program runs.
This facility is enabled by our monitoring routine.
We have adjusted the granularity of the histogram so that
program counter values map one-to-one onto the histogram.
We make the simplifying assumption that all calls to a specific
routine require the same amount of time to execute.
This assumption may disguise that some calls
(or worse, some call sites) always invoke a routine
such that its execution is faster (or slower)
than the average time for that routine.
.pp
When the profiled program terminates, 
the arc table and the histogram of
program counter samples is written to a file.
The arc table is condensed to consist of the source and destination
addresses of the arc and the count of the number of times the arc
was traversed by this execution of the program.
The recorded histogram consists of counters of the number of
times the program counter was found to be in each of the ranges covered
by the histogram.
The ranges themselves are summarized as a
lower and upper bound and a step size.
Commit	Line	Data
607a525f	1	\" @(#)gathering.me 1.7 %G%
9107ec01 PK	2	.sh 1 "Gathering Profile Data"
	3	.pp
	4	Routine calls or statement executions can be measured by having a
	5	compiler augment the code at strategic points.
	6	The additions can be inline increments to counters [Knuth71]
	7	[Satterthwaite72] [Joy79] or calls to
	8	monitoring routines [Unix].
	9	The counter increment overhead is low, and is suitable for
602c58f5	10	profiling statements.
ed1b79b0 PK	11	A call of the monitoring routine has an overhead comparable with a
ed1b79b0 PK	12	call of a regular routine, and is therefore only suited
602c58f5	13	to profiling on a routine by routine basis.
9107ec01 PK	14	However, the monitoring routine solution has certain advantages.
	15	Whatever counters are needed by the monitoring routine can be
	16	managed by the monitoring routine itself, rather than being
	17	distributed around the code.
ed1b79b0 PK	18	In particular, a monitoring routine can easily be called from separately
ed1b79b0 PK	19	compiled programs.
9107ec01	20	In addition, different monitoring routines can be linked into the
607a525f PK	21	program
	22	being measured
	23	to assemble different profiling data without having to
9107ec01 PK	24	change the compiler or recompile the program.
	25	We have exploited this approach;
	26	our compilers for C, Fortran77, and Pascal can insert calls to a
	27	monitoring routine in the prologue for each routine.
130ed122 PK	28	Use of the monitoring routine requires no planning on part of a
	29	programmer other than to request that augmented routine
	30	prologues be produced during compilation.
9107ec01 PK	31	.pp
	32	We are interested in gathering three pieces of information during
	33	program execution: call counts and execution times for
	34	each profiled routine, and the arcs of the dynamic call graph
	35	traversed by this execution of the program.
bded3f9b	36	By post-processing of this data we can build the dynamic call
9107ec01 PK	37	graph for this execution of the program and propagate times along
	38	the edges of this graph to attribute times for routines to the
	39	routines that invoke them.
	40	.pp
602c58f5	41	Gathering of the profiling information should not greatly
9107ec01 PK	42	interfere with the running of the program.
	43	Thus, the monitoring routine must not produce trace output each
	44	time it is invoked.
	45	The volume of data thus produced would be unmanageably large,
	46	and the time required to record it would overwhelm the running
	47	time of most programs.
bded3f9b	48	Similarly, the monitoring routine can not do the analysis of
9107ec01 PK	49	the profiling data (e.g. assembling the call graph, propagating
	50	times around it, discovering cycles, etc.) during program
	51	execution.
	52	Our solution is to gather profiling data in memory during program
	53	execution and to condense it to a file as the profiled
	54	program exits.
	55	This file is then processed by a separate program to produce the
	56	listing of the profile data.
	57	An advantage of this approach is that the profile data for
	58	several executions of a program can be combined by the
bded3f9b	59	post-processing to provide a profile of many
ed1b79b0	60	executions.
9107ec01 PK	61	.pp
	62	The execution time monitoring consists of three parts.
	63	The first part allocates and initializes the runtime monitoring data
	64	structures before the program begins execution.
9107ec01 PK	65	The second part is the monitoring routine invoked from the
9107ec01 PK	66	prologue of each profiled routine.
ed1b79b0 PK	67	The third part condenses the data structures and writes them
ed1b79b0 PK	68	to a file as the program terminates.
9107ec01 PK	69	The monitoring routine is discussed in detail in the following sections.
	70	.sh 2 "Execution Counts"
	71	.pp
602c58f5 PK	72	The \fBgprof\fP monitoring routine counts the number of times
602c58f5 PK	73	each profiled routine is called.
9107ec01 PK	74	The monitoring routine also records the arc in the call graph
9107ec01 PK	75	that activated the profiled routine.
bded3f9b	76	The count is associated with the arc in the call graph
9107ec01 PK	77	rather than with the routine.
	78	Call counts for routines can then be determined by summing the counts
	79	on arcs directed into that routine.
	80	In a machine-dependent fashion, the monitoring routine notes its
	81	own return address.
	82	This address is in the prologue of some profiled routine that is
602c58f5	83	the destination of an arc in the dynamic call graph.
9107ec01	84	The monitoring routine also discovers the return address for that
602c58f5 PK	85	routine, thus identifying the call site, or source of the arc.
602c58f5 PK	86	The source of the arc is in the \fIcaller\fP, and the destination is in
9107ec01 PK	87	the \fIcallee\fP.
	88	For example, if a routine A calls a routine B, A is the caller,
	89	and B is the callee.
	90	The prologue of B will include a call to the monitoring routine
	91	that will note the arc from A to B and either initialize or
	92	increment a counter for that arc.
	93	.pp
	94	One can not afford to have the monitoring routine output tracing
	95	information as each arc is identified.
	96	Therefore, the monitoring routine maintains a table of all the
	97	arcs discovered, with counts of the numbers of times each is
	98	traversed during execution.
607a525f PK	99	This table is accessed once per routine call.
607a525f PK	100	Access to it
9107ec01 PK	101	must be as fast as possible so as not to overwhelm the time
	102	required to execute the program.
	103	.pp
	104	Our solution is to access the table through a hash table.
	105	We use the call site as the primary key with the callee
	106	address being the secondary key.
	107	Since each call site typically calls only one callee, we can
	108	reduce (usually to one) the number of minor lookups based on the callee.
ed1b79b0	109	Another alternative would use the callee as the primary key and the
9107ec01 PK	110	call site as the secondary key.
	111	Such an organization has the advantage of associating callers with
	112	callees, at the expense of longer lookups in the monitoring
	113	routine.
	114	We are fortunate to be running in a virtual memory environment,
	115	and (for the sake of speed) were able to allocate enough space
	116	for the primary hash table to allow a one-to-one mapping from
	117	call site addresses to the primary hash table.
607a525f PK	118	Thus our hash function is trivial to calculate and collisions
	119	occur only for call sites that call multiple
	120	destinations (e.g. functional parameters and functional variables).
9107ec01 PK	121	A one level hash function using both call site and callee would
9107ec01 PK	122	result in an unreasonably large hash table.
0020feb2	123	Further, the number of dynamic call sites and callees is not known during
9107ec01 PK	124	execution of the profiled program.
	125	.pp
	126	Not all callers and callees can be identified by the monitoring
	127	routine.
	128	Routines that were compiled without the profiling augmentations
	129	will not call the monitoring routine as part of their prologue,
607a525f	130	and thus no arcs will be recorded whose destinations are in these
9107ec01 PK	131	routines.
	132	One need not profile all the routines in a program.
	133	Routines that are not profiled run at full speed.
	134	Certain routines, notably exception handlers, are invoked by
	135	non-standard calling sequences.
607a525f PK	136	Thus the monitoring routine may know the destination of an arc
607a525f PK	137	(the callee),
ed1b79b0	138	but find it difficult or
607a525f PK	139	impossible to determine the source of the arc (the caller).
607a525f PK	140	Often in these cases the apparent source of the arc is not a call
9107ec01	141	site at all.
ed1b79b0	142	Such anomalous invocations are declared ``spontaneous''.
9107ec01 PK	143	.sh 2 "Execution Times"
	144	.pp
	145	The execution times for routines can be gathered in at least two
	146	ways.
	147	One method measures the execution time of a routine by measuring
	148	the elapsed time from routine entry to routine exit.
ed1b79b0	149	Unfortunately, time measurement is complicated on time-sharing
602c58f5	150	systems by the time-slicing of the program.
0020feb2	151	A second method samples the value of the program counter at some
9107ec01 PK	152	interval, and infers execution time from the distribution of the
9107ec01 PK	153	samples within the program.
ed1b79b0	154	This technique is particularly suited to time-sharing systems,
602c58f5	155	where the time-slicing can serve as the basis for sampling
9107ec01 PK	156	the program counter.
	157	Notice that, whereas the first method could provide exact timings,
	158	the second is inherently a statistical approximation.
	159	.pp
	160	The sampling method need not require support from the operating
	161	system: all that is needed is the ability to set and respond to
	162	``alarm clock'' interrupts that run relative to program time.
	163	It is imperative that the intervals be uniform since the
	164	sampling of the program counter rather than the duration of the
	165	interval is the basis of the distribution.
	166	If sampling is done too often, the interruptions to sample the
	167	program counter will overwhelm the running of the profiled program.
	168	On the other hand, the program must run for enough sampled
	169	intervals that the distribution of the samples accurately
	170	represents the distribution of time for the execution of the
	171	program.
	172	As with routine call tracing, the monitoring routine can not
	173	afford to output information for each program counter
	174	sample.
	175	In our computing environment, the operating system can provide a
	176	histogram of the location of the program counter at the end of
	177	each clock tick (1/60th of a second) in which a program runs.
	178	The histogram is assembled in memory as the program runs.
	179	This facility is enabled by our monitoring routine.
	180	We have adjusted the granularity of the histogram so that
	181	program counter values map one-to-one onto the histogram.
602c58f5 PK	182	We make the simplifying assumption that all calls to a specific
602c58f5 PK	183	routine require the same amount of time to execute.
bded3f9b	184	This assumption may disguise that some calls
602c58f5 PK	185	(or worse, some call sites) always invoke a routine
	186	such that its execution is faster (or slower)
	187	than the average time for that routine.
9107ec01	188	.pp
607a525f	189	When the profiled program terminates,
9107ec01 PK	190	the arc table and the histogram of
9107ec01 PK	191	program counter samples is written to a file.
607a525f	192	The arc table is condensed to consist of the source and destination
9107ec01 PK	193	addresses of the arc and the count of the number of times the arc
	194	was traversed by this execution of the program.
	195	The recorded histogram consists of counters of the number of
	196	times the program counter was found to be in each of the ranges covered
	197	by the histogram.
ed1b79b0 PK	198	The ranges themselves are summarized as a
ed1b79b0 PK	199	lower and upper bound and a step size.