[unix-history] / usr / src / usr.bin / gprof / PSD.doc / present.me

.\"	@(#)present.me	1.12	(Berkeley)	%G%
.\"
.sh 1 "Data Presentation"
.pp
The data is presented to the user in two different formats.
The first presentation simply lists the routines
without regard to the amount of time their descendants use.
The second presentation incorporates the call graph of the
program.
.sh 2 "The Flat Profile
.pp
The flat profile consists of a list of all the routines 
that are called during execution of the program,
with the count of the number of times they are called
and the number of seconds of execution time for which they
are themselves accountable.
The routines are listed in decreasing order of execution time.
A list of the routines that are never called during execution of
the program is also available
to verify that nothing important is omitted by
this execution.
The flat profile gives a quick overview of the routines that are used,
and shows the routines that are themselves responsible
for large fractions of the execution time.
In practice,
this profile usually shows that no single function
is overwhelmingly responsible for 
the total time of the program.
Notice that for this profile,
the individual times sum to the total execution time.
.sh 2 "The Call Graph Profile"
.sz 10
.(z
.TS
box center;
c c c c c l l
c c c c c l l
c c c c c l l
l n n n c l l.
				called/total	\ \ parents
index	%time	self	descendants	called+self	name	index
				called/total	\ \ children
_
		0.20	1.20	4/10	\ \ \s-1CALLER1\s+1	[7]
		0.30	1.80	6/10	\ \ \s-1CALLER2\s+1	[1]
[2]	41.5	0.50	3.00	10+4	\s-1EXAMPLE\s+1	[2]
		1.50	1.00	20/40	\ \ \s-1SUB1\s+1 <cycle1>	[4]
		0.00	0.50	1/5	\ \ \s-1SUB2\s+1 	[9]
		0.00	0.00	0/5	\ \ \s-1SUB3\s+1 	[11]
.TE
.ce 2
Profile entry for \s-1EXAMPLE\s+1.
Figure 4.
.)z
.pp
Ideally, we would like to print the call graph of the program,
but we are limited by the two-dimensional nature of our output
devices.
We cannot assume that a call graph is planar,
and even if it is, that we can print a planar version of it.
Instead, we choose to list each routine,
together with information about
the routines that are its direct parents and children.
This listing presents a window into the call graph.
Based on our experience,
both parent information and child information
is important,
and should be available without searching
through the output.
.pp
The major entries of the call graph profile are the entries from the
flat profile, augmented by the time propagated to each 
routine from its descendants.
This profile is sorted by the sum of the time for the routine
itself plus the time inherited from its descendants.
The profile shows which of the higher level routines 
spend large portions of the total execution time 
in the routines that they call.
For each routine, we show the amount of time passed by each child
to the routine, which includes time for the child itself
and for the descendants of the child
(and thus the descendants of the routine).
We also show the percentage these times represent of the total time
accounted to the child.
Similarly, the parents of each routine are listed, 
along with time,
and percentage of total routine time,
propagated to each one.
.pp
Cycles are handled as single entities.
The cycle as a whole is shown as though it were a single routine,
except that members of the cycle are listed in place of the children.
Although the number of calls of each member
from within the cycle are shown,
they do not affect time propagation.
When a child is a member of a cycle,
the time shown is the appropriate fraction of the time
for the whole cycle.
Self-recursive routines have their calls broken
down into calls from the outside and self-recursive calls.
Only the outside calls affect the propagation of time.
.pp
The following example is a typical fragment of a call graph.
.(b
.so pres1.pic
.)b
The entry in the call graph profile listing for this example is
shown in Figure 4.
.pp
The entry is for routine \s-1EXAMPLE\s+1, which has
the Caller routines as its parents,
and the Sub routines as its children.
The reader should keep in mind that all information
is given \fIwith respect to \s-1EXAMPLE\s+1\fP.
The index in the first column shows that \s-1EXAMPLE\s+1
is the second entry in the profile listing.
The \s-1EXAMPLE\s+1 routine is called ten times, four times by \s-1CALLER1\s+1,
and six times by \s-1CALLER2\s+1.
Consequently 40% of \s-1EXAMPLE\s+1's time is propagated to \s-1CALLER1\s+1,
and 60% of \s-1EXAMPLE\s+1's time is propagated to \s-1CALLER2\s+1.
The self and descendant fields of the parents
show the amount of self and descendant time \s-1EXAMPLE\s+1
propagates to them (but not the time used by
the parents directly).
Note that \s-1EXAMPLE\s+1 calls itself recursively four times.
The routine \s-1EXAMPLE\s+1 calls routine \s-1SUB1\s+1 twenty times, \s-1SUB2\s+1 once,
and never calls \s-1SUB3\s+1.
Since \s-1SUB2\s+1 is called a total of five times,
20% of its self and descendant time is propagated to \s-1EXAMPLE\s+1's
descendant time field.
Because \s-1SUB1\s+1 is a member of \fIcycle 1\fR,
the self and descendant times
and call count fraction
are those for the cycle as a whole.
Since cycle 1 is called a total of forty times
(not counting calls among members of the cycle),
it propagates 50% of the cycle's self and descendant
time to \s-1EXAMPLE\s+1's descendant time field.
Finally each name is followed by an index that shows
where on the listing to find the entry for that routine.
.sh 1 "Using the Profiles"
.pp
The profiler is a useful tool for improving
a set of routines that implement an abstraction.
It can be helpful in identifying poorly coded routines,
and in evaluating the new algorithms and code that replace them.
Taking full advantage of the profiler 
requires a careful examination of the call graph profile,
and a thorough knowledge of the abstractions underlying
the program.
.pp
The easiest optimization that can be performed
is a small change
to a control construct or data structure that improves the
running time of the program.
An obvious starting point
is a routine that is called many times.
For example, suppose an output 
routine is the only parent
of a routine that formats the data.
If this format routine is expanded inline in the
output routine, the overhead of a function call and
return can be saved for each datum that needs to be formatted.
.pp
The drawback to inline expansion is that the data abstractions
in the program may become less parameterized,
hence less clearly defined.
The profiling will also become less useful since the loss of 
routines will make its output more granular.
For example,
if the symbol table functions ``lookup'', ``insert'', and ``delete''
are all merged into a single parameterized routine,
it will be impossible to determine the costs
of any one of these individual functions from the profile.
.pp
Further potential for optimization lies in routines that
implement data abstractions whose total execution
time is long.
For example, a lookup routine might be called only a few
times, but use an inefficient linear search algorithm,
that might be replaced with a binary search.
Alternately, the discovery that a rehashing function is being
called excessively, can lead to a different
hash function or a larger hash table.
If the data abstraction function cannot easily be speeded up,
it may be advantageous to cache its results,
and eliminate the need to rerun
it for identical inputs.
These and other ideas for program improvement are discussed in
[Bentley81].
.pp
This tool is best used in an iterative approach:
profiling the program,
eliminating one bottleneck,
then finding some other part of the program
that begins to dominate execution time.
For instance, we have used \fBgprof\fR on itself;
eliminating, rewriting, and inline expanding routines,
until reading
data files (hardly a target for optimization!)
represents the dominating factor in its execution time.
.pp
Certain types of programs are not easily analyzed by \fBgprof\fR.
They are typified by programs that exhibit a large degree of 
recursion, such as recursive descent compilers.
The problem is that most of the major routines are grouped
into a single monolithic cycle.
As in the symbol table abstraction that is placed
in one routine,
it is impossible to distinguish which members of the cycle are
responsible for the execution time.
Unfortunately there are no easy modifications to these programs that
make them amenable to analysis.
.pp
A completely different use of the profiler is to analyze the control
flow of an unfamiliar program.
If you receive a program from another user that you need to modify
in some small way,
it is often unclear where the changes need to be made.
By running the program on an example and then using \fBgprof\fR,
you can get a view of the structure of the program.
.pp
Consider an example in which you need to change the output format
of the program.
For purposes of this example suppose that the call graph
of the output portion of the program has the following structure:
.(b
.so pres2.pic
.)b
Initially you look through the \fBgprof\fR
output for the system call ``\s-1WRITE\s+1''.
The format routine you will need to change is probably
among the parents of the ``\s-1WRITE\s+1'' procedure.
The next step is to look at the profile entry for each
of parents of ``\s-1WRITE\s+1'',
in this example either ``\s-1FORMAT1\s+1'' or ``\s-1FORMAT2\s+1'',
to determine which one to change.
Each format routine will have one or more parents,
in this example ``\s-1CALC1\s+1'', ``\s-1CALC2\s+1'', and ``\s-1CALC3\s+1''.
By inspecting the source code for each of these routines
you can determine which format routine generates the output that
you wish to modify.
Since the \fBgprof\fR entry shows all the
potential calls to the format routine you intend to change,
you can determine if your modifications will affect output that
should be left alone.
If you desire to change the output of ``\s-1CALC2\s+1'', but not ``\s-1CALC3\s+1'',
then formatting routine ``\s-1FORMAT2\s+1'' needs to be split
into two separate routines,
one of which implements the new format.
You can then retarget just the call by ``\s-1CALC2\s+1''
that needs the new format.
It should be noted that the static call information is particularly
useful here since the test case you run probably will not
exercise the entire program.
.sh 1 "Conclusions"
.pp
We have created a profiler that aids in the evaluation
of modular programs.
For each routine in the program,
the profile shows the extent to which that routine
helps support various abstractions,
and how that routine uses other abstractions.
The profile accurately assesses the cost of routines
at all levels of the program decomposition.
The profiler is easily used,
and can be compiled into the program without any prior planning by
the programmer.
It adds only five to thirty percent execution overhead to the program
being profiled,
produces no additional output until after the program finishes,
and allows the program to be measured in its actual environment.
Finally, the profiler runs on a time-sharing system 
using only the normal services provided by the operating system
and compilers.
Commit	Line	Data
73772756 KM	1	.\" @(#)present.me 1.12 (Berkeley) %G%
73772756 KM	2	.\"
443c8968 PK	3	.sh 1 "Data Presentation"
	4	.pp
	5	The data is presented to the user in two different formats.
	6	The first presentation simply lists the routines
	7	without regard to the amount of time their descendants use.
	8	The second presentation incorporates the call graph of the
	9	program.
	10	.sh 2 "The Flat Profile
	11	.pp
	12	The flat profile consists of a list of all the routines
	13	that are called during execution of the program,
	14	with the count of the number of times they are called
	15	and the number of seconds of execution time for which they
	16	are themselves accountable.
	17	The routines are listed in decreasing order of execution time.
	18	A list of the routines that are never called during execution of
	19	the program is also available
	20	to verify that nothing important is omitted by
	21	this execution.
	22	The flat profile gives a quick overview of the routines that are used,
	23	and shows the routines that are themselves responsible
	24	for large fractions of the execution time.
	25	In practice,
cde52728	26	this profile usually shows that no single function
443c8968 PK	27	is overwhelmingly responsible for
	28	the total time of the program.
	29	Notice that for this profile,
	30	the individual times sum to the total execution time.
	31	.sh 2 "The Call Graph Profile"
627116d0 PK	32	.sz 10
	33	.(z
	34	.TS
	35	box center;
	36	c c c c c l l
	37	c c c c c l l
	38	c c c c c l l
	39	l n n n c l l.
	40	called/total \ \ parents
	41	index %time self descendants called+self name index
	42	called/total \ \ children
	43	_
	44	0.20 1.20 4/10 \ \ \s-1CALLER1\s+1 [7]
	45	0.30 1.80 6/10 \ \ \s-1CALLER2\s+1 [1]
	46	[2] 41.5 0.50 3.00 10+4 \s-1EXAMPLE\s+1 [2]
	47	1.50 1.00 20/40 \ \ \s-1SUB1\s+1 <cycle1> [4]
	48	0.00 0.50 1/5 \ \ \s-1SUB2\s+1 [9]
	49	0.00 0.00 0/5 \ \ \s-1SUB3\s+1 [11]
	50	.TE
	51	.ce 2
	52	Profile entry for \s-1EXAMPLE\s+1.
	53	Figure 4.
	54	.)z
443c8968 PK	55	.pp
	56	Ideally, we would like to print the call graph of the program,
	57	but we are limited by the two-dimensional nature of our output
	58	devices.
	59	We cannot assume that a call graph is planar,
	60	and even if it is, that we can print a planar version of it.
4fc97ef0	61	Instead, we choose to list each routine,
443c8968 PK	62	together with information about
	63	the routines that are its direct parents and children.
	64	This listing presents a window into the call graph.
	65	Based on our experience,
	66	both parent information and child information
	67	is important,
	68	and should be available without searching
	69	through the output.
	70	.pp
	71	The major entries of the call graph profile are the entries from the
	72	flat profile, augmented by the time propagated to each
	73	routine from its descendants.
	74	This profile is sorted by the sum of the time for the routine
	75	itself plus the time inherited from its descendants.
	76	The profile shows which of the higher level routines
	77	spend large portions of the total execution time
	78	in the routines that they call.
	79	For each routine, we show the amount of time passed by each child
	80	to the routine, which includes time for the child itself
	81	and for the descendants of the child
	82	(and thus the descendants of the routine).
	83	We also show the percentage these times represent of the total time
	84	accounted to the child.
	85	Similarly, the parents of each routine are listed,
	86	along with time,
	87	and percentage of total routine time,
	88	propagated to each one.
	89	.pp
	90	Cycles are handled as single entities.
	91	The cycle as a whole is shown as though it were a single routine,
	92	except that members of the cycle are listed in place of the children.
	93	Although the number of calls of each member
	94	from within the cycle are shown,
	95	they do not affect time propagation.
	96	When a child is a member of a cycle,
	97	the time shown is the appropriate fraction of the time
	98	for the whole cycle.
	99	Self-recursive routines have their calls broken
c962f264	100	down into calls from the outside and self-recursive calls.
443c8968 PK	101	Only the outside calls affect the propagation of time.
443c8968 PK	102	.pp
ceba7594	103	The following example is a typical fragment of a call graph.
443c8968	104	.(b
73772756	105	.so pres1.pic
443c8968	106	.)b
19435cbf	107	The entry in the call graph profile listing for this example is
627116d0	108	shown in Figure 4.
443c8968	109	.pp
627116d0	110	The entry is for routine \s-1EXAMPLE\s+1, which has
443c8968 PK	111	the Caller routines as its parents,
	112	and the Sub routines as its children.
	113	The reader should keep in mind that all information
627116d0 PK	114	is given \fIwith respect to \s-1EXAMPLE\s+1\fP.
627116d0 PK	115	The index in the first column shows that \s-1EXAMPLE\s+1
443c8968	116	is the second entry in the profile listing.
627116d0 PK	117	The \s-1EXAMPLE\s+1 routine is called ten times, four times by \s-1CALLER1\s+1,
	118	and six times by \s-1CALLER2\s+1.
	119	Consequently 40% of \s-1EXAMPLE\s+1's time is propagated to \s-1CALLER1\s+1,
	120	and 60% of \s-1EXAMPLE\s+1's time is propagated to \s-1CALLER2\s+1.
443c8968	121	The self and descendant fields of the parents
627116d0	122	show the amount of self and descendant time \s-1EXAMPLE\s+1
443c8968 PK	123	propagates to them (but not the time used by
443c8968 PK	124	the parents directly).
627116d0 PK	125	Note that \s-1EXAMPLE\s+1 calls itself recursively four times.
	126	The routine \s-1EXAMPLE\s+1 calls routine \s-1SUB1\s+1 twenty times, \s-1SUB2\s+1 once,
	127	and never calls \s-1SUB3\s+1.
	128	Since \s-1SUB2\s+1 is called a total of five times,
	129	20% of its self and descendant time is propagated to \s-1EXAMPLE\s+1's
443c8968	130	descendant time field.
627116d0	131	Because \s-1SUB1\s+1 is a member of \fIcycle 1\fR,
4fc97ef0 PK	132	the self and descendant times
	133	and call count fraction
	134	are those for the cycle as a whole.
	135	Since cycle 1 is called a total of forty times
	136	(not counting calls among members of the cycle),
	137	it propagates 50% of the cycle's self and descendant
627116d0	138	time to \s-1EXAMPLE\s+1's descendant time field.
cde52728	139	Finally each name is followed by an index that shows
443c8968 PK	140	where on the listing to find the entry for that routine.
	141	.sh 1 "Using the Profiles"
	142	.pp
	143	The profiler is a useful tool for improving
	144	a set of routines that implement an abstraction.
	145	It can be helpful in identifying poorly coded routines,
	146	and in evaluating the new algorithms and code that replace them.
	147	Taking full advantage of the profiler
	148	requires a careful examination of the call graph profile,
	149	and a thorough knowledge of the abstractions underlying
	150	the program.
	151	.pp
cde52728	152	The easiest optimization that can be performed
443c8968 PK	153	is a small change
	154	to a control construct or data structure that improves the
	155	running time of the program.
	156	An obvious starting point
	157	is a routine that is called many times.
	158	For example, suppose an output
	159	routine is the only parent
	160	of a routine that formats the data.
	161	If this format routine is expanded inline in the
	162	output routine, the overhead of a function call and
	163	return can be saved for each datum that needs to be formatted.
	164	.pp
	165	The drawback to inline expansion is that the data abstractions
	166	in the program may become less parameterized,
	167	hence less clearly defined.
	168	The profiling will also become less useful since the loss of
	169	routines will make its output more granular.
	170	For example,
	171	if the symbol table functions ``lookup'', ``insert'', and ``delete''
	172	are all merged into a single parameterized routine,
	173	it will be impossible to determine the costs
	174	of any one of these individual functions from the profile.
	175	.pp
	176	Further potential for optimization lies in routines that
	177	implement data abstractions whose total execution
	178	time is long.
	179	For example, a lookup routine might be called only a few
	180	times, but use an inefficient linear search algorithm,
	181	that might be replaced with a binary search.
	182	Alternately, the discovery that a rehashing function is being
	183	called excessively, can lead to a different
	184	hash function or a larger hash table.
	185	If the data abstraction function cannot easily be speeded up,
	186	it may be advantageous to cache its results,
	187	and eliminate the need to rerun
	188	it for identical inputs.
dc2d6ac1 PK	189	These and other ideas for program improvement are discussed in
dc2d6ac1 PK	190	[Bentley81].
443c8968	191	.pp
c962f264	192	This tool is best used in an iterative approach:
443c8968 PK	193	profiling the program,
	194	eliminating one bottleneck,
	195	then finding some other part of the program
	196	that begins to dominate execution time.
	197	For instance, we have used \fBgprof\fR on itself;
	198	eliminating, rewriting, and inline expanding routines,
	199	until reading
	200	data files (hardly a target for optimization!)
	201	represents the dominating factor in its execution time.
	202	.pp
	203	Certain types of programs are not easily analyzed by \fBgprof\fR.
	204	They are typified by programs that exhibit a large degree of
	205	recursion, such as recursive descent compilers.
	206	The problem is that most of the major routines are grouped
	207	into a single monolithic cycle.
cde52728	208	As in the symbol table abstraction that is placed
443c8968 PK	209	in one routine,
	210	it is impossible to distinguish which members of the cycle are
	211	responsible for the execution time.
c962f264	212	Unfortunately there are no easy modifications to these programs that
443c8968 PK	213	make them amenable to analysis.
	214	.pp
	215	A completely different use of the profiler is to analyze the control
	216	flow of an unfamiliar program.
	217	If you receive a program from another user that you need to modify
	218	in some small way,
	219	it is often unclear where the changes need to be made.
	220	By running the program on an example and then using \fBgprof\fR,
	221	you can get a view of the structure of the program.
ceba7594	222	.pp
443c8968 PK	223	Consider an example in which you need to change the output format
	224	of the program.
	225	For purposes of this example suppose that the call graph
	226	of the output portion of the program has the following structure:
	227	.(b
73772756	228	.so pres2.pic
443c8968 PK	229	.)b
443c8968 PK	230	Initially you look through the \fBgprof\fR
627116d0	231	output for the system call ``\s-1WRITE\s+1''.
443c8968	232	The format routine you will need to change is probably
627116d0	233	among the parents of the ``\s-1WRITE\s+1'' procedure.
443c8968	234	The next step is to look at the profile entry for each
627116d0 PK	235	of parents of ``\s-1WRITE\s+1'',
627116d0 PK	236	in this example either ``\s-1FORMAT1\s+1'' or ``\s-1FORMAT2\s+1'',
443c8968 PK	237	to determine which one to change.
443c8968 PK	238	Each format routine will have one or more parents,
627116d0	239	in this example ``\s-1CALC1\s+1'', ``\s-1CALC2\s+1'', and ``\s-1CALC3\s+1''.
443c8968 PK	240	By inspecting the source code for each of these routines
	241	you can determine which format routine generates the output that
	242	you wish to modify.
cde52728	243	Since the \fBgprof\fR entry shows all the
443c8968 PK	244	potential calls to the format routine you intend to change,
	245	you can determine if your modifications will affect output that
	246	should be left alone.
627116d0 PK	247	If you desire to change the output of ``\s-1CALC2\s+1'', but not ``\s-1CALC3\s+1'',
627116d0 PK	248	then formatting routine ``\s-1FORMAT2\s+1'' needs to be split
443c8968 PK	249	into two separate routines,
443c8968 PK	250	one of which implements the new format.
627116d0	251	You can then retarget just the call by ``\s-1CALC2\s+1''
443c8968 PK	252	that needs the new format.
443c8968 PK	253	It should be noted that the static call information is particularly
cde52728	254	useful here since the test case you run probably will not
443c8968 PK	255	exercise the entire program.
	256	.sh 1 "Conclusions"
	257	.pp
	258	We have created a profiler that aids in the evaluation
	259	of modular programs.
	260	For each routine in the program,
	261	the profile shows the extent to which that routine
	262	helps support various abstractions,
	263	and how that routine uses other abstractions.
	264	The profile accurately assesses the cost of routines
	265	at all levels of the program decomposition.
	266	The profiler is easily used,
cde52728 PK	267	and can be compiled into the program without any prior planning by
cde52728 PK	268	the programmer.
bfb7a486	269	It adds only five to thirty percent execution overhead to the program
ceba7594	270	being profiled,
bfb7a486 KM	271	produces no additional output until after the program finishes,
bfb7a486 KM	272	and allows the program to be measured in its actual environment.
443c8968	273	Finally, the profiler runs on a time-sharing system
6b7f5073 KM	274	using only the normal services provided by the operating system
6b7f5073 KM	275	and compilers.