BSD 4_4_Lite2 development
[unix-history] / usr / src / contrib / mprof / mprof.tex
CommitLineData
02eabe90
C
1% Master File: mprof.tex
2% Document type: LaTeX
3\documentstyle[11pt,captions,doublespace,twoside]{article}
4
5\input{defs}
6\setstretch{1}
7
8\fullpagestyle
9
10\begin{document}
11
12\title{A Memory Allocation Profiler for C and Lisp Programs
13\footnotetext{This research
14was funded by DARPA contract number N00039-85-C-0269 as part of the
15SPUR research project.}}
16
17
18\author{Benjamin Zorn \\
19 Department of Computer Science \\
20 University of Colorado at Boulder \\
21 \\
22 Paul Hilfinger \\
23 Computer Science Division, Dept. of EECS \\
24 University of California at Berkeley \\
25 }
26
27\date{}
28
29\maketitle
30
31\begin{singlespace}
32\begin{abstract}
33This paper describes {\tt mprof}, a tool used to study the dynamic
34memory allocation behavior of programs. {\tt Mprof} records the
35amount of memory that a function allocates, breaks down allocation
36information by type and size, and displays a program's dynamic call
37graph so that functions indirectly responsible for memory allocation
38are easy to identify. {\tt Mprof} is a two-phase tool. The monitor
39phase is linked into executing programs and records information each
40time memory is allocated. The display phase reduces the data
41generated by the monitor and presents the information to a user in
42several tables. {\tt Mprof} has been implemented for C and Kyoto
43Common Lisp. Measurements of these implementations are presented.
44\end{abstract}
45\end{singlespace}
46
47
48\section{Introduction}
49\setlength{\parskip}{6pt plus 2pt}
50
51Dynamic memory allocation, hereafter referred to simply as memory
52allocation, is an important part of many programs. By dynamic
53allocation, we mean the memory allocated from the heap. Unnecessary
54allocation can decrease program locality and increase execution time
55for the allocation itself and for possible memory reclamation. If
56reclamation is not performed, or if some objects are accidently not
57reclaimed (a ``memory leak''), programs can fail when they reach the
58memory size limit. Programmers often write their own versions of
59memory allocation routines to measure and reduce allocation overhead.
60It is estimated that Mesa programmers spent 40\% of their time solving
61storage-management related problems before automatic storage
62reclamation techniques were introduced in Cedar~\cite{rovner85:gc}.
63Even with automatic storage management, in which reclamation occurs
64transparently, memory allocation has a strong effect on the
65performance of programs~\cite{moon84:gc}. Although memory allocation
66is important, few software tools exist to help programmers understand
67the memory allocation behavior of their programs.
68
69{\tt Mprof} is a tool that allows programmers to identify where and
70why dynamic memory is allocated in a program. It records which
71functions are directly responsible for memory allocation and also
72records the dynamic call chain at an allocation to show which
73functions were indirectly responsible for the allocation. The design
74of {\tt mprof} was inspired by the tool {\tt gprof}, a dynamic
75execution profiler~\cite{graham83:gprof}. {\tt gprof} is a very
76useful tool for understanding the execution behavior of programs; {\tt
77mprof} extends the ideas of {\tt gprof} to give programmers
78information about the dynamic memory allocation behavior of their
79programs. {\tt Mprof} is a two-phase tool. A monitor phase is linked
80into an executing application and collects data as the application
81executes. A reduction and display phase takes the data collected by
82the monitor and presents it to the user in concise tables.
83
84A profiling program such as {\tt mprof} should satisfy several
85criteria. First, a program monitor should not significantly alter the
86behavior of the program being monitored. In particular, a monitor
87should not impose so much overhead on the program being monitored that
88large programs cannot be profiled. Second, a monitor should be easy
89to integrate into existing applications. To use {\tt mprof},
90programmers simply have to relink their applications with a special
91version of the system library. No source code modifications are
92required. Finally, a monitor should provide a programmer with
93information that he can understand and use to reduce the memory
94allocation overhead of his programs. We will present an example that
95illustrates such a use of {\tt mprof}.
96
97In Section \ref{using mprof}, we present a simple program and
98describe the use of {\tt mprof} with respect to the example. In
99Section \ref{implementing mprof} we discuss techniques for the
100effective implementation of {\tt mprof} . Section \ref{measurements}
101presents some measurements of {\tt mprof}. Section \ref{related work}
102describes other memory profiling tools and previous work on which {\tt
103mprof} is based, while Section \ref{conclusions} contains our
104conclusions.
105
106\section{Using {\tt mprof}}
107\label{using mprof}
108
109\subsection{A Producer/Consumer Example}
110
111To illustrate how {\tt mprof} helps a programmer understand the memory
112allocation in his program, consider the C program in Figure~\ref{C
113example figure}. In this program, a simplified producer/consumer
114simulation, objects are randomly allocated by two producers and freed
115by the consumer. The function {\tt random\_flip}, which is not shown,
116randomly returns 1 or 0 with equal probability. The function {\tt
117consume\_widget}, which is responsible for freeing the memory
118allocated, contains a bug and does not free red widgets. If the
119simulation ran for a long time, memory would eventually be exhausted,
120and the program would fail.
121
122\begin{figure}[htbp]
123\begin{singlespace}
124{\footnotesize
125\include{expr}
126}
127\end{singlespace}
128\caption{A Simple Producer/Consumer Simulation Program}
129\label{C example figure}
130\end{figure}
131
132In the example, {\tt make\_widget} is the only function that allocates
133memory directly. To fully understand the allocation behavior of the
134program, we must know which functions called {\tt make\_widget} and
135hence were indirectly responsible for memory allocation. {\tt Mprof}
136provides this information.
137
138To use {\tt mprof}, programmers link in special versions of the system
139functions {\tt malloc} and {\tt free}, which are called each time
140memory is allocated and freed, respectively. The application is then
141run normally. The {\tt mprof} monitor function, linked in with {\tt
142malloc}, gathers statistics as the program runs and writes this
143information to a file when the application exits. The programmer then
144runs a display program over the data file, and four tables are
145printed: a list of memory leaks, an allocation bin table, a direct
146allocation table, and a dynamic call graph. Each table presents the
147allocation behavior of the program from a different perspective. The
148rest of this section describes each of the tables for the C program in
149Figure~\ref{C example figure}.
150
151% Fields in the tables are described in detail in the appendix.
152
153\subsection{The Memory Leak Table}
154
155C programmers must explicitly free memory objects when they are done
156using them. Memory leaks arise when programmers accidently forget to
157release memory. Because Lisp reclaims memory automatically, the
158memory leak table is not necessary in the Lisp version of {\tt mprof}.
159
160The memory leak table tells a programmer which functions allocated the
161memory associated with memory leaks. The table contains a list of
162partial call paths that resulted in memory that was allocated and not
163subsequently freed. The paths are partial because complete path
164information is not recorded; only the last five callers on the call
165stack are listed in the memory leak table. In our simple example,
166there is only one such path, and it tells us immediately what objects
167are not freed. Figure~\ref{memory leak figure} contains the memory
168leak table for our example.
169
170\begin{figure}[htbp]
171\begin{singlespace}
172{
173\include{extable0a}
174}
175\end{singlespace}
176\caption{Memory Leak Table for Producer/Consumer Example}
177\label{memory leak figure}
178\end{figure}
179
180In larger examples, more than one path through a particular function
181is possible. We provide an option that distinguishes individual call
182sites within the same function in the memory leak table if such a
183distinction is needed.
184
185\subsection{The Allocation Bin Table}
186
187A major part of understanding the memory allocation behavior of a
188program is knowing what objects were allocated. In C, memory
189allocation is done by object size; the type of object being allocated
190is not known at allocation time. The allocation bin table provides
191information about what sizes of objects were allocated and what
192program types correspond to the sizes listed. This knowledge helps a
193programmer recognize which data structures consume the most memory and
194allows him to concentrate any space optimizations on them.
195
196The allocation bin table breaks down object allocation by the size, in
197bytes, of allocated objects. Figure~\ref{allocation bin figure} shows
198the allocation bin table for the program in Figure~\ref{C example
199figure}.
200
201\begin{figure}[htbp]
202\begin{singlespace}
203{
204\include{extable0}
205}
206\end{singlespace}
207\caption{Allocation Bin Table for Producer/Consumer Example}
208\label{allocation bin figure}
209\end{figure}
210
211The allocation bin table contains information about objects of each
212byte size from 0 to 1024 bytes and groups objects larger than 1024
213bytes into a single bin. For each byte size in which memory was
214allocated, the allocation bin table shows the number of allocations of
215that size ({\tt allocs}), the total number of bytes allocated to
216objects of that size ({\tt bytes}), the number of frees of objects of
217that size ({\tt frees}), the number of bytes not freed that were
218allocated to objects of that size ({\tt kept}\footnote{The label {\tt
219kept} is used throughout the paper to refer to objects that were never
220freed.}), and user types whose size is the same as the bin size ({\tt
221types}). From the example, we can see that 10,000 widgets were
222allocated by the program, but only 4,981 of these were eventually
223freed, resulting in 1,023,876 bytes of memory lost to the memory leak.
224The percentages show what fraction of all bins a particular bin
225contributed. This information is provided to allow a user to rapidly
226determine which bins are of interest (i.e., contribute a substantial
227percentage). 99\% is the largest percentage possible because we chose
228to use a 2 character field width.
229
230\subsection{The Direct Allocation Table}
231
232Another facet of understanding memory allocation is knowing which
233functions allocated memory and how much they allocated. In C, memory
234allocation is performed explicitly by calling {\tt malloc}, and so
235programmers are often aware of the functions that allocate memory.
236Even in C, however, knowing how much memory was allocated can point
237out functions that do unnecessary allocation and guide the programmer
238when he attempts to optimize the space consumption of his program. In
239Lisp, memory allocation happens implicitly in many primitive routines
240such as {\tt mapcar}, {\tt *}, and {\tt intern}. The direct allocation
241table can reveal unsuspected sources of allocation to Lisp
242programmers.
243
244Figure~\ref{direct allocation figure} contains the direct allocation
245table for our example. The direct allocation table corresponds to the
246flat profile generated by {\tt gprof}.
247
248\begin{figure}[htbp]
249\begin{singlespace}
250{\scriptsize
251\include{extable1}
252}
253\end{singlespace}
254
255\caption{Direct Allocation Table for Producer/Consumer Example}
256\label{direct allocation figure}
257\end{figure}
258
259The first line of the direct allocation table contains the totals for
260all functions allocating memory. In this example, only one function,
261{\tt make\_widget}, allocates memory. The direct allocation table
262prints percent of total allocation that took place in each function
263({\tt \%~mem}), the number of bytes allocated by each function ({\tt
264bytes}), the number of bytes allocated by the function and never freed
265({\tt bytes kept}), and the number of calls made to the function that
266resulted in allocation ({\tt calls}). The {\tt \%~mem(size)} fields
267contain a size breakdown\footnote{ Both the direct allocation table
268and the dynamic call graph break down object allocation into four
269categories of object size: small (s), from 0--32 bytes; medium (m),
270from 33--256 bytes; large (l), from 257--2048 bytes; and extra large
271(x), larger than 2048 bytes. For Lisp, categorization is by type
272rather than size: cons cell (c), floating point number (f), structure
273or vector (s), and other (o).} of the memory allocated by each
274function as a fraction of the memory allocated by all functions. In
275this example, 99\% of the memory allocated by the program was
276allocated in {\tt make\_widget} for medium-sized objects. Blank
277columns indicate values less than 1\%. The other size breakdown given
278in the direct allocation table is for the memory that was allocated
279and never freed. The {\tt \%~all~kept} field contains a size
280breakdown of the memory not freed by a particular function as a
281fraction of all the memory not freed. In the example, 99\% of the
282unfreed memory was of medium-sized objects allocated by {\tt
283make\_widget}.
284
285\subsection{The Allocation Call Graph}
286
287Understanding the memory allocation behavior of a programs sometimes
288requires more information than just knowing the functions that are
289directly responsible for memory allocation. Sometimes, as happens in
290Figure~\ref{C example figure}, the same allocation function is called
291by several different functions for different purposes. The allocation
292call graph shows all the functions that were indirect callers of
293functions that allocated memory.
294
295Because the dynamic caller/callee relations of a program are numerous,
296the dynamic call graph is a complex table with many entries. Often,
297the information provided by the first three tables is enough to allow
298programmers to understand the memory allocation of their program.
299Nevertheless, for a full understanding of the allocation behavior of
300programs the allocation call graph is useful. Figure \ref{call graph
301figure} contains the allocation call graph for the producer/consumer
302example and corresponds to the call graph profile generated by {\tt
303gprof}.
304
305\begin{figure}[ht]
306\begin{singlespace}
307{\scriptsize
308\include{extable2}
309}
310\end{singlespace}
311
312\caption{Allocation Call Graph for Producer/Consumer Example}
313\label{call graph figure}
314\end{figure}
315
316The allocation call graph is a large table with an entry for each
317function that was on a call chain when memory was allocated. Each
318table entry is divided into three parts. The line for the function
319itself (called the {\em entry function\/}); lines above that line,
320each of which represents a caller of the entry function (the
321ancestors), and lines below that line, each of which represents a
322function called by the entry function (the descendents). The entry
323function is easy to identify in each table entry because a large rule
324appears in the {\tt frac} column on that row. In the first entry of
325Figure \ref{call graph figure}, {\tt main} is the entry function;
326there are no ancestors and two descendents.
327
328The entry function line of the allocation call graph contains
329information about the function itself. The {\tt index} field provides
330a unique index to help users navigate through the call graph. The
331{\tt self~$+$~desc} field contains the percent of total memory
332allocated that was allocated in this function and its descendents.
333The call graph is sorted by decreasing values in this field. The {\tt
334self} field contains the number of bytes that were allocated directly
335in the entry function. The {\tt size-func} fields contain a size
336breakdown of the memory allocated in the function itself. Some
337functions, like {\tt main} (index 0) allocated no memory directly, so
338the {\tt size-func} fields are all blank. The {\tt called} field
339shows the number of times this function was called during a memory
340allocation, with the number of recursive calls recorded in the
341adjacent field.
342
343Each caller of the entry function is listed on a separate line above
344it. A summary of all callers is given on the top line of the entry if
345there is more than one ancestor. The {\tt self} field of ancestors
346lists the number of bytes that the entry function and its descendents
347allocated on behalf of the ancestor. The {\tt size-ances} field
348breaks down those bytes into size categories, while the {\tt
349frac-ances} field shows the size breakdown of the bytes requested by
350this ancestor as a fraction of bytes allocated at the request of all
351ancestors. For example, in the entry for function {\tt make\_widget}
352(index 1), the ancestor {\tt make\_red\_widget} can be seen to have
353requested 1,023,876 bytes of data from {\tt make\_widget}, 99\% of which
354was of medium-sized objects. Furthermore, calls from {\tt
355make\_red\_widget} accounted for 50\% of the total memory allocated by
356{\tt make\_widget} and its descendents. Other fields show how many
357calls the ancestor made to the entry function and how many calls the
358ancestor made in total. In a similar fashion, information about the
359function's descendents appears below the entry function.
360
361Had the memory leak table not already told us what objects were not
362being freed, we could use the allocation call graph for the same
363purpose. The direct allocation table told us that {\tt make\_widget}
364allocated 1,023,876 bytes of unfreed memory, all for medium-sized
365objects. From the allocation call graph, we can see that the function
366{\tt make\_red\_widget} was the function calling {\tt make\_widget}
367that requested 1,023,876 bytes of medium-sized objects.
368
369Cycles in the call graph are not illustrated in Figure~\ref{call graph
370figure}. As described in the next section, cycles obscure allocation
371information among functions that are members of a cycle. When the
372parent/child relationships that appear in the graph are between
373members of the same cycle, most of the fields in the graph must be
374omitted.
375
376
377\section{Implementation}
378\label{implementing mprof}
379
380We have implemented {\tt mprof} for use with C and Common Lisp
381programs. Since the implementations are quite similar, the C
382implementation will be described in detail, and the minor differences
383in the Lisp implementation will be noted at the end of the section.
384
385\subsection{The Monitor}
386
387The first phase of {\tt mprof} is a monitor that is linked into the
388executing application. The monitor includes modified versions of {\tt
389malloc} and {\tt free} that record information each time they are
390invoked. Along with {\tt malloc} and {\tt free}, {\tt mprof} provides
391its own {\tt exit} function, so that when the application program
392exits, the data collected by the monitor is written to a file. The
393monitor maintains several data structures needed to construct the
394tables.
395
396To construct the leak table, the monitor associates a list of the last
397five callers in the call chain, the {\em partial call chain\/}, with
398the object allocated. {\tt mprof} augments every object allocated
399with two items: an integer which is the object size as requested by
400the user (since the allocator may allocate an object of a different
401size for convenience), and a pointer to a structure that contains the
402object's partial call chain and a count of allocations and frees of
403objects with that call chain. A hash table is used to map a partial
404call chain to the structure containing the counters. When an object
405is allocated, its partial call chain is used as a hash key to retrieve
406the structure containing the counters. A pointer to the structure is
407placed in the allocated object and the allocation counter is
408incremented. When the object is later freed, the pointer is followed
409and the counter of frees is incremented. Any partial call chain in
410which the number of allocations does not match the number of frees
411indicates a memory leak and is printed in the leak table.
412
413To construct the allocation bin table, the monitor has a 1026-element
414array of integers to count allocations and another 1026-element array
415to count frees. When objects of a particular size from 0--1024 bytes
416are allocated or freed, the appropriated bin is incremented. Objects
417larger than 1024 bytes are grouped into the same bin.
418
419The construction of the direct allocation table falls out directly
420from maintaining the allocation call graph information, which is
421described in the next section.
422
423\subsection{Constructing the Allocation Call Graph}
424
425To construct the allocation call graph, the monitor must associate the
426number of bytes allocated with every function on the current dynamic
427call chain, each time {\tt malloc} is called. Consider the sample
428call chain in Figure~\ref{cchain:fig}, which we abbreviate:
429{\tt main}-$>${\tt foo}-$>${\tt bar}(24).
430
431\begin{figure}[htbp]
432\begin{verbatim}
433 CALL STACK: MPROF RECORDS:
434 main calls foo 24 bytes over main -> foo
435 foo calls bar 24 bytes over foo -> bar
436 bar calls malloc(24) 24 bytes allocated in bar
437\end{verbatim}
438\caption{Example of a Dynamic Call Chain}
439\label{cchain:fig}
440\end{figure}
441
442In {\tt mprof}, the monitor traverses the entire call chain by
443following return addresses. This differs from {\tt gprof}, where only
444the immediate caller of the current function is recorded. {\tt gprof}
445makes the assumption that each call takes an equal amount of time and
446uses this assumption to reconstruct the complete dynamic call graph
447from information only about the immediate callers. In {\tt mprof}, we
448actually traverse the entire dynamic call chain and need to make no
449assumptions.
450
451In choosing to traverse the entire call chain, we have elected to
452perform an operation that is potentially expensive both in time and
453space. One implementation would simply record every function in every
454chain and write the information to a file (i.e., in the example we
455would output [{\tt main}-$>${\tt foo}-$>${\tt bar}, 24]). Considering that
456many programs execute millions of calls to {\tt malloc} and that the
457depth of a call chain can be hundreds of functions, the amount of
458information could be prohibitive.
459
460An alternative to recording the entire chain of callers is to break
461the call chain into a set of caller/callee pairs, and associate the
462bytes allocated with each pair in the chain. For the call in the
463example, we could maintain the pairs [{\tt main}, {\tt foo}] and [{\tt foo},
464{\tt bar}], and associate 24 bytes with each pair. Conceptually, the
465data structure our monitor maintains is an association between
466caller/callee pairs and the cumulative bytes allocated over the pair,
467which we denote ([{\tt main}, {\tt foo}], 24). To continue with the
468example, if the next allocation was: {\tt main}-$>${\tt foo}-$>${\tt
469otherbar}(10), where this is the first call to {\tt otherbar}, we
470would update the byte count associated with the [{\tt main}, {\tt foo}] pair
471to 34 from 24. Furthermore, we would create a new association between
472[{\tt foo}, {\tt otherbar}] and the byte count, 10. A disadvantage
473with this implementation is that the exact call chains are no longer
474available. However, from the pairs we can construct the correct
475dynamic call graph of the program, which is the information that we
476need for the allocation call graph.
477
478For the overhead imposed by the monitor to be reasonable, we have to
479make the association between caller/callee pairs and cumulative byte
480counts fast. We use a hash table in which the hash function is a
481simple byte-swap XOR of the callee address. Each callee has a list of
482its callers and the number of allocated bytes associated with each
483pair. In an effort to decrease the number of hash lookups, we noted
484that from allocation to allocation, most of the call chain remains the
485same. Our measurements show that on the average, 60--75\% of the call
486chain remains the same between allocations. This observation allows
487us to cache the pairs associated with the current caller chain and to
488use most of these pairs the next time a caller chain is recorded.
489Thus, on any particular allocation, only a few addresses need to be
490hashed. Here are the events that take place when a call to {\tt
491malloc} is monitored:
492
493\begin{enumerate}
494 \item The chain of return addresses is stored in a vector.
495 \item The new chain is compared with the previous chain, and the point
496at which they differ is noted.
497 \item For the addresses in the chain that have not changed, the
498caller/callee byte count for each pair is already available and is
499incremented.
500 \item For new addresses in the chain, each caller/callee byte count is
501looked up and updated.
502 \item For the tail of the chain (i.e., the function that called {\tt
503malloc} directly), the direct allocation information is recorded.
504\end{enumerate}
505
506Maintaining allocation call graph information requires a byte count
507for every distinct caller/callee pair in every call chain that
508allocates memory. Our experience is that there are a limited number
509of such pairs, even in very large C programs, so that the memory
510requirements of the {\tt mprof} monitor are not large (see
511section~\ref{mmeas:sec}).
512
513\subsection{Reduction and Display}
514
515The second phase of {\tt mprof} reads the output of the monitor,
516reduces the data to create a dynamic call graph, and displays the data
517in four tables. The first part of the data reduction is to map the
518caller/callee address pairs to actual function names. A program {\tt
519mpfilt} reads the executable file that created the monitor trace
520(compiled so that symbol table information is retained), and outputs a
521new set of function caller/callee relations. These relations are then
522used to construct the subset of the program's dynamic call graph that
523involved memory allocation.
524
525The call graph initially can contain cycles due to recursion in the
526program's execution. Cycles in the call graph introduce spurious
527allocation relations, as is illustrated in Figure~\ref{recursive call
528figure}. In this example, {\tt main} is credited as being indirectly
529responsible for 10 bytes, but because we only keep track of
530caller/callee pairs, {\tt F} appears to have requested 20 bytes from
531{\tt G}, even though only 10 bytes were allocated.
532
533\begin{figure}[htbp]
534\begin{verbatim}
535 CALL STACK: MPROF RECORDS:
536 main calls F (10 bytes over main -> F)
537 F calls G (10 bytes over F -> G)
538 G calls F (10 bytes over G -> F)
539 F calls G (10 MORE bytes over F -> G)
540 G calls malloc(10) (10 bytes allocated in G)
541\end{verbatim}
542\caption{Problems Caused by Recursive Calls}
543\label{recursive call figure}
544\end{figure}
545
546We considered several solutions to the problems caused by cycles and
547adopted the most conservative solution. One way to avoid recording
548spurious allocation caused by recursion is for the monitor to identify
549the cycles before recording the allocation. For example, in
550Figure~\ref{recursive call figure}, the monitor could realize that it
551had already credited {\tt F} with the 10 bytes when it encountered
552{\tt F} calling {\tt G} the second time. This solution adds overhead
553to the monitor and conflicts with our goal to make the monitor as
554unobtrusive as possible.
555
556The solution that we adopted was to merge functions that are in a
557cycle into a single node in the reduction phase. Thus, each strongly
558connected component in the dynamic call graph is merged into a single
559node. The result is a call graph with no cycles. This process is
560also used by {\tt gprof}, and described carefully
561elsewhere~\cite{graham83:gprof}. Such an approach works well in {\tt
562gprof} because C programs, for which {\tt gprof} was primarily
563intended, tend to have limited amounts of recursion. Lisp programs,
564for which {\tt mprof} is also intended, intuitively contain much more
565recursion. We have experience profiling a number of large Common Lisp
566programs. We observe several recursive cycles in most programs, but
567the cycles generally contain a small percentage of the total functions
568and {\tt mprof} is quite effective.
569
570\subsection{Lisp Implementation}
571
572So far, we have described the implementation of {\tt mprof} for C.
573The Lisp implementation is quite similar, and here we describe the
574major differences. C has a single function, {\tt malloc}, that is
575called to allocate memory explicitly. Lisp has a large number of
576primitives that allocate memory implicitly (i.e., {\tt cons}, {\tt *},
577{\tt intern}, etc.). To make {\tt mprof} work, these primitives must
578be modified so that every allocation is recorded. Fortunately, at the
579Lisp implementation level, all memory allocations may be channeled
580through a single routine. We worked with KCL (Kyoto Common Lisp),
581which is implemented in C. In KCL, all Lisp memory allocations are
582handled by a single function, {\tt alloc\_object}. Just as we had
583modified {\tt malloc} in C, we were able to simply patch {\tt
584alloc\_object} to monitor memory allocation in KCL.
585
586The other major difference in monitoring Lisp is that the addresses
587recorded by the monitor must be translated into Lisp function names.
588Again, KCL makes this quite easy because Lisp functions are defined in
589a central place in KCL and the names of the functions are known when
590they are defined. Many other Lisp systems are designed to allow
591return addresses to be mapped to symbolic function names so that the
592call stack can be printed at a breakpoint. In this case, the monitor
593can use the same mechanism to map return addresses to function names.
594Therefore, in Lisp systems in which addresses can be quickly mapped to
595function names, memory profiling in the style of {\tt mprof} is not a
596difficult problem. In systems in which symbolic names are not
597available in compiled code, profiling is more difficult. Furthermore,
598many systems open-code important allocation functions, like {\tt
599cons}. Because open-coded allocation functions will not necessarily
600call a central allocation function (like {\tt alloc\_object}), such
601allocations will not be observed by {\tt mprof}. To avoid such a loss
602of information, {\tt mprof} should be used in conjunction with program
603declarations that will force allocation functions such as {\tt cons}
604to be coded out-of-line.
605
606\section{Measurements}
607\label{measurements}
608
609We have measured the C implementation of {\tt mprof} by instrumenting
610four programs using {\tt mprof}. The first program, {\tt example}, is
611our example program with the number of widgets allocated increased to
612100,000 to increase program execution time. The second program, {\tt
613fidilrt}, is the runtime library of FIDIL, a programming language for
614finite difference computations~\cite{hilfinger88:fidil}. The third
615program, {\tt epoxy}, is an electrical and physical layout optimizer
616written by Fred Obermeier~\cite{fwo87:epoxy}. The fourth program,
617{\tt crystal}, is a VLSI timing analysis
618program~\cite{ouster85:crystal}. These tests represent a small
619program ({\tt example}, 100 lines); a medium-sized program ({\tt
620fidilrt}, 7,100 lines); and two large programs ({\tt epoxy}, 11,000
621lines and {\tt crystal}, 10,500 lines). In the remainder of this
622section, we will look at the resource consumption of {\tt mprof} from
623two perspectives: execution time overhead and space consumption.
624
625\subsection{Execution Time Overhead}
626
627There are two sources of execution time overhead associated with {\tt
628mprof}: additional time spent monitoring an application and the time
629to reduce and print the data produced by the monitor. The largest
630source of monitor overhead is the time required to traverse the
631complete call chain and associate allocations with caller/callee
632pairs. We implemented a version of {\tt mprof}, called {\tt mprof-},
633which does not create the allocation call graph. With this version,
634we can see the relative cost of the allocation call graph. The ratio
635of the time spent with profiling to the time spent without profiling
636is called the {\em slowdown factor\/}. Table \ref{cpu:tab} summarizes
637the execution time overheads for our four applications. Measurements
638were gathered running the test programs on a VAX 8800 with 80
639megabytes of physical memory.
640
641\begin{table}[htbp]
642\begin{singlespace}
643\begin{center}
644\begin{tabular}{|l|r|r|r|r|} \hline
645\multicolumn{1}{|c|}{} & \multicolumn{4}{|c|}{Cost} \\ \cline{2-5}
646\multicolumn{1}{|c|}{Resource Description} &
647\multicolumn{1}{|c|}{\tt example} &
648\multicolumn{1}{|c|}{\tt fidilrt} &
649\multicolumn{1}{|c|}{\tt epoxy} &
650\multicolumn{1}{|c|}{\tt crystal} \\ \hline
651Number of allocations & 100000 & 77376 & 306295 & 31158 \\
652Execution time with {\tt mprof} (seconds) & 62.7 & 132.7 & 188.8 & 134.1 \\
653Execution time with {\tt mprof-} (seconds) & 44.1 & 116.0 & 149.7 & 25.5 \\
654Execution time without {\tt mprof} (seconds) & 17.9 & 107.1 & 52.1 & 13.2 \\
655\hline
656Slowdown using {\tt mprof} & 3.5 & 1.2 & 3.6 & 10.1 \\
657Slowdown using {\tt mprof-} & 2.5 & 1.1 & 2.9 & 1.9 \\
658\hline
659Reduction and display time (seconds) & 10.3 & 28.8 & 28.3 & 36.8 \\
660\hline
661\end{tabular}
662\caption{Execution Time Overhead of {\tt mprof}}
663\label{cpu:tab}
664\end{center}
665\end{singlespace}
666\end{table}
667
668The slowdown associated with {\tt mprof} varies widely, from 1.5 to
66910. {\tt crystal} suffered the worst degradation from profiling
670because {\tt crystal} uses a depth-first algorithm that results in
671long call chains. Programs without long call chains appear to slow
672down by a factor of 2--4. If the allocation call graph is not
673generated and long call chains are not traversed, the slowdown is
674significantly less, especially in the extreme cases. Since {\tt
675mprof} is a prototype and has not been carefully optimized, this
676overhead seems acceptable. From the table, we see the reduction and
677display time is typically less than a minute.
678
679\subsection{Storage Consumption}
680\label{mmeas:sec}
681
682The storage consumption of {\tt mprof} is divided into the additional
683memory needed by the monitor as an application executes, and the
684external storage required by the profile data file. The most
685significant source of memory used by the monitor is the data stored
686with each object allocated: an object size and a pointer needed to
687construct the memory leak table. The monitor also uses memory to
688record the memory bins and caller/callee byte counts and must write
689this information to a file when the application is finished. We
690measured how many bytes of memory and disk space are needed to store
691this information. Table~\ref{mem:tab} summarizes the measurements of
692storage consumption associated with {\tt mprof}.
693
694\begin{table}[htbp]
695\begin{singlespace}
696\begin{center}
697\begin{tabular}{|l|r|r|r|r|} \hline
698\multicolumn{1}{|c|}{} & \multicolumn{4}{|c|}{Cost} \\ \cline{2-5}
699\multicolumn{1}{|c|}{Resource Description} &
700\multicolumn{1}{|c|}{\tt example} &
701\multicolumn{1}{|c|}{\tt fidilrt} &
702\multicolumn{1}{|c|}{\tt epoxy} &
703\multicolumn{1}{|c|}{\tt crystal} \\ \hline
704Number of allocations & 100000 & 61163 & 306295 & 31158 \\
705User memory allocated (Kbytes) & 20000 & 2425 & 6418 & 21464 \\
706\hline
707Per object memory (Kbytes) & 781 & 477 & 2393 & 168 \\
708Other monitor memory (Kbytes) & 8.7 & 23.3 & 52.3 & 17.5 \\
709Total monitor memory (Kbytes) & 790 & 500 & 2445 & 186 \\
710Monitor fraction (\% memory allocated) & 4 & 17 & 28 & 1 \\
711\hline
712Data file size (Kbytes) & 4.5 & 8.1 & 28.6 & 9.6 \\
713\hline
714\end{tabular}
715\caption{Storage Consumption of {\tt mprof}}
716\label{mem:tab}
717\end{center}
718\end{singlespace}
719\end{table}
720
721The memory overhead of {\tt mprof} is small, except that an additional
7228 bytes of storage are allocated with every object. In programs in
723which many small objects are allocated, like {\tt epoxy}, {\tt mprof}
724can contribute significantly to the total memory allocated.
725Nevertheless, in the worst case, {\tt mprof} increases application
726size by 1/3, and since {\tt mprof} is a development tool, this
727overhead seems acceptable. From the table we also see that the data
728files created by {\tt mprof} are quite small ($<$ 30 Kbytes).
729
730\section{Related Work}
731\label{related work}
732
733{\tt Mprof} is similar to the tool {\tt gprof}~\cite{graham83:gprof},
734a dynamic execution profiler. Because some of the problems of
735interpreting the dynamic call graph are the same, we have borrowed
736these ideas from {\tt gprof}. Also, we have used ideas from the user
737interface of {\tt gprof} for two reasons: because the information
738being displayed is quite similar and because users familiar with {\tt
739gprof} would probably also be interested in {\tt mprof} and would
740benefit from a similar presentation.
741
742Barach, Taenzer, and Wells developed a tool for finding storage
743allocation errors in C programs~\cite{barach82:memprof}. Their
744approach concentrated on finding two specific storage allocation
745errors: memory leaks and duplicate frees. They modified {\tt malloc}
746and {\tt free} so that every time that these functions were called,
747information about the memory block being manipulated was recorded in a
748file. A program that examines this file, {\tt prleak}, prints out
749which memory blocks were never freed or were freed twice. This
750approach differs from {\tt mprof} in two ways. First, {\tt mprof}
751provides more information about the memory allocation of programs than
752{\tt prleak}, which just reports on storage errors. Second, {\tt
753prleak} generates extremely large intermediate files that are
754comparable in size to the total amount of memory allocated by the
755program (often megabytes of data). Although {\tt mprof} records more
756useful information, the data files it generates are of modest size
757(see the table above).
758
759\section{Conclusions}
760\label{conclusions}
761
762We have implemented a memory allocation profiling program for both C
763and Common Lisp. Our example has shown that {\tt mprof} can be
764effective in elucidating the allocation behavior of a program so that
765a programmer can detect memory leaks and identify major sources of
766allocation.
767
768Unlike {\tt gprof}, {\tt mprof} records every caller in the call chain
769every time an object is allocated. The overhead for this recording is
770large but not impractically so, because we take advantage of the fact
771that a call chain changes little between allocations. Moreover,
772recording this information does not require large amounts of memory
773because there are relatively few unique caller/callee address pairs on
774call chains in which allocation takes place, even in very large
775programs. We have measured the overhead of {\tt mprof}, and find that
776typically it slows applications by a factor of 2--4 times, and adds up
777to 33\% to the memory allocated by the application. Because {\tt
778mprof} is intended as a development tool, these costs are acceptable.
779
780Because {\tt mprof} merges cycles caused by recursive function calls,
781{\tt mprof} may be ineffective for programs with large cycles in their
782call graph. Only with more data will we be able to decide if many
783programs (especially those written in Lisp) contain so many recursive
784calls that cycle merging makes {\tt mprof} ineffective. Nevertheless,
785{\tt mprof} has already been effective in detecting KCL system
786functions that allocate memory extraneously.\footnote{Using {\tt
787mprof}, we noted that for a large object-oriented program written in
788KCL, the system function {\tt every} accounted for 13\% of the memory
789allocated. We rewrote {\tt every} so it would not allocate any
790memory, and decreased the memory consumption of the program by 13\%.}
791
792As a final note, we have received feedback from C application
793programmers who have used {\tt mprof}. They report that the memory
794leak table and the allocation bin table are both extremely useful,
795while the direct allocation table and the allocation call graph are
796harder to understand and also less useful. Considering the execution
797overhead associated with the allocation call graph and the complexity
798of the table, it is questionable whether the allocation call graph
799will ever be as helpful C programmers as the memory leak table. On
800the other hand, with automatic storage reclamation, the memory leak
801table becomes unnecessary. Yet for memory intensive languages, such
802as Lisp, the need for effective use of the memory is more important,
803and tools such as the allocation call graph might prove very useful.
804Because we have limited feedback from Lisp programmers using {\tt
805mprof}, we cannot report their response to this tool.
806
807\bibliography{/spur2/larus/bib/abbrev,lispm,gc,/spur2/larus/bib/optim,spur,misc,lisp}
808\bibliographystyle{plain}
809
810\end{document}
811%