Commit | Line | Data |
---|---|---|
02eabe90 C |
1 | % Master File: mprof.tex |
2 | % Document type: LaTeX | |
3 | \documentstyle[11pt,captions,doublespace,twoside]{article} | |
4 | ||
5 | \input{defs} | |
6 | \setstretch{1} | |
7 | ||
8 | \fullpagestyle | |
9 | ||
10 | \begin{document} | |
11 | ||
12 | \title{A Memory Allocation Profiler for C and Lisp Programs | |
13 | \footnotetext{This research | |
14 | was funded by DARPA contract number N00039-85-C-0269 as part of the | |
15 | SPUR research project.}} | |
16 | ||
17 | ||
18 | \author{Benjamin Zorn \\ | |
19 | Department of Computer Science \\ | |
20 | University of Colorado at Boulder \\ | |
21 | \\ | |
22 | Paul Hilfinger \\ | |
23 | Computer Science Division, Dept. of EECS \\ | |
24 | University of California at Berkeley \\ | |
25 | } | |
26 | ||
27 | \date{} | |
28 | ||
29 | \maketitle | |
30 | ||
31 | \begin{singlespace} | |
32 | \begin{abstract} | |
33 | This paper describes {\tt mprof}, a tool used to study the dynamic | |
34 | memory allocation behavior of programs. {\tt Mprof} records the | |
35 | amount of memory that a function allocates, breaks down allocation | |
36 | information by type and size, and displays a program's dynamic call | |
37 | graph so that functions indirectly responsible for memory allocation | |
38 | are easy to identify. {\tt Mprof} is a two-phase tool. The monitor | |
39 | phase is linked into executing programs and records information each | |
40 | time memory is allocated. The display phase reduces the data | |
41 | generated by the monitor and presents the information to a user in | |
42 | several tables. {\tt Mprof} has been implemented for C and Kyoto | |
43 | Common Lisp. Measurements of these implementations are presented. | |
44 | \end{abstract} | |
45 | \end{singlespace} | |
46 | ||
47 | ||
48 | \section{Introduction} | |
49 | \setlength{\parskip}{6pt plus 2pt} | |
50 | ||
51 | Dynamic memory allocation, hereafter referred to simply as memory | |
52 | allocation, is an important part of many programs. By dynamic | |
53 | allocation, we mean the memory allocated from the heap. Unnecessary | |
54 | allocation can decrease program locality and increase execution time | |
55 | for the allocation itself and for possible memory reclamation. If | |
56 | reclamation is not performed, or if some objects are accidently not | |
57 | reclaimed (a ``memory leak''), programs can fail when they reach the | |
58 | memory size limit. Programmers often write their own versions of | |
59 | memory allocation routines to measure and reduce allocation overhead. | |
60 | It is estimated that Mesa programmers spent 40\% of their time solving | |
61 | storage-management related problems before automatic storage | |
62 | reclamation techniques were introduced in Cedar~\cite{rovner85:gc}. | |
63 | Even with automatic storage management, in which reclamation occurs | |
64 | transparently, memory allocation has a strong effect on the | |
65 | performance of programs~\cite{moon84:gc}. Although memory allocation | |
66 | is important, few software tools exist to help programmers understand | |
67 | the memory allocation behavior of their programs. | |
68 | ||
69 | {\tt Mprof} is a tool that allows programmers to identify where and | |
70 | why dynamic memory is allocated in a program. It records which | |
71 | functions are directly responsible for memory allocation and also | |
72 | records the dynamic call chain at an allocation to show which | |
73 | functions were indirectly responsible for the allocation. The design | |
74 | of {\tt mprof} was inspired by the tool {\tt gprof}, a dynamic | |
75 | execution profiler~\cite{graham83:gprof}. {\tt gprof} is a very | |
76 | useful tool for understanding the execution behavior of programs; {\tt | |
77 | mprof} extends the ideas of {\tt gprof} to give programmers | |
78 | information about the dynamic memory allocation behavior of their | |
79 | programs. {\tt Mprof} is a two-phase tool. A monitor phase is linked | |
80 | into an executing application and collects data as the application | |
81 | executes. A reduction and display phase takes the data collected by | |
82 | the monitor and presents it to the user in concise tables. | |
83 | ||
84 | A profiling program such as {\tt mprof} should satisfy several | |
85 | criteria. First, a program monitor should not significantly alter the | |
86 | behavior of the program being monitored. In particular, a monitor | |
87 | should not impose so much overhead on the program being monitored that | |
88 | large programs cannot be profiled. Second, a monitor should be easy | |
89 | to integrate into existing applications. To use {\tt mprof}, | |
90 | programmers simply have to relink their applications with a special | |
91 | version of the system library. No source code modifications are | |
92 | required. Finally, a monitor should provide a programmer with | |
93 | information that he can understand and use to reduce the memory | |
94 | allocation overhead of his programs. We will present an example that | |
95 | illustrates such a use of {\tt mprof}. | |
96 | ||
97 | In Section \ref{using mprof}, we present a simple program and | |
98 | describe the use of {\tt mprof} with respect to the example. In | |
99 | Section \ref{implementing mprof} we discuss techniques for the | |
100 | effective implementation of {\tt mprof} . Section \ref{measurements} | |
101 | presents some measurements of {\tt mprof}. Section \ref{related work} | |
102 | describes other memory profiling tools and previous work on which {\tt | |
103 | mprof} is based, while Section \ref{conclusions} contains our | |
104 | conclusions. | |
105 | ||
106 | \section{Using {\tt mprof}} | |
107 | \label{using mprof} | |
108 | ||
109 | \subsection{A Producer/Consumer Example} | |
110 | ||
111 | To illustrate how {\tt mprof} helps a programmer understand the memory | |
112 | allocation in his program, consider the C program in Figure~\ref{C | |
113 | example figure}. In this program, a simplified producer/consumer | |
114 | simulation, objects are randomly allocated by two producers and freed | |
115 | by the consumer. The function {\tt random\_flip}, which is not shown, | |
116 | randomly returns 1 or 0 with equal probability. The function {\tt | |
117 | consume\_widget}, which is responsible for freeing the memory | |
118 | allocated, contains a bug and does not free red widgets. If the | |
119 | simulation ran for a long time, memory would eventually be exhausted, | |
120 | and the program would fail. | |
121 | ||
122 | \begin{figure}[htbp] | |
123 | \begin{singlespace} | |
124 | {\footnotesize | |
125 | \include{expr} | |
126 | } | |
127 | \end{singlespace} | |
128 | \caption{A Simple Producer/Consumer Simulation Program} | |
129 | \label{C example figure} | |
130 | \end{figure} | |
131 | ||
132 | In the example, {\tt make\_widget} is the only function that allocates | |
133 | memory directly. To fully understand the allocation behavior of the | |
134 | program, we must know which functions called {\tt make\_widget} and | |
135 | hence were indirectly responsible for memory allocation. {\tt Mprof} | |
136 | provides this information. | |
137 | ||
138 | To use {\tt mprof}, programmers link in special versions of the system | |
139 | functions {\tt malloc} and {\tt free}, which are called each time | |
140 | memory is allocated and freed, respectively. The application is then | |
141 | run normally. The {\tt mprof} monitor function, linked in with {\tt | |
142 | malloc}, gathers statistics as the program runs and writes this | |
143 | information to a file when the application exits. The programmer then | |
144 | runs a display program over the data file, and four tables are | |
145 | printed: a list of memory leaks, an allocation bin table, a direct | |
146 | allocation table, and a dynamic call graph. Each table presents the | |
147 | allocation behavior of the program from a different perspective. The | |
148 | rest of this section describes each of the tables for the C program in | |
149 | Figure~\ref{C example figure}. | |
150 | ||
151 | % Fields in the tables are described in detail in the appendix. | |
152 | ||
153 | \subsection{The Memory Leak Table} | |
154 | ||
155 | C programmers must explicitly free memory objects when they are done | |
156 | using them. Memory leaks arise when programmers accidently forget to | |
157 | release memory. Because Lisp reclaims memory automatically, the | |
158 | memory leak table is not necessary in the Lisp version of {\tt mprof}. | |
159 | ||
160 | The memory leak table tells a programmer which functions allocated the | |
161 | memory associated with memory leaks. The table contains a list of | |
162 | partial call paths that resulted in memory that was allocated and not | |
163 | subsequently freed. The paths are partial because complete path | |
164 | information is not recorded; only the last five callers on the call | |
165 | stack are listed in the memory leak table. In our simple example, | |
166 | there is only one such path, and it tells us immediately what objects | |
167 | are not freed. Figure~\ref{memory leak figure} contains the memory | |
168 | leak table for our example. | |
169 | ||
170 | \begin{figure}[htbp] | |
171 | \begin{singlespace} | |
172 | { | |
173 | \include{extable0a} | |
174 | } | |
175 | \end{singlespace} | |
176 | \caption{Memory Leak Table for Producer/Consumer Example} | |
177 | \label{memory leak figure} | |
178 | \end{figure} | |
179 | ||
180 | In larger examples, more than one path through a particular function | |
181 | is possible. We provide an option that distinguishes individual call | |
182 | sites within the same function in the memory leak table if such a | |
183 | distinction is needed. | |
184 | ||
185 | \subsection{The Allocation Bin Table} | |
186 | ||
187 | A major part of understanding the memory allocation behavior of a | |
188 | program is knowing what objects were allocated. In C, memory | |
189 | allocation is done by object size; the type of object being allocated | |
190 | is not known at allocation time. The allocation bin table provides | |
191 | information about what sizes of objects were allocated and what | |
192 | program types correspond to the sizes listed. This knowledge helps a | |
193 | programmer recognize which data structures consume the most memory and | |
194 | allows him to concentrate any space optimizations on them. | |
195 | ||
196 | The allocation bin table breaks down object allocation by the size, in | |
197 | bytes, of allocated objects. Figure~\ref{allocation bin figure} shows | |
198 | the allocation bin table for the program in Figure~\ref{C example | |
199 | figure}. | |
200 | ||
201 | \begin{figure}[htbp] | |
202 | \begin{singlespace} | |
203 | { | |
204 | \include{extable0} | |
205 | } | |
206 | \end{singlespace} | |
207 | \caption{Allocation Bin Table for Producer/Consumer Example} | |
208 | \label{allocation bin figure} | |
209 | \end{figure} | |
210 | ||
211 | The allocation bin table contains information about objects of each | |
212 | byte size from 0 to 1024 bytes and groups objects larger than 1024 | |
213 | bytes into a single bin. For each byte size in which memory was | |
214 | allocated, the allocation bin table shows the number of allocations of | |
215 | that size ({\tt allocs}), the total number of bytes allocated to | |
216 | objects of that size ({\tt bytes}), the number of frees of objects of | |
217 | that size ({\tt frees}), the number of bytes not freed that were | |
218 | allocated to objects of that size ({\tt kept}\footnote{The label {\tt | |
219 | kept} is used throughout the paper to refer to objects that were never | |
220 | freed.}), and user types whose size is the same as the bin size ({\tt | |
221 | types}). From the example, we can see that 10,000 widgets were | |
222 | allocated by the program, but only 4,981 of these were eventually | |
223 | freed, resulting in 1,023,876 bytes of memory lost to the memory leak. | |
224 | The percentages show what fraction of all bins a particular bin | |
225 | contributed. This information is provided to allow a user to rapidly | |
226 | determine which bins are of interest (i.e., contribute a substantial | |
227 | percentage). 99\% is the largest percentage possible because we chose | |
228 | to use a 2 character field width. | |
229 | ||
230 | \subsection{The Direct Allocation Table} | |
231 | ||
232 | Another facet of understanding memory allocation is knowing which | |
233 | functions allocated memory and how much they allocated. In C, memory | |
234 | allocation is performed explicitly by calling {\tt malloc}, and so | |
235 | programmers are often aware of the functions that allocate memory. | |
236 | Even in C, however, knowing how much memory was allocated can point | |
237 | out functions that do unnecessary allocation and guide the programmer | |
238 | when he attempts to optimize the space consumption of his program. In | |
239 | Lisp, memory allocation happens implicitly in many primitive routines | |
240 | such as {\tt mapcar}, {\tt *}, and {\tt intern}. The direct allocation | |
241 | table can reveal unsuspected sources of allocation to Lisp | |
242 | programmers. | |
243 | ||
244 | Figure~\ref{direct allocation figure} contains the direct allocation | |
245 | table for our example. The direct allocation table corresponds to the | |
246 | flat profile generated by {\tt gprof}. | |
247 | ||
248 | \begin{figure}[htbp] | |
249 | \begin{singlespace} | |
250 | {\scriptsize | |
251 | \include{extable1} | |
252 | } | |
253 | \end{singlespace} | |
254 | ||
255 | \caption{Direct Allocation Table for Producer/Consumer Example} | |
256 | \label{direct allocation figure} | |
257 | \end{figure} | |
258 | ||
259 | The first line of the direct allocation table contains the totals for | |
260 | all functions allocating memory. In this example, only one function, | |
261 | {\tt make\_widget}, allocates memory. The direct allocation table | |
262 | prints percent of total allocation that took place in each function | |
263 | ({\tt \%~mem}), the number of bytes allocated by each function ({\tt | |
264 | bytes}), the number of bytes allocated by the function and never freed | |
265 | ({\tt bytes kept}), and the number of calls made to the function that | |
266 | resulted in allocation ({\tt calls}). The {\tt \%~mem(size)} fields | |
267 | contain a size breakdown\footnote{ Both the direct allocation table | |
268 | and the dynamic call graph break down object allocation into four | |
269 | categories of object size: small (s), from 0--32 bytes; medium (m), | |
270 | from 33--256 bytes; large (l), from 257--2048 bytes; and extra large | |
271 | (x), larger than 2048 bytes. For Lisp, categorization is by type | |
272 | rather than size: cons cell (c), floating point number (f), structure | |
273 | or vector (s), and other (o).} of the memory allocated by each | |
274 | function as a fraction of the memory allocated by all functions. In | |
275 | this example, 99\% of the memory allocated by the program was | |
276 | allocated in {\tt make\_widget} for medium-sized objects. Blank | |
277 | columns indicate values less than 1\%. The other size breakdown given | |
278 | in the direct allocation table is for the memory that was allocated | |
279 | and never freed. The {\tt \%~all~kept} field contains a size | |
280 | breakdown of the memory not freed by a particular function as a | |
281 | fraction of all the memory not freed. In the example, 99\% of the | |
282 | unfreed memory was of medium-sized objects allocated by {\tt | |
283 | make\_widget}. | |
284 | ||
285 | \subsection{The Allocation Call Graph} | |
286 | ||
287 | Understanding the memory allocation behavior of a programs sometimes | |
288 | requires more information than just knowing the functions that are | |
289 | directly responsible for memory allocation. Sometimes, as happens in | |
290 | Figure~\ref{C example figure}, the same allocation function is called | |
291 | by several different functions for different purposes. The allocation | |
292 | call graph shows all the functions that were indirect callers of | |
293 | functions that allocated memory. | |
294 | ||
295 | Because the dynamic caller/callee relations of a program are numerous, | |
296 | the dynamic call graph is a complex table with many entries. Often, | |
297 | the information provided by the first three tables is enough to allow | |
298 | programmers to understand the memory allocation of their program. | |
299 | Nevertheless, for a full understanding of the allocation behavior of | |
300 | programs the allocation call graph is useful. Figure \ref{call graph | |
301 | figure} contains the allocation call graph for the producer/consumer | |
302 | example and corresponds to the call graph profile generated by {\tt | |
303 | gprof}. | |
304 | ||
305 | \begin{figure}[ht] | |
306 | \begin{singlespace} | |
307 | {\scriptsize | |
308 | \include{extable2} | |
309 | } | |
310 | \end{singlespace} | |
311 | ||
312 | \caption{Allocation Call Graph for Producer/Consumer Example} | |
313 | \label{call graph figure} | |
314 | \end{figure} | |
315 | ||
316 | The allocation call graph is a large table with an entry for each | |
317 | function that was on a call chain when memory was allocated. Each | |
318 | table entry is divided into three parts. The line for the function | |
319 | itself (called the {\em entry function\/}); lines above that line, | |
320 | each of which represents a caller of the entry function (the | |
321 | ancestors), and lines below that line, each of which represents a | |
322 | function called by the entry function (the descendents). The entry | |
323 | function is easy to identify in each table entry because a large rule | |
324 | appears in the {\tt frac} column on that row. In the first entry of | |
325 | Figure \ref{call graph figure}, {\tt main} is the entry function; | |
326 | there are no ancestors and two descendents. | |
327 | ||
328 | The entry function line of the allocation call graph contains | |
329 | information about the function itself. The {\tt index} field provides | |
330 | a unique index to help users navigate through the call graph. The | |
331 | {\tt self~$+$~desc} field contains the percent of total memory | |
332 | allocated that was allocated in this function and its descendents. | |
333 | The call graph is sorted by decreasing values in this field. The {\tt | |
334 | self} field contains the number of bytes that were allocated directly | |
335 | in the entry function. The {\tt size-func} fields contain a size | |
336 | breakdown of the memory allocated in the function itself. Some | |
337 | functions, like {\tt main} (index 0) allocated no memory directly, so | |
338 | the {\tt size-func} fields are all blank. The {\tt called} field | |
339 | shows the number of times this function was called during a memory | |
340 | allocation, with the number of recursive calls recorded in the | |
341 | adjacent field. | |
342 | ||
343 | Each caller of the entry function is listed on a separate line above | |
344 | it. A summary of all callers is given on the top line of the entry if | |
345 | there is more than one ancestor. The {\tt self} field of ancestors | |
346 | lists the number of bytes that the entry function and its descendents | |
347 | allocated on behalf of the ancestor. The {\tt size-ances} field | |
348 | breaks down those bytes into size categories, while the {\tt | |
349 | frac-ances} field shows the size breakdown of the bytes requested by | |
350 | this ancestor as a fraction of bytes allocated at the request of all | |
351 | ancestors. For example, in the entry for function {\tt make\_widget} | |
352 | (index 1), the ancestor {\tt make\_red\_widget} can be seen to have | |
353 | requested 1,023,876 bytes of data from {\tt make\_widget}, 99\% of which | |
354 | was of medium-sized objects. Furthermore, calls from {\tt | |
355 | make\_red\_widget} accounted for 50\% of the total memory allocated by | |
356 | {\tt make\_widget} and its descendents. Other fields show how many | |
357 | calls the ancestor made to the entry function and how many calls the | |
358 | ancestor made in total. In a similar fashion, information about the | |
359 | function's descendents appears below the entry function. | |
360 | ||
361 | Had the memory leak table not already told us what objects were not | |
362 | being freed, we could use the allocation call graph for the same | |
363 | purpose. The direct allocation table told us that {\tt make\_widget} | |
364 | allocated 1,023,876 bytes of unfreed memory, all for medium-sized | |
365 | objects. From the allocation call graph, we can see that the function | |
366 | {\tt make\_red\_widget} was the function calling {\tt make\_widget} | |
367 | that requested 1,023,876 bytes of medium-sized objects. | |
368 | ||
369 | Cycles in the call graph are not illustrated in Figure~\ref{call graph | |
370 | figure}. As described in the next section, cycles obscure allocation | |
371 | information among functions that are members of a cycle. When the | |
372 | parent/child relationships that appear in the graph are between | |
373 | members of the same cycle, most of the fields in the graph must be | |
374 | omitted. | |
375 | ||
376 | ||
377 | \section{Implementation} | |
378 | \label{implementing mprof} | |
379 | ||
380 | We have implemented {\tt mprof} for use with C and Common Lisp | |
381 | programs. Since the implementations are quite similar, the C | |
382 | implementation will be described in detail, and the minor differences | |
383 | in the Lisp implementation will be noted at the end of the section. | |
384 | ||
385 | \subsection{The Monitor} | |
386 | ||
387 | The first phase of {\tt mprof} is a monitor that is linked into the | |
388 | executing application. The monitor includes modified versions of {\tt | |
389 | malloc} and {\tt free} that record information each time they are | |
390 | invoked. Along with {\tt malloc} and {\tt free}, {\tt mprof} provides | |
391 | its own {\tt exit} function, so that when the application program | |
392 | exits, the data collected by the monitor is written to a file. The | |
393 | monitor maintains several data structures needed to construct the | |
394 | tables. | |
395 | ||
396 | To construct the leak table, the monitor associates a list of the last | |
397 | five callers in the call chain, the {\em partial call chain\/}, with | |
398 | the object allocated. {\tt mprof} augments every object allocated | |
399 | with two items: an integer which is the object size as requested by | |
400 | the user (since the allocator may allocate an object of a different | |
401 | size for convenience), and a pointer to a structure that contains the | |
402 | object's partial call chain and a count of allocations and frees of | |
403 | objects with that call chain. A hash table is used to map a partial | |
404 | call chain to the structure containing the counters. When an object | |
405 | is allocated, its partial call chain is used as a hash key to retrieve | |
406 | the structure containing the counters. A pointer to the structure is | |
407 | placed in the allocated object and the allocation counter is | |
408 | incremented. When the object is later freed, the pointer is followed | |
409 | and the counter of frees is incremented. Any partial call chain in | |
410 | which the number of allocations does not match the number of frees | |
411 | indicates a memory leak and is printed in the leak table. | |
412 | ||
413 | To construct the allocation bin table, the monitor has a 1026-element | |
414 | array of integers to count allocations and another 1026-element array | |
415 | to count frees. When objects of a particular size from 0--1024 bytes | |
416 | are allocated or freed, the appropriated bin is incremented. Objects | |
417 | larger than 1024 bytes are grouped into the same bin. | |
418 | ||
419 | The construction of the direct allocation table falls out directly | |
420 | from maintaining the allocation call graph information, which is | |
421 | described in the next section. | |
422 | ||
423 | \subsection{Constructing the Allocation Call Graph} | |
424 | ||
425 | To construct the allocation call graph, the monitor must associate the | |
426 | number of bytes allocated with every function on the current dynamic | |
427 | call chain, each time {\tt malloc} is called. Consider the sample | |
428 | call chain in Figure~\ref{cchain:fig}, which we abbreviate: | |
429 | {\tt main}-$>${\tt foo}-$>${\tt bar}(24). | |
430 | ||
431 | \begin{figure}[htbp] | |
432 | \begin{verbatim} | |
433 | CALL STACK: MPROF RECORDS: | |
434 | main calls foo 24 bytes over main -> foo | |
435 | foo calls bar 24 bytes over foo -> bar | |
436 | bar calls malloc(24) 24 bytes allocated in bar | |
437 | \end{verbatim} | |
438 | \caption{Example of a Dynamic Call Chain} | |
439 | \label{cchain:fig} | |
440 | \end{figure} | |
441 | ||
442 | In {\tt mprof}, the monitor traverses the entire call chain by | |
443 | following return addresses. This differs from {\tt gprof}, where only | |
444 | the immediate caller of the current function is recorded. {\tt gprof} | |
445 | makes the assumption that each call takes an equal amount of time and | |
446 | uses this assumption to reconstruct the complete dynamic call graph | |
447 | from information only about the immediate callers. In {\tt mprof}, we | |
448 | actually traverse the entire dynamic call chain and need to make no | |
449 | assumptions. | |
450 | ||
451 | In choosing to traverse the entire call chain, we have elected to | |
452 | perform an operation that is potentially expensive both in time and | |
453 | space. One implementation would simply record every function in every | |
454 | chain and write the information to a file (i.e., in the example we | |
455 | would output [{\tt main}-$>${\tt foo}-$>${\tt bar}, 24]). Considering that | |
456 | many programs execute millions of calls to {\tt malloc} and that the | |
457 | depth of a call chain can be hundreds of functions, the amount of | |
458 | information could be prohibitive. | |
459 | ||
460 | An alternative to recording the entire chain of callers is to break | |
461 | the call chain into a set of caller/callee pairs, and associate the | |
462 | bytes allocated with each pair in the chain. For the call in the | |
463 | example, we could maintain the pairs [{\tt main}, {\tt foo}] and [{\tt foo}, | |
464 | {\tt bar}], and associate 24 bytes with each pair. Conceptually, the | |
465 | data structure our monitor maintains is an association between | |
466 | caller/callee pairs and the cumulative bytes allocated over the pair, | |
467 | which we denote ([{\tt main}, {\tt foo}], 24). To continue with the | |
468 | example, if the next allocation was: {\tt main}-$>${\tt foo}-$>${\tt | |
469 | otherbar}(10), where this is the first call to {\tt otherbar}, we | |
470 | would update the byte count associated with the [{\tt main}, {\tt foo}] pair | |
471 | to 34 from 24. Furthermore, we would create a new association between | |
472 | [{\tt foo}, {\tt otherbar}] and the byte count, 10. A disadvantage | |
473 | with this implementation is that the exact call chains are no longer | |
474 | available. However, from the pairs we can construct the correct | |
475 | dynamic call graph of the program, which is the information that we | |
476 | need for the allocation call graph. | |
477 | ||
478 | For the overhead imposed by the monitor to be reasonable, we have to | |
479 | make the association between caller/callee pairs and cumulative byte | |
480 | counts fast. We use a hash table in which the hash function is a | |
481 | simple byte-swap XOR of the callee address. Each callee has a list of | |
482 | its callers and the number of allocated bytes associated with each | |
483 | pair. In an effort to decrease the number of hash lookups, we noted | |
484 | that from allocation to allocation, most of the call chain remains the | |
485 | same. Our measurements show that on the average, 60--75\% of the call | |
486 | chain remains the same between allocations. This observation allows | |
487 | us to cache the pairs associated with the current caller chain and to | |
488 | use most of these pairs the next time a caller chain is recorded. | |
489 | Thus, on any particular allocation, only a few addresses need to be | |
490 | hashed. Here are the events that take place when a call to {\tt | |
491 | malloc} is monitored: | |
492 | ||
493 | \begin{enumerate} | |
494 | \item The chain of return addresses is stored in a vector. | |
495 | \item The new chain is compared with the previous chain, and the point | |
496 | at which they differ is noted. | |
497 | \item For the addresses in the chain that have not changed, the | |
498 | caller/callee byte count for each pair is already available and is | |
499 | incremented. | |
500 | \item For new addresses in the chain, each caller/callee byte count is | |
501 | looked up and updated. | |
502 | \item For the tail of the chain (i.e., the function that called {\tt | |
503 | malloc} directly), the direct allocation information is recorded. | |
504 | \end{enumerate} | |
505 | ||
506 | Maintaining allocation call graph information requires a byte count | |
507 | for every distinct caller/callee pair in every call chain that | |
508 | allocates memory. Our experience is that there are a limited number | |
509 | of such pairs, even in very large C programs, so that the memory | |
510 | requirements of the {\tt mprof} monitor are not large (see | |
511 | section~\ref{mmeas:sec}). | |
512 | ||
513 | \subsection{Reduction and Display} | |
514 | ||
515 | The second phase of {\tt mprof} reads the output of the monitor, | |
516 | reduces the data to create a dynamic call graph, and displays the data | |
517 | in four tables. The first part of the data reduction is to map the | |
518 | caller/callee address pairs to actual function names. A program {\tt | |
519 | mpfilt} reads the executable file that created the monitor trace | |
520 | (compiled so that symbol table information is retained), and outputs a | |
521 | new set of function caller/callee relations. These relations are then | |
522 | used to construct the subset of the program's dynamic call graph that | |
523 | involved memory allocation. | |
524 | ||
525 | The call graph initially can contain cycles due to recursion in the | |
526 | program's execution. Cycles in the call graph introduce spurious | |
527 | allocation relations, as is illustrated in Figure~\ref{recursive call | |
528 | figure}. In this example, {\tt main} is credited as being indirectly | |
529 | responsible for 10 bytes, but because we only keep track of | |
530 | caller/callee pairs, {\tt F} appears to have requested 20 bytes from | |
531 | {\tt G}, even though only 10 bytes were allocated. | |
532 | ||
533 | \begin{figure}[htbp] | |
534 | \begin{verbatim} | |
535 | CALL STACK: MPROF RECORDS: | |
536 | main calls F (10 bytes over main -> F) | |
537 | F calls G (10 bytes over F -> G) | |
538 | G calls F (10 bytes over G -> F) | |
539 | F calls G (10 MORE bytes over F -> G) | |
540 | G calls malloc(10) (10 bytes allocated in G) | |
541 | \end{verbatim} | |
542 | \caption{Problems Caused by Recursive Calls} | |
543 | \label{recursive call figure} | |
544 | \end{figure} | |
545 | ||
546 | We considered several solutions to the problems caused by cycles and | |
547 | adopted the most conservative solution. One way to avoid recording | |
548 | spurious allocation caused by recursion is for the monitor to identify | |
549 | the cycles before recording the allocation. For example, in | |
550 | Figure~\ref{recursive call figure}, the monitor could realize that it | |
551 | had already credited {\tt F} with the 10 bytes when it encountered | |
552 | {\tt F} calling {\tt G} the second time. This solution adds overhead | |
553 | to the monitor and conflicts with our goal to make the monitor as | |
554 | unobtrusive as possible. | |
555 | ||
556 | The solution that we adopted was to merge functions that are in a | |
557 | cycle into a single node in the reduction phase. Thus, each strongly | |
558 | connected component in the dynamic call graph is merged into a single | |
559 | node. The result is a call graph with no cycles. This process is | |
560 | also used by {\tt gprof}, and described carefully | |
561 | elsewhere~\cite{graham83:gprof}. Such an approach works well in {\tt | |
562 | gprof} because C programs, for which {\tt gprof} was primarily | |
563 | intended, tend to have limited amounts of recursion. Lisp programs, | |
564 | for which {\tt mprof} is also intended, intuitively contain much more | |
565 | recursion. We have experience profiling a number of large Common Lisp | |
566 | programs. We observe several recursive cycles in most programs, but | |
567 | the cycles generally contain a small percentage of the total functions | |
568 | and {\tt mprof} is quite effective. | |
569 | ||
570 | \subsection{Lisp Implementation} | |
571 | ||
572 | So far, we have described the implementation of {\tt mprof} for C. | |
573 | The Lisp implementation is quite similar, and here we describe the | |
574 | major differences. C has a single function, {\tt malloc}, that is | |
575 | called to allocate memory explicitly. Lisp has a large number of | |
576 | primitives that allocate memory implicitly (i.e., {\tt cons}, {\tt *}, | |
577 | {\tt intern}, etc.). To make {\tt mprof} work, these primitives must | |
578 | be modified so that every allocation is recorded. Fortunately, at the | |
579 | Lisp implementation level, all memory allocations may be channeled | |
580 | through a single routine. We worked with KCL (Kyoto Common Lisp), | |
581 | which is implemented in C. In KCL, all Lisp memory allocations are | |
582 | handled by a single function, {\tt alloc\_object}. Just as we had | |
583 | modified {\tt malloc} in C, we were able to simply patch {\tt | |
584 | alloc\_object} to monitor memory allocation in KCL. | |
585 | ||
586 | The other major difference in monitoring Lisp is that the addresses | |
587 | recorded by the monitor must be translated into Lisp function names. | |
588 | Again, KCL makes this quite easy because Lisp functions are defined in | |
589 | a central place in KCL and the names of the functions are known when | |
590 | they are defined. Many other Lisp systems are designed to allow | |
591 | return addresses to be mapped to symbolic function names so that the | |
592 | call stack can be printed at a breakpoint. In this case, the monitor | |
593 | can use the same mechanism to map return addresses to function names. | |
594 | Therefore, in Lisp systems in which addresses can be quickly mapped to | |
595 | function names, memory profiling in the style of {\tt mprof} is not a | |
596 | difficult problem. In systems in which symbolic names are not | |
597 | available in compiled code, profiling is more difficult. Furthermore, | |
598 | many systems open-code important allocation functions, like {\tt | |
599 | cons}. Because open-coded allocation functions will not necessarily | |
600 | call a central allocation function (like {\tt alloc\_object}), such | |
601 | allocations will not be observed by {\tt mprof}. To avoid such a loss | |
602 | of information, {\tt mprof} should be used in conjunction with program | |
603 | declarations that will force allocation functions such as {\tt cons} | |
604 | to be coded out-of-line. | |
605 | ||
606 | \section{Measurements} | |
607 | \label{measurements} | |
608 | ||
609 | We have measured the C implementation of {\tt mprof} by instrumenting | |
610 | four programs using {\tt mprof}. The first program, {\tt example}, is | |
611 | our example program with the number of widgets allocated increased to | |
612 | 100,000 to increase program execution time. The second program, {\tt | |
613 | fidilrt}, is the runtime library of FIDIL, a programming language for | |
614 | finite difference computations~\cite{hilfinger88:fidil}. The third | |
615 | program, {\tt epoxy}, is an electrical and physical layout optimizer | |
616 | written by Fred Obermeier~\cite{fwo87:epoxy}. The fourth program, | |
617 | {\tt crystal}, is a VLSI timing analysis | |
618 | program~\cite{ouster85:crystal}. These tests represent a small | |
619 | program ({\tt example}, 100 lines); a medium-sized program ({\tt | |
620 | fidilrt}, 7,100 lines); and two large programs ({\tt epoxy}, 11,000 | |
621 | lines and {\tt crystal}, 10,500 lines). In the remainder of this | |
622 | section, we will look at the resource consumption of {\tt mprof} from | |
623 | two perspectives: execution time overhead and space consumption. | |
624 | ||
625 | \subsection{Execution Time Overhead} | |
626 | ||
627 | There are two sources of execution time overhead associated with {\tt | |
628 | mprof}: additional time spent monitoring an application and the time | |
629 | to reduce and print the data produced by the monitor. The largest | |
630 | source of monitor overhead is the time required to traverse the | |
631 | complete call chain and associate allocations with caller/callee | |
632 | pairs. We implemented a version of {\tt mprof}, called {\tt mprof-}, | |
633 | which does not create the allocation call graph. With this version, | |
634 | we can see the relative cost of the allocation call graph. The ratio | |
635 | of the time spent with profiling to the time spent without profiling | |
636 | is called the {\em slowdown factor\/}. Table \ref{cpu:tab} summarizes | |
637 | the execution time overheads for our four applications. Measurements | |
638 | were gathered running the test programs on a VAX 8800 with 80 | |
639 | megabytes of physical memory. | |
640 | ||
641 | \begin{table}[htbp] | |
642 | \begin{singlespace} | |
643 | \begin{center} | |
644 | \begin{tabular}{|l|r|r|r|r|} \hline | |
645 | \multicolumn{1}{|c|}{} & \multicolumn{4}{|c|}{Cost} \\ \cline{2-5} | |
646 | \multicolumn{1}{|c|}{Resource Description} & | |
647 | \multicolumn{1}{|c|}{\tt example} & | |
648 | \multicolumn{1}{|c|}{\tt fidilrt} & | |
649 | \multicolumn{1}{|c|}{\tt epoxy} & | |
650 | \multicolumn{1}{|c|}{\tt crystal} \\ \hline | |
651 | Number of allocations & 100000 & 77376 & 306295 & 31158 \\ | |
652 | Execution time with {\tt mprof} (seconds) & 62.7 & 132.7 & 188.8 & 134.1 \\ | |
653 | Execution time with {\tt mprof-} (seconds) & 44.1 & 116.0 & 149.7 & 25.5 \\ | |
654 | Execution time without {\tt mprof} (seconds) & 17.9 & 107.1 & 52.1 & 13.2 \\ | |
655 | \hline | |
656 | Slowdown using {\tt mprof} & 3.5 & 1.2 & 3.6 & 10.1 \\ | |
657 | Slowdown using {\tt mprof-} & 2.5 & 1.1 & 2.9 & 1.9 \\ | |
658 | \hline | |
659 | Reduction and display time (seconds) & 10.3 & 28.8 & 28.3 & 36.8 \\ | |
660 | \hline | |
661 | \end{tabular} | |
662 | \caption{Execution Time Overhead of {\tt mprof}} | |
663 | \label{cpu:tab} | |
664 | \end{center} | |
665 | \end{singlespace} | |
666 | \end{table} | |
667 | ||
668 | The slowdown associated with {\tt mprof} varies widely, from 1.5 to | |
669 | 10. {\tt crystal} suffered the worst degradation from profiling | |
670 | because {\tt crystal} uses a depth-first algorithm that results in | |
671 | long call chains. Programs without long call chains appear to slow | |
672 | down by a factor of 2--4. If the allocation call graph is not | |
673 | generated and long call chains are not traversed, the slowdown is | |
674 | significantly less, especially in the extreme cases. Since {\tt | |
675 | mprof} is a prototype and has not been carefully optimized, this | |
676 | overhead seems acceptable. From the table, we see the reduction and | |
677 | display time is typically less than a minute. | |
678 | ||
679 | \subsection{Storage Consumption} | |
680 | \label{mmeas:sec} | |
681 | ||
682 | The storage consumption of {\tt mprof} is divided into the additional | |
683 | memory needed by the monitor as an application executes, and the | |
684 | external storage required by the profile data file. The most | |
685 | significant source of memory used by the monitor is the data stored | |
686 | with each object allocated: an object size and a pointer needed to | |
687 | construct the memory leak table. The monitor also uses memory to | |
688 | record the memory bins and caller/callee byte counts and must write | |
689 | this information to a file when the application is finished. We | |
690 | measured how many bytes of memory and disk space are needed to store | |
691 | this information. Table~\ref{mem:tab} summarizes the measurements of | |
692 | storage consumption associated with {\tt mprof}. | |
693 | ||
694 | \begin{table}[htbp] | |
695 | \begin{singlespace} | |
696 | \begin{center} | |
697 | \begin{tabular}{|l|r|r|r|r|} \hline | |
698 | \multicolumn{1}{|c|}{} & \multicolumn{4}{|c|}{Cost} \\ \cline{2-5} | |
699 | \multicolumn{1}{|c|}{Resource Description} & | |
700 | \multicolumn{1}{|c|}{\tt example} & | |
701 | \multicolumn{1}{|c|}{\tt fidilrt} & | |
702 | \multicolumn{1}{|c|}{\tt epoxy} & | |
703 | \multicolumn{1}{|c|}{\tt crystal} \\ \hline | |
704 | Number of allocations & 100000 & 61163 & 306295 & 31158 \\ | |
705 | User memory allocated (Kbytes) & 20000 & 2425 & 6418 & 21464 \\ | |
706 | \hline | |
707 | Per object memory (Kbytes) & 781 & 477 & 2393 & 168 \\ | |
708 | Other monitor memory (Kbytes) & 8.7 & 23.3 & 52.3 & 17.5 \\ | |
709 | Total monitor memory (Kbytes) & 790 & 500 & 2445 & 186 \\ | |
710 | Monitor fraction (\% memory allocated) & 4 & 17 & 28 & 1 \\ | |
711 | \hline | |
712 | Data file size (Kbytes) & 4.5 & 8.1 & 28.6 & 9.6 \\ | |
713 | \hline | |
714 | \end{tabular} | |
715 | \caption{Storage Consumption of {\tt mprof}} | |
716 | \label{mem:tab} | |
717 | \end{center} | |
718 | \end{singlespace} | |
719 | \end{table} | |
720 | ||
721 | The memory overhead of {\tt mprof} is small, except that an additional | |
722 | 8 bytes of storage are allocated with every object. In programs in | |
723 | which many small objects are allocated, like {\tt epoxy}, {\tt mprof} | |
724 | can contribute significantly to the total memory allocated. | |
725 | Nevertheless, in the worst case, {\tt mprof} increases application | |
726 | size by 1/3, and since {\tt mprof} is a development tool, this | |
727 | overhead seems acceptable. From the table we also see that the data | |
728 | files created by {\tt mprof} are quite small ($<$ 30 Kbytes). | |
729 | ||
730 | \section{Related Work} | |
731 | \label{related work} | |
732 | ||
733 | {\tt Mprof} is similar to the tool {\tt gprof}~\cite{graham83:gprof}, | |
734 | a dynamic execution profiler. Because some of the problems of | |
735 | interpreting the dynamic call graph are the same, we have borrowed | |
736 | these ideas from {\tt gprof}. Also, we have used ideas from the user | |
737 | interface of {\tt gprof} for two reasons: because the information | |
738 | being displayed is quite similar and because users familiar with {\tt | |
739 | gprof} would probably also be interested in {\tt mprof} and would | |
740 | benefit from a similar presentation. | |
741 | ||
742 | Barach, Taenzer, and Wells developed a tool for finding storage | |
743 | allocation errors in C programs~\cite{barach82:memprof}. Their | |
744 | approach concentrated on finding two specific storage allocation | |
745 | errors: memory leaks and duplicate frees. They modified {\tt malloc} | |
746 | and {\tt free} so that every time that these functions were called, | |
747 | information about the memory block being manipulated was recorded in a | |
748 | file. A program that examines this file, {\tt prleak}, prints out | |
749 | which memory blocks were never freed or were freed twice. This | |
750 | approach differs from {\tt mprof} in two ways. First, {\tt mprof} | |
751 | provides more information about the memory allocation of programs than | |
752 | {\tt prleak}, which just reports on storage errors. Second, {\tt | |
753 | prleak} generates extremely large intermediate files that are | |
754 | comparable in size to the total amount of memory allocated by the | |
755 | program (often megabytes of data). Although {\tt mprof} records more | |
756 | useful information, the data files it generates are of modest size | |
757 | (see the table above). | |
758 | ||
759 | \section{Conclusions} | |
760 | \label{conclusions} | |
761 | ||
762 | We have implemented a memory allocation profiling program for both C | |
763 | and Common Lisp. Our example has shown that {\tt mprof} can be | |
764 | effective in elucidating the allocation behavior of a program so that | |
765 | a programmer can detect memory leaks and identify major sources of | |
766 | allocation. | |
767 | ||
768 | Unlike {\tt gprof}, {\tt mprof} records every caller in the call chain | |
769 | every time an object is allocated. The overhead for this recording is | |
770 | large but not impractically so, because we take advantage of the fact | |
771 | that a call chain changes little between allocations. Moreover, | |
772 | recording this information does not require large amounts of memory | |
773 | because there are relatively few unique caller/callee address pairs on | |
774 | call chains in which allocation takes place, even in very large | |
775 | programs. We have measured the overhead of {\tt mprof}, and find that | |
776 | typically it slows applications by a factor of 2--4 times, and adds up | |
777 | to 33\% to the memory allocated by the application. Because {\tt | |
778 | mprof} is intended as a development tool, these costs are acceptable. | |
779 | ||
780 | Because {\tt mprof} merges cycles caused by recursive function calls, | |
781 | {\tt mprof} may be ineffective for programs with large cycles in their | |
782 | call graph. Only with more data will we be able to decide if many | |
783 | programs (especially those written in Lisp) contain so many recursive | |
784 | calls that cycle merging makes {\tt mprof} ineffective. Nevertheless, | |
785 | {\tt mprof} has already been effective in detecting KCL system | |
786 | functions that allocate memory extraneously.\footnote{Using {\tt | |
787 | mprof}, we noted that for a large object-oriented program written in | |
788 | KCL, the system function {\tt every} accounted for 13\% of the memory | |
789 | allocated. We rewrote {\tt every} so it would not allocate any | |
790 | memory, and decreased the memory consumption of the program by 13\%.} | |
791 | ||
792 | As a final note, we have received feedback from C application | |
793 | programmers who have used {\tt mprof}. They report that the memory | |
794 | leak table and the allocation bin table are both extremely useful, | |
795 | while the direct allocation table and the allocation call graph are | |
796 | harder to understand and also less useful. Considering the execution | |
797 | overhead associated with the allocation call graph and the complexity | |
798 | of the table, it is questionable whether the allocation call graph | |
799 | will ever be as helpful C programmers as the memory leak table. On | |
800 | the other hand, with automatic storage reclamation, the memory leak | |
801 | table becomes unnecessary. Yet for memory intensive languages, such | |
802 | as Lisp, the need for effective use of the memory is more important, | |
803 | and tools such as the allocation call graph might prove very useful. | |
804 | Because we have limited feedback from Lisp programmers using {\tt | |
805 | mprof}, we cannot report their response to this tool. | |
806 | ||
807 | \bibliography{/spur2/larus/bib/abbrev,lispm,gc,/spur2/larus/bib/optim,spur,misc,lisp} | |
808 | \bibliographystyle{plain} | |
809 | ||
810 | \end{document} | |
811 | % |