1 <?xml version="1.0"?> <!-- -*- sgml -*- --> 2 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" 3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" 4 [ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]> 5 6 7 <chapter id="dh-manual" 8 xreflabel="DHAT: a dynamic heap analysis tool"> 9 <title>DHAT: a dynamic heap analysis tool</title> 10 11 <para>To use this tool, you must specify 12 <option>--tool=exp-dhat</option> on the Valgrind 13 command line.</para> 14 15 16 17 <sect1 id="dh-manual.overview" xreflabel="Overview"> 18 <title>Overview</title> 19 20 <para>DHAT is a tool for examining how programs use their heap 21 allocations.</para> 22 23 <para>It tracks the allocated blocks, and inspects every memory access 24 to find which block, if any, it is to. The following data is 25 collected and presented per allocation point (allocation 26 stack):</para> 27 28 <itemizedlist> 29 <listitem><para>Total allocation (number of bytes and 30 blocks)</para></listitem> 31 32 <listitem><para>maximum live volume (number of bytes and 33 blocks)</para></listitem> 34 35 <listitem><para>average block lifetime (number of instructions 36 between allocation and freeing)</para></listitem> 37 38 <listitem><para>average number of reads and writes to each byte in 39 the block ("access ratios")</para></listitem> 40 41 <listitem><para>for allocation points which always allocate blocks 42 only of one size, and that size is 4096 bytes or less: counts 43 showing how often each byte offset inside the block is 44 accessed.</para></listitem> 45 </itemizedlist> 46 47 <para>Using these statistics it is possible to identify allocation 48 points with the following characteristics:</para> 49 50 <itemizedlist> 51 52 <listitem><para>potential process-lifetime leaks: blocks allocated 53 by the point just accumulate, and are freed only at the end of the 54 run.</para></listitem> 55 56 <listitem><para>excessive turnover: points which chew through a lot 57 of heap, even if it is not held onto for very long</para></listitem> 58 59 <listitem><para>excessively transient: points which allocate very 60 short lived blocks</para></listitem> 61 62 <listitem><para>useless or underused allocations: blocks which are 63 allocated but not completely filled in, or are filled in but not 64 subsequently read.</para></listitem> 65 66 <listitem><para>blocks with inefficient layout -- areas never 67 accessed, or with hot fields scattered throughout the 68 block.</para></listitem> 69 </itemizedlist> 70 71 <para>As with the Massif heap profiler, DHAT measures program progress 72 by counting instructions, and so presents all age/time related figures 73 as instruction counts. This sounds a little odd at first, but it 74 makes runs repeatable in a way which is not possible if CPU time is 75 used.</para> 76 77 </sect1> 78 79 80 81 82 <sect1 id="dh-manual.understanding" xreflabel="Understanding DHAT's output"> 83 <title>Understanding DHAT's output</title> 84 85 86 <para>DHAT provides a lot of useful information on dynamic heap usage. 87 Most of the art of using it is in interpretation of the resulting 88 numbers. That is best illustrated via a set of examples.</para> 89 90 91 <sect2> 92 <title>Interpreting the max-live, tot-alloc and deaths fields</title> 93 94 <sect3><title>A simple example</title></sect3> 95 96 <screen><![CDATA[ 97 ======== SUMMARY STATISTICS ======== 98 99 guest_insns: 1,045,339,534 100 [...] 101 max-live: 63,490 in 984 blocks 102 tot-alloc: 1,904,700 in 29,520 blocks (avg size 64.52) 103 deaths: 29,520, at avg age 22,227,424 104 acc-ratios: 6.37 rd, 1.14 wr (12,141,526 b-read, 2,174,460 b-written) 105 at 0x4C275B8: malloc (vg_replace_malloc.c:236) 106 by 0x40350E: tcc_malloc (tinycc.c:6712) 107 by 0x404580: tok_alloc_new (tinycc.c:7151) 108 by 0x40870A: next_nomacro1 (tinycc.c:9305) 109 ]]></screen> 110 111 <para>Over the entire run of the program, this stack (allocation 112 point) allocated 29,520 blocks in total, containing 1,904,700 bytes in 113 total. By looking at the max-live data, we see that not many blocks 114 were simultaneously live, though: at the peak, there were 63,490 115 allocated bytes in 984 blocks. This tells us that the program is 116 steadily freeing such blocks as it runs, rather than hanging on to all 117 of them until the end and freeing them all.</para> 118 119 <para>The deaths entry tells us that 29,520 blocks allocated by this stack 120 died (were freed) during the run of the program. Since 29,520 is 121 also the number of blocks allocated in total, that tells us that 122 all allocated blocks were freed by the end of the program.</para> 123 124 <para>It also tells us that the average age at death was 22,227,424 125 instructions. From the summary statistics we see that the program ran 126 for 1,045,339,534 instructions, and so the average age at death is 127 about 2% of the program's total run time.</para> 128 129 <sect3><title>Example of a potential process-lifetime leak</title></sect3> 130 131 <para>This next example (from a different program than the above) 132 shows a potential process lifetime leak. A process lifetime leak 133 occurs when a program keeps allocating data, but only frees the 134 data just before it exits. Hence the program's heap grows constantly 135 in size, yet Memcheck reports no leak, because the program has 136 freed up everything at exit. This is particularly a hazard for 137 long running programs.</para> 138 139 <screen><![CDATA[ 140 ======== SUMMARY STATISTICS ======== 141 142 guest_insns: 418,901,537 143 [...] 144 max-live: 32,512 in 254 blocks 145 tot-alloc: 32,512 in 254 blocks (avg size 128.00) 146 deaths: 254, at avg age 300,467,389 147 acc-ratios: 0.26 rd, 0.20 wr (8,756 b-read, 6,604 b-written) 148 at 0x4C275B8: malloc (vg_replace_malloc.c:236) 149 by 0x4C27632: realloc (vg_replace_malloc.c:525) 150 by 0x56FF41D: QtFontStyle::pixelSize(unsigned short, bool) (qfontdatabase.cpp:269) 151 by 0x5700D69: loadFontConfig() (qfontdatabase_x11.cpp:1146) 152 ]]></screen> 153 154 <para>There are two tell-tale signs that this might be a 155 process-lifetime leak. Firstly, the max-live and tot-alloc numbers 156 are identical. The only way that can happen is if these blocks are 157 all allocated and then all deallocated.</para> 158 159 <para>Secondly, the average age at death (300 million insns) is 71% of 160 the total program lifetime (419 million insns), hence this is not a 161 transient allocation-free spike -- rather, it is spread out over a 162 large part of the entire run. One interpretation is, roughly, that 163 all 254 blocks were allocated in the first half of the run, held onto 164 for the second half, and then freed just before exit.</para> 165 166 </sect2> 167 168 169 <sect2> 170 <title>Interpreting the acc-ratios fields</title> 171 172 173 <sect3><title>A fairly harmless allocation point record</title></sect3> 174 175 <screen><![CDATA[ 176 max-live: 49,398 in 808 blocks 177 tot-alloc: 1,481,940 in 24,240 blocks (avg size 61.13) 178 deaths: 24,240, at avg age 34,611,026 179 acc-ratios: 2.13 rd, 0.91 wr (3,166,650 b-read, 1,358,820 b-written) 180 at 0x4C275B8: malloc (vg_replace_malloc.c:236) 181 by 0x40350E: tcc_malloc (tinycc.c:6712) 182 by 0x404580: tok_alloc_new (tinycc.c:7151) 183 by 0x4046C4: tok_alloc (tinycc.c:7190) 184 ]]></screen> 185 186 <para>The acc-ratios field tells us that each byte in the blocks 187 allocated here is read an average of 2.13 times before the block is 188 deallocated. Given that the blocks have an average age at death of 189 34,611,026, that's one read per block per approximately every 15 190 million instructions. So from that standpoint the blocks aren't 191 "working" very hard.</para> 192 193 <para>More interesting is the write ratio: each byte is written an 194 average of 0.91 times. This tells us that some parts of the allocated 195 blocks are never written, at least 9% on average. To completely 196 initialise the block would require writing each byte at least once, 197 and that would give a write ratio of 1.0. The fact that some block 198 areas are evidently unused might point to data alignment holes or 199 other layout inefficiencies.</para> 200 201 <para>Well, at least all the blocks are freed (24,240 allocations, 202 24,240 deaths).</para> 203 204 <para>If all the blocks had been the same size, DHAT would also show 205 the access counts by block offset, so we could see where exactly these 206 unused areas are. However, that isn't the case: the blocks have 207 varying sizes, so DHAT can't perform such an analysis. We can see 208 that they must have varying sizes since the average block size, 61.13, 209 isn't a whole number.</para> 210 211 212 <sect3><title>A more suspicious looking example</title></sect3> 213 214 <screen><![CDATA[ 215 max-live: 180,224 in 22 blocks 216 tot-alloc: 180,224 in 22 blocks (avg size 8192.00) 217 deaths: none (none of these blocks were freed) 218 acc-ratios: 0.00 rd, 0.00 wr (0 b-read, 0 b-written) 219 at 0x4C275B8: malloc (vg_replace_malloc.c:236) 220 by 0x40350E: tcc_malloc (tinycc.c:6712) 221 by 0x40369C: __sym_malloc (tinycc.c:6787) 222 by 0x403711: sym_malloc (tinycc.c:6805) 223 ]]></screen> 224 225 <para>Here, both the read and write access ratios are zero. Hence 226 this point is allocating blocks which are never used, neither read nor 227 written. Indeed, they are also not freed ("deaths: none") and are 228 simply leaked. So, here is 180k of completely useless allocation that 229 could be removed.</para> 230 231 <para>Re-running with Memcheck does indeed report the same leak. What 232 DHAT can tell us, that Memcheck can't, is that not only are the blocks 233 leaked, they are also never used.</para> 234 235 <sect3><title>Another suspicious example</title></sect3> 236 237 <para>Here's one where blocks are allocated, written to, 238 but never read from. We see this immediately from the zero read 239 access ratio. They do get freed, though:</para> 240 241 <screen><![CDATA[ 242 max-live: 54 in 3 blocks 243 tot-alloc: 1,620 in 90 blocks (avg size 18.00) 244 deaths: 90, at avg age 34,558,236 245 acc-ratios: 0.00 rd, 1.11 wr (0 b-read, 1,800 b-written) 246 at 0x4C275B8: malloc (vg_replace_malloc.c:236) 247 by 0x40350E: tcc_malloc (tinycc.c:6712) 248 by 0x4035BD: tcc_strdup (tinycc.c:6750) 249 by 0x41FEBB: tcc_add_sysinclude_path (tinycc.c:20931) 250 ]]></screen> 251 252 <para>In the previous two examples, it is easy to see blocks that are 253 never written to, or never read from, or some combination of both. 254 Unfortunately, in C++ code, the situation is less clear. That's 255 because an object's constructor will write to the underlying block, 256 and its destructor will read from it. So the block's read and write 257 ratios will be non-zero even if the object, once constructed, is never 258 used, but only eventually destructed.</para> 259 260 <para>Really, what we want is to measure only memory accesses in 261 between the end of an object's construction and the start of its 262 destruction. Unfortunately I do not know of a reliable way to 263 determine when those transitions are made.</para> 264 265 266 </sect2> 267 268 <sect2> 269 <title>Interpreting "Aggregated access counts by offset" data</title> 270 271 <para>For allocation points that always allocate blocks of the same 272 size, and which are 4096 bytes or smaller, DHAT counts accesses 273 per offset, for example:</para> 274 275 <screen><![CDATA[ 276 max-live: 317,408 in 5,668 blocks 277 tot-alloc: 317,408 in 5,668 blocks (avg size 56.00) 278 deaths: 5,668, at avg age 622,890,597 279 acc-ratios: 1.03 rd, 1.28 wr (327,642 b-read, 408,172 b-written) 280 at 0x4C275B8: malloc (vg_replace_malloc.c:236) 281 by 0x5440C16: QDesignerPropertySheetPrivate::ensureInfo (qhash.h:515) 282 by 0x544350B: QDesignerPropertySheet::setVisible (qdesigner_propertysh...) 283 by 0x5446232: QDesignerPropertySheet::QDesignerPropertySheet (qdesigne...) 284 285 Aggregated access counts by offset: 286 287 [ 0] 28782 28782 28782 28782 28782 28782 28782 28782 288 [ 8] 20638 20638 20638 20638 0 0 0 0 289 [ 16] 22738 22738 22738 22738 22738 22738 22738 22738 290 [ 24] 6013 6013 6013 6013 6013 6013 6013 6013 291 [ 32] 18883 18883 18883 37422 0 0 0 0 292 [ 36] 5668 11915 5668 5668 11336 11336 11336 11336 293 [ 48] 6166 6166 6166 6166 0 0 0 0 294 ]]></screen> 295 296 <para>This is fairly typical, for C++ code running on a 64-bit 297 platform. Here, we have aggregated access statistics for 5668 blocks, 298 all of size 56 bytes. Each byte has been accessed at least 5668 299 times, except for offsets 12--15, 36--39 and 52--55. These are likely 300 to be alignment holes.</para> 301 302 <para>Careful interpretation of the numbers reveals useful information. 303 Groups of N consecutive identical numbers that begin at an N-aligned 304 offset, for N being 2, 4 or 8, are likely to indicate an N-byte object 305 in the structure at that point. For example, the first 32 bytes of 306 this object are likely to have the layout</para> 307 308 <screen><![CDATA[ 309 [0 ] 64-bit type 310 [8 ] 32-bit type 311 [12] 32-bit alignment hole 312 [16] 64-bit type 313 [24] 64-bit type 314 ]]></screen> 315 316 <para>As a counterexample, it's also clear that, whatever is at offset 32, 317 it is not a 32-bit value. That's because the last number of the group 318 (37422) is not the same as the first three (18883 18883 18883).</para> 319 320 <para>This example leads one to enquire (by reading the source code) 321 whether the zeroes at 12--15 and 52--55 are alignment holes, and 322 whether 48--51 is indeed a 32-bit type. If so, it might be possible 323 to place what's at 48--51 at 12--15 instead, which would reduce 324 the object size from 56 to 48 bytes.</para> 325 326 <para>Bear in mind that the above inferences are all only "maybes". That's 327 because they are based on dynamic data, not static analysis of the 328 object layout. For example, the zeroes might not be alignment 329 holes, but rather just parts of the structure which were not used 330 at all for this particular run. Experience shows that's unlikely 331 to be the case, but it could happen.</para> 332 333 </sect2> 334 335 </sect1> 336 337 338 339 340 341 342 343 <sect1 id="dh-manual.options" xreflabel="DHAT Command-line Options"> 344 <title>DHAT Command-line Options</title> 345 346 <para>DHAT-specific command-line options are:</para> 347 348 <!-- start of xi:include in the manpage --> 349 <variablelist id="dh.opts.list"> 350 351 <varlistentry id="opt.show-top-n" xreflabel="--show-top-n"> 352 <term> 353 <option><![CDATA[--show-top-n=<number> 354 [default: 10] ]]></option> 355 </term> 356 <listitem> 357 <para>At the end of the run, DHAT sorts the accumulated 358 allocation points according to some metric, and shows the 359 highest scoring entries. <varname>--show-top-n</varname> 360 controls how many entries are shown. The default of 10 is 361 quite small. For realistic applications you will probably need 362 to set it much higher, at least several hundred.</para> 363 </listitem> 364 </varlistentry> 365 366 <varlistentry id="opt.sort-by" xreflabel="--sort-by=string"> 367 <term> 368 <option><![CDATA[--sort-by=<string> [default: max-bytes-live] ]]></option> 369 </term> 370 <listitem> 371 <para>At the end of the run, DHAT sorts the accumulated 372 allocation points according to some metric, and shows the 373 highest scoring entries. <varname>--sort-by</varname> 374 selects the metric used for sorting:</para> 375 <para><varname>max-bytes-live </varname> maximum live bytes [default]</para> 376 <para><varname>tot-bytes-allocd </varname> bytes allocates in total (turnover)</para> 377 <para><varname>max-blocks-live </varname> maximum live blocks</para> 378 <para><varname>tot-blocks-allocd </varname> blocks allocated in total (turnover)</para> 379 <para>This controls the order in which allocation points are 380 displayed. You can choose to look at allocation points with 381 the highest number of live bytes, or the highest total byte turnover, or 382 by the highest number of live blocks, or the highest total block 383 turnover. These give usefully different pictures of program behaviour. 384 For example, sorting by maximum live blocks tends to show up allocation 385 points creating large numbers of small objects.</para> 386 </listitem> 387 </varlistentry> 388 389 </variablelist> 390 391 <para>One important point to note is that each allocation stack counts 392 as a separate allocation point. Because stacks by default have 12 393 frames, this tends to spread data out over multiple allocation points. 394 You may want to use the flag --num-callers=4 or some such small 395 number, to reduce the spreading.</para> 396 397 <!-- end of xi:include in the manpage --> 398 399 </sect1> 400 401 </chapter> 402