1 <html> 2 <head> 3 <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> 4 <title>5.Cachegrind: a cache and branch-prediction profiler</title> 5 <link rel="stylesheet" href="vg_basic.css" type="text/css"> 6 <meta name="generator" content="DocBook XSL Stylesheets V1.75.2"> 7 <link rel="home" href="index.html" title="Valgrind Documentation"> 8 <link rel="up" href="manual.html" title="Valgrind User Manual"> 9 <link rel="prev" href="mc-manual.html" title="4.Memcheck: a memory error detector"> 10 <link rel="next" href="cl-manual.html" title="6.Callgrind: a call-graph generating cache and branch prediction profiler"> 11 </head> 12 <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> 13 <div><table class="nav" width="100%" cellspacing="3" cellpadding="3" border="0" summary="Navigation header"><tr> 14 <td width="22px" align="center" valign="middle"><a accesskey="p" href="mc-manual.html"><img src="images/prev.png" width="18" height="21" border="0" alt="Prev"></a></td> 15 <td width="25px" align="center" valign="middle"><a accesskey="u" href="manual.html"><img src="images/up.png" width="21" height="18" border="0" alt="Up"></a></td> 16 <td width="31px" align="center" valign="middle"><a accesskey="h" href="index.html"><img src="images/home.png" width="27" height="20" border="0" alt="Up"></a></td> 17 <th align="center" valign="middle">Valgrind User Manual</th> 18 <td width="22px" align="center" valign="middle"><a accesskey="n" href="cl-manual.html"><img src="images/next.png" width="18" height="21" border="0" alt="Next"></a></td> 19 </tr></table></div> 20 <div class="chapter" title="5.Cachegrind: a cache and branch-prediction profiler"> 21 <div class="titlepage"><div><div><h2 class="title"> 22 <a name="cg-manual"></a>5.Cachegrind: a cache and branch-prediction profiler</h2></div></div></div> 23 <div class="toc"> 24 <p><b>Table of Contents</b></p> 25 <dl> 26 <dt><span class="sect1"><a href="cg-manual.html#cg-manual.overview">5.1. Overview</a></span></dt> 27 <dt><span class="sect1"><a href="cg-manual.html#cg-manual.profile">5.2. Using Cachegrind, cg_annotate and cg_merge</a></span></dt> 28 <dd><dl> 29 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.running-cachegrind">5.2.1. Running Cachegrind</a></span></dt> 30 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.outputfile">5.2.2. Output File</a></span></dt> 31 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.running-cg_annotate">5.2.3. Running cg_annotate</a></span></dt> 32 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.the-output-preamble">5.2.4. The Output Preamble</a></span></dt> 33 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.the-global">5.2.5. The Global and Function-level Counts</a></span></dt> 34 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.line-by-line">5.2.6. Line-by-line Counts</a></span></dt> 35 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.assembler">5.2.7. Annotating Assembly Code Programs</a></span></dt> 36 <dt><span class="sect2"><a href="cg-manual.html#ms-manual.forkingprograms">5.2.8. Forking Programs</a></span></dt> 37 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.annopts.warnings">5.2.9. cg_annotate Warnings</a></span></dt> 38 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.annopts.things-to-watch-out-for">5.2.10. Unusual Annotation Cases</a></span></dt> 39 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.cg_merge">5.2.11. Merging Profiles with cg_merge</a></span></dt> 40 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.cg_diff">5.2.12. Differencing Profiles with cg_diff</a></span></dt> 41 </dl></dd> 42 <dt><span class="sect1"><a href="cg-manual.html#cg-manual.cgopts">5.3. Cachegrind Command-line Options</a></span></dt> 43 <dt><span class="sect1"><a href="cg-manual.html#cg-manual.annopts">5.4. cg_annotate Command-line Options</a></span></dt> 44 <dt><span class="sect1"><a href="cg-manual.html#cg-manual.diffopts">5.5. cg_diff Command-line Options</a></span></dt> 45 <dt><span class="sect1"><a href="cg-manual.html#cg-manual.acting-on">5.6. Acting on Cachegrind's Information</a></span></dt> 46 <dt><span class="sect1"><a href="cg-manual.html#cg-manual.sim-details">5.7. Simulation Details</a></span></dt> 47 <dd><dl> 48 <dt><span class="sect2"><a href="cg-manual.html#cache-sim">5.7.1. Cache Simulation Specifics</a></span></dt> 49 <dt><span class="sect2"><a href="cg-manual.html#branch-sim">5.7.2. Branch Simulation Specifics</a></span></dt> 50 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.annopts.accuracy">5.7.3. Accuracy</a></span></dt> 51 </dl></dd> 52 <dt><span class="sect1"><a href="cg-manual.html#cg-manual.impl-details">5.8. Implementation Details</a></span></dt> 53 <dd><dl> 54 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.impl-details.how-cg-works">5.8.1. How Cachegrind Works</a></span></dt> 55 <dt><span class="sect2"><a href="cg-manual.html#cg-manual.impl-details.file-format">5.8.2. Cachegrind Output File Format</a></span></dt> 56 </dl></dd> 57 </dl> 58 </div> 59 <p>To use this tool, you must specify 60 <code class="option">--tool=cachegrind</code> on the 61 Valgrind command line.</p> 62 <div class="sect1" title="5.1.Overview"> 63 <div class="titlepage"><div><div><h2 class="title" style="clear: both"> 64 <a name="cg-manual.overview"></a>5.1.Overview</h2></div></div></div> 65 <p>Cachegrind simulates how your program interacts with a machine's cache 66 hierarchy and (optionally) branch predictor. It simulates a machine with 67 independent first-level instruction and data caches (I1 and D1), backed by a 68 unified second-level cache (L2). This exactly matches the configuration of 69 many modern machines.</p> 70 <p>However, some modern machines have three levels of cache. For these 71 machines (in the cases where Cachegrind can auto-detect the cache 72 configuration) Cachegrind simulates the first-level and third-level caches. 73 The reason for this choice is that the L3 cache has the most influence on 74 runtime, as it masks accesses to main memory. Furthermore, the L1 caches 75 often have low associativity, so simulating them can detect cases where the 76 code interacts badly with this cache (eg. traversing a matrix column-wise 77 with the row length being a power of 2).</p> 78 <p>Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) 79 caches.</p> 80 <p> 81 Cachegrind gathers the following statistics (abbreviations used for each statistic 82 is given in parentheses):</p> 83 <div class="itemizedlist"><ul class="itemizedlist" type="disc"> 84 <li class="listitem"><p>I cache reads (<code class="computeroutput">Ir</code>, 85 which equals the number of instructions executed), 86 I1 cache read misses (<code class="computeroutput">I1mr</code>) and 87 LL cache instruction read misses (<code class="computeroutput">ILmr</code>). 88 </p></li> 89 <li class="listitem"><p>D cache reads (<code class="computeroutput">Dr</code>, which 90 equals the number of memory reads), 91 D1 cache read misses (<code class="computeroutput">D1mr</code>), and 92 LL cache data read misses (<code class="computeroutput">DLmr</code>). 93 </p></li> 94 <li class="listitem"><p>D cache writes (<code class="computeroutput">Dw</code>, which equals 95 the number of memory writes), 96 D1 cache write misses (<code class="computeroutput">D1mw</code>), and 97 LL cache data write misses (<code class="computeroutput">DLmw</code>). 98 </p></li> 99 <li class="listitem"><p>Conditional branches executed (<code class="computeroutput">Bc</code>) and 100 conditional branches mispredicted (<code class="computeroutput">Bcm</code>). 101 </p></li> 102 <li class="listitem"><p>Indirect branches executed (<code class="computeroutput">Bi</code>) and 103 indirect branches mispredicted (<code class="computeroutput">Bim</code>). 104 </p></li> 105 </ul></div> 106 <p>Note that D1 total accesses is given by 107 <code class="computeroutput">D1mr</code> + 108 <code class="computeroutput">D1mw</code>, and that LL total 109 accesses is given by <code class="computeroutput">ILmr</code> + 110 <code class="computeroutput">DLmr</code> + 111 <code class="computeroutput">DLmw</code>. 112 </p> 113 <p>These statistics are presented for the entire program and for each 114 function in the program. You can also annotate each line of source code in 115 the program with the counts that were caused directly by it.</p> 116 <p>On a modern machine, an L1 miss will typically cost 117 around 10 cycles, an LL miss can cost as much as 200 118 cycles, and a mispredicted branch costs in the region of 10 119 to 30 cycles. Detailed cache and branch profiling can be very useful 120 for understanding how your program interacts with the machine and thus how 121 to make it faster.</p> 122 <p>Also, since one instruction cache read is performed per 123 instruction executed, you can find out how many instructions are 124 executed per line, which can be useful for traditional profiling.</p> 125 </div> 126 <div class="sect1" title="5.2.Using Cachegrind, cg_annotate and cg_merge"> 127 <div class="titlepage"><div><div><h2 class="title" style="clear: both"> 128 <a name="cg-manual.profile"></a>5.2.Using Cachegrind, cg_annotate and cg_merge</h2></div></div></div> 129 <p>First off, as for normal Valgrind use, you probably want to 130 compile with debugging info (the 131 <code class="option">-g</code> option). But by contrast with 132 normal Valgrind use, you probably do want to turn 133 optimisation on, since you should profile your program as it will 134 be normally run.</p> 135 <p>Then, you need to run Cachegrind itself to gather the profiling 136 information, and then run cg_annotate to get a detailed presentation of that 137 information. As an optional intermediate step, you can use cg_merge to sum 138 together the outputs of multiple Cachegrind runs into a single file which 139 you then use as the input for cg_annotate. Alternatively, you can use 140 cg_diff to difference the outputs of two Cachegrind runs into a signel file 141 which you then use as the input for cg_annotate.</p> 142 <div class="sect2" title="5.2.1.Running Cachegrind"> 143 <div class="titlepage"><div><div><h3 class="title"> 144 <a name="cg-manual.running-cachegrind"></a>5.2.1.Running Cachegrind</h3></div></div></div> 145 <p>To run Cachegrind on a program <code class="filename">prog</code>, run:</p> 146 <pre class="screen"> 147 valgrind --tool=cachegrind prog 148 </pre> 149 <p>The program will execute (slowly). Upon completion, 150 summary statistics that look like this will be printed:</p> 151 <pre class="programlisting"> 152 ==31751== I refs: 27,742,716 153 ==31751== I1 misses: 276 154 ==31751== LLi misses: 275 155 ==31751== I1 miss rate: 0.0% 156 ==31751== LLi miss rate: 0.0% 157 ==31751== 158 ==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr) 159 ==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr) 160 ==31751== LLd misses: 23,085 ( 3,987 rd + 19,098 wr) 161 ==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%) 162 ==31751== LLd miss rate: 0.1% ( 0.0% + 0.4%) 163 ==31751== 164 ==31751== LL misses: 23,360 ( 4,262 rd + 19,098 wr) 165 ==31751== LL miss rate: 0.0% ( 0.0% + 0.4%)</pre> 166 <p>Cache accesses for instruction fetches are summarised 167 first, giving the number of fetches made (this is the number of 168 instructions executed, which can be useful to know in its own 169 right), the number of I1 misses, and the number of LL instruction 170 (<code class="computeroutput">LLi</code>) misses.</p> 171 <p>Cache accesses for data follow. The information is similar 172 to that of the instruction fetches, except that the values are 173 also shown split between reads and writes (note each row's 174 <code class="computeroutput">rd</code> and 175 <code class="computeroutput">wr</code> values add up to the row's 176 total).</p> 177 <p>Combined instruction and data figures for the LL cache 178 follow that. Note that the LL miss rate is computed relative to the total 179 number of memory accesses, not the number of L1 misses. I.e. it is 180 <code class="computeroutput">(ILmr + DLmr + DLmw) / (Ir + Dr + Dw)</code> 181 not 182 <code class="computeroutput">(ILmr + DLmr + DLmw) / (I1mr + D1mr + D1mw)</code> 183 </p> 184 <p>Branch prediction statistics are not collected by default. 185 To do so, add the option <code class="option">--branch-sim=yes</code>.</p> 186 </div> 187 <div class="sect2" title="5.2.2.Output File"> 188 <div class="titlepage"><div><div><h3 class="title"> 189 <a name="cg-manual.outputfile"></a>5.2.2.Output File</h3></div></div></div> 190 <p>As well as printing summary information, Cachegrind also writes 191 more detailed profiling information to a file. By default this file is named 192 <code class="filename">cachegrind.out.<pid></code> (where 193 <code class="filename"><pid></code> is the program's process ID), but its name 194 can be changed with the <code class="option">--cachegrind-out-file</code> option. This 195 file is human-readable, but is intended to be interpreted by the 196 accompanying program cg_annotate, described in the next section.</p> 197 <p>The default <code class="computeroutput">.<pid></code> suffix 198 on the output file name serves two purposes. Firstly, it means you 199 don't have to rename old log files that you don't want to overwrite. 200 Secondly, and more importantly, it allows correct profiling with the 201 <code class="option">--trace-children=yes</code> option of 202 programs that spawn child processes.</p> 203 <p>The output file can be big, many megabytes for large applications 204 built with full debugging information.</p> 205 </div> 206 <div class="sect2" title="5.2.3.Running cg_annotate"> 207 <div class="titlepage"><div><div><h3 class="title"> 208 <a name="cg-manual.running-cg_annotate"></a>5.2.3.Running cg_annotate</h3></div></div></div> 209 <p>Before using cg_annotate, 210 it is worth widening your window to be at least 120-characters 211 wide if possible, as the output lines can be quite long.</p> 212 <p>To get a function-by-function summary, run:</p> 213 <pre class="screen">cg_annotate <filename></pre> 214 <p>on a Cachegrind output file.</p> 215 </div> 216 <div class="sect2" title="5.2.4.The Output Preamble"> 217 <div class="titlepage"><div><div><h3 class="title"> 218 <a name="cg-manual.the-output-preamble"></a>5.2.4.The Output Preamble</h3></div></div></div> 219 <p>The first part of the output looks like this:</p> 220 <pre class="programlisting"> 221 -------------------------------------------------------------------------------- 222 I1 cache: 65536 B, 64 B, 2-way associative 223 D1 cache: 65536 B, 64 B, 2-way associative 224 LL cache: 262144 B, 64 B, 8-way associative 225 Command: concord vg_to_ucode.c 226 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 227 Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 228 Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 229 Threshold: 99% 230 Chosen for annotation: 231 Auto-annotation: off 232 </pre> 233 <p>This is a summary of the annotation options:</p> 234 <div class="itemizedlist"><ul class="itemizedlist" type="disc"> 235 <li class="listitem"><p>I1 cache, D1 cache, LL cache: cache configuration. So 236 you know the configuration with which these results were 237 obtained.</p></li> 238 <li class="listitem"><p>Command: the command line invocation of the program 239 under examination.</p></li> 240 <li class="listitem"><p>Events recorded: which events were recorded.</p></li> 241 <li class="listitem"><p>Events shown: the events shown, which is a subset of the events 242 gathered. This can be adjusted with the 243 <code class="option">--show</code> option.</p></li> 244 <li class="listitem"> 245 <p>Event sort order: the sort order in which functions are 246 shown. For example, in this case the functions are sorted 247 from highest <code class="computeroutput">Ir</code> counts to 248 lowest. If two functions have identical 249 <code class="computeroutput">Ir</code> counts, they will then be 250 sorted by <code class="computeroutput">I1mr</code> counts, and 251 so on. This order can be adjusted with the 252 <code class="option">--sort</code> option.</p> 253 <p>Note that this dictates the order the functions appear. 254 It is <span class="emphasis"><em>not</em></span> the order in which the columns 255 appear; that is dictated by the "events shown" line (and can 256 be changed with the <code class="option">--show</code> 257 option).</p> 258 </li> 259 <li class="listitem"><p>Threshold: cg_annotate 260 by default omits functions that cause very low counts 261 to avoid drowning you in information. In this case, 262 cg_annotate shows summaries the functions that account for 263 99% of the <code class="computeroutput">Ir</code> counts; 264 <code class="computeroutput">Ir</code> is chosen as the 265 threshold event since it is the primary sort event. The 266 threshold can be adjusted with the 267 <code class="option">--threshold</code> 268 option.</p></li> 269 <li class="listitem"><p>Chosen for annotation: names of files specified 270 manually for annotation; in this case none.</p></li> 271 <li class="listitem"><p>Auto-annotation: whether auto-annotation was requested 272 via the <code class="option">--auto=yes</code> 273 option. In this case no.</p></li> 274 </ul></div> 275 </div> 276 <div class="sect2" title="5.2.5.The Global and Function-level Counts"> 277 <div class="titlepage"><div><div><h3 class="title"> 278 <a name="cg-manual.the-global"></a>5.2.5.The Global and Function-level Counts</h3></div></div></div> 279 <p>Then follows summary statistics for the whole 280 program:</p> 281 <pre class="programlisting"> 282 -------------------------------------------------------------------------------- 283 Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 284 -------------------------------------------------------------------------------- 285 27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS</pre> 286 <p> 287 These are similar to the summary provided when Cachegrind finishes running. 288 </p> 289 <p>Then comes function-by-function statistics:</p> 290 <pre class="programlisting"> 291 -------------------------------------------------------------------------------- 292 Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function 293 -------------------------------------------------------------------------------- 294 8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc 295 5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word 296 2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp 297 2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash 298 2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower 299 1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert 300 897,991 51 51 897,831 95 30 62 1 1 ???:??? 301 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile 302 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile 303 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc 304 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing 305 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER 306 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table 307 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create 308 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0 309 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0 310 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node 311 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue</pre> 312 <p>Each function 313 is identified by a 314 <code class="computeroutput">file_name:function_name</code> pair. If 315 a column contains only a dot it means the function never performs 316 that event (e.g. the third row shows that 317 <code class="computeroutput">strcmp()</code> contains no 318 instructions that write to memory). The name 319 <code class="computeroutput">???</code> is used if the the file name 320 and/or function name could not be determined from debugging 321 information. If most of the entries have the form 322 <code class="computeroutput">???:???</code> the program probably 323 wasn't compiled with <code class="option">-g</code>.</p> 324 <p>It is worth noting that functions will come both from 325 the profiled program (e.g. <code class="filename">concord.c</code>) 326 and from libraries (e.g. <code class="filename">getc.c</code>)</p> 327 </div> 328 <div class="sect2" title="5.2.6.Line-by-line Counts"> 329 <div class="titlepage"><div><div><h3 class="title"> 330 <a name="cg-manual.line-by-line"></a>5.2.6.Line-by-line Counts</h3></div></div></div> 331 <p>There are two ways to annotate source files -- by specifying them 332 manually as arguments to cg_annotate, or with the 333 <code class="option">--auto=yes</code> option. For example, the output from running 334 <code class="filename">cg_annotate <filename> concord.c</code> for our example 335 produces the same output as above followed by an annotated version of 336 <code class="filename">concord.c</code>, a section of which looks like:</p> 337 <pre class="programlisting"> 338 -------------------------------------------------------------------------------- 339 -- User-annotated source: concord.c 340 -------------------------------------------------------------------------------- 341 Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 342 343 . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[]) 344 3 1 1 . . . 1 0 0 { 345 . . . . . . . . . FILE *file_ptr; 346 . . . . . . . . . Word_Info *data; 347 1 0 0 . . . 1 1 1 int line = 1, i; 348 . . . . . . . . . 349 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info)); 350 . . . . . . . . . 351 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++) 352 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL; 353 . . . . . . . . . 354 . . . . . . . . . /* Open file, check it. */ 355 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r"); 356 2 0 0 1 0 0 . . . if (!(file_ptr)) { 357 . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name); 358 1 1 1 . . . . . . exit(EXIT_FAILURE); 359 . . . . . . . . . } 360 . . . . . . . . . 361 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF) 362 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table); 363 . . . . . . . . . 364 4 0 0 1 0 0 2 0 0 free(data); 365 4 0 0 1 0 0 2 0 0 fclose(file_ptr); 366 3 0 0 2 0 0 . . . }</pre> 367 <p>(Although column widths are automatically minimised, a wide 368 terminal is clearly useful.)</p> 369 <p>Each source file is clearly marked 370 (<code class="computeroutput">User-annotated source</code>) as 371 having been chosen manually for annotation. If the file was 372 found in one of the directories specified with the 373 <code class="option">-I</code>/<code class="option">--include</code> option, the directory 374 and file are both given.</p> 375 <p>Each line is annotated with its event counts. Events not 376 applicable for a line are represented by a dot. This is useful 377 for distinguishing between an event which cannot happen, and one 378 which can but did not.</p> 379 <p>Sometimes only a small section of a source file is 380 executed. To minimise uninteresting output, Cachegrind only shows 381 annotated lines and lines within a small distance of annotated 382 lines. Gaps are marked with the line numbers so you know which 383 part of a file the shown code comes from, eg:</p> 384 <pre class="programlisting"> 385 (figures and code for line 704) 386 -- line 704 ---------------------------------------- 387 -- line 878 ---------------------------------------- 388 (figures and code for line 878)</pre> 389 <p>The amount of context to show around annotated lines is 390 controlled by the <code class="option">--context</code> 391 option.</p> 392 <p>To get automatic annotation, use the <code class="option">--auto=yes</code> option. 393 cg_annotate will automatically annotate every source file it can 394 find that is mentioned in the function-by-function summary. 395 Therefore, the files chosen for auto-annotation are affected by 396 the <code class="option">--sort</code> and 397 <code class="option">--threshold</code> options. Each 398 source file is clearly marked (<code class="computeroutput">Auto-annotated 399 source</code>) as being chosen automatically. Any 400 files that could not be found are mentioned at the end of the 401 output, eg:</p> 402 <pre class="programlisting"> 403 ------------------------------------------------------------------ 404 The following files chosen for auto-annotation could not be found: 405 ------------------------------------------------------------------ 406 getc.c 407 ctype.c 408 ../sysdeps/generic/lockfile.c</pre> 409 <p>This is quite common for library files, since libraries are 410 usually compiled with debugging information, but the source files 411 are often not present on a system. If a file is chosen for 412 annotation both manually and automatically, it 413 is marked as <code class="computeroutput">User-annotated 414 source</code>. Use the 415 <code class="option">-I</code>/<code class="option">--include</code> option to tell Valgrind where 416 to look for source files if the filenames found from the debugging 417 information aren't specific enough.</p> 418 <p>Beware that cg_annotate can take some time to digest large 419 <code class="filename">cachegrind.out.<pid></code> files, 420 e.g. 30 seconds or more. Also beware that auto-annotation can 421 produce a lot of output if your program is large!</p> 422 </div> 423 <div class="sect2" title="5.2.7.Annotating Assembly Code Programs"> 424 <div class="titlepage"><div><div><h3 class="title"> 425 <a name="cg-manual.assembler"></a>5.2.7.Annotating Assembly Code Programs</h3></div></div></div> 426 <p>Valgrind can annotate assembly code programs too, or annotate 427 the assembly code generated for your C program. Sometimes this is 428 useful for understanding what is really happening when an 429 interesting line of C code is translated into multiple 430 instructions.</p> 431 <p>To do this, you just need to assemble your 432 <code class="computeroutput">.s</code> files with assembly-level debug 433 information. You can use compile with the <code class="option">-S</code> to compile C/C++ 434 programs to assembly code, and then assemble the assembly code files with 435 <code class="option">-g</code> to achieve this. You can then profile and annotate the 436 assembly code source files in the same way as C/C++ source files.</p> 437 </div> 438 <div class="sect2" title="5.2.8.Forking Programs"> 439 <div class="titlepage"><div><div><h3 class="title"> 440 <a name="ms-manual.forkingprograms"></a>5.2.8.Forking Programs</h3></div></div></div> 441 <p>If your program forks, the child will inherit all the profiling data that 442 has been gathered for the parent.</p> 443 <p>If the output file format string (controlled by 444 <code class="option">--cachegrind-out-file</code>) does not contain <code class="option">%p</code>, 445 then the outputs from the parent and child will be intermingled in a single 446 output file, which will almost certainly make it unreadable by 447 cg_annotate.</p> 448 </div> 449 <div class="sect2" title="5.2.9.cg_annotate Warnings"> 450 <div class="titlepage"><div><div><h3 class="title"> 451 <a name="cg-manual.annopts.warnings"></a>5.2.9.cg_annotate Warnings</h3></div></div></div> 452 <p>There are a couple of situations in which 453 cg_annotate issues warnings.</p> 454 <div class="itemizedlist"><ul class="itemizedlist" type="disc"> 455 <li class="listitem"><p>If a source file is more recent than the 456 <code class="filename">cachegrind.out.<pid></code> file. 457 This is because the information in 458 <code class="filename">cachegrind.out.<pid></code> is only 459 recorded with line numbers, so if the line numbers change at 460 all in the source (e.g. lines added, deleted, swapped), any 461 annotations will be incorrect.</p></li> 462 <li class="listitem"><p>If information is recorded about line numbers past the 463 end of a file. This can be caused by the above problem, 464 i.e. shortening the source file while using an old 465 <code class="filename">cachegrind.out.<pid></code> file. If 466 this happens, the figures for the bogus lines are printed 467 anyway (clearly marked as bogus) in case they are 468 important.</p></li> 469 </ul></div> 470 </div> 471 <div class="sect2" title="5.2.10.Unusual Annotation Cases"> 472 <div class="titlepage"><div><div><h3 class="title"> 473 <a name="cg-manual.annopts.things-to-watch-out-for"></a>5.2.10.Unusual Annotation Cases</h3></div></div></div> 474 <p>Some odd things that can occur during annotation:</p> 475 <div class="itemizedlist"><ul class="itemizedlist" type="disc"> 476 <li class="listitem"> 477 <p>If annotating at the assembler level, you might see 478 something like this:</p> 479 <pre class="programlisting"> 480 1 0 0 . . . . . . leal -12(%ebp),%eax 481 1 0 0 . . . 1 0 0 movl %eax,84(%ebx) 482 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp) 483 . . . . . . . . . .align 4,0x90 484 1 0 0 . . . . . . movl $.LnrB,%eax 485 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)</pre> 486 <p>How can the third instruction be executed twice when 487 the others are executed only once? As it turns out, it 488 isn't. Here's a dump of the executable, using 489 <code class="computeroutput">objdump -d</code>:</p> 490 <pre class="programlisting"> 491 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax 492 8048f28: 89 43 54 mov %eax,0x54(%ebx) 493 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp) 494 8048f32: 89 f6 mov %esi,%esi 495 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax 496 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)</pre> 497 <p>Notice the extra <code class="computeroutput">mov 498 %esi,%esi</code> instruction. Where did this come 499 from? The GNU assembler inserted it to serve as the two 500 bytes of padding needed to align the <code class="computeroutput">movl 501 $.LnrB,%eax</code> instruction on a four-byte 502 boundary, but pretended it didn't exist when adding debug 503 information. Thus when Valgrind reads the debug info it 504 thinks that the <code class="computeroutput">movl 505 $0x1,0xffffffec(%ebp)</code> instruction covers the 506 address range 0x8048f2b--0x804833 by itself, and attributes 507 the counts for the <code class="computeroutput">mov 508 %esi,%esi</code> to it.</p> 509 </li> 510 <li class="listitem"><p>Sometimes, the same filename might be represented with 511 a relative name and with an absolute name in different parts 512 of the debug info, eg: 513 <code class="filename">/home/user/proj/proj.h</code> and 514 <code class="filename">../proj.h</code>. In this case, if you use 515 auto-annotation, the file will be annotated twice with the 516 counts split between the two.</p></li> 517 <li class="listitem"> 518 <p>Files with more than 65,535 lines cause difficulties 519 for the Stabs-format debug info reader. This is because the line 520 number in the <code class="computeroutput">struct nlist</code> 521 defined in <code class="filename">a.out.h</code> under Linux is only a 522 16-bit value. Valgrind can handle some files with more than 523 65,535 lines correctly by making some guesses to identify 524 line number overflows. But some cases are beyond it, in 525 which case you'll get a warning message explaining that 526 annotations for the file might be incorrect.</p> 527 <p>If you are using GCC 3.1 or later, this is most likely 528 irrelevant, since GCC switched to using the more modern DWARF2 529 format by default at version 3.1. DWARF2 does not have any such 530 limitations on line numbers.</p> 531 </li> 532 <li class="listitem"><p>If you compile some files with 533 <code class="option">-g</code> and some without, some 534 events that take place in a file without debug info could be 535 attributed to the last line of a file with debug info 536 (whichever one gets placed before the non-debug-info file in 537 the executable).</p></li> 538 </ul></div> 539 <p>This list looks long, but these cases should be fairly 540 rare.</p> 541 </div> 542 <div class="sect2" title="5.2.11.Merging Profiles with cg_merge"> 543 <div class="titlepage"><div><div><h3 class="title"> 544 <a name="cg-manual.cg_merge"></a>5.2.11.Merging Profiles with cg_merge</h3></div></div></div> 545 <p> 546 cg_merge is a simple program which 547 reads multiple profile files, as created by Cachegrind, merges them 548 together, and writes the results into another file in the same format. 549 You can then examine the merged results using 550 <code class="computeroutput">cg_annotate <filename></code>, as 551 described above. The merging functionality might be useful if you 552 want to aggregate costs over multiple runs of the same program, or 553 from a single parallel run with multiple instances of the same 554 program.</p> 555 <p> 556 cg_merge is invoked as follows: 557 </p> 558 <pre class="programlisting"> 559 cg_merge -o outputfile file1 file2 file3 ...</pre> 560 <p> 561 It reads and checks <code class="computeroutput">file1</code>, then read 562 and checks <code class="computeroutput">file2</code> and merges it into 563 the running totals, then the same with 564 <code class="computeroutput">file3</code>, etc. The final results are 565 written to <code class="computeroutput">outputfile</code>, or to standard 566 out if no output file is specified.</p> 567 <p> 568 Costs are summed on a per-function, per-line and per-instruction 569 basis. Because of this, the order in which the input files does not 570 matter, although you should take care to only mention each file once, 571 since any file mentioned twice will be added in twice.</p> 572 <p> 573 cg_merge does not attempt to check 574 that the input files come from runs of the same executable. It will 575 happily merge together profile files from completely unrelated 576 programs. It does however check that the 577 <code class="computeroutput">Events:</code> lines of all the inputs are 578 identical, so as to ensure that the addition of costs makes sense. 579 For example, it would be nonsensical for it to add a number indicating 580 D1 read references to a number from a different file indicating LL 581 write misses.</p> 582 <p> 583 A number of other syntax and sanity checks are done whilst reading the 584 inputs. cg_merge will stop and 585 attempt to print a helpful error message if any of the input files 586 fail these checks.</p> 587 </div> 588 <div class="sect2" title="5.2.12.Differencing Profiles with cg_diff"> 589 <div class="titlepage"><div><div><h3 class="title"> 590 <a name="cg-manual.cg_diff"></a>5.2.12.Differencing Profiles with cg_diff</h3></div></div></div> 591 <p> 592 cg_diff is a simple program which 593 reads two profile files, as created by Cachegrind, finds the difference 594 between them, and writes the results into another file in the same format. 595 You can then examine the merged results using 596 <code class="computeroutput">cg_annotate <filename></code>, as 597 described above. This is very useful if you want to measure how a change to 598 a program affected its performance. 599 </p> 600 <p> 601 cg_diff is invoked as follows: 602 </p> 603 <pre class="programlisting"> 604 cg_diff file1 file2</pre> 605 <p> 606 It reads and checks <code class="computeroutput">file1</code>, then read 607 and checks <code class="computeroutput">file2</code>, then computes the 608 difference (effectively <code class="computeroutput">file1</code> - 609 <code class="computeroutput">file2</code>). The final results are written to 610 standard output.</p> 611 <p> 612 Costs are summed on a per-function basis. Per-line costs are not summed, 613 because doing so is too difficult. For example, consider differencing two 614 profiles, one from a single-file program A, and one from the same program A 615 where a single blank line was inserted at the top of the file. Every single 616 per-line count has changed. In comparison, the per-function counts have not 617 changed. The per-function count differences are still very useful for 618 determining differences between programs. Note that because the result is 619 the difference of two profiles, many of the counts will be negative; this 620 indicates that the counts for the relevant function are fewer in the second 621 version than those in the first version.</p> 622 <p> 623 cg_diff does not attempt to check 624 that the input files come from runs of the same executable. It will 625 happily merge together profile files from completely unrelated 626 programs. It does however check that the 627 <code class="computeroutput">Events:</code> lines of all the inputs are 628 identical, so as to ensure that the addition of costs makes sense. 629 For example, it would be nonsensical for it to add a number indicating 630 D1 read references to a number from a different file indicating LL 631 write misses.</p> 632 <p> 633 A number of other syntax and sanity checks are done whilst reading the 634 inputs. cg_diff will stop and 635 attempt to print a helpful error message if any of the input files 636 fail these checks.</p> 637 <p> 638 Sometimes you will want to compare Cachegrind profiles of two versions of a 639 program that you have sitting side-by-side. For example, you might have 640 <code class="computeroutput">version1/prog.c</code> and 641 <code class="computeroutput">version2/prog.c</code>, where the second is 642 slightly different to the first. A straight comparison of the two will not 643 be useful -- because functions are qualified with filenames, a function 644 <code class="function">f</code> will be listed as 645 <code class="computeroutput">version1/prog.c:f</code> for the first version but 646 <code class="computeroutput">version2/prog.c:f</code> for the second 647 version.</p> 648 <p> 649 When this happens, you can use the <code class="option">--mod-filename</code> option. 650 Its argument is a Perl search-and-replace expression that will be applied 651 to all the filenames in both Cachegrind output files. It can be used to 652 remove minor differences in filenames. For example, the option 653 <code class="option">--mod-filename='s/version[0-9]/versionN/'</code> will suffice for 654 this case.</p> 655 </div> 656 </div> 657 <div class="sect1" title="5.3.Cachegrind Command-line Options"> 658 <div class="titlepage"><div><div><h2 class="title" style="clear: both"> 659 <a name="cg-manual.cgopts"></a>5.3.Cachegrind Command-line Options</h2></div></div></div> 660 <p>Cachegrind-specific options are:</p> 661 <div class="variablelist"> 662 <a name="cg.opts.list"></a><dl> 663 <dt> 664 <a name="opt.I1"></a><span class="term"> 665 <code class="option">--I1=<size>,<associativity>,<line size> </code> 666 </span> 667 </dt> 668 <dd><p>Specify the size, associativity and line size of the level 1 669 instruction cache. </p></dd> 670 <dt> 671 <a name="opt.D1"></a><span class="term"> 672 <code class="option">--D1=<size>,<associativity>,<line size> </code> 673 </span> 674 </dt> 675 <dd><p>Specify the size, associativity and line size of the level 1 676 data cache.</p></dd> 677 <dt> 678 <a name="opt.LL"></a><span class="term"> 679 <code class="option">--LL=<size>,<associativity>,<line size> </code> 680 </span> 681 </dt> 682 <dd><p>Specify the size, associativity and line size of the last-level 683 cache.</p></dd> 684 <dt> 685 <a name="opt.cache-sim"></a><span class="term"> 686 <code class="option">--cache-sim=no|yes [yes] </code> 687 </span> 688 </dt> 689 <dd><p>Enables or disables collection of cache access and miss 690 counts.</p></dd> 691 <dt> 692 <a name="opt.branch-sim"></a><span class="term"> 693 <code class="option">--branch-sim=no|yes [no] </code> 694 </span> 695 </dt> 696 <dd><p>Enables or disables collection of branch instruction and 697 misprediction counts. By default this is disabled as it 698 slows Cachegrind down by approximately 25%. Note that you 699 cannot specify <code class="option">--cache-sim=no</code> 700 and <code class="option">--branch-sim=no</code> 701 together, as that would leave Cachegrind with no 702 information to collect.</p></dd> 703 <dt> 704 <a name="opt.cachegrind-out-file"></a><span class="term"> 705 <code class="option">--cachegrind-out-file=<file> </code> 706 </span> 707 </dt> 708 <dd><p>Write the profile data to 709 <code class="computeroutput">file</code> rather than to the default 710 output file, 711 <code class="filename">cachegrind.out.<pid></code>. The 712 <code class="option">%p</code> and <code class="option">%q</code> format specifiers 713 can be used to embed the process ID and/or the contents of an 714 environment variable in the name, as is the case for the core 715 option <code class="option"><a class="xref" href="manual-core.html#opt.log-file">--log-file</a></code>. 716 </p></dd> 717 </dl> 718 </div> 719 </div> 720 <div class="sect1" title="5.4.cg_annotate Command-line Options"> 721 <div class="titlepage"><div><div><h2 class="title" style="clear: both"> 722 <a name="cg-manual.annopts"></a>5.4.cg_annotate Command-line Options</h2></div></div></div> 723 <div class="variablelist"> 724 <a name="cg_annotate.opts.list"></a><dl> 725 <dt><span class="term"> 726 <code class="option">-h --help </code> 727 </span></dt> 728 <dd><p>Show the help message.</p></dd> 729 <dt><span class="term"> 730 <code class="option">--version </code> 731 </span></dt> 732 <dd><p>Show the version number.</p></dd> 733 <dt><span class="term"> 734 <code class="option">--show=A,B,C [default: all, using order in 735 cachegrind.out.<pid>] </code> 736 </span></dt> 737 <dd><p>Specifies which events to show (and the column 738 order). Default is to use all present in the 739 <code class="filename">cachegrind.out.<pid></code> file (and 740 use the order in the file). Useful if you want to concentrate on, for 741 example, I cache misses (<code class="option">--show=I1mr,ILmr</code>), or data 742 read misses (<code class="option">--show=D1mr,DLmr</code>), or LL data misses 743 (<code class="option">--show=DLmr,DLmw</code>). Best used in conjunction with 744 <code class="option">--sort</code>.</p></dd> 745 <dt><span class="term"> 746 <code class="option">--sort=A,B,C [default: order in 747 cachegrind.out.<pid>] </code> 748 </span></dt> 749 <dd><p>Specifies the events upon which the sorting of the 750 function-by-function entries will be based.</p></dd> 751 <dt><span class="term"> 752 <code class="option">--threshold=X [default: 0.1%] </code> 753 </span></dt> 754 <dd> 755 <p>Sets the threshold for the function-by-function 756 summary. A function is shown if it accounts for more than X% 757 of the counts for the primary sort event. If auto-annotating, also 758 affects which files are annotated.</p> 759 <p>Note: thresholds can be set for more than one of the 760 events by appending any events for the 761 <code class="option">--sort</code> option with a colon 762 and a number (no spaces, though). E.g. if you want to see 763 each function that covers more than 1% of LL read misses or 1% of LL 764 write misses, use this option:</p> 765 <p><code class="option">--sort=DLmr:1,DLmw:1</code></p> 766 </dd> 767 <dt><span class="term"> 768 <code class="option">--auto=<no|yes> [default: no] </code> 769 </span></dt> 770 <dd><p>When enabled, automatically annotates every file that 771 is mentioned in the function-by-function summary that can be 772 found. Also gives a list of those that couldn't be found.</p></dd> 773 <dt><span class="term"> 774 <code class="option">--context=N [default: 8] </code> 775 </span></dt> 776 <dd><p>Print N lines of context before and after each 777 annotated line. Avoids printing large sections of source 778 files that were not executed. Use a large number 779 (e.g. 100000) to show all source lines.</p></dd> 780 <dt><span class="term"> 781 <code class="option">-I<dir> --include=<dir> [default: none] </code> 782 </span></dt> 783 <dd><p>Adds a directory to the list in which to search for 784 files. Multiple <code class="option">-I</code>/<code class="option">--include</code> 785 options can be given to add multiple directories.</p></dd> 786 </dl> 787 </div> 788 </div> 789 <div class="sect1" title="5.5.cg_diff Command-line Options"> 790 <div class="titlepage"><div><div><h2 class="title" style="clear: both"> 791 <a name="cg-manual.diffopts"></a>5.5.cg_diff Command-line Options</h2></div></div></div> 792 <div class="variablelist"> 793 <a name="cg_diff.opts.list"></a><dl> 794 <dt><span class="term"> 795 <code class="option">-h --help </code> 796 </span></dt> 797 <dd><p>Show the help message.</p></dd> 798 <dt><span class="term"> 799 <code class="option">--version </code> 800 </span></dt> 801 <dd><p>Show the version number.</p></dd> 802 <dt><span class="term"> 803 <code class="option">--mod-filename=<expr> [default: none]</code> 804 </span></dt> 805 <dd><p>Specifies a Perl search-and-replace expression that is applied 806 to all filenames. Useful for removing minor differences in paths 807 between two different versions of a program that are sitting in 808 different directories.</p></dd> 809 </dl> 810 </div> 811 </div> 812 <div class="sect1" title="5.6.Acting on Cachegrind's Information"> 813 <div class="titlepage"><div><div><h2 class="title" style="clear: both"> 814 <a name="cg-manual.acting-on"></a>5.6.Acting on Cachegrind's Information</h2></div></div></div> 815 <p> 816 Cachegrind gives you lots of information, but acting on that information 817 isn't always easy. Here are some rules of thumb that we have found to be 818 useful.</p> 819 <p> 820 First of all, the global hit/miss counts and miss rates are not that useful. 821 If you have multiple programs or multiple runs of a program, comparing the 822 numbers might identify if any are outliers and worthy of closer 823 investigation. Otherwise, they're not enough to act on.</p> 824 <p> 825 The function-by-function counts are more useful to look at, as they pinpoint 826 which functions are causing large numbers of counts. However, beware that 827 inlining can make these counts misleading. If a function 828 <code class="function">f</code> is always inlined, counts will be attributed to the 829 functions it is inlined into, rather than itself. However, if you look at 830 the line-by-line annotations for <code class="function">f</code> you'll see the 831 counts that belong to <code class="function">f</code>. (This is hard to avoid, it's 832 how the debug info is structured.) So it's worth looking for large numbers 833 in the line-by-line annotations.</p> 834 <p> 835 The line-by-line source code annotations are much more useful. In our 836 experience, the best place to start is by looking at the 837 <code class="computeroutput">Ir</code> numbers. They simply measure how many 838 instructions were executed for each line, and don't include any cache 839 information, but they can still be very useful for identifying 840 bottlenecks.</p> 841 <p> 842 After that, we have found that LL misses are typically a much bigger source 843 of slow-downs than L1 misses. So it's worth looking for any snippets of 844 code with high <code class="computeroutput">DLmr</code> or 845 <code class="computeroutput">DLmw</code> counts. (You can use 846 <code class="option">--show=DLmr 847 --sort=DLmr</code> with cg_annotate to focus just on 848 <code class="literal">DLmr</code> counts, for example.) If you find any, it's still 849 not always easy to work out how to improve things. You need to have a 850 reasonable understanding of how caches work, the principles of locality, and 851 your program's data access patterns. Improving things may require 852 redesigning a data structure, for example.</p> 853 <p> 854 Looking at the <code class="computeroutput">Bcm</code> and 855 <code class="computeroutput">Bim</code> misses can also be helpful. 856 In particular, <code class="computeroutput">Bim</code> misses are often caused 857 by <code class="literal">switch</code> statements, and in some cases these 858 <code class="literal">switch</code> statements can be replaced with table-driven code. 859 For example, you might replace code like this:</p> 860 <pre class="programlisting"> 861 enum E { A, B, C }; 862 enum E e; 863 int i; 864 ... 865 switch (e) 866 { 867 case A: i += 1; 868 case B: i += 2; 869 case C: i += 3; 870 } 871 </pre> 872 <p>with code like this:</p> 873 <pre class="programlisting"> 874 enum E { A, B, C }; 875 enum E e; 876 enum E table[] = { 1, 2, 3 }; 877 int i; 878 ... 879 i += table[e]; 880 </pre> 881 <p> 882 This is obviously a contrived example, but the basic principle applies in a 883 wide variety of situations.</p> 884 <p> 885 In short, Cachegrind can tell you where some of the bottlenecks in your code 886 are, but it can't tell you how to fix them. You have to work that out for 887 yourself. But at least you have the information! 888 </p> 889 </div> 890 <div class="sect1" title="5.7.Simulation Details"> 891 <div class="titlepage"><div><div><h2 class="title" style="clear: both"> 892 <a name="cg-manual.sim-details"></a>5.7.Simulation Details</h2></div></div></div> 893 <p> 894 This section talks about details you don't need to know about in order to 895 use Cachegrind, but may be of interest to some people. 896 </p> 897 <div class="sect2" title="5.7.1.Cache Simulation Specifics"> 898 <div class="titlepage"><div><div><h3 class="title"> 899 <a name="cache-sim"></a>5.7.1.Cache Simulation Specifics</h3></div></div></div> 900 <p>Specific characteristics of the cache simulation are as 901 follows:</p> 902 <div class="itemizedlist"><ul class="itemizedlist" type="disc"> 903 <li class="listitem"><p>Write-allocate: when a write miss occurs, the block 904 written to is brought into the D1 cache. Most modern caches 905 have this property.</p></li> 906 <li class="listitem"> 907 <p>Bit-selection hash function: the set of line(s) in the cache 908 to which a memory block maps is chosen by the middle bits 909 M--(M+N-1) of the byte address, where:</p> 910 <div class="itemizedlist"><ul class="itemizedlist" type="circle"> 911 <li class="listitem"><p>line size = 2^M bytes</p></li> 912 <li class="listitem"><p>(cache size / line size / associativity) = 2^N bytes</p></li> 913 </ul></div> 914 </li> 915 <li class="listitem"><p>Inclusive LL cache: the LL cache typically replicates all 916 the entries of the L1 caches, because fetching into L1 involves 917 fetching into LL first (this does not guarantee strict inclusiveness, 918 as lines evicted from LL still could reside in L1). This is 919 standard on Pentium chips, but AMD Opterons, Athlons and Durons 920 use an exclusive LL cache that only holds 921 blocks evicted from L1. Ditto most modern VIA CPUs.</p></li> 922 </ul></div> 923 <p>The cache configuration simulated (cache size, 924 associativity and line size) is determined automatically using 925 the x86 CPUID instruction. If you have a machine that (a) 926 doesn't support the CPUID instruction, or (b) supports it in an 927 early incarnation that doesn't give any cache information, then 928 Cachegrind will fall back to using a default configuration (that 929 of a model 3/4 Athlon). Cachegrind will tell you if this 930 happens. You can manually specify one, two or all three levels 931 (I1/D1/LL) of the cache from the command line using the 932 <code class="option">--I1</code>, 933 <code class="option">--D1</code> and 934 <code class="option">--LL</code> options. 935 For cache parameters to be valid for simulation, the number 936 of sets (with associativity being the number of cache lines in 937 each set) has to be a power of two.</p> 938 <p>On PowerPC platforms 939 Cachegrind cannot automatically 940 determine the cache configuration, so you will 941 need to specify it with the 942 <code class="option">--I1</code>, 943 <code class="option">--D1</code> and 944 <code class="option">--LL</code> options.</p> 945 <p>Other noteworthy behaviour:</p> 946 <div class="itemizedlist"><ul class="itemizedlist" type="disc"> 947 <li class="listitem"> 948 <p>References that straddle two cache lines are treated as 949 follows:</p> 950 <div class="itemizedlist"><ul class="itemizedlist" type="circle"> 951 <li class="listitem"><p>If both blocks hit --> counted as one hit</p></li> 952 <li class="listitem"><p>If one block hits, the other misses --> counted 953 as one miss.</p></li> 954 <li class="listitem"><p>If both blocks miss --> counted as one miss (not 955 two)</p></li> 956 </ul></div> 957 </li> 958 <li class="listitem"> 959 <p>Instructions that modify a memory location 960 (e.g. <code class="computeroutput">inc</code> and 961 <code class="computeroutput">dec</code>) are counted as doing 962 just a read, i.e. a single data reference. This may seem 963 strange, but since the write can never cause a miss (the read 964 guarantees the block is in the cache) it's not very 965 interesting.</p> 966 <p>Thus it measures not the number of times the data cache 967 is accessed, but the number of times a data cache miss could 968 occur.</p> 969 </li> 970 </ul></div> 971 <p>If you are interested in simulating a cache with different 972 properties, it is not particularly hard to write your own cache 973 simulator, or to modify the existing ones in 974 <code class="computeroutput">cg_sim.c</code>. We'd be 975 interested to hear from anyone who does.</p> 976 </div> 977 <div class="sect2" title="5.7.2.Branch Simulation Specifics"> 978 <div class="titlepage"><div><div><h3 class="title"> 979 <a name="branch-sim"></a>5.7.2.Branch Simulation Specifics</h3></div></div></div> 980 <p>Cachegrind simulates branch predictors intended to be 981 typical of mainstream desktop/server processors of around 2004.</p> 982 <p>Conditional branches are predicted using an array of 16384 2-bit 983 saturating counters. The array index used for a branch instruction is 984 computed partly from the low-order bits of the branch instruction's 985 address and partly using the taken/not-taken behaviour of the last few 986 conditional branches. As a result the predictions for any specific 987 branch depend both on its own history and the behaviour of previous 988 branches. This is a standard technique for improving prediction 989 accuracy.</p> 990 <p>For indirect branches (that is, jumps to unknown destinations) 991 Cachegrind uses a simple branch target address predictor. Targets are 992 predicted using an array of 512 entries indexed by the low order 9 993 bits of the branch instruction's address. Each branch is predicted to 994 jump to the same address it did last time. Any other behaviour causes 995 a mispredict.</p> 996 <p>More recent processors have better branch predictors, in 997 particular better indirect branch predictors. Cachegrind's predictor 998 design is deliberately conservative so as to be representative of the 999 large installed base of processors which pre-date widespread 1000 deployment of more sophisticated indirect branch predictors. In 1001 particular, late model Pentium 4s (Prescott), Pentium M, Core and Core 1002 2 have more sophisticated indirect branch predictors than modelled by 1003 Cachegrind. </p> 1004 <p>Cachegrind does not simulate a return stack predictor. It 1005 assumes that processors perfectly predict function return addresses, 1006 an assumption which is probably close to being true.</p> 1007 <p>See Hennessy and Patterson's classic text "Computer 1008 Architecture: A Quantitative Approach", 4th edition (2007), Section 1009 2.3 (pages 80-89) for background on modern branch predictors.</p> 1010 </div> 1011 <div class="sect2" title="5.7.3.Accuracy"> 1012 <div class="titlepage"><div><div><h3 class="title"> 1013 <a name="cg-manual.annopts.accuracy"></a>5.7.3.Accuracy</h3></div></div></div> 1014 <p>Valgrind's cache profiling has a number of 1015 shortcomings:</p> 1016 <div class="itemizedlist"><ul class="itemizedlist" type="disc"> 1017 <li class="listitem"><p>It doesn't account for kernel activity -- the effect of system 1018 calls on the cache and branch predictor contents is ignored.</p></li> 1019 <li class="listitem"><p>It doesn't account for other process activity. 1020 This is probably desirable when considering a single 1021 program.</p></li> 1022 <li class="listitem"><p>It doesn't account for virtual-to-physical address 1023 mappings. Hence the simulation is not a true 1024 representation of what's happening in the 1025 cache. Most caches and branch predictors are physically indexed, but 1026 Cachegrind simulates caches using virtual addresses.</p></li> 1027 <li class="listitem"><p>It doesn't account for cache misses not visible at the 1028 instruction level, e.g. those arising from TLB misses, or 1029 speculative execution.</p></li> 1030 <li class="listitem"><p>Valgrind will schedule 1031 threads differently from how they would be when running natively. 1032 This could warp the results for threaded programs.</p></li> 1033 <li class="listitem"> 1034 <p>The x86/amd64 instructions <code class="computeroutput">bts</code>, 1035 <code class="computeroutput">btr</code> and 1036 <code class="computeroutput">btc</code> will incorrectly be 1037 counted as doing a data read if both the arguments are 1038 registers, eg:</p> 1039 <pre class="programlisting"> 1040 btsl %eax, %edx</pre> 1041 <p>This should only happen rarely.</p> 1042 </li> 1043 <li class="listitem"><p>x86/amd64 FPU instructions with data sizes of 28 and 108 bytes 1044 (e.g. <code class="computeroutput">fsave</code>) are treated as 1045 though they only access 16 bytes. These instructions seem to 1046 be rare so hopefully this won't affect accuracy much.</p></li> 1047 </ul></div> 1048 <p>Another thing worth noting is that results are very sensitive. 1049 Changing the size of the the executable being profiled, or the sizes 1050 of any of the shared libraries it uses, or even the length of their 1051 file names, can perturb the results. Variations will be small, but 1052 don't expect perfectly repeatable results if your program changes at 1053 all.</p> 1054 <p>More recent GNU/Linux distributions do address space 1055 randomisation, in which identical runs of the same program have their 1056 shared libraries loaded at different locations, as a security measure. 1057 This also perturbs the results.</p> 1058 <p>While these factors mean you shouldn't trust the results to 1059 be super-accurate, they should be close enough to be useful.</p> 1060 </div> 1061 </div> 1062 <div class="sect1" title="5.8.Implementation Details"> 1063 <div class="titlepage"><div><div><h2 class="title" style="clear: both"> 1064 <a name="cg-manual.impl-details"></a>5.8.Implementation Details</h2></div></div></div> 1065 <p> 1066 This section talks about details you don't need to know about in order to 1067 use Cachegrind, but may be of interest to some people. 1068 </p> 1069 <div class="sect2" title="5.8.1.How Cachegrind Works"> 1070 <div class="titlepage"><div><div><h3 class="title"> 1071 <a name="cg-manual.impl-details.how-cg-works"></a>5.8.1.How Cachegrind Works</h3></div></div></div> 1072 <p>The best reference for understanding how Cachegrind works is chapter 3 of 1073 "Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It 1074 is available on the <a class="ulink" href="http://www.valgrind.org/docs/pubs.html" target="_top">Valgrind publications 1075 page</a>.</p> 1076 </div> 1077 <div class="sect2" title="5.8.2.Cachegrind Output File Format"> 1078 <div class="titlepage"><div><div><h3 class="title"> 1079 <a name="cg-manual.impl-details.file-format"></a>5.8.2.Cachegrind Output File Format</h3></div></div></div> 1080 <p>The file format is fairly straightforward, basically giving the 1081 cost centre for every line, grouped by files and 1082 functions. It's also totally generic and self-describing, in the sense that 1083 it can be used for any events that can be counted on a line-by-line basis, 1084 not just cache and branch predictor events. For example, earlier versions 1085 of Cachegrind didn't have a branch predictor simulation. When this was 1086 added, the file format didn't need to change at all. So the format (and 1087 consequently, cg_annotate) could be used by other tools.</p> 1088 <p>The file format:</p> 1089 <pre class="programlisting"> 1090 file ::= desc_line* cmd_line events_line data_line+ summary_line 1091 desc_line ::= "desc:" ws? non_nl_string 1092 cmd_line ::= "cmd:" ws? cmd 1093 events_line ::= "events:" ws? (event ws)+ 1094 data_line ::= file_line | fn_line | count_line 1095 file_line ::= "fl=" filename 1096 fn_line ::= "fn=" fn_name 1097 count_line ::= line_num ws? (count ws)+ 1098 summary_line ::= "summary:" ws? (count ws)+ 1099 count ::= num | "."</pre> 1100 <p>Where:</p> 1101 <div class="itemizedlist"><ul class="itemizedlist" type="disc"> 1102 <li class="listitem"><p><code class="computeroutput">non_nl_string</code> is any 1103 string not containing a newline.</p></li> 1104 <li class="listitem"><p><code class="computeroutput">cmd</code> is a string holding the 1105 command line of the profiled program.</p></li> 1106 <li class="listitem"><p><code class="computeroutput">event</code> is a string containing 1107 no whitespace.</p></li> 1108 <li class="listitem"><p><code class="computeroutput">filename</code> and 1109 <code class="computeroutput">fn_name</code> are strings.</p></li> 1110 <li class="listitem"><p><code class="computeroutput">num</code> and 1111 <code class="computeroutput">line_num</code> are decimal 1112 numbers.</p></li> 1113 <li class="listitem"><p><code class="computeroutput">ws</code> is whitespace.</p></li> 1114 </ul></div> 1115 <p>The contents of the "desc:" lines are printed out at the top 1116 of the summary. This is a generic way of providing simulation 1117 specific information, e.g. for giving the cache configuration for 1118 cache simulation.</p> 1119 <p>More than one line of info can be presented for each file/fn/line number. 1120 In such cases, the counts for the named events will be accumulated.</p> 1121 <p>Counts can be "." to represent zero. This makes the files easier for 1122 humans to read.</p> 1123 <p>The number of counts in each 1124 <code class="computeroutput">line</code> and the 1125 <code class="computeroutput">summary_line</code> should not exceed 1126 the number of events in the 1127 <code class="computeroutput">event_line</code>. If the number in 1128 each <code class="computeroutput">line</code> is less, cg_annotate 1129 treats those missing as though they were a "." entry. This saves space. 1130 </p> 1131 <p>A <code class="computeroutput">file_line</code> changes the 1132 current file name. A <code class="computeroutput">fn_line</code> 1133 changes the current function name. A 1134 <code class="computeroutput">count_line</code> contains counts that 1135 pertain to the current filename/fn_name. A "fn=" 1136 <code class="computeroutput">file_line</code> and a 1137 <code class="computeroutput">fn_line</code> must appear before any 1138 <code class="computeroutput">count_line</code>s to give the context 1139 of the first <code class="computeroutput">count_line</code>s.</p> 1140 <p>Each <code class="computeroutput">file_line</code> will normally be 1141 immediately followed by a <code class="computeroutput">fn_line</code>. But it 1142 doesn't have to be.</p> 1143 <p>The summary line is redundant, because it just holds the total counts 1144 for each event. But this serves as a useful sanity check of the data; if 1145 the totals for each event don't match the summary line, something has gone 1146 wrong.</p> 1147 </div> 1148 </div> 1149 </div> 1150 <div> 1151 <br><table class="nav" width="100%" cellspacing="3" cellpadding="2" border="0" summary="Navigation footer"> 1152 <tr> 1153 <td rowspan="2" width="40%" align="left"> 1154 <a accesskey="p" href="mc-manual.html"><<4.Memcheck: a memory error detector</a></td> 1155 <td width="20%" align="center"><a accesskey="u" href="manual.html">Up</a></td> 1156 <td rowspan="2" width="40%" align="right"><a accesskey="n" href="cl-manual.html">6.Callgrind: a call-graph generating cache and branch prediction profiler>></a> 1157 </td> 1158 </tr> 1159 <tr><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td></tr> 1160 </table> 1161 </div> 1162 </body> 1163 </html> 1164