1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> 2 <HTML> 3 4 <HEAD> 5 <link rel="stylesheet" href="designstyle.css"> 6 <title>Gperftools CPU Profiler</title> 7 </HEAD> 8 9 <BODY> 10 11 <p align=right> 12 <i>Last modified 13 <script type=text/javascript> 14 var lm = new Date(document.lastModified); 15 document.write(lm.toDateString()); 16 </script></i> 17 </p> 18 19 <p>This is the CPU profiler we use at Google. There are three parts 20 to using it: linking the library into an application, running the 21 code, and analyzing the output.</p> 22 23 <p>On the off-chance that you should need to understand it, the CPU 24 profiler data file format is documented separately, 25 <a href="cpuprofile-fileformat.html">here</a>. 26 27 28 <H1>Linking in the Library</H1> 29 30 <p>To install the CPU profiler into your executable, add 31 <code>-lprofiler</code> to the link-time step for your executable. 32 (It's also probably possible to add in the profiler at run-time using 33 <code>LD_PRELOAD</code>, e.g. 34 <code>% env LD_PRELOAD="/usr/lib/libprofiler.so" <binary></code>, 35 but this isn't necessarily recommended.)</p> 36 37 <p>This does <i>not</i> turn on CPU profiling; it just inserts the 38 code. For that reason, it's practical to just always link 39 <code>-lprofiler</code> into a binary while developing; that's what we 40 do at Google. (However, since any user can turn on the profiler by 41 setting an environment variable, it's not necessarily recommended to 42 install profiler-linked binaries into a production, running 43 system.)</p> 44 45 46 <H1>Running the Code</H1> 47 48 <p>There are several alternatives to actually turn on CPU profiling 49 for a given run of an executable:</p> 50 51 <ol> 52 <li> <p>Define the environment variable CPUPROFILE to the filename 53 to dump the profile to. For instance, if you had a version of 54 <code>/bin/ls</code> that had been linked against libprofiler, 55 you could run:</p> 56 <pre>% env CPUPROFILE=ls.prof /bin/ls</pre> 57 58 <li> <p>In your code, bracket the code you want profiled in calls to 59 <code>ProfilerStart()</code> and <code>ProfilerStop()</code>. 60 (These functions are declared in <code><gperftools/profiler.h></code>.) 61 <code>ProfilerStart()</code> will take 62 the profile-filename as an argument.</p> 63 </ol> 64 65 <p>In Linux 2.6 and above, profiling works correctly with threads, 66 automatically profiling all threads. In Linux 2.4, profiling only 67 profiles the main thread (due to a kernel bug involving itimers and 68 threads). Profiling works correctly with sub-processes: each child 69 process gets its own profile with its own name (generated by combining 70 CPUPROFILE with the child's process id).</p> 71 72 <p>For security reasons, CPU profiling will not write to a file -- and 73 is thus not usable -- for setuid programs.</p> 74 75 <p>See the include-file <code>gperftools/profiler.h</code> for 76 advanced-use functions, including <code>ProfilerFlush()</code> and 77 <code>ProfilerStartWithOptions()</code>.</p> 78 79 80 <H2>Modifying Runtime Behavior</H2> 81 82 <p>You can more finely control the behavior of the CPU profiler via 83 environment variables.</p> 84 85 <table frame=box rules=sides cellpadding=5 width=100%> 86 87 <tr valign=top> 88 <td><code>CPUPROFILE_FREQUENCY=<i>x</i></code></td> 89 <td>default: 100</td> 90 <td> 91 How many interrupts/second the cpu-profiler samples. 92 </td> 93 </tr> 94 95 <tr valign=top> 96 <td><code>CPUPROFILE_REALTIME=1</code></td> 97 <td>default: [not set]</td> 98 <td> 99 If set to any value (including 0 or the empty string), use 100 ITIMER_REAL instead of ITIMER_PROF to gather profiles. In 101 general, ITIMER_REAL is not as accurate as ITIMER_PROF, and also 102 interacts badly with use of alarm(), so prefer ITIMER_PROF unless 103 you have a reason prefer ITIMER_REAL. 104 </td> 105 </tr> 106 107 </table> 108 109 110 <h1><a name="pprof">Analyzing the Output</a></h1> 111 112 <p><code>pprof</code> is the script used to analyze a profile. It has 113 many output modes, both textual and graphical. Some give just raw 114 numbers, much like the <code>-pg</code> output of <code>gcc</code>, 115 and others show the data in the form of a dependency graph.</p> 116 117 <p>pprof <b>requires</b> <code>perl5</code> to be installed to run. 118 It also requires <code>dot</code> to be installed for any of the 119 graphical output routines, and <code>gv</code> to be installed for 120 <code>--gv</code> mode (described below). 121 </p> 122 123 <p>Here are some ways to call pprof. These are described in more 124 detail below.</p> 125 126 <pre> 127 % pprof /bin/ls ls.prof 128 Enters "interactive" mode 129 % pprof --text /bin/ls ls.prof 130 Outputs one line per procedure 131 % pprof --gv /bin/ls ls.prof 132 Displays annotated call-graph via 'gv' 133 % pprof --gv --focus=Mutex /bin/ls ls.prof 134 Restricts to code paths including a .*Mutex.* entry 135 % pprof --gv --focus=Mutex --ignore=string /bin/ls ls.prof 136 Code paths including Mutex but not string 137 % pprof --list=getdir /bin/ls ls.prof 138 (Per-line) annotated source listing for getdir() 139 % pprof --disasm=getdir /bin/ls ls.prof 140 (Per-PC) annotated disassembly for getdir() 141 % pprof --text localhost:1234 142 Outputs one line per procedure for localhost:1234 143 % pprof --callgrind /bin/ls ls.prof 144 Outputs the call information in callgrind format 145 </pre> 146 147 148 <h3>Analyzing Text Output</h3> 149 150 <p>Text mode has lines of output that look like this:</p> 151 <pre> 152 14 2.1% 17.2% 58 8.7% std::_Rb_tree::find 153 </pre> 154 155 <p>Here is how to interpret the columns:</p> 156 <ol> 157 <li> Number of profiling samples in this function 158 <li> Percentage of profiling samples in this function 159 <li> Percentage of profiling samples in the functions printed so far 160 <li> Number of profiling samples in this function and its callees 161 <li> Percentage of profiling samples in this function and its callees 162 <li> Function name 163 </ol> 164 165 <h3>Analyzing Callgrind Output</h3> 166 167 <p>Use <a href="http://kcachegrind.sourceforge.net">kcachegrind</a> to 168 analyze your callgrind output:</p> 169 <pre> 170 % pprof --callgrind /bin/ls ls.prof > ls.callgrind 171 % kcachegrind ls.callgrind 172 </pre> 173 174 <p>The cost is specified in 'hits', i.e. how many times a function 175 appears in the recorded call stack information. The 'calls' from 176 function a to b record how many times function b was found in the 177 stack traces directly below function a.</p> 178 179 <p>Tip: if you use a debug build the output will include file and line 180 number information and kcachegrind will show an annotated source 181 code view.</p> 182 183 <h3>Node Information</h3> 184 185 <p>In the various graphical modes of pprof, the output is a call graph 186 annotated with timing information, like so:</p> 187 188 <A HREF="pprof-test-big.gif"> 189 <center><table><tr><td> 190 <img src="pprof-test.gif"> 191 </td></tr></table></center> 192 </A> 193 194 <p>Each node represents a procedure. The directed edges indicate 195 caller to callee relations. Each node is formatted as follows:</p> 196 197 <center><pre> 198 Class Name 199 Method Name 200 local (percentage) 201 <b>of</b> cumulative (percentage) 202 </pre></center> 203 204 <p>The last one or two lines contains the timing information. (The 205 profiling is done via a sampling method, where by default we take 100 206 samples a second. Therefor one unit of time in the output corresponds 207 to about 10 milliseconds of execution time.) The "local" time is the 208 time spent executing the instructions directly contained in the 209 procedure (and in any other procedures that were inlined into the 210 procedure). The "cumulative" time is the sum of the "local" time and 211 the time spent in any callees. If the cumulative time is the same as 212 the local time, it is not printed.</p> 213 214 <p>For instance, the timing information for test_main_thread() 215 indicates that 155 units (about 1.55 seconds) were spent executing the 216 code in <code>test_main_thread()</code> and 200 units were spent while 217 executing <code>test_main_thread()</code> and its callees such as 218 <code>snprintf()</code>.</p> 219 220 <p>The size of the node is proportional to the local count. The 221 percentage displayed in the node corresponds to the count divided by 222 the total run time of the program (that is, the cumulative count for 223 <code>main()</code>).</p> 224 225 <h3>Edge Information</h3> 226 227 <p>An edge from one node to another indicates a caller to callee 228 relationship. Each edge is labelled with the time spent by the callee 229 on behalf of the caller. E.g, the edge from 230 <code>test_main_thread()</code> to <code>snprintf()</code> indicates 231 that of the 200 samples in <code>test_main_thread()</code>, 37 are 232 because of calls to <code>snprintf()</code>.</p> 233 234 <p>Note that <code>test_main_thread()</code> has an edge to 235 <code>vsnprintf()</code>, even though <code>test_main_thread()</code> 236 doesn't call that function directly. This is because the code was 237 compiled with <code>-O2</code>; the profile reflects the optimized 238 control flow.</p> 239 240 <h3>Meta Information</h3> 241 242 <p>The top of the display should contain some meta information 243 like:</p> 244 <pre> 245 /tmp/profiler2_unittest 246 Total samples: 202 247 Focusing on: 202 248 Dropped nodes with <= 1 abs(samples) 249 Dropped edges with <= 0 samples 250 </pre> 251 252 <p>This section contains the name of the program, and the total 253 samples collected during the profiling run. If the 254 <code>--focus</code> option is on (see the <a href="#focus">Focus</a> 255 section below), the legend also contains the number of samples being 256 shown in the focused display. Furthermore, some unimportant nodes and 257 edges are dropped to reduce clutter. The characteristics of the 258 dropped nodes and edges are also displayed in the legend.</p> 259 260 <h3><a name=focus>Focus and Ignore</a></h3> 261 262 <p>You can ask pprof to generate a display focused on a particular 263 piece of the program. You specify a regular expression. Any portion 264 of the call-graph that is on a path which contains at least one node 265 matching the regular expression is preserved. The rest of the 266 call-graph is dropped on the floor. For example, you can focus on the 267 <code>vsnprintf()</code> libc call in <code>profiler2_unittest</code> 268 as follows:</p> 269 270 <pre> 271 % pprof --gv --focus=vsnprintf /tmp/profiler2_unittest test.prof 272 </pre> 273 <A HREF="pprof-vsnprintf-big.gif"> 274 <center><table><tr><td> 275 <img src="pprof-vsnprintf.gif"> 276 </td></tr></table></center> 277 </A> 278 279 <p>Similarly, you can supply the <code>--ignore</code> option to 280 ignore samples that match a specified regular expression. E.g., if 281 you are interested in everything except calls to 282 <code>snprintf()</code>, you can say:</p> 283 <pre> 284 % pprof --gv --ignore=snprintf /tmp/profiler2_unittest test.prof 285 </pre> 286 287 288 <h3>Interactive mode</a></h3> 289 290 <p>By default -- if you don't specify any flags to the contrary -- 291 pprof runs in interactive mode. At the <code>(pprof)</code> prompt, 292 you can run many of the commands described above. You can type 293 <code>help</code> for a list of what commands are available in 294 interactive mode.</p> 295 296 <h3><a name=options>pprof Options</a></h3> 297 298 For a complete list of pprof options, you can run <code>pprof 299 --help</code>. 300 301 <h4>Output Type</h4> 302 303 <p> 304 <center> 305 <table frame=box rules=sides cellpadding=5 width=100%> 306 <tr valign=top> 307 <td><code>--text</code></td> 308 <td> 309 Produces a textual listing. (Note: If you have an X display, and 310 <code>dot</code> and <code>gv</code> installed, you will probably 311 be happier with the <code>--gv</code> output.) 312 </td> 313 </tr> 314 <tr valign=top> 315 <td><code>--gv</code></td> 316 <td> 317 Generates annotated call-graph, converts to postscript, and 318 displays via gv (requres <code>dot</code> and <code>gv</code> be 319 installed). 320 </td> 321 </tr> 322 <tr valign=top> 323 <td><code>--dot</code></td> 324 <td> 325 Generates the annotated call-graph in dot format and 326 emits to stdout (requres <code>dot</code> be installed). 327 </td> 328 </tr> 329 <tr valign=top> 330 <td><code>--ps</code></td> 331 <td> 332 Generates the annotated call-graph in Postscript format and 333 emits to stdout (requres <code>dot</code> be installed). 334 </td> 335 </tr> 336 <tr valign=top> 337 <td><code>--pdf</code></td> 338 <td> 339 Generates the annotated call-graph in PDF format and emits to 340 stdout (requires <code>dot</code> and <code>ps2pdf</code> be 341 installed). 342 </td> 343 </tr> 344 <tr valign=top> 345 <td><code>--gif</code></td> 346 <td> 347 Generates the annotated call-graph in GIF format and 348 emits to stdout (requres <code>dot</code> be installed). 349 </td> 350 </tr> 351 <tr valign=top> 352 <td><code>--list=<<i>regexp</i>></code></td> 353 <td> 354 <p>Outputs source-code listing of routines whose 355 name matches <regexp>. Each line 356 in the listing is annotated with flat and cumulative 357 sample counts.</p> 358 359 <p>In the presence of inlined calls, the samples 360 associated with inlined code tend to get assigned 361 to a line that follows the location of the 362 inlined call. A more precise accounting can be 363 obtained by disassembling the routine using the 364 --disasm flag.</p> 365 </td> 366 </tr> 367 <tr valign=top> 368 <td><code>--disasm=<<i>regexp</i>></code></td> 369 <td> 370 Generates disassembly of routines that match 371 <regexp>, annotated with flat and 372 cumulative sample counts and emits to stdout. 373 </td> 374 </tr> 375 </table> 376 </center> 377 378 <h4>Reporting Granularity</h4> 379 380 <p>By default, pprof produces one entry per procedure. However you can 381 use one of the following options to change the granularity of the 382 output. The <code>--files</code> option seems to be particularly 383 useless, and may be removed eventually.</p> 384 385 <center> 386 <table frame=box rules=sides cellpadding=5 width=100%> 387 <tr valign=top> 388 <td><code>--addresses</code></td> 389 <td> 390 Produce one node per program address. 391 </td> 392 </tr> 393 <td><code>--lines</code></td> 394 <td> 395 Produce one node per source line. 396 </td> 397 </tr> 398 <td><code>--functions</code></td> 399 <td> 400 Produce one node per function (this is the default). 401 </td> 402 </tr> 403 <td><code>--files</code></td> 404 <td> 405 Produce one node per source file. 406 </td> 407 </tr> 408 </table> 409 </center> 410 411 <h4>Controlling the Call Graph Display</h4> 412 413 <p>Some nodes and edges are dropped to reduce clutter in the output 414 display. The following options control this effect:</p> 415 416 <center> 417 <table frame=box rules=sides cellpadding=5 width=100%> 418 <tr valign=top> 419 <td><code>--nodecount=<n></code></td> 420 <td> 421 This option controls the number of displayed nodes. The nodes 422 are first sorted by decreasing cumulative count, and then only 423 the top N nodes are kept. The default value is 80. 424 </td> 425 </tr> 426 <tr valign=top> 427 <td><code>--nodefraction=<f></code></td> 428 <td> 429 This option provides another mechanism for discarding nodes 430 from the display. If the cumulative count for a node is 431 less than this option's value multiplied by the total count 432 for the profile, the node is dropped. The default value 433 is 0.005; i.e. nodes that account for less than 434 half a percent of the total time are dropped. A node 435 is dropped if either this condition is satisfied, or the 436 --nodecount condition is satisfied. 437 </td> 438 </tr> 439 <tr valign=top> 440 <td><code>--edgefraction=<f></code></td> 441 <td> 442 This option controls the number of displayed edges. First of all, 443 an edge is dropped if either its source or destination node is 444 dropped. Otherwise, the edge is dropped if the sample 445 count along the edge is less than this option's value multiplied 446 by the total count for the profile. The default value is 447 0.001; i.e., edges that account for less than 448 0.1% of the total time are dropped. 449 </td> 450 </tr> 451 <tr valign=top> 452 <td><code>--focus=<re></code></td> 453 <td> 454 This option controls what region of the graph is displayed 455 based on the regular expression supplied with the option. 456 For any path in the callgraph, we check all nodes in the path 457 against the supplied regular expression. If none of the nodes 458 match, the path is dropped from the output. 459 </td> 460 </tr> 461 <tr valign=top> 462 <td><code>--ignore=<re></code></td> 463 <td> 464 This option controls what region of the graph is displayed 465 based on the regular expression supplied with the option. 466 For any path in the callgraph, we check all nodes in the path 467 against the supplied regular expression. If any of the nodes 468 match, the path is dropped from the output. 469 </td> 470 </tr> 471 </table> 472 </center> 473 474 <p>The dropped edges and nodes account for some count mismatches in 475 the display. For example, the cumulative count for 476 <code>snprintf()</code> in the first diagram above was 41. However 477 the local count (1) and the count along the outgoing edges (12+1+20+6) 478 add up to only 40.</p> 479 480 481 <h1>Caveats</h1> 482 483 <ul> 484 <li> If the program exits because of a signal, the generated profile 485 will be <font color=red>incomplete, and may perhaps be 486 completely empty</font>. 487 <li> The displayed graph may have disconnected regions because 488 of the edge-dropping heuristics described above. 489 <li> If the program linked in a library that was not compiled 490 with enough symbolic information, all samples associated 491 with the library may be charged to the last symbol found 492 in the program before the library. This will artificially 493 inflate the count for that symbol. 494 <li> If you run the program on one machine, and profile it on 495 another, and the shared libraries are different on the two 496 machines, the profiling output may be confusing: samples that 497 fall within shared libaries may be assigned to arbitrary 498 procedures. 499 <li> If your program forks, the children will also be profiled 500 (since they inherit the same CPUPROFILE setting). Each process 501 is profiled separately; to distinguish the child profiles from 502 the parent profile and from each other, all children will have 503 their process-id appended to the CPUPROFILE name. 504 <li> Due to a hack we make to work around a possible gcc bug, your 505 profiles may end up named strangely if the first character of 506 your CPUPROFILE variable has ascii value greater than 127. 507 This should be exceedingly rare, but if you need to use such a 508 name, just set prepend <code>./</code> to your filename: 509 <code>CPUPROFILE=./Ägypten</code>. 510 </ul> 511 512 513 <hr> 514 <address>Sanjay Ghemawat<br> 515 <!-- Created: Tue Dec 19 10:43:14 PST 2000 --> 516 <!-- hhmts start --> 517 Last modified: Fri May 9 14:41:29 PDT 2008 518 <!-- hhmts end --> 519 </address> 520 </BODY> 521 </HTML> 522