1 <?xml version="1.0" encoding='ISO-8859-1'?> 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> 3 4 <book id="oprofile-internals"> 5 <bookinfo> 6 <title>OProfile Internals</title> 7 8 <authorgroup> 9 <author> 10 <firstname>John</firstname> 11 <surname>Levon</surname> 12 <affiliation> 13 <address><email>levon (a] movementarian.org</email></address> 14 </affiliation> 15 </author> 16 </authorgroup> 17 18 <copyright> 19 <year>2003</year> 20 <holder>John Levon</holder> 21 </copyright> 22 </bookinfo> 23 24 <toc></toc> 25 26 <chapter id="introduction"> 27 <title>Introduction</title> 28 29 <para> 30 This document is current for OProfile version <oprofileversion />. 31 This document provides some details on the internal workings of OProfile for the 32 interested hacker. This document assumes strong C, working C++, plus some knowledge of 33 kernel internals and CPU hardware. 34 </para> 35 <note> 36 <para> 37 Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4 38 uses a very different kernel module implementation and daemon to produce the sample files. 39 </para> 40 </note> 41 42 <sect1 id="overview"> 43 <title>Overview</title> 44 <para> 45 OProfile is a statistical continuous profiler. In other words, profiles are generated by 46 regularly sampling the current registers on each CPU (from an interrupt handler, the 47 saved PC value at the time of interrupt is stored), and converting that runtime PC 48 value into something meaningful to the programmer. 49 </para> 50 <para> 51 OProfile achieves this by taking the stream of sampled PC values, along with the detail 52 of which task was running at the time of the interrupt, and converting into a file offset 53 against a particular binary file. Because applications <function>mmap()</function> 54 the code they run (be it <filename>/bin/bash</filename>, <filename>/lib/libfoo.so</filename> 55 or whatever), it's possible to find the relevant binary file and offset by walking 56 the task's list of mapped memory areas. Each PC value is thus converted into a tuple 57 of binary-image,offset. This is something that the userspace tools can use directly 58 to reconstruct where the code came from, including the particular assembly instructions, 59 symbol, and source line (via the binary's debug information if present). 60 </para> 61 <para> 62 Regularly sampling the PC value like this approximates what actually was executed and 63 how often - more often than not, this statistical approximation is good enough to 64 reflect reality. In common operation, the time between each sample interrupt is regulated 65 by a fixed number of clock cycles. This implies that the results will reflect where 66 the CPU is spending the most time; this is obviously a very useful information source 67 for performance analysis. 68 </para> 69 <para> 70 Sometimes though, an application programmer needs different kinds of information: for example, 71 "which of the source routines cause the most cache misses ?". The rise in importance of 72 such metrics in recent years has led many CPU manufacturers to provide hardware performance 73 counters capable of measuring these events on the hardware level. Typically, these counters 74 increment once per each event, and generate an interrupt on reaching some pre-defined 75 number of events. OProfile can use these interrupts to generate samples: then, the 76 profile results are a statistical approximation of which code caused how many of the 77 given event. 78 </para> 79 <para> 80 Consider a simplified system that only executes two functions A and B. A 81 takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at 82 100 cycles a second, and we've set the performance counter to create an 83 interrupt after a set number of "events" (in this case an event is one 84 clock cycle). It should be clear that the chances of the interrupt 85 occurring in function A is 1/100, and 99/100 for function B. Thus, we 86 statistically approximate the actual relative performance features of 87 the two functions over time. This same analysis works for other types of 88 events, providing that the interrupt is tied to the number of events 89 occurring (that is, after N events, an interrupt is generated). 90 </para> 91 <para> 92 There are typically more than one of these counters, so it's possible to set up profiling 93 for several different event types. Using these counters gives us a powerful, low-overhead 94 way of gaining performance metrics. If OProfile, or the CPU, does not support performance 95 counters, then a simpler method is used: the kernel timer interrupt feeds samples 96 into OProfile itself. 97 </para> 98 <para> 99 The rest of this document concerns itself with how we get from receiving samples at 100 interrupt time to producing user-readable profile information. 101 </para> 102 </sect1> 103 104 <sect1 id="components"> 105 <title>Components of the OProfile system</title> 106 107 <sect2 id="arch-specific-components"> 108 <title>Architecture-specific components</title> 109 <para> 110 If OProfile supports the hardware performance counters found on 111 a particular architecture, code for managing the details of setting 112 up and managing these counters can be found in the kernel source 113 tree in the relevant <filename>arch/<emphasis>arch</emphasis>/oprofile/</filename> 114 directory. The architecture-specific implementation works via 115 filling in the oprofile_operations structure at init time. This 116 provides a set of operations such as <function>setup()</function>, 117 <function>start()</function>, <function>stop()</function>, etc. 118 that manage the hardware-specific details of fiddling with the 119 performance counter registers. 120 </para> 121 <para> 122 The other important facility available to the architecture code is 123 <function>oprofile_add_sample()</function>. This is where a particular sample 124 taken at interrupt time is fed into the generic OProfile driver code. 125 </para> 126 </sect2> 127 128 <sect2 id="filesystem"> 129 <title>oprofilefs</title> 130 <para> 131 OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from 132 userspace at <filename>/dev/oprofile</filename>. This consists of small 133 files for reporting and receiving configuration from userspace, as well 134 as the actual character device that the OProfile userspace receives samples 135 from. At <function>setup()</function> time, the architecture-specific may 136 add further configuration files related to the details of the performance 137 counters. For example, on x86, one numbered directory for each hardware 138 performance counter is added, with files in each for the event type, 139 reset value, etc. 140 </para> 141 <para> 142 The filesystem also contains a <filename>stats</filename> directory with 143 a number of useful counters for various OProfile events. 144 </para> 145 </sect2> 146 147 <sect2 id="driver"> 148 <title>Generic kernel driver</title> 149 <para> 150 This lives in <filename>drivers/oprofile/</filename>, and forms the core of 151 how OProfile works in the kernel. Its job is to take samples delivered 152 from the architecture-specific code (via <function>oprofile_add_sample()</function>), 153 and buffer this data, in a transformed form as described later, until releasing 154 the data to the userspace daemon via the <filename>/dev/oprofile/buffer</filename> 155 character device. 156 </para> 157 </sect2> 158 159 <sect2 id="daemon"> 160 <title>The OProfile daemon</title> 161 <para> 162 The OProfile userspace daemon's job is to take the raw data provided by the 163 kernel and write it to the disk. It takes the single data stream from the 164 kernel and logs sample data against a number of sample files (found in 165 <filename>$SESSION_DIR/samples/current/</filename>, by default located at 166 <filename>/var/lib/oprofile/samples/current/</filename>. For the benefit 167 of the "separate" functionality, the names/paths of these sample files 168 are mangled to reflect where the samples were from: this can include 169 thread IDs, the binary file path, the event type used, and more. 170 </para> 171 <para> 172 After this final step from interrupt to disk file, the data is now 173 persistent (that is, changes in the running of the system do not invalidate 174 stored data). So the post-profiling tools can run on this data at any 175 time (assuming the original binary files are still available and unchanged, 176 naturally). 177 </para> 178 </sect2> 179 180 <sect2 id="post-profiling"> 181 <title>Post-profiling tools</title> 182 So far, we've collected data, but we've yet to present it in a useful form 183 to the user. This is the job of the post-profiling tools. In general form, 184 they collate a subset of the available sample files, load and process each one 185 correlated against the relevant binary file, and finally produce user-readable 186 information. 187 </sect2> 188 189 </sect1> 190 191 </chapter> 192 193 <chapter id="performance-counters"> 194 <title>Performance counter management</title> 195 196 <sect1 id ="performance-counters-ui"> 197 <title>Providing a user interface</title> 198 199 <para> 200 The performance counter registers need programming in order to set the 201 type of event to count, etc. OProfile uses a standard model across all 202 CPUs for defining these events as follows : 203 </para> 204 <informaltable frame="all"> 205 <tgroup cols='2'> 206 <tbody> 207 <row><entry><option>event</option></entry><entry>The event type e.g. DATA_MEM_REFS</entry></row> 208 <row><entry><option>unit mask</option></entry><entry>The sub-events to count (more detailed specification)</entry></row> 209 <row><entry><option>counter</option></entry><entry>The hardware counter(s) that can count this event</entry></row> 210 <row><entry><option>count</option></entry><entry>The reset value (how many events before an interrupt)</entry></row> 211 <row><entry><option>kernel</option></entry><entry>Whether the counter should increment when in kernel space</entry></row> 212 <row><entry><option>user</option></entry><entry>Whether the counter should increment when in user space</entry></row> 213 </tbody> 214 </tgroup> 215 </informaltable> 216 <para> 217 The term "unit mask" is borrowed from the Intel architectures, and can 218 further specify exactly when a counter is incremented (for example, 219 cache-related events can be restricted to particular state transitions 220 of the cache lines). 221 </para> 222 <para> 223 All of the available hardware events and their details are specified in 224 the textual files in the <filename>events</filename> directory. The 225 syntax of these files should be fairly obvious. The user specifies the 226 names and configuration details of the chosen counters via 227 <command>opcontrol</command>. These are then written to the kernel 228 module (in numerical form) via <filename>/dev/oprofile/N/</filename> 229 where N is the physical hardware counter (some events can only be used 230 on specific counters; OProfile hides these details from the user when 231 possible). On IA64, the perfmon-based interface behaves somewhat 232 differently, as described later. 233 </para> 234 235 </sect1> 236 237 <sect1 id="performance-counters-programming"> 238 <title>Programming the performance counter registers</title> 239 240 <para> 241 We have described how the user interface fills in the desired 242 configuration of the counters and transmits the information to the 243 kernel. It is the job of the <function>->setup()</function> method 244 to actually program the performance counter registers. Clearly, the 245 details of how this is done is architecture-specific; it is also 246 model-specific on many architectures. For example, i386 provides methods 247 for each model type that programs the counter registers correctly 248 (see the <filename>op_model_*</filename> files in 249 <filename>arch/i386/oprofile</filename> for the details). The method 250 reads the values stored in the virtual oprofilefs files and programs 251 the registers appropriately, ready for starting the actual profiling 252 session. 253 </para> 254 <para> 255 The architecture-specific drivers make sure to save the old register 256 settings before doing OProfile setup. They are restored when OProfile 257 shuts down. This is useful, for example, on i386, where the NMI watchdog 258 uses the same performance counter registers as OProfile; they cannot 259 run concurrently, but OProfile makes sure to restore the setup it found 260 before it was running. 261 </para> 262 <para> 263 In addition to programming the counter registers themselves, other setup 264 is often necessary. For example, on i386, the local APIC needs 265 programming in order to make the counter's overflow interrupt appear as 266 an NMI (non-maskable interrupt). This allows sampling (and therefore 267 profiling) of regions where "normal" interrupts are masked, enabling 268 more reliable profiles. 269 </para> 270 271 <sect2 id="performance-counters-start"> 272 <title>Starting and stopping the counters</title> 273 <para> 274 Initiating a profiling session is done via writing an ASCII '1' 275 to the file <filename>/dev/oprofile/enable</filename>. This sets up the 276 core, and calls into the architecture-specific driver to actually 277 enable each configured counter. Again, the details of how this is 278 done is model-specific (for example, the Athlon models can disable 279 or enable on a per-counter basis, unlike the PPro models). 280 </para> 281 </sect2> 282 283 <sect2> 284 <title>IA64 and perfmon</title> 285 <para> 286 The IA64 architecture provides a different interface from the other 287 architectures, using the existing perfmon driver. Register programming 288 is handled entirely in user-space (see 289 <filename>daemon/opd_perfmon.c</filename> for the details). A process 290 is forked for each CPU, which creates a perfmon context and sets the 291 counter registers appropriately via the 292 <function>sys_perfmonctl</function> interface. In addition, the actual 293 initiation and termination of the profiling session is handled via the 294 same interface using <constant>PFM_START</constant> and 295 <constant>PFM_STOP</constant>. On IA64, then, there are no oprofilefs 296 files for the performance counters, as the kernel driver does not 297 program the registers itself. 298 </para> 299 <para> 300 Instead, the perfmon driver for OProfile simply registers with the 301 OProfile core with an OProfile-specific UUID. During a profiling 302 session, the perfmon core calls into the OProfile perfmon driver and 303 samples are registered with the OProfile core itself as usual (with 304 <function>oprofile_add_sample()</function>). 305 </para> 306 </sect2> 307 308 </sect1> 309 310 </chapter> 311 312 <chapter id="collecting-samples"> 313 <title>Collecting and processing samples</title> 314 315 <sect1 id="receiving-interrupts"> 316 <title>Receiving interrupts</title> 317 <para> 318 Naturally, how the overflow interrupts are received is specific 319 to the hardware architecture, unless we are in "timer" mode, where the 320 logging routine is called directly from the standard kernel timer 321 interrupt handler. 322 </para> 323 <para> 324 On the i386 architecture, the local APIC is programmed such that when a 325 counter overflows (that is, it receives an event that causes an integer 326 overflow of the register value to zero), an NMI is generated. This calls 327 into the general handler <function>do_nmi()</function>; because OProfile 328 has registered itself as capable of handling NMI interrupts, this will 329 call into the OProfile driver code in 330 <filename>arch/i386/oprofile</filename>. Here, the saved PC value (the 331 CPU saves the register set at the time of interrupt on the stack 332 available for inspection) is extracted, and the counters are examined to 333 find out which one generated the interrupt. Also determined is whether 334 the system was inside kernel or user space at the time of the interrupt. 335 These three pieces of information are then forwarded onto the OProfile 336 core via <function>oprofile_add_sample()</function>. Finally, the 337 counter values are reset to the chosen count value, to ensure another 338 interrupt happens after another N events have occurred. Other 339 architectures behave in a similar manner. 340 </para> 341 </sect1> 342 343 <sect1 id="core-structure"> 344 <title>Core data structures</title> 345 <para> 346 Before considering what happens when we log a sample, we shall digress 347 for a moment and look at the general structure of the data collection 348 system. 349 </para> 350 <para> 351 OProfile maintains a small buffer for storing the logged samples for 352 each CPU on the system. Only this buffer is altered when we actually log 353 a sample (remember, we may still be in an NMI context, so no locking is 354 possible). The buffer is managed by a two-handed system; the "head" 355 iterator dictates where the next sample data should be placed in the 356 buffer. Of course, overflow of the buffer is possible, in which case 357 the sample is discarded. 358 </para> 359 <para> 360 It is critical to remember that at this point, the PC value is an 361 absolute value, and is therefore only meaningful in the context of which 362 task it was logged against. Thus, these per-CPU buffers also maintain 363 details of which task each logged sample is for, as described in the 364 next section. In addition, we store whether the sample was in kernel 365 space or user space (on some architectures and configurations, the address 366 space is not sub-divided neatly at a specific PC value, so we must store 367 this information). 368 </para> 369 <para> 370 As well as these small per-CPU buffers, we have a considerably larger 371 single buffer. This holds the data that is eventually copied out into 372 the OProfile daemon. On certain system events, the per-CPU buffers are 373 processed and entered (in mutated form) into the main buffer, known in 374 the source as the "event buffer". The "tail" iterator indicates the 375 point from which the CPU may be read, up to the position of the "head" 376 iterator. This provides an entirely lock-free method for extracting data 377 from the CPU buffers. This process is described in detail later in this chapter. 378 </para> 379 <figure><title>The OProfile buffers</title> 380 <graphic fileref="buffers.png" /> 381 </figure> 382 </sect1> 383 384 <sect1 id="logging-sample"> 385 <title>Logging a sample</title> 386 <para> 387 As mentioned, the sample is logged into the buffer specific to the 388 current CPU. The CPU buffer is a simple array of pairs of unsigned long 389 values; for a sample, they hold the PC value and the counter for the 390 sample. (The counter value is later used to translate back into the relevant 391 event type the counter was programmed to). 392 </para> 393 <para> 394 In addition to logging the sample itself, we also log task switches. 395 This is simply done by storing the address of the last task to log a 396 sample on that CPU in a data structure, and writing a task switch entry 397 into the buffer if the new value of <function>current()</function> has 398 changed. Note that later we will directly de-reference this pointer; 399 this imposes certain restrictions on when and how the CPU buffers need 400 to be processed. 401 </para> 402 <para> 403 Finally, as mentioned, we log whether we have changed between kernel and 404 userspace using a similar method. Both of these variables 405 (<varname>last_task</varname> and <varname>last_is_kernel</varname>) are 406 reset when the CPU buffer is read. 407 </para> 408 </sect1> 409 410 <sect1 id="logging-stack"> 411 <title>Logging stack traces</title> 412 <para> 413 OProfile can also provide statistical samples of call chains (on x86). To 414 do this, at sample time, the frame pointer chain is traversed, recording 415 the return address for each stack frame. This will only work if the code 416 was compiled with frame pointers, but we're careful to abort the 417 traversal if the frame pointer appears bad. We store the set of return 418 addresses straight into the CPU buffer. Note that, since this traversal 419 is keyed off the standard sample interrupt, the number of times a 420 function appears in a stack trace is not an indicator of how many times 421 the call site was executed: rather, it's related to the number of 422 samples we took where that call site was involved. Thus, the results for 423 stack traces are not necessarily proportional to the call counts: 424 typical programs will have many <function>main()</function> samples. 425 </para> 426 </sect1> 427 428 <sect1 id="synchronising-buffers"> 429 <title>Synchronising the CPU buffers to the event buffer</title> 430 <!-- FIXME: update when percpu patch goes in --> 431 <para> 432 At some point, we have to process the data in each CPU buffer and enter 433 it into the main (event) buffer. The file 434 <filename>buffer_sync.c</filename> contains the relevant code. We 435 periodically (currently every <constant>HZ</constant>/4 jiffies) start 436 the synchronisation process. In addition, we process the buffers on 437 certain events, such as an application calling 438 <function>munmap()</function>. This is particularly important for 439 <function>exit()</function> - because the CPU buffers contain pointers 440 to the task structure, if we don't process all the buffers before the 441 task is actually destroyed and the task structure freed, then we could 442 end up trying to dereference a bogus pointer in one of the CPU buffers. 443 </para> 444 <para> 445 We also add a notification when a kernel module is loaded; this is so 446 that user-space can re-read <filename>/proc/modules</filename> to 447 determine the load addresses of kernel module text sections. Without 448 this notification, samples for a newly-loaded module could get lost or 449 be attributed to the wrong module. 450 </para> 451 <para> 452 The synchronisation itself works in the following manner: first, mutual 453 exclusion on the event buffer is taken. Remember, we do not need to do 454 that for each CPU buffer, as we only read from the tail iterator (whilst 455 interrupts might be arriving at the same buffer, but they will write to 456 the position of the head iterator, leaving previously written entries 457 intact). Then, we process each CPU buffer in turn. A CPU switch 458 notification is added to the buffer first (for 459 <option>--separate=cpu</option> support). Then the processing of the 460 actual data starts. 461 </para> 462 <para> 463 As mentioned, the CPU buffer consists of task switch entries and the 464 actual samples. When the routine <function>sync_buffer()</function> sees 465 a task switch, the process ID and process group ID are recorded into the 466 event buffer, along with a dcookie (see below) identifying the 467 application binary (e.g. <filename>/bin/bash</filename>). The 468 <varname>mmap_sem</varname> for the task is then taken, to allow safe 469 iteration across the tasks' list of mapped areas. Each sample is then 470 processed as described in the next section. 471 </para> 472 <para> 473 After a buffer has been read, the tail iterator is updated to reflect 474 how much of the buffer was processed. Note that when we determined how 475 much data there was to read in the CPU buffer, we also called 476 <function>cpu_buffer_reset()</function> to reset 477 <varname>last_task</varname> and <varname>last_is_kernel</varname>, as 478 we've already mentioned. During the processing, more samples may have 479 been arriving in the CPU buffer; this is OK because we are careful to 480 only update the tail iterator to how much we actually read - on the next 481 buffer synchronisation, we will start again from that point. 482 </para> 483 </sect1> 484 485 <sect1 id="dentry-cookies"> 486 <title>Identifying binary images</title> 487 <para> 488 In order to produce useful profiles, we need to be able to associate a 489 particular PC value sample with an actual ELF binary on the disk. This 490 leaves us with the problem of how to export this information to 491 user-space. We create unique IDs that identify a particular directory 492 entry (dentry), and write those IDs into the event buffer. Later on, 493 the user-space daemon can call the <function>lookup_dcookie</function> 494 system call, which looks up the ID and fills in the full path of 495 the binary image in the buffer user-space passes in. These IDs are 496 maintained by the code in <filename>fs/dcookies.c</filename>; the 497 cache lasts for as long as the daemon has the event buffer open. 498 </para> 499 </sect1> 500 501 <sect1 id="finding-dentry"> 502 <title>Finding a sample's binary image and offset</title> 503 <para> 504 We haven't yet described how we process the absolute PC value into 505 something usable by the user-space daemon. When we find a sample entered 506 into the CPU buffer, we traverse the list of mappings for the task 507 (remember, we will have seen a task switch earlier, so we know which 508 task's lists to look at). When a mapping is found that contains the PC 509 value, we look up the mapped file's dentry in the dcookie cache. This 510 gives the dcookie ID that will uniquely identify the mapped file. Then 511 we alter the absolute value such that it is an offset from the start of 512 the file being mapped (the mapping need not start at the start of the 513 actual file, so we have to consider the offset value of the mapping). We 514 store this dcookie ID into the event buffer; this identifies which 515 binary the samples following it are against. 516 In this manner, we have converted a PC value, which has transitory 517 meaning only, into a static offset value for later processing by the 518 daemon. 519 </para> 520 <para> 521 We also attempt to avoid the relatively expensive lookup of the dentry 522 cookie value by storing the cookie value directly into the dentry 523 itself; then we can simply derive the cookie value immediately when we 524 find the correct mapping. 525 </para> 526 </sect1> 527 528 </chapter> 529 530 <chapter id="sample-files"> 531 <title>Generating sample files</title> 532 533 <sect1 id="processing-buffer"> 534 <title>Processing the buffer</title> 535 536 <para> 537 Now we can move onto user-space in our description of how raw interrupt 538 samples are processed into useful information. As we described in 539 previous sections, the kernel OProfile driver creates a large buffer of 540 sample data consisting of offset values, interspersed with 541 notification of changes in context. These context changes indicate how 542 following samples should be attributed, and include task switches, CPU 543 changes, and which dcookie the sample value is against. By processing 544 this buffer entry-by-entry, we can determine where the samples should 545 be accredited to. This is particularly important when using the 546 <option>--separate</option>. 547 </para> 548 <para> 549 The file <filename>daemon/opd_trans.c</filename> contains the basic routine 550 for the buffer processing. The <varname>struct transient</varname> 551 structure is used to hold changes in context. Its members are modified 552 as we process each entry; it is passed into the routines in 553 <filename>daemon/opd_sfile.c</filename> for actually logging the sample 554 to a particular sample file (which will be held in 555 <filename>$SESSION_DIR/samples/current</filename>). 556 </para> 557 <para> 558 The buffer format is designed for conciseness, as high sampling rates 559 can easily generate a lot of data. Thus, context changes are prefixed 560 by an escape code, identified by <function>is_escape_code()</function>. 561 If an escape code is found, the next entry in the buffer identifies 562 what type of context change is being read. These are handed off to 563 various handlers (see the <varname>handlers</varname> array), which 564 modify the transient structure as appropriate. If it's not an escape 565 code, then it must be a PC offset value, and the very next entry will 566 be the numeric hardware counter. These values are read and recorded 567 in the transient structure; we then do a lookup to find the correct 568 sample file, and log the sample, as described in the next section. 569 </para> 570 571 <sect2 id="handling-kernel-samples"> 572 <title>Handling kernel samples</title> 573 574 <para> 575 Samples from kernel code require a little special handling. Because 576 the binary text which the sample is against does not correspond to 577 any file that the kernel directly knows about, the OProfile driver 578 stores the absolute PC value in the buffer, instead of the file offset. 579 Of course, we need an offset against some particular binary. To handle 580 this, we keep a list of loaded modules by parsing 581 <filename>/proc/modules</filename> as needed. When a module is loaded, 582 a notification is placed in the OProfile buffer, and this triggers a 583 re-read. We store the module name, and the loading address and size. 584 This is also done for the main kernel image, as specified by the user. 585 The absolute PC value is matched against each address range, and 586 modified into an offset when the matching module is found. See 587 <filename>daemon/opd_kernel.c</filename> for the details. 588 </para> 589 590 </sect2> 591 592 593 </sect1> 594 595 <sect1 id="sample-file-generation"> 596 <title>Locating and creating sample files</title> 597 598 <para> 599 We have a sample value and its satellite data stored in a 600 <varname>struct transient</varname>, and we must locate an 601 actual sample file to store the sample in, using the context 602 information in the transient structure as a key. The transient data to 603 sample file lookup is handled in 604 <filename>daemon/opd_sfile.c</filename>. A hash is taken of the 605 transient values that are relevant (depending upon the setting of 606 <option>--separate</option>, some values might be irrelevant), and the 607 hash value is used to lookup the list of currently open sample files. 608 Of course, the sample file might not be found, in which case we need 609 to create and open it. 610 </para> 611 <para> 612 OProfile uses a rather complex scheme for naming sample files, in order 613 to make selecting relevant sample files easier for the post-profiling 614 utilities. The exact details of the scheme are given in 615 <filename>oprofile-tests/pp_interface</filename>, but for now it will 616 suffice to remember that the filename will include only relevant 617 information for the current settings, taken from the transient data. A 618 fully-specified filename looks something like : 619 </para> 620 <computeroutput> 621 /var/lib/oprofile/samples/current/{root}/usr/bin/xmms/{dep}/{root}/lib/tls/libc-2.3.2.so/CPU_CLK_UNHALTED.100000.0.28082.28089.0 622 </computeroutput> 623 <para> 624 It should be clear that this identifies such information as the 625 application binary, the dependent (library) binary, the hardware event, 626 and the process and thread ID. Typically, not all this information is 627 needed, in which cases some values may be replaced with the token 628 <filename>all</filename>. 629 </para> 630 <para> 631 The code that generates this filename and opens the file is found in 632 <filename>daemon/opd_mangling.c</filename>. You may have realised that 633 at this point, we do not have the binary image file names, only the 634 dcookie values. In order to determine a file name, a dcookie value is 635 looked up in the dcookie cache. This is to be found in 636 <filename>daemon/opd_cookie.c</filename>. Since dcookies are both 637 persistent and unique during a sampling session, we can cache the 638 values. If the value is not found in the cache, then we ask the kernel 639 to do the lookup from value to file name for us by calling 640 <function>lookup_dcookie()</function>. This looks up the value in a 641 kernel-side cache (see <filename>fs/dcookies.c</filename>) and returns 642 the fully-qualified file name to userspace. 643 </para> 644 645 </sect1> 646 647 <sect1 id="sample-file-writing"> 648 <title>Writing data to a sample file</title> 649 650 <para> 651 Each specific sample file is a hashed collection, where the key is 652 the PC offset from the transient data, and the value is the number of 653 samples recorded against that offset. The files are 654 <function>mmap()</function>ed into the daemon's memory space. The code 655 to actually log the write against the sample file can be found in 656 <filename>libdb/</filename>. 657 </para> 658 <para> 659 For recording stack traces, we have a more complicated sample filename 660 mangling scheme that allows us to identify cross-binary calls. We use 661 the same sample file format, where the key is a 64-bit value composed 662 from the from,to pair of offsets. 663 </para> 664 665 </sect1> 666 667 </chapter> 668 669 <chapter id="output"> 670 <title>Generating useful output</title> 671 672 <para> 673 All of the tools used to generate human-readable output have to take 674 roughly the same steps to collect the data for processing. First, the 675 profile specification given by the user has to be parsed. Next, a list 676 of sample files matching the specification has to obtained. Using this 677 list, we need to locate the binary file for each sample file, and then 678 use them to extract meaningful data, before a final collation and 679 presentation to the user. 680 </para> 681 682 <sect1 id="profile-specification"> 683 <title>Handling the profile specification</title> 684 685 <para> 686 The profile specification presented by the user is parsed in 687 the function <function>profile_spec::create()</function>. This 688 creates an object representing the specification. Then we 689 use <function>profile_spec::generate_file_list()</function> 690 to search for all sample files and match them against the 691 <varname>profile_spec</varname>. 692 </para> 693 694 <para> 695 To enable this matching process to work, the attributes of 696 each sample file is encoded in its filename. This is a low-tech 697 approach to matching specifications against candidate sample 698 files, but it works reasonably well. A typical sample file 699 might look like these: 700 </para> 701 <screen> 702 /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/{cg}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all 703 /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all 704 /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.7423.7424.0 705 /var/lib/oprofile/samples/current/{kern}/r128/{dep}/{kern}/r128/CPU_CLK_UNHALTED.100000.0.all.all.all 706 </screen> 707 <para> 708 This looks unnecessarily complex, but it's actually fairly simple. First 709 we have the session of the sample, by default located here 710 <filename>/var/lib/oprofile/samples/current</filename>. This location 711 can be changed by specifying the --session-dir option at command-line. 712 This session could equally well be inside an archive from <command>oparchive</command>. 713 Next we have one of the tokens <filename>{root}</filename> or 714 <filename>{kern}</filename>. <filename>{root}</filename> indicates 715 that the binary is found on a file system, and we will encode its path 716 in the next section (e.g. <filename>/bin/ls</filename>). 717 <filename>{kern}</filename> indicates a kernel module - on 2.6 kernels 718 the path information is not available from the kernel, so we have to 719 special-case kernel modules like this; we encode merely the name of the 720 module as loaded. 721 </para> 722 <para> 723 Next there is a <filename>{dep}</filename> token, indicating another 724 token/path which identifies the dependent binary image. This is used even for 725 the "primary" binary (i.e. the one that was 726 <function>execve()</function>d), as it simplifies processing. Finally, 727 if this sample file is a normal flat profile, the actual file is next in 728 the path. If it's a call-graph sample file, we need one further 729 specification, to allow us to identify cross-binary arcs in the call 730 graph. 731 </para> 732 <para> 733 The actual sample file name is dot-separated, where the fields are, in 734 order: event name, event count, unit mask, task group ID, task ID, and 735 CPU number. 736 </para> 737 <para> 738 This sample file can be reliably parsed (with 739 <function>parse_filename()</function>) into a 740 <varname>filename_spec</varname>. Finally, we can check whether to 741 include the sample file in the final results by comparing this 742 <varname>filename_spec</varname> against the 743 <varname>profile_spec</varname> the user specified (for the interested, 744 see <function>valid_candidate()</function> and 745 <function>profile_spec::match</function>). Then comes the really 746 complicated bit... 747 </para> 748 749 </sect1> 750 751 <sect1 id="sample-file-collating"> 752 <title>Collating the candidate sample files</title> 753 754 <para> 755 At this point we have a duplicate-free list of sample files we need 756 to process. But first we need to do some further arrangement: we 757 need to classify each sample file, and we may also need to "invert" 758 the profiles. 759 </para> 760 761 <sect2 id="sample-file-classifying"> 762 <title>Classifying sample files</title> 763 764 <para> 765 It's possible for utilities like <command>opreport</command> to show 766 data in columnar format: for example, we might want to show the results 767 of two threads within a process side-by-side. To do this, we need 768 to classify each sample file into classes - the classes correspond 769 with each <command>opreport</command> column. The function that handles 770 this is <function>arrange_profiles()</function>. Each sample file 771 is added to a particular class. If the sample file is the first in 772 its class, a template is generated from the sample file. Each template 773 describes a particular class (thus, in our example above, each template 774 will have a different thread ID, and this uniquely identifies each 775 class). 776 </para> 777 778 <para> 779 Each class has a list of "profile sets" matching that class's template. 780 A profile set is either a profile of the primary binary image, or any of 781 its dependent images. After all sample files have been listed in one of 782 the profile sets belonging to the classes, we have to name each class and 783 perform error-checking. This is done by 784 <function>identify_classes()</function>; each class is checked to ensure 785 that its "axis" is the same as all the others. This is needed because 786 <command>opreport</command> can't produce results in 3D format: we can 787 only differ in one aspect, such as thread ID or event name. 788 </para> 789 790 </sect2> 791 792 <sect2 id="sample-file-inverting"> 793 <title>Creating inverted profile lists</title> 794 795 <para> 796 Remember that if we're using certain profile separation options, such as 797 "--separate=lib", a single binary could be a dependent image to many 798 different binaries. For example, the C library image would be a 799 dependent image for most programs that have been profiled. As it 800 happens, this can cause severe performance problems: without some 801 re-arrangement, these dependent binary images would be opened each 802 time we need to process sample files for each program. 803 </para> 804 805 <para> 806 The solution is to "invert" the profiles via 807 <function>invert_profiles()</function>. We create a new data structure 808 where the dependent binary is first, and the primary binary images using 809 that dependent binary are listed as sub-images. This helps our 810 performance problem, as now we only need to open each dependent image 811 once, when we process the list of inverted profiles. 812 </para> 813 814 </sect2> 815 816 </sect1> 817 818 <sect1 id="generating-profile-data"> 819 <title>Generating profile data</title> 820 821 <para> 822 Things don't get any simpler at this point, unfortunately. At this point 823 we've collected and classified the sample files into the set of inverted 824 profiles, as described in the previous section. Now we need to process 825 each inverted profile and make something of the data. The entry point 826 for this is <function>populate_for_image()</function>. 827 </para> 828 829 <sect2 id="bfd"> 830 <title>Processing the binary image</title> 831 <para> 832 The first thing we do with an inverted profile is attempt to open the 833 binary image (remember each inverted profile set is only for one binary 834 image, but may have many sample files to process). The 835 <varname>op_bfd</varname> class provides an abstracted interface to 836 this; internally it uses <filename>libbfd</filename>. The main purpose 837 of this class is to process the symbols for the binary image; this is 838 also where symbol filtering happens. This is actually quite tricky, but 839 should be clear from the source. 840 </para> 841 </sect2> 842 843 <sect2 id="processing-sample-files"> 844 <title>Processing the sample files</title> 845 <para> 846 The class <varname>profile_container</varname> is a hold-all that 847 contains all the processed results. It is a container of 848 <varname>profile_t</varname> objects. The 849 <function>add_sample_files()</function> method uses 850 <filename>libdb</filename> to open the given sample file and add the 851 key/value types to the <varname>profile_t</varname>. Once this has been 852 done, <function>profile_container::add()</function> is passed the 853 <varname>profile_t</varname> plus the <varname>op_bfd</varname> for 854 processing. 855 </para> 856 <para> 857 <function>profile_container::add()</function> walks through the symbols 858 collected in the <varname>op_bfd</varname>. 859 <function>op_bfd::get_symbol_range()</function> gives us the start and 860 end of the symbol as an offset from the start of the binary image, 861 then we interrogate the <varname>profile_t</varname> for the relevant samples 862 for that offset range. We create a <varname>symbol_entry</varname> 863 object for this symbol and fill it in. If needed, here we also collect 864 debug information from the <varname>op_bfd</varname>, and possibly 865 record the detailed sample information (as used by <command>opreport 866 -d</command> and <command>opannotate</command>). 867 Finally the <varname>symbol_entry</varname> is added to 868 a private container of <varname>profile_container</varname> - this 869 <varname>symbol_container</varname> holds all such processed symbols. 870 </para> 871 </sect2> 872 873 </sect1> 874 875 <sect1 id="generating-output"> 876 <title>Generating output</title> 877 878 <para> 879 After the processing described in the previous section, we've now got 880 full details of what we need to output stored in the 881 <varname>profile_container</varname> on a symbol-by-symbol basis. To 882 produce output, we need to replay that data and format it suitably. 883 </para> 884 <para> 885 <command>opreport</command> first asks the 886 <varname>profile_container</varname> for a 887 <varname>symbol_collection</varname> (this is also where thresholding 888 happens). 889 This is sorted, then a 890 <varname>opreport_formatter</varname> is initialised. 891 This object initialises a set of field formatters as requested. Then 892 <function>opreport_formatter::output()</function> is called. This 893 iterates through the (sorted) <varname>symbol_collection</varname>; 894 for each entry, the selected fields (as set by the 895 <varname>format_flags</varname> options) are output by calling the 896 field formatters, with the <varname>symbol_entry</varname> passed in. 897 </para> 898 899 </sect1> 900 901 </chapter> 902 903 <chapter id="ext"> 904 <title>Extended Feature Interface</title> 905 906 <sect1 id="ext-intro"> 907 <title>Introduction</title> 908 909 <para> 910 The Extended Feature Interface is a standard callback interface 911 designed to allow extension to the OProfile daemon's sample processing. 912 Each feature defines a set of callback handlers which can be enabled or 913 disabled through the OProfile daemon's command-line option. 914 This interface can be used to implement support for architecture-specific 915 features or features not commonly used by general OProfile users. 916 </para> 917 918 </sect1> 919 920 <sect1 id="ext-name-and-handlers"> 921 <title>Feature Name and Handlers</title> 922 923 <para> 924 Each extended feature has an entry in the <varname>ext_feature_table</varname> 925 in <filename>opd_extended.cpp</filename>. Each entry contains a feature name, 926 and a corresponding set of handlers. Feature name is a unique string, which is 927 used to identify a feature in the table. Each feature provides a set 928 of handlers, which will be executed by the OProfile daemon from pre-determined 929 locations to perform certain tasks. At runtime, the OProfile daemon calls a feature 930 handler wrapper from one of the predetermined locations to check whether 931 an extended feature is enabled, and whether a particular handler exists. 932 Only the handlers of the enabled feature will be executed. 933 </para> 934 935 </sect1> 936 937 <sect1 id="ext-enable"> 938 <title>Enabling Features</title> 939 940 <para> 941 Each feature is enabled using the OProfile daemon (oprofiled) command-line 942 option "--ext-feature=<extended-feature-name>:[args]". The 943 "extended-feature-name" is used to determine the feature to be enabled. 944 The optional "args" is passed into the feature-specific initialization handler 945 (<function>ext_init</function>). Currently, only one extended feature can be 946 enabled at a time. 947 </para> 948 949 </sect1> 950 951 <sect1 id="ext-types-of-handlers"> 952 <title>Type of Handlers</title> 953 954 <para> 955 Each feature is responsible for providing its own set of handlers. 956 Types of handler are: 957 </para> 958 959 <sect2 id="ext_init"> 960 <title>ext_init Handler</title> 961 962 <para> 963 "ext_init" handles initialization of an extended feature. It takes 964 "args" parameter which is passed in through the "oprofiled --ext-feature=< 965 extended-feature-name>:[args]". This handler is executed in the function 966 <function>opd_options()</function> in the file <filename>daemon/oprofiled.c 967 </filename>. 968 </para> 969 970 <note> 971 <para> 972 The ext_init handler is required for all features. 973 </para> 974 </note> 975 976 </sect2> 977 978 <sect2 id="ext_print_stats"> 979 <title>ext_print_stats Handler</title> 980 981 <para> 982 "ext_print_stats" handles the extended feature statistics report. It adds 983 a new section in the OProfile daemon statistics report, which is normally 984 outputed to the file 985 <filename>/var/lib/oprofile/samples/oprofiled.log</filename>. 986 This handler is executed in the function <function>opd_print_stats()</function> 987 in the file <filename>daemon/opd_stats.c</filename>. 988 </para> 989 990 </sect2> 991 992 <sect2 id="ext_sfile_handlers"> 993 <title>ext_sfile Handler</title> 994 995 <para> 996 "ext_sfile" contains a set of handlers related to operations on the extended 997 sample files (sample files for events related to extended feature). 998 These operations include <function>create_sfile()</function>, 999 <function>sfile_dup()</function>, <function>close_sfile()</function>, 1000 <function>sync_sfile()</function>, and <function>get_file()</function> 1001 as defined in <filename>daemon/opd_sfile.c</filename>. 1002 An additional field, <varname>odb_t * ext_file</varname>, is added to the 1003 <varname>struct sfile</varname> for storing extended sample files 1004 information. 1005 1006 </para> 1007 1008 </sect2> 1009 1010 </sect1> 1011 1012 <sect1 id="ext-implementation"> 1013 <title>Extended Feature Reference Implementation</title> 1014 1015 <sect2 id="ext-ibs"> 1016 <title>Instruction-Based Sampling (IBS)</title> 1017 1018 <para> 1019 An example of extended feature implementation can be seen by 1020 examining the AMD Instruction-Based Sampling support. 1021 </para> 1022 1023 <sect3 id="ibs-init"> 1024 <title>IBS Initialization</title> 1025 1026 <para> 1027 Instruction-Based Sampling (IBS) is a new performance measurement technique 1028 available on AMD Family 10h processors. Enabling IBS profiling is done simply 1029 by specifying IBS performance events through the "--event=" options. 1030 </para> 1031 1032 <screen> 1033 opcontrol --event=IBS_FETCH_XXX:<count>:<um>:<kernel>:<user> 1034 opcontrol --event=IBS_OP_XXX:<count>:<um>:<kernel>:<user> 1035 1036 Note: * Count and unitmask for all IBS fetch events must be the same, 1037 as do those for IBS op. 1038 </screen> 1039 1040 <para> 1041 IBS performance events are listed in <function>opcontrol --list-events</function>. 1042 When users specify these events, opcontrol verifies them using ophelp, which 1043 checks for the <varname>ext:ibs_fetch</varname> or <varname>ext:ibs_op</varname> 1044 tag in <filename>events/x86-64/family10/events</filename> file. 1045 Then, it configures the driver interface (/dev/oprofile/ibs_fetch/... and 1046 /dev/oprofile/ibs_op/...) and starts the OProfile daemon as follows. 1047 </para> 1048 1049 <screen> 1050 oprofiled \ 1051 --ext-feature=ibs:\ 1052 fetch:<IBS_FETCH_EVENT1>,<IBS_FETCH_EVENT2>,...,:<IBS fetch count>:<IBS Fetch um>|\ 1053 op:<IBS_OP_EVENT1>,<IBS_OP_EVENT2>,...,:<IBS op count>:<IBS op um> 1054 </screen> 1055 1056 <para> 1057 Here, the OProfile daemon parses the <varname>--ext-feature</varname> 1058 option and checks the feature name ("ibs") before calling the 1059 the initialization function to handle the string 1060 containing IBS events, counts, and unitmasks. 1061 Then, it stores each event in the IBS virtual-counter table 1062 (<varname>struct opd_event ibs_vc[OP_MAX_IBS_COUNTERS]</varname>) and 1063 stores the event index in the IBS Virtual Counter Index (VCI) map 1064 (<varname>ibs_vci_map[OP_MAX_IBS_COUNTERS]</varname>) with IBS event value 1065 as the map key. 1066 </para> 1067 </sect3> 1068 1069 <sect3 id="ibs-data-processing"> 1070 <title>IBS Data Processing</title> 1071 1072 <para> 1073 During a profile session, the OProfile daemon identifies IBS samples in the 1074 event buffer using the <varname>"IBS_FETCH_CODE"</varname> or 1075 <varname>"IBS_OP_CODE"</varname>. These codes trigger the handlers 1076 <function>code_ibs_fetch_sample()</function> or 1077 <function>code_ibs_op_sample()</function> listed in the 1078 <varname>handler_t handlers[]</varname> vector in 1079 <filename>daemon/opd_trans.c </filename>. These handlers are responsible for 1080 processing IBS samples and translate them into IBS performance events. 1081 </para> 1082 1083 <para> 1084 Unlike traditional performance events, each IBS sample can be derived into 1085 multiple IBS performance events. For each event that the user specifies, 1086 a combination of bits from Model-Specific Registers (MSR) are checked 1087 against the bitmask defining the event. If the condition is met, the event 1088 will then be recorded. The derivation logic is in the files 1089 <filename>daemon/opd_ibs_macro.h</filename> and 1090 <filename>daemon/opd_ibs_trans.[h,c]</filename>. 1091 </para> 1092 1093 </sect3> 1094 1095 <sect3 id="ibs-sample-file"> 1096 <title>IBS Sample File</title> 1097 1098 <para> 1099 Traditionally, sample file information <varname>(odb_t)</varname> is stored 1100 in the <varname>struct sfile::odb_t file[OP_MAX_COUNTER]</varname>. 1101 Currently, <varname>OP_MAX_COUNTER</varname> is 8 on non-alpha, and 20 on 1102 alpha-based system. Event index (the counter number on which the event 1103 is configured) is used to access the corresponding entry in the array. 1104 Unlike the traditional performance event, IBS does not use the actual 1105 counter registers (i.e. <filename>/dev/oprofile/0,1,2,3</filename>). 1106 Also, the number of performance events generated by IBS could be larger than 1107 <varname>OP_MAX_COUNTER</varname> (currently upto 13 IBS-fetch and 46 IBS-op 1108 events). Therefore IBS requires a special data structure and sfile 1109 handlers (<varname>struct opd_ext_sfile_handlers</varname>) for managing 1110 IBS sample files. IBS-sample-file information is stored in a memory 1111 allocated by handler <function>ibs_sfile_create()</function>, which can 1112 be accessed through <varname>struct sfile::odb_t * ext_files</varname>. 1113 </para> 1114 1115 </sect3> 1116 1117 </sect2> 1118 1119 </sect1> 1120 1121 </chapter> 1122 1123 <glossary id="glossary"> 1124 <title>Glossary of OProfile source concepts and types</title> 1125 1126 <glossentry><glossterm>application image</glossterm> 1127 <glossdef><para> 1128 The primary binary image used by an application. This is derived 1129 from the kernel and corresponds to the binary started upon running 1130 an application: for example, <filename>/bin/bash</filename>. 1131 </para></glossdef></glossentry> 1132 1133 <glossentry><glossterm>binary image</glossterm> 1134 <glossdef><para> 1135 An ELF file containing executable code: this includes kernel modules, 1136 the kernel itself (a.k.a. <filename>vmlinux</filename>), shared libraries, 1137 and application binaries. 1138 </para></glossdef></glossentry> 1139 1140 <glossentry><glossterm>dcookie</glossterm> 1141 <glossdef><para> 1142 Short for "dentry cookie". A unique ID that can be looked up to provide 1143 the full path name of a binary image. 1144 </para></glossdef></glossentry> 1145 1146 <glossentry><glossterm>dependent image</glossterm> 1147 <glossdef><para> 1148 A binary image that is dependent upon an application, used with 1149 per-application separation. Most commonly, shared libraries. For example, 1150 if <filename>/bin/bash</filename> is running and we take 1151 some samples inside the C library itself due to <command>bash</command> 1152 calling library code, then the image <filename>/lib/libc.so</filename> 1153 would be dependent upon <filename>/bin/bash</filename>. 1154 </para></glossdef></glossentry> 1155 1156 <glossentry><glossterm>merging</glossterm> 1157 <glossdef><para> 1158 This refers to the ability to merge several distinct sample files 1159 into one set of data at runtime, in the post-profiling tools. For example, 1160 per-thread sample files can be merged into one set of data, because 1161 they are compatible (i.e. the aggregation of the data is meaningful), 1162 but it's not possible to merge sample files for two different events, 1163 because there would be no useful meaning to the results. 1164 </para></glossdef></glossentry> 1165 1166 <glossentry><glossterm>profile class</glossterm> 1167 <glossdef><para> 1168 A collection of profile data that has been collected under the same 1169 class template. For example, if we're using <command>opreport</command> 1170 to show results after profiling with two performance counters enabled 1171 profiling <constant>DATA_MEM_REFS</constant> and <constant>CPU_CLK_UNHALTED</constant>, 1172 there would be two profile classes, one for each event. Or if we're on 1173 an SMP system and doing per-cpu profiling, and we request 1174 <command>opreport</command> to show results for each CPU side-by-side, 1175 there would be a profile class for each CPU. 1176 </para></glossdef></glossentry> 1177 1178 <glossentry><glossterm>profile specification</glossterm> 1179 <glossdef><para> 1180 The parameters the user passes to the post-profiling tools that limit 1181 what sample files are used. This specification is matched against 1182 the available sample files to generate a selection of profile data. 1183 </para></glossdef></glossentry> 1184 1185 <glossentry><glossterm>profile template</glossterm> 1186 <glossdef><para> 1187 The parameters that define what goes in a particular profile class. 1188 This includes a symbolic name (e.g. "cpu:1") and the code-usable 1189 equivalent. 1190 </para></glossdef></glossentry> 1191 1192 </glossary> 1193 1194 </book> 1195