1 <?xml version="1.0"?> <!-- -*- sgml -*- --> 2 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" 3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> 4 5 <chapter id="bbv-manual" xreflabel="BBV"> 6 <title>BBV: an experimental basic block vector generation tool</title> 7 8 <para>To use this tool, you must specify 9 <option>--tool=exp-bbv</option> on the Valgrind 10 command line.</para> 11 12 <sect1 id="bbv-manual.overview" xreflabel="Overview"> 13 <title>Overview</title> 14 15 <para> 16 A basic block is a linear section of code with one entry point and one exit 17 point. A <emphasis>basic block vector</emphasis> (BBV) is a list of all 18 basic blocks entered during program execution, and a count of how many 19 times each basic block was run. 20 </para> 21 22 <para> 23 BBV is a tool that generates basic block vectors for use with the 24 <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint</ulink> 25 analysis tool. 26 The SimPoint methodology enables speeding up architectural 27 simulations by only running a small portion of a program 28 and then extrapolating total behavior from this 29 small portion. Most programs exhibit phase-based behavior, which 30 means that at various times during execution a program will encounter 31 intervals of time where the code behaves similarly to a previous 32 interval. If you can detect these intervals and group them together, 33 an approximation of the total program behavior can be obtained 34 by only simulating a bare minimum number of intervals, and then scaling 35 the results. 36 </para> 37 38 <para> 39 In computer architecture research, running a 40 benchmark on a cycle-accurate simulator can cause slowdowns on the order 41 of 1000 times, making it take days, weeks, or even longer to run full 42 benchmarks. By utilizing SimPoint this can be reduced significantly, 43 usually by 90-95%, while still retaining reasonable accuracy. 44 </para> 45 46 <para> 47 A more complete introduction to how SimPoint works can be 48 found in the paper "Automatically Characterizing Large Scale 49 Program Behavior" by T. Sherwood, E. Perelman, G. Hamerly, and 50 B. Calder. 51 </para> 52 53 </sect1> 54 55 <sect1 id="bbv-manual.quickstart" xreflabel="Quick Start"> 56 <title>Using Basic Block Vectors to create SimPoints</title> 57 58 <para> 59 To quickly create a basic block vector file, you will call Valgrind 60 like this: 61 62 <programlisting>valgrind --tool=exp-bbv /bin/ls</programlisting> 63 64 In this case we are running on <filename>/bin/ls</filename>, 65 but this can be any program. By default a file called 66 <computeroutput>bb.out.PID</computeroutput> will be created, 67 where PID is replaced by the process ID of the running process. 68 This file contains the basic block vector. For long-running programs 69 this file can be quite large, so it might be wise to compress 70 it with gzip or some other compression program. 71 </para> 72 73 <para> 74 To create actual SimPoint results, you will need the SimPoint utility, 75 available from the 76 <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint webpage</ulink>. 77 Assuming you have downloaded SimPoint 3.2 and compiled it, 78 create SimPoint results with a command like the following: 79 80 <programlisting><![CDATA[ 81 ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \ 82 -loadFVFile bb.out.1234.gz \ 83 -k 5 -saveSimpoints results.simpts \ 84 -saveSimpointWeights results.weights]]></programlisting> 85 86 where bb.out.1234.gz is your compressed basic block vector file 87 generated by BBV. 88 </para> 89 90 <para> 91 The SimPoint utility does random linear projection using 15-dimensions, 92 then does k-mean clustering to calculate which intervals are 93 of interest. In this example we specify 5 intervals with the 94 -k 5 option. 95 </para> 96 97 <para> 98 The outputs from the SimPoint run are the 99 <computeroutput>results.simpts</computeroutput> 100 and <computeroutput>results.weights</computeroutput> files. 101 The first holds the 5 most relevant intervals of the program. 102 The seconds holds the weight to scale each interval by when 103 extrapolating full-program behavior. The intervals and the weights 104 can be used in conjunction with a simulator that supports 105 fast-forwarding; you fast-forward to the interval of interest, 106 collect stats for the desired interval length, then use 107 statistics gathered in conjunction with the weights to 108 calculate your results. 109 </para> 110 111 </sect1> 112 113 <sect1 id="bbv-manual.usage" xreflabel="BBV Command-line Options"> 114 <title>BBV Command-line Options</title> 115 116 <para> BBV-specific command-line options are:</para> 117 118 <!-- start of xi:include in the manpage --> 119 <variablelist id="bbv.opts.list"> 120 121 <varlistentry id="opt.bb-out-file" xreflabel="--bb-out-file"> 122 <term> 123 <option><![CDATA[--bb-out-file=<name> [default: bb.out.%p] ]]></option> 124 </term> 125 <listitem> 126 <para> 127 This option selects the name of the basic block vector file. The 128 <option>%p</option> and <option>%q</option> format specifiers can be 129 used to embed the process ID and/or the contents of an environment 130 variable in the name, as is the case for the core option 131 <option><xref linkend="opt.log-file"/></option>. 132 </para> 133 </listitem> 134 </varlistentry> 135 136 <varlistentry id="opt.pc-out-file" xreflabel="--pc-out-file"> 137 <term> 138 <option><![CDATA[--pc-out-file=<name> [default: pc.out.%p] ]]></option> 139 </term> 140 <listitem> 141 <para> 142 This option selects the name of the PC file. 143 This file holds program counter addresses 144 and function name info for the various basic blocks. 145 This can be used in conjunction 146 with the basic block vector file to fast-forward via function names 147 instead of just instruction counts. The 148 <option>%p</option> and <option>%q</option> format specifiers can be 149 used to embed the process ID and/or the contents of an environment 150 variable in the name, as is the case for the core option 151 <option><xref linkend="opt.log-file"/></option>. 152 </para> 153 </listitem> 154 </varlistentry> 155 156 <varlistentry id="opt.interval-size" xreflabel="--interval-size"> 157 <term> 158 <option><![CDATA[--interval-size=<number> [default: 100000000] ]]></option> 159 </term> 160 <listitem> 161 <para> 162 This option selects the size of the interval to use. 163 The default is 100 164 million instructions, which is a commonly used value. 165 Other sizes can be used; smaller intervals can help programs 166 with finer-grained phases. However smaller interval size 167 can lead to accuracy issues due to warm-up effects 168 (When fast-forwarding the various architectural features 169 will be un-initialized, and it will take some number 170 of instructions before they "warm up" to the state a 171 full simulation would be at without the fast-forwarding. 172 Large interval sizes tend to mitigate this.) 173 </para> 174 </listitem> 175 </varlistentry> 176 177 <varlistentry id="opt.instr-count-only" xreflabel="--instr-count-only"> 178 <term> 179 <option><![CDATA[--instr-count-only [default: no] ]]></option> 180 </term> 181 <listitem> 182 <para> 183 This option tells the tool to only display instruction count 184 totals, and to not generate the actual basic block vector file. 185 This is useful for debugging, and for gathering instruction count 186 info without generating the large basic block vector files. 187 </para> 188 </listitem> 189 </varlistentry> 190 191 192 </variablelist> 193 <!-- end of xi:include in the manpage --> 194 195 </sect1> 196 197 <sect1 id="bbv-manual.fileformat" xreflabel="BBV File Format"> 198 <title>Basic Block Vector File Format</title> 199 200 <para> 201 The Basic Block Vector is dumped at fixed intervals. This 202 is commonly done every 100 million instructions; the 203 <option>--interval-size</option> option can be 204 used to change this. 205 </para> 206 207 <para> 208 The output file looks like this: 209 </para> 210 211 <programlisting><![CDATA[ 212 T:45:1024 :189:99343 213 T:11:78573 :15:1353 :56:1 214 T:18:45 :12:135353 :56:78 314:4324263]]></programlisting> 215 216 <para> 217 Each new interval starts with a T. This is followed on the same line 218 by a series of basic block and frequency pairs, one for each 219 basic block that was entered during the interval. The format for 220 each block/frequency pair is a colon, followed by a number that 221 uniquely identifies the basic block, another colon, and then 222 the frequency (which is the number of times the block was entered, 223 multiplied by the number of instructions in the block). The 224 pairs are separated from each other by a space. 225 </para> 226 227 <para> 228 The frequency count is multiplied by the number of instructions that are 229 in the basic block, in order to weigh the count so that instructions in 230 small basic blocks aren't counted as more important than instructions 231 in large basic blocks. 232 </para> 233 234 <para> 235 The SimPoint program only processes lines that start with a "T". All 236 other lines are ignored. Traditionally comments are indicated by 237 starting a line with a "#" character. Some other BBV generation tools, 238 such as PinPoints, generate lines beginning with letters other than "T" 239 to indicate more information about the program being run. We do 240 not generate these, as the SimPoint utility ignores them. 241 </para> 242 243 </sect1> 244 245 <sect1 id="bbv-manual.implementation" xreflabel="Implementation"> 246 <title>Implementation</title> 247 248 <para> 249 Valgrind provides all of the information necessary to create 250 BBV files. In the current implementation, all instructions 251 are instrumented. This is slower (by approximately a factor 252 of two) than a method that instruments at the basic block level, 253 but there are some complications (especially with rep prefix 254 detection) that make that method more difficult. 255 </para> 256 257 <para> 258 Valgrind actually provides instrumentation at a superblock level. 259 A superblock has one entry point but unlike basic blocks can 260 have multiple exit points. Once a branch occurs into the middle 261 of a block, it is split into a new basic block. Because 262 Valgrind cannot produce "true" basic blocks, the generated 263 BBV vectors will be different than those generated by other tools. 264 In practice this does not seem to affect the accuracy of the 265 SimPoint results. We do internally force the 266 <option>--vex-guest-chase-thresh=0</option> 267 option to Valgrind which forces a more basic-block-like 268 behavior. 269 </para> 270 271 <para> 272 When a superblock is run for the first time, it is instrumented 273 with our BBV routine. A block info (bbInfo) structure is allocated 274 which holds the various information and statistics for the block. 275 A unique block ID is assigned to the block, and then the 276 structure is placed into an ordered set. 277 Then each native instruction in the block is instrumented to 278 call an instruction counting routine with a pointer to the block 279 info structure as an argument. 280 </para> 281 282 <para> 283 At run-time, our instruction counting routines are called once 284 per native instruction. The relevant block info structure is accessed 285 and the block count and total instruction count is updated. 286 If the total instruction count overflows the interval size 287 then we walk the ordered set, writing out the statistics for 288 any block that was accessed in the interval, then resetting the 289 block counters to zero. 290 </para> 291 292 <para> 293 On the x86 and amd64 architectures the counting code has extra 294 code to handle rep-prefixed string instructions. This is because 295 actual hardware counts a rep-prefixed instruction 296 as one instruction, while a naive Valgrind implementation 297 would count it as many (possibly hundreds, thousands or even millions) 298 of instructions. We handle rep-prefixed instructions specially, 299 in order to make the results match those obtained with hardware performance 300 counters. 301 </para> 302 303 <para> 304 BBV also counts the fldcw instruction. This instruction is used on 305 x86 machines in various ways; it is most commonly found when converting 306 floating point values into integers. 307 On Pentium 4 systems the retired instruction performance 308 counter counts this instruction as two instructions (all other 309 known processors only count it as one). 310 This can affect results when using SimPoint on Pentium 4 systems. 311 We provide the fldcw count so that users can evaluate whether it 312 will impact their results enough to avoid using Pentium 4 machines 313 for their experiments. It would be possible to add an option to 314 this tool that mimics the double-counting so that the generated BBV 315 files would be usable for experiments using hardware performance 316 counters on Pentium 4 systems. 317 </para> 318 319 </sect1> 320 321 <sect1 id="bbv-manual.threadsupport" xreflabel="BBV Threaded Support"> 322 <title>Threaded Executable Support</title> 323 324 <para> 325 BBV supports threaded programs. When a program has multiple threads, 326 an additional basic block vector file is created for each thread (each 327 additional file is the specified filename with the thread number 328 appended at the end). 329 </para> 330 331 <para> 332 There is no official method of using SimPoint with 333 threaded workloads. The most common method is to run 334 SimPoint on each thread's results independently, and use 335 some method of deterministic execution to try to match the 336 original workload. This should be possible with the current 337 BBV. 338 </para> 339 340 </sect1> 341 342 <sect1 id="bbv-manual.validation" xreflabel="BBV Validation"> 343 <title>Validation</title> 344 345 <para> 346 BBV has been tested on x86, amd64, and ppc32 platforms. 347 An earlier version of BBV was tested in detail using 348 hardware performance counters, this work is described in a paper 349 from the HiPEAC'08 conference, "Using Dynamic Binary Instrumentation 350 to Generate Multi-Platform SimPoints: Methodology and Accuracy" by 351 V.M. Weaver and S.A. McKee. 352 </para> 353 354 </sect1> 355 356 <sect1 id="bbv-manual.performance" xreflabel="BBV Performance"> 357 <title>Performance</title> 358 359 <para> 360 Using this program slows down execution by roughly a factor of 40 361 over native execution. This varies depending on the machine 362 used and the benchmark being run. 363 On the SPEC CPU 2000 benchmarks running on a 3.4GHz Pentium D 364 processor, the slowdown ranges from 24x (mcf) to 340x (vortex.2). 365 </para> 366 367 </sect1> 368 369 </chapter> 370