Home | History | Annotate | Download | only in docs
      1 <?xml version="1.0"?> <!-- -*- sgml -*- -->
      2 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
      3   "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
      4 
      5 <chapter id="bbv-manual" xreflabel="BBV">
      6   <title>BBV: an experimental basic block vector generation tool</title>
      7 
      8 <para>To use this tool, you must specify
      9 <option>--tool=exp-bbv</option> on the Valgrind
     10 command line.</para>
     11 
     12 <sect1 id="bbv-manual.overview" xreflabel="Overview">
     13 <title>Overview</title>
     14 
     15 <para>
     16    A basic block is a linear section of code with one entry point and one exit
     17    point.  A <emphasis>basic block vector</emphasis> (BBV) is a list of all
     18    basic blocks entered during program execution, and a count of how many
     19    times each basic block was run.
     20 </para>
     21 
     22 <para>
     23    BBV is a tool that generates basic block vectors for use with the 
     24    <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint</ulink>
     25    analysis tool. 
     26    The SimPoint methodology enables speeding up architectural 
     27    simulations by only running a small portion of a program
     28    and then extrapolating total behavior from this
     29    small portion.  Most programs exhibit phase-based behavior, which
     30    means that at various times during execution a program will encounter 
     31    intervals of time where the code behaves similarly to a previous
     32    interval.  If you can detect these intervals and group them together, 
     33    an approximation of the total program behavior can be obtained
     34    by only simulating a bare minimum number of intervals, and then scaling 
     35    the results.
     36 </para>
     37 
     38 <para>
     39   In computer architecture research, running a 
     40   benchmark on a cycle-accurate simulator can cause slowdowns on the order
     41   of 1000 times, making it take days, weeks, or even longer to run full
     42   benchmarks.  By utilizing SimPoint this can be reduced significantly,
     43   usually by 90-95%, while still retaining reasonable accuracy.
     44 </para>
     45 
     46 <para>
     47    A more complete introduction to how SimPoint works can be 
     48    found in the paper "Automatically Characterizing Large Scale 
     49    Program Behavior" by T. Sherwood, E. Perelman, G. Hamerly, and 
     50    B. Calder.  
     51 </para>
     52 
     53 </sect1>
     54 
     55 <sect1 id="bbv-manual.quickstart" xreflabel="Quick Start">
     56 <title>Using Basic Block Vectors to create SimPoints</title>
     57 
     58 <para>
     59    To quickly create a basic block vector file, you will call Valgrind
     60    like this:
     61 
     62    <programlisting>valgrind --tool=exp-bbv /bin/ls</programlisting>
     63 
     64    In this case we are running on <filename>/bin/ls</filename>,
     65    but this can be any program.  By default a file called
     66    <computeroutput>bb.out.PID</computeroutput> will be created,
     67    where PID is replaced by the process ID of the running process.
     68    This file contains the basic block vector.  For long-running programs
     69    this file can be quite large, so it might be wise to compress
     70    it with gzip or some other compression program.
     71 </para>   
     72 
     73 <para>
     74    To create actual SimPoint results, you will need the SimPoint utility,
     75    available from the 
     76    <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint webpage</ulink>.
     77    Assuming you have downloaded SimPoint 3.2 and compiled it,
     78    create SimPoint results with a command like the following:
     79       
     80    <programlisting><![CDATA[
     81 ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \
     82     -loadFVFile bb.out.1234.gz \
     83     -k 5 -saveSimpoints results.simpts \
     84     -saveSimpointWeights results.weights]]></programlisting>
     85 
     86    where bb.out.1234.gz is your compressed basic block vector file
     87    generated by BBV.
     88 </para>
     89 
     90 <para>   
     91    The SimPoint utility does random linear projection using 15-dimensions,
     92    then does k-mean clustering to calculate which intervals are 
     93    of interest.  In this example we specify 5 intervals with the 
     94    -k 5 option.   
     95 </para>   
     96    
     97 <para>   
     98    The outputs from the SimPoint run are the 
     99    <computeroutput>results.simpts</computeroutput>
    100    and <computeroutput>results.weights</computeroutput> files.
    101    The first holds the 5 most relevant intervals of the program.
    102    The seconds holds the weight to scale each interval by when
    103    extrapolating full-program behavior.  The intervals and the weights
    104    can be used in conjunction with a simulator that supports
    105    fast-forwarding; you fast-forward to the interval of interest,
    106    collect stats for the desired interval length, then use
    107    statistics gathered in conjunction with the weights to 
    108    calculate your results.
    109 </para> 
    110    
    111 </sect1>
    112 
    113 <sect1 id="bbv-manual.usage" xreflabel="BBV Command-line Options">
    114 <title>BBV Command-line Options</title>
    115 
    116 <para> BBV-specific command-line options are:</para>
    117 
    118 <!-- start of xi:include in the manpage -->
    119 <variablelist id="bbv.opts.list">
    120 
    121   <varlistentry id="opt.bb-out-file" xreflabel="--bb-out-file">
    122      <term>
    123         <option><![CDATA[--bb-out-file=<name> [default: bb.out.%p] ]]></option>
    124      </term>
    125      <listitem>
    126         <para>
    127            This option selects the name of the basic block vector file.  The
    128            <option>%p</option> and <option>%q</option> format specifiers can be
    129            used to embed the process ID and/or the contents of an environment
    130            variable in the name, as is the case for the core option
    131            <option><xref linkend="opt.log-file"/></option>.
    132         </para>
    133      </listitem>
    134   </varlistentry>
    135 
    136   <varlistentry id="opt.pc-out-file" xreflabel="--pc-out-file">
    137      <term>
    138         <option><![CDATA[--pc-out-file=<name> [default: pc.out.%p] ]]></option>
    139      </term>
    140      <listitem>
    141         <para>
    142            This option selects the name of the PC file.  
    143            This file holds program counter addresses
    144            and function name info for the various basic blocks.
    145            This can be used in conjunction
    146            with the basic block vector file to fast-forward via function names
    147            instead of just instruction counts.  The 
    148            <option>%p</option> and <option>%q</option> format specifiers can be
    149            used to embed the process ID and/or the contents of an environment
    150            variable in the name, as is the case for the core option
    151            <option><xref linkend="opt.log-file"/></option>.
    152         </para>
    153      </listitem>
    154    </varlistentry>
    155 
    156    <varlistentry id="opt.interval-size" xreflabel="--interval-size">
    157       <term>
    158         <option><![CDATA[--interval-size=<number> [default: 100000000] ]]></option>
    159       </term>
    160       <listitem>
    161       <para>
    162          This option selects the size of the interval to use.  
    163          The default is 100 
    164          million instructions, which is a commonly used value.  
    165          Other sizes can be used; smaller intervals can help programs
    166          with finer-grained phases.  However smaller interval size
    167          can lead to accuracy issues due to warm-up effects 
    168          (When fast-forwarding the various architectural features
    169          will be un-initialized, and it will take some number
    170          of instructions before they "warm up" to the state a 
    171          full simulation would be at without the fast-forwarding.
    172          Large interval sizes tend to mitigate this.)
    173       </para>
    174       </listitem>
    175   </varlistentry>
    176 
    177   <varlistentry id="opt.instr-count-only" xreflabel="--instr-count-only">
    178      <term>
    179         <option><![CDATA[--instr-count-only [default: no] ]]></option>
    180      </term>
    181      <listitem>
    182         <para>
    183            This option tells the tool to only display instruction count
    184            totals, and to not generate the actual basic block vector file.
    185            This is useful for debugging, and for gathering instruction count
    186            info without generating the large basic block vector files.
    187         </para>
    188      </listitem>
    189    </varlistentry>
    190   
    191 
    192 </variablelist>
    193 <!-- end of xi:include in the manpage -->
    194 
    195 </sect1>
    196 
    197 <sect1 id="bbv-manual.fileformat" xreflabel="BBV File Format">
    198 <title>Basic Block Vector File Format</title>
    199 
    200 <para>  
    201   The Basic Block Vector is dumped at fixed intervals.  This
    202   is commonly done every 100 million instructions; the 
    203   <option>--interval-size</option> option can be 
    204   used to change this.
    205 </para>
    206 
    207 <para>
    208   The output file looks like this:
    209 </para>
    210 
    211 <programlisting><![CDATA[
    212 T:45:1024 :189:99343
    213 T:11:78573 :15:1353  :56:1
    214 T:18:45 :12:135353 :56:78 314:4324263]]></programlisting>
    215 
    216 <para>
    217   Each new interval starts with a T.   This is followed on the same line
    218   by a series of basic block and frequency pairs, one for each
    219   basic block that was entered during the interval.  The format for
    220   each block/frequency pair is a colon, followed by a number that
    221   uniquely identifies the basic block, another colon, and then
    222   the frequency (which is the number of times the block was entered,
    223   multiplied by the number of instructions in the block).  The
    224   pairs are separated from each other by a space.
    225 </para>
    226 
    227 <para>
    228   The frequency count is multiplied by the number of instructions that are 
    229   in the basic block, in order to weigh the count so that instructions in 
    230   small basic blocks aren't counted as more important than instructions 
    231   in large basic blocks.
    232 </para>
    233 
    234 <para>
    235   The SimPoint program only processes lines that start with a "T".  All
    236   other lines are ignored.  Traditionally comments are indicated by
    237   starting a line with a "#" character.  Some other BBV generation tools,
    238   such as PinPoints, generate lines beginning with letters other than "T"
    239   to indicate more information about the program being run.  We do
    240   not generate these, as the SimPoint utility ignores them.
    241 </para>
    242 
    243 </sect1>
    244 
    245 <sect1 id="bbv-manual.implementation" xreflabel="Implementation">
    246 <title>Implementation</title>
    247 
    248 <para>
    249    Valgrind provides all of the information necessary to create
    250    BBV files.  In the current implementation, all instructions
    251    are instrumented.  This is slower (by approximately a factor
    252    of two) than a method that instruments at the basic block level, 
    253    but there are some complications (especially with rep prefix
    254    detection) that make that method more difficult.
    255 </para>
    256   
    257 <para>
    258    Valgrind actually provides instrumentation at a superblock level.
    259    A superblock has one entry point but unlike basic blocks can
    260    have multiple exit points.  Once a branch occurs into the middle
    261    of a block, it is split into a new basic block.  Because
    262    Valgrind cannot produce "true" basic blocks, the generated
    263    BBV vectors will be different than those generated by other tools.
    264    In practice this does not seem to affect the accuracy of the
    265    SimPoint results.  We do internally force the
    266    <option>--vex-guest-chase-thresh=0</option>
    267    option to Valgrind which forces a more basic-block-like
    268    behavior.
    269 </para>
    270 
    271 <para>
    272    When a superblock is run for the first time, it is instrumented
    273    with our BBV routine.  A block info (bbInfo) structure is allocated
    274    which holds the various information and statistics for the block.
    275    A unique block ID is assigned to the block, and then the
    276    structure is placed into an ordered set.
    277    Then each native instruction in the block is instrumented to
    278    call an instruction counting routine with a pointer to the block
    279    info structure as an argument.
    280 </para>
    281 
    282 <para>
    283    At run-time, our instruction counting routines are called once
    284    per native instruction.  The relevant block info structure is accessed
    285    and the block count and total instruction count is updated.   
    286    If the total instruction count overflows the interval size 
    287    then we walk the ordered set, writing out the statistics for
    288    any block that was accessed in the interval, then resetting the
    289    block counters to zero.
    290 </para>
    291 
    292 <para>
    293    On the x86 and amd64 architectures the counting code has extra
    294    code to handle rep-prefixed string instructions.  This is because 
    295    actual hardware counts a rep-prefixed instruction 
    296    as one instruction, while a naive Valgrind implementation
    297    would count it as many (possibly hundreds, thousands or even millions)
    298    of instructions.  We handle rep-prefixed instructions specially,
    299    in order to make the results match those obtained with hardware performance
    300    counters.
    301 </para>   
    302    
    303 <para>
    304    BBV also counts the fldcw instruction.  This instruction is used on 
    305    x86 machines in various ways; it is most commonly found when converting 
    306    floating point values into integers.
    307    On Pentium 4 systems the retired instruction performance
    308    counter counts this instruction as two instructions (all other 
    309    known processors only count it as one).
    310    This can affect results when using SimPoint on Pentium 4 systems.
    311    We provide the fldcw count so that users can evaluate whether it
    312    will impact their results enough to avoid using Pentium 4 machines
    313    for their experiments.  It would be possible to add an option to 
    314    this tool that mimics the double-counting so that the generated BBV
    315    files would be usable for experiments using hardware performance
    316    counters on Pentium 4 systems.
    317 </para>
    318 
    319 </sect1>
    320 
    321 <sect1 id="bbv-manual.threadsupport" xreflabel="BBV Threaded Support">
    322 <title>Threaded Executable Support</title>
    323 
    324 <para>
    325    BBV supports threaded programs.  When a program has multiple threads,
    326    an additional basic block vector file is created for each thread (each
    327    additional file is the specified filename with the thread number
    328    appended at the end).
    329 </para>
    330 
    331 <para>
    332    There is no official method of using SimPoint with
    333    threaded workloads.  The most common method is to run
    334    SimPoint on each thread's results independently, and use 
    335    some method of deterministic execution to try to match the
    336    original workload.  This should be possible with the current
    337    BBV.
    338 </para>
    339 
    340 </sect1>
    341 
    342 <sect1 id="bbv-manual.validation" xreflabel="BBV Validation">
    343 <title>Validation</title>
    344 
    345 <para>
    346    BBV has been tested on x86, amd64, and ppc32 platforms.
    347    An earlier version of BBV was tested in detail using
    348    hardware performance counters, this work is described in a paper 
    349    from the HiPEAC'08 conference, "Using Dynamic Binary Instrumentation 
    350    to Generate Multi-Platform SimPoints: Methodology and Accuracy" by
    351    V.M. Weaver and S.A. McKee.
    352 </para>
    353  
    354 </sect1>
    355  
    356 <sect1 id="bbv-manual.performance" xreflabel="BBV Performance">
    357 <title>Performance</title>
    358 
    359 <para>
    360   Using this program slows down execution by roughly a factor of 40
    361   over native execution.  This varies depending on the machine
    362   used and the benchmark being run.
    363   On the SPEC CPU 2000 benchmarks running on a 3.4GHz Pentium D 
    364   processor, the slowdown ranges from 24x (mcf) to 340x (vortex.2).
    365 </para>
    366 
    367 </sect1>
    368 
    369 </chapter>
    370