Home | History | Annotate | Download | only in HistoricalNotes
      1 Wed Jun 25 15:13:51 CDT 2003
      2 
      3 First-level instrumentation
      4 ---------------------------
      5 
      6 We use opt to do Bytecode-to-bytecode instrumentation. Look at
      7 back-edges and insert llvm_first_trigger() function call which takes
      8 no arguments and no return value. This instrumentation is designed to
      9 be easy to remove, for instance by writing a NOP over the function
     10 call instruction.
     11 
     12 Keep count of every call to llvm_first_trigger(), and maintain
     13 counters in a map indexed by return address. If the trigger count
     14 exceeds a threshold, we identify a hot loop and perform second-level
     15 instrumentation on the hot loop region (the instructions between the
     16 target of the back-edge and the branch that causes the back-edge).  We
     17 do not move code across basic-block boundaries.
     18 
     19 
     20 Second-level instrumentation
     21 ---------------------------
     22 
     23 We remove the first-level instrumentation by overwriting the CALL to
     24 llvm_first_trigger() with a NOP.
     25 
     26 The reoptimizer maintains a map between machine-code basic blocks and
     27 LLVM BasicBlock*s.  We only keep track of paths that start at the
     28 first machine-code basic block of the hot loop region.
     29 
     30 How do we keep track of which edges to instrument, and which edges are
     31 exits from the hot region? 3 step process.
     32 
     33 1) Do a DFS from the first machine-code basic block of the hot loop
     34 region and mark reachable edges.
     35 
     36 2) Do a DFS from the last machine-code basic block of the hot loop
     37 region IGNORING back edges, and mark the edges which are reachable in
     38 1) and also in 2) (i.e., must be reachable from both the start BB and
     39 the end BB of the hot region).
     40 
     41 3) Mark BBs which end in edges that exit the hot region; we need to
     42 instrument these differently.
     43 
     44 Assume that there is 1 free register. On SPARC we use %g1, which LLC
     45 has agreed not to use.  Shift a 1 into it at the beginning. At every
     46 edge which corresponds to a conditional branch, we shift 0 for not
     47 taken and 1 for taken into a register. This uniquely numbers the paths
     48 through the hot region. Silently fail if we need more than 64 bits.
     49 
     50 At the end BB we call countPath and increment the counter based on %g1
     51 and the return address of the countPath call.  We keep track of the
     52 number of iterations and the number of paths.  We only run this
     53 version 30 or 40 times.
     54 
     55 Find the BBs that total 90% or more of execution, and aggregate them
     56 together to form our trace. But we do not allow more than 5 paths; if
     57 we have more than 5 we take the ones that are executed the most.  We
     58 verify our assumption that we picked a hot back-edge in first-level
     59 instrumentation, by making sure that the number of times we took an
     60 exit edge from the hot trace is less than 10% of the number of
     61 iterations.
     62 
     63 LLC has been taught to recognize llvm_first_trigger() calls and NOT
     64 generate saves and restores of caller-saved registers around these
     65 calls.
     66 
     67 
     68 Phase behavior
     69 --------------
     70 
     71 We turn off llvm_first_trigger() calls with NOPs, but this would hide
     72 phase behavior from us (when some funcs/traces stop being hot and
     73 others become hot.)
     74 
     75 We have a SIGALRM timer that counts time for us. Every time we get a
     76 SIGALRM we look at our priority queue of locations where we have
     77 removed llvm_first_trigger() calls. Each location is inserted along
     78 with a time when we will next turn instrumentation back on for that
     79 call site. If the time has arrived for a particular call site, we pop
     80 that off the prio. queue and turn instrumentation back on for that
     81 call site.
     82 
     83 
     84 Generating traces
     85 -----------------
     86 
     87 When we finally generate an optimized trace we first copy the code
     88 into the trace cache. This leaves us with 3 copies of the code: the
     89 original code, the instrumented code, and the optimized trace. The
     90 optimized trace does not have instrumentation. The original code and
     91 the instrumented code are modified to have a branch to the trace
     92 cache, where the optimized traces are kept.
     93 
     94 We copy the code from the original to the instrumentation version
     95 by tracing the LLVM-to-Machine code basic block map and then copying
     96 each machine code basic block we think is in the hot region into the
     97 trace cache. Then we instrument that code. The process is similar for
     98 generating the final optimized trace; we copy the same basic blocks
     99 because we might need to put in fixup code for exit BBs.
    100 
    101 LLVM basic blocks are not typically used in the Reoptimizer except
    102 for the mapping information.
    103 
    104 We are restricted to using single instructions to branch between the
    105 original code, trace, and instrumented code. So we have to keep the
    106 code copies in memory near the original code (they can't be far enough
    107 away that a single pc-relative branch would not work.) Malloc() or
    108 data region space is too far away. this impacts the design of the 
    109 trace cache.
    110 
    111 We use a dummy function that is full of a bunch of for loops which we
    112 overwrite with trace-cache code. The trace manager keeps track of
    113 whether or not we have enough space in the trace cache, etc.
    114 
    115 The trace insertion routine takes an original start address, a vector
    116 of machine instructions representing the trace, index of branches and
    117 their corresponding absolute targets, and index of calls and their
    118 corresponding absolute targets.
    119 
    120 The trace insertion routine is responsible for inserting branches from
    121 the beginning of the original code to the beginning of the optimized
    122 trace. This is because at some point the trace cache may run out of
    123 space and it may have to evict a trace, at which point the branch to
    124 the trace would also have to be removed. It uses a round-robin
    125 replacement policy; we have found that this is almost as good as LRU
    126 and better than random (especially because of problems fitting the new
    127 trace in.)
    128 
    129 We cannot deal with discontiguous trace cache areas.  The trace cache
    130 is supposed to be cache-line-aligned, but it is not page-aligned.
    131 
    132 We generate instrumentation traces and optimized traces into separate
    133 trace caches. We keep the instrumented code around because you don't
    134 want to delete a trace when you still might have to return to it
    135 (i.e., return from a llvm_first_trigger() or countPath() call.)
    136 
    137 
    138