Home | History | Annotate | Download | only in docs
      1 <?xml version="1.0"?> <!-- -*- sgml -*- -->
      2 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
      3           "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
      4 
      5 
      6 <chapter id="mc-tech-docs" 
      7          xreflabel="The design and implementation of Valgrind">
      8 
      9 <title>The Design and Implementation of Valgrind</title>
     10 <subtitle>Detailed technical notes for hackers, maintainers and
     11           the overly-curious</subtitle>
     12 
     13 <sect1 id="mc-tech-docs.intro" xreflabel="Introduction">
     14 <title>Introduction</title>
     15 
     16 <para>This document contains a detailed, highly-technical description of
     17 the internals of Valgrind.  This is not the user manual; if you are an
     18 end-user of Valgrind, you do not want to read this.  Conversely, if you
     19 really are a hacker-type and want to know how it works, I assume that
     20 you have read the user manual thoroughly.</para>
     21 
     22 <para>You may need to read this document several times, and carefully.
     23 Some important things, I only say once.</para>
     24 
     25 <para>[Note: this document is now very old, and a lot of its contents
     26 are out of date, and misleading.]</para>
     27 
     28 
     29 <sect2 id="mc-tech-docs.history" xreflabel="History">
     30 <title>History</title>
     31 
     32 <para>Valgrind came into public view in late Feb 2002.  However, it has
     33 been under contemplation for a very long time, perhaps seriously for
     34 about five years.  Somewhat over two years ago, I started working on the
     35 x86 code generator for the Glasgow Haskell Compiler
     36 (http://www.haskell.org/ghc), gaining familiarity with x86 internals on
     37 the way.  I then did Cacheprof, gaining further x86 experience.  Some
     38 time around Feb 2000 I started experimenting with a user-space x86
     39 interpreter for x86-Linux.  This worked, but it was clear that a
     40 JIT-based scheme would be necessary to give reasonable performance for
     41 Valgrind.  Design work for the JITter started in earnest in Oct 2000,
     42 and by early 2001 I had an x86-to-x86 dynamic translator which could run
     43 quite large programs.  This translator was in a sense pointless, since
     44 it did not do any instrumentation or checking.</para>
     45 
     46 <para>Most of the rest of 2001 was taken up designing and implementing
     47 the instrumentation scheme.  The main difficulty, which consumed a lot
     48 of effort, was to design a scheme which did not generate large numbers
     49 of false uninitialised-value warnings.  By late 2001 a satisfactory
     50 scheme had been arrived at, and I started to test it on ever-larger
     51 programs, with an eventual eye to making it work well enough so that it
     52 was helpful to folks debugging the upcoming version 3 of KDE.  I've used
     53 KDE since before version 1.0, and wanted to Valgrind to be an indirect
     54 contribution to the KDE 3 development effort.  At the start of Feb 02
     55 the kde-core-devel crew started using it, and gave a huge amount of
     56 helpful feedback and patches in the space of three weeks.  Snapshot
     57 20020306 is the result.</para>
     58 
     59 <para>In the best Unix tradition, or perhaps in the spirit of Fred
     60 Brooks' depressing-but-completely-accurate epitaph "build one to throw
     61 away; you will anyway", much of Valgrind is a second or third rendition
     62 of the initial idea.  The instrumentation machinery
     63 (<filename>vg_translate.c</filename>, <filename>vg_memory.c</filename>)
     64 and core CPU simulation (<filename>vg_to_ucode.c</filename>,
     65 <filename>vg_from_ucode.c</filename>) have had three redesigns and
     66 rewrites; the register allocator, low-level memory manager
     67 (<filename>vg_malloc2.c</filename>) and symbol table reader
     68 (<filename>vg_symtab2.c</filename>) are on the second rewrite.  In a
     69 sense, this document serves to record some of the knowledge gained as a
     70 result.</para>
     71 
     72 </sect2>
     73 
     74 
     75 <sect2 id="mc-tech-docs.overview" xreflabel="Design overview">
     76 <title>Design overview</title>
     77 
     78 <para>Valgrind is compiled into a Linux shared object,
     79 <filename>valgrind.so</filename>, and also a dummy one,
     80 <filename>valgrinq.so</filename>, of which more later.  The
     81 <filename>valgrind</filename> shell script adds
     82 <filename>valgrind.so</filename> to the
     83 <computeroutput>LD_PRELOAD</computeroutput> list of extra libraries to
     84 be loaded with any dynamically linked library.  This is a standard
     85 trick, one which I assume the
     86 <computeroutput>LD_PRELOAD</computeroutput> mechanism was developed to
     87 support.</para>
     88 
     89 <para><filename>valgrind.so</filename> is linked with the
     90 <option>-z initfirst</option> flag, which
     91 requests that its initialisation code is run before that of any
     92 other object in the executable image.  When this happens,
     93 valgrind gains control.  The real CPU becomes "trapped" in
     94 <filename>valgrind.so</filename> and the translations it
     95 generates.  The synthetic CPU provided by Valgrind does, however,
     96 return from this initialisation function.  So the normal startup
     97 actions, orchestrated by the dynamic linker
     98 <filename>ld.so</filename>, continue as usual, except on the
     99 synthetic CPU, not the real one.  Eventually
    100 <function>main</function> is run and returns, and
    101 then the finalisation code of the shared objects is run,
    102 presumably in inverse order to which they were initialised.
    103 Remember, this is still all happening on the simulated CPU.
    104 Eventually <filename>valgrind.so</filename>'s own finalisation
    105 code is called.  It spots this event, shuts down the simulated
    106 CPU, prints any error summaries and/or does leak detection, and
    107 returns from the initialisation code on the real CPU.  At this
    108 point, in effect the real and synthetic CPUs have merged back
    109 into one, Valgrind has lost control of the program, and the
    110 program finally <function>exit()s</function> back to
    111 the kernel in the usual way.</para>
    112 
    113 <para>The normal course of activity, once Valgrind has started
    114 up, is as follows.  Valgrind never runs any part of your program
    115 (usually referred to as the "client"), not a single byte of it,
    116 directly.  Instead it uses function
    117 <function>VG_(translate)</function> to translate
    118 basic blocks (BBs, straight-line sequences of code) into
    119 instrumented translations, and those are run instead.  The
    120 translations are stored in the translation cache (TC),
    121 <computeroutput>vg_tc</computeroutput>, with the translation
    122 table (TT), <computeroutput>vg_tt</computeroutput> supplying the
    123 original-to-translation code address mapping.  Auxiliary array
    124 <computeroutput>VG_(tt_fast)</computeroutput> is used as a
    125 direct-map cache for fast lookups in TT; it usually achieves a
    126 hit rate of around 98% and facilitates an orig-to-trans lookup in
    127 4 x86 insns, which is not bad.</para>
    128 
    129 <para>Function <function>VG_(dispatch)</function> in
    130 <filename>vg_dispatch.S</filename> is the heart of the JIT
    131 dispatcher.  Once a translated code address has been found, it is
    132 executed simply by an x86 <computeroutput>call</computeroutput>
    133 to the translation.  At the end of the translation, the next
    134 original code addr is loaded into
    135 <computeroutput>%eax</computeroutput>, and the translation then
    136 does a <computeroutput>ret</computeroutput>, taking it back to
    137 the dispatch loop, with, interestingly, zero branch
    138 mispredictions.  The address requested in
    139 <computeroutput>%eax</computeroutput> is looked up first in
    140 <function>VG_(tt_fast)</function>, and, if not found,
    141 by calling C helper
    142 <function>VG_(search_transtab)</function>.  If there
    143 is still no translation available,
    144 <function>VG_(dispatch)</function> exits back to the
    145 top-level C dispatcher
    146 <function>VG_(toploop)</function>, which arranges for
    147 <function>VG_(translate)</function> to make a new
    148 translation.  All fairly unsurprising, really.  There are various
    149 complexities described below.</para>
    150 
    151 <para>The translator, orchestrated by
    152 <function>VG_(translate)</function>, is complicated
    153 but entirely self-contained.  It is described in great detail in
    154 subsequent sections.  Translations are stored in TC, with TT
    155 tracking administrative information.  The translations are
    156 subject to an approximate LRU-based management scheme.  With the
    157 current settings, the TC can hold at most about 15MB of
    158 translations, and LRU passes prune it to about 13.5MB.  Given
    159 that the orig-to-translation expansion ratio is about 13:1 to
    160 14:1, this means TC holds translations for more or less a
    161 megabyte of original code, which generally comes to about 70000
    162 basic blocks for C++ compiled with optimisation on.  Generating
    163 new translations is expensive, so it is worth having a large TC
    164 to minimise the (capacity) miss rate.</para>
    165 
    166 <para>The dispatcher,
    167 <function>VG_(dispatch)</function>, receives hints
    168 from the translations which allow it to cheaply spot all control
    169 transfers corresponding to x86
    170 <computeroutput>call</computeroutput> and
    171 <computeroutput>ret</computeroutput> instructions.  It has to do
    172 this in order to spot some special events:</para>
    173 
    174 <itemizedlist>
    175   <listitem>
    176     <para>Calls to
    177     <function>VG_(shutdown)</function>.  This is
    178     Valgrind's cue to exit.  NOTE: actually this is done a
    179     different way; it should be cleaned up.</para>
    180   </listitem>
    181 
    182   <listitem>
    183     <para>Returns of system call handlers, to the return address
    184     <function>VG_(signalreturn_bogusRA)</function>.
    185     The signal simulator needs to know when a signal handler is
    186     returning, so we spot jumps (returns) to this address.</para>
    187   </listitem>
    188 
    189   <listitem>
    190     <para>Calls to <function>vg_trap_here</function>.
    191     All <function>malloc</function>,
    192     <function>free</function>, etc calls that the
    193     client program makes are eventually routed to a call to
    194     <function>vg_trap_here</function>, and Valgrind
    195     does its own special thing with these calls.  In effect this
    196     provides a trapdoor, by which Valgrind can intercept certain
    197     calls on the simulated CPU, run the call as it sees fit
    198     itself (on the real CPU), and return the result to the
    199     simulated CPU, quite transparently to the client
    200     program.</para>
    201   </listitem>
    202 
    203 </itemizedlist>
    204 
    205 <para>Valgrind intercepts the client's
    206 <function>malloc</function>,
    207 <function>free</function>, etc, calls, so that it can
    208 store additional information.  Each block
    209 <function>malloc</function>'d by the client gives
    210 rise to a shadow block in which Valgrind stores the call stack at
    211 the time of the <function>malloc</function> call.
    212 When the client calls <function>free</function>,
    213 Valgrind tries to find the shadow block corresponding to the
    214 address passed to <function>free</function>, and
    215 emits an error message if none can be found.  If it is found, the
    216 block is placed on the freed blocks queue
    217 <computeroutput>vg_freed_list</computeroutput>, it is marked as
    218 inaccessible, and its shadow block now records the call stack at
    219 the time of the <function>free</function> call.
    220 Keeping <computeroutput>free</computeroutput>'d blocks in this
    221 queue allows Valgrind to spot all (presumably invalid) accesses
    222 to them.  However, once the volume of blocks in the free queue
    223 exceeds <function>VG_(clo_freelist_vol)</function>,
    224 blocks are finally removed from the queue.</para>
    225 
    226 <para>Keeping track of <literal>A</literal> and
    227 <literal>V</literal> bits (note: if you don't know what these
    228 are, you haven't read the user guide carefully enough) for memory
    229 is done in <filename>vg_memory.c</filename>.  This implements a
    230 sparse array structure which covers the entire 4G address space
    231 in a way which is reasonably fast and reasonably space efficient.
    232 The 4G address space is divided up into 64K sections, each
    233 covering 64Kb of address space.  Given a 32-bit address, the top
    234 16 bits are used to select one of the 65536 entries in
    235 <function>VG_(primary_map)</function>.  The resulting
    236 "secondary" (<computeroutput>SecMap</computeroutput>) holds A and
    237 V bits for the 64k of address space chunk corresponding to the
    238 lower 16 bits of the address.</para>
    239 
    240 </sect2>
    241 
    242 
    243 
    244 <sect2 id="mc-tech-docs.design" xreflabel="Design decisions">
    245 <title>Design decisions</title>
    246 
    247 <para>Some design decisions were motivated by the need to make
    248 Valgrind debuggable.  Imagine you are writing a CPU simulator.
    249 It works fairly well.  However, you run some large program, like
    250 Netscape, and after tens of millions of instructions, it crashes.
    251 How can you figure out where in your simulator the bug is?</para>
    252 
    253 <para>Valgrind's answer is: cheat.  Valgrind is designed so that
    254 it is possible to switch back to running the client program on
    255 the real CPU at any point.  Using the
    256 <option>--stop-after= </option> flag, you can ask
    257 Valgrind to run just some number of basic blocks, and then run
    258 the rest of the way on the real CPU.  If you are searching for a
    259 bug in the simulated CPU, you can use this to do a binary search,
    260 which quickly leads you to the specific basic block which is
    261 causing the problem.</para>
    262 
    263 <para>This is all very handy.  It does constrain the design in
    264 certain unimportant ways.  Firstly, the layout of memory, when
    265 viewed from the client's point of view, must be identical
    266 regardless of whether it is running on the real or simulated CPU.
    267 This means that Valgrind can't do pointer swizzling -- well, no
    268 great loss -- and it can't run on the same stack as the client --
    269 again, no great loss.  Valgrind operates on its own stack,
    270 <function>VG_(stack)</function>, which it switches to
    271 at startup, temporarily switching back to the client's stack when
    272 doing system calls for the client.</para>
    273 
    274 <para>Valgrind also receives signals on its own stack,
    275 <computeroutput>VG_(sigstack)</computeroutput>, but for different
    276 gruesome reasons discussed below.</para>
    277 
    278 <para>This nice clean
    279 switch-back-to-the-real-CPU-whenever-you-like story is muddied by
    280 signals.  Problem is that signals arrive at arbitrary times and
    281 tend to slightly perturb the basic block count, with the result
    282 that you can get close to the basic block causing a problem but
    283 can't home in on it exactly.  My kludgey hack is to define
    284 <computeroutput>SIGNAL_SIMULATION</computeroutput> to 1 towards
    285 the bottom of <filename>vg_syscall_mem.c</filename>, so that
    286 signal handlers are run on the real CPU and don't change the BB
    287 counts.</para>
    288 
    289 <para>A second hole in the switch-back-to-real-CPU story is that
    290 Valgrind's way of delivering signals to the client is different
    291 from that of the kernel.  Specifically, the layout of the signal
    292 delivery frame, and the mechanism used to detect a sighandler
    293 returning, are different.  So you can't expect to make the
    294 transition inside a sighandler and still have things working, but
    295 in practice that's not much of a restriction.</para>
    296 
    297 <para>Valgrind's implementation of
    298 <function>malloc</function>,
    299 <function>free</function>, etc, (in
    300 <filename>vg_clientmalloc.c</filename>, not the low-level stuff
    301 in <filename>vg_malloc2.c</filename>) is somewhat complicated by
    302 the need to handle switching back at arbitrary points.  It does
    303 work tho.</para>
    304 
    305 </sect2>
    306 
    307 
    308 
    309 <sect2 id="mc-tech-docs.correctness" xreflabel="Correctness">
    310 <title>Correctness</title>
    311 
    312 <para>There's only one of me, and I have a Real Life (tm) as well
    313 as hacking Valgrind [allegedly :-].  That means I don't have time
    314 to waste chasing endless bugs in Valgrind.  My emphasis is
    315 therefore on doing everything as simply as possible, with
    316 correctness, stability and robustness being the number one
    317 priority, more important than performance or functionality.  As a
    318 result:</para>
    319 
    320 <itemizedlist>
    321 
    322   <listitem>
    323     <para>The code is absolutely loaded with assertions, and
    324     these are <command>permanently enabled.</command> I have no
    325     plan to remove or disable them later.  Over the past couple
    326     of months, as valgrind has become more widely used, they have
    327     shown their worth, pulling up various bugs which would
    328     otherwise have appeared as hard-to-find segmentation
    329     faults.</para>
    330 
    331     <para>I am of the view that it's acceptable to spend 5% of
    332     the total running time of your valgrindified program doing
    333     assertion checks and other internal sanity checks.</para>
    334   </listitem>
    335 
    336   <listitem>
    337     <para>Aside from the assertions, valgrind contains various
    338     sets of internal sanity checks, which get run at varying
    339     frequencies during normal operation.
    340     <function>VG_(do_sanity_checks)</function> runs
    341     every 1000 basic blocks, which means 500 to 2000 times/second
    342     for typical machines at present.  It checks that Valgrind
    343     hasn't overrun its private stack, and does some simple checks
    344     on the memory permissions maps.  Once every 25 calls it does
    345     some more extensive checks on those maps.  Etc, etc.</para>
    346     <para>The following components also have sanity check code,
    347     which can be enabled to aid debugging:</para>
    348     <itemizedlist>
    349       <listitem><para>The low-level memory-manager
    350         (<computeroutput>VG_(mallocSanityCheckArena)</computeroutput>).
    351         This does a complete check of all blocks and chains in an
    352         arena, which is very slow.  Is not engaged by default.</para>
    353       </listitem>
    354 
    355       <listitem>
    356         <para>The symbol table reader(s): various checks to
    357         ensure uniqueness of mappings; see
    358         <function>VG_(read_symbols)</function> for a
    359         start.  Is permanently engaged.</para>
    360       </listitem>
    361 
    362       <listitem>
    363         <para>The A and V bit tracking stuff in
    364         <filename>vg_memory.c</filename>.  This can be compiled
    365         with cpp symbol
    366         <computeroutput>VG_DEBUG_MEMORY</computeroutput> defined,
    367         which removes all the fast, optimised cases, and uses
    368         simple-but-slow fallbacks instead.  Not engaged by
    369         default.</para>
    370       </listitem>
    371 
    372       <listitem>
    373         <para>Ditto
    374         <computeroutput>VG_DEBUG_LEAKCHECK</computeroutput>.</para>
    375       </listitem>
    376 
    377       <listitem>
    378         <para>The JITter parses x86 basic blocks into sequences
    379         of UCode instructions.  It then sanity checks each one
    380         with <function>VG_(saneUInstr)</function> and
    381         sanity checks the sequence as a whole with
    382         <function>VG_(saneUCodeBlock)</function>.
    383         This stuff is engaged by default, and has caught some
    384         way-obscure bugs in the simulated CPU machinery in its
    385         time.</para>
    386       </listitem>
    387 
    388       <listitem>
    389         <para>The system call wrapper does
    390         <function>VG_(first_and_last_secondaries_look_plausible)</function>
    391         after every syscall; this is known to pick up bugs in the
    392         syscall wrappers.  Engaged by default.</para>
    393       </listitem>
    394 
    395       <listitem>
    396         <para>The main dispatch loop, in
    397         <function>VG_(dispatch)</function>, checks
    398         that translations do not set
    399         <computeroutput>%ebp</computeroutput> to any value
    400         different from
    401         <computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput>
    402         or <computeroutput>&amp; VG_(baseBlock)</computeroutput>.
    403         In effect this test is free, and is permanently
    404         engaged.</para>
    405       </listitem>
    406 
    407       <listitem>
    408         <para>There are a couple of ifdefed-out consistency
    409         checks I inserted whilst debugging the new register
    410         allocater,
    411         <computeroutput>vg_do_register_allocation</computeroutput>.</para>
    412       </listitem>
    413     </itemizedlist>
    414   </listitem>
    415 
    416   <listitem>
    417     <para>I try to avoid techniques, algorithms, mechanisms, etc,
    418     for which I can supply neither a convincing argument that
    419     they are correct, nor sanity-check code which might pick up
    420     bugs in my implementation.  I don't always succeed in this,
    421     but I try.  Basically the idea is: avoid techniques which
    422     are, in practice, unverifiable, in some sense.  When doing
    423     anything, always have in mind: "how can I verify that this is
    424     correct?"</para>
    425   </listitem>
    426 
    427 </itemizedlist>
    428 
    429 
    430 <para>Some more specific things are:</para>
    431 <itemizedlist>
    432   <listitem>
    433     <para>Valgrind runs in the same namespace as the client, at
    434     least from <filename>ld.so</filename>'s point of view, and it
    435     therefore absolutely had better not export any symbol with a
    436     name which could clash with that of the client or any of its
    437     libraries.  Therefore, all globally visible symbols exported
    438     from <filename>valgrind.so</filename> are defined using the
    439     <computeroutput>VG_</computeroutput> CPP macro.  As you'll
    440     see from <filename>vg_constants.h</filename>, this appends
    441     some arbitrary prefix to the symbol, in order that it be, we
    442     hope, globally unique.  Currently the prefix is
    443     <computeroutput>vgPlain_</computeroutput>.  For convenience
    444     there are also <computeroutput>VGM_</computeroutput>,
    445     <computeroutput>VGP_</computeroutput> and
    446     <computeroutput>VGOFF_</computeroutput>.  All locally defined
    447     symbols are declared <computeroutput>static</computeroutput>
    448     and do not appear in the final shared object.</para>
    449 
    450     <para>To check this, I periodically do <computeroutput>nm
    451     valgrind.so | grep " T "</computeroutput>, which shows you
    452     all the globally exported text symbols.  They should all have
    453     an approved prefix, except for those like
    454     <function>malloc</function>,
    455     <function>free</function>, etc, which we
    456     deliberately want to shadow and take precedence over the same
    457     names exported from <filename>glibc.so</filename>, so that
    458     valgrind can intercept those calls easily.  Similarly,
    459     <computeroutput>nm valgrind.so | grep " D "</computeroutput>
    460     allows you to find any rogue data-segment symbol
    461     names.</para>
    462   </listitem>
    463 
    464   <listitem>
    465     <para>Valgrind tries, and almost succeeds, in being
    466     completely independent of all other shared objects, in
    467     particular of <filename>glibc.so</filename>.  For example, we
    468     have our own low-level memory manager in
    469     <filename>vg_malloc2.c</filename>, which is a fairly standard
    470     malloc/free scheme augmented with arenas, and
    471     <filename>vg_mylibc.c</filename> exports reimplementations of
    472     various bits and pieces you'd normally get from the C
    473     library.</para>
    474 
    475     <para>Why all the hassle?  Because imagine the potential
    476     chaos of both the simulated and real CPUs executing in
    477     <filename>glibc.so</filename>.  It just seems simpler and
    478     cleaner to be completely self-contained, so that only the
    479     simulated CPU visits <filename>glibc.so</filename>.  In
    480     practice it's not much hassle anyway.  Also, valgrind starts
    481     up before glibc has a chance to initialise itself, and who
    482     knows what difficulties that could lead to.  Finally, glibc
    483     has definitions for some types, specifically
    484     <computeroutput>sigset_t</computeroutput>, which conflict
    485     (are different from) the Linux kernel's idea of same.  When
    486     Valgrind wants to fiddle around with signal stuff, it wants
    487     to use the kernel's definitions, not glibc's definitions.  So
    488     it's simplest just to keep glibc out of the picture
    489     entirely.</para>
    490 
    491     <para>To find out which glibc symbols are used by Valgrind,
    492     reinstate the link flags <option>-nostdlib
    493     -Wl,-no-undefined</option>.  This causes linking to
    494     fail, but will tell you what you depend on.  I have mostly,
    495     but not entirely, got rid of the glibc dependencies; what
    496     remains is, IMO, fairly harmless.  AFAIK the current
    497     dependencies are: <computeroutput>memset</computeroutput>,
    498     <computeroutput>memcmp</computeroutput>,
    499     <computeroutput>stat</computeroutput>,
    500     <computeroutput>system</computeroutput>,
    501     <computeroutput>sbrk</computeroutput>,
    502     <computeroutput>setjmp</computeroutput> and
    503     <computeroutput>longjmp</computeroutput>.</para>
    504   </listitem>
    505 
    506   <listitem>
    507     <para>Similarly, valgrind should not really import any
    508     headers other than the Linux kernel headers, since it knows
    509     of no API other than the kernel interface to talk to.  At the
    510     moment this is really not in a good state, and
    511     <computeroutput>vg_syscall_mem</computeroutput> imports, via
    512     <filename>vg_unsafe.h</filename>, a significant number of
    513     C-library headers so as to know the sizes of various structs
    514     passed across the kernel boundary.  This is of course
    515     completely bogus, since there is no guarantee that the C
    516     library's definitions of these structs matches those of the
    517     kernel.  I have started to sort this out using
    518     <filename>vg_kerneliface.h</filename>, into which I had
    519     intended to copy all kernel definitions which valgrind could
    520     need, but this has not gotten very far.  At the moment it
    521     mostly contains definitions for
    522     <computeroutput>sigset_t</computeroutput> and
    523     <computeroutput>struct sigaction</computeroutput>, since the
    524     kernel's definition for these really does clash with glibc's.
    525     I plan to use a <computeroutput>vki_</computeroutput> prefix
    526     on all these types and constants, to denote the fact that
    527     they pertain to <command>V</command>algrind's
    528     <command>K</command>ernel
    529     <command>I</command>nterface.</para>
    530 
    531     <para>Another advantage of having a
    532     <filename>vg_kerneliface.h</filename> file is that it makes
    533     it simpler to interface to a different kernel.  Once can, for
    534     example, easily imagine writing a new
    535     <filename>vg_kerneliface.h</filename> for FreeBSD, or x86
    536     NetBSD.</para>
    537   </listitem>
    538 
    539 </itemizedlist>
    540 
    541 </sect2>
    542 
    543 
    544 
    545 <sect2 id="mc-tech-docs.limits" xreflabel="Current limitations">
    546 <title>Current limitations</title>
    547 
    548 <para>Support for weird (non-POSIX) signal stuff is patchy.  Does
    549 anybody care?</para>
    550 
    551 </sect2>
    552 
    553 </sect1>
    554 
    555 
    556 
    557 
    558 
    559 <sect1 id="mc-tech-docs.jitter" xreflabel="The instrumenting JITter">
    560 <title>The instrumenting JITter</title>
    561 
    562 <para>This really is the heart of the matter.  We begin with
    563 various side issues.</para>
    564 
    565 
    566 <sect2 id="mc-tech-docs.storage" 
    567        xreflabel="Run-time storage, and the use of host registers">
    568 <title>Run-time storage, and the use of host registers</title>
    569 
    570 <para>Valgrind translates client (original) basic blocks into
    571 instrumented basic blocks, which live in the translation cache
    572 TC, until either the client finishes or the translations are
    573 ejected from TC to make room for newer ones.</para>
    574 
    575 <para>Since it generates x86 code in memory, Valgrind has
    576 complete control of the use of registers in the translations.
    577 Now pay attention.  I shall say this only once, and it is
    578 important you understand this.  In what follows I will refer to
    579 registers in the host (real) cpu using their standard names,
    580 <computeroutput>%eax</computeroutput>,
    581 <computeroutput>%edi</computeroutput>, etc.  I refer to registers
    582 in the simulated CPU by capitalising them:
    583 <computeroutput>%EAX</computeroutput>,
    584 <computeroutput>%EDI</computeroutput>, etc.  These two sets of
    585 registers usually bear no direct relationship to each other;
    586 there is no fixed mapping between them.  This naming scheme is
    587 used fairly consistently in the comments in the sources.</para>
    588 
    589 <para>Host registers, once things are up and running, are used as
    590 follows:</para>
    591 
    592 <itemizedlist>
    593   <listitem>
    594     <para><computeroutput>%esp</computeroutput>, the real stack
    595     pointer, points somewhere in Valgrind's private stack area,
    596     <computeroutput>VG_(stack)</computeroutput> or, transiently,
    597     into its signal delivery stack,
    598     <computeroutput>VG_(sigstack)</computeroutput>.</para>
    599   </listitem>
    600 
    601   <listitem>
    602     <para><computeroutput>%edi</computeroutput> is used as a
    603     temporary in code generation; it is almost always dead,
    604     except when used for the
    605     <computeroutput>Left</computeroutput> value-tag operations.</para>
    606   </listitem>
    607 
    608   <listitem>
    609     <para><computeroutput>%eax</computeroutput>,
    610     <computeroutput>%ebx</computeroutput>,
    611     <computeroutput>%ecx</computeroutput>,
    612     <computeroutput>%edx</computeroutput> and
    613     <computeroutput>%esi</computeroutput> are available to
    614     Valgrind's register allocator.  They are dead (carry
    615     unimportant values) in between translations, and are live
    616     only in translations.  The one exception to this is
    617     <computeroutput>%eax</computeroutput>, which, as mentioned
    618     far above, has a special significance to the dispatch loop
    619     <computeroutput>VG_(dispatch)</computeroutput>: when a
    620     translation returns to the dispatch loop,
    621     <computeroutput>%eax</computeroutput> is expected to contain
    622     the original-code-address of the next translation to run.
    623     The register allocator is so good at minimising spill code
    624     that using five regs and not having to save/restore
    625     <computeroutput>%edi</computeroutput> actually gives better
    626     code than allocating to <computeroutput>%edi</computeroutput>
    627     as well, but then having to push/pop it around special
    628     uses.</para>
    629   </listitem>
    630 
    631   <listitem>
    632     <para><computeroutput>%ebp</computeroutput> points
    633     permanently at
    634     <computeroutput>VG_(baseBlock)</computeroutput>.  Valgrind's
    635     translations are position-independent, partly because this is
    636     convenient, but also because translations get moved around in
    637     TC as part of the LRUing activity.  <command>All</command>
    638     static entities which need to be referred to from generated
    639     code, whether data or helper functions, are stored starting
    640     at <computeroutput>VG_(baseBlock)</computeroutput> and are
    641     therefore reached by indexing from
    642     <computeroutput>%ebp</computeroutput>.  There is but one
    643     exception, which is that by placing the value
    644     <computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput> in
    645     <computeroutput>%ebp</computeroutput> just before a return to
    646     the dispatcher, the dispatcher is informed that the next
    647     address to run, in <computeroutput>%eax</computeroutput>,
    648     requires special treatment.</para>
    649   </listitem>
    650 
    651   <listitem>
    652     <para>The real machine's FPU state is pretty much
    653     unimportant, for reasons which will become obvious.  Ditto
    654     its <computeroutput>%eflags</computeroutput> register.</para>
    655   </listitem>
    656 
    657 </itemizedlist>
    658 
    659 <para>The state of the simulated CPU is stored in memory, in
    660 <computeroutput>VG_(baseBlock)</computeroutput>, which is a block
    661 of 200 words IIRC.  Recall that
    662 <computeroutput>%ebp</computeroutput> points permanently at the
    663 start of this block.  Function
    664 <computeroutput>vg_init_baseBlock</computeroutput> decides what
    665 the offsets of various entities in
    666 <computeroutput>VG_(baseBlock)</computeroutput> are to be, and
    667 allocates word offsets for them.  The code generator then emits
    668 <computeroutput>%ebp</computeroutput> relative addresses to get
    669 at those things.  The sequence in which entities are allocated
    670 has been carefully chosen so that the 32 most popular entities
    671 come first, because this means 8-bit offsets can be used in the
    672 generated code.</para>
    673 
    674 <para>If I was clever, I could make
    675 <computeroutput>%ebp</computeroutput> point 32 words along
    676 <computeroutput>VG_(baseBlock)</computeroutput>, so that I'd have
    677 another 32 words of short-form offsets available, but that's just
    678 complicated, and it's not important -- the first 32 words take
    679 99% (or whatever) of the traffic.</para>
    680 
    681 <para>Currently, the sequence of stuff in
    682 <computeroutput>VG_(baseBlock)</computeroutput> is as
    683 follows:</para>
    684 
    685 <itemizedlist>
    686   <listitem>
    687     <para>9 words, holding the simulated integer registers,
    688     <computeroutput>%EAX</computeroutput>
    689     .. <computeroutput>%EDI</computeroutput>, and the simulated
    690     flags, <computeroutput>%EFLAGS</computeroutput>.</para>
    691   </listitem>
    692 
    693   <listitem>
    694     <para>Another 9 words, holding the V bit "shadows" for the
    695     above 9 regs.</para>
    696   </listitem>
    697 
    698   <listitem>
    699     <para>The <command>addresses</command> of various helper
    700     routines called from generated code:
    701     <computeroutput>VG_(helper_value_check4_fail)</computeroutput>,
    702     <computeroutput>VG_(helper_value_check0_fail)</computeroutput>,
    703     which register V-check failures,
    704     <computeroutput>VG_(helperc_STOREV4)</computeroutput>,
    705     <computeroutput>VG_(helperc_STOREV1)</computeroutput>,
    706     <computeroutput>VG_(helperc_LOADV4)</computeroutput>,
    707     <computeroutput>VG_(helperc_LOADV1)</computeroutput>, which
    708     do stores and loads of V bits to/from the sparse array which
    709     keeps track of V bits in memory, and
    710     <computeroutput>VGM_(handle_esp_assignment)</computeroutput>,
    711     which messes with memory addressability resulting from
    712     changes in <computeroutput>%ESP</computeroutput>.</para>
    713   </listitem>
    714 
    715   <listitem>
    716     <para>The simulated <computeroutput>%EIP</computeroutput>.</para>
    717   </listitem>
    718 
    719   <listitem>
    720     <para>24 spill words, for when the register allocator can't
    721     make it work with 5 measly registers.</para>
    722   </listitem>
    723 
    724   <listitem>
    725     <para>Addresses of helpers
    726     <computeroutput>VG_(helperc_STOREV2)</computeroutput>,
    727     <computeroutput>VG_(helperc_LOADV2)</computeroutput>.  These
    728     are here because 2-byte loads and stores are relatively rare,
    729     so are placed above the magic 32-word offset boundary.</para>
    730   </listitem>
    731 
    732   <listitem>
    733     <para>For similar reasons, addresses of helper functions
    734     <computeroutput>VGM_(fpu_write_check)</computeroutput> and
    735     <computeroutput>VGM_(fpu_read_check)</computeroutput>, which
    736     handle the A/V maps testing and changes required by FPU
    737     writes/reads.</para>
    738   </listitem>
    739 
    740   <listitem>
    741     <para>Some other boring helper addresses:
    742     <computeroutput>VG_(helper_value_check2_fail)</computeroutput>
    743     and
    744     <computeroutput>VG_(helper_value_check1_fail)</computeroutput>.
    745     These are probably never emitted now, and should be
    746     removed.</para>
    747   </listitem>
    748 
    749   <listitem>
    750     <para>The entire state of the simulated FPU, which I believe
    751     to be 108 bytes long.</para>
    752   </listitem>
    753 
    754   <listitem>
    755     <para>Finally, the addresses of various other helper
    756     functions in <filename>vg_helpers.S</filename>, which deal
    757     with rare situations which are tedious or difficult to
    758     generate code in-line for.</para>
    759   </listitem>
    760 
    761 </itemizedlist>
    762 
    763 <para>As a general rule, the simulated machine's state lives
    764 permanently in memory at
    765 <computeroutput>VG_(baseBlock)</computeroutput>.  However, the
    766 JITter does some optimisations which allow the simulated integer
    767 registers to be cached in real registers over multiple simulated
    768 instructions within the same basic block.  These are always
    769 flushed back into memory at the end of every basic block, so that
    770 the in-memory state is up-to-date between basic blocks.  (This
    771 flushing is implied by the statement above that the real
    772 machine's allocatable registers are dead in between simulated
    773 blocks).</para>
    774 
    775 </sect2>
    776 
    777 
    778 
    779 <sect2 id="mc-tech-docs.startup" 
    780        xreflabel="Startup, shutdown, and system calls">
    781 <title>Startup, shutdown, and system calls</title>
    782 
    783 <para>Getting into of Valgrind
    784 (<computeroutput>VG_(startup)</computeroutput>, called from
    785 <filename>valgrind.so</filename>'s initialisation section),
    786 really means copying the real CPU's state into
    787 <computeroutput>VG_(baseBlock)</computeroutput>, and then
    788 installing our own stack pointer, etc, into the real CPU, and
    789 then starting up the JITter.  Exiting valgrind involves copying
    790 the simulated state back to the real state.</para>
    791 
    792 <para>Unfortunately, there's a complication at startup time.
    793 Problem is that at the point where we need to take a snapshot of
    794 the real CPU's state, the offsets in
    795 <computeroutput>VG_(baseBlock)</computeroutput> are not set up
    796 yet, because to do so would involve disrupting the real machine's
    797 state significantly.  The way round this is to dump the real
    798 machine's state into a temporary, static block of memory,
    799 <computeroutput>VG_(m_state_static)</computeroutput>.  We can
    800 then set up the <computeroutput>VG_(baseBlock)</computeroutput>
    801 offsets at our leisure, and copy into it from
    802 <computeroutput>VG_(m_state_static)</computeroutput> at some
    803 convenient later time.  This copying is done by
    804 <computeroutput>VG_(copy_m_state_static_to_baseBlock)</computeroutput>.</para>
    805 
    806 <para>On exit, the inverse transformation is (rather
    807 unnecessarily) used: stuff in
    808 <computeroutput>VG_(baseBlock)</computeroutput> is copied to
    809 <computeroutput>VG_(m_state_static)</computeroutput>, and the
    810 assembly stub then copies from
    811 <computeroutput>VG_(m_state_static)</computeroutput> into the
    812 real machine registers.</para>
    813 
    814 <para>Doing system calls on behalf of the client
    815 (<filename>vg_syscall.S</filename>) is something of a half-way
    816 house.  We have to make the world look sufficiently like that
    817 which the client would normally have to make the syscall actually
    818 work properly, but we can't afford to lose control.  So the trick
    819 is to copy all of the client's state, <command>except its program
    820 counter</command>, into the real CPU, do the system call, and
    821 copy the state back out.  Note that the client's state includes
    822 its stack pointer register, so one effect of this partial
    823 restoration is to cause the system call to be run on the client's
    824 stack, as it should be.</para>
    825 
    826 <para>As ever there are complications.  We have to save some of
    827 our own state somewhere when restoring the client's state into
    828 the CPU, so that we can keep going sensibly afterwards.  In fact
    829 the only thing which is important is our own stack pointer, but
    830 for paranoia reasons I save and restore our own FPU state as
    831 well, even though that's probably pointless.</para>
    832 
    833 <para>The complication on the above complication is, that for
    834 horrible reasons to do with signals, we may have to handle a
    835 second client system call whilst the client is blocked inside
    836 some other system call (unbelievable!).  That means there's two
    837 sets of places to dump Valgrind's stack pointer and FPU state
    838 across the syscall, and we decide which to use by consulting
    839 <computeroutput>VG_(syscall_depth)</computeroutput>, which is in
    840 turn maintained by
    841 <computeroutput>VG_(wrap_syscall)</computeroutput>.</para>
    842 
    843 </sect2>
    844 
    845 
    846 
    847 <sect2 id="mc-tech-docs.ucode" xreflabel="Introduction to UCode">
    848 <title>Introduction to UCode</title>
    849 
    850 <para>UCode lies at the heart of the x86-to-x86 JITter.  The
    851 basic premise is that dealing the the x86 instruction set head-on
    852 is just too darn complicated, so we do the traditional
    853 compiler-writer's trick and translate it into a simpler,
    854 easier-to-deal-with form.</para>
    855 
    856 <para>In normal operation, translation proceeds through six
    857 stages, coordinated by
    858 <computeroutput>VG_(translate)</computeroutput>:</para>
    859 
    860 <orderedlist>
    861   <listitem>
    862     <para>Parsing of an x86 basic block into a sequence of UCode
    863     instructions (<computeroutput>VG_(disBB)</computeroutput>).</para>
    864   </listitem>
    865 
    866   <listitem>
    867     <para>UCode optimisation
    868     (<computeroutput>vg_improve</computeroutput>), with the aim
    869     of caching simulated registers in real registers over
    870     multiple simulated instructions, and removing redundant
    871     simulated <computeroutput>%EFLAGS</computeroutput>
    872     saving/restoring.</para>
    873   </listitem>
    874 
    875   <listitem>
    876     <para>UCode instrumentation
    877     (<computeroutput>vg_instrument</computeroutput>), which adds
    878     value and address checking code.</para>
    879   </listitem>
    880 
    881   <listitem>
    882     <para>Post-instrumentation cleanup
    883     (<computeroutput>vg_cleanup</computeroutput>), removing
    884     redundant value-check computations.</para>
    885   </listitem>
    886 
    887   <listitem>
    888     <para>Register allocation
    889     (<computeroutput>vg_do_register_allocation</computeroutput>),
    890     which, note, is done on UCode.</para>
    891   </listitem>
    892 
    893   <listitem>
    894     <para>Emission of final instrumented x86 code
    895     (<computeroutput>VG_(emit_code)</computeroutput>).</para>
    896   </listitem>
    897 
    898 </orderedlist>
    899 
    900 <para>Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
    901 transformation passes, all on straight-line blocks of UCode (type
    902 <computeroutput>UCodeBlock</computeroutput>).  Steps 2 and 4 are
    903 optimisation passes and can be disabled for debugging purposes,
    904 with <option>--optimise=no</option> and
    905 <option>--cleanup=no</option> respectively.</para>
    906 
    907 <para>Valgrind can also run in a no-instrumentation mode, given
    908 <option>--instrument=no</option>.  This is useful
    909 for debugging the JITter quickly without having to deal with the
    910 complexity of the instrumentation mechanism too.  In this mode,
    911 steps 3 and 4 are omitted.</para>
    912 
    913 <para>These flags combine, so that
    914 <option>--instrument=no</option> together with
    915 <option>--optimise=no</option> means only steps
    916 1, 5 and 6 are used.
    917 <option>--single-step=yes</option> causes each
    918 x86 instruction to be treated as a single basic block.  The
    919 translations are terrible but this is sometimes instructive.</para>
    920 
    921 <para>The <option>--stop-after=N</option> flag
    922 switches back to the real CPU after
    923 <computeroutput>N</computeroutput> basic blocks.  It also re-JITs
    924 the final basic block executed and prints the debugging info
    925 resulting, so this gives you a way to get a quick snapshot of how
    926 a basic block looks as it passes through the six stages mentioned
    927 above.  If you want to see full information for every block
    928 translated (probably not, but still ...) find, in
    929 <computeroutput>VG_(translate)</computeroutput>, the lines</para>
    930 <programlisting><![CDATA[
    931 dis = True;
    932 dis = debugging_translation;]]></programlisting>
    933 
    934 <para>and comment out the second line.  This will spew out
    935 debugging junk faster than you can possibly imagine.</para>
    936 
    937 </sect2>
    938 
    939 
    940 
    941 <sect2 id="mc-tech-docs.tags" xreflabel="UCode operand tags: type 'Tag'">
    942 <title>UCode operand tags: type <computeroutput>Tag</computeroutput></title>
    943 
    944 <para>UCode is, more or less, a simple two-address RISC-like
    945 code.  In keeping with the x86 AT&amp;T assembly syntax,
    946 generally speaking the first operand is the source operand, and
    947 the second is the destination operand, which is modified when the
    948 uinstr is notionally executed.</para>
    949 
    950 <para>UCode instructions have up to three operand fields, each of
    951 which has a corresponding <computeroutput>Tag</computeroutput>
    952 describing it.  Possible values for the tag are:</para>
    953 
    954 <itemizedlist>
    955 
    956   <listitem>
    957     <para><computeroutput>NoValue</computeroutput>: indicates
    958     that the field is not in use.</para>
    959   </listitem>
    960 
    961   <listitem>
    962     <para><computeroutput>Lit16</computeroutput>: the field
    963     contains a 16-bit literal.</para>
    964   </listitem>
    965 
    966   <listitem>
    967     <para><computeroutput>Literal</computeroutput>: the field
    968     denotes a 32-bit literal, whose value is stored in the
    969     <computeroutput>lit32</computeroutput> field of the uinstr
    970     itself.  Since there is only one
    971     <computeroutput>lit32</computeroutput> for the whole uinstr,
    972     only one operand field may contain this tag.</para>
    973   </listitem>
    974 
    975   <listitem>
    976     <para><computeroutput>SpillNo</computeroutput>: the field
    977     contains a spill slot number, in the range 0 to 23 inclusive,
    978     denoting one of the spill slots contained inside
    979     <computeroutput>VG_(baseBlock)</computeroutput>.  Such tags
    980     only exist after register allocation.</para>
    981   </listitem>
    982 
    983   <listitem>
    984     <para><computeroutput>RealReg</computeroutput>: the field
    985     contains a number in the range 0 to 7 denoting an integer x86
    986     ("real") register on the host.  The number is the Intel
    987     encoding for integer registers.  Such tags only exist after
    988     register allocation.</para>
    989   </listitem>
    990 
    991   <listitem>
    992     <para><computeroutput>ArchReg</computeroutput>: the field
    993     contains a number in the range 0 to 7 denoting an integer x86
    994     register on the simulated CPU.  In reality this means a
    995     reference to one of the first 8 words of
    996     <computeroutput>VG_(baseBlock)</computeroutput>.  Such tags
    997     can exist at any point in the translation process.</para>
    998   </listitem>
    999 
   1000   <listitem>
   1001     <para>Last, but not least,
   1002     <computeroutput>TempReg</computeroutput>.  The field contains
   1003     the number of one of an infinite set of virtual (integer)
   1004     registers. <computeroutput>TempReg</computeroutput>s are used
   1005     everywhere throughout the translation process; you can have
   1006     as many as you want.  The register allocator maps as many as
   1007     it can into <computeroutput>RealReg</computeroutput>s and
   1008     turns the rest into
   1009     <computeroutput>SpillNo</computeroutput>s, so
   1010     <computeroutput>TempReg</computeroutput>s should not exist
   1011     after the register allocation phase.</para>
   1012 
   1013     <para><computeroutput>TempReg</computeroutput>s are always 32
   1014     bits long, even if the data they hold is logically shorter.
   1015     In that case the upper unused bits are required, and, I
   1016     think, generally assumed, to be zero.
   1017     <computeroutput>TempReg</computeroutput>s holding V bits for
   1018     quantities shorter than 32 bits are expected to have ones in
   1019     the unused places, since a one denotes "undefined".</para>
   1020   </listitem>
   1021 
   1022 </itemizedlist>
   1023 
   1024 </sect2>
   1025 
   1026 
   1027 
   1028 <sect2 id="mc-tech-docs.uinstr" 
   1029        xreflabel="UCode instructions: type 'UInstr'">
   1030 <title>UCode instructions: type <computeroutput>UInstr</computeroutput></title>
   1031 
   1032 <para>UCode was carefully designed to make it possible to do
   1033 register allocation on UCode and then translate the result into
   1034 x86 code without needing any extra registers ... well, that was
   1035 the original plan, anyway.  Things have gotten a little more
   1036 complicated since then.  In what follows, UCode instructions are
   1037 referred to as uinstrs, to distinguish them from x86
   1038 instructions.  Uinstrs of course have uopcodes which are
   1039 (naturally) different from x86 opcodes.</para>
   1040 
   1041 <para>A uinstr (type <computeroutput>UInstr</computeroutput>)
   1042 contains various fields, not all of which are used by any one
   1043 uopcode:</para>
   1044 
   1045 <itemizedlist>
   1046 
   1047   <listitem>
   1048     <para>Three 16-bit operand fields,
   1049     <computeroutput>val1</computeroutput>,
   1050     <computeroutput>val2</computeroutput> and
   1051     <computeroutput>val3</computeroutput>.</para>
   1052   </listitem>
   1053 
   1054   <listitem>
   1055     <para>Three tag fields,
   1056     <computeroutput>tag1</computeroutput>,
   1057     <computeroutput>tag2</computeroutput> and
   1058     <computeroutput>tag3</computeroutput>.  Each of these has a
   1059     value of type <computeroutput>Tag</computeroutput>, and they
   1060     describe what the <computeroutput>val1</computeroutput>,
   1061     <computeroutput>val2</computeroutput> and
   1062     <computeroutput>val3</computeroutput> fields contain.</para>
   1063   </listitem>
   1064 
   1065   <listitem>
   1066     <para>A 32-bit literal field.</para>
   1067   </listitem>
   1068 
   1069   <listitem>
   1070     <para>Two <computeroutput>FlagSet</computeroutput>s,
   1071     specifying which x86 condition codes are read and written by
   1072     the uinstr.</para>
   1073   </listitem>
   1074 
   1075   <listitem>
   1076     <para>An opcode byte, containing a value of type
   1077     <computeroutput>Opcode</computeroutput>.</para>
   1078   </listitem>
   1079 
   1080   <listitem>
   1081     <para>A size field, indicating the data transfer size
   1082     (1/2/4/8/10) in cases where this makes sense, or zero
   1083     otherwise.</para>
   1084   </listitem>
   1085 
   1086   <listitem>
   1087     <para>A condition-code field, which, for jumps, holds a value
   1088     of type <computeroutput>Condcode</computeroutput>, indicating
   1089     the condition which applies.  The encoding is as it is in the
   1090     x86 insn stream, except we add a 17th value
   1091     <computeroutput>CondAlways</computeroutput> to indicate an
   1092     unconditional transfer.</para>
   1093   </listitem>
   1094 
   1095   <listitem>
   1096     <para>Various 1-bit flags, indicating whether this insn
   1097     pertains to an x86 CALL or RET instruction, whether a
   1098     widening is signed or not, etc.</para>
   1099   </listitem>
   1100 
   1101 </itemizedlist>
   1102 
   1103 <para>UOpcodes (type <computeroutput>Opcode</computeroutput>) are
   1104 divided into two groups: those necessary merely to express the
   1105 functionality of the x86 code, and extra uopcodes needed to
   1106 express the instrumentation.  The former group contains:</para>
   1107 
   1108 <itemizedlist>
   1109 
   1110   <listitem>
   1111     <para><computeroutput>GET</computeroutput> and
   1112     <computeroutput>PUT</computeroutput>, which move values from
   1113     the simulated CPU's integer registers
   1114     (<computeroutput>ArchReg</computeroutput>s) into
   1115     <computeroutput>TempReg</computeroutput>s, and back.
   1116     <computeroutput>GETF</computeroutput> and
   1117     <computeroutput>PUTF</computeroutput> do the corresponding
   1118     thing for the simulated
   1119     <computeroutput>%EFLAGS</computeroutput>.  There are no
   1120     corresponding insns for the FPU register stack, since we
   1121     don't explicitly simulate its registers.</para>
   1122   </listitem>
   1123 
   1124   <listitem>
   1125     <para><computeroutput>LOAD</computeroutput> and
   1126     <computeroutput>STORE</computeroutput>, which, in RISC-like
   1127     fashion, are the only uinstrs able to interact with
   1128     memory.</para>
   1129   </listitem>
   1130 
   1131   <listitem>
   1132     <para><computeroutput>MOV</computeroutput> and
   1133     <computeroutput>CMOV</computeroutput> allow unconditional and
   1134     conditional moves of values between
   1135     <computeroutput>TempReg</computeroutput>s.</para>
   1136   </listitem>
   1137 
   1138   <listitem>
   1139     <para>ALU operations.  Again in RISC-like fashion, these only
   1140     operate on <computeroutput>TempReg</computeroutput>s (before
   1141     reg-alloc) or <computeroutput>RealReg</computeroutput>s
   1142     (after reg-alloc).  These are:
   1143     <computeroutput>ADD</computeroutput>,
   1144     <computeroutput>ADC</computeroutput>,
   1145     <computeroutput>AND</computeroutput>,
   1146     <computeroutput>OR</computeroutput>,
   1147     <computeroutput>XOR</computeroutput>,
   1148     <computeroutput>SUB</computeroutput>,
   1149     <computeroutput>SBB</computeroutput>,
   1150     <computeroutput>SHL</computeroutput>,
   1151     <computeroutput>SHR</computeroutput>,
   1152     <computeroutput>SAR</computeroutput>,
   1153     <computeroutput>ROL</computeroutput>,
   1154     <computeroutput>ROR</computeroutput>,
   1155     <computeroutput>RCL</computeroutput>,
   1156     <computeroutput>RCR</computeroutput>,
   1157     <computeroutput>NOT</computeroutput>,
   1158     <computeroutput>NEG</computeroutput>,
   1159     <computeroutput>INC</computeroutput>,
   1160     <computeroutput>DEC</computeroutput>,
   1161     <computeroutput>BSWAP</computeroutput>,
   1162     <computeroutput>CC2VAL</computeroutput> and
   1163     <computeroutput>WIDEN</computeroutput>.
   1164     <computeroutput>WIDEN</computeroutput> does signed or
   1165     unsigned value widening.
   1166     <computeroutput>CC2VAL</computeroutput> is used to convert
   1167     condition codes into a value, zero or one.  The rest are
   1168     obvious.</para>
   1169 
   1170     <para>To allow for more efficient code generation, we bend
   1171     slightly the restriction at the start of the previous para:
   1172     for <computeroutput>ADD</computeroutput>,
   1173     <computeroutput>ADC</computeroutput>,
   1174     <computeroutput>XOR</computeroutput>,
   1175     <computeroutput>SUB</computeroutput> and
   1176     <computeroutput>SBB</computeroutput>, we allow the first
   1177     (source) operand to also be an
   1178     <computeroutput>ArchReg</computeroutput>, that is, one of the
   1179     simulated machine's registers.  Also, many of these ALU ops
   1180     allow the source operand to be a literal.  See
   1181     <computeroutput>VG_(saneUInstr)</computeroutput> for the
   1182     final word on the allowable forms of uinstrs.</para>
   1183   </listitem>
   1184 
   1185   <listitem>
   1186     <para><computeroutput>LEA1</computeroutput> and
   1187     <computeroutput>LEA2</computeroutput> are not strictly
   1188     necessary, but facilitate better translations.  They
   1189     record the fancy x86 addressing modes in a direct way, which
   1190     allows those amodes to be emitted back into the final
   1191     instruction stream more or less verbatim.</para>
   1192   </listitem>
   1193 
   1194   <listitem>
   1195     <para><computeroutput>CALLM</computeroutput> calls a
   1196     machine-code helper, one of the methods whose address is
   1197     stored at some
   1198     <computeroutput>VG_(baseBlock)</computeroutput> offset.
   1199     <computeroutput>PUSH</computeroutput> and
   1200     <computeroutput>POP</computeroutput> move values to/from
   1201     <computeroutput>TempReg</computeroutput> to the real
   1202     (Valgrind's) stack, and
   1203     <computeroutput>CLEAR</computeroutput> removes values from
   1204     the stack.  <computeroutput>CALLM_S</computeroutput> and
   1205     <computeroutput>CALLM_E</computeroutput> delimit the
   1206     boundaries of call setups and clearings, for the benefit of
   1207     the instrumentation passes.  Getting this right is critical,
   1208     and so <computeroutput>VG_(saneUCodeBlock)</computeroutput>
   1209     makes various checks on the use of these uopcodes.</para>
   1210 
   1211     <para>It is important to understand that these uopcodes have
   1212     nothing to do with the x86
   1213     <computeroutput>call</computeroutput>,
   1214     <computeroutput>return,</computeroutput>
   1215     <computeroutput>push</computeroutput> or
   1216     <computeroutput>pop</computeroutput> instructions, and are
   1217     not used to implement them.  Those guys turn into
   1218     combinations of <computeroutput>GET</computeroutput>,
   1219     <computeroutput>PUT</computeroutput>,
   1220     <computeroutput>LOAD</computeroutput>,
   1221     <computeroutput>STORE</computeroutput>,
   1222     <computeroutput>ADD</computeroutput>,
   1223     <computeroutput>SUB</computeroutput>, and
   1224     <computeroutput>JMP</computeroutput>.  What these uopcodes
   1225     support is calling of helper functions such as
   1226     <computeroutput>VG_(helper_imul_32_64)</computeroutput>,
   1227     which do stuff which is too difficult or tedious to emit
   1228     inline.</para>
   1229   </listitem>
   1230 
   1231   <listitem>
   1232     <para><computeroutput>FPU</computeroutput>,
   1233     <computeroutput>FPU_R</computeroutput> and
   1234     <computeroutput>FPU_W</computeroutput>.  Valgrind doesn't
   1235     attempt to simulate the internal state of the FPU at all.
   1236     Consequently it only needs to be able to distinguish FPU ops
   1237     which read and write memory from those that don't, and for
   1238     those which do, it needs to know the effective address and
   1239     data transfer size.  This is made easier because the x86 FP
   1240     instruction encoding is very regular, basically consisting of
   1241     16 bits for a non-memory FPU insn and 11 (IIRC) bits + an
   1242     address mode for a memory FPU insn.  So our
   1243     <computeroutput>FPU</computeroutput> uinstr carries the 16
   1244     bits in its <computeroutput>val1</computeroutput> field.  And
   1245     <computeroutput>FPU_R</computeroutput> and
   1246     <computeroutput>FPU_W</computeroutput> carry 11 bits in that
   1247     field, together with the identity of a
   1248     <computeroutput>TempReg</computeroutput> or (later)
   1249     <computeroutput>RealReg</computeroutput> which contains the
   1250     address.</para>
   1251   </listitem>
   1252 
   1253   <listitem>
   1254     <para><computeroutput>JIFZ</computeroutput> is unique, in
   1255     that it allows a control-flow transfer which is not deemed to
   1256     end a basic block.  It causes a jump to a literal (original)
   1257     address if the specified argument is zero.</para>
   1258   </listitem>
   1259 
   1260   <listitem>
   1261     <para>Finally, <computeroutput>INCEIP</computeroutput>
   1262     advances the simulated <computeroutput>%EIP</computeroutput>
   1263     by the specified literal amount.  This supports lazy
   1264     <computeroutput>%EIP</computeroutput> updating, as described
   1265     below.</para>
   1266   </listitem>
   1267 
   1268 </itemizedlist>
   1269 
   1270 <para>Stages 1 and 2 of the 6-stage translation process mentioned
   1271 above deal purely with these uopcodes, and no others.  They are
   1272 sufficient to express pretty much all the x86 32-bit
   1273 protected-mode instruction set, at least everything understood by
   1274 a pre-MMX original Pentium (P54C).</para>
   1275 
   1276 <para>Stages 3, 4, 5 and 6 also deal with the following extra
   1277 "instrumentation" uopcodes.  They are used to express all the
   1278 definedness-tracking and -checking machinery which valgrind does.
   1279 In later sections we show how to create checking code for each of
   1280 the uopcodes above.  Note that these instrumentation uopcodes,
   1281 although some appearing complicated, have been carefully chosen
   1282 so that efficient x86 code can be generated for them.  GNU
   1283 superopt v2.5 did a great job helping out here.  Anyways, the
   1284 uopcodes are as follows:</para>
   1285 
   1286 <itemizedlist>
   1287 
   1288   <listitem>
   1289     <para><computeroutput>GETV</computeroutput> and
   1290     <computeroutput>PUTV</computeroutput> are analogues to
   1291     <computeroutput>GET</computeroutput> and
   1292     <computeroutput>PUT</computeroutput> above.  They are
   1293     identical except that they move the V bits for the specified
   1294     values back and forth to
   1295     <computeroutput>TempRegs</computeroutput>, rather than moving
   1296     the values themselves.</para>
   1297   </listitem>
   1298 
   1299   <listitem>
   1300     <para>Similarly, <computeroutput>LOADV</computeroutput> and
   1301     <computeroutput>STOREV</computeroutput> read and write V bits
   1302     from the synthesised shadow memory that Valgrind maintains.
   1303     In fact they do more than that, since they also do
   1304     address-validity checks, and emit complaints if the
   1305     read/written addresses are unaddressable.</para>
   1306   </listitem>
   1307 
   1308   <listitem>
   1309     <para><computeroutput>TESTV</computeroutput>, whose
   1310     parameters are a <computeroutput>TempReg</computeroutput> and
   1311     a size, tests the V bits in the
   1312     <computeroutput>TempReg</computeroutput>, at the specified
   1313     operation size (0/1/2/4 byte) and emits an error if any of
   1314     them indicate undefinedness.  This is the only uopcode
   1315     capable of doing such tests.</para>
   1316   </listitem>
   1317 
   1318   <listitem>
   1319     <para><computeroutput>SETV</computeroutput>, whose parameters
   1320     are also <computeroutput>TempReg</computeroutput> and a size,
   1321     makes the V bits in the
   1322     <computeroutput>TempReg</computeroutput> indicated
   1323     definedness, at the specified operation size.  This is
   1324     usually used to generate the correct V bits for a literal
   1325     value, which is of course fully defined.</para>
   1326   </listitem>
   1327 
   1328   <listitem>
   1329     <para><computeroutput>GETVF</computeroutput> and
   1330     <computeroutput>PUTVF</computeroutput> are analogues to
   1331     <computeroutput>GETF</computeroutput> and
   1332     <computeroutput>PUTF</computeroutput>.  They move the single
   1333     V bit used to model definedness of
   1334     <computeroutput>%EFLAGS</computeroutput> between its home in
   1335     <computeroutput>VG_(baseBlock)</computeroutput> and the
   1336     specified <computeroutput>TempReg</computeroutput>.</para>
   1337   </listitem>
   1338 
   1339   <listitem>
   1340     <para><computeroutput>TAG1</computeroutput> denotes one of a
   1341     family of unary operations on
   1342     <computeroutput>TempReg</computeroutput>s containing V bits.
   1343     Similarly, <computeroutput>TAG2</computeroutput> denotes one
   1344     in a family of binary operations on V bits.</para>
   1345   </listitem>
   1346 
   1347 </itemizedlist>
   1348 
   1349 
   1350 <para>These 10 uopcodes are sufficient to express Valgrind's
   1351 entire definedness-checking semantics.  In fact most of the
   1352 interesting magic is done by the
   1353 <computeroutput>TAG1</computeroutput> and
   1354 <computeroutput>TAG2</computeroutput> suboperations.</para>
   1355 
   1356 <para>First, however, I need to explain about V-vector operation
   1357 sizes.  There are 4 sizes: 1, 2 and 4, which operate on groups of
   1358 8, 16 and 32 V bits at a time, supporting the usual 1, 2 and 4
   1359 byte x86 operations.  However there is also the mysterious size
   1360 0, which really means a single V bit.  Single V bits are used in
   1361 various circumstances; in particular, the definedness of
   1362 <computeroutput>%EFLAGS</computeroutput> is modelled with a
   1363 single V bit.  Now might be a good time to also point out that
   1364 for V bits, 1 means "undefined" and 0 means "defined".
   1365 Similarly, for A bits, 1 means "invalid address" and 0 means
   1366 "valid address".  This seems counterintuitive (and so it is), but
   1367 testing against zero on x86s saves instructions compared to
   1368 testing against all 1s, because many ALU operations set the Z
   1369 flag for free, so to speak.</para>
   1370 
   1371 <para>With that in mind, the tag ops are:</para>
   1372 
   1373 <itemizedlist>
   1374 
   1375   <listitem>
   1376     <formalpara>
   1377     <title>(UNARY) Pessimising casts:</title>
   1378     <para><computeroutput>VgT_PCast40</computeroutput>,
   1379     <computeroutput>VgT_PCast20</computeroutput>,
   1380     <computeroutput>VgT_PCast10</computeroutput>,
   1381     <computeroutput>VgT_PCast01</computeroutput>,
   1382     <computeroutput>VgT_PCast02</computeroutput> and
   1383     <computeroutput>VgT_PCast04</computeroutput>.  A "pessimising
   1384     cast" takes a V-bit vector at one size, and creates a new one
   1385     at another size, pessimised in the sense that if any of the
   1386     bits in the source vector indicate undefinedness, then all
   1387     the bits in the result indicate undefinedness.  In this case
   1388     the casts are all to or from a single V bit, so for example
   1389     <computeroutput>VgT_PCast40</computeroutput> is a pessimising
   1390     cast from 32 bits to 1, whereas
   1391     <computeroutput>VgT_PCast04</computeroutput> simply copies
   1392     the single source V bit into all 32 bit positions in the
   1393     result.  Surprisingly, these ops can all be implemented very
   1394     efficiently.</para>
   1395     </formalpara>
   1396 
   1397     <para>There are also the pessimising casts
   1398     <computeroutput>VgT_PCast14</computeroutput>, from 8 bits to
   1399     32, <computeroutput>VgT_PCast12</computeroutput>, from 8 bits
   1400     to 16, and <computeroutput>VgT_PCast11</computeroutput>, from
   1401     8 bits to 8.  This last one seems nonsensical, but in fact it
   1402     isn't a no-op because, as mentioned above, any undefined (1)
   1403     bits in the source infect the entire result.</para>
   1404   </listitem>
   1405 
   1406   <listitem>
   1407     <formalpara>
   1408     <title>(UNARY) Propagating undefinedness upwards in a
   1409     word:</title>
   1410     <para><computeroutput>VgT_Left4</computeroutput>,
   1411     <computeroutput>VgT_Left2</computeroutput> and
   1412     <computeroutput>VgT_Left1</computeroutput>.  These are used
   1413     to simulate the worst-case effects of carry propagation in
   1414     adds and subtracts.  They return a V vector identical to the
   1415     original, except that if the original contained any undefined
   1416     bits, then it and all bits above it are marked as undefined
   1417     too.  Hence the Left bit in the names.</para></formalpara>
   1418   </listitem>
   1419 
   1420   <listitem>
   1421     <formalpara>
   1422     <title>(UNARY) Signed and unsigned value widening:</title> 
   1423     <para><computeroutput>VgT_SWiden14</computeroutput>,
   1424     <computeroutput>VgT_SWiden24</computeroutput>,
   1425     <computeroutput>VgT_SWiden12</computeroutput>,
   1426     <computeroutput>VgT_ZWiden14</computeroutput>,
   1427     <computeroutput>VgT_ZWiden24</computeroutput> and
   1428     <computeroutput>VgT_ZWiden12</computeroutput>.  These mimic
   1429     the definedness effects of standard signed and unsigned
   1430     integer widening.  Unsigned widening creates zero bits in the
   1431     new positions, so
   1432     <computeroutput>VgT_ZWiden*</computeroutput> accordingly park
   1433     mark those parts of their argument as defined.  Signed
   1434     widening copies the sign bit into the new positions, so
   1435     <computeroutput>VgT_SWiden*</computeroutput> copies the
   1436     definedness of the sign bit into the new positions.  Because
   1437     1 means undefined and 0 means defined, these operations can
   1438     (fascinatingly) be done by the same operations which they
   1439     mimic.  Go figure.</para>
   1440     </formalpara>
   1441   </listitem>
   1442 
   1443   <listitem>
   1444     <formalpara>
   1445     <title>(BINARY) Undefined-if-either-Undefined,
   1446     Defined-if-either-Defined:</title>
   1447     <para><computeroutput>VgT_UifU4</computeroutput>,
   1448     <computeroutput>VgT_UifU2</computeroutput>,
   1449     <computeroutput>VgT_UifU1</computeroutput>,
   1450     <computeroutput>VgT_UifU0</computeroutput>,
   1451     <computeroutput>VgT_DifD4</computeroutput>,
   1452     <computeroutput>VgT_DifD2</computeroutput>,
   1453     <computeroutput>VgT_DifD1</computeroutput>.  These do simple
   1454     bitwise operations on pairs of V-bit vectors, with
   1455     <computeroutput>UifU</computeroutput> giving undefined if
   1456     either arg bit is undefined, and
   1457     <computeroutput>DifD</computeroutput> giving defined if
   1458     either arg bit is defined.  Abstract interpretation junkies,
   1459     if any make it this far, may like to think of them as meets
   1460     and joins (or is it joins and meets) in the definedness
   1461     lattices.</para>
   1462     </formalpara>
   1463   </listitem>
   1464 
   1465   <listitem>
   1466     <formalpara>
   1467     <title>(BINARY; one value, one V bits) Generate argument
   1468     improvement terms for AND and OR</title>
   1469     <para><computeroutput>VgT_ImproveAND4_TQ</computeroutput>,
   1470     <computeroutput>VgT_ImproveAND2_TQ</computeroutput>,
   1471     <computeroutput>VgT_ImproveAND1_TQ</computeroutput>,
   1472     <computeroutput>VgT_ImproveOR4_TQ</computeroutput>,
   1473     <computeroutput>VgT_ImproveOR2_TQ</computeroutput>,
   1474     <computeroutput>VgT_ImproveOR1_TQ</computeroutput>.  These
   1475     help out with AND and OR operations.  AND and OR have the
   1476     inconvenient property that the definedness of the result
   1477     depends on the actual values of the arguments as well as
   1478     their definedness.  At the bit level:</para></formalpara>
   1479 <programlisting><![CDATA[
   1480 1 AND undefined = undefined, but
   1481 0 AND undefined = 0, and
   1482 similarly 
   1483 0 OR undefined = undefined, but
   1484 1 OR undefined = 1.]]></programlisting>
   1485     
   1486     <para>It turns out that gcc (quite legitimately) generates
   1487     code which relies on this fact, so we have to model it
   1488     properly in order to avoid flooding users with spurious value
   1489     errors.  The ultimate definedness result of AND and OR is
   1490     calculated using <computeroutput>UifU</computeroutput> on the
   1491     definedness of the arguments, but we also
   1492     <computeroutput>DifD</computeroutput> in some "improvement"
   1493     terms which take into account the above phenomena.</para>
   1494 
   1495     <para><computeroutput>ImproveAND</computeroutput> takes as
   1496     its first argument the actual value of an argument to AND
   1497     (the T) and the definedness of that argument (the Q), and
   1498     returns a V-bit vector which is defined (0) for bits which
   1499     have value 0 and are defined; this, when
   1500     <computeroutput>DifD</computeroutput> into the final result
   1501     causes those bits to be defined even if the corresponding bit
   1502     in the other argument is undefined.</para>
   1503 
   1504     <para>The <computeroutput>ImproveOR</computeroutput> ops do
   1505     the dual thing for OR arguments.  Note that XOR does not have
   1506     this property that one argument can make the other
   1507     irrelevant, so there is no need for such complexity for
   1508     XOR.</para>
   1509   </listitem>
   1510 
   1511 </itemizedlist>
   1512 
   1513 <para>That's all the tag ops.  If you stare at this long enough,
   1514 and then run Valgrind and stare at the pre- and post-instrumented
   1515 ucode, it should be fairly obvious how the instrumentation
   1516 machinery hangs together.</para>
   1517 
   1518 <para>One point, if you do this: in order to make it easy to
   1519 differentiate <computeroutput>TempReg</computeroutput>s carrying
   1520 values from <computeroutput>TempReg</computeroutput>s carrying V
   1521 bit vectors, Valgrind prints the former as (for example)
   1522 <computeroutput>t28</computeroutput> and the latter as
   1523 <computeroutput>q28</computeroutput>; the fact that they carry
   1524 the same number serves to indicate their relationship.  This is
   1525 purely for the convenience of the human reader; the register
   1526 allocator and code generator don't regard them as
   1527 different.</para>
   1528 
   1529 </sect2>
   1530 
   1531 
   1532 
   1533 <sect2 id="mc-tech-docs.trans" xreflabel="Translation into UCode">
   1534 <title>Translation into UCode</title>
   1535 
   1536 <para><computeroutput>VG_(disBB)</computeroutput> allocates a new
   1537 <computeroutput>UCodeBlock</computeroutput> and then uses
   1538 <computeroutput>disInstr</computeroutput> to translate x86
   1539 instructions one at a time into UCode, dumping the result in the
   1540 <computeroutput>UCodeBlock</computeroutput>.  This goes on until
   1541 a control-flow transfer instruction is encountered.</para>
   1542 
   1543 <para>Despite the large size of
   1544 <filename>vg_to_ucode.c</filename>, this translation is really
   1545 very simple.  Each x86 instruction is translated entirely
   1546 independently of its neighbours, merrily allocating new
   1547 <computeroutput>TempReg</computeroutput>s as it goes.  The idea
   1548 is to have a simple translator -- in reality, no more than a
   1549 macro-expander -- and the -- resulting bad UCode translation is
   1550 cleaned up by the UCode optimisation phase which follows.  To
   1551 give you an idea of some x86 instructions and their translations
   1552 (this is a complete basic block, as Valgrind sees it):</para>
   1553 <programlisting><![CDATA[
   1554 0x40435A50:  incl %edx
   1555      0: GETL      %EDX, t0
   1556      1: INCL      t0  (-wOSZAP)
   1557      2: PUTL      t0, %EDX
   1558 
   1559 0x40435A51:  movsbl (%edx),%eax
   1560      3: GETL      %EDX, t2
   1561      4: LDB       (t2), t2
   1562      5: WIDENL_Bs t2
   1563      6: PUTL      t2, %EAX
   1564 
   1565 0x40435A54:  testb $0x20, 1(%ecx,%eax,2)
   1566      7: GETL      %EAX, t6
   1567      8: GETL      %ECX, t8
   1568      9: LEA2L     1(t8,t6,2), t4
   1569     10: LDB       (t4), t10
   1570     11: MOVB      $0x20, t12
   1571     12: ANDB      t12, t10  (-wOSZACP)
   1572     13: INCEIPo   $9
   1573 
   1574 0x40435A59:  jnz-8 0x40435A50
   1575     14: Jnzo      $0x40435A50  (-rOSZACP)
   1576     15: JMPo      $0x40435A5B]]></programlisting>
   1577 
   1578 <para>Notice how the block always ends with an unconditional jump
   1579 to the next block.  This is a bit unnecessary, but makes many
   1580 things simpler.</para>
   1581 
   1582 <para>Most x86 instructions turn into sequences of
   1583 <computeroutput>GET</computeroutput>,
   1584 <computeroutput>PUT</computeroutput>,
   1585 <computeroutput>LEA1</computeroutput>,
   1586 <computeroutput>LEA2</computeroutput>,
   1587 <computeroutput>LOAD</computeroutput> and
   1588 <computeroutput>STORE</computeroutput>.  Some complicated ones
   1589 however rely on calling helper bits of code in
   1590 <filename>vg_helpers.S</filename>.  The ucode instructions
   1591 <computeroutput>PUSH</computeroutput>,
   1592 <computeroutput>POP</computeroutput>,
   1593 <computeroutput>CALL</computeroutput>,
   1594 <computeroutput>CALLM_S</computeroutput> and
   1595 <computeroutput>CALLM_E</computeroutput> support this.  The
   1596 calling convention is somewhat ad-hoc and is not the C calling
   1597 convention.  The helper routines must save all integer registers,
   1598 and the flags, that they use.  Args are passed on the stack
   1599 underneath the return address, as usual, and if result(s) are to
   1600 be returned, it (they) are either placed in dummy arg slots
   1601 created by the ucode <computeroutput>PUSH</computeroutput>
   1602 sequence, or just overwrite the incoming args.</para>
   1603 
   1604 <para>In order that the instrumentation mechanism can handle
   1605 calls to these helpers,
   1606 <computeroutput>VG_(saneUCodeBlock)</computeroutput> enforces the
   1607 following restrictions on calls to helpers:</para>
   1608 
   1609 <itemizedlist>
   1610 
   1611   <listitem>
   1612     <para>Each <computeroutput>CALL</computeroutput> uinstr must
   1613     be bracketed by a preceding
   1614     <computeroutput>CALLM_S</computeroutput> marker (dummy
   1615     uinstr) and a trailing
   1616     <computeroutput>CALLM_E</computeroutput> marker.  These
   1617     markers are used by the instrumentation mechanism later to
   1618     establish the boundaries of the
   1619     <computeroutput>PUSH</computeroutput>,
   1620     <computeroutput>POP</computeroutput> and
   1621     <computeroutput>CLEAR</computeroutput> sequences for the
   1622     call.</para>
   1623   </listitem>
   1624 
   1625   <listitem>
   1626     <para><computeroutput>PUSH</computeroutput>,
   1627     <computeroutput>POP</computeroutput> and
   1628     <computeroutput>CLEAR</computeroutput> may only appear inside
   1629     sections bracketed by
   1630     <computeroutput>CALLM_S</computeroutput> and
   1631     <computeroutput>CALLM_E</computeroutput>, and nowhere else.</para>
   1632   </listitem>
   1633 
   1634   <listitem>
   1635     <para>In any such bracketed section, no two
   1636     <computeroutput>PUSH</computeroutput> insns may push the same
   1637     <computeroutput>TempReg</computeroutput>.  Dually, no two two
   1638     <computeroutput>POP</computeroutput>s may pop the same
   1639     <computeroutput>TempReg</computeroutput>.</para>
   1640   </listitem>
   1641 
   1642   <listitem>
   1643     <para>Finally, although this is not checked, args should be
   1644     removed from the stack with
   1645     <computeroutput>CLEAR</computeroutput>, rather than
   1646     <computeroutput>POP</computeroutput>s into a
   1647     <computeroutput>TempReg</computeroutput> which is not
   1648     subsequently used.  This is because the instrumentation
   1649     mechanism assumes that all values
   1650     <computeroutput>POP</computeroutput>ped from the stack are
   1651     actually used.</para>
   1652   </listitem>
   1653 
   1654 </itemizedlist>
   1655 
   1656 <para>Some of the translations may appear to have redundant
   1657 <computeroutput>TempReg</computeroutput>-to-<computeroutput>TempReg</computeroutput>
   1658 moves.  This helps the next phase, UCode optimisation, to
   1659 generate better code.</para>
   1660 
   1661 </sect2>
   1662 
   1663 
   1664 
   1665 <sect2 id="mc-tech-docs.optim" xreflabel="UCode optimisation">
   1666 <title>UCode optimisation</title>
   1667 
   1668 <para>UCode is then subjected to an improvement pass
   1669 (<computeroutput>vg_improve()</computeroutput>), which blurs the
   1670 boundaries between the translations of the original x86
   1671 instructions.  It's pretty straightforward.  Three
   1672 transformations are done:</para>
   1673 
   1674 <itemizedlist>
   1675 
   1676   <listitem>
   1677     <para>Redundant <computeroutput>GET</computeroutput>
   1678     elimination.  Actually, more general than that -- eliminates
   1679     redundant fetches of ArchRegs.  In our running example,
   1680     uinstr 3 <computeroutput>GET</computeroutput>s
   1681     <computeroutput>%EDX</computeroutput> into
   1682     <computeroutput>t2</computeroutput> despite the fact that, by
   1683     looking at the previous uinstr, it is already in
   1684     <computeroutput>t0</computeroutput>.  The
   1685     <computeroutput>GET</computeroutput> is therefore removed,
   1686     and <computeroutput>t2</computeroutput> renamed to
   1687     <computeroutput>t0</computeroutput>.  Assuming
   1688     <computeroutput>t0</computeroutput> is allocated to a host
   1689     register, it means the simulated
   1690     <computeroutput>%EDX</computeroutput> will exist in a host
   1691     CPU register for more than one simulated x86 instruction,
   1692     which seems to me to be a highly desirable property.</para>
   1693 
   1694     <para>There is some mucking around to do with subregisters;
   1695     <computeroutput>%AL</computeroutput> vs
   1696     <computeroutput>%AH</computeroutput>
   1697     <computeroutput>%AX</computeroutput> vs
   1698     <computeroutput>%EAX</computeroutput> etc.  I can't remember
   1699     how it works, but in general we are very conservative, and
   1700     these tend to invalidate the caching.</para>
   1701   </listitem>
   1702 
   1703   <listitem>
   1704     <para>Redundant <computeroutput>PUT</computeroutput>
   1705     elimination.  This annuls
   1706     <computeroutput>PUT</computeroutput>s of values back to
   1707     simulated CPU registers if a later
   1708     <computeroutput>PUT</computeroutput> would overwrite the
   1709     earlier <computeroutput>PUT</computeroutput> value, and there
   1710     is no intervening reads of the simulated register
   1711     (<computeroutput>ArchReg</computeroutput>).</para>
   1712 
   1713     <para>As before, we are paranoid when faced with subregister
   1714     references.  Also, <computeroutput>PUT</computeroutput>s of
   1715     <computeroutput>%ESP</computeroutput> are never annulled,
   1716     because it is vital the instrumenter always has an up-to-date
   1717     <computeroutput>%ESP</computeroutput> value available,
   1718     <computeroutput>%ESP</computeroutput> changes affect
   1719     addressability of the memory around the simulated stack
   1720     pointer.</para>
   1721 
   1722     <para>The implication of the above paragraph is that the
   1723     simulated machine's registers are only lazily updated once
   1724     the above two optimisation phases have run, with the
   1725     exception of <computeroutput>%ESP</computeroutput>.
   1726     <computeroutput>TempReg</computeroutput>s go dead at the end
   1727     of every basic block, from which is is inferrable that any
   1728     <computeroutput>TempReg</computeroutput> caching a simulated
   1729     CPU reg is flushed (back into the relevant
   1730     <computeroutput>VG_(baseBlock)</computeroutput> slot) at the
   1731     end of every basic block.  The further implication is that
   1732     the simulated registers are only up-to-date at in between
   1733     basic blocks, and not at arbitrary points inside basic
   1734     blocks.  And the consequence of that is that we can only
   1735     deliver signals to the client in between basic blocks.  None
   1736     of this seems any problem in practice.</para>
   1737   </listitem>
   1738 
   1739   <listitem>
   1740     <para>Finally there is a simple def-use thing for condition
   1741     codes.  If an earlier uinstr writes the condition codes, and
   1742     the next uinsn along which actually cares about the condition
   1743     codes writes the same or larger set of them, but does not
   1744     read any, the earlier uinsn is marked as not writing any
   1745     condition codes.  This saves a lot of redundant cond-code
   1746     saving and restoring.</para>
   1747   </listitem>
   1748 
   1749 </itemizedlist>
   1750 
   1751 <para>The effect of these transformations on our short block is
   1752 rather unexciting, and shown below.  On longer basic blocks they
   1753 can dramatically improve code quality.</para>
   1754 
   1755 <programlisting><![CDATA[
   1756 at 3: delete GET, rename t2 to t0 in (4 .. 6)
   1757 at 7: delete GET, rename t6 to t0 in (8 .. 9)
   1758 at 1: annul flag write OSZAP due to later OSZACP
   1759 
   1760 Improved code:
   1761      0: GETL      %EDX, t0
   1762      1: INCL      t0
   1763      2: PUTL      t0, %EDX
   1764      4: LDB       (t0), t0
   1765      5: WIDENL_Bs t0
   1766      6: PUTL      t0, %EAX
   1767      8: GETL      %ECX, t8
   1768      9: LEA2L     1(t8,t0,2), t4
   1769     10: LDB       (t4), t10
   1770     11: MOVB      $0x20, t12
   1771     12: ANDB      t12, t10  (-wOSZACP)
   1772     13: INCEIPo   $9
   1773     14: Jnzo      $0x40435A50  (-rOSZACP)
   1774     15: JMPo      $0x40435A5B]]></programlisting>
   1775 
   1776 </sect2>
   1777 
   1778 
   1779 
   1780 <sect2 id="mc-tech-docs.instrum" xreflabel="UCode instrumentation">
   1781 <title>UCode instrumentation</title>
   1782 
   1783 <para>Once you understand the meaning of the instrumentation
   1784 uinstrs, discussed in detail above, the instrumentation scheme is
   1785 fairly straightforward.  Each uinstr is instrumented in
   1786 isolation, and the instrumentation uinstrs are placed before the
   1787 original uinstr.  Our running example continues below.  I have
   1788 placed a blank line after every original ucode, to make it easier
   1789 to see which instrumentation uinstrs correspond to which
   1790 originals.</para>
   1791 
   1792 <para>As mentioned somewhere above,
   1793 <computeroutput>TempReg</computeroutput>s carrying values have
   1794 names like <computeroutput>t28</computeroutput>, and each one has
   1795 a shadow carrying its V bits, with names like
   1796 <computeroutput>q28</computeroutput>.  This pairing aids in
   1797 reading instrumented ucode.</para>
   1798 
   1799 <para>One decision about all this is where to have "observation
   1800 points", that is, where to check that V bits are valid.  I use a
   1801 minimalistic scheme, only checking where a failure of validity
   1802 could cause the original program to (seg)fault.  So the use of
   1803 values as memory addresses causes a check, as do conditional
   1804 jumps (these cause a check on the definedness of the condition
   1805 codes).  And arguments <computeroutput>PUSH</computeroutput>ed
   1806 for helper calls are checked, hence the weird restrictions on
   1807 help call preambles described above.</para>
   1808 
   1809 <para>Another decision is that once a value is tested, it is
   1810 thereafter regarded as defined, so that we do not emit multiple
   1811 undefined-value errors for the same undefined value.  That means
   1812 that <computeroutput>TESTV</computeroutput> uinstrs are always
   1813 followed by <computeroutput>SETV</computeroutput> on the same
   1814 (shadow) <computeroutput>TempReg</computeroutput>s.  Most of
   1815 these <computeroutput>SETV</computeroutput>s are redundant and
   1816 are removed by the post-instrumentation cleanup phase.</para>
   1817 
   1818 <para>The instrumentation for calling helper functions deserves
   1819 further comment.  The definedness of results from a helper is
   1820 modelled using just one V bit.  So, in short, we do pessimising
   1821 casts of the definedness of all the args, down to a single bit,
   1822 and then <computeroutput>UifU</computeroutput> these bits
   1823 together.  So this single V bit will say "undefined" if any part
   1824 of any arg is undefined.  This V bit is then pessimally cast back
   1825 up to the result(s) sizes, as needed.  If, by seeing that all the
   1826 args are got rid of with <computeroutput>CLEAR</computeroutput>
   1827 and none with <computeroutput>POP</computeroutput>, Valgrind sees
   1828 that the result of the call is not actually used, it immediately
   1829 examines the result V bit with a
   1830 <computeroutput>TESTV</computeroutput> --
   1831 <computeroutput>SETV</computeroutput> pair.  If it did not do
   1832 this, there would be no observation point to detect that the some
   1833 of the args to the helper were undefined.  Of course, if the
   1834 helper's results are indeed used, we don't do this, since the
   1835 result usage will presumably cause the result definedness to be
   1836 checked at some suitable future point.</para>
   1837 
   1838 <para>In general Valgrind tries to track definedness on a
   1839 bit-for-bit basis, but as the above para shows, for calls to
   1840 helpers we throw in the towel and approximate down to a single
   1841 bit.  This is because it's too complex and difficult to track
   1842 bit-level definedness through complex ops such as integer
   1843 multiply and divide, and in any case there is no reasonable code
   1844 fragments which attempt to (eg) multiply two partially-defined
   1845 values and end up with something meaningful, so there seems
   1846 little point in modelling multiplies, divides, etc, in that level
   1847 of detail.</para>
   1848 
   1849 <para>Integer loads and stores are instrumented with firstly a
   1850 test of the definedness of the address, followed by a
   1851 <computeroutput>LOADV</computeroutput> or
   1852 <computeroutput>STOREV</computeroutput> respectively.  These turn
   1853 into calls to (for example)
   1854 <computeroutput>VG_(helperc_LOADV4)</computeroutput>.  These
   1855 helpers do two things: they perform an address-valid check, and
   1856 they load or store V bits from/to the relevant address in the
   1857 (simulated V-bit) memory.</para>
   1858 
   1859 <para>FPU loads and stores are different.  As above the
   1860 definedness of the address is first tested.  However, the helper
   1861 routine for FPU loads
   1862 (<computeroutput>VGM_(fpu_read_check)</computeroutput>) emits an
   1863 error if either the address is invalid or the referenced area
   1864 contains undefined values.  It has to do this because we do not
   1865 simulate the FPU at all, and so cannot track definedness of
   1866 values loaded into it from memory, so we have to check them as
   1867 soon as they are loaded into the FPU, ie, at this point.  We
   1868 notionally assume that everything in the FPU is defined.</para>
   1869 
   1870 <para>It follows therefore that FPU writes first check the
   1871 definedness of the address, then the validity of the address, and
   1872 finally mark the written bytes as well-defined.</para>
   1873 
   1874 <para>If anyone is inspired to extend Valgrind to MMX/SSE insns,
   1875 I suggest you use the same trick.  It works provided that the
   1876 FPU/MMX unit is not used to merely as a conduit to copy partially
   1877 undefined data from one place in memory to another.
   1878 Unfortunately the integer CPU is used like that (when copying C
   1879 structs with holes, for example) and this is the cause of much of
   1880 the elaborateness of the instrumentation here described.</para>
   1881 
   1882 <para><computeroutput>vg_instrument()</computeroutput> in
   1883 <filename>vg_translate.c</filename> actually does the
   1884 instrumentation.  There are comments explaining how each uinstr
   1885 is handled, so we do not repeat that here.  As explained already,
   1886 it is bit-accurate, except for calls to helper functions.
   1887 Unfortunately the x86 insns
   1888 <computeroutput>bt/bts/btc/btr</computeroutput> are done by
   1889 helper fns, so bit-level accuracy is lost there.  This should be
   1890 fixed by doing them inline; it will probably require adding a
   1891 couple new uinstrs.  Also, left and right rotates through the
   1892 carry flag (x86 <computeroutput>rcl</computeroutput> and
   1893 <computeroutput>rcr</computeroutput>) are approximated via a
   1894 single V bit; so far this has not caused anyone to complain.  The
   1895 non-carry rotates, <computeroutput>rol</computeroutput> and
   1896 <computeroutput>ror</computeroutput>, are much more common and
   1897 are done exactly.  Re-visiting the instrumentation for AND and
   1898 OR, they seem rather verbose, and I wonder if it could be done
   1899 more concisely now.</para>
   1900 
   1901 <para>The lowercase <computeroutput>o</computeroutput> on many of
   1902 the uopcodes in the running example indicates that the size field
   1903 is zero, usually meaning a single-bit operation.</para>
   1904 
   1905 <para>Anyroads, the post-instrumented version of our running
   1906 example looks like this:</para>
   1907 
   1908 <programlisting><![CDATA[
   1909 Instrumented code:
   1910      0: GETVL     %EDX, q0
   1911      1: GETL      %EDX, t0
   1912 
   1913      2: TAG1o     q0 = Left4 ( q0 )
   1914      3: INCL      t0
   1915 
   1916      4: PUTVL     q0, %EDX
   1917      5: PUTL      t0, %EDX
   1918 
   1919      6: TESTVL    q0
   1920      7: SETVL     q0
   1921      8: LOADVB    (t0), q0
   1922      9: LDB       (t0), t0
   1923 
   1924     10: TAG1o     q0 = SWiden14 ( q0 )
   1925     11: WIDENL_Bs t0
   1926 
   1927     12: PUTVL     q0, %EAX
   1928     13: PUTL      t0, %EAX
   1929 
   1930     14: GETVL     %ECX, q8
   1931     15: GETL      %ECX, t8
   1932 
   1933     16: MOVL      q0, q4
   1934     17: SHLL      $0x1, q4
   1935     18: TAG2o     q4 = UifU4 ( q8, q4 )
   1936     19: TAG1o     q4 = Left4 ( q4 )
   1937     20: LEA2L     1(t8,t0,2), t4
   1938 
   1939     21: TESTVL    q4
   1940     22: SETVL     q4
   1941     23: LOADVB    (t4), q10
   1942     24: LDB       (t4), t10
   1943 
   1944     25: SETVB     q12
   1945     26: MOVB      $0x20, t12
   1946 
   1947     27: MOVL      q10, q14
   1948     28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
   1949     29: TAG2o     q10 = UifU1 ( q12, q10 )
   1950     30: TAG2o     q10 = DifD1 ( q14, q10 )
   1951     31: MOVL      q12, q14
   1952     32: TAG2o     q14 = ImproveAND1_TQ ( t12, q14 )
   1953     33: TAG2o     q10 = DifD1 ( q14, q10 )
   1954     34: MOVL      q10, q16
   1955     35: TAG1o     q16 = PCast10 ( q16 )
   1956     36: PUTVFo    q16
   1957     37: ANDB      t12, t10  (-wOSZACP)
   1958 
   1959     38: INCEIPo   $9
   1960 
   1961     39: GETVFo    q18
   1962     40: TESTVo    q18
   1963     41: SETVo     q18
   1964     42: Jnzo      $0x40435A50  (-rOSZACP)
   1965 
   1966     43: JMPo      $0x40435A5B]]></programlisting>
   1967 
   1968 </sect2>
   1969 
   1970 
   1971 
   1972 <sect2 id="mc-tech-docs.cleanup" 
   1973        xreflabel="UCode post-instrumentation cleanup">
   1974 <title>UCode post-instrumentation cleanup</title>
   1975 
   1976 <para>This pass, coordinated by
   1977 <computeroutput>vg_cleanup()</computeroutput>, removes redundant
   1978 definedness computation created by the simplistic instrumentation
   1979 pass.  It consists of two passes,
   1980 <computeroutput>vg_propagate_definedness()</computeroutput>
   1981 followed by
   1982 <computeroutput>vg_delete_redundant_SETVs</computeroutput>.</para>
   1983 
   1984 <para><computeroutput>vg_propagate_definedness()</computeroutput>
   1985 is a simple constant-propagation and constant-folding pass.  It
   1986 tries to determine which
   1987 <computeroutput>TempReg</computeroutput>s containing V bits will
   1988 always indicate "fully defined", and it propagates this
   1989 information as far as it can, and folds out as many operations as
   1990 possible.  For example, the instrumentation for an ADD of a
   1991 literal to a variable quantity will be reduced down so that the
   1992 definedness of the result is simply the definedness of the
   1993 variable quantity, since the literal is by definition fully
   1994 defined.</para>
   1995 
   1996 <para><computeroutput>vg_delete_redundant_SETVs</computeroutput>
   1997 removes <computeroutput>SETV</computeroutput>s on shadow
   1998 <computeroutput>TempReg</computeroutput>s for which the next
   1999 action is a write.  I don't think there's anything else worth
   2000 saying about this; it is simple.  Read the sources for
   2001 details.</para>
   2002 
   2003 <para>So the cleaned-up running example looks like this.  As
   2004 above, I have inserted line breaks after every original
   2005 (non-instrumentation) uinstr to aid readability.  As with
   2006 straightforward ucode optimisation, the results in this block are
   2007 undramatic because it is so short; longer blocks benefit more
   2008 because they have more redundancy which gets eliminated.</para>
   2009 
   2010 <programlisting><![CDATA[
   2011 at 29: delete UifU1 due to defd arg1
   2012 at 32: change ImproveAND1_TQ to MOV due to defd arg2
   2013 at 41: delete SETV
   2014 at 31: delete MOV
   2015 at 25: delete SETV
   2016 at 22: delete SETV
   2017 at 7: delete SETV
   2018 
   2019      0: GETVL     %EDX, q0
   2020      1: GETL      %EDX, t0
   2021 
   2022      2: TAG1o     q0 = Left4 ( q0 )
   2023      3: INCL      t0
   2024 
   2025      4: PUTVL     q0, %EDX
   2026      5: PUTL      t0, %EDX
   2027 
   2028      6: TESTVL    q0
   2029      8: LOADVB    (t0), q0
   2030      9: LDB       (t0), t0
   2031 
   2032     10: TAG1o     q0 = SWiden14 ( q0 )
   2033     11: WIDENL_Bs t0
   2034 
   2035     12: PUTVL     q0, %EAX
   2036     13: PUTL      t0, %EAX
   2037 
   2038     14: GETVL     %ECX, q8
   2039     15: GETL      %ECX, t8
   2040 
   2041     16: MOVL      q0, q4
   2042     17: SHLL      $0x1, q4
   2043     18: TAG2o     q4 = UifU4 ( q8, q4 )
   2044     19: TAG1o     q4 = Left4 ( q4 )
   2045     20: LEA2L     1(t8,t0,2), t4
   2046 
   2047     21: TESTVL    q4
   2048     23: LOADVB    (t4), q10
   2049     24: LDB       (t4), t10
   2050 
   2051     26: MOVB      $0x20, t12
   2052 
   2053     27: MOVL      q10, q14
   2054     28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
   2055     30: TAG2o     q10 = DifD1 ( q14, q10 )
   2056     32: MOVL      t12, q14
   2057     33: TAG2o     q10 = DifD1 ( q14, q10 )
   2058     34: MOVL      q10, q16
   2059     35: TAG1o     q16 = PCast10 ( q16 )
   2060     36: PUTVFo    q16
   2061     37: ANDB      t12, t10  (-wOSZACP)
   2062 
   2063     38: INCEIPo   $9
   2064     39: GETVFo    q18
   2065     40: TESTVo    q18
   2066     42: Jnzo      $0x40435A50  (-rOSZACP)
   2067 
   2068     43: JMPo      $0x40435A5B]]></programlisting>
   2069 
   2070 </sect2>
   2071 
   2072 
   2073 
   2074 <sect2 id="mc-tech-docs.transfrom" xreflabel="Translation from UCode">
   2075 <title>Translation from UCode</title>
   2076 
   2077 <para>This is all very simple, even though
   2078 <filename>vg_from_ucode.c</filename> is a big file.
   2079 Position-independent x86 code is generated into a dynamically
   2080 allocated array <computeroutput>emitted_code</computeroutput>;
   2081 this is doubled in size when it overflows.  Eventually the array
   2082 is handed back to the caller of
   2083 <computeroutput>VG_(translate)</computeroutput>, who must copy
   2084 the result into TC and TT, and free the array.</para>
   2085 
   2086 <para>This file is structured into four layers of abstraction,
   2087 which, thankfully, are glued back together with extensive
   2088 <computeroutput>__inline__</computeroutput> directives.  From the
   2089 bottom upwards:</para>
   2090 
   2091 <itemizedlist>
   2092 
   2093   <listitem>
   2094     <para>Address-mode emitters,
   2095     <computeroutput>emit_amode_regmem_reg</computeroutput> et
   2096     al.</para>
   2097   </listitem>
   2098 
   2099   <listitem>
   2100     <para>Emitters for specific x86 instructions.  There are
   2101     quite a lot of these, with names such as
   2102     <computeroutput>emit_movv_offregmem_reg</computeroutput>.
   2103     The <computeroutput>v</computeroutput> suffix is Intel
   2104     parlance for a 16/32 bit insn; there are also
   2105     <computeroutput>b</computeroutput> suffixes for 8 bit
   2106     insns.</para>
   2107   </listitem>
   2108 
   2109   <listitem>
   2110     <para>The next level up are the
   2111     <computeroutput>synth_*</computeroutput> functions, which
   2112     synthesise possibly a sequence of raw x86 instructions to do
   2113     some simple task.  Some of these are quite complex because
   2114     they have to work around Intel's silly restrictions on
   2115     subregister naming.  See
   2116     <computeroutput>synth_nonshiftop_reg_reg</computeroutput> for
   2117     example.</para>
   2118   </listitem>
   2119 
   2120   <listitem>
   2121     <para>Finally, at the top of the heap, we have
   2122     <computeroutput>emitUInstr()</computeroutput>, which emits
   2123     code for a single uinstr.</para>
   2124   </listitem>
   2125 
   2126 </itemizedlist>
   2127 
   2128 <para>Some comments:</para>
   2129 
   2130 <itemizedlist>
   2131 
   2132   <listitem>
   2133     <para>The hack for FPU instructions becomes apparent here.
   2134     To do a <computeroutput>FPU</computeroutput> ucode
   2135     instruction, we load the simulated FPU's state into from its
   2136     <computeroutput>VG_(baseBlock)</computeroutput> into the real
   2137     FPU using an x86 <computeroutput>frstor</computeroutput>
   2138     insn, do the ucode <computeroutput>FPU</computeroutput> insn
   2139     on the real CPU, and write the updated FPU state back into
   2140     <computeroutput>VG_(baseBlock)</computeroutput> using an
   2141     <computeroutput>fnsave</computeroutput> instruction.  This is
   2142     pretty brutal, but is simple and it works, and even seems
   2143     tolerably efficient.  There is no attempt to cache the
   2144     simulated FPU state in the real FPU over multiple
   2145     back-to-back ucode FPU instructions.</para>
   2146 
   2147     <para><computeroutput>FPU_R</computeroutput> and
   2148     <computeroutput>FPU_W</computeroutput> are also done this
   2149     way, with the minor complication that we need to patch in
   2150     some addressing mode bits so the resulting insn knows the
   2151     effective address to use.  This is easy because of the
   2152     regularity of the x86 FPU instruction encodings.</para>
   2153   </listitem>
   2154 
   2155   <listitem>
   2156     <para>An analogous trick is done with ucode insns which
   2157     claim, in their <computeroutput>flags_r</computeroutput> and
   2158     <computeroutput>flags_w</computeroutput> fields, that they
   2159     read or write the simulated
   2160     <computeroutput>%EFLAGS</computeroutput>.  For such cases we
   2161     first copy the simulated
   2162     <computeroutput>%EFLAGS</computeroutput> into the real
   2163     <computeroutput>%eflags</computeroutput>, then do the insn,
   2164     then, if the insn says it writes the flags, copy back to
   2165     <computeroutput>%EFLAGS</computeroutput>.  This is a bit
   2166     expensive, which is why the ucode optimisation pass goes to
   2167     some effort to remove redundant flag-update annotations.</para>
   2168   </listitem>
   2169 
   2170 </itemizedlist>
   2171 
   2172 <para>And so ... that's the end of the documentation for the
   2173 instrumentating translator!  It's really not that complex,
   2174 because it's composed as a sequence of simple(ish) self-contained
   2175 transformations on straight-line blocks of code.</para>
   2176 
   2177 </sect2>
   2178 
   2179 
   2180 
   2181 <sect2 id="mc-tech-docs.dispatch" xreflabel="Top-level dispatch loop">
   2182 <title>Top-level dispatch loop</title>
   2183 
   2184 <para>Urk.  In <computeroutput>VG_(toploop)</computeroutput>.
   2185 This is basically boring and unsurprising, not to mention fiddly
   2186 and fragile.  It needs to be cleaned up.</para>
   2187 
   2188 <para>The only perhaps surprise is that the whole thing is run on
   2189 top of a <computeroutput>setjmp</computeroutput>-installed
   2190 exception handler, because, supposing a translation got a
   2191 segfault, we have to bail out of the Valgrind-supplied exception
   2192 handler <computeroutput>VG_(oursignalhandler)</computeroutput>
   2193 and immediately start running the client's segfault handler, if
   2194 it has one.  In particular we can't finish the current basic
   2195 block and then deliver the signal at some convenient future
   2196 point, because signals like SIGILL, SIGSEGV and SIGBUS mean that
   2197 the faulting insn should not simply be re-tried.  (I'm sure there
   2198 is a clearer way to explain this).</para>
   2199 
   2200 </sect2>
   2201 
   2202 
   2203 
   2204 <sect2 id="mc-tech-docs.lazy" 
   2205        xreflabel="Lazy updates of the simulated program counter">
   2206 <title>Lazy updates of the simulated program counter</title>
   2207 
   2208 <para>Simulated <computeroutput>%EIP</computeroutput> is not
   2209 updated after every simulated x86 insn as this was regarded as
   2210 too expensive.  Instead ucode
   2211 <computeroutput>INCEIP</computeroutput> insns move it along as
   2212 and when necessary.  Currently we don't allow it to fall more
   2213 than 4 bytes behind reality (see
   2214 <computeroutput>VG_(disBB)</computeroutput> for the way this
   2215 works).</para>
   2216 
   2217 <para>Note that <computeroutput>%EIP</computeroutput> is always
   2218 brought up to date by the inner dispatch loop in
   2219 <computeroutput>VG_(dispatch)</computeroutput>, so that if the
   2220 client takes a fault we know at least which basic block this
   2221 happened in.</para>
   2222 
   2223 </sect2>
   2224 
   2225 
   2226 
   2227 <sect2 id="mc-tech-docs.signals" xreflabel="Signals">
   2228 <title>Signals</title>
   2229 
   2230 <para>Horrible, horrible.  <filename>vg_signals.c</filename>.
   2231 Basically, since we have to intercept all system calls anyway, we
   2232 can see when the client tries to install a signal handler.  If it
   2233 does so, we make a note of what the client asked to happen, and
   2234 ask the kernel to route the signal to our own signal handler,
   2235 <computeroutput>VG_(oursignalhandler)</computeroutput>.  This
   2236 simply notes the delivery of signals, and returns.</para>
   2237 
   2238 <para>Every 1000 basic blocks, we see if more signals have
   2239 arrived.  If so,
   2240 <computeroutput>VG_(deliver_signals)</computeroutput> builds
   2241 signal delivery frames on the client's stack, and allows their
   2242 handlers to be run.  Valgrind places in these signal delivery
   2243 frames a bogus return address,
   2244 <computeroutput>VG_(signalreturn_bogusRA)</computeroutput>, and
   2245 checks all jumps to see if any jump to it.  If so, this is a sign
   2246 that a signal handler is returning, and if so Valgrind removes
   2247 the relevant signal frame from the client's stack, restores the
   2248 from the signal frame the simulated state before the signal was
   2249 delivered, and allows the client to run onwards.  We have to do
   2250 it this way because some signal handlers never return, they just
   2251 <computeroutput>longjmp()</computeroutput>, which nukes the
   2252 signal delivery frame.</para>
   2253 
   2254 <para>The Linux kernel has a different but equally horrible hack
   2255 for detecting signal handler returns.  Discovering it is left as
   2256 an exercise for the reader.</para>
   2257 
   2258 </sect2>
   2259 
   2260 
   2261 <sect2 id="mc-tech-docs.todo">
   2262 <title>To be written</title>
   2263 
   2264 <para>The following is a list of as-yet-not-written stuff. Apologies.</para>
   2265 <orderedlist>
   2266   <listitem>
   2267     <para>The translation cache and translation table</para>
   2268   </listitem>
   2269   <listitem>
   2270     <para>Exceptions, creating new translations</para>
   2271   </listitem>
   2272   <listitem>
   2273     <para>Self-modifying code</para>
   2274   </listitem>
   2275   <listitem>
   2276     <para>Errors, error contexts, error reporting, suppressions</para>
   2277   </listitem>
   2278   <listitem>
   2279     <para>Client malloc/free</para>
   2280   </listitem>
   2281   <listitem>
   2282     <para>Low-level memory management</para>
   2283   </listitem>
   2284   <listitem>
   2285     <para>A and V bitmaps</para>
   2286   </listitem>
   2287   <listitem>
   2288     <para>Symbol table management</para>
   2289   </listitem>
   2290   <listitem>
   2291     <para>Dealing with system calls</para>
   2292   </listitem>
   2293   <listitem>
   2294     <para>Namespace management</para>
   2295   </listitem>
   2296   <listitem>
   2297     <para>GDB attaching</para>
   2298   </listitem>
   2299   <listitem>
   2300     <para>Non-dependence on glibc or anything else</para>
   2301   </listitem>
   2302   <listitem>
   2303     <para>The leak detector</para>
   2304   </listitem>
   2305   <listitem>
   2306     <para>Performance problems</para>
   2307   </listitem>
   2308   <listitem>
   2309     <para>Continuous sanity checking</para>
   2310   </listitem>
   2311   <listitem>
   2312     <para>Tracing, or not tracing, child processes</para>
   2313   </listitem>
   2314   <listitem>
   2315     <para>Assembly glue for syscalls</para>
   2316   </listitem>
   2317 </orderedlist>
   2318 
   2319 </sect2>
   2320 
   2321 </sect1>
   2322 
   2323 
   2324 
   2325 
   2326 <sect1 id="mc-tech-docs.extensions" xreflabel="Extensions">
   2327 <title>Extensions</title>
   2328 
   2329 <para>Some comments about Stuff To Do.</para>
   2330 
   2331 <sect2 id="mc-tech-docs.bugs" xreflabel="Bugs">
   2332 <title>Bugs</title>
   2333 
   2334 <para>Stephan Kulow and Marc Mutz report problems with kmail in
   2335 KDE 3 CVS (RC2 ish) when run on Valgrind.  Stephan has it
   2336 deadlocking; Marc has it looping at startup.  I can't repro
   2337 either behaviour. Needs repro-ing and fixing.</para>
   2338 
   2339 </sect2>
   2340 
   2341 
   2342 <sect2 id="mc-tech-docs.threads" xreflabel="Threads">
   2343 <title>Threads</title>
   2344 
   2345 <para>Doing a good job of thread support strikes me as almost a
   2346 research-level problem.  The central issues are how to do fast
   2347 cheap locking of the
   2348 <computeroutput>VG_(primary_map)</computeroutput> structure,
   2349 whether or not accesses to the individual secondary maps need
   2350 locking, what race-condition issues result, and whether the
   2351 already-nasty mess that is the signal simulator needs further
   2352 hackery.</para>
   2353 
   2354 <para>I realise that threads are the most-frequently-requested
   2355 feature, and I am thinking about it all.  If you have guru-level
   2356 understanding of fast mutual exclusion mechanisms and race
   2357 conditions, I would be interested in hearing from you.</para>
   2358 
   2359 </sect2>
   2360 
   2361 
   2362 
   2363 <sect2 id="mc-tech-docs.verify" xreflabel="Verification suite">
   2364 <title>Verification suite</title>
   2365 
   2366 <para>Directory <computeroutput>tests/</computeroutput> contains
   2367 various ad-hoc tests for Valgrind.  However, there is no
   2368 systematic verification or regression suite, that, for example,
   2369 exercises all the stuff in <filename>vg_memory.c</filename>, to
   2370 ensure that illegal memory accesses and undefined value uses are
   2371 detected as they should be.  It would be good to have such a
   2372 suite.</para>
   2373 
   2374 </sect2>
   2375 
   2376 
   2377 <sect2 id="mc-tech-docs.porting" xreflabel="Porting to other platforms">
   2378 <title>Porting to other platforms</title>
   2379 
   2380 <para>It would be great if Valgrind was ported to FreeBSD and x86
   2381 NetBSD, and to x86 OpenBSD, if it's possible (doesn't OpenBSD use
   2382 a.out-style executables, not ELF ?)</para>
   2383 
   2384 <para>The main difficulties, for an x86-ELF platform, seem to
   2385 be:</para>
   2386 
   2387 <itemizedlist>
   2388 
   2389   <listitem>
   2390     <para>You'd need to rewrite the
   2391     <computeroutput>/proc/self/maps</computeroutput> parser
   2392     (<filename>vg_procselfmaps.c</filename>).  Easy.</para>
   2393   </listitem>
   2394 
   2395   <listitem>
   2396     <para>You'd need to rewrite
   2397     <filename>vg_syscall_mem.c</filename>, or, more specifically,
   2398     provide one for your OS.  This is tedious, but you can
   2399     implement syscalls on demand, and the Linux kernel interface
   2400     is, for the most part, going to look very similar to the *BSD
   2401     interfaces, so it's really a copy-paste-and-modify-on-demand
   2402     job.  As part of this, you'd need to supply a new
   2403     <filename>vg_kerneliface.h</filename> file.</para>
   2404   </listitem>
   2405 
   2406   <listitem>
   2407     <para>You'd also need to change the syscall wrappers for
   2408     Valgrind's internal use, in
   2409     <filename>vg_mylibc.c</filename>.</para>
   2410   </listitem>
   2411 
   2412 </itemizedlist>
   2413 
   2414 <para>All in all, I think a port to x86-ELF *BSDs is not really
   2415 very difficult, and in some ways I would like to see it happen,
   2416 because that would force a more clear factoring of Valgrind into
   2417 platform dependent and independent pieces.  Not to mention, *BSD
   2418 folks also deserve to use Valgrind just as much as the Linux crew
   2419 do.</para>
   2420 
   2421 </sect2>
   2422 
   2423 </sect1>
   2424 
   2425 
   2426 
   2427 <sect1 id="mc-tech-docs.easystuff" 
   2428        xreflabel="Easy stuff which ought to be done">
   2429 <title>Easy stuff which ought to be done</title>
   2430 
   2431 
   2432 <sect2 id="mc-tech-docs.mmx" xreflabel="MMX Instructions">
   2433 <title>MMX Instructions</title>
   2434 
   2435 <para>MMX insns should be supported, using the same trick as for
   2436 FPU insns.  If the MMX registers are not used to copy
   2437 uninitialised junk from one place to another in memory, this
   2438 means we don't have to actually simulate the internal MMX unit
   2439 state, so the FPU hack applies.  This should be fairly
   2440 easy.</para>
   2441 
   2442 </sect2>
   2443 
   2444 
   2445 <sect2 id="mc-tech-docs.fixstabs" xreflabel="Fix stabs-info Reader">
   2446 <title>Fix stabs-info reader</title>
   2447 
   2448 <para>The machinery in <filename>vg_symtab2.c</filename> which
   2449 reads "stabs" style debugging info is pretty weak.  It usually
   2450 correctly translates simulated program counter values into line
   2451 numbers and procedure names, but the file name is often
   2452 completely wrong.  I think the logic used to parse "stabs"
   2453 entries is weak.  It should be fixed.  The simplest solution,
   2454 IMO, is to copy either the logic or simply the code out of GNU
   2455 binutils which does this; since GDB can clearly get it right,
   2456 binutils (or GDB?) must have code to do this somewhere.</para>
   2457 
   2458 </sect2>
   2459 
   2460 
   2461 
   2462 <sect2 id="mc-tech-docs.x86instr" xreflabel="BT/BTC/BTS/BTR">
   2463 <title>BT/BTC/BTS/BTR</title>
   2464 
   2465 <para>These are x86 instructions which test, complement, set, or
   2466 reset, a single bit in a word.  At the moment they are both
   2467 incorrectly implemented and incorrectly instrumented.</para>
   2468 
   2469 <para>The incorrect instrumentation is due to use of helper
   2470 functions.  This means we lose bit-level definedness tracking,
   2471 which could wind up giving spurious uninitialised-value use
   2472 errors.  The Right Thing to do is to invent a couple of new
   2473 UOpcodes, I think <computeroutput>GET_BIT</computeroutput> and
   2474 <computeroutput>SET_BIT</computeroutput>, which can be used to
   2475 implement all 4 x86 insns, get rid of the helpers, and give
   2476 bit-accurate instrumentation rules for the two new
   2477 UOpcodes.</para>
   2478 
   2479 <para>I realised the other day that they are mis-implemented too.
   2480 The x86 insns take a bit-index and a register or memory location
   2481 to access.  For registers the bit index clearly can only be in
   2482 the range zero to register-width minus 1, and I assumed the same
   2483 applied to memory locations too.  But evidently not; for memory
   2484 locations the index can be arbitrary, and the processor will
   2485 index arbitrarily into memory as a result.  This too should be
   2486 fixed.  Sigh.  Presumably indexing outside the immediate word is
   2487 not actually used by any programs yet tested on Valgrind, for
   2488 otherwise they (presumably) would simply not work at all.  If you
   2489 plan to hack on this, first check the Intel docs to make sure my
   2490 understanding is really correct.</para>
   2491 
   2492 </sect2>
   2493 
   2494 
   2495 <sect2 id="mc-tech-docs.prefetch" xreflabel="Using PREFETCH Instructions">
   2496 <title>Using PREFETCH Instructions</title>
   2497 
   2498 <para>Here's a small but potentially interesting project for
   2499 performance junkies.  Experiments with valgrind's code generator
   2500 and optimiser(s) suggest that reducing the number of instructions
   2501 executed in the translations and mem-check helpers gives
   2502 disappointingly small performance improvements.  Perhaps this is
   2503 because performance of Valgrindified code is limited by cache
   2504 misses.  After all, each read in the original program now gives
   2505 rise to at least three reads, one for the
   2506 <computeroutput>VG_(primary_map)</computeroutput>, one of the
   2507 resulting secondary, and the original.  Not to mention, the
   2508 instrumented translations are 13 to 14 times larger than the
   2509 originals.  All in all one would expect the memory system to be
   2510 hammered to hell and then some.</para>
   2511 
   2512 <para>So here's an idea.  An x86 insn involving a read from
   2513 memory, after instrumentation, will turn into ucode of the
   2514 following form:</para>
   2515 <programlisting><![CDATA[
   2516 ... calculate effective addr, into ta and qa ...
   2517   TESTVL qa             -- is the addr defined?
   2518   LOADV (ta), qloaded   -- fetch V bits for the addr
   2519   LOAD  (ta), tloaded   -- do the original load]]></programlisting>
   2520 
   2521 <para>At the point where the
   2522 <computeroutput>LOADV</computeroutput> is done, we know the
   2523 actual address (<computeroutput>ta</computeroutput>) from which
   2524 the real <computeroutput>LOAD</computeroutput> will be done.  We
   2525 also know that the <computeroutput>LOADV</computeroutput> will
   2526 take around 20 x86 insns to do.  So it seems plausible that doing
   2527 a prefetch of <computeroutput>ta</computeroutput> just before the
   2528 <computeroutput>LOADV</computeroutput> might just avoid a miss at
   2529 the <computeroutput>LOAD</computeroutput> point, and that might
   2530 be a significant performance win.</para>
   2531 
   2532 <para>Prefetch insns are notoriously tempermental, more often
   2533 than not making things worse rather than better, so this would
   2534 require considerable fiddling around.  It's complicated because
   2535 Intels and AMDs have different prefetch insns with different
   2536 semantics, so that too needs to be taken into account.  As a
   2537 general rule, even placing the prefetches before the
   2538 <computeroutput>LOADV</computeroutput> insn is too near the
   2539 <computeroutput>LOAD</computeroutput>; the ideal distance is
   2540 apparently circa 200 CPU cycles.  So it might be worth having
   2541 another analysis/transformation pass which pushes prefetches as
   2542 far back as possible, hopefully immediately after the effective
   2543 address becomes available.</para>
   2544 
   2545 <para>Doing too many prefetches is also bad because they soak up
   2546 bus bandwidth / cpu resources, so some cleverness in deciding
   2547 which loads to prefetch and which to not might be helpful.  One
   2548 can imagine not prefetching client-stack-relative
   2549 (<computeroutput>%EBP</computeroutput> or
   2550 <computeroutput>%ESP</computeroutput>) accesses, since the stack
   2551 in general tends to show good locality anyway.</para>
   2552 
   2553 <para>There's quite a lot of experimentation to do here, but I
   2554 think it might make an interesting week's work for
   2555 someone.</para>
   2556 
   2557 <para>As of 15-ish March 2002, I've started to experiment with
   2558 this, using the AMD
   2559 <computeroutput>prefetch/prefetchw</computeroutput> insns.</para>
   2560 
   2561 </sect2>
   2562 
   2563 
   2564 <sect2 id="mc-tech-docs.pranges" xreflabel="User-defined Permission Ranges">
   2565 <title>User-defined Permission Ranges</title>
   2566 
   2567 <para>This is quite a large project -- perhaps a month's hacking
   2568 for a capable hacker to do a good job -- but it's potentially
   2569 very interesting.  The outcome would be that Valgrind could
   2570 detect a whole class of bugs which it currently cannot.</para>
   2571 
   2572 <para>The presentation falls into two pieces.</para>
   2573 
   2574 <sect3 id="mc-tech-docs.psetting" 
   2575   xreflabel="Part 1: User-defined Address-range Permission Setting">
   2576 <title>Part 1: User-defined Address-range Permission Setting</title>
   2577 
   2578 <para>Valgrind intercepts the client's
   2579 <computeroutput>malloc</computeroutput>,
   2580 <computeroutput>free</computeroutput>, etc calls, watches system
   2581 calls, and watches the stack pointer move.  This is currently the
   2582 only way it knows about which addresses are valid and which not.
   2583 Sometimes the client program knows extra information about its
   2584 memory areas.  For example, the client could at some point know
   2585 that all elements of an array are out-of-date.  We would like to
   2586 be able to convey to Valgrind this information that the array is
   2587 now addressable-but-uninitialised, so that Valgrind can then warn
   2588 if elements are used before they get new values.</para>
   2589 
   2590 <para>What I would like are some macros like this:</para>
   2591 <programlisting><![CDATA[
   2592   VALGRIND_MAKE_NOACCESS(addr, len)
   2593   VALGRIND_MAKE_WRITABLE(addr, len)
   2594   VALGRIND_MAKE_READABLE(addr, len)]]></programlisting>
   2595 
   2596 <para>and also, to check that memory is
   2597 addressable/initialised,</para>
   2598 <programlisting><![CDATA[
   2599   VALGRIND_CHECK_ADDRESSABLE(addr, len)
   2600   VALGRIND_CHECK_INITIALISED(addr, len)]]></programlisting>
   2601 
   2602 <para>I then include in my sources a header defining these
   2603 macros, rebuild my app, run under Valgrind, and get user-defined
   2604 checks.</para>
   2605 
   2606 <para>Now here's a neat trick.  It's a nuisance to have to
   2607 re-link the app with some new library which implements the above
   2608 macros.  So the idea is to define the macros so that the
   2609 resulting executable is still completely stand-alone, and can be
   2610 run without Valgrind, in which case the macros do nothing, but
   2611 when run on Valgrind, the Right Thing happens.  How to do this?
   2612 The idea is for these macros to turn into a piece of inline
   2613 assembly code, which (1) has no effect when run on the real CPU,
   2614 (2) is easily spotted by Valgrind's JITter, and (3) no sane
   2615 person would ever write, which is important for avoiding false
   2616 matches in (2).  So here's a suggestion:</para>
   2617 <programlisting><![CDATA[
   2618   VALGRIND_MAKE_NOACCESS(addr, len)]]></programlisting>
   2619 
   2620 <para>becomes (roughly speaking)</para>
   2621 <programlisting><![CDATA[
   2622   movl addr, %eax
   2623   movl len,  %ebx
   2624   movl $1,   %ecx   -- 1 describes the action; MAKE_WRITABLE might be
   2625                     -- 2, etc
   2626   rorl $13, %ecx
   2627   rorl $19, %ecx
   2628   rorl $11, %eax
   2629   rorl $21, %eax]]></programlisting>
   2630 
   2631 <para>The rotate sequences have no effect, and it's unlikely they
   2632 would appear for any other reason, but they define a unique
   2633 byte-sequence which the JITter can easily spot.  Using the
   2634 operand constraints section at the end of a gcc inline-assembly
   2635 statement, we can tell gcc that the assembly fragment kills
   2636 <computeroutput>%eax</computeroutput>,
   2637 <computeroutput>%ebx</computeroutput>,
   2638 <computeroutput>%ecx</computeroutput> and the condition codes, so
   2639 this fragment is made harmless when not running on Valgrind, runs
   2640 quickly when not on Valgrind, and does not require any other
   2641 library support.</para>
   2642 
   2643 
   2644 </sect3>
   2645 
   2646 
   2647 <sect3 id="mc-tech-docs.prange-detect" 
   2648   xreflabel="Part 2: Using it to detect Interference between Stack 
   2649 Variables">
   2650 <title>Part 2: Using it to detect Interference between Stack 
   2651 Variables</title>
   2652 
   2653 <para>Currently Valgrind cannot detect errors of the following
   2654 form:</para>
   2655 <programlisting><![CDATA[
   2656 void fooble ( void )
   2657 {
   2658   int a[10];
   2659   int b[10];
   2660   a[10] = 99;
   2661 }]]></programlisting>
   2662 
   2663 <para>Now imagine rewriting this as</para>
   2664 <programlisting><![CDATA[
   2665 void fooble ( void )
   2666 {
   2667   int spacer0;
   2668   int a[10];
   2669   int spacer1;
   2670   int b[10];
   2671   int spacer2;
   2672   VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
   2673   VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
   2674   VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
   2675   a[10] = 99;
   2676 }]]></programlisting>
   2677 
   2678 <para>Now the invalid write is certain to hit
   2679 <computeroutput>spacer0</computeroutput> or
   2680 <computeroutput>spacer1</computeroutput>, so Valgrind will spot
   2681 the error.</para>
   2682 
   2683 <para>There are two complications.</para>
   2684 
   2685 <orderedlist>
   2686 
   2687   <listitem>
   2688     <para>The first is that we don't want to annotate sources by
   2689     hand, so the Right Thing to do is to write a C/C++ parser,
   2690     annotator, prettyprinter which does this automatically, and
   2691     run it on post-CPP'd C/C++ source.  The parser/prettyprinter 
   2692     is probably not as hard as it sounds; I would write it in Haskell, 
   2693     a powerful functional language well suited to doing symbolic
   2694     computation, with which I am intimately familiar.  There is
   2695     already a C parser written in Haskell by someone in the
   2696     Haskell community, and that would probably be a good starting
   2697     point.</para>
   2698   </listitem>
   2699 
   2700 
   2701   <listitem>
   2702     <para>The second complication is how to get rid of these
   2703     <computeroutput>NOACCESS</computeroutput> records inside
   2704     Valgrind when the instrumented function exits; after all,
   2705     these refer to stack addresses and will make no sense
   2706     whatever when some other function happens to re-use the same
   2707     stack address range, probably shortly afterwards.  I think I
   2708     would be inclined to define a special stack-specific
   2709     macro:</para>
   2710 <programlisting><![CDATA[
   2711   VALGRIND_MAKE_NOACCESS_STACK(addr, len)]]></programlisting>
   2712     <para>which causes Valgrind to record the client's
   2713     <computeroutput>%ESP</computeroutput> at the time it is
   2714     executed.  Valgrind will then watch for changes in
   2715     <computeroutput>%ESP</computeroutput> and discard such
   2716     records as soon as the protected area is uncovered by an
   2717     increase in <computeroutput>%ESP</computeroutput>.  I
   2718     hesitate with this scheme only because it is potentially
   2719     expensive, if there are hundreds of such records, and
   2720     considering that changes in
   2721     <computeroutput>%ESP</computeroutput> already require
   2722     expensive messing with stack access permissions.</para>
   2723   </listitem>
   2724 </orderedlist>
   2725 
   2726 <para>This is probably easier and more robust than for the
   2727 instrumenter program to try and spot all exit points for the
   2728 procedure and place suitable deallocation annotations there.
   2729 Plus C++ procedures can bomb out at any point if they get an
   2730 exception, so spotting return points at the source level just
   2731 won't work at all.</para>
   2732 
   2733 <para>Although some work, it's all eminently doable, and it would
   2734 make Valgrind into an even-more-useful tool.</para>
   2735 
   2736 </sect3>
   2737 
   2738 </sect2>
   2739 
   2740 </sect1>
   2741 </chapter>
   2742