Home | History | Annotate | Download | only in internals
      1 -----------------------------------------------------------------------------
      2 Notes on performance
      3 -----------------------------------------------------------------------------
      4 The intent of this file is to record progress in improving performance.
      5 
      6 -----------------------------------------------------------------------------
      7 Just before 3.1.0:
      8 - Julian made LibVEX_Alloc() inlinable.  Saved a couple of percent.
      9 - Julian started building Vex at -O2.  Saved up to 8% or so(?) in some
     10   cases.
     11 
     12 Post 3.1.0:
     13 - Julian made the tree builder linear.  Saved 2--13% on a range of programs.
     14 - Nick improved vg_SP_update_pass() to identify more small constant
     15   increments/decrements of SP, so the fast cases can be used more often.
     16   Saved 1--3% on a few programs.
     17 - r5345,r5346,r5352: Julian improved the dispatcher so that x86 and
     18   AMD64 use jumps instead of call/return for calling translations.
     19   Also, on x86, amd64, ppc32 and ppc64, --profile-flags style profiling was
     20   removed from the despatch loop unless --profile-flags is being used.
     21   Improved Nulgrind performance typically by 10--20%, and Memcheck
     22   performance typically by 2--20%.
     23 - Julian changed findSb to slowly move superblocks to the front of the list
     24   as they were accessed.  This sped up perf/heap by 25--50%, and some big
     25   programs (eg. ktuberling) programs by a couple of percent.
     26 - Nick reduced the iteration count of the loop in swizzle() from 20 to 5,
     27   which gave almost identical results while saving 2% in perf/tinycc and 10%
     28   in perf/heap on a 3GHz Prescott P4.
     29 - Nick changed ExeContext gathering to not record/save extra zeroes at the
     30   end.  Saved 7% on perf/heap with --num-callers=50, and about 1% on
     31   perf/tinycc.
     32 - Julian vectorised copy_address_range_perms for common cases, which
     33   gives about 40% speedup on artificial programs which just do
     34   realloc() and nothing else, and about a 3-4% speedup on starting
     35   kpresenter-1.5.0 and loading a 16-slide presentation.
     36 
     37 COMPVBITS branch:
     38 - Nick converted to compress V bits, initial version saved 0--5% on most
     39   cases, with a 30% improvement on one case (tsim_arch) which calls
     40   set_address_range_perms() a lot.
     41 - Nick rewrote set_address_range_perms(), which gained 0--3% typically,
     42   and 22% on tsim_arch.
     43 
     44