Home | History | Annotate | Download | only in subzero
      1 Design of the Subzero fast code generator
      2 =========================================
      3 
      4 Introduction
      5 ------------
      6 
      7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
      8 compiler technology based on `LLVM <http://llvm.org/>`_.  The developer uses the
      9 PNaCl toolchain to compile their application to architecture-neutral PNaCl
     10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
     11 possible.  The ``.pexe`` file is downloaded to the user's browser where the
     12 PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
     13 `sandboxed
     14 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
     15 native code.  The translator uses architecture-specific optimizations as much as
     16 practical to generate good native code.
     17 
     18 The native code can be cached by the browser to avoid repeating translation on
     19 future page loads.  However, first-time user experience is hampered by long
     20 translation times.  The LLVM-based PNaCl translator is pretty slow, even when
     21 using ``-O0`` to minimize optimizations, so delays are especially noticeable on
     22 slow browser platforms such as ARM-based Chromebooks.
     23 
     24 Translator slowness can be mitigated or hidden in a number of ways.
     25 
     26 - Parallel translation.  However, slow machines where this matters most, e.g.
     27   ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
     28   and are likely to less memory available for multiple translation threads to
     29   use.
     30 
     31 - Streaming translation, i.e. start translating as soon as the download starts.
     32   This doesn't help much when translation speed is 10 slower than download
     33   speed, or the ``.pexe`` file is already cached while the translated binary was
     34   flushed from the cache.
     35 
     36 - Arrange the web page such that translation is done in parallel with
     37   downloading large assets.
     38 
     39 - Arrange the web page to distract the user with `cat videos
     40   <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
     41   progress.
     42 
     43 Or, improve translator performance to something more reasonable.
     44 
     45 This document describes Subzero's attempt to improve translation speed by an
     46 order of magnitude while rivaling LLVM's code quality.  Subzero does this
     47 through minimal IR layering, lean data structures and passes, and a careful
     48 selection of fast optimization passes.  It has two optimization recipes: full
     49 optimizations (``O2``) and minimal optimizations (``Om1``).  The recipes are the
     50 following (described in more detail below):
     51 
     52 +----------------------------------------+-----------------------------+
     53 | O2 recipe                              | Om1 recipe                  |
     54 +========================================+=============================+
     55 | Parse .pexe file                       | Parse .pexe file            |
     56 +----------------------------------------+-----------------------------+
     57 | Loop nest analysis                     |                             |
     58 +----------------------------------------+-----------------------------+
     59 | Local common subexpression elimination |                             |
     60 +----------------------------------------+-----------------------------+
     61 | Address mode inference                 |                             |
     62 +----------------------------------------+-----------------------------+
     63 | Read-modify-write (RMW) transform      |                             |
     64 +----------------------------------------+-----------------------------+
     65 | Basic liveness analysis                |                             |
     66 +----------------------------------------+-----------------------------+
     67 | Load optimization                      |                             |
     68 +----------------------------------------+-----------------------------+
     69 |                                        | Phi lowering (simple)       |
     70 +----------------------------------------+-----------------------------+
     71 | Target lowering                        | Target lowering             |
     72 +----------------------------------------+-----------------------------+
     73 | Full liveness analysis                 |                             |
     74 +----------------------------------------+-----------------------------+
     75 | Register allocation                    | Minimal register allocation |
     76 +----------------------------------------+-----------------------------+
     77 | Phi lowering (advanced)                |                             |
     78 +----------------------------------------+-----------------------------+
     79 | Post-phi register allocation           |                             |
     80 +----------------------------------------+-----------------------------+
     81 | Branch optimization                    |                             |
     82 +----------------------------------------+-----------------------------+
     83 | Code emission                          | Code emission               |
     84 +----------------------------------------+-----------------------------+
     85 
     86 Goals
     87 =====
     88 
     89 Translation speed
     90 -----------------
     91 
     92 We'd like to be able to translate a ``.pexe`` file as fast as download speed.
     93 Any faster is in a sense wasted effort.  Download speed varies greatly, but
     94 we'll arbitrarily say 1 MB/sec.  We'll pick the ARM A15 CPU as the example of a
     95 slow machine.  We observe a 3 single-thread performance difference between A15
     96 and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
     97 ``.pexe`` file could be compressed to 50% on the web server using gzip transport
     98 compression, so we set the translation speed goal to 6 MB/sec on the high-end
     99 Xeon workstation.
    100 
    101 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
    102  the target rate.  The ``-O2`` mode takes 3 as long as the ``-O0`` mode.
    103 
    104 In other words, Subzero's goal is to improve over LLVM's translation speed by
    105 10.
    106 
    107 Code quality
    108 ------------
    109 
    110 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
    111 code quality.  The stretch goal is to approach LLVM ``-O2`` code quality.  On
    112 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
    113 
    114 It's important to note that the quality of Subzero-generated code depends on
    115 target-neutral optimizations and simplifications being run beforehand in the
    116 developer environment.  The ``.pexe`` file reflects these optimizations.  For
    117 example, Subzero assumes that the basic blocks are ordered topologically where
    118 possible (which makes liveness analysis converge fastest), and Subzero does not
    119 do any function inlining because it should already have been done.
    120 
    121 Translator size
    122 ---------------
    123 
    124 The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
    125 We think 1 MB is a more reasonable size -- especially for such a component that
    126 is distributed to a billion Chrome users.  Thus we target a 10 reduction in
    127 binary size.
    128 
    129 For development, Subzero can be built for all target architectures, and all
    130 debugging and diagnostic options enabled.  For a smaller translator, we restrict
    131 to a single target architecture, and define a ``MINIMAL`` build where
    132 unnecessary features are compiled out.
    133 
    134 Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
    135 include directories, which have little impact on translator size.  It also uses
    136 some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
    137 with little size impact.  In non-``MINIMAL`` builds, the translator size is much
    138 larger due to including code for parsing text-format bitcode files and forming
    139 LLVM IR.
    140 
    141 Memory footprint
    142 ----------------
    143 
    144 The current LLVM-based translator suffers from an issue in which some
    145 function-specific data has to be retained in memory until all translation
    146 completes, and therefore the memory footprint grows without bound.  Large
    147 ``.pexe`` files can lead to the translator process holding hundreds of MB of
    148 memory by the end.  The translator runs in a separate process, so this memory
    149 growth doesn't *directly* affect other processes, but it does dirty the physical
    150 memory and contributes to a perception of bloat and sometimes a reality of
    151 out-of-memory tab killing, especially noticeable on weaker systems.
    152 
    153 Subzero should maintain a stable memory footprint throughout translation.  It's
    154 not really practical to set a specific limit, because there is not really a
    155 practical limit on a single function's size, but the footprint should be
    156 "reasonable" and be proportional to the largest input function size, not the
    157 total ``.pexe`` file size.  Simply put, Subzero should not have memory leaks or
    158 inexorable memory growth.  (We use ASAN builds to test for leaks.)
    159 
    160 Multithreaded translation
    161 -------------------------
    162 
    163 It should be practical to translate different functions concurrently and see
    164 good scalability.  Some locking may be needed, such as accessing output buffers
    165 or constant pools, but that should be fairly minimal.  In contrast, LLVM was
    166 only designed for module-level parallelism, and as such, the PNaCl translator
    167 internally splits a ``.pexe`` file into several modules for concurrent
    168 translation.  All output needs to be deterministic regardless of the level of
    169 multithreading, i.e. functions and data should always be output in the same
    170 order.
    171 
    172 Target architectures
    173 --------------------
    174 
    175 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32.  Future
    176 targets include ARM64 and MIPS64, though these targets lack NaCl support
    177 including a sandbox model or a validator.
    178 
    179 The first implementation is for x86-32, because it was expected to be
    180 particularly challenging, and thus more likely to draw out any design problems
    181 early:
    182 
    183 - There are a number of special cases, asymmetries, and warts in the x86
    184   instruction set.
    185 
    186 - Complex addressing modes may be leveraged for better code quality.
    187 
    188 - 64-bit integer operations have to be lowered into longer sequences of 32-bit
    189   operations.
    190 
    191 - Paucity of physical registers may reveal code quality issues early in the
    192   design.
    193 
    194 Detailed design
    195 ===============
    196 
    197 Intermediate representation - ICE
    198 ---------------------------------
    199 
    200 Subzero's IR is called ICE.  It is designed to be reasonably similar to LLVM's
    201 IR, which is reflected in the ``.pexe`` file's bitcode structure.  It has a
    202 representation of global variables and initializers, and a set of functions.
    203 Each function contains a list of basic blocks, and each basic block constains a
    204 list of instructions.  Instructions that operate on stack and register variables
    205 do so using static single assignment (SSA) form.
    206 
    207 The ``.pexe`` file is translated one function at a time (or in parallel by
    208 multiple translation threads).  The recipe for optimization passes depends on
    209 the specific target and optimization level, and is described in detail below.
    210 Global variables (types and initializers) are simply and directly translated to
    211 object code, without any meaningful attempts at optimization.
    212 
    213 A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
    214 Its key contents include:
    215 
    216 - A list of ``CfgNode`` pointers, generally held in topological order.
    217 
    218 - A list of ``Variable`` pointers corresponding to local variables used in the
    219   function plus compiler-generated temporaries.
    220 
    221 A basic block is represented by the ``Ice::CfgNode`` class.  Its key contents
    222 include:
    223 
    224 - A linear list of instructions, in the same style as LLVM.  The last
    225   instruction of the list is always a terminator instruction: branch, switch,
    226   return, unreachable.
    227 
    228 - A list of Phi instructions, also in the same style as LLVM.  They are held as
    229   a linear list for convenience, though per Phi semantics, they are executed "in
    230   parallel" without dependencies on each other.
    231 
    232 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
    233   another list for outgoing edges.
    234 
    235 - The node's unique, 0-based index into the CFG's node list.
    236 
    237 An instruction is represented by the ``Ice::Inst`` class.  Its key contents
    238 include:
    239 
    240 - A list of source operands.
    241 
    242 - Its destination variable, if the instruction produces a result in an
    243   ``Ice::Variable``.
    244 
    245 - A bitvector indicating which variables' live ranges this instruction ends.
    246   This is computed during liveness analysis.
    247 
    248 Instructions kinds are divided into high-level ICE instructions and low-level
    249 ICE instructions.  High-level instructions consist of the PNaCl/LLVM bitcode
    250 instruction kinds.  Each target architecture implementation extends the
    251 instruction space with its own set of low-level instructions.  Generally,
    252 low-level instructions correspond to individual machine instructions.  The
    253 high-level ICE instruction space includes a few additional instruction kinds
    254 that are not part of LLVM but are generally useful (e.g., an Assignment
    255 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
    256 instructions for sandboxing).
    257 
    258 Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
    259 ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
    260 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
    261 the following:
    262 
    263 - Alloca: allocate data on the stack
    264 
    265 - Arithmetic: binary operations of the form ``A = B op C``
    266 
    267 - Br: conditional or unconditional branch
    268 
    269 - Call: function call
    270 
    271 - Cast: unary type-conversion operations
    272 
    273 - ExtractElement: extract a scalar element from a vector-type value
    274 
    275 - Fcmp: floating-point comparison
    276 
    277 - Icmp: integer comparison
    278 
    279 - IntrinsicCall: call a known intrinsic
    280 
    281 - InsertElement: insert a scalar element into a vector-type value
    282 
    283 - Load: load a value from memory
    284 
    285 - Phi: implement the SSA phi node
    286 
    287 - Ret: return from the function
    288 
    289 - Select: essentially the C language operation of the form ``X = C ? Y : Z``
    290 
    291 - Store: store a value into memory
    292 
    293 - Switch: generalized branch to multiple possible locations
    294 
    295 - Unreachable: indicate that this portion of the code is unreachable
    296 
    297 The additional high-level ICE instructions are the following:
    298 
    299 - Assign: a simple ``A=B`` assignment.  This is useful for e.g. lowering Phi
    300   instructions to non-SSA assignments, before lowering to machine code.
    301 
    302 - BundleLock, BundleUnlock.  These are markers used for sandboxing, but are
    303   common across all targets and so they are elevated to the high-level
    304   instruction set.
    305 
    306 - FakeDef, FakeUse, FakeKill.  These are tools used to preserve consistency in
    307   liveness analysis, elevated to the high-level because they are used by all
    308   targets.  They are described in more detail at the end of this section.
    309 
    310 - JumpTable: this represents the result of switch optimization analysis, where
    311   some switch instructions may use jump tables instead of cascading
    312   compare/branches.
    313 
    314 An operand is represented by the ``Ice::Operand`` class.  In high-level ICE, an
    315 operand is either an ``Ice::Constant`` or an ``Ice::Variable``.  Constants
    316 include scalar integer constants, scalar floating point constants, Undef (an
    317 unspecified constant of a particular scalar or vector type), and symbol
    318 constants (essentially addresses of globals).  Note that the PNaCl ABI does not
    319 include vector-type constants besides Undef, and as such, Subzero (so far) has
    320 no reason to represent vector-type constants internally.  A variable represents
    321 a value allocated on the stack (though not including alloca-derived storage).
    322 Among other things, a variable holds its unique, 0-based index into the CFG's
    323 variable list.
    324 
    325 Each target can extend the ``Constant`` and ``Variable`` classes for its own
    326 needs.  In addition, the ``Operand`` class may be extended, e.g. to define an
    327 x86 ``MemOperand`` that encodes a base register, an index register, an index
    328 register shift amount, and a constant offset.
    329 
    330 Register allocation and liveness analysis are restricted to Variable operands.
    331 Because of the importance of register allocation to code quality, and the
    332 translation-time cost of liveness analysis, Variable operands get some special
    333 treatment in ICE.  Most notably, a frequent pattern in Subzero is to iterate
    334 across all the Variables of an instruction.  An instruction holds a list of
    335 operands, but an operand may contain 0, 1, or more Variables.  As such, the
    336 ``Operand`` class specially holds a list of Variables contained within, for
    337 quick access.
    338 
    339 A Subzero transformation pass may work by deleting an existing instruction and
    340 replacing it with zero or more new instructions.  Instead of actually deleting
    341 the existing instruction, we generally mark it as deleted and insert the new
    342 instructions right after the deleted instruction.  When printing the IR for
    343 debugging, this is a big help because it makes it much more clear how the
    344 non-deleted instructions came about.
    345 
    346 Subzero has a few special instructions to help with liveness analysis
    347 consistency.
    348 
    349 - The FakeDef instruction gives a fake definition of some variable.  For
    350   example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
    351   but an ICE instruction can represent only one destination variable.  This is
    352   similar for multiply instructions, and for function calls that return a 64-bit
    353   integer result in the ``%edx:%eax`` pair.  Also, using the ``xor %eax, %eax``
    354   trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
    355 
    356 - The FakeUse instruction registers a use of a variable, typically to prevent an
    357   earlier assignment to that variable from being dead-code eliminated.  For
    358   example, lowering an operation like ``x=cc?y:z`` may be done using x86's
    359   conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``.  Without a
    360   FakeUse of ``x`` between the two instructions, the liveness analysis pass may
    361   dead-code eliminate the first instruction.
    362 
    363 - The FakeKill instruction is added after a call instruction, and is a quick way
    364   of indicating that caller-save registers are invalidated.
    365 
    366 Pexe parsing
    367 ------------
    368 
    369 Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files.  It
    370 parses the ``.pexe`` file function by function, ultimately constructing an ICE
    371 CFG for each function.  After a function is parsed, its CFG is handed off to the
    372 translation phase.  The bitcode parser also parses global initializer data and
    373 hands it off to be translated to data sections in the object file.
    374 
    375 Subzero has another parsing strategy for testing/debugging.  LLVM libraries can
    376 be used to parse a module into LLVM IR (though very slowly relative to Subzero
    377 native parsing).  Then we iterate across the LLVM IR and construct high-level
    378 ICE, handing off each CFG to the translation phase.
    379 
    380 Overview of lowering
    381 --------------------
    382 
    383 In general, translation goes like this:
    384 
    385 - Parse the next function from the ``.pexe`` file and construct a CFG consisting
    386   of high-level ICE.
    387 
    388 - Do analysis passes and transformation passes on the high-level ICE, as
    389   desired.
    390 
    391 - Lower each high-level ICE instruction into a sequence of zero or more
    392   low-level ICE instructions.  Each high-level instruction is generally lowered
    393   independently, though the target lowering is allowed to look ahead in the
    394   CfgNode's instruction list if desired.
    395 
    396 - Do more analysis and transformation passes on the low-level ICE, as desired.
    397 
    398 - Assemble the low-level CFG into an ELF object file (alternatively, a textual
    399   assembly file that is later assembled by some external tool).
    400 
    401 - Repeat for all functions, and also produce object code for data such as global
    402   initializers and internal constant pools.
    403 
    404 Currently there are two optimization levels: ``O2`` and ``Om1``.  For ``O2``,
    405 the intention is to apply all available optimizations to get the best code
    406 quality (though the initial code quality goal is measured against LLVM's ``O0``
    407 code quality).  For ``Om1``, the intention is to apply as few optimizations as
    408 possible and produce code as quickly as possible, accepting poor code quality.
    409 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
    410 "sub-zero".
    411 
    412 High-level debuggability of generated code is so far not a design requirement.
    413 Subzero doesn't really do transformations that would obfuscate debugging; the
    414 main thing might be that register allocation (including stack slot coalescing
    415 for stack-allocated variables whose live ranges don't overlap) may render a
    416 variable's value unobtainable after its live range ends.  This would not be an
    417 issue for ``Om1`` since it doesn't register-allocate program-level variables,
    418 nor does it coalesce stack slots.  That said, fully supporting debuggability
    419 would require a few additions:
    420 
    421 - DWARF support would need to be added to Subzero's ELF file emitter.  Subzero
    422   propagates global symbol names, local variable names, and function-internal
    423   label names that are present in the ``.pexe`` file.  This would allow a
    424   debugger to map addresses back to symbols in the ``.pexe`` file.
    425 
    426 - To map ``.pexe`` file symbols back to meaningful source-level symbol names,
    427   file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
    428   metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
    429   `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
    430 
    431 - The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
    432   the toolchain would need to be modified to preserve it.
    433 
    434 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
    435 produces code with one third the code quality.  ``Om1`` is good for testing and
    436 debugging -- during translation, it tends to expose errors in the basic lowering
    437 that might otherwise have been hidden by the register allocator or other
    438 optimization passes.  It also helps determine whether a code correctness problem
    439 is a fundamental problem in the basic lowering, or an error in another
    440 optimization pass.
    441 
    442 The implementation of target lowering also controls the recipe of passes used
    443 for ``Om1`` and ``O2`` translation.  For example, address mode inference may
    444 only be relevant for x86.
    445 
    446 Lowering strategy
    447 -----------------
    448 
    449 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
    450 each high-level instruction down to a sequence of low-level target-specific
    451 instructions, in a largely context-free setting.  That is, each high-level
    452 instruction conceptually has a simple template expansion into low-level
    453 instructions, and lowering can in theory be done in any order.  This may sound
    454 like a small effort, but quite a large number of templates may be needed because
    455 of the number of PNaCl types and instruction variants.  Furthermore, there may
    456 be optimized templates, e.g. to take advantage of operator commutativity (for
    457 example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``).  This is
    458 similar to other template-based approaches in fast code generation or
    459 interpretation, though some decisions are deferred until after some global
    460 analysis passes, mostly related to register allocation, stack slot assignment,
    461 and specific choice of instruction variant and addressing mode.
    462 
    463 The key idea for a lowering template is to produce valid low-level instructions
    464 that are guaranteed to meet address mode and other structural requirements of
    465 the instruction set.  For example, on x86, the source operand of an integer
    466 store instruction must be an immediate or a physical register; a shift
    467 instruction's shift amount must be an immediate or in register ``%cl``; a
    468 function's integer return value is in ``%eax``; most x86 instructions are
    469 two-operand, in contrast to corresponding three-operand high-level instructions;
    470 etc.
    471 
    472 Because target lowering runs before register allocation, there is no way to know
    473 whether a given ``Ice::Variable`` operand lives on the stack or in a physical
    474 register.  When the low-level instruction calls for a physical register operand,
    475 the target lowering can create an infinite-weight Variable.  This tells the
    476 register allocator to assign infinite weight when making decisions, effectively
    477 guaranteeing some physical register.  Variables can also be pre-colored to a
    478 specific physical register (``cl`` in the shift example above), which also gives
    479 infinite weight.
    480 
    481 To illustrate, consider a high-level arithmetic instruction on 32-bit integer
    482 operands::
    483 
    484     A = B + C
    485 
    486 X86 target lowering might produce the following::
    487 
    488     T.inf = B  // mov instruction
    489     T.inf += C // add instruction
    490     A = T.inf  // mov instruction
    491 
    492 Here, ``T.inf`` is an infinite-weight temporary.  As long as ``T.inf`` has a
    493 physical register, the three lowered instructions are all encodable regardless
    494 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
    495 whether ``A`` is a physical register or in memory.
    496 
    497 In this example, ``A`` must be a Variable and one may be tempted to simplify the
    498 lowering sequence by setting ``A`` as infinite-weight and using::
    499 
    500         A = B  // mov instruction
    501         A += C // add instruction
    502 
    503 This has two problems.  First, if the original instruction was actually ``A =
    504 B + A``, the result would be incorrect.  Second, assigning ``A`` a physical
    505 register applies throughout ``A``'s entire live range.  This is probably not
    506 what is intended, and may ultimately lead to a failure to allocate a register
    507 for an infinite-weight variable.
    508 
    509 This style of lowering leads to many temporaries being generated, so in ``O2``
    510 mode, we rely on the register allocator to clean things up.  For example, in the
    511 example above, if ``B`` ends up getting a physical register and its live range
    512 ends at this instruction, the register allocator is likely to reuse that
    513 register for ``T.inf``.  This leads to ``T.inf=B`` being a redundant register
    514 copy, which is removed as an emission-time peephole optimization.
    515 
    516 O2 lowering
    517 -----------
    518 
    519 Currently, the ``O2`` lowering recipe is the following:
    520 
    521 - Loop nest analysis
    522 
    523 - Local common subexpression elimination
    524 
    525 - Address mode inference
    526 
    527 - Read-modify-write (RMW) transformation
    528 
    529 - Basic liveness analysis
    530 
    531 - Load optimization
    532 
    533 - Target lowering
    534 
    535 - Full liveness analysis
    536 
    537 - Register allocation
    538 
    539 - Phi instruction lowering (advanced)
    540 
    541 - Post-phi lowering register allocation
    542 
    543 - Branch optimization
    544 
    545 These passes are described in more detail below.
    546 
    547 Om1 lowering
    548 ------------
    549 
    550 Currently, the ``Om1`` lowering recipe is the following:
    551 
    552 - Phi instruction lowering (simple)
    553 
    554 - Target lowering
    555 
    556 - Register allocation (infinite-weight and pre-colored only)
    557 
    558 Optimization passes
    559 -------------------
    560 
    561 Liveness analysis
    562 ^^^^^^^^^^^^^^^^^
    563 
    564 Liveness analysis is a standard dataflow optimization, implemented as follows.
    565 For each node (basic block), its live-out set is computed as the union of the
    566 live-in sets of its successor nodes.  Then the node's instructions are processed
    567 in reverse order, updating the live set, until the beginning of the node is
    568 reached, and the node's live-in set is recorded.  If this iteration has changed
    569 the node's live-in set, the node's predecessors are marked for reprocessing.
    570 This continues until no more nodes need reprocessing.  If nodes are processed in
    571 reverse topological order, the number of iterations over the CFG is generally
    572 equal to the maximum loop nest depth.
    573 
    574 To implement this, each node records its live-in and live-out sets, initialized
    575 to the empty set.  Each instruction records which of its Variables' live ranges
    576 end in that instruction, initialized to the empty set.  A side effect of
    577 liveness analysis is dead instruction elimination.  Each instruction can be
    578 marked as tentatively dead, and after the algorithm converges, the tentatively
    579 dead instructions are permanently deleted.
    580 
    581 Optionally, after this liveness analysis completes, we can do live range
    582 construction, in which we calculate the live range of each variable in terms of
    583 instruction numbers.  A live range is represented as a union of segments, where
    584 the segment endpoints are instruction numbers.  Instruction numbers are required
    585 to be unique across the CFG, and monotonically increasing within a basic block.
    586 As a union of segments, live ranges can contain "gaps" and are therefore
    587 precise.  Because of SSA properties, a variable's live range can start at most
    588 once in a basic block, and can end at most once in a basic block.  Liveness
    589 analysis keeps track of which variable/instruction tuples begin live ranges and
    590 end live ranges, and combined with live-in and live-out sets, we can efficiently
    591 build up live ranges of all variables across all basic blocks.
    592 
    593 A lot of care is taken to try to make liveness analysis fast and efficient.
    594 Because of the lowering strategy, the number of variables is generally
    595 proportional to the number of instructions, leading to an O(N^2) complexity
    596 algorithm if implemented naively.  To improve things based on sparsity, we note
    597 that most variables are "local" and referenced in at most one basic block (in
    598 contrast to the "global" variables with multi-block usage), and therefore cannot
    599 be live across basic blocks.  Therefore, the live-in and live-out sets,
    600 typically represented as bit vectors, can be limited to the set of global
    601 variables, and the intra-block liveness bit vector can be compacted to hold the
    602 global variables plus the local variables for that block.
    603 
    604 Register allocation
    605 ^^^^^^^^^^^^^^^^^^^
    606 
    607 Subzero implements a simple linear-scan register allocator, based on the
    608 allocator described by Hanspeter Mssenbck and Michael Pfeiffer in `Linear Scan
    609 Register Allocation in the Context of SSA Form and Register Constraints
    610 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  This allocator has
    611 several nice features:
    612 
    613 - Live ranges are represented as unions of segments, as described above, rather
    614   than a single start/end tuple.
    615 
    616 - It allows pre-coloring of variables with specific physical registers.
    617 
    618 - It applies equally well to pre-lowered Phi instructions.
    619 
    620 The paper suggests an approach of aggressively coalescing variables across Phi
    621 instructions (i.e., trying to force Phi source and destination variables to have
    622 the same register assignment), but we reject that in favor of the more natural
    623 preference mechanism described below.
    624 
    625 We enhance the algorithm in the paper with the capability of automatic inference
    626 of register preference, and with the capability of allowing overlapping live
    627 ranges to safely share the same register in certain circumstances.  If we are
    628 considering register allocation for variable ``A``, and ``A`` has a single
    629 defining instruction ``A=B+C``, then the preferred register for ``A``, if
    630 available, would be the register assigned to ``B`` or ``C``, if any, provided
    631 that ``B`` or ``C``'s live range does not overlap ``A``'s live range.  In this
    632 way we infer a good register preference for ``A``.
    633 
    634 We allow overlapping live ranges to get the same register in certain cases.
    635 Suppose a high-level instruction like::
    636 
    637     A = unary_op(B)
    638 
    639 has been target-lowered like::
    640 
    641     T.inf = B
    642     A = unary_op(T.inf)
    643 
    644 Further, assume that ``B``'s live range continues beyond this instruction
    645 sequence, and that ``B`` has already been assigned some register.  Normally, we
    646 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
    647 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
    648 different registers.  But ``T.inf`` is just a read-only copy of ``B`` that is
    649 guaranteed to be in a register, so in theory these overlapping live ranges could
    650 safely have the same register.  Our implementation allows this overlap as long
    651 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
    652 modified within ``T.inf``'s live range.
    653 
    654 Subzero's register allocator can be run in 3 configurations.
    655 
    656 - Normal mode.  All Variables are considered for register allocation.  It
    657   requires full liveness analysis and live range construction as a prerequisite.
    658   This is used by ``O2`` lowering.
    659 
    660 - Minimal mode.  Only infinite-weight or pre-colored Variables are considered.
    661   All other Variables are stack-allocated.  It does not require liveness
    662   analysis; instead, it quickly scans the instructions and records first
    663   definitions and last uses of all relevant Variables, using that to construct a
    664   single-segment live range.  Although this includes most of the Variables, the
    665   live ranges are mostly simple, short, and rarely overlapping, which the
    666   register allocator handles efficiently.  This is used by ``Om1`` lowering.
    667 
    668 - Post-phi lowering mode.  Advanced phi lowering is done after normal-mode
    669   register allocation, and may result in new infinite-weight Variables that need
    670   registers.  One would like to just run something like minimal mode to assign
    671   registers to the new Variables while respecting existing register allocation
    672   decisions.  However, it sometimes happens that there are no free registers.
    673   In this case, some register needs to be forcibly spilled to the stack and
    674   temporarily reassigned to the new Variable, and reloaded at the end of the new
    675   Variable's live range.  The register must be one that has no explicit
    676   references during the Variable's live range.  Since Subzero currently doesn't
    677   track def/use chains (though it does record the CfgNode where a Variable is
    678   defined), we just do a brute-force search across the CfgNode's instruction
    679   list for the instruction numbers of interest.  This situation happens very
    680   rarely, so there's little point for now in improving its performance.
    681 
    682 The basic linear-scan algorithm may, as it proceeds, rescind an early register
    683 allocation decision, leaving that Variable to be stack-allocated.  Some of these
    684 times, it turns out that the Variable could have been given a different register
    685 without conflict, but by this time it's too late.  The literature recognizes
    686 this situation and describes "second-chance bin-packing", which Subzero can do.
    687 We can rerun the register allocator in a mode that respects existing register
    688 allocation decisions, and sometimes it finds new non-conflicting opportunities.
    689 In fact, we can repeatedly run the register allocator until convergence.
    690 Unfortunately, in the current implementation, these subsequent register
    691 allocation passes end up being extremely expensive.  This is because of the
    692 treatment of the "unhandled pre-colored" Variable set, which is normally very
    693 small but ends up being quite large on subsequent passes.  Its performance can
    694 probably be made acceptable with a better choice of data structures, but for now
    695 this second-chance mechanism is disabled.
    696 
    697 Future work is to implement LLVM's `Greedy
    698 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
    699 register allocator as a replacement for the basic linear-scan algorithm, given
    700 LLVM's experience with its improvement in code quality.  (The blog post claims
    701 that the Greedy allocator also improved maintainability because a lot of hacks
    702 could be removed, but Subzero is probably not yet to that level of hacks, and is
    703 less likely to see that particular benefit.)
    704 
    705 Local common subexpression elimination
    706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    707 
    708 The Local CSE implementation goes through each instruction and records a portion
    709 of each ``Seen`` instruction in a hashset-like container.  That portion consists
    710 of the entire instruction except for any dest variable. That means ``A = X + Y``
    711 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows
    712 us to detect common subexpressions.
    713 
    714 Whenever a repetition is detected, the redundant variables are stored in a
    715 container mapping the replacee to the replacement. In the case above, it would
    716 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.
    717 
    718 At any point if a variable that has an entry in the replacement table is
    719 encountered, it is replaced with the variable it is mapped to. This ensures that
    720 the redundant variables will not have any uses in the basic block, allowing
    721 dead code elimination to clean up the redundant instruction.
    722 
    723 With SSA, the information stored is never invalidated. However, non-SSA input is
    724 supported with the ``-lcse=no-ssa`` option. This has to maintain some extra
    725 dependency information to ensure proper invalidation on variable assignment.
    726 This is not rigorously tested because this pass is run at an early stage where
    727 it is safe to assume variables have a single definition. This is not enabled by
    728 default because it bumps the compile time overhead from 2% to 6%.
    729 
    730 Loop-invariant code motion
    731 ^^^^^^^^^^^^^^^^^^^^^^^^^^
    732 
    733 This pass utilizes the loop analysis information to hoist invariant instructions
    734 to loop pre-headers. A loop must have a single entry node (header) and that node
    735 must have a single external predecessor for this optimization to work, as it is
    736 currently implemented.
    737 
    738 The pass works by iterating over all instructions in the loop until the set of
    739 invariant instructions converges. In each iteration, a non-invariant instruction
    740 involving only constants or a variable known to be invariant is added to the
    741 result set. The destination variable of that instruction is added to the set of
    742 variables known to be invariant (which is initialized with the constant args).
    743 
    744 Improving the loop-analysis infrastructure is likely to have significant impact
    745 on this optimization. Inserting an extra node to act as the pre-header when the
    746 header has multiple incoming edges from outside could also be a good idea.
    747 Expanding the initial invariant variable set to contain all variables that do
    748 not have definitions inside the loop does not seem to improve anything.
    749 
    750 Short circuit evaluation
    751 ^^^^^^^^^^^^^^^^^^^^^^^^
    752 
    753 Short circuit evaluation splits nodes and introduces early jumps when the result
    754 of a logical operation can be determined early and there are no observable side
    755 effects of skipping the rest of the computation. The instructions considered
    756 backwards from the end of the basic blocks. When a definition of a variable
    757 involved in a conditional jump is found, an extra jump can be inserted in that
    758 location, moving the rest of the instructions in the node to a newly inserted
    759 node. Consider this example::
    760 
    761   __N :
    762     a = <something>
    763     Instruction 1 without side effect
    764     ... b = <something> ...
    765     Instruction N without side effect
    766     t1 = or a b
    767     br t1 __X __Y
    768 
    769 is transformed to::
    770 
    771   __N :
    772     a = <something>
    773     br a __X __N_ext
    774 
    775   __N_ext :
    776     Instruction 1 without side effect
    777     ... b = <something> ...
    778     Instruction N without side effect
    779     br b __X __Y
    780 
    781 The logic for AND is analogous, the only difference is that the early jump is
    782 facilitated by a ``false`` value instead of ``true``.
    783 
    784 Global Variable Splitting
    785 ^^^^^^^^^^^^^^^^^^^^^^^^^
    786 
    787 Global variable splitting (``-split-global-vars``) is run after register
    788 allocation. It works on the variables that did not manage to get registers (but
    789 are allowed to) and decomposes their live ranges into the individual segments
    790 (which span a single node at most). New variables are created (but not yet used)
    791 with these smaller live ranges and the register allocator is run again. This is
    792 not inefficient as the old variables that already had registers are now
    793 considered pre-colored.
    794 
    795 The new variables that get registers replace their parent variables for their
    796 portion of its (parent's) live range. A copy from the old variable to the new
    797 is introduced before the first use and the reverse after the last def in the
    798 live range.
    799 
    800 Basic phi lowering
    801 ^^^^^^^^^^^^^^^^^^
    802 
    803 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
    804 implements it).  Consider this example::
    805 
    806     L1:
    807       ...
    808       br L3
    809     L2:
    810       ...
    811       br L3
    812     L3:
    813       A = phi [B, L1], [C, L2]
    814       X = phi [Y, L1], [Z, L2]
    815 
    816 For each destination of a phi instruction, we can create a temporary and insert
    817 the temporary's assignment at the end of the predecessor block::
    818 
    819     L1:
    820       ...
    821       A' = B
    822       X' = Y
    823       br L3
    824     L2:
    825       ...
    826       A' = C
    827       X' = Z
    828       br L3
    829     L2:
    830       A = A'
    831       X = X'
    832 
    833 This transformation is very simple and reliable.  It can be done before target
    834 lowering and register allocation, and it easily avoids the classic lost-copy and
    835 related problems.  ``Om1`` lowering uses this strategy.
    836 
    837 However, it has the disadvantage of initializing temporaries even for branches
    838 not taken, though that could be mitigated by splitting non-critical edges and
    839 putting assignments in the edge-split nodes.  Another problem is that without
    840 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
    841 specific ordering even though phi semantics are that the assignments are
    842 parallel or unordered.  This sometimes imposes false live range overlaps and
    843 leads to poorer register allocation.
    844 
    845 Advanced phi lowering
    846 ^^^^^^^^^^^^^^^^^^^^^
    847 
    848 ``O2`` lowering defers phi lowering until after register allocation to avoid the
    849 problem of false live range overlaps.  It works as follows.  We split each
    850 incoming edge and move the (parallel) phi assignments into the split nodes.  We
    851 linearize each set of assignments by finding a safe, topological ordering of the
    852 assignments, respecting register assignments as well.  For example::
    853 
    854     A = B
    855     X = Y
    856 
    857 Normally these assignments could be executed in either order, but if ``B`` and
    858 ``X`` are assigned the same physical register, we would want to use the above
    859 ordering.  Dependency cycles are broken by introducing a temporary.  For
    860 example::
    861 
    862     A = B
    863     B = A
    864 
    865 Here, a temporary breaks the cycle::
    866 
    867     t = A
    868     A = B
    869     B = t
    870 
    871 Finally, we use the existing target lowering to lower the assignments in this
    872 basic block, and once that is done for all basic blocks, we run the post-phi
    873 variant of register allocation on the edge-split basic blocks.
    874 
    875 When computing a topological order, we try to first schedule assignments whose
    876 source has a physical register, and last schedule assignments whose destination
    877 has a physical register.  This helps reduce register pressure.
    878 
    879 X86 address mode inference
    880 ^^^^^^^^^^^^^^^^^^^^^^^^^^
    881 
    882 We try to take advantage of the x86 addressing mode that includes a base
    883 register, an index register, an index register scale amount, and an immediate
    884 offset.  We do this through simple pattern matching.  Starting with a load or
    885 store instruction where the address is a variable, we initialize the base
    886 register to that variable, and look up the instruction where that variable is
    887 defined.  If that is an add instruction of two variables and the index register
    888 hasn't been set, we replace the base and index register with those two
    889 variables.  If instead it is an add instruction of a variable and a constant, we
    890 replace the base register with the variable and add the constant to the
    891 immediate offset.
    892 
    893 There are several more patterns that can be matched.  This pattern matching
    894 continues on the load or store instruction until no more matches are found.
    895 Because a program typically has few load and store instructions (not to be
    896 confused with instructions that manipulate stack variables), this address mode
    897 inference pass is fast.
    898 
    899 X86 read-modify-write inference
    900 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    901 
    902 A reasonably common bitcode pattern is a non-atomic update of a memory
    903 location::
    904 
    905     x = load addr
    906     y = add x, 1
    907     store y, addr
    908 
    909 On x86, with good register allocation, the Subzero passes described above
    910 generate code with only this quality::
    911 
    912     mov [%ebx], %eax
    913     add $1, %eax
    914     mov %eax, [%ebx]
    915 
    916 However, x86 allows for this kind of code::
    917 
    918     add $1, [%ebx]
    919 
    920 which requires fewer instructions, but perhaps more importantly, requires fewer
    921 physical registers.
    922 
    923 It's also important to note that this transformation only makes sense if the
    924 store instruction ends ``x``'s live range.
    925 
    926 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
    927 opportunities via simple pattern matching.  The only problem is that it is run
    928 before liveness analysis, which is needed to determine whether ``x``'s live
    929 range ends after the RMW.  Since liveness analysis is one of the most expensive
    930 passes, it's not attractive to run it an extra time just for RMW analysis.
    931 Instead, we essentially generate both the RMW and the non-RMW versions, and then
    932 during lowering, the RMW version deletes itself if it finds x still live.
    933 
    934 X86 compare-branch inference
    935 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    936 
    937 In the LLVM instruction set, the compare/branch pattern works like this::
    938 
    939     cond = icmp eq a, b
    940     br cond, target
    941 
    942 The result of the icmp instruction is a single bit, and a conditional branch
    943 tests that bit.  By contrast, most target architectures use this pattern::
    944 
    945     cmp a, b  // implicitly sets various bits of FLAGS register
    946     br eq, target  // branch on a particular FLAGS bit
    947 
    948 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
    949 ``cond`` and conditionally branches.  Subzero has a pass that identifies
    950 boolean-based operations like this and folds them into a single
    951 compare/branch-like operation.  It is set up for more than just cmp/br though.
    952 Boolean producers include icmp (integer compare), fcmp (floating-point compare),
    953 and trunc (integer truncation when the destination has bool type).  Boolean
    954 consumers include branch, select (the ternary operator from the C language), and
    955 sign-extend and zero-extend when the source has bool type.
    956 
    957 Sandboxing
    958 ^^^^^^^^^^
    959 
    960 Native Client's sandbox model uses software fault isolation (SFI) to provide
    961 safety when running untrusted code in a browser or other environment.  Subzero
    962 implements Native Client's `sandboxing
    963 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
    964 to enable Subzero-translated executables to be run inside Chrome.  Subzero also
    965 provides a fairly simple framework for investigating alternative sandbox models
    966 or other restrictions on the sandbox model.
    967 
    968 Sandboxing in Subzero is not actually implemented as a separate pass, but is
    969 integrated into lowering and assembly.
    970 
    971 - Indirect branches, including the ret instruction, are masked to a bundle
    972   boundary and bundle-locked.
    973 
    974 - Call instructions are aligned to the end of the bundle so that the return
    975   address is bundle-aligned.
    976 
    977 - Indirect branch targets, including function entry and targets in a switch
    978   statement jump table, are bundle-aligned.
    979 
    980 - The intrinsic for reading the thread pointer is inlined appropriately.
    981 
    982 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox
    983   base register.  We reduce the aggressiveness of address mode inference to
    984   leave room for the sandbox base register during lowering.  There are no memory
    985   sandboxing changes for x86-32.
    986 
    987 Code emission
    988 -------------
    989 
    990 Subzero's integrated assembler is derived from Dart's `assembler code
    991 <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_.  There is a pass
    992 that iterates through the low-level ICE instructions and invokes the relevant
    993 assembler functions.  Placeholders are added for later fixup of branch target
    994 offsets.  (Backward branches use short offsets if possible; forward branches
    995 generally use long offsets unless it is an intra-block branch of "known" short
    996 length.)  The assembler emits into a staging buffer.  Once emission into the
    997 staging buffer for a function is complete, the data is emitted to the output
    998 file as an ELF object file, and metadata such as relocations, symbol table, and
    999 string table, are accumulated for emission at the end.  Global data initializers
   1000 are emitted similarly.  A key point is that at this point, the staging buffer
   1001 can be deallocated, and only a minimum of data needs to held until the end.
   1002 
   1003 As a debugging alternative, Subzero can emit textual assembly code which can
   1004 then be run through an external assembler.  This is of course super slow, but
   1005 quite valuable when bringing up a new target.
   1006 
   1007 As another debugging option, the staging buffer can be emitted as textual
   1008 assembly, primarily in the form of ".byte" lines.  This allows the assembler to
   1009 be tested separately from the ELF related code.
   1010 
   1011 Memory management
   1012 -----------------
   1013 
   1014 Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
   1015 LLVM's ``BumpPtrAllocator``.  This is an arena-style allocator where objects
   1016 allocated from the arena are never actually freed; instead, when the CFG
   1017 translation completes and the CFG is deleted, the entire arena memory is
   1018 reclaimed at once.  This style of allocation works well in an environment like a
   1019 compiler where there are distinct phases with only easily-identifiable objects
   1020 living across phases.  It frees the developer from having to manage object
   1021 deletion, and it amortizes deletion costs across just a single arena deletion at
   1022 the end of the phase.  Furthermore, it helps scalability by allocating entirely
   1023 from thread-local memory pools, and minimizing global locking of the heap.
   1024 
   1025 Instructions are probably the most heavily allocated complex class in Subzero.
   1026 We represent an instruction list as an intrusive doubly linked list, allocate
   1027 all instructions from the ``CfgLocalAllocator``, and we make sure each
   1028 instruction subclass is basically `POD
   1029 <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
   1030 trivial destructor.  This way, when the CFG is finished, we don't need to
   1031 individually deallocate every instruction.  We do similar for Variables, which
   1032 is probably the second most popular complex class.
   1033 
   1034 There are some situations where passes need to use some `STL container class
   1035 <http://en.cppreference.com/w/cpp/container>`_.  Subzero has a way of using the
   1036 ``CfgLocalAllocator`` as the container allocator if this is needed.
   1037 
   1038 Multithreaded translation
   1039 -------------------------
   1040 
   1041 Subzero is designed to be able to translate functions in parallel.  With the
   1042 ``-threads=N`` command-line option, there is a 3-stage producer-consumer
   1043 pipeline:
   1044 
   1045 - A single thread parses the ``.pexe`` file and produces a sequence of work
   1046   units.  A work unit can be either a fully constructed CFG, or a set of global
   1047   initializers.  The work unit includes its sequence number denoting its parse
   1048   order.  Each work unit is added to the translation queue.
   1049 
   1050 - There are N translation threads that draw work units from the translation
   1051   queue and lower them into assembler buffers.  Each assembler buffer is added
   1052   to the emitter queue, tagged with its sequence number.  The CFG and its
   1053   ``CfgLocalAllocator`` are disposed of at this point.
   1054 
   1055 - A single thread draws assembler buffers from the emitter queue and appends to
   1056   the output file.  It uses the sequence numbers to reintegrate the assembler
   1057   buffers according to the original parse order, such that output order is
   1058   always deterministic.
   1059 
   1060 This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
   1061 for a total of ``N+2`` execution threads, taking the parser and emitter threads
   1062 into account.  For the special case of ``N=0``, execution is entirely sequential
   1063 -- the same thread parses, translates, and emits, one function at a time.  This
   1064 is useful for performance measurements.
   1065 
   1066 Ideally, we would like to get near-linear scalability as the number of
   1067 translation threads increases.  We expect that ``-threads=1`` should be slightly
   1068 faster than ``-threads=0`` as the small amount of time spent parsing and
   1069 emitting is done largely in parallel with translation.  With perfect
   1070 scalability, we see ``-threads=N`` translating ``N`` times as fast as
   1071 ``-threads=1``, up until the point where parsing or emitting becomes the
   1072 bottleneck, or ``N+2`` exceeds the number of CPU cores.  In reality, memory
   1073 performance would become a bottleneck and efficiency might peak at, say, 75%.
   1074 
   1075 Currently, parsing takes about 11% of total sequential time.  If translation
   1076 scalability ever gets so fast and awesomely scalable that parsing becomes a
   1077 bottleneck, it should be possible to make parsing multithreaded as well.
   1078 
   1079 Internally, all shared, mutable data is held in the GlobalContext object, and
   1080 access to each field is guarded by a mutex.
   1081 
   1082 Security
   1083 --------
   1084 
   1085 Subzero includes a number of security features in the generated code, as well as
   1086 in the Subzero translator itself, which run on top of the existing Native Client
   1087 sandbox as well as Chrome's OS-level sandbox.
   1088 
   1089 Sandboxed translator
   1090 ^^^^^^^^^^^^^^^^^^^^
   1091 
   1092 When running inside the browser, the Subzero translator executes as sandboxed,
   1093 untrusted code that is initially checked by the validator, just like the
   1094 LLVM-based ``pnacl-llc`` translator.  As such, the Subzero binary should be no
   1095 more or less secure than the translator it replaces, from the point of view of
   1096 the Chrome sandbox.  That said, Subzero is much smaller than ``pnacl-llc`` and
   1097 was designed from the start with security in mind, so one expects fewer attacker
   1098 opportunities here.
   1099 
   1100 Code diversification
   1101 ^^^^^^^^^^^^^^^^^^^^
   1102 
   1103 `Return-oriented programming
   1104 <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
   1105 now-common technique for starting with e.g. a known buffer overflow situation
   1106 and launching it into a deeper exploit.  The attacker scans the executable
   1107 looking for ROP gadgets, which are short sequences of code that happen to load
   1108 known values into known registers and then return.  An attacker who manages to
   1109 overwrite parts of the stack can overwrite it with carefully chosen return
   1110 addresses such that certain ROP gadgets are effectively chained together to set
   1111 up the register state as desired, finally returning to some code that manages to
   1112 do something nasty based on those register values.
   1113 
   1114 If there is a popular ``.pexe`` with a large install base, the attacker could
   1115 run Subzero on it and scan the executable for suitable ROP gadgets to use as
   1116 part of a potential exploit.  Note that if the trusted validator is working
   1117 correctly, these ROP gadgets are limited to starting at a bundle boundary and
   1118 cannot use the trick of finding a gadget that happens to begin inside another
   1119 instruction.  All the same, gadgets with these constraints still exist and the
   1120 attacker has access to them.  This is the attack model we focus most on --
   1121 protecting the user against misuse of a "trusted" developer's application, as
   1122 opposed to mischief from a malicious ``.pexe`` file.
   1123 
   1124 Subzero can mitigate these attacks to some degree through code diversification.
   1125 Specifically, we can apply some randomness to the code generation that makes ROP
   1126 gadgets less predictable.  This randomness can have some compile-time cost, and
   1127 it can affect the code quality; and some diversifications may be more effective
   1128 than others.  A more detailed treatment of hardening techniques may be found in
   1129 the Matasano report "`Attacking Clientside JIT Compilers
   1130 <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
   1131 
   1132 To evaluate diversification effectiveness, we use a third-party ROP gadget
   1133 finder and limit its results to bundle-aligned addresses.  For a given
   1134 diversification technique, we run it with a number of different random seeds,
   1135 find ROP gadgets for each version, and determine how persistent each ROP gadget
   1136 is across the different versions.  A gadget is persistent if the same gadget is
   1137 found at the same code address.  The best diversifications are ones with low
   1138 gadget persistence rates.
   1139 
   1140 Subzero implements 7 different diversification techniques.  Below is a
   1141 discussion of each technique, its effectiveness, and its cost.  The discussions
   1142 of cost and effectiveness are for a single diversification technique; the
   1143 translation-time costs for multiple techniques are additive, but the effects of
   1144 multiple techniques on code quality and effectiveness are not yet known.
   1145 
   1146 In Subzero's implementation, each randomization is "repeatable" in a sense.
   1147 Each pass that includes a randomization option gets its own private instance of
   1148 a random number generator (RNG).  The RNG is seeded with a combination of a
   1149 global seed, the pass ID, and the function's sequence number.  The global seed
   1150 is designed to be different across runs (perhaps based on the current time), but
   1151 for debugging, the global seed can be set to a specific value and the results
   1152 will be repeatable.
   1153 
   1154 Subzero-generated code is subject to diversification once per translation, and
   1155 then Chrome caches the diversified binary for subsequent executions.  An
   1156 attacker may attempt to run the binary multiple times hoping for
   1157 higher-probability combinations of ROP gadgets.  When the attacker guesses
   1158 wrong, a likely outcome is an application crash.  Chrome throttles creation of
   1159 crashy processes which reduces the likelihood of the attacker eventually gaining
   1160 a foothold.
   1161 
   1162 Constant blinding
   1163 ~~~~~~~~~~~~~~~~~
   1164 
   1165 Here, we prevent attackers from controlling large immediates in the text
   1166 (executable) section.  A random cookie is generated for each function, and if
   1167 the constant exceeds a specified threshold, the constant is obfuscated with the
   1168 cookie and equivalent code is generated.  For example, instead of this x86
   1169 instruction::
   1170 
   1171     mov $0x11223344, <%Reg/Mem>
   1172 
   1173 the following code might be generated::
   1174 
   1175     mov $(0x11223344+Cookie), %temp
   1176     lea -Cookie(%temp), %temp
   1177     mov %temp, <%Reg/Mem>
   1178 
   1179 The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
   1180 prevent unintended effects on the flags register.
   1181 
   1182 This transformation has almost no effect on translation time, and about 1%
   1183 impact on code quality, depending on the threshold chosen.  It does little to
   1184 reduce gadget persistence, but it does remove a lot of potential opportunities
   1185 to construct intra-instruction ROP gadgets (which an attacker could use only if
   1186 a validator bug were discovered, since the Native Client sandbox and associated
   1187 validator force returns and other indirect branches to be to bundle-aligned
   1188 addresses).
   1189 
   1190 Constant pooling
   1191 ~~~~~~~~~~~~~~~~
   1192 
   1193 This is similar to constant blinding, in that large immediates are removed from
   1194 the text section.  In this case, each unique constant above the threshold is
   1195 stored in a read-only data section and the constant is accessed via a memory
   1196 load.  For the above example, the following code might be generated::
   1197 
   1198     mov $Label$1, %temp
   1199     mov %temp, <%Reg/Mem>
   1200 
   1201 This has a similarly small impact on translation time and ROP gadget
   1202 persistence, and a smaller (better) impact on code quality.  This is because it
   1203 uses fewer instructions, and in some cases good register allocation leads to no
   1204 increase in instruction count.  Note that this still gives an attacker some
   1205 limited amount of control over some text section values, unless we randomize the
   1206 constant pool layout.
   1207 
   1208 Static data reordering
   1209 ~~~~~~~~~~~~~~~~~~~~~~
   1210 
   1211 This transformation limits the attacker's ability to control bits in global data
   1212 address references.  It simply permutes the order in memory of global variables
   1213 and internal constant pool entries.  For the constant pool, we only permute
   1214 within a type (i.e., emit a randomized list of ints, followed by a randomized
   1215 list of floats, etc.) to maintain good packing in the face of alignment
   1216 constraints.
   1217 
   1218 As might be expected, this has no impact on code quality, translation time, or
   1219 ROP gadget persistence (though as above, it limits opportunities for
   1220 intra-instruction ROP gadgets with a broken validator).
   1221 
   1222 Basic block reordering
   1223 ~~~~~~~~~~~~~~~~~~~~~~
   1224 
   1225 Here, we randomize the order of basic blocks within a function, with the
   1226 constraint that we still want to maintain a topological order as much as
   1227 possible, to avoid making the code too branchy.
   1228 
   1229 This has no impact on code quality, and about 1% impact on translation time, due
   1230 to a separate pass to recompute layout.  It ends up having a huge effect on ROP
   1231 gadget persistence, tied for best with nop insertion, reducing ROP gadget
   1232 persistence to less than 5%.
   1233 
   1234 Function reordering
   1235 ~~~~~~~~~~~~~~~~~~~
   1236 
   1237 Here, we permute the order that functions are emitted, primarily to shift ROP
   1238 gadgets around to less predictable locations.  It may also change call address
   1239 offsets in case the attacker was trying to control that offset in the code.
   1240 
   1241 To control latency and memory footprint, we don't arbitrarily permute functions.
   1242 Instead, for some relatively small value of N, we queue up N assembler buffers,
   1243 and then emit the N functions in random order, and repeat until all functions
   1244 are emitted.
   1245 
   1246 Function reordering has no impact on translation time or code quality.
   1247 Measurements indicate that it reduces ROP gadget persistence to about 15%.
   1248 
   1249 Nop insertion
   1250 ~~~~~~~~~~~~~
   1251 
   1252 This diversification randomly adds a nop instruction after each regular
   1253 instruction, with some probability.  Nop instructions of different lengths may
   1254 be selected.  Nop instructions are never added inside a bundle_lock region.
   1255 Note that when sandboxing is enabled, nop instructions are already being added
   1256 for bundle alignment, so the diversification nop instructions may simply be
   1257 taking the place of alignment nop instructions, though distributed differently
   1258 through the bundle.
   1259 
   1260 In Subzero's currently implementation, nop insertion adds 3-5% to the
   1261 translation time, but this is probably because it is implemented as a separate
   1262 pass that adds actual nop instructions to the IR.  The overhead would probably
   1263 be a lot less if it were integrated into the assembler pass.  The code quality
   1264 is also reduced by 3-5%, making nop insertion the most expensive of the
   1265 diversification techniques.
   1266 
   1267 Nop insertion is very effective in reducing ROP gadget persistence, at the same
   1268 level as basic block randomization (less than 5%).  But given nop insertion's
   1269 impact on translation time and code quality, one would most likely prefer to use
   1270 basic block randomization instead (though the combined effects of the different
   1271 diversification techniques have not yet been studied).
   1272 
   1273 Register allocation randomization
   1274 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   1275 
   1276 In this diversification, the register allocator tries to make different but
   1277 mostly functionally equivalent choices, while maintaining stable code quality.
   1278 
   1279 A naive approach would be the following.  Whenever the allocator has more than
   1280 one choice for assigning a register, choose randomly among those options.  And
   1281 whenever there are no registers available and there is a tie for the
   1282 lowest-weight variable, randomly select one of the lowest-weight variables to
   1283 evict.  Because of the one-pass nature of the linear-scan algorithm, this
   1284 randomization strategy can have a large impact on which variables are ultimately
   1285 assigned registers, with a corresponding large impact on code quality.
   1286 
   1287 Instead, we choose an approach that tries to keep code quality stable regardless
   1288 of the random seed.  We partition the set of physical registers into equivalence
   1289 classes.  If a register is pre-colored in the function (i.e., referenced
   1290 explicitly by name), it forms its own equivalence class.  The remaining
   1291 registers are partitioned according to their combination of attributes such as
   1292 integer versus floating-point, 8-bit versus 32-bit, caller-save versus
   1293 callee-saved, etc.  Each equivalence class is randomly permuted, and the
   1294 complete permutation is applied to the final register assignments.
   1295 
   1296 Register randomization reduces ROP gadget persistence to about 10% on average,
   1297 though there tends to be fairly high variance across functions and applications.
   1298 This probably has to do with the set of restrictions in the x86-32 instruction
   1299 set and ABI, such as few general-purpose registers, ``%eax`` used for return
   1300 values, ``%edx`` used for division, ``%cl`` used for shifting, etc.  As
   1301 intended, register randomization has no impact on code quality, and a slight
   1302 (0.5%) impact on translation time due to an extra scan over the variables to
   1303 identify pre-colored registers.
   1304 
   1305 Fuzzing
   1306 ^^^^^^^
   1307 
   1308 We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
   1309 combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
   1310 <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling.  The purpose is to
   1311 find and fix cases where Subzero crashes or otherwise ungracefully fails on
   1312 unexpected inputs, and to do so automatically over a large range of unexpected
   1313 inputs.  By fixing bugs that arise from fuzz testing, we reduce the possibility
   1314 of an attacker exploiting these bugs.
   1315 
   1316 Most of the problems found so far are ones most appropriately handled in the
   1317 parser.  However, there have been a couple that have identified problems in the
   1318 lowering, or otherwise inappropriately triggered assertion failures and fatal
   1319 errors.  We continue to dig into this area.
   1320 
   1321 Future security work
   1322 ^^^^^^^^^^^^^^^^^^^^
   1323 
   1324 Subzero is well-positioned to explore other future security enhancements, e.g.:
   1325 
   1326 - Tightening the Native Client sandbox.  ABI changes, such as the previous work
   1327   on `hiding the sandbox base address
   1328   <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
   1329   in x86-64, are easy to experiment with in Subzero.
   1330 
   1331 - Making the executable code section read-only.  This would prevent a PNaCl
   1332   application from inspecting its own binary and trying to find ROP gadgets even
   1333   after code diversification has been performed.  It may still be susceptible to
   1334   `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
   1335   security is still overall improved.
   1336 
   1337 - Instruction selection diversification.  It may be possible to lower a given
   1338   instruction in several largely equivalent ways, which gives more opportunities
   1339   for code randomization.
   1340 
   1341 Chrome integration
   1342 ------------------
   1343 
   1344 Currently Subzero is available in Chrome for the x86-32 architecture, but under
   1345 a flag.  When the flag is enabled, Subzero is used when the `manifest file
   1346 <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
   1347 linking to the ``.pexe`` file specifies the ``O0`` optimization level.
   1348 
   1349 The next step is to remove the flag, i.e. invoke Subzero as the only translator
   1350 for ``O0``-specified manifest files.
   1351 
   1352 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
   1353 case Subzero could be used for all PNaCl translation.
   1354 
   1355 Command line options
   1356 --------------------
   1357 
   1358 Subzero has a number of command-line options for debugging and diagnostics.
   1359 Among the more interesting are the following.
   1360 
   1361 - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
   1362   diagnostic output, with various levels of detail after each pass.  Instruction
   1363   numbers can be printed or suppressed.  Deleted instructions can be printed or
   1364   suppressed (they are retained in the instruction list, as discussed earlier,
   1365   because they can help explain how lower-level instructions originated).
   1366   Liveness information can be printed when available.  Details of register
   1367   allocation can be printed as register allocator decisions are made.  And more.
   1368 
   1369 - Running Subzero with any level of verbosity produces an enormous amount of
   1370   output.  When debugging a single function, verbose output can be suppressed
   1371   except for a particular function.  The ``-verbose-focus`` flag suppresses
   1372   verbose output except for the specified function.
   1373 
   1374 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
   1375   at exit.  Timing markers can be placed in the Subzero source code to demarcate
   1376   logical operations or passes of interest.  Basic timing information plus
   1377   call-stack type timing information is printed at the end.
   1378 
   1379 - Along with ``-timing``, the user can instead get a report on the overall
   1380   translation time for each function, to help focus on timing outliers.  Also,
   1381   ``-timing-focus`` limits the ``-timing`` reporting to a single function,
   1382   instead of aggregating pass timing across all functions.
   1383 
   1384 - The ``-szstats`` option reports various statistics on each function, such as
   1385   stack frame size, static instruction count, etc.  It may be helpful to track
   1386   these stats over time as Subzero is improved, as an approximate measure of
   1387   code quality.
   1388 
   1389 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly
   1390   output, annotate the assembly output with register-focused liveness
   1391   information.  In particular, each basic block is annotated with which
   1392   registers are live-in and live-out, and each instruction is annotated with
   1393   which registers' and stack locations' live ranges end at that instruction.
   1394   This is really useful when studying the generated code to find opportunities
   1395   for code quality improvements.
   1396 
   1397 Testing and debugging
   1398 ---------------------
   1399 
   1400 LLVM lit tests
   1401 ^^^^^^^^^^^^^^
   1402 
   1403 For basic testing, Subzero uses LLVM's `lit
   1404 <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests.  We
   1405 have a suite of hundreds of small functions where we test for particular
   1406 assembly code patterns across different target architectures.
   1407 
   1408 Cross tests
   1409 ^^^^^^^^^^^
   1410 
   1411 Unfortunately, the lit tests don't do a great job of precisely testing the
   1412 correctness of the output.  Much better are the cross tests, which are execution
   1413 tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
   1414 variety of interesting inputs.  Each cross test consists of a set of C, C++,
   1415 and/or low-level bitcode files.  The C and C++ source files are compiled down to
   1416 bitcode.  The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
   1417 Subzero mangles global symbol names with a special prefix to avoid duplicate
   1418 symbol errors.  A driver program invokes both versions on a large set of
   1419 interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
   1420 differ.  Cross tests turn out to be an excellent way of testing the basic
   1421 lowering patterns, but they are less useful for testing more global things like
   1422 liveness analysis and register allocation.
   1423 
   1424 Bisection debugging
   1425 ^^^^^^^^^^^^^^^^^^^
   1426 
   1427 Sometimes with a new application, Subzero will end up producing incorrect code
   1428 that either crashes at runtime or otherwise produces the wrong results.  When
   1429 this happens, we need to narrow it down to a single function (or small set of
   1430 functions) that yield incorrect behavior.  For this, we have a bisection
   1431 debugging framework.  Here, we initially translate the entire application once
   1432 with Subzero and once with ``pnacl-llc``.  We then use ``objdump`` to
   1433 selectively weaken symbols based on a whitelist or blacklist provided on the
   1434 command line.  The two object files can then be linked together without link
   1435 errors, with the desired version of each method "winning".  Then the binary is
   1436 tested, and bisection proceeds based on whether the binary produces correct
   1437 output.
   1438 
   1439 When the bisection completes, we are left with a minimal set of
   1440 Subzero-translated functions that cause the failure.  Usually it is a single
   1441 function, though sometimes it might require a combination of several functions
   1442 to cause a failure; this may be due to an incorrect call ABI, for example.
   1443 However, Murphy's Law implies that the single failing function is enormous and
   1444 impractical to debug.  In that case, we can restart the bisection, explicitly
   1445 blacklisting the enormous function, and try to find another candidate to debug.
   1446 (Future work is to automate this to find all minimal sets of functions, so that
   1447 debugging can focus on the simplest example.)
   1448 
   1449 Fuzz testing
   1450 ^^^^^^^^^^^^
   1451 
   1452 As described above, we try to find internal Subzero bugs using fuzz testing
   1453 techniques.
   1454 
   1455 Sanitizers
   1456 ^^^^^^^^^^
   1457 
   1458 Subzero can be built with `AddressSanitizer
   1459 <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
   1460 <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support.  This is
   1461 done using something as simple as ``make ASAN=1`` or ``make TSAN=1``.  So far,
   1462 multithreading has been simple enough that TSan hasn't found any bugs, but ASan
   1463 has found at least one memory leak which was subsequently fixed.
   1464 `UndefinedBehaviorSanitizer
   1465 <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
   1466 (UBSan) support is in progress.  `Control flow integrity sanitization
   1467 <http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
   1468 consideration.
   1469 
   1470 Current status
   1471 ==============
   1472 
   1473 Target architectures
   1474 --------------------
   1475 
   1476 Subzero is currently more or less complete for the x86-32 target.  It has been
   1477 refactored and extended to handle x86-64 as well, and that is mostly complete at
   1478 this point.
   1479 
   1480 ARM32 work is in progress.  It currently lacks the testing level of x86, at
   1481 least in part because Subzero's register allocator needs modifications to handle
   1482 ARM's aliasing of floating point and vector registers.  Specifically, a 64-bit
   1483 register is actually a gang of two consecutive and aligned 32-bit registers, and
   1484 a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
   1485 ARM64 work has not started; when it does, it will be native-only since the
   1486 Native Client sandbox model, validator, and other tools have never been defined.
   1487 
   1488 An external contributor is adding MIPS support, in most part by following the
   1489 ARM work.
   1490 
   1491 Translator performance
   1492 ----------------------
   1493 
   1494 Single-threaded translation speed is currently about 5 the ``pnacl-llc``
   1495 translation speed.  For a large ``.pexe`` file, the time breaks down as:
   1496 
   1497 - 11% for parsing and initial IR building
   1498 
   1499 - 4% for emitting to /dev/null
   1500 
   1501 - 27% for liveness analysis (two liveness passes plus live range construction)
   1502 
   1503 - 15% for linear-scan register allocation
   1504 
   1505 - 9% for basic lowering
   1506 
   1507 - 10% for advanced phi lowering
   1508 
   1509 - ~11% for other minor analysis
   1510 
   1511 - ~10% measurement overhead to acquire these numbers
   1512 
   1513 Some improvements could undoubtedly be made, but it will be hard to increase the
   1514 speed to 10 of ``pnacl-llc`` while keeping acceptable code quality.  With
   1515 ``-Om1`` (lack of) optimization, we do actually achieve roughly 10
   1516 ``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
   1517 
   1518 Code quality
   1519 ------------
   1520 
   1521 Measured across 16 components of spec2k, Subzero's code quality is uniformly
   1522 better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
   1523 between ``pnacl-llc`` ``-O0`` and ``-O2``.
   1524 
   1525 Translator size
   1526 ---------------
   1527 
   1528 When built in MINIMAL mode, the x86-64 native translator size for the x86-32
   1529 target is about 700 KB, not including the size of functions referenced in
   1530 dynamically-linked libraries.  The sandboxed version of Subzero is a bit over 1
   1531 MB, and it is statically linked and also includes nop padding for bundling as
   1532 well as indirect branch masking.
   1533 
   1534 Translator memory footprint
   1535 ---------------------------
   1536 
   1537 It's hard to draw firm conclusions about memory footprint, since the footprint
   1538 is at least proportional to the input function size, and there is no real limit
   1539 on the size of functions in the ``.pexe`` file.
   1540 
   1541 That said, we looked at the memory footprint over time as Subzero translated
   1542 ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
   1543 disposal.  One of LLVM's libraries that Subzero uses can report the current
   1544 malloc heap usage.  With single-threaded translation, Subzero tends to hover
   1545 around 15 MB of memory usage.  There are a couple of monstrous functions where
   1546 Subzero grows to around 100 MB, but then it drops back down after those
   1547 functions finish translating.  In contrast, ``pnacl-llc`` grows larger and
   1548 larger throughout translation, reaching several hundred MB by the time it
   1549 completes.
   1550 
   1551 It's a bit more interesting when we enable multithreaded translation.  When
   1552 there are N translation threads, Subzero implements a policy that limits the
   1553 size of the translation queue to N entries -- if it is "full" when the parser
   1554 tries to add a new CFG, the parser blocks until one of the translation threads
   1555 removes a CFG.  This means the number of in-memory CFGs can (and generally does)
   1556 reach 2*N+1, and so the memory footprint rises in proportion to the number of
   1557 threads.  Adding to the pressure is the observation that the monstrous functions
   1558 also take proportionally longer time to translate, so there's a good chance many
   1559 of the monstrous functions will be active at the same time with multithreaded
   1560 translation.  As a result, for N=32, Subzero's memory footprint peaks at about
   1561 260 MB, but drops back down as the large functions finish translating.
   1562 
   1563 If this peak memory size becomes a problem, it might be possible for the parser
   1564 to resequence the functions to try to spread out the larger functions, or to
   1565 throttle the translation queue to prevent too many in-flight large functions.
   1566 It may also be possible to throttle based on memory pressure signaling from
   1567 Chrome.
   1568 
   1569 Translator scalability
   1570 ----------------------
   1571 
   1572 Currently scalability is "not very good".  Multiple translation threads lead to
   1573 faster translation, but not to the degree desired.  We haven't dug in to
   1574 investigate yet.
   1575 
   1576 There are a few areas to investigate.  First, there may be contention on the
   1577 constant pool, which all threads access, and which requires locked access even
   1578 for reading.  This could be mitigated by keeping a CFG-local cache of the most
   1579 common constants.
   1580 
   1581 Second, there may be contention on memory allocation.  While almost all CFG
   1582 objects are allocated from the CFG-local allocator, some passes use temporary
   1583 STL containers that use the default allocator, which may require global locking.
   1584 This could be mitigated by switching these to the CFG-local allocator.
   1585 
   1586 Third, multithreading may make the default allocator strategy more expensive.
   1587 In a single-threaded environment, a pass will allocate its containers, run the
   1588 pass, and deallocate the containers.  This results in stack-like allocation
   1589 behavior and makes the heap free list easier to manage, with less heap
   1590 fragmentation.  But when multithreading is added, the allocations and
   1591 deallocations become much less stack-like, making allocation and deallocation
   1592 operations individually more expensive.  Again, this could be mitigated by
   1593 switching these to the CFG-local allocator.
   1594