Home | History | Annotate | Download | only in docs
      1 ===========================================
      2 Control Flow Integrity Design Documentation
      3 ===========================================
      4 
      5 This page documents the design of the :doc:`ControlFlowIntegrity` schemes
      6 supported by Clang.
      7 
      8 Forward-Edge CFI for Virtual Calls
      9 ==================================
     10 
     11 This scheme works by allocating, for each static type used to make a virtual
     12 call, a region of read-only storage in the object file holding a bit vector
     13 that maps onto to the region of storage used for those virtual tables. Each
     14 set bit in the bit vector corresponds to the `address point`_ for a virtual
     15 table compatible with the static type for which the bit vector is being built.
     16 
     17 For example, consider the following three C++ classes:
     18 
     19 .. code-block:: c++
     20 
     21   struct A {
     22     virtual void f1();
     23     virtual void f2();
     24     virtual void f3();
     25   };
     26 
     27   struct B : A {
     28     virtual void f1();
     29     virtual void f2();
     30     virtual void f3();
     31   };
     32 
     33   struct C : A {
     34     virtual void f1();
     35     virtual void f2();
     36     virtual void f3();
     37   };
     38 
     39 The scheme will cause the virtual tables for A, B and C to be laid out
     40 consecutively:
     41 
     42 .. csv-table:: Virtual Table Layout for A, B, C
     43   :header: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
     44 
     45   A::offset-to-top, &A::rtti, &A::f1, &A::f2, &A::f3, B::offset-to-top, &B::rtti, &B::f1, &B::f2, &B::f3, C::offset-to-top, &C::rtti, &C::f1, &C::f2, &C::f3
     46 
     47 The bit vector for static types A, B and C will look like this:
     48 
     49 .. csv-table:: Bit Vectors for A, B, C
     50   :header: Class, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
     51 
     52   A, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0
     53   B, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
     54   C, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0
     55 
     56 Bit vectors are represented in the object file as byte arrays. By loading
     57 from indexed offsets into the byte array and applying a mask, a program can
     58 test bits from the bit set with a relatively short instruction sequence. Bit
     59 vectors may overlap so long as they use different bits. For the full details,
     60 see the `ByteArrayBuilder`_ class.
     61 
     62 In this case, assuming A is laid out at offset 0 in bit 0, B at offset 0 in
     63 bit 1 and C at offset 0 in bit 2, the byte array would look like this:
     64 
     65 .. code-block:: c++
     66 
     67   char bits[] = { 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 5, 0, 0 };
     68 
     69 To emit a virtual call, the compiler will assemble code that checks that
     70 the object's virtual table pointer is in-bounds and aligned and that the
     71 relevant bit is set in the bit vector.
     72 
     73 For example on x86 a typical virtual call may look like this:
     74 
     75 .. code-block:: none
     76 
     77   ca7fbb:       48 8b 0f                mov    (%rdi),%rcx
     78   ca7fbe:       48 8d 15 c3 42 fb 07    lea    0x7fb42c3(%rip),%rdx
     79   ca7fc5:       48 89 c8                mov    %rcx,%rax
     80   ca7fc8:       48 29 d0                sub    %rdx,%rax
     81   ca7fcb:       48 c1 c0 3d             rol    $0x3d,%rax
     82   ca7fcf:       48 3d 7f 01 00 00       cmp    $0x17f,%rax
     83   ca7fd5:       0f 87 36 05 00 00       ja     ca8511
     84   ca7fdb:       48 8d 15 c0 0b f7 06    lea    0x6f70bc0(%rip),%rdx
     85   ca7fe2:       f6 04 10 10             testb  $0x10,(%rax,%rdx,1)
     86   ca7fe6:       0f 84 25 05 00 00       je     ca8511
     87   ca7fec:       ff 91 98 00 00 00       callq  *0x98(%rcx)
     88     [...]
     89   ca8511:       0f 0b                   ud2
     90 
     91 The compiler relies on co-operation from the linker in order to assemble
     92 the bit vectors for the whole program. It currently does this using LLVM's
     93 `type metadata`_ mechanism together with link-time optimization.
     94 
     95 .. _address point: https://mentorembedded.github.io/cxx-abi/abi.html#vtable-general
     96 .. _type metadata: http://llvm.org/docs/TypeMetadata.html
     97 .. _ByteArrayBuilder: http://llvm.org/docs/doxygen/html/structllvm_1_1ByteArrayBuilder.html
     98 
     99 Optimizations
    100 -------------
    101 
    102 The scheme as described above is the fully general variant of the scheme.
    103 Most of the time we are able to apply one or more of the following
    104 optimizations to improve binary size or performance.
    105 
    106 In fact, if you try the above example with the current version of the
    107 compiler, you will probably find that it will not use the described virtual
    108 table layout or machine instructions. Some of the optimizations we are about
    109 to introduce cause the compiler to use a different layout or a different
    110 sequence of machine instructions.
    111 
    112 Stripping Leading/Trailing Zeros in Bit Vectors
    113 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    114 
    115 If a bit vector contains leading or trailing zeros, we can strip them from
    116 the vector. The compiler will emit code to check if the pointer is in range
    117 of the region covered by ones, and perform the bit vector check using a
    118 truncated version of the bit vector. For example, the bit vectors for our
    119 example class hierarchy will be emitted like this:
    120 
    121 .. csv-table:: Bit Vectors for A, B, C
    122   :header: Class, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
    123 
    124   A,  ,  , 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,  ,  
    125   B,  ,  ,  ,  ,  ,  ,  , 1,  ,  ,  ,  ,  ,  ,  
    126   C,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , 1,  ,  
    127 
    128 Short Inline Bit Vectors
    129 ~~~~~~~~~~~~~~~~~~~~~~~~
    130 
    131 If the vector is sufficiently short, we can represent it as an inline constant
    132 on x86. This saves us a few instructions when reading the correct element
    133 of the bit vector.
    134 
    135 If the bit vector fits in 32 bits, the code looks like this:
    136 
    137 .. code-block:: none
    138 
    139      dc2:       48 8b 03                mov    (%rbx),%rax
    140      dc5:       48 8d 15 14 1e 00 00    lea    0x1e14(%rip),%rdx
    141      dcc:       48 89 c1                mov    %rax,%rcx
    142      dcf:       48 29 d1                sub    %rdx,%rcx
    143      dd2:       48 c1 c1 3d             rol    $0x3d,%rcx
    144      dd6:       48 83 f9 03             cmp    $0x3,%rcx
    145      dda:       77 2f                   ja     e0b <main+0x9b>
    146      ddc:       ba 09 00 00 00          mov    $0x9,%edx
    147      de1:       0f a3 ca                bt     %ecx,%edx
    148      de4:       73 25                   jae    e0b <main+0x9b>
    149      de6:       48 89 df                mov    %rbx,%rdi
    150      de9:       ff 10                   callq  *(%rax)
    151     [...]
    152      e0b:       0f 0b                   ud2    
    153 
    154 Or if the bit vector fits in 64 bits:
    155 
    156 .. code-block:: none
    157 
    158     11a6:       48 8b 03                mov    (%rbx),%rax
    159     11a9:       48 8d 15 d0 28 00 00    lea    0x28d0(%rip),%rdx
    160     11b0:       48 89 c1                mov    %rax,%rcx
    161     11b3:       48 29 d1                sub    %rdx,%rcx
    162     11b6:       48 c1 c1 3d             rol    $0x3d,%rcx
    163     11ba:       48 83 f9 2a             cmp    $0x2a,%rcx
    164     11be:       77 35                   ja     11f5 <main+0xb5>
    165     11c0:       48 ba 09 00 00 00 00    movabs $0x40000000009,%rdx
    166     11c7:       04 00 00 
    167     11ca:       48 0f a3 ca             bt     %rcx,%rdx
    168     11ce:       73 25                   jae    11f5 <main+0xb5>
    169     11d0:       48 89 df                mov    %rbx,%rdi
    170     11d3:       ff 10                   callq  *(%rax)
    171     [...]
    172     11f5:       0f 0b                   ud2    
    173 
    174 If the bit vector consists of a single bit, there is only one possible
    175 virtual table, and the check can consist of a single equality comparison:
    176 
    177 .. code-block:: none
    178 
    179      9a2:   48 8b 03                mov    (%rbx),%rax
    180      9a5:   48 8d 0d a4 13 00 00    lea    0x13a4(%rip),%rcx
    181      9ac:   48 39 c8                cmp    %rcx,%rax
    182      9af:   75 25                   jne    9d6 <main+0x86>
    183      9b1:   48 89 df                mov    %rbx,%rdi
    184      9b4:   ff 10                   callq  *(%rax)
    185      [...]
    186      9d6:   0f 0b                   ud2
    187 
    188 Virtual Table Layout
    189 ~~~~~~~~~~~~~~~~~~~~
    190 
    191 The compiler lays out classes of disjoint hierarchies in separate regions
    192 of the object file. At worst, bit vectors in disjoint hierarchies only
    193 need to cover their disjoint hierarchy. But the closer that classes in
    194 sub-hierarchies are laid out to each other, the smaller the bit vectors for
    195 those sub-hierarchies need to be (see "Stripping Leading/Trailing Zeros in Bit
    196 Vectors" above). The `GlobalLayoutBuilder`_ class is responsible for laying
    197 out the globals efficiently to minimize the sizes of the underlying bitsets.
    198 
    199 .. _GlobalLayoutBuilder: http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Transforms/IPO/LowerTypeTests.h?view=markup
    200 
    201 Alignment
    202 ~~~~~~~~~
    203 
    204 If all gaps between address points in a particular bit vector are multiples
    205 of powers of 2, the compiler can compress the bit vector by strengthening
    206 the alignment requirements of the virtual table pointer. For example, given
    207 this class hierarchy:
    208 
    209 .. code-block:: c++
    210 
    211   struct A {
    212     virtual void f1();
    213     virtual void f2();
    214   };
    215 
    216   struct B : A {
    217     virtual void f1();
    218     virtual void f2();
    219     virtual void f3();
    220     virtual void f4();
    221     virtual void f5();
    222     virtual void f6();
    223   };
    224 
    225   struct C : A {
    226     virtual void f1();
    227     virtual void f2();
    228   };
    229 
    230 The virtual tables will be laid out like this:
    231 
    232 .. csv-table:: Virtual Table Layout for A, B, C
    233   :header: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
    234 
    235   A::offset-to-top, &A::rtti, &A::f1, &A::f2, B::offset-to-top, &B::rtti, &B::f1, &B::f2, &B::f3, &B::f4, &B::f5, &B::f6, C::offset-to-top, &C::rtti, &C::f1, &C::f2
    236 
    237 Notice that each address point for A is separated by 4 words. This lets us
    238 emit a compressed bit vector for A that looks like this:
    239 
    240 .. csv-table::
    241   :header: 2, 6, 10, 14
    242 
    243   1, 1, 0, 1
    244 
    245 At call sites, the compiler will strengthen the alignment requirements by
    246 using a different rotate count. For example, on a 64-bit machine where the
    247 address points are 4-word aligned (as in A from our example), the ``rol``
    248 instruction may look like this:
    249 
    250 .. code-block:: none
    251 
    252      dd2:       48 c1 c1 3b             rol    $0x3b,%rcx
    253 
    254 Padding to Powers of 2
    255 ~~~~~~~~~~~~~~~~~~~~~~
    256 
    257 Of course, this alignment scheme works best if the address points are
    258 in fact aligned correctly. To make this more likely to happen, we insert
    259 padding between virtual tables that in many cases aligns address points to
    260 a power of 2. Specifically, our padding aligns virtual tables to the next
    261 highest power of 2 bytes; because address points for specific base classes
    262 normally appear at fixed offsets within the virtual table, this normally
    263 has the effect of aligning the address points as well.
    264 
    265 This scheme introduces tradeoffs between decreased space overhead for
    266 instructions and bit vectors and increased overhead in the form of padding. We
    267 therefore limit the amount of padding so that we align to no more than 128
    268 bytes. This number was found experimentally to provide a good tradeoff.
    269 
    270 Eliminating Bit Vector Checks for All-Ones Bit Vectors
    271 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    272 
    273 If the bit vector is all ones, the bit vector check is redundant; we simply
    274 need to check that the address is in range and well aligned. This is more
    275 likely to occur if the virtual tables are padded.
    276 
    277 Forward-Edge CFI for Indirect Function Calls
    278 ============================================
    279 
    280 Under forward-edge CFI for indirect function calls, each unique function
    281 type has its own bit vector, and at each call site we need to check that the
    282 function pointer is a member of the function type's bit vector. This scheme
    283 works in a similar way to forward-edge CFI for virtual calls, the distinction
    284 being that we need to build bit vectors of function entry points rather than
    285 of virtual tables.
    286 
    287 Unlike when re-arranging global variables, we cannot re-arrange functions
    288 in a particular order and base our calculations on the layout of the
    289 functions' entry points, as we have no idea how large a particular function
    290 will end up being (the function sizes could even depend on how we arrange
    291 the functions). Instead, we build a jump table, which is a block of code
    292 consisting of one branch instruction for each of the functions in the bit
    293 set that branches to the target function, and redirect any taken function
    294 addresses to the corresponding jump table entry. In this way, the distance
    295 between function entry points is predictable and controllable. In the object
    296 file's symbol table, the symbols for the target functions also refer to the
    297 jump table entries, so that addresses taken outside the module will pass
    298 any verification done inside the module.
    299 
    300 In more concrete terms, suppose we have three functions ``f``, ``g``,
    301 ``h`` which are all of the same type, and a function foo that returns their
    302 addresses:
    303 
    304 .. code-block:: none
    305 
    306   f:
    307   mov 0, %eax
    308   ret
    309 
    310   g:
    311   mov 1, %eax
    312   ret
    313 
    314   h:
    315   mov 2, %eax
    316   ret
    317 
    318   foo:
    319   mov f, %eax
    320   mov g, %edx
    321   mov h, %ecx
    322   ret
    323 
    324 Our jump table will (conceptually) look like this:
    325 
    326 .. code-block:: none
    327 
    328   f:
    329   jmp .Ltmp0 ; 5 bytes
    330   int3       ; 1 byte
    331   int3       ; 1 byte
    332   int3       ; 1 byte
    333 
    334   g:
    335   jmp .Ltmp1 ; 5 bytes
    336   int3       ; 1 byte
    337   int3       ; 1 byte
    338   int3       ; 1 byte
    339 
    340   h:
    341   jmp .Ltmp2 ; 5 bytes
    342   int3       ; 1 byte
    343   int3       ; 1 byte
    344   int3       ; 1 byte
    345 
    346   .Ltmp0:
    347   mov 0, %eax
    348   ret
    349 
    350   .Ltmp1:
    351   mov 1, %eax
    352   ret
    353 
    354   .Ltmp2:
    355   mov 2, %eax
    356   ret
    357 
    358   foo:
    359   mov f, %eax
    360   mov g, %edx
    361   mov h, %ecx
    362   ret
    363 
    364 Because the addresses of ``f``, ``g``, ``h`` are evenly spaced at a power of
    365 2, and function types do not overlap (unlike class types with base classes),
    366 we can normally apply the `Alignment`_ and `Eliminating Bit Vector Checks
    367 for All-Ones Bit Vectors`_ optimizations thus simplifying the check at each
    368 call site to a range and alignment check.
    369 
    370 Shared library support
    371 ======================
    372 
    373 **EXPERIMENTAL**
    374 
    375 The basic CFI mode described above assumes that the application is a
    376 monolithic binary; at least that all possible virtual/indirect call
    377 targets and the entire class hierarchy are known at link time. The
    378 cross-DSO mode, enabled with **-f[no-]sanitize-cfi-cross-dso** relaxes
    379 this requirement by allowing virtual and indirect calls to cross the
    380 DSO boundary.
    381 
    382 Assuming the following setup: the binary consists of several
    383 instrumented and several uninstrumented DSOs. Some of them may be
    384 dlopen-ed/dlclose-d periodically, even frequently.
    385 
    386   - Calls made from uninstrumented DSOs are not checked and just work.
    387   - Calls inside any instrumented DSO are fully protected.
    388   - Calls between different instrumented DSOs are also protected, with
    389      a performance penalty (in addition to the monolithic CFI
    390      overhead).
    391   - Calls from an instrumented DSO to an uninstrumented one are
    392      unchecked and just work, with performance penalty.
    393   - Calls from an instrumented DSO outside of any known DSO are
    394      detected as CFI violations.
    395 
    396 In the monolithic scheme a call site is instrumented as
    397 
    398 .. code-block:: none
    399 
    400    if (!InlinedFastCheck(f))
    401      abort();
    402    call *f
    403 
    404 In the cross-DSO scheme it becomes
    405 
    406 .. code-block:: none
    407 
    408    if (!InlinedFastCheck(f))
    409      __cfi_slowpath(CallSiteTypeId, f);
    410    call *f
    411 
    412 CallSiteTypeId
    413 --------------
    414 
    415 ``CallSiteTypeId`` is a stable process-wide identifier of the
    416 call-site type. For a virtual call site, the type in question is the class
    417 type; for an indirect function call it is the function signature. The
    418 mapping from a type to an identifier is an ABI detail. In the current,
    419 experimental, implementation the identifier of type T is calculated as
    420 follows:
    421 
    422   -  Obtain the mangled name for "typeinfo name for T".
    423   -  Calculate MD5 hash of the name as a string.
    424   -  Reinterpret the first 8 bytes of the hash as a little-endian
    425      64-bit integer.
    426 
    427 It is possible, but unlikely, that collisions in the
    428 ``CallSiteTypeId`` hashing will result in weaker CFI checks that would
    429 still be conservatively correct.
    430 
    431 CFI_Check
    432 ---------
    433 
    434 In the general case, only the target DSO knows whether the call to
    435 function ``f`` with type ``CallSiteTypeId`` is valid or not.  To
    436 export this information, every DSO implements
    437 
    438 .. code-block:: none
    439 
    440    void __cfi_check(uint64 CallSiteTypeId, void *TargetAddr)
    441 
    442 This function provides external modules with access to CFI checks for the
    443 targets inside this DSO.  For each known ``CallSiteTypeId``, this function
    444 performs an ``llvm.type.test`` with the corresponding type identifier. It
    445 aborts if the type is unknown, or if the check fails.
    446 
    447 The basic implementation is a large switch statement over all values
    448 of CallSiteTypeId supported by this DSO, and each case is similar to
    449 the InlinedFastCheck() in the basic CFI mode.
    450 
    451 CFI Shadow
    452 ----------
    453 
    454 To route CFI checks to the target DSO's __cfi_check function, a
    455 mapping from possible virtual / indirect call targets to
    456 the corresponding __cfi_check functions is maintained. This mapping is
    457 implemented as a sparse array of 2 bytes for every possible page (4096
    458 bytes) of memory. The table is kept readonly (FIXME: not yet) most of
    459 the time.
    460 
    461 There are 3 types of shadow values:
    462 
    463   -  Address in a CFI-instrumented DSO.
    464   -  Unchecked address (a trusted non-instrumented DSO). Encoded as
    465      value 0xFFFF.
    466   -  Invalid address (everything else). Encoded as value 0.
    467 
    468 For a CFI-instrumented DSO, a shadow value encodes the address of the
    469 __cfi_check function for all call targets in the corresponding memory
    470 page. If Addr is the target address, and V is the shadow value, then
    471 the address of __cfi_check is calculated as
    472 
    473 .. code-block:: none
    474 
    475   __cfi_check = AlignUpTo(Addr, 4096) - (V + 1) * 4096
    476 
    477 This works as long as __cfi_check is aligned by 4096 bytes and located
    478 below any call targets in its DSO, but not more than 256MB apart from
    479 them.
    480 
    481 CFI_SlowPath
    482 ------------
    483 
    484 The slow path check is implemented in compiler-rt library as
    485 
    486 .. code-block:: none
    487 
    488   void __cfi_slowpath(uint64 CallSiteTypeId, void *TargetAddr)
    489 
    490 This functions loads a shadow value for ``TargetAddr``, finds the
    491 address of __cfi_check as described above and calls that.
    492 
    493 Position-independent executable requirement
    494 -------------------------------------------
    495 
    496 Cross-DSO CFI mode requires that the main executable is built as PIE.
    497 In non-PIE executables the address of an external function (taken from
    498 the main executable) is the address of that functions PLT record in
    499 the main executable. This would break the CFI checks.
    500