Home | History | Annotate | Download | only in doc
      1 # Overview of gemmlowp design
      2 
      3 ## Primer on GEMM, kernels, and cache friendliness
      4 
      5 gemmlowp, like most GEMMs, implements the straightforward matrix multiplication
      6 algorithm, which takes n^3 multiply-accumulate instructions for n*n sized
      7 matrices. Because the arithmetic complexity grows quicker than the memory
      8 complexity (n^3 vs. n^2), memory accesses are redundant (each matrix entry is
      9 accessed n times). A large part of a GEMM's performance and design goes toward
     10 minimizing the inefficiency resulting from these redundant memory accesses.
     11 
     12 Ultimately, once values are loaded into CPU registers, they cost nothing to
     13 access, so as long as we can work within registers, this problem doesn't exist.
     14 Thus, in order to be efficient, a GEMM's inner loops must wisely use the
     15 available registers to do as much arithmetic work as possible before loading
     16 more data from memory into registers. This means that a GEMM implementation
     17 needs to have architecture-specific inner loops tailored for architecture
     18 details such as the number of registers, and typically written in assembly. This
     19 'inner loops' architecture-specific component is referred to as the GEMM kernel.
     20 (More details about kernels are in [kernel.md](kernel.md)).
     21 
     22 However, only small blocks can fit at a given time in registers, so at larger
     23 scales one needs to repeatedly load blocks of matrices from memory, and these
     24 accesses are redundant for the reason outlined above. The way that one minimizes
     25 the resulting inefficiency is by organizing for cache locality, so that most of
     26 these accesses hit the L1 cache, and most of the remaining ones hit the L2
     27 cache, etc.
     28 
     29 This is achieved by subdividing the matrices into blocks sized to fit in L2
     30 cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
     31 and performing the matrix multiplication one such block at a time.
     32 
     33 In practice, it tends to pay off to "pack" input blocks for optimally efficient
     34 traversal by the kernel, since they will be traversed multiple times. "packing"
     35 means at least reordering the data layout for 1) simple access patterns that fit
     36 the CPU's cache behavior (in particular, the cache line size), and 2) simple
     37 loading into SIMD vector registers by the kernel.
     38 
     39 So a typical GEMM, in pseudo-code, tends to look like this:
     40 
     41 ```
     42 allocate(some_lhs_L2_block);
     43 allocate(some_rhs_L2_block);
     44 for (some_lhs_L2_block) {
     45   pack(some_lhs_L2_block);
     46   for (some_rhs_L2_block) {
     47     pack(some_rhs_L2_block);
     48     for (some_lhs_sub_block in some_lhs_L2_block) {
     49       for (some_rhs_sub_block in some_rhs_L2_block) {
     50         kernel(some_lhs_sub_block, some_rhs_sub_block);
     51       }
     52     }
     53   }
     54 }
     55 ```
     56 
     57 ## Impact of low-precision computation on gemmlowp design
     58 
     59 Refer to [low-precision.md](low-precision.md) for specifics of the
     60 low-precision-computation paradigm and how it's implemented in gemmlowp.
     61 
     62 Inputs and outputs are matrices of uint8 values, but internally we are
     63 accumulating int32 values, only converting them back to uint8 at the end. This
     64 means that we need so store a block of int32 accumulators at a time. We compute
     65 a block of the result in int32 accumulators and then we "unpack" it into the
     66 destination matrix at once. In this way, we minimize the amount of memory used
     67 to store int32 values at a given time.
     68 
     69 Because of that, besides the "pack" and "kernel" stages outlined above, a third
     70 stage is needed in gemmlowp, which we call "unpack". Thus we arrive at the
     71 3-stage computation scheme that gemmlowp uses:
     72 
     73 1.  Pack lhs/rhs blocks from the input matrices.
     74 2.  Compute the product of the packed blocks, using the kernel.
     75 3.  Unpack the result block into the output matrix.
     76 
     77 The pseudo-code overview of gemmlowp now looks like:
     78 
     79 ```
     80 allocate(some_lhs_L2_block);
     81 allocate(some_rhs_L2_block);
     82 // new: temp storage for int32 accums
     83 allocate(some_int32_accumulators_block);
     84 for (some_lhs_L2_block) {
     85   pack(some_lhs_L2_block);
     86   for (some_rhs_L2_block) {
     87     pack(some_rhs_L2_block);
     88     for (some_lhs_sub_block in some_lhs_L2_block) {
     89       for (some_rhs_sub_block in some_rhs_L2_block) {
     90         // new: pass int32 accums to kernel
     91         kernel(&some_int32_accumulators_block,
     92                some_lhs_sub_block,
     93                some_rhs_sub_block);
     94       }
     95     }
     96     // new: unpack int32 accums into destination matrix
     97     unpack(some_int32_accumulators_block);
     98   }
     99 }
    100 ```
    101 
    102 ## Exploring gemmlowp code
    103 
    104 The design outlined above can be readily matched to gemmlowp source code, in
    105 particular in this file, which gives a simple GEMM implementation fitting in one
    106 rather small function:
    107 
    108 ```
    109 internal/single_thread_gemm.h
    110 ```
    111 
    112 The reader can compare the above pseudo-code to the actual code in this file:
    113 
    114 ```
    115 for (int r = 0; r < rows; r += block_params.l2_rows) {
    116   int rs = std::min(block_params.l2_rows, rows - r);
    117 
    118   PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));
    119 
    120   for (int c = 0; c < cols; c += block_params.l2_cols) {
    121     int cs = std::min(block_params.l2_cols, cols - c);
    122 
    123     if (!pack_rhs_once) {
    124       PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
    125     }
    126 
    127     Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);
    128 
    129     auto result_block = result->block(r, c, rs, cs);
    130     UnpackResult(&result_block, packed_result, packed_lhs, packed_rhs, depth,
    131                  result_offset, result_mult_int, result_shift);
    132   }
    133 }
    134 ```
    135 
    136 The files in `internal/` fall into a few categories:
    137 
    138 There are two top-level GEMM implementations,
    139 
    140 *   [internal/single_thread_gemm.h](../internal/single_thread_gemm.h)
    141 *   [internal/multi_thread_gemm.h](../internal/multi_thread_gemm.h)
    142 
    143 They both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
    144 [packing.md](packing.md)) implemented in the following files:
    145 
    146 *   [internal/pack.h](../internal/pack.h)
    147 *   [internal/compute.h](../internal/compute.h)
    148 *   [internal/unpack.h](../internal/unpack.h)
    149     *   This in turn calls into [internal/output.h](../internal/output.h) for
    150         the output pipeline (see [output.md](output.md))
    151 
    152 The pack.h and unpack.h files contain generic templated code that can be
    153 overridden by optimized code in template specializations; for example, see the
    154 NEON optimized code here:
    155 
    156 *   [internal/pack_neon.h](../internal/pack_neon.h)
    157 *   [internal/unpack_neon.h](../internal/unpack_neon.h)
    158     *   This in turn calls into
    159         [internal/output_neon.h](../internal/output_neon.h)
    160 
    161 The compute stage contains generic code in compute.h that only calls into
    162 optimized code through the Kernel::Run() entry point. Each kernel is basically
    163 just as struct offering a Run() implementation; see the NEON kernels in:
    164 
    165 *   [internal/kernel_neon.h](../internal/kernel_neon.h)
    166