Home | History | Annotate | Download | only in doc
      1 # The low-precision paradigm in gemmlowp, and how it's implemented
      2 
      3 ## Introduction
      4 
      5 "Low-precision" means that the input and output matrix entries are integers on
      6 at most 8 bits. The scalar type is uint8_t.
      7 
      8 This isn't the same as just doing plain matrix arithmetic over uint8_t, because
      9 that would overflow. To avoid overflow, we internally accumulate results on more
     10 than 8 bits, and at the end we keep only some significant 8 bits. This relies on
     11 the caller providing suitable offset/multiplier/shift parameters, which
     12 effectively govern how we extract some significant 8 bit from our more-than-8bit
     13 temporary accumulators.
     14 
     15 ## Low-precision paradigms
     16 
     17 gemmlowp is flexible enough to support multiple low-precision paradigms, i.e.
     18 multiple ways that a meaning is attached to 8bit values so that a computation
     19 can rely on a 8bit GEMM provided by gemmlowp.
     20 
     21 ### The current flexible design with arbitrary "output pipelines".
     22 
     23 See [output.md](output.md) for more details about output pipelines. This is a
     24 mechanism by which gemmlowp becomes generic enough to support multiple 8bit
     25 computation paradigms, by allowing the user to set up a chain of transformations
     26 to be performed on internal 32bit accumulators to obtain the final outputs.
     27 
     28 The public entry point in [public/gemmlowp.h](../public/gemmlowp.h) allowing to
     29 set up an arbitrary output pipeline is `GemmWithOutputPipeline`.
     30 
     31 Refer to [quantization.md](quantization.md) for details of how one gets from
     32 first principles to the actual output pipelines to assemble for successful
     33 real-world quantized calculations.
     34 
     35 For the scope of the present document, it suffices to say that quantized matrix
     36 multiplication takes the following parameters:
     37 
     38 -   The lhs matrix of uint8 quantized values.
     39 -   The rhs matrix of uint8 quantized values.
     40 -   A int32 lhs_offset, that will be added to each entry of the lhs matrix.
     41 -   A int32 rhs_offset, that will be added to each entry of the rhs matrix.
     42 -   An output pipeline, that will process int32 accumulators into final outputs.
     43 
     44 The overall computation goes through the following steps:
     45 
     46 1.  Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
     47 2.  Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
     48 3.  Compute the int32 matrix product of the resulting lhs times rhs.
     49 4.  Apply the output pipeline on these int32 accumulators, to obtain the final
     50     outputs.
     51 
     52 ### The legacy low-precision paradigm
     53 
     54 This older paradigm is the one exposed by the following entry points:
     55 
     56 *   In [public/gemmlowp.h](../public/gemmlowp.h), the `Gemm` entry point.
     57 *   The deprecateed `eight_bit_int_gemm` directory.
     58 
     59 Originally, gemmlowp started an implementation of the (now deprecated)
     60 EightBitIntGemm paradigm, where quantized matrix multiplication takes the
     61 following input parameters: - the lhs matrix of uint8 quantized values - the rhs
     62 matrix of uint8 quantized values - the following int32 "quantization
     63 parameters", which control how the uint8 quantized values in the matrices are to
     64 be interpreted during the matrix computation: - lhs_offset - rhs_offset -
     65 result_offset - result_mult_int - result_shift
     66 
     67 In that legacy paradigm, the mathematical expression to be computed is the
     68 result of the following steps:
     69 
     70 1.  Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
     71 2.  Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
     72 3.  Compute the int32 matrix product of the resulting lhs times rhs.
     73 4.  Add result_offset to each entry of the result.
     74 5.  Multiply each entry of the result by the following fraction, and round to
     75     the nearest integer:
     76 
     77 ```
     78 result_mult_int
     79 ---------------                             (1)
     80 2^result_shift
     81 ```
     82 
     83 1.  Clamp the resulting int32 values to the `[0..255]` range and cast to uint8.
     84 
     85 Again, this paradigm is not recommended for new usage. See
     86 [quantization.md](quantization.md) for how reasoning from first principles, one
     87 arrives to a substantially different quantization paradigm.
     88 
     89 In addition, note that the integer multiplication by the numerator in the above
     90 step 5. risks overflowing. That concern is avoided in the currently recommended
     91 output stages by performing a fixed-point multiplication instead of an ordinary
     92 integer multiplication.
     93 
     94 # Efficient handling of offsets
     95 
     96 At first glance it may seem like the above-described quantized computation
     97 scheme requires adding the lhs_offset and rhs_offset to each of the lhs and rhs
     98 matrix entries.
     99 
    100 Doing that in the GEMM kernel would incur substantial overhead: - It would mean
    101 extra arithmetic work in the GEMM kernel; - It would require storing the
    102 lhs_offset and rhs_offset in registers, which would eat into the register space
    103 available for the rest of the GEMM kernel.
    104 
    105 One may then consider adding the lhs_offset and rhs_offset once and for all to
    106 lhs and rhs blocks, in a GEMM implementation operating on one lhs block and one
    107 rhs block at a time. However, doing so would require storing lhs and rhs blocks
    108 in 32 bit (or at least in 16 bit in real-world cases), which would partially
    109 negate the memory bandwidth benefits of low-precision computation.
    110 
    111 Fortunately, there is another way to handle these offsets that has none of the
    112 costs of the approaches described above. The idea is as follows.
    113 
    114 Let `P` denote the matrix shaped like `lhs`, but filled with 1's.
    115 
    116 Let `Q` denote the matrix shaped like `rhs`, but filled with 1's.
    117 
    118 Adding lhs_offset to each entry of `lhs`, means adding `lhs_offset * P` to
    119 `lhs`.
    120 
    121 Adding rhs_offset to each entry of `rhs`, means adding `rhs_offset * Q` to
    122 `rhs`.
    123 
    124 Thus, as far as handling `lhs_offset` and `rhs_offset` goes, the matrix product
    125 to be computed is:
    126 
    127 ```
    128 (lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
    129 ```
    130 
    131 Expanding this (using distributivity of matrix multiplication over addition), we
    132 see that the above product is equal to the following sum of 4 terms:
    133 
    134 ```
    135   lhs * rhs                                 (2)
    136 + lhs_offset * P * rhs
    137 + lhs * rhs_offset * Q
    138 + lhs_offset * rhs_offset * P * Q
    139 ```
    140 
    141 The first term, `lhs * rhs`, is just the matrix multiplication ignoring the
    142 offsets, i.e. as if `lhs_offset==rhs_offset==0`. Our claim here is that this is
    143 all what we have to compute in the GEMM kernel.
    144 
    145 In the second term, `lhs_offset * P * rhs`, notice that since P is filled with
    146 1's, `P * rhs` has all its rows equal to each other, and equal to the row-vector
    147 of sums of all the entries in each column of rhs.
    148 
    149 Thus, we can compute the second term, `lhs_offset * P * rhs`, by summing each
    150 column of rhs. This produces a single row-vector, and in order to add the second
    151 term, we simply need to add this row-vector (multiplied by lhs_offset) to each
    152 row of the result. This is just a rank one update of the result (equivalently,
    153 the second term is a rank one matrix), and we can efficiently store it as a
    154 single vector.
    155 
    156 The third term, `lhs * rhs_offset * Q`, is entirely similar to the second one,
    157 and can be similarly computed by summing each row of lhs, storing this in a
    158 single column-vector, and later multiplying these sums by rhs_offset.
    159 
    160 The fourth term is a single constant, repeated into all the entries of the
    161 matrix. The matrix `P * Q` is filled with the single constant value 'depth' (the
    162 depth of the matrix product i.e. the number of columns of the lhs). Thus the
    163 fourth term is simply the rank zero update adding this constant to each matrix
    164 entry:
    165 
    166 ```
    167 lhs_offset * rhs_offset * depth
    168 ```
    169 
    170 # Implementation of this technique in gemmlowp
    171 
    172 In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
    173 to prepare them for efficient repeated traversal by the kernel), we compute the
    174 sum of each row of the lhs block and the sum of each column of the rhs block.
    175 
    176 See in [internal/pack.h](../internal/pack.h), in the PackedSideBlock class, the
    177 following member:
    178 
    179 ```
    180 // Handle on the additional buffer backing the vector of sums of slices
    181 // associated with this block. Owned.
    182 Allocator::Handle sums_of_each_slice_handle_;
    183 ```
    184 
    185 sums_of_each_slice_handle_ is the handle to the buffer allocated to store the
    186 vector containing sums of rows of lhs, or of sums of columns of rhs.
    187 
    188 After these rank one updates have been computed at the packing stage, they are
    189 ignored at the compute kernel stage, since that stage is only concerned with the
    190 first of the four terms in (2); they are only used at the unpacking stage. See
    191 the default/reference implementation, `UnpackResultImpl`, in
    192 [internal/unpack.h](../internal/unpack.h).
    193