1 # The low-precision paradigm in gemmlowp, and how it's implemented 2 3 ## Introduction 4 5 "Low-precision" means that the input and output matrix entries are integers on 6 at most 8 bits. The scalar type is uint8_t. 7 8 This isn't the same as just doing plain matrix arithmetic over uint8_t, because 9 that would overflow. To avoid overflow, we internally accumulate results on more 10 than 8 bits, and at the end we keep only some significant 8 bits. This relies on 11 the caller providing suitable offset/multiplier/shift parameters, which 12 effectively govern how we extract some significant 8 bit from our more-than-8bit 13 temporary accumulators. 14 15 ## Low-precision paradigms 16 17 gemmlowp is flexible enough to support multiple low-precision paradigms, i.e. 18 multiple ways that a meaning is attached to 8bit values so that a computation 19 can rely on a 8bit GEMM provided by gemmlowp. 20 21 ### The current flexible design with arbitrary "output pipelines". 22 23 See [output.md](output.md) for more details about output pipelines. This is a 24 mechanism by which gemmlowp becomes generic enough to support multiple 8bit 25 computation paradigms, by allowing the user to set up a chain of transformations 26 to be performed on internal 32bit accumulators to obtain the final outputs. 27 28 The public entry point in [public/gemmlowp.h](../public/gemmlowp.h) allowing to 29 set up an arbitrary output pipeline is `GemmWithOutputPipeline`. 30 31 Refer to [quantization.md](quantization.md) for details of how one gets from 32 first principles to the actual output pipelines to assemble for successful 33 real-world quantized calculations. 34 35 For the scope of the present document, it suffices to say that quantized matrix 36 multiplication takes the following parameters: 37 38 - The lhs matrix of uint8 quantized values. 39 - The rhs matrix of uint8 quantized values. 40 - A int32 lhs_offset, that will be added to each entry of the lhs matrix. 41 - A int32 rhs_offset, that will be added to each entry of the rhs matrix. 42 - An output pipeline, that will process int32 accumulators into final outputs. 43 44 The overall computation goes through the following steps: 45 46 1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them. 47 2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them. 48 3. Compute the int32 matrix product of the resulting lhs times rhs. 49 4. Apply the output pipeline on these int32 accumulators, to obtain the final 50 outputs. 51 52 ### The legacy low-precision paradigm 53 54 This older paradigm is the one exposed by the following entry points: 55 56 * In [public/gemmlowp.h](../public/gemmlowp.h), the `Gemm` entry point. 57 * The deprecateed `eight_bit_int_gemm` directory. 58 59 Originally, gemmlowp started an implementation of the (now deprecated) 60 EightBitIntGemm paradigm, where quantized matrix multiplication takes the 61 following input parameters: - the lhs matrix of uint8 quantized values - the rhs 62 matrix of uint8 quantized values - the following int32 "quantization 63 parameters", which control how the uint8 quantized values in the matrices are to 64 be interpreted during the matrix computation: - lhs_offset - rhs_offset - 65 result_offset - result_mult_int - result_shift 66 67 In that legacy paradigm, the mathematical expression to be computed is the 68 result of the following steps: 69 70 1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them. 71 2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them. 72 3. Compute the int32 matrix product of the resulting lhs times rhs. 73 4. Add result_offset to each entry of the result. 74 5. Multiply each entry of the result by the following fraction, and round to 75 the nearest integer: 76 77 ``` 78 result_mult_int 79 --------------- (1) 80 2^result_shift 81 ``` 82 83 1. Clamp the resulting int32 values to the `[0..255]` range and cast to uint8. 84 85 Again, this paradigm is not recommended for new usage. See 86 [quantization.md](quantization.md) for how reasoning from first principles, one 87 arrives to a substantially different quantization paradigm. 88 89 In addition, note that the integer multiplication by the numerator in the above 90 step 5. risks overflowing. That concern is avoided in the currently recommended 91 output stages by performing a fixed-point multiplication instead of an ordinary 92 integer multiplication. 93 94 # Efficient handling of offsets 95 96 At first glance it may seem like the above-described quantized computation 97 scheme requires adding the lhs_offset and rhs_offset to each of the lhs and rhs 98 matrix entries. 99 100 Doing that in the GEMM kernel would incur substantial overhead: - It would mean 101 extra arithmetic work in the GEMM kernel; - It would require storing the 102 lhs_offset and rhs_offset in registers, which would eat into the register space 103 available for the rest of the GEMM kernel. 104 105 One may then consider adding the lhs_offset and rhs_offset once and for all to 106 lhs and rhs blocks, in a GEMM implementation operating on one lhs block and one 107 rhs block at a time. However, doing so would require storing lhs and rhs blocks 108 in 32 bit (or at least in 16 bit in real-world cases), which would partially 109 negate the memory bandwidth benefits of low-precision computation. 110 111 Fortunately, there is another way to handle these offsets that has none of the 112 costs of the approaches described above. The idea is as follows. 113 114 Let `P` denote the matrix shaped like `lhs`, but filled with 1's. 115 116 Let `Q` denote the matrix shaped like `rhs`, but filled with 1's. 117 118 Adding lhs_offset to each entry of `lhs`, means adding `lhs_offset * P` to 119 `lhs`. 120 121 Adding rhs_offset to each entry of `rhs`, means adding `rhs_offset * Q` to 122 `rhs`. 123 124 Thus, as far as handling `lhs_offset` and `rhs_offset` goes, the matrix product 125 to be computed is: 126 127 ``` 128 (lhs + lhs_offset * P) * (rhs + rhs_offset * Q) 129 ``` 130 131 Expanding this (using distributivity of matrix multiplication over addition), we 132 see that the above product is equal to the following sum of 4 terms: 133 134 ``` 135 lhs * rhs (2) 136 + lhs_offset * P * rhs 137 + lhs * rhs_offset * Q 138 + lhs_offset * rhs_offset * P * Q 139 ``` 140 141 The first term, `lhs * rhs`, is just the matrix multiplication ignoring the 142 offsets, i.e. as if `lhs_offset==rhs_offset==0`. Our claim here is that this is 143 all what we have to compute in the GEMM kernel. 144 145 In the second term, `lhs_offset * P * rhs`, notice that since P is filled with 146 1's, `P * rhs` has all its rows equal to each other, and equal to the row-vector 147 of sums of all the entries in each column of rhs. 148 149 Thus, we can compute the second term, `lhs_offset * P * rhs`, by summing each 150 column of rhs. This produces a single row-vector, and in order to add the second 151 term, we simply need to add this row-vector (multiplied by lhs_offset) to each 152 row of the result. This is just a rank one update of the result (equivalently, 153 the second term is a rank one matrix), and we can efficiently store it as a 154 single vector. 155 156 The third term, `lhs * rhs_offset * Q`, is entirely similar to the second one, 157 and can be similarly computed by summing each row of lhs, storing this in a 158 single column-vector, and later multiplying these sums by rhs_offset. 159 160 The fourth term is a single constant, repeated into all the entries of the 161 matrix. The matrix `P * Q` is filled with the single constant value 'depth' (the 162 depth of the matrix product i.e. the number of columns of the lhs). Thus the 163 fourth term is simply the rank zero update adding this constant to each matrix 164 entry: 165 166 ``` 167 lhs_offset * rhs_offset * depth 168 ``` 169 170 # Implementation of this technique in gemmlowp 171 172 In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs 173 to prepare them for efficient repeated traversal by the kernel), we compute the 174 sum of each row of the lhs block and the sum of each column of the rhs block. 175 176 See in [internal/pack.h](../internal/pack.h), in the PackedSideBlock class, the 177 following member: 178 179 ``` 180 // Handle on the additional buffer backing the vector of sums of slices 181 // associated with this block. Owned. 182 Allocator::Handle sums_of_each_slice_handle_; 183 ``` 184 185 sums_of_each_slice_handle_ is the handle to the buffer allocated to store the 186 vector containing sums of rows of lhs, or of sums of columns of rhs. 187 188 After these rank one updates have been computed at the packing stage, they are 189 ignored at the compute kernel stage, since that stage is only concerned with the 190 first of the four terms in (2); they are only used at the unpacking stage. See 191 the default/reference implementation, `UnpackResultImpl`, in 192 [internal/unpack.h](../internal/unpack.h). 193