Home | History | Annotate | Download | only in doc
      1 # Building a quantization paradigm from first principles
      2 
      3 **TLDR:** If you prefer example code over theory, look at
      4 [doc/quantization_example.cc](quantization_example.cc).
      5 
      6 ## Overview
      7 
      8 gemmlowp allows to perform calculations on matrices on uint8 values, but these
      9 matrices are only useful insofar as they somehow approximate matrices of real
     10 numbers. By a _quantization paradigm_ we mean a correspondence between matrices
     11 of quantized 8bit values and matrices of real numbers. The choice of a
     12 quantization paradigm affects the calculations that gemmlowp itself needs to
     13 perform, specifically, it affects how one goes from internal 32bit accumulator
     14 to final 8bit outputs.
     15 
     16 The part of gemmlowp transforming internal internal 32bit accumulator to final
     17 8bit outputs is the "output pipeline" described in [output.md](output.md).
     18 
     19 gemmlowp's `GemmWithOutputPipeline` entry point allows specifying an arbitrary
     20 output pipeline, allowing the user to implement their own preferred quantized
     21 arithmetic paradigm.
     22 
     23 In the present document, our purpose is to show how, reasoning from first
     24 principles and some domain-specific knowledge of neural networks, we can arrive
     25 naturally at some specific quantization paradigm, and how that can be
     26 implemented using a specific output pipeline.
     27 
     28 We also aim to show how that differs from the older, legacy quantization
     29 paradigm implemented by gemmlowp's legacy interfaces and why the change to the
     30 newer quantization paradigm described in this document was useful as far as some
     31 applications of gemmlowp were concerned.
     32 
     33 ## Quantization as an affine map.
     34 
     35 In order for arithmetic on real values to map directly to arithmetic on
     36 quantized uint8 values, the mapping between real and quantized uint8 values must
     37 be affine, which means it must be of the form
     38 
     39 ```
     40 real_value = A * quantized_value + B             (1)
     41 ```
     42 
     43 for some constants A, B, or equivalently, of the form
     44 
     45 ```
     46 real_value = C * (quantized_value + D)           (2)
     47 ```
     48 
     49 for some constants C, D. Indeed, anything else than such an affine map would
     50 mean that the result of the quantized calculations do no longer readily provide
     51 an approximation to the result of the real-numbers calculation.
     52 
     53 ## Domain-specific constraint: the real value 0 must be exactly representable.
     54 
     55 Here a domain-specific constrain from neural networks appears: for some neural
     56 network layers, it is very useful for optimized implementations that the
     57 real-value 0 be exactly representable.
     58 
     59 For instance, in a Convolutional or Pooling layer with padding, it is useful to
     60 be able to implement the padding by zero-padding the input array, so that
     61 optimized loops do not need to become more complex to avoid overrunning the
     62 array bounds.
     63 
     64 In order for such zero-padding to be feasible in a quantized implementation of
     65 such layers, it is important that the real value '0' be exactly representable in
     66 quantized form, i.e. that it correspond exactly to some quantized value, which
     67 we call the _zero-point_.
     68 
     69 Indeed, if '0' were not exactly representable, then we would have to use some
     70 quantized value for padding, that does not exactly correspond to the real value
     71 '0'. That would typically introduce inaccuracy in the result. In fact, using
     72 always the same such value would be worse: it would introduce _bias_ in the
     73 result.
     74 
     75 ## The final form of the quantization equation
     76 
     77 Now let us phrase what this constraint — that the real value 0 be exactly
     78 representable — means in either quantization equations, (1) and (2).
     79 
     80 In equation (1), plugging `real_value = 0` and `quantized_value = zero_point`,
     81 we get:
     82 
     83 ```
     84 0 = A * zero_point + B
     85 ```
     86 
     87 equivalently:
     88 
     89 ```
     90 zero_point = -B / A
     91 ```
     92 
     93 We are thus left with a rather awkward constraint: the real number `-B / A` must
     94 somehow be guaranteed to be exactly integral, so that the special uint8 value
     95 `zero_point` can be exactly equal to it. Quite awkward!
     96 
     97 Now let us look at equation (2). Plugging `real_value = 0` and
     98 `quantized_value = zero_point`, we get:
     99 
    100 ```
    101 0 = C * (zero_point + D)
    102 ```
    103 
    104 Conveniently, the constant `C` plays no role anymore, so this equation
    105 simplifies to:
    106 
    107 ```
    108 0 = zero_point + D
    109 ```
    110 
    111 In other words, `D = -zero_point`. This suggests rewriting the quantization
    112 equation (2) into the following form (3), which will be the final form that we
    113 will consistently use:
    114 
    115 ```
    116 real_value = scale * (quantized_value - zero_point)        (3)
    117 ```
    118 
    119 To go from (2) to (3), we merely renamed `C` to `scale` and `D` to
    120 `-zero_point`.
    121 
    122 With this quantization equation (3), the condition that 0 be exactly
    123 representable is vacuously satisfied: `zero_point` is by definition one of the
    124 possible `quantized_value`'s, and equation (3) maps it to a `real_value` of
    125 exactly 0.
    126 
    127 Note that the final quantizaton equation (3) depends on two constants, one
    128 integral, the other an arbitrary positive real number:
    129 
    130 *   `zero_point` is integral, more specifically is one of the possible quantized
    131     values (i.e. typically is a uint8 value).
    132 *   `scale` is a positive real number. Thus at this stage we have not yet shown
    133     how to eliminate all usage of floating-point arithmetic. That will come
    134     below.
    135 
    136 ## Quantizing a matrix multiplication
    137 
    138 Now that we know — equation (3) — how real numbers are to correspond
    139 to quantized values (typically uint8), we turn to applying this knowledge to
    140 rewriting a multiplication of matrices of real numbers, by the equivalent
    141 multiplication of matrices of quantized values.
    142 
    143 Say that we have two matrices of real values `lhs_real_matrix`,
    144 `rhs_real_matrix`. Each entry of their product is the sum (accumulation) of many
    145 products of individual matrix entries, say `lhs_real_value * rhs_real_value`.
    146 
    147 Now suppose that we have already quantized these two matrices according to the
    148 above equation (3), with some already-known quantization parameters `lhs_scale`,
    149 `rhs_scale`, `lhs_zero_point`, `rhs_zero_point`, so that their matrix entries
    150 are quantized as
    151 
    152 ```
    153 lhs_real_value[i] = lhs_scale * (lhs_quantized_value[i] - lhs_zero_point)
    154 rhs_real_value[i] = rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
    155 ```
    156 
    157 We then rewrite the matrix product accumulator accordingly:
    158 
    159 ```
    160 result_real_value
    161   = Sum_over_i(lhs_real_value[i] * rhs_real_value[i])
    162   = Sum_over_i(
    163         lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) *
    164         rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
    165     )
    166   = lhs_scale * rhs_scale * Sum_over_i(
    167         (lhs_quantized_value[i] - lhs_zero_point) *
    168         (rhs_quantized_value[i] - rhs_zero_point)
    169     )                                                      (4)
    170 ```
    171 
    172 Now our goal is to represent this result itself as a quantized matrix, i.e.
    173 still according to equation (3), for some pre-established quantization
    174 parameters `result_scale` and `result_zero_point`, as
    175 
    176 ```
    177 result_real_value = result_scale *
    178     (result_quantized_value - result_zero_point)
    179 ```
    180 
    181 Here we need to keep in mind that our goal is to specify what the quantized
    182 matrix multiplication should do, i.e. how to compute `result_quantized_value`.
    183 The last equation above is equivalent to
    184 
    185 ```
    186 result_quantized_value = result_zero_point +
    187     result_real_value / result_scale
    188 ```
    189 
    190 Now we can use equation (4) above to plug into this the expression of
    191 result_real_value in terms of the quantized operands, and we obtain:
    192 
    193 ```
    194 result_quantized_value = result_zero_point +
    195     (lhs_scale * rhs_scale / result_scale) *
    196         Sum_over_i(
    197             (lhs_quantized_value[i] - lhs_zero_point) *
    198             (rhs_quantized_value[i] - rhs_zero_point)
    199         )                                                  (5)
    200 ```
    201 
    202 Equation (5) is the conclusion of this general discussion of how to specify what
    203 "quantized matrix multiplication" should actually compute, in order to be able
    204 to replace real matrix multiplications.
    205 
    206 ## Implementation of quantized matrix multiplication
    207 
    208 Having obtained the mathematical form (5) of quantized matrix multiplication, we
    209 now turn to its actual implementation.
    210 
    211 The inner-most part of (5),
    212 
    213 ```
    214 int32_accumulator =
    215     Sum_over_i(
    216         (lhs_quantized_value[i] - lhs_zero_point) *
    217         (rhs_quantized_value[i] - rhs_zero_point)
    218 )
    219 ```
    220 
    221 is the "kernel" accumulation loop. It is where the bulk of the computational
    222 cost goes. Luckily, it only involves integers: the quantized operands matrix
    223 entries, and their `zero_point` quantization parameters. Typically, all of these
    224 values are uint8. Typically, the above differences of uint8 values would be
    225 represented as signed int16; their products as signed int32.
    226 
    227 It is out of scope of the present doc to discuss how to avoid the overhead of
    228 having to subtract these `zero_point` constants in this inner loop; refer to
    229 [this section of
    230 low-precision.md](low-precision.md#efficient-handling-of-offsets) for that. The
    231 gist of it is that a mathematical trick allows us to take the handling of these
    232 `zero_point` constants out of this accumulation loop, so that it simplifies to
    233 
    234 ```
    235 int32_accumulator =
    236     Sum_over_i(
    237       lhs_quantized_value[i] *
    238       rhs_quantized_value[i]
    239     )                                                      (6)
    240 ```
    241 
    242 Anyway, the result is a `int32_accumulator` that we now plug back into the rest
    243 of (5):
    244 
    245 ```
    246 result_quantized_value = result_zero_point +
    247     (lhs_scale * rhs_scale / result_scale) * int32_accumulator       (7)
    248 ```
    249 
    250 The difficulty here is of course that `(lhs_scale * rhs_scale / result_scale)`
    251 is a positive real number, not an integer in general. It is a constant, though.
    252 So what we have to implement here is the (approximate) scaling of a int32 value
    253 by some arbitrary positive constant multiplier.
    254 
    255 Moreover, it is safe to assume that this positive constant multiplier is smaller
    256 than one — each of the `scale` values here is typically smaller than one,
    257 as we are typically mapping the `[0..255]` quantized uint8 value range to an
    258 interval of real values that is much narrower than that, typically within
    259 `[-10,10]` in most neural networks. For example, a neural network using Relu6
    260 activation functions will typically have real activation values in the interval
    261 [0,6].
    262 
    263 So how do we implement the multiplication of a int32 value by a positive real
    264 constant that is smaller than one? Typically, by multiplying by a fixed-point
    265 constant multiplier in the normalized interval `[1/2,1)`, and right-shifting
    266 the result to achieve the correct multiplier.
    267 
    268 At this point we have obtained the int32 value of the product
    269 
    270 ```
    271 (lhs_scale * rhs_scale / result_scale) * int32_accumulator
    272 ```
    273 
    274 Looking at (7), it only remains to add to it the integral value
    275 `result_zero_point`, and we are done.
    276 
    277 ## How this is implemented in gemmlowp
    278 
    279 The different parts of gemmlowp implementing aspects of the above discussion
    280 are:
    281 
    282 *   The packing stage (see [packing.md](packing.md)) implements the special
    283     mathematical trick to handle `lhs_offset`, `rhs_offset` that we alluded to
    284     above, see [this section of
    285     low-precision.md](low-precision.md#efficient-handling-of-offsets) for
    286     details. Thanks to is, the rest of the calculation can proceed as if
    287     `lhs_offset`, `rhs_offset` were 0.
    288 
    289 *   The compute/kernel stage (see [kernel.md](kernel.md)) performs the core
    290     accumulation loop producing the `int32_accumulator`, see equation (6) above.
    291 
    292 *   The unpacking stage feeds into the output pipeline (see
    293     [output.md](output.md)), which implements the rest of the evaluation of the
    294     above equation (5), that we discussed in the previous section.
    295 
    296 Now, the point of gemmlowp's flexible output-pipelines mechanism (see
    297 [output.md](output.md)) is to support different quantization paradigms, so we
    298 now have to specify which particular flavor of output pipeline corresponds to
    299 the particular quantization paradigm that we detailed above in this document.
    300 
    301 The specific output pipeline stage implementing the present quantization
    302 paradigm, i.e. implementing the precise computation detailed in the previous
    303 section (equation (5)), is
    304 `OutputStageQuantizeDownInt32ByFixedPoint`.
    305 
    306 Please refer to the comment explaining it in
    307 [public/output_stages.h](../public/output_stages.h).
    308 
    309 ## How this differs from the older legacy gemmlowp quantization paradigm
    310 
    311 The difference between the older legacy quantization paradigm described in
    312 [low-precision.md](low-precision.md) and the newer one described in this
    313 document boils down to the difference between the legacy output stage
    314 implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output
    315 stage implementing the new paradigm,
    316 `OutputStageQuantizeDownInt32ByFixedPoint`.
    317 
    318 Please refer to the comments in
    319 [public/output_stages.h](../public/output_stages.h) for details about these two
    320 output stages and how they differ.
    321 
    322 Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are:
    323 
    324 1.  The int32 accumulators (inputs to the output stage) undergo a plain int32
    325     multiplication with a int32 multiplier, which may overflow. By contrast, in
    326     the newer `OutputStageQuantizeDownInt32ByFixedPoint`, this
    327     integer multiplication becomes a fixed-point multiplication and cannot
    328     overflow.
    329 
    330     *   In practice, to limit the risk of overflow, this pushes users to choose
    331         smaller values for this integer multiplier, which means limited
    332         multiplicative accuracy, which may cause multiplicative bias depending
    333         on how it is used.
    334 
    335 2.  Note how the order of multiplying by the multipler and adding the
    336     `result_offset` are swapped. This reflects a quantizatin equation of the
    337     form (1) above, as opposed to the form (2)/(3) that the new quantization
    338     paradigm uses. As a result, it is essentially impossible to guarantee that 0
    339     is an exactly-representable value, which as discussed above is an issue at
    340     least in some convolutional neural network applications.
    341 
    342 ## Example code illustrating the new quantization paradigm
    343 
    344 Example code showing how to perfom a quantized matrix multiplication in the
    345 quantization paradigm discussed here is in
    346 [doc/quantization_example.cc](quantization_example.cc).
    347