Home | History | Annotate | Download | only in doc
      1 # Computation with less than 8 bits in gemmlowp
      2 
      3 ## Introduction
      4 
      5 We assume familiarity with gemmlowp's low-precision uint8 computation paradigm,
      6 which is described in [low-precision.md](low-precision.md).
      7 
      8 This document is about the possibility of further reducing precision below 8
      9 bits.
     10 
     11 That allows to get higher arithmetic throughput on some architectures, at the
     12 cost of decreased accuracy.
     13 
     14 ## The past, present, and future of less-than-8-bit computation in gemmlowp
     15 
     16 A meta note is needed here as to how this fits with the general gemmlowp design.
     17 
     18 ### The past
     19 
     20 Less-than-8-bit computation was initially designed and implemented in gemmlowp
     21 as a drop-in replacement for regular 8bit computation, a plain optimization. The
     22 idea was that to automatically requantize 8bit operands to less-than-8bit during
     23 the O(N^2) packing stage, then take advantage of the lower bit depth during the
     24 O(N^3) compute stage. For large enough matrices, that should be worth it.
     25 
     26 ### The present
     27 
     28 TODO(benoitjacob): update this documentation. This 'present' state just
     29 became the past (February 2017).
     30 
     31 At the moment, this less-than-8-bit mode of gemmlowp is not much used in
     32 practice, because the implicit requantization of operands from 8bit to
     33 less-than-8bit turned out to be more expensive than initially expected, both in
     34 terms of speed and accuracy:
     35 
     36 1.  Speed: the O(N^2) requantization is only negligible compared to the O(N^3)
     37     compute kernel when the matrix size N is large enough; in practice, smaller
     38     matrix sizes turned out to be very important, making the requantization
     39     approach slower than expected.
     40 
     41 2.  Accuracy: As neural networks were optimized for size, their sensitivity to
     42     numerical accuracy increased. Then the approach of requantizing
     43     already-quantized data turned out to be more wasteful of accuracy than we
     44     could afford.
     45 
     46 ### The future
     47 
     48 Less-than-8bit still probably has good prospects; what should be dropped is the
     49 requantization. In other words, in the future, we might have neural networkds
     50 trained right away for some bit depth lower than 8 bits. The resulting values
     51 would probably still be stored as 8 bits (unless the bit depth eventually
     52 becomes very low). Thus, no particular work would be needed in the packing
     53 stage; no overhead or loss of accuracy would be incurred anymore.
     54 
     55 In other words: the design of less-than-8-bit kernels is probably useful in the
     56 long run; what is on the way out is requantization and packing/unpacking-stage
     57 aspects.
     58 
     59 With that said, the rest of this page retains its old content about the present
     60 approach:
     61 
     62 ## Public interface
     63 
     64 ### The BitDepthSetting parameter in the EightBitIntGemm interface
     65 
     66 Accessing less-than-8-bit computation via the EightBitIntGemm is very simple:
     67 EightBitIntGemm takes a BitDepthSetting enum which allows to choose among a
     68 fixed set of supported bit-depth combinations.
     69 
     70 ### The BitDepthParams parameter in the public/gemmlowp.h interface
     71 
     72 The public/gemmlowp.h interface exposes more extensive control over
     73 quantization, by means of a BitDepthParams template parameter, which is a type
     74 parameter, carrying information about: 1. The LHS and RHS bit depth, which can
     75 be set arbitrarily and independently; 2. The 'RoundingStrategy', which is the
     76 heuristic used to choose a rounding mode, based on the accumulation size (a.k.a.
     77 the "depth" dimension of the Gemm). Details can be seen in public/bit_depth.h.
     78 
     79 ### How does BitDepth{Setting,Params} affect input/output uint8 matrix data?
     80 
     81 Input/output matrix data is all uint8's, ranging from 0 to 255, regardless of
     82 the BitDepth{Setting,Params}.
     83 
     84 So the BitDepth{Setting,Params} is only an internal detail. It only means to
     85 allow gemmlowp to use lower precision internally, but the input/output data
     86 format is unaffected.
     87 
     88 As far as the API contract goes, the only thing that the
     89 BitDepth{Setting,Params} does is to relax the accuracy requirement. With
     90 standard 8bit/8bit computation, gemmlowp is required to return the exact result
     91 as specified in [low-precision.md](low-precision.md). With lower bit depths,
     92 gemmlowp is no longer required to return an exact result.
     93 
     94 ## Implementation
     95 
     96 Here we refer to the 3 stages of computation as described in
     97 [design.md](design.md), namely: packing, computation kernel, unpacking.
     98 
     99 The general idea is that at the packing stage, we requantize input (Lhs/Rhs)
    100 data to less-than-8-bit depths by scaling them, thus shrinking the range of the
    101 packed matrix entries; for instance, if the Rhs bit depth is to be 5 bits then
    102 packed Rhs matrix entries will be in the range [0 ... 31]. This then allows the
    103 GEMM kernel to use narrower accumulators without risking overflow, thus
    104 achieving higher arithmetic throughput. Finally, at the unpacking stage, it only
    105 remains to scale the result values to compensate for the scalings applied
    106 earlier.
    107 
    108 Let us go into more detail for each of those stages:
    109 
    110 ### Packing stage
    111 
    112 The packing stage is where most of the work specific to the BitDepthParams takes
    113 place.
    114 
    115 Here, we have to scale input matrix values from their original range of [0 ...
    116 255] to the range specified by the BitDepthParams, which is [0 ... (2^N)-1]
    117 where N is the number of bits for the matrix at hand (Lhs or Rhs). For example,
    118 for a bit depth of 5 bits, we need to scale down to [0 ... 31].
    119 
    120 This scaling is what we call "requantization". The pedantic name matches the
    121 fact that this is actually quite nontrivial to do correctly i.e. in such a way
    122 that the result accuracy will be good enough for real-world applications. See
    123 the section below on requantization details.
    124 
    125 Concretely, this work happens in PackingRegisterBlock::Pack(), which calls
    126 Requantize(). This is in internal/pack.h. This code can be overridden for
    127 specific architectures, see internal/pack_neon.h.
    128 
    129 This requantization work is costly and makes packing slower. This means that, at
    130 least in our approach, less-than-8-bit computation is only interesting for
    131 large-enough, square-enough GEMMs where packing is only a small fraction of the
    132 overall cost. In cases where packing overhead is more prevalent (highly
    133 rectangular cases), less-than-8-bit is probably a waste of time as long as we
    134 treat it as an internal computation detail. What might help there, might be if
    135 we shrunk the input/output data format to lower memory bandwidth usage.
    136 
    137 ### Computation kernel stage
    138 
    139 In principle, the computation kernel stage simply doesn't have to care about the
    140 bit depth at all. In fact, on architectures where we do not have specific
    141 optimized kernels for less-than-8-bit cases, we simply use our standard kernel
    142 there, and that's correct!
    143 
    144 However, while the kernel doesn't have to know about the fact that the operands
    145 are on less than 8 bits, it can use that information to make special
    146 optimizations that would be incorrect in the general 8-bit case and become
    147 correct here thanks to the more restricted range of inputs. That's the whole
    148 point of this less-than-8-bit computation idea.
    149 
    150 With Lhs entries guaranteed to be smaller than 2^N, and Rhs entries guaranteed
    151 to be smaller than 2^M, each product is thus guaranteed to be smaller than
    152 2^(M+N). Thus, one may accumulate 2^(16-(M+N)) such products and still be
    153 guaranteed that such an accumulator will be smaller than 2^16, and thus can be
    154 stored as a uint16 without risking overflow.
    155 
    156 For example, in the L7R5 case, the Lhs enties are on 7 bits (N=7) and the Rhs
    157 entries are on 5 bits (M=5), so each product fits in 12 bits and one can thus
    158 accumulate 16 ( = 2^(16-12)) such products into uint16 accumulators with no risk
    159 of overflow.
    160 
    161 This means that a computation kernel may use uint16 accumulators for several
    162 loop iterations (16 in the above example), provided that it is allowed to assume
    163 that inputs are in such restricted range.
    164 
    165 After this fixed number of loop iterations, the kernel must accumulate the local
    166 uint16 accumulators back into global uint32 accumulators.
    167 
    168 On SIMD architectures with suitable uint16 arithmetic, this in principle allows
    169 to multiply arithmetic throughput by up to 2x, since twice more accumulators now
    170 fit in each SIMD vector register. This is partially offset by the cost of
    171 accumulating back into global uint32 accumulators every several loop iterations,
    172 but our experience on ARM NEON has been that we still get quite close to a 2x
    173 speedup. See internal/kernel_neon.h, specifically
    174 NEON32Kernel12x4Depth2Assuming12BitProducts.
    175 
    176 ### Unpacking stage
    177 
    178 At the unpacking stage, it only remains to scale the result values to compensate
    179 for the scaling of the inputs. This is easier because now we are expanding the
    180 range instead of shrinking it, so we don't need to worry about ways to minimize
    181 a loss of accuracy. We simply need to multiply result values by a constant
    182 fraction, rounding to nearest.
    183 
    184 Since the inputs were scaled by factors of (2^lhs_bits - 1)/255 and
    185 (2^rhs_bits - 1)/255 respectively, the scaling of the outputs needs to be by the
    186 following factor:
    187 
    188                  255 * 255
    189     -----------------------------------
    190     (2^lhs_bits - 1) * (2^rhs_bits - 1)
    191 
    192 This is done by a MultiplyByConstantFraction function, see internal/unpack.h
    193 
    194 ## Requantization details
    195 
    196 Here we go into more detail on the Requantize() function used at the packing
    197 stage to requantize input matrix data. See this function in internal/pack.h.
    198 
    199 It depends on the bit depth and on a rounding mode, and requantizes an input
    200 value in [0 ... 255] to the range [0 ... (2^N)-1] specified by the bit depth N.
    201 
    202 ### Naive, bad rounding, that's plainly biased
    203 
    204 Naive and inaccurate ways to achieve this requantization include: 1. By shifting
    205 bits rights by (8-N) bits; 2. By multiplying by ((2^N) - 1) and dividing by 255.
    206 
    207 Both of those are biased in some way: 1. has the wrong "derivative", since it
    208 approximates (((2^N) - 1) / 255) by ((2^N) / 256) ; 2. has bias since it
    209 effectively implements rounding towards 0.
    210 
    211 In practice, both of the above requantization functions give results that are
    212 too inaccurate in practice for the neural network that we tried (GoogLeNet).
    213 
    214 ### Round-to-nearest rounding: unbiased in principle but not in practice
    215 
    216 The simplest fix is to avoid the bias in 2. by rounding-to-nearest instead of
    217 rounding towards 0. This can be achieved by doing
    218 
    219 dst = (src * maxval + rounding_offset) / 255;
    220 
    221 Where maxval = ((2^N) - 1) is the highest requantized value, and the
    222 rounding_offset can be set to
    223 
    224 rounding_offset = 127
    225 
    226 to achieve rounding-to-nearest (while the above rounding towards 0 corresponded
    227 to rounding_offset = 0).
    228 
    229 In principle, rounding-to-nearest is unbiased and optimal in various ways.
    230 
    231 In practice though, our input data is not random real numbers, but
    232 already-quantized 8-bit values. That means that even in the best case, there
    233 would be at most 255 different possible input values; in practice, we generally
    234 see the input values distributed non-uniformly in that range, so that a majority
    235 of input values tend to be in a much smaller range. See test/test_data.cc.
    236 
    237 Having a large part of the input values in a very small finite set, means that
    238 the corresponding rounding errors are also in a very small finite set, which can
    239 be small enough that the mean of these rounding errors is significantly
    240 different from 0. That rounding-to-nearest is "unbiased" only means that over a
    241 sufficiently large set of input values, the bias would become arbitrarily close
    242 to 0; here, the set of input values is effectively small enough that the
    243 resulting bias is significant.
    244 
    245 This leads to biasing the matrix product entries, resulting in an error that
    246 grows linearly with the depth dimension of the GEMM.
    247 
    248 ### Probabilistic rounding: unbiased even on small finite input distributions
    249 
    250 To address that, we can instead use probabilistic rounding. The idea is that for
    251 instance if we have to round the value 3.8 to the nearest integer, we can round
    252 it to 3 with 20% probability and to 4 with probability 80%. If that value 3.8
    253 occurs many times, the mean requantized value will thus tend to 3.8.
    254 
    255 This amounts to keeping the above requantization formula,
    256 
    257 dst = (src * maxval + rounding_offset) / 255;
    258 
    259 but now the rounding_offset is a random value in [0 .. 254].
    260 
    261 This guarantees zero bias no matter how small the distribution of input values
    262 is.
    263 
    264 On the other hand, the variance of the error term here is higher than with
    265 rounding-to-nearest --- one can check that it is 2x higher.
    266 
    267 So the error term coming from the Central Limit Theorem, which grows with the
    268 square root of the accumulator depth i.e. the GEMM depth, will be 2x higher
    269 here.
    270 
    271 Still, for large enough GEMM depth, that is better than rounding-to-nearest
    272 which has an error term growing linearly with the GEMM depth.
    273 
    274 ### Switching between rounding-to-nearest and probabilistic rounding
    275 
    276 Thus, for fixed input values and bit depths, we expect that probabilistic
    277 rounding will give more accurate results for large enough GEMM depths, while
    278 rounding-to-nearest will be more accurate for smaller GEMM depths.
    279 
    280 That is why use switch between these rounding modes based on GEMM depth, see
    281 ChooseRoundingMode in internal/bit_depth_util.h.
    282 
    283 It is based on a constant, kProbabilisticRoundingThreshold, defined in
    284 internal/common.h and empirically determined. See the comment there. It would be
    285 nice to better understand the statistics here and come up with better heuristics
    286 for this switching.
    287 
    288 ### Choice of pseudorandom number generator
    289 
    290 We provide two PRNGs. The first is an 8-bit Xorshift. It is fast, naturally
    291 produces values ranging over an interval of width 255, which is what we need
    292 here (as opposed to an interval of width 256), and turns out, from empirical
    293 tests, to produce better results than a linear congruential generator (LCG).
    294 That's unfortunate, as a 8-bit LCG performs better (we confirmed that on various
    295 ARM devices) but we need as perfect un-biased-ness as we can get.
    296 
    297 The second is an "add-mod" sequence generator, which generates non-random values
    298 in the sequence x = (x + 97) % 255. This generates a low-discrepancy sequence
    299 that minimizes the "clumpiness" of the random offsets (Thus, for example,
    300 quantizing a 3x3 matrix will have a maximum additive error of about 200 from the
    301 random offsets). While not random, this sequence performs well empirically for
    302 many quantizations. (For information about why 97 is a good value, see
    303 https://en.wikipedia.org/wiki/Low-discrepancy_sequence#Additive_recurrence and
    304 http://mollwollfumble.blogspot.com/2011/03/subrandom-numbers.html 97/255 = 0.38;
    305 0.382 is the best choice. For discrete numbers, the choice must be relatively
    306 prime to the modulus. 97 is prime, so it is safely relatively prime to 255. 107
    307 is another near-optimal choice.
    308 
    309 The low-discrepancy sequence generator is the default.
    310 
    311 More details and results are given in a comment on the default PRNG in
    312 internal/pack.h. Interested users can change the PRNG used by setting
    313 DefaultRoundingGenerator in bit_depth_util.h.
    314