Home | History | Annotate | Download | only in doc
      1 # The packing stage in gemmlowp
      2 
      3 ## Introduction
      4 
      5 We assume familiarity with [design.md](design.md) and with the overall 3 stages
      6 of computations described there: packing, kernel, unpacking.
      7 
      8 This page goes into more details about the first stage: packing.
      9 
     10 We also assume familiarity with [kernel.md](kernel.md) as it describes the
     11 packed format requirements that the kernels expect, and that forms basically the
     12 contract that the packing stage must honor.
     13 
     14 Some parts below also assume familiarity with
     15 [low-precision.md](low-precision.md) as the packing stage also has to compute
     16 the vectors of sums or columns as described there.
     17 
     18 ## The storage order of packed blocks, partly hidden behind sequential access
     19 
     20 As explained in [design.md](design.md), the primary purpose of packing is to
     21 ensure that when the kernel traverses Lhs/Rhs matrix data, it can do so
     22 efficiently thanks to having the data stored in an order that is as similar as
     23 possible to the order in which the compute stage has to traverse this data.
     24 
     25 This traversal order is nontrivial for the reasons outlined in
     26 [design.md](design.md): at the innermost level, one tries to work within
     27 registers as much as possible; at the next level, one tries to stay within L1
     28 cache as much as possible. The packed blocks that we handle are supposed to fit
     29 entirely in L2 cache.
     30 
     31 Thus it has become standard in GEMM design to describe complicated "Z-order" or
     32 "fractal order" storage for packed blocks.
     33 
     34 However, we should keep in mind that the whole point of the packed storage order
     35 is to be as similar as possible to the order of traversal during the compute
     36 stage. The storage order doesn't matter in itself; the only thing that matters
     37 is simple access patterns during the compute stage.
     38 
     39 This suggests the following approach to implementing packing: take the exact
     40 same hierarchy of nested loops of the compute stage, drop the loops that are not
     41 relevant to the side (Lhs or Rhs) being packed, and try to use mostly sequential
     42 access to the destination packed data.
     43 
     44 This hierarchy of nested loops can be seen in PackSideBlockImpl (PackL2, PackL1,
     45 PackRun), compare to the similar hierarchy of loops in internal/compute.h.
     46 
     47 In this way, the more intricate small-scale details or the packed data format
     48 never need to be made explicit (which would be complicated). We still use some
     49 "seeking" but only at larger scales, where the storage order is less \
     50 complicated to describe.
     51 
     52 ### Sequential access to PackedSideBlock data
     53 
     54 See PackedSideBlock in internal/pack.h, specifically the following data members:
     55 
     56 ```
     57 // Handle on the buffer backing this packed block. Owned.
     58 Allocator::Handle data_handle_;
     59 ```
     60 
     61 and:
     62 
     63 ```
     64 // pos_ is the current position in the buffer, which we access
     65 // sequentially, like a file.
     66 // The idea is that we pack data in the same order as it is
     67 // going to be traversed during the computation, which for
     68 // cache-friendliness reasons is complicated to random-access,
     69 // as the offsets calculations would be intricate. So we
     70 // give up random-access addressing, and instead content ourselves
     71 // with sequential access.
     72 //
     73 // pos_ is mutable because during the computation we will want to
     74 // be able to iterate on the data in a const PackedSideBlock.
     75 mutable int pos_;
     76 ```
     77 
     78 The methods exposing sequential access are:
     79 
     80 ```
     81 std::uint8_t* current_data() {
     82   return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_;
     83 }
     84 ```
     85 
     86 and:
     87 
     88 ```
     89 void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; }
     90 
     91 void seek_forward_n_cells(int n) const {
     92   pos_ += n * KernelSideFormat::Cell::kSize;
     93 }
     94 ```
     95 
     96 ### Random access to PackedSideBlock data at larger scales
     97 
     98 We still need some random access at larger scales (with high granularity), which
     99 is unavoidable since GEMM is O(n^3) and has to traverse each of the O(n^2)
    100 inputs O(n) times.
    101 
    102 The watershed between sequential access and random access is at the level of a
    103 'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer to the
    104 innermost GEMM loop in the depth dimension. That's the critical inner loop that
    105 must be as fast as possible, thus for which we absolutely want sequential access
    106 to packed data so that the storage order is optimal by construction. At larger
    107 scales i.e. between runs, we accept that the storage order is less optimal and
    108 since it's also less intricate, it's not too hard to implement random access
    109 there.
    110 
    111 This is done by the seek_run method:
    112 
    113 ```
    114 void seek_run(int start_width, int start_depth) const {
    115   int kernel_run_depth =
    116       std::min<int>(params_.l1_depth, params_.l2_depth - start_depth);
    117   pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth;
    118 }
    119 ```
    120 
    121 We see that the formula involves the l1_depth parameter, which is how the packed
    122 storage order depends on L1 cache size. Again, the whole packed block is
    123 supposed to fit in L2 cache.
    124 
    125 ## The innermost loop of the packing stage, PackRun, and PackingRegisterBlock
    126 
    127 Keeping with our consistent usage of the term 'Run' throughout gemmlowp, the
    128 innermost loop is called PackRun().
    129 
    130 Here we recall a very important principle that was explained in
    131 [kernels.md](kernels.md): the kernel is free to dictate the precise data format
    132 that it expects; the packing code has to honor it. So there's an asymmetry here:
    133 the kernel is the master, the packing is the slave. That's why the packing code
    134 is templatized in the KernelSideFormat. At larger scales, the packing is
    135 independent of kernel format details, but inside PackRun is where we take care
    136 of the small-scale details that do depend on the kernel format details. That's
    137 why it's a good thing that we only need sequential access here, as it would be
    138 very complicated to spell out random access at this scale.
    139 
    140 Anyway, PackRun.
    141 
    142 Since it is the critical inner loop, it is what we want to allow specializing
    143 for particular CPU architectures. To allow that, we handle at a time blocks of
    144 fixed dimensions, that is intended to be friendly enough to optimization. These
    145 blocks are PackingRegisterBlock's and their dimensions are:
    146 
    147 ```
    148   width = KernelWidth
    149   depth = kRegisterSize
    150 ```
    151 
    152 See [kernels.md](kernels.md) and internal/kernel.h for the former, and
    153 internal/common.h for the latter.
    154 
    155 See the comments around PackingRegisterBlock in internal/pack.h:
    156 
    157 ```
    158 // A PackingRegisterBlock is a small fixed-size block of a matrix being
    159 // packed. This class is the generic non-optimized implementation,
    160 // it is inherited by the generic implementation of PackingRegisterBlock,
    161 // which may be overriden by template specialization. Overriding it is how
    162 // one may provide optimized packing code paths.
    163 //
    164 // The packing of a block proceeds in two steps:
    165 //   1. Ensuring that we have a complete block of source data, i.e. a block of
    166 //      the compile-time prescribed size. This is where we handle unaligned
    167 //      boundaries: if we don't have a complete block of source data, then
    168 //      we copy and zero-extend it into a local temporary (complete_src_),
    169 //      see MakeCompleteSrc. In the generic case, we do have a complete block,
    170 //      so we just use it in-place, see UseCompleteSrcInPlace.
    171 //   2. Packing a complete block into the destination, see Pack. This is the
    172 //      most critical part, so it's convenient that unaligned boundaries have
    173 //      already been handled in step 1.
    174 ```
    175 
    176 ## Other things that the packing stage has to do
    177 
    178 Besides storing matrix entries in a suitable order, the packing stages also has
    179 two other things to do.
    180 
    181 First, packing has to compute the vectors of sums of entries along the depth
    182 dimension. If this is any mysterious, read [low-precision.md](low-precision.md).
    183 These will only be used at the unpacking stage.
    184 
    185 Second, if the BitDepthSetting requires less than 8 bit of precision, then at
    186 the packing stage we have to requantize inputs accordingly. See
    187 [less-than-8-bit.md](less-than-8-bit.md) for details. This is the Requantize()
    188 function.
    189 
    190 ## Specialized packing paths for specific formats on specific CPU architectures
    191 
    192 Please refer to internal/pack_neon.h for examples of doing that. The piece of
    193 code to be specialized is PackingRegisterBlock. However, inside of it, only the
    194 Pack() method typically needs to be specialized (the rest is unlikely to be
    195 critical). So one typically specializes PackingRegisterBlock but still
    196 inheriting PackingRegisterBlockBase to keep the generic stuff, and then one
    197 typically wants to override the Pack() method.
    198 
    199 Template specialization for the right template parameters is how one specifies
    200 in which case a given path is to be used in place of the generic packing code.
    201 
    202 It is entirely possible to set the value of kRegisterSize differently based on
    203 the CPU architecture (for example, 32 on x86 with AVX) as long as all the
    204 specialized packing paths used on that CPU architecture are consistent with it.
    205