Home | History | Annotate | Download | only in doc
      1 # Gemmlowp's public entry points
      2 
      3 gemmlowp's public interface is defined in
      4 [public/gemmlowp.h](../public/gemmlowp.h).
      5 
      6 ## GemmWithOutputPipeline
      7 
      8 The primary public entry point is: `GemmWithOutputPipeline`.
      9 
     10 A usage example is given in
     11 [doc/quantization_example.cc](quantization_example.cc).
     12 
     13 The high-level overview of how this specifies a low-precision matrix
     14 multiplication is explained in [low-precision.md](low-precision.md). The
     15 rationale for a specific quantization paradigm is given in
     16 [quantization.md](quantization.md). That specific quantization paradigm is
     17 implemented at two different stages of the computation: as pre-processing ont
     18 the operands and as post-processing on the result:
     19 
     20 *   Pre-processing on the LHS, RHS operands, in the form of adding constant
     21     `lhs_offset`, `rhs_offset` to them, is explained in
     22     [low-precision.md](low-precision.md).
     23 
     24 *   Post-processing on the result, in the form of a flexible "output pipeline",
     25     is explained in [output.md](output.md).
     26 
     27 More details on this below as we discuss specific function parameters.
     28 
     29 The prototype is:
     30 
     31 ```
     32 template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
     33           MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
     34           typename OutputPipelineType, typename GemmContextType>
     35 void GemmWithOutputPipeline(GemmContextType* context,
     36                             const MatrixMap<const InputScalar, LhsOrder>& lhs,
     37                             const MatrixMap<const InputScalar, RhsOrder>& rhs,
     38                             MatrixMap<OutputScalar, ResultOrder>* result,
     39                             int lhs_offset, int rhs_offset,
     40                             const OutputPipelineType& output_pipeline);
     41 ```
     42 
     43 A typical call looks like (from the [usage example](quantization_example.cc)):
     44 
     45 ```
     46 gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
     47                                  gemmlowp::DefaultL8R8BitDepthParams>(
     48     &gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
     49     &uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
     50 ```
     51 
     52 ### Template parameters
     53 
     54 Typically only the 3 first template parameters need to be specified, the rest
     55 being automatically deduced from function parameters:
     56 
     57 *   `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
     58     this must be `std::uint8_t`.
     59 *   `OutputScalar`: The scalar type of the LHS and RHS operands. At the moment,
     60     this must be `std::uint8_t`.
     61 *   `BitDepthParams`: Defines the bit format of the input and output matrices
     62     and the required accuracy of the computation. At the moment, the only
     63     non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
     64     [less-than-8-bit.md](less-than-8-bit.md) for other values and the general
     65     idea of this, and how it may become more useful in the future.
     66 
     67 The other template parameters, which typically do not need to be specified, are:
     68 
     69 *   `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
     70     column-major) of the LHS, RHS, result matrices. See
     71     [public/map.h](../public/map.h). See the below performance note: we
     72     recommend using respectively RowMajor, ColMajor, ColMajor for optimal
     73     performance.
     74 *   `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
     75     See below explanation of the `output_pipeline` parameter, and
     76     [output.md](output.md).
     77 *   `GemmContextType`: the type of the `context` parameter. At the moment, this
     78     must be `gemmlowp::GemmContext`.
     79 
     80 ### Function parameters
     81 
     82 The function parameters taken by `GemmWithOutputPipeline` are:
     83 
     84 *   `context`: The `gemmlowp::GemmContext` object holding state and resources to
     85     be used for this gemmlowp call.
     86 *   `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
     87     `MatrixMap` objects, mapping external buffers as matrices, not owning data.
     88     See [public/map.h](../public/map.h).
     89 *   `result`: pointer to the destination `MatrixMap` object, which must be
     90     already constructed, wrapping the external destination buffer with the
     91     wanted destination matrix shape and storage layout. No memory allocation
     92     will be performed by gemmlowp for the destination buffer. See
     93     [public/map.h](../public/map.h).
     94 *   `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
     95     LHS, RHS matrices respectively, as explained in
     96     [low-precision.md](low-precision.md). This is only the part of the
     97     quantization paradigm explained in [quantization.md](quantization.md) that
     98     needs to be implemented as operations on the operands; everything else is
     99     operations on the result, see `output_pipeline`.
    100 *   `output_pipeline` is a `std::tuple` of output stages (see
    101     [public/output_stages.h](../public/output_stages.h)), specifying the output
    102     pipeline (see [output.md](output.md)). This is the part of the quantization
    103     paradigm explained in [quantization.md](quantization.md) that needs to be
    104     implemented as operations on the result matrix.
    105 
    106 ### Performance note on storage orders.
    107 
    108 gemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
    109 result matrices. However, not all are equally optimized for.
    110 
    111 Because gemmlowp is primarily aimed at neural network inference workloads,
    112 optimization focus is on this particular combination of storage orders:
    113 
    114 *   `LhsOrder=RowMajor`
    115 *   `RhsOrder=ColMajor`
    116 *   `ResultOrder=ColMajor`
    117 
    118 The rationale is that the LHS is typically the constant weights of a neural
    119 network layer (e.g. the weights of a Convolutional layer implemented as a matrix
    120 multiplication), while the RHS and result are neural network activations,
    121 respectively the input and output activations of the layer.
    122 
    123 Because the RHS and result are activations, we want them to share the same
    124 storage order -- so that one layer's output activations can be readily used as
    125 the next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.
    126 
    127 We also know from general considerations on matrix multiplication that it is
    128 slightly more efficient to have the direction of accumulation (the "depth"
    129 dimension) be the direction of contiguous storage in memory. That means that it
    130 is always going to be slightly easier and more efficient to have
    131 `LhsOrder=RowMajor` and `RhsOrder=ColMajor`.
    132 
    133 Putting this together, we arrive at gemmlowp's focus on the above-described
    134 combination of storage orders.
    135 
    136 Using other storage orders will typically mean taking less efficient paths in
    137 the packing and unpacking stages, see [packing.md](packing.md). The compute
    138 kernel stage ([kernel.md](kernel.md)) is unaffected.
    139 
    140 ## GemmWithOutputPipelinePC
    141 
    142 This is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
    143 scalar. They are then broadcasted against LHS, RHS respectively.
    144 
    145 This is useful for some flavors of neural network inference with "per-channel
    146 quantization", whence the PC suffix. This has been useful in some settings where
    147 a neural network trained in float arithmetic was subsequently quantized. On the
    148 other hand, retraining neural networks for quantized inference tends to remove
    149 the need for per-channel quantization. For that reason, the long-term usefulness
    150 of this entry point is in question.
    151 
    152 ## Gemm
    153 
    154 This is gemmlowp's original, now legacy and deprecated, entry point. See the
    155 section of [low-precision.md](low-precision.md) on the legacy quantization
    156 paradigm. Avoid in new code.
    157 
    158 ## The eight_bit_int_gemm directory
    159 
    160 As explained in the top-level [README.md](../README.md#public-interfaces), this
    161 is entirely deprecated.
    162