gemmlowp/doc/output.md

# Output pipelines in gemmlowp

In gemmlowp, the "output pipeline" is the process that takes a final `int32`
accumulator value (the output of the compute/kernel stage), and processes it to
obtain the final value (typically a `uint8` value) and write it to the
destination matrix.

Gemmlowp has some genericity in what arithmetic transformations take place in
the output pipeline, so as to allow different users to implement different
quantization paradigms. See [low-precision.md](low-precision.md) and
[quantization.md](quantization.md).

Besides implementing a quantization paradigms, the other thing that output
pipelines are good for, is implementing fused operations where a matrix
multiplication feeds into other operations applied to its result, without
additional array traversals. For instance, when implementing neural network
inference, one might have a Convolutional layer with a bias-addition and an
activation. One then wants to feed the result of the matrix multiplication
implementing the Convolutional operator itself, directly into the bias-addition
and activation function. gemmlowp's output pipelines allow implementing that:
the bias-addition and activation function are just additional stages in the
output pipeline.

## Usage

The gemmlowp entry point allowing to use an arbitrary output pipeline is
`GemmWithOutputPipeline` in [public/gemmlowp.h](../public/gemmlowp.h).

The output pipeline is specified as a `std::tuple` of "output stages", each of
which defining an elementary arithmetic transformation.

All available output stages are defined in
[public/output_stages.h](../public/output_stages.h).

## Example usage

The best part to see examples of using various output pipelines is in the unit
test,

```
test/test.cc
```

specifically in this function:

```
TestOutputStages
```

Separately, a self-contained example showing how to use gemmlowp to compute a
quantized matrix multiplication with a sounds quantization paradigm, is here:

[doc/quantization_example.cc](quantization_example.cc)