Home | History | Annotate | Download | only in guide
      1 # TensorFlow Lite inference
      2 
      3 [TOC]
      4 
      5 ## Overview
      6 
      7 TensorFlow Lite inference is the process of executing a TensorFlow Lite
      8 model on-device and extracting meaningful results from it. Inference is the
      9 final step in using the model on-device in the
     10 [architecture](index.md#tensorflow_lite_architecture).
     11 
     12 Inference for TensorFlow Lite models is run through an interpreter. This
     13 document outlines the various APIs for the interpreter along with the
     14 [supported platforms](#supported-platforms).
     15 
     16 ### Important Concepts
     17 
     18 TensorFlow Lite inference on device typically follows the following steps.
     19 
     20 1. **Loading a Model**
     21 
     22    The user loads the `.tflite` model into memory which contains the model's
     23    execution graph.
     24 
     25 1. **Transforming Data**
     26    Input data acquired by the user generally may not match the input data format
     27    expected by the model. For eg., a user may need to resize an image or change
     28    the image format to be used by the model.
     29 
     30 1. **Running Inference**
     31 
     32    This step involves using the API to execute the model. It involves a few
     33    steps such as building the interpreter, and allocating tensors as explained
     34    in detail in [Running a Model](#running_a_model).
     35 
     36 1. **Interpreting Output**
     37 
     38    The user retrieves results from model inference and interprets the tensors in
     39    a meaningful way to be used in the application.
     40 
     41    For example, a model may only return a list of probabilities. It is up to the
     42    application developer to meaningully map them to relevant categories and
     43    present it to their user.
     44 
     45 ### Supported Platforms
     46 TensorFlow inference APIs are provided for most common mobile/embedded platforms
     47 such as Android, iOS and Linux.
     48 
     49 #### Android
     50 On Android, TensorFlow Lite inference can be performed using either Java or C++
     51 APIs. The Java APIs provide convenience and can be used directly within your
     52 Android Activity classes. The C++ APIs on the other hand may offer more
     53 flexibility and speed, but may require writing JNI wrappers to move data between
     54 Java and C++ layers. You can find an example [here](android.md).
     55 
     56 #### iOS
     57 TensorFlow Lite provides Swift/Objective C++ APIs for inference on iOS. An
     58 example can be found [here](ios.md).
     59 
     60 #### Linux
     61 On Linux platforms such as [Raspberry Pi](build_rpi.md), TensorFlow Lite C++
     62 and Python APIs can be used to run inference.
     63 
     64 
     65 ## API Guides
     66 
     67 TensorFlow Lite provides programming APIs in C++, Java and Python, with
     68 experimental bindings for several other languages (C, Swift, Objective-C). In
     69 most cases, the API design reflects a preference for performance over ease of
     70 use. TensorFlow Lite is designed for fast inference on small devices so it
     71 should be no surprise that the APIs try to avoid unnecessary copies at the
     72 expense of convenience. Similarly, consistency with TensorFlow APIs was not an
     73 explicit goal and some variance is to be expected.
     74 
     75 There is also a [Python API for TensorFlow Lite](../convert/python_api.md).
     76 
     77 ### Loading a Model
     78 
     79 #### C++
     80 The `FlatBufferModel` class encapsulates a model and can be built in a couple of
     81 slightly different ways depending on where the model is stored:
     82 
     83 ```c++
     84 class FlatBufferModel {
     85  // Build a model based on a file. Return a nullptr in case of failure.
     86  static std::unique_ptr<FlatBufferModel> BuildFromFile(
     87  const char* filename,
     88  ErrorReporter* error_reporter);
     89 
     90  // Build a model based on a pre-loaded flatbuffer. The caller retains
     91  // ownership of the buffer and should keep it alive until the returned object
     92  // is destroyed. Return a nullptr in case of failure.
     93  static std::unique_ptr<FlatBufferModel> BuildFromBuffer(
     94  const char* buffer,
     95  size_t buffer_size,
     96  ErrorReporter* error_reporter);
     97 };
     98 ```
     99 
    100 ```c++
    101 tflite::FlatBufferModel model(path_to_model);
    102 ```
    103 
    104 Note that if TensorFlow Lite detects the presence of Android's NNAPI it will
    105 automatically try to use shared memory to store the FlatBufferModel.
    106 
    107 #### Java
    108 
    109 TensorFlow Lite's Java API supports on-device inference and is provided as an
    110 Android Studio Library that allows loading models, feeding inputs, and
    111 retrieving inference outputs.
    112 
    113 The `Interpreter` class drives model inference with TensorFlow Lite. In
    114 most of the cases, this is the only class an app developer will need.
    115 
    116 The `Interpreter` can be initialized with a model file using the constructor:
    117 
    118 ```java
    119 public Interpreter(@NotNull File modelFile);
    120 ```
    121 
    122 or with a `MappedByteBuffer`:
    123 
    124 ```java
    125 public Interpreter(@NotNull MappedByteBuffer mappedByteBuffer);
    126 ```
    127 
    128 In both cases a valid TensorFlow Lite model must be provided or an
    129 `IllegalArgumentException` with be thrown. If a `MappedByteBuffer` is used to
    130 initialize an Interpreter, it should remain unchanged for the whole lifetime of
    131 the `Interpreter`.
    132 
    133 ### Running a Model {#running_a_model}
    134 
    135 #### C++
    136 Running a model involves a few simple steps:
    137 
    138   * Build an `Interpreter` based on an existing `FlatBufferModel`
    139   * Optionally resize input tensors if the predefined sizes are not desired.
    140   * Set input tensor values
    141   * Invoke inference
    142   * Read output tensor values
    143 
    144 The important parts of public interface of the `Interpreter` are provided
    145 below. It should be noted that:
    146 
    147   * Tensors are represented by integers, in order to avoid string comparisons
    148     (and any fixed dependency on string libraries).
    149   * An interpreter must not be accessed from concurrent threads.
    150   * Memory allocation for input and output tensors must be triggered
    151     by calling AllocateTensors() right after resizing tensors.
    152 
    153 In order to run the inference model in TensorFlow Lite, one has to load the
    154 model into a `FlatBufferModel` object which then can be executed by an
    155 `Interpreter`.  The `FlatBufferModel` needs to remain valid for the whole
    156 lifetime of the `Interpreter`, and a single `FlatBufferModel` can be
    157 simultaneously used by more than one `Interpreter`. In concrete terms, the
    158 `FlatBufferModel` object must be created before any `Interpreter` objects that
    159 use it, and must be kept around until they have all been destroyed.
    160 
    161 The simplest usage of TensorFlow Lite will look like this:
    162 
    163 ```c++
    164 tflite::FlatBufferModel model(path_to_model);
    165 
    166 tflite::ops::builtin::BuiltinOpResolver resolver;
    167 std::unique_ptr<tflite::Interpreter> interpreter;
    168 tflite::InterpreterBuilder(*model, resolver)(&interpreter);
    169 
    170 // Resize input tensors, if desired.
    171 interpreter->AllocateTensors();
    172 
    173 float* input = interpreter->typed_input_tensor<float>(0);
    174 // Fill `input`.
    175 
    176 interpreter->Invoke();
    177 
    178 float* output = interpreter->typed_output_tensor<float>(0);
    179 ```
    180 
    181 #### Java
    182 
    183 The simplest usage of Tensorflow Lite Java API looks like this:
    184 
    185 ```java
    186 try (Interpreter interpreter = new Interpreter(file_of_a_tensorflowlite_model)) {
    187   interpreter.run(input, output);
    188 }
    189 ```
    190 
    191 If a model takes only one input and returns only one output, the following will
    192 trigger an inference run:
    193 
    194 ```java
    195 interpreter.run(input, output);
    196 ```
    197 
    198 For models with multiple inputs, or multiple outputs, use:
    199 
    200 ```java
    201 interpreter.runForMultipleInputsOutputs(inputs, map_of_indices_to_outputs);
    202 ```
    203 
    204 where each entry in `inputs` corresponds to an input tensor and
    205 `map_of_indices_to_outputs` maps indices of output tensors to the corresponding
    206 output data. In both cases the tensor indices should correspond to the values
    207 given to the
    208 [TensorFlow Lite Optimized Converter](../convert/cmdline_examples.md) when the
    209 model was created. Be aware that the order of tensors in `input` must match the
    210 order given to the `TensorFlow Lite Optimized Converter`.
    211 
    212 The Java API also provides convenient functions for app developers to get the
    213 index of any model input or output using a tensor name:
    214 
    215 ```java
    216 public int getInputIndex(String tensorName);
    217 public int getOutputIndex(String tensorName);
    218 ```
    219 
    220 If tensorName is not a valid name in model, an `IllegalArgumentException` will
    221 be thrown.
    222 
    223 ##### Releasing Resources After Use
    224 
    225 An `Interpreter` owns resources. To avoid memory leak, the resources must be
    226 released after use by:
    227 
    228 ```java
    229 interpreter.close();
    230 ```
    231 
    232 ##### Supported Data Types
    233 
    234 To use TensorFlow Lite, the data types of the input and output tensors must be
    235 one of the following primitive types:
    236 
    237 *   `float`
    238 *   `int`
    239 *   `long`
    240 *   `byte`
    241 
    242 `String` types are also supported, but they are encoded differently than the
    243 primitive types. In particular, the shape of a string Tensor dictates the number
    244 and arrangement of strings in the Tensor, with each element itself being a
    245 variable length string. In this sense, the (byte) size of the Tensor cannot be
    246 computed from the shape and type alone, and consequently strings cannot be
    247 provided as a single, flat `ByteBuffer` argument.
    248 
    249 If other data types, including boxed types like `Integer` and `Float`, are used,
    250 an `IllegalArgumentException` will be thrown.
    251 
    252 ##### Inputs
    253 
    254 Each input should be an array or multi-dimensional array of the supported
    255 primitive types, or a raw `ByteBuffer` of the appropriate size. If the input is
    256 an array or multi-dimensional array, the associated input tensor will be
    257 implicitly resized to the array's dimensions at inference time. If the input is
    258 a ByteBuffer, the caller should first manually resize the associated input
    259 tensor (via `Interpreter.resizeInput()`) before running inference.
    260 
    261 When using 'ByteBuffer', prefer using direct byte buffers, as this allows the
    262 `Interpreter` to avoid unnecessary copies. If the `ByteBuffer` is a direct byte
    263 buffer, its order must be `ByteOrder.nativeOrder()`. After it is used for a
    264 model inference, it must remain unchanged until the model inference is finished.
    265 
    266 ##### Outputs
    267 
    268 Each output should be an array or multi-dimensional array of the supported
    269 primitive types, or a ByteBuffer of the appropriate size. Note that some models
    270 have dynamic outputs, where the shape of output tensors can vary depending on
    271 the input. There's no straightforward way of handling this with the existing
    272 Java inference API, but planned extensions will make this possible.
    273 
    274 
    275 ## Writing Custom Operators
    276 
    277 All TensorFlow Lite operators (both custom and builtin) are defined using a
    278 simple pure-C interface that consists of four functions:
    279 
    280 ```c++
    281 typedef struct {
    282   void* (*init)(TfLiteContext* context, const char* buffer, size_t length);
    283   void (*free)(TfLiteContext* context, void* buffer);
    284   TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);
    285   TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);
    286 } TfLiteRegistration;
    287 ```
    288 
    289 Refer to `context.h` for details on `TfLiteContext` and `TfLiteNode`. The
    290 former provides error reporting facilities and access to global objects,
    291 including all the tensors. The latter allows implementations to access their
    292 inputs and outputs.
    293 
    294 When the interpreter loads a model, it calls `init()` once for each node in the
    295 graph. A given `init()` will be called more than once if the op is used
    296 multiple times in the graph. For custom ops a configuration buffer will be
    297 provided, containing a flexbuffer that maps parameter names to their values.
    298 The buffer is empty for builtin ops because the interpreter has already parsed
    299 the op parameters. Kernel implementation that require state should initialize
    300 it here and transfer ownership to the caller.  For each `init()` call, there
    301 will be a corresponding call to `free()`, allowing implementations to dispose
    302 of the buffer they might have allocated in `init()`.
    303 
    304 Whenever the input tensors are resized the interpreter will go through the
    305 graph notifying implementations of the change. This gives them the chance to
    306 resize their internal buffer, check validity of input shapes and types, and
    307 recalculate output shapes. This is all done through `prepare()` and
    308 implementation can access their state using `node->user_data`.
    309 
    310 Finally, each time inference runs the interpreter traverses the graph calling
    311 `invoke()`, and here too the state is available as `node->user_data`.
    312 
    313 Custom ops can be implemented in exactly the same way as builtin ops, by
    314 defined those four functions and a global registration function that usually
    315 looks like this:
    316 
    317 ```c++
    318 namespace tflite {
    319 namespace ops {
    320 namespace custom {
    321   TfLiteRegistration* Register_MY_CUSTOM_OP() {
    322     static TfLiteRegistration r = {my_custom_op::Init,
    323                                    my_custom_op::Free,
    324                                    my_custom_op::Prepare,
    325                                    my_custom_op::Eval};
    326     return &r;
    327   }
    328 }  // namespace custom
    329 }  // namespace ops
    330 }  // namespace tflite
    331 ```
    332 
    333 Note that registration is not automatic and an explicit call to
    334 `Register_MY_CUSTOM_OP` should be made somewhere. While the standard
    335 `BuiltinOpResolver` (available from the `:builtin_ops` target) takes care of the
    336 registration of builtins, custom ops will have to be collected in separate
    337 custom libraries.
    338 
    339 ### Customizing the kernel library
    340 
    341 Behind the scenes the interpreter will load a library of kernels which will be
    342 assigned to execute each of the operators in the model. While the default
    343 library only contains builtin kernels, it is possible to replace it with a
    344 custom library.
    345 
    346 The interpreter uses an `OpResolver` to translate operator codes and names into
    347 actual code:
    348 
    349 ```c++
    350 class OpResolver {
    351   virtual TfLiteRegistration* FindOp(tflite::BuiltinOperator op) const = 0;
    352   virtual TfLiteRegistration* FindOp(const char* op) const = 0;
    353   virtual void AddOp(tflite::BuiltinOperator op, TfLiteRegistration* registration) = 0;
    354   virtual void AddOp(const char* op, TfLiteRegistration* registration) = 0;
    355 };
    356 ```
    357 
    358 Regular usage will require the developer to use the `BuiltinOpResolver` and
    359 write:
    360 
    361 ```c++
    362 tflite::ops::builtin::BuiltinOpResolver resolver;
    363 ```
    364 
    365 They can then optionally register custom ops:
    366 
    367 ```c++
    368 resolver.AddOp("MY_CUSTOM_OP", Register_MY_CUSTOM_OP());
    369 ```
    370 
    371 before the resolver is passed to the `InterpreterBuilder`.
    372 
    373 If the set of builtin ops is deemed to be too large, a new `OpResolver` could
    374 be code-generated  based on a given subset of ops, possibly only the ones
    375 contained in a given model. This is the equivalent of TensorFlow's selective
    376 registration (and a simple version of it is available in the `tools`
    377 directory).
    378