1 # TensorFlow Lite inference 2 3 [TOC] 4 5 ## Overview 6 7 TensorFlow Lite inference is the process of executing a TensorFlow Lite 8 model on-device and extracting meaningful results from it. Inference is the 9 final step in using the model on-device in the 10 [architecture](index.md#tensorflow_lite_architecture). 11 12 Inference for TensorFlow Lite models is run through an interpreter. This 13 document outlines the various APIs for the interpreter along with the 14 [supported platforms](#supported-platforms). 15 16 ### Important Concepts 17 18 TensorFlow Lite inference on device typically follows the following steps. 19 20 1. **Loading a Model** 21 22 The user loads the `.tflite` model into memory which contains the model's 23 execution graph. 24 25 1. **Transforming Data** 26 Input data acquired by the user generally may not match the input data format 27 expected by the model. For eg., a user may need to resize an image or change 28 the image format to be used by the model. 29 30 1. **Running Inference** 31 32 This step involves using the API to execute the model. It involves a few 33 steps such as building the interpreter, and allocating tensors as explained 34 in detail in [Running a Model](#running_a_model). 35 36 1. **Interpreting Output** 37 38 The user retrieves results from model inference and interprets the tensors in 39 a meaningful way to be used in the application. 40 41 For example, a model may only return a list of probabilities. It is up to the 42 application developer to meaningully map them to relevant categories and 43 present it to their user. 44 45 ### Supported Platforms 46 TensorFlow inference APIs are provided for most common mobile/embedded platforms 47 such as Android, iOS and Linux. 48 49 #### Android 50 On Android, TensorFlow Lite inference can be performed using either Java or C++ 51 APIs. The Java APIs provide convenience and can be used directly within your 52 Android Activity classes. The C++ APIs on the other hand may offer more 53 flexibility and speed, but may require writing JNI wrappers to move data between 54 Java and C++ layers. You can find an example [here](android.md). 55 56 #### iOS 57 TensorFlow Lite provides Swift/Objective C++ APIs for inference on iOS. An 58 example can be found [here](ios.md). 59 60 #### Linux 61 On Linux platforms such as [Raspberry Pi](build_rpi.md), TensorFlow Lite C++ 62 and Python APIs can be used to run inference. 63 64 65 ## API Guides 66 67 TensorFlow Lite provides programming APIs in C++, Java and Python, with 68 experimental bindings for several other languages (C, Swift, Objective-C). In 69 most cases, the API design reflects a preference for performance over ease of 70 use. TensorFlow Lite is designed for fast inference on small devices so it 71 should be no surprise that the APIs try to avoid unnecessary copies at the 72 expense of convenience. Similarly, consistency with TensorFlow APIs was not an 73 explicit goal and some variance is to be expected. 74 75 There is also a [Python API for TensorFlow Lite](../convert/python_api.md). 76 77 ### Loading a Model 78 79 #### C++ 80 The `FlatBufferModel` class encapsulates a model and can be built in a couple of 81 slightly different ways depending on where the model is stored: 82 83 ```c++ 84 class FlatBufferModel { 85 // Build a model based on a file. Return a nullptr in case of failure. 86 static std::unique_ptr<FlatBufferModel> BuildFromFile( 87 const char* filename, 88 ErrorReporter* error_reporter); 89 90 // Build a model based on a pre-loaded flatbuffer. The caller retains 91 // ownership of the buffer and should keep it alive until the returned object 92 // is destroyed. Return a nullptr in case of failure. 93 static std::unique_ptr<FlatBufferModel> BuildFromBuffer( 94 const char* buffer, 95 size_t buffer_size, 96 ErrorReporter* error_reporter); 97 }; 98 ``` 99 100 ```c++ 101 tflite::FlatBufferModel model(path_to_model); 102 ``` 103 104 Note that if TensorFlow Lite detects the presence of Android's NNAPI it will 105 automatically try to use shared memory to store the FlatBufferModel. 106 107 #### Java 108 109 TensorFlow Lite's Java API supports on-device inference and is provided as an 110 Android Studio Library that allows loading models, feeding inputs, and 111 retrieving inference outputs. 112 113 The `Interpreter` class drives model inference with TensorFlow Lite. In 114 most of the cases, this is the only class an app developer will need. 115 116 The `Interpreter` can be initialized with a model file using the constructor: 117 118 ```java 119 public Interpreter(@NotNull File modelFile); 120 ``` 121 122 or with a `MappedByteBuffer`: 123 124 ```java 125 public Interpreter(@NotNull MappedByteBuffer mappedByteBuffer); 126 ``` 127 128 In both cases a valid TensorFlow Lite model must be provided or an 129 `IllegalArgumentException` with be thrown. If a `MappedByteBuffer` is used to 130 initialize an Interpreter, it should remain unchanged for the whole lifetime of 131 the `Interpreter`. 132 133 ### Running a Model {#running_a_model} 134 135 #### C++ 136 Running a model involves a few simple steps: 137 138 * Build an `Interpreter` based on an existing `FlatBufferModel` 139 * Optionally resize input tensors if the predefined sizes are not desired. 140 * Set input tensor values 141 * Invoke inference 142 * Read output tensor values 143 144 The important parts of public interface of the `Interpreter` are provided 145 below. It should be noted that: 146 147 * Tensors are represented by integers, in order to avoid string comparisons 148 (and any fixed dependency on string libraries). 149 * An interpreter must not be accessed from concurrent threads. 150 * Memory allocation for input and output tensors must be triggered 151 by calling AllocateTensors() right after resizing tensors. 152 153 In order to run the inference model in TensorFlow Lite, one has to load the 154 model into a `FlatBufferModel` object which then can be executed by an 155 `Interpreter`. The `FlatBufferModel` needs to remain valid for the whole 156 lifetime of the `Interpreter`, and a single `FlatBufferModel` can be 157 simultaneously used by more than one `Interpreter`. In concrete terms, the 158 `FlatBufferModel` object must be created before any `Interpreter` objects that 159 use it, and must be kept around until they have all been destroyed. 160 161 The simplest usage of TensorFlow Lite will look like this: 162 163 ```c++ 164 tflite::FlatBufferModel model(path_to_model); 165 166 tflite::ops::builtin::BuiltinOpResolver resolver; 167 std::unique_ptr<tflite::Interpreter> interpreter; 168 tflite::InterpreterBuilder(*model, resolver)(&interpreter); 169 170 // Resize input tensors, if desired. 171 interpreter->AllocateTensors(); 172 173 float* input = interpreter->typed_input_tensor<float>(0); 174 // Fill `input`. 175 176 interpreter->Invoke(); 177 178 float* output = interpreter->typed_output_tensor<float>(0); 179 ``` 180 181 #### Java 182 183 The simplest usage of Tensorflow Lite Java API looks like this: 184 185 ```java 186 try (Interpreter interpreter = new Interpreter(file_of_a_tensorflowlite_model)) { 187 interpreter.run(input, output); 188 } 189 ``` 190 191 If a model takes only one input and returns only one output, the following will 192 trigger an inference run: 193 194 ```java 195 interpreter.run(input, output); 196 ``` 197 198 For models with multiple inputs, or multiple outputs, use: 199 200 ```java 201 interpreter.runForMultipleInputsOutputs(inputs, map_of_indices_to_outputs); 202 ``` 203 204 where each entry in `inputs` corresponds to an input tensor and 205 `map_of_indices_to_outputs` maps indices of output tensors to the corresponding 206 output data. In both cases the tensor indices should correspond to the values 207 given to the 208 [TensorFlow Lite Optimized Converter](../convert/cmdline_examples.md) when the 209 model was created. Be aware that the order of tensors in `input` must match the 210 order given to the `TensorFlow Lite Optimized Converter`. 211 212 The Java API also provides convenient functions for app developers to get the 213 index of any model input or output using a tensor name: 214 215 ```java 216 public int getInputIndex(String tensorName); 217 public int getOutputIndex(String tensorName); 218 ``` 219 220 If tensorName is not a valid name in model, an `IllegalArgumentException` will 221 be thrown. 222 223 ##### Releasing Resources After Use 224 225 An `Interpreter` owns resources. To avoid memory leak, the resources must be 226 released after use by: 227 228 ```java 229 interpreter.close(); 230 ``` 231 232 ##### Supported Data Types 233 234 To use TensorFlow Lite, the data types of the input and output tensors must be 235 one of the following primitive types: 236 237 * `float` 238 * `int` 239 * `long` 240 * `byte` 241 242 `String` types are also supported, but they are encoded differently than the 243 primitive types. In particular, the shape of a string Tensor dictates the number 244 and arrangement of strings in the Tensor, with each element itself being a 245 variable length string. In this sense, the (byte) size of the Tensor cannot be 246 computed from the shape and type alone, and consequently strings cannot be 247 provided as a single, flat `ByteBuffer` argument. 248 249 If other data types, including boxed types like `Integer` and `Float`, are used, 250 an `IllegalArgumentException` will be thrown. 251 252 ##### Inputs 253 254 Each input should be an array or multi-dimensional array of the supported 255 primitive types, or a raw `ByteBuffer` of the appropriate size. If the input is 256 an array or multi-dimensional array, the associated input tensor will be 257 implicitly resized to the array's dimensions at inference time. If the input is 258 a ByteBuffer, the caller should first manually resize the associated input 259 tensor (via `Interpreter.resizeInput()`) before running inference. 260 261 When using 'ByteBuffer', prefer using direct byte buffers, as this allows the 262 `Interpreter` to avoid unnecessary copies. If the `ByteBuffer` is a direct byte 263 buffer, its order must be `ByteOrder.nativeOrder()`. After it is used for a 264 model inference, it must remain unchanged until the model inference is finished. 265 266 ##### Outputs 267 268 Each output should be an array or multi-dimensional array of the supported 269 primitive types, or a ByteBuffer of the appropriate size. Note that some models 270 have dynamic outputs, where the shape of output tensors can vary depending on 271 the input. There's no straightforward way of handling this with the existing 272 Java inference API, but planned extensions will make this possible. 273 274 275 ## Writing Custom Operators 276 277 All TensorFlow Lite operators (both custom and builtin) are defined using a 278 simple pure-C interface that consists of four functions: 279 280 ```c++ 281 typedef struct { 282 void* (*init)(TfLiteContext* context, const char* buffer, size_t length); 283 void (*free)(TfLiteContext* context, void* buffer); 284 TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node); 285 TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node); 286 } TfLiteRegistration; 287 ``` 288 289 Refer to `context.h` for details on `TfLiteContext` and `TfLiteNode`. The 290 former provides error reporting facilities and access to global objects, 291 including all the tensors. The latter allows implementations to access their 292 inputs and outputs. 293 294 When the interpreter loads a model, it calls `init()` once for each node in the 295 graph. A given `init()` will be called more than once if the op is used 296 multiple times in the graph. For custom ops a configuration buffer will be 297 provided, containing a flexbuffer that maps parameter names to their values. 298 The buffer is empty for builtin ops because the interpreter has already parsed 299 the op parameters. Kernel implementation that require state should initialize 300 it here and transfer ownership to the caller. For each `init()` call, there 301 will be a corresponding call to `free()`, allowing implementations to dispose 302 of the buffer they might have allocated in `init()`. 303 304 Whenever the input tensors are resized the interpreter will go through the 305 graph notifying implementations of the change. This gives them the chance to 306 resize their internal buffer, check validity of input shapes and types, and 307 recalculate output shapes. This is all done through `prepare()` and 308 implementation can access their state using `node->user_data`. 309 310 Finally, each time inference runs the interpreter traverses the graph calling 311 `invoke()`, and here too the state is available as `node->user_data`. 312 313 Custom ops can be implemented in exactly the same way as builtin ops, by 314 defined those four functions and a global registration function that usually 315 looks like this: 316 317 ```c++ 318 namespace tflite { 319 namespace ops { 320 namespace custom { 321 TfLiteRegistration* Register_MY_CUSTOM_OP() { 322 static TfLiteRegistration r = {my_custom_op::Init, 323 my_custom_op::Free, 324 my_custom_op::Prepare, 325 my_custom_op::Eval}; 326 return &r; 327 } 328 } // namespace custom 329 } // namespace ops 330 } // namespace tflite 331 ``` 332 333 Note that registration is not automatic and an explicit call to 334 `Register_MY_CUSTOM_OP` should be made somewhere. While the standard 335 `BuiltinOpResolver` (available from the `:builtin_ops` target) takes care of the 336 registration of builtins, custom ops will have to be collected in separate 337 custom libraries. 338 339 ### Customizing the kernel library 340 341 Behind the scenes the interpreter will load a library of kernels which will be 342 assigned to execute each of the operators in the model. While the default 343 library only contains builtin kernels, it is possible to replace it with a 344 custom library. 345 346 The interpreter uses an `OpResolver` to translate operator codes and names into 347 actual code: 348 349 ```c++ 350 class OpResolver { 351 virtual TfLiteRegistration* FindOp(tflite::BuiltinOperator op) const = 0; 352 virtual TfLiteRegistration* FindOp(const char* op) const = 0; 353 virtual void AddOp(tflite::BuiltinOperator op, TfLiteRegistration* registration) = 0; 354 virtual void AddOp(const char* op, TfLiteRegistration* registration) = 0; 355 }; 356 ``` 357 358 Regular usage will require the developer to use the `BuiltinOpResolver` and 359 write: 360 361 ```c++ 362 tflite::ops::builtin::BuiltinOpResolver resolver; 363 ``` 364 365 They can then optionally register custom ops: 366 367 ```c++ 368 resolver.AddOp("MY_CUSTOM_OP", Register_MY_CUSTOM_OP()); 369 ``` 370 371 before the resolver is passed to the `InterpreterBuilder`. 372 373 If the set of builtin ops is deemed to be too large, a new `OpResolver` could 374 be code-generated based on a given subset of ops, possibly only the ones 375 contained in a given model. This is the equivalent of TensorFlow's selective 376 registration (and a simple version of it is available in the `tools` 377 directory). 378