1 # Using AOT compilation 2 3 ## What is tfcompile? 4 5 `tfcompile` is a standalone tool that ahead-of-time (AOT) compiles TensorFlow 6 graphs into executable code. It can reduce total binary size, and also avoid 7 some runtime overheads. A typical use-case of `tfcompile` is to compile an 8 inference graph into executable code for mobile devices. 9 10 The TensorFlow graph is normally executed by the TensorFlow runtime. This incurs 11 some runtime overhead for execution of each node in the graph. This also leads 12 to a larger total binary size, since the code for the TensorFlow runtime needs 13 to be available, in addition to the graph itself. The executable code produced 14 by `tfcompile` does not use the TensorFlow runtime, and only has dependencies on 15 kernels that are actually used in the computation. 16 17 The compiler is built on top of the XLA framework. The code bridging TensorFlow 18 to the XLA framework resides under 19 [tensorflow/compiler](https://www.tensorflow.org/code/tensorflow/compiler/), 20 which also includes support for @{$jit$just-in-time (JIT) compilation} of 21 TensorFlow graphs. 22 23 ## What does tfcompile do? 24 25 `tfcompile` takes a subgraph, identified by the TensorFlow concepts of 26 feeds and fetches, and generates a function that implements that subgraph. 27 The `feeds` are the input arguments for the function, and the `fetches` are the 28 output arguments for the function. All inputs must be fully specified by the 29 feeds; the resulting pruned subgraph cannot contain Placeholder or Variable 30 nodes. It is common to specify all Placeholders and Variables as feeds, which 31 ensures the resulting subgraph no longer contains these nodes. The generated 32 function is packaged as a `cc_library`, with a header file exporting the 33 function signature, and an object file containing the implementation. The user 34 writes code to invoke the generated function as appropriate. 35 36 ## Using tfcompile 37 38 This section details high level steps for generating an executable binary with 39 `tfcompile` from a TensorFlow subgraph. The steps are: 40 41 * Step 1: Configure the subgraph to compile 42 * Step 2: Use the `tf_library` build macro to compile the subgraph 43 * Step 3: Write code to invoke the subgraph 44 * Step 4: Create the final binary 45 46 ### Step 1: Configure the subgraph to compile 47 48 Identify the feeds and fetches that correspond to the input and output 49 arguments for the generated function. Then configure the `feeds` and `fetches` 50 in a [`tensorflow.tf2xla.Config`](https://www.tensorflow.org/code/tensorflow/compiler/tf2xla/tf2xla.proto) 51 proto. 52 53 ```textproto 54 # Each feed is a positional input argument for the generated function. The order 55 # of each entry matches the order of each input argument. Here x_hold and y_hold 56 # refer to the names of placeholder nodes defined in the graph. 57 feed { 58 id { node_name: "x_hold" } 59 shape { 60 dim { size: 2 } 61 dim { size: 3 } 62 } 63 } 64 feed { 65 id { node_name: "y_hold" } 66 shape { 67 dim { size: 3 } 68 dim { size: 2 } 69 } 70 } 71 72 # Each fetch is a positional output argument for the generated function. The order 73 # of each entry matches the order of each output argument. Here x_y_prod 74 # refers to the name of a matmul node defined in the graph. 75 fetch { 76 id { node_name: "x_y_prod" } 77 } 78 ``` 79 80 ### Step 2: Use tf_library build macro to compile the subgraph 81 82 This step converts the graph into a `cc_library` using the `tf_library` build 83 macro. The `cc_library` consists of an object file containing the code generated 84 from the graph, along with a header file that gives access to the generated 85 code. `tf_library` utilizes `tfcompile` to compile the TensorFlow graph into 86 executable code. 87 88 ```build 89 load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library") 90 91 # Use the tf_library macro to compile your graph into executable code. 92 tf_library( 93 # name is used to generate the following underlying build rules: 94 # <name> : cc_library packaging the generated header and object files 95 # <name>_test : cc_test containing a simple test and benchmark 96 # <name>_benchmark : cc_binary containing a stand-alone benchmark with minimal deps; 97 # can be run on a mobile device 98 name = "test_graph_tfmatmul", 99 # cpp_class specifies the name of the generated C++ class, with namespaces allowed. 100 # The class will be generated in the given namespace(s), or if no namespaces are 101 # given, within the global namespace. 102 cpp_class = "foo::bar::MatMulComp", 103 # graph is the input GraphDef proto, by default expected in binary format. To 104 # use the text format instead, just use the .pbtxt suffix. A subgraph will be 105 # created from this input graph, with feeds as inputs and fetches as outputs. 106 # No Placeholder or Variable ops may exist in this subgraph. 107 graph = "test_graph_tfmatmul.pb", 108 # config is the input Config proto, by default expected in binary format. To 109 # use the text format instead, use the .pbtxt suffix. This is where the 110 # feeds and fetches were specified above, in the previous step. 111 config = "test_graph_tfmatmul.config.pbtxt", 112 ) 113 ``` 114 115 > To generate the GraphDef proto (test_graph_tfmatmul.pb) for this example, run 116 > [make_test_graphs.py]("https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/make_test_graphs.py") 117 > and specify the output location with the --out_dir flag. 118 119 Typical graphs contain @{$python/state_ops$`Variables`} 120 representing the weights that are learned via training, but `tfcompile` cannot 121 compile a subgraph that contain `Variables`. The 122 [freeze_graph.py](https://www.tensorflow.org/code/tensorflow/python/tools/freeze_graph.py) 123 tool converts variables into constants, using values stored in a checkpoint 124 file. As a convenience, the `tf_library` macro supports the `freeze_checkpoint` 125 argument, which runs the tool. For more examples see 126 [tensorflow/compiler/aot/tests/BUILD](https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/BUILD). 127 128 > Constants that show up in the compiled subgraph are compiled directly into the 129 > generated code. To pass the constants into the generated function, rather than 130 > having them compiled-in, simply pass them in as feeds. 131 132 For details on the `tf_library` build macro, see 133 [tfcompile.bzl](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile.bzl). 134 135 For details on the underlying `tfcompile` tool, see 136 [tfcompile_main.cc](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile_main.cc). 137 138 ### Step 3: Write code to invoke the subgraph 139 140 This step uses the header file (`test_graph_tfmatmul.h`) generated by the 141 `tf_library` build macro in the previous step to invoke the generated code. The 142 header file is located in the `bazel-genfiles` directory corresponding to the 143 build package, and is named based on the name attribute set on the `tf_library` 144 build macro. For example, the header generated for `test_graph_tfmatmul` would 145 be `test_graph_tfmatmul.h`. Below is an abbreviated version of what is 146 generated. The generated file, in `bazel-genfiles`, contains additional useful 147 comments. 148 149 ```c++ 150 namespace foo { 151 namespace bar { 152 153 // MatMulComp represents a computation previously specified in a 154 // TensorFlow graph, now compiled into executable code. 155 class MatMulComp { 156 public: 157 // AllocMode controls the buffer allocation mode. 158 enum class AllocMode { 159 ARGS_RESULTS_AND_TEMPS, // Allocate arg, result and temp buffers 160 RESULTS_AND_TEMPS_ONLY, // Only allocate result and temp buffers 161 }; 162 163 MatMulComp(AllocMode mode = AllocMode::ARGS_RESULTS_AND_TEMPS); 164 ~MatMulComp(); 165 166 // Runs the computation, with inputs read from arg buffers, and outputs 167 // written to result buffers. Returns true on success and false on failure. 168 bool Run(); 169 170 // Arg methods for managing input buffers. Buffers are in row-major order. 171 // There is a set of methods for each positional argument. 172 void** args(); 173 174 void set_arg0_data(float* data); 175 float* arg0_data(); 176 float& arg0(size_t dim0, size_t dim1); 177 178 void set_arg1_data(float* data); 179 float* arg1_data(); 180 float& arg1(size_t dim0, size_t dim1); 181 182 // Result methods for managing output buffers. Buffers are in row-major order. 183 // Must only be called after a successful Run call. There is a set of methods 184 // for each positional result. 185 void** results(); 186 187 188 float* result0_data(); 189 float& result0(size_t dim0, size_t dim1); 190 }; 191 192 } // end namespace bar 193 } // end namespace foo 194 ``` 195 196 The generated C++ class is called `MatMulComp` in the `foo::bar` namespace, 197 because that was the `cpp_class` specified in the `tf_library` macro. All 198 generated classes have a similar API, with the only difference being the methods 199 to handle arg and result buffers. Those methods differ based on the number and 200 types of the buffers, which were specified by the `feed` and `fetch` arguments 201 to the `tf_library` macro. 202 203 There are three types of buffers managed within the generated class: `args` 204 representing the inputs, `results` representing the outputs, and `temps` 205 representing temporary buffers used internally to perform the computation. By 206 default, each instance of the generated class allocates and manages all of these 207 buffers for you. The `AllocMode` constructor argument may be used to change this 208 behavior. A convenience library is provided in 209 [`tensorflow/compiler/aot/runtime.h`](https://www.tensorflow.org/code/tensorflow/compiler/aot/runtime.h) 210 to help with manual buffer allocation; usage of this library is optional. All 211 buffers should be aligned to 32-byte boundaries. 212 213 The generated C++ class is just a wrapper around the low-level code generated by 214 XLA. 215 216 Example of invoking the generated function based on 217 [`tfcompile_test.cc`](https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/tfcompile_test.cc): 218 219 ```c++ 220 #define EIGEN_USE_THREADS 221 #define EIGEN_USE_CUSTOM_THREAD_POOL 222 223 #include <iostream> 224 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor" 225 #include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h" // generated 226 227 int main(int argc, char** argv) { 228 Eigen::ThreadPool tp(2); // Size the thread pool as appropriate. 229 Eigen::ThreadPoolDevice device(&tp, tp.NumThreads()); 230 231 232 foo::bar::MatMulComp matmul; 233 matmul.set_thread_pool(&device); 234 235 // Set up args and run the computation. 236 const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}; 237 std::copy(args + 0, args + 6, matmul.arg0_data()); 238 std::copy(args + 6, args + 12, matmul.arg1_data()); 239 matmul.Run(); 240 241 // Check result 242 if (matmul.result0(0, 0) == 58) { 243 std::cout << "Success" << std::endl; 244 } else { 245 std::cout << "Failed. Expected value 58 at 0,0. Got:" 246 << matmul.result0(0, 0) << std::endl; 247 } 248 249 return 0; 250 } 251 ``` 252 253 ### Step 4: Create the final binary 254 255 This step combines the library generated by `tf_library` in step 2 and the code 256 written in step 3 to create a final binary. Below is an example `bazel` BUILD 257 file. 258 259 ```build 260 # Example of linking your binary 261 # Also see //third_party/tensorflow/compiler/aot/tests/BUILD 262 load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library") 263 264 # The same tf_library call from step 2 above. 265 tf_library( 266 name = "test_graph_tfmatmul", 267 ... 268 ) 269 270 # The executable code generated by tf_library can then be linked into your code. 271 cc_binary( 272 name = "my_binary", 273 srcs = [ 274 "my_code.cc", # include test_graph_tfmatmul.h to access the generated header 275 ], 276 deps = [ 277 ":test_graph_tfmatmul", # link in the generated object file 278 "//third_party/eigen3", 279 ], 280 linkopts = [ 281 "-lpthread", 282 ] 283 ) 284 ``` 285