Home | History | Annotate | Download | only in xla
      1 # Using AOT compilation
      2 
      3 ## What is tfcompile?
      4 
      5 `tfcompile` is a standalone tool that ahead-of-time (AOT) compiles TensorFlow
      6 graphs into executable code. It can reduce total binary size, and also avoid
      7 some runtime overheads. A typical use-case of `tfcompile` is to compile an
      8 inference graph into executable code for mobile devices.
      9 
     10 The TensorFlow graph is normally executed by the TensorFlow runtime. This incurs
     11 some runtime overhead for execution of each node in the graph. This also leads
     12 to a larger total binary size, since the code for the TensorFlow runtime needs
     13 to be available, in addition to the graph itself. The executable code produced
     14 by `tfcompile` does not use the TensorFlow runtime, and only has dependencies on
     15 kernels that are actually used in the computation.
     16 
     17 The compiler is built on top of the XLA framework. The code bridging TensorFlow
     18 to the XLA framework resides under
     19 [tensorflow/compiler](https://www.tensorflow.org/code/tensorflow/compiler/),
     20 which also includes support for @{$jit$just-in-time (JIT) compilation} of
     21 TensorFlow graphs.
     22 
     23 ## What does tfcompile do?
     24 
     25 `tfcompile` takes a subgraph, identified by the TensorFlow concepts of
     26 feeds and fetches, and generates a function that implements that subgraph.
     27 The `feeds` are the input arguments for the function, and the `fetches` are the
     28 output arguments for the function. All inputs must be fully specified by the
     29 feeds; the resulting pruned subgraph cannot contain Placeholder or Variable
     30 nodes. It is common to specify all Placeholders and Variables as feeds, which
     31 ensures the resulting subgraph no longer contains these nodes. The generated
     32 function is packaged as a `cc_library`, with a header file exporting the
     33 function signature, and an object file containing the implementation. The user
     34 writes code to invoke the generated function as appropriate.
     35 
     36 ## Using tfcompile
     37 
     38 This section details high level steps for generating an executable binary with
     39 `tfcompile` from a TensorFlow subgraph. The steps are:
     40 
     41 *   Step 1: Configure the subgraph to compile
     42 *   Step 2: Use the `tf_library` build macro to compile the subgraph
     43 *   Step 3: Write code to invoke the subgraph
     44 *   Step 4: Create the final binary
     45 
     46 ### Step 1: Configure the subgraph to compile
     47 
     48 Identify the feeds and fetches that correspond to the input and output
     49 arguments for the generated function. Then configure the `feeds` and `fetches`
     50 in a [`tensorflow.tf2xla.Config`](https://www.tensorflow.org/code/tensorflow/compiler/tf2xla/tf2xla.proto)
     51 proto.
     52 
     53 ```textproto
     54 # Each feed is a positional input argument for the generated function.  The order
     55 # of each entry matches the order of each input argument.  Here x_hold and y_hold
     56 # refer to the names of placeholder nodes defined in the graph.
     57 feed {
     58   id { node_name: "x_hold" }
     59   shape {
     60     dim { size: 2 }
     61     dim { size: 3 }
     62   }
     63 }
     64 feed {
     65   id { node_name: "y_hold" }
     66   shape {
     67     dim { size: 3 }
     68     dim { size: 2 }
     69   }
     70 }
     71 
     72 # Each fetch is a positional output argument for the generated function.  The order
     73 # of each entry matches the order of each output argument.  Here x_y_prod
     74 # refers to the name of a matmul node defined in the graph.
     75 fetch {
     76   id { node_name: "x_y_prod" }
     77 }
     78 ```
     79 
     80 ### Step 2: Use tf_library build macro to compile the subgraph
     81 
     82 This step converts the graph into a `cc_library` using the `tf_library` build
     83 macro. The `cc_library` consists of an object file containing the code generated
     84 from the graph, along with a header file that gives access to the generated
     85 code. `tf_library` utilizes `tfcompile` to compile the TensorFlow graph into
     86 executable code.
     87 
     88 ```build
     89 load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
     90 
     91 # Use the tf_library macro to compile your graph into executable code.
     92 tf_library(
     93     # name is used to generate the following underlying build rules:
     94     # <name>           : cc_library packaging the generated header and object files
     95     # <name>_test      : cc_test containing a simple test and benchmark
     96     # <name>_benchmark : cc_binary containing a stand-alone benchmark with minimal deps;
     97     #                    can be run on a mobile device
     98     name = "test_graph_tfmatmul",
     99     # cpp_class specifies the name of the generated C++ class, with namespaces allowed.
    100     # The class will be generated in the given namespace(s), or if no namespaces are
    101     # given, within the global namespace.
    102     cpp_class = "foo::bar::MatMulComp",
    103     # graph is the input GraphDef proto, by default expected in binary format.  To
    104     # use the text format instead, just use the .pbtxt suffix.  A subgraph will be
    105     # created from this input graph, with feeds as inputs and fetches as outputs.
    106     # No Placeholder or Variable ops may exist in this subgraph.
    107     graph = "test_graph_tfmatmul.pb",
    108     # config is the input Config proto, by default expected in binary format.  To
    109     # use the text format instead, use the .pbtxt suffix.  This is where the
    110     # feeds and fetches were specified above, in the previous step.
    111     config = "test_graph_tfmatmul.config.pbtxt",
    112 )
    113 ```
    114 
    115 > To generate the GraphDef proto (test_graph_tfmatmul.pb) for this example, run
    116 > [make_test_graphs.py]("https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/make_test_graphs.py")
    117 > and specify the output location with the --out_dir flag.
    118 
    119 Typical graphs contain @{$python/state_ops$`Variables`}
    120 representing the weights that are learned via training, but `tfcompile` cannot
    121 compile a subgraph that contain `Variables`. The
    122 [freeze_graph.py](https://www.tensorflow.org/code/tensorflow/python/tools/freeze_graph.py)
    123 tool converts variables into constants, using values stored in a checkpoint
    124 file. As a convenience, the `tf_library` macro supports the `freeze_checkpoint`
    125 argument, which runs the tool. For more examples see
    126 [tensorflow/compiler/aot/tests/BUILD](https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/BUILD).
    127 
    128 > Constants that show up in the compiled subgraph are compiled directly into the
    129 > generated code. To pass the constants into the generated function, rather than
    130 > having them compiled-in, simply pass them in as feeds.
    131 
    132 For details on the `tf_library` build macro, see
    133 [tfcompile.bzl](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile.bzl).
    134 
    135 For details on the underlying `tfcompile` tool, see
    136 [tfcompile_main.cc](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile_main.cc).
    137 
    138 ### Step 3: Write code to invoke the subgraph
    139 
    140 This step uses the header file (`test_graph_tfmatmul.h`) generated by the
    141 `tf_library` build macro in the previous step to invoke the generated code. The
    142 header file is located in the `bazel-genfiles` directory corresponding to the
    143 build package, and is named based on the name attribute set on the `tf_library`
    144 build macro. For example, the header generated for `test_graph_tfmatmul` would
    145 be `test_graph_tfmatmul.h`. Below is an abbreviated version of what is
    146 generated. The generated file, in `bazel-genfiles`, contains additional useful
    147 comments.
    148 
    149 ```c++
    150 namespace foo {
    151 namespace bar {
    152 
    153 // MatMulComp represents a computation previously specified in a
    154 // TensorFlow graph, now compiled into executable code.
    155 class MatMulComp {
    156  public:
    157   // AllocMode controls the buffer allocation mode.
    158   enum class AllocMode {
    159     ARGS_RESULTS_AND_TEMPS,  // Allocate arg, result and temp buffers
    160     RESULTS_AND_TEMPS_ONLY,  // Only allocate result and temp buffers
    161   };
    162 
    163   MatMulComp(AllocMode mode = AllocMode::ARGS_RESULTS_AND_TEMPS);
    164   ~MatMulComp();
    165 
    166   // Runs the computation, with inputs read from arg buffers, and outputs
    167   // written to result buffers. Returns true on success and false on failure.
    168   bool Run();
    169 
    170   // Arg methods for managing input buffers. Buffers are in row-major order.
    171   // There is a set of methods for each positional argument.
    172   void** args();
    173 
    174   void set_arg0_data(float* data);
    175   float* arg0_data();
    176   float& arg0(size_t dim0, size_t dim1);
    177 
    178   void set_arg1_data(float* data);
    179   float* arg1_data();
    180   float& arg1(size_t dim0, size_t dim1);
    181 
    182   // Result methods for managing output buffers. Buffers are in row-major order.
    183   // Must only be called after a successful Run call. There is a set of methods
    184   // for each positional result.
    185   void** results();
    186 
    187 
    188   float* result0_data();
    189   float& result0(size_t dim0, size_t dim1);
    190 };
    191 
    192 }  // end namespace bar
    193 }  // end namespace foo
    194 ```
    195 
    196 The generated C++ class is called `MatMulComp` in the `foo::bar` namespace,
    197 because that was the `cpp_class` specified in the `tf_library` macro. All
    198 generated classes have a similar API, with the only difference being the methods
    199 to handle arg and result buffers. Those methods differ based on the number and
    200 types of the buffers, which were specified by the `feed` and `fetch` arguments
    201 to the `tf_library` macro.
    202 
    203 There are three types of buffers managed within the generated class: `args`
    204 representing the inputs, `results` representing the outputs, and `temps`
    205 representing temporary buffers used internally to perform the computation. By
    206 default, each instance of the generated class allocates and manages all of these
    207 buffers for you. The `AllocMode` constructor argument may be used to change this
    208 behavior. A convenience library is provided in
    209 [`tensorflow/compiler/aot/runtime.h`](https://www.tensorflow.org/code/tensorflow/compiler/aot/runtime.h)
    210 to help with manual buffer allocation; usage of this library is optional. All
    211 buffers should be aligned to 32-byte boundaries.
    212 
    213 The generated C++ class is just a wrapper around the low-level code generated by
    214 XLA.
    215 
    216 Example of invoking the generated function based on
    217 [`tfcompile_test.cc`](https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/tfcompile_test.cc):
    218 
    219 ```c++
    220 #define EIGEN_USE_THREADS
    221 #define EIGEN_USE_CUSTOM_THREAD_POOL
    222 
    223 #include <iostream>
    224 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
    225 #include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h" // generated
    226 
    227 int main(int argc, char** argv) {
    228   Eigen::ThreadPool tp(2);  // Size the thread pool as appropriate.
    229   Eigen::ThreadPoolDevice device(&tp, tp.NumThreads());
    230 
    231 
    232   foo::bar::MatMulComp matmul;
    233   matmul.set_thread_pool(&device);
    234 
    235   // Set up args and run the computation.
    236   const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    237   std::copy(args + 0, args + 6, matmul.arg0_data());
    238   std::copy(args + 6, args + 12, matmul.arg1_data());
    239   matmul.Run();
    240 
    241   // Check result
    242   if (matmul.result0(0, 0) == 58) {
    243     std::cout << "Success" << std::endl;
    244   } else {
    245     std::cout << "Failed. Expected value 58 at 0,0. Got:"
    246               << matmul.result0(0, 0) << std::endl;
    247   }
    248 
    249   return 0;
    250 }
    251 ```
    252 
    253 ### Step 4: Create the final binary
    254 
    255 This step combines the library generated by `tf_library` in step 2 and the code
    256 written in step 3 to create a final binary. Below is an example `bazel` BUILD
    257 file.
    258 
    259 ```build
    260 # Example of linking your binary
    261 # Also see //third_party/tensorflow/compiler/aot/tests/BUILD
    262 load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
    263 
    264 # The same tf_library call from step 2 above.
    265 tf_library(
    266     name = "test_graph_tfmatmul",
    267     ...
    268 )
    269 
    270 # The executable code generated by tf_library can then be linked into your code.
    271 cc_binary(
    272     name = "my_binary",
    273     srcs = [
    274         "my_code.cc",  # include test_graph_tfmatmul.h to access the generated header
    275     ],
    276     deps = [
    277         ":test_graph_tfmatmul",  # link in the generated object file
    278         "//third_party/eigen3",
    279     ],
    280     linkopts = [
    281           "-lpthread",
    282     ]
    283 )
    284 ```
    285