Home | History | Annotate | only in /external/tensorflow/tensorflow/core/profiler
Up to higher level directory
NameDateSize
BUILD21-Aug-20181.8K
g3doc/21-Aug-2018
internal/21-Aug-2018
profile.proto21-Aug-20181.3K
profiler.cc21-Aug-201811.6K
README.md21-Aug-201813.8K
tfprof_log.proto21-Aug-20184.8K
tfprof_options.cc21-Aug-20188.5K
tfprof_options.h21-Aug-20186.2K
tfprof_options.proto21-Aug-20181K
tfprof_output.proto21-Aug-20184.2K

README.md

      1 <h1>TensorFlow Profiler and Advisor</h1>
      2 
      3 * [Features](#features)
      4 * [Quick Start](#quick-start)
      5 * [Demo](#demo)
      6 * [Feature Request and Bug Report](#feature-request-and-bug-report)
      7 
      8 
      9 ### Features
     10 
     11 * Profile model architectures.
     12   * parameters, tensor shapes, float operations, device placement, etc.
     13 * Profile multiple-steps model performance.
     14   * execution time, memory consumption.
     15 * Auto profile and advise.
     16   * accelerator utilization check
     17   * expensive operation check
     18   * operation configuration check
     19   * distributed runtime check (Not OSS)
     20 
     21 ### Quick Start
     22 
     23 ```python
     24 # When using high-level API, session is usually hidden.
     25 #
     26 # Under the default ProfileContext, run a few hundred steps.
     27 # The ProfileContext will sample some steps and dump the profiles
     28 # to files. Users can then use command line tool or Web UI for
     29 # interactive profiling.
     30 with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
     31   # High level API, such as slim, Estimator, etc.
     32   train_loop()
     33 
     34 bazel-bin/tensorflow/core/profiler/profiler \
     35     --profile_path=/tmp/train_dir/profile_xx
     36 tfprof> op -select micros,bytes,occurrence -order_by micros
     37 
     38 # To be open sourced...
     39 bazel-bin/tensorflow/python/profiler/profiler_ui \
     40     --profile_path=/tmp/profiles/profile_1
     41 ```
     42 ![ProfilerUI](g3doc/profiler_ui.jpg)
     43 
     44 ```python
     45 # When using lower-level APIs with a Session object. User can have
     46 # explicit control of each step.
     47 #
     48 # Create options to profile the time and memory information.
     49 builder = tf.profiler.ProfileOptionBuilder
     50 opts = builder(builder.time_and_memory()).order_by('micros').build()
     51 # Create a profiling context, set constructor argument `trace_steps`,
     52 # `dump_steps` to empty for explicit control.
     53 with tf.contrib.tfprof.ProfileContext('/tmp/train_dir',
     54                                       trace_steps=[],
     55                                       dump_steps=[]) as pctx:
     56   with tf.Session() as sess:
     57     # Enable tracing for next session.run.
     58     pctx.trace_next_step()
     59     # Dump the profile to '/tmp/train_dir' after the step.
     60     pctx.dump_next_step()
     61     _ = session.run(train_op)
     62     pctx.profiler.profile_operations(options=opts)
     63 ```
     64 
     65 ```python
     66 # For more advanced usage, user can control the tracing steps and
     67 # dumping steps. User can also run online profiling during training.
     68 #
     69 # Create options to profile time/memory as well as parameters.
     70 builder = tf.profiler.ProfileOptionBuilder
     71 opts = builder(builder.time_and_memory()).order_by('micros').build()
     72 opts2 = tf.profiler.ProfileOptionBuilder.trainable_variables_parameter()
     73 
     74 # Collect traces of steps 10~20, dump the whole profile (with traces of
     75 # step 10~20) at step 20. The dumped profile can be used for further profiling
     76 # with command line interface or Web UI.
     77 with tf.contrib.tfprof.ProfileContext('/tmp/train_dir',
     78                                       trace_steps=range(10, 20),
     79                                       dump_steps=[20]) as pctx:
     80   # Run online profiling with 'op' view and 'opts' options at step 15, 18, 20.
     81   pctx.add_auto_profiling('op', opts, [15, 18, 20])
     82   # Run online profiling with 'scope' view and 'opts2' options at step 20.
     83   pctx.add_auto_profiling('scope', opts2, [20])
     84   # High level API, such as slim, Estimator, etc.
     85   train_loop()
     86 ```
     87 
     88 
     89 <b>Detail Tutorials</b>
     90 
     91 *  [Python API](g3doc/python_api.md)
     92 *  [Command Line Interface](g3doc/command_line.md)
     93 *  [Profile Time](g3doc/profile_time.md)
     94 *  [Profile Memory](g3doc/profile_memory.md)
     95 *  [Profile Model Architecture](g3doc/profile_model_architecture.md)
     96 *  [Auto Detect and Advise](g3doc/advise.md)
     97 
     98 <b>Detail Documentation</b>
     99 
    100 *  [Options](g3doc/options.md)
    101 
    102 ## Demo
    103 
    104 ### Attribute TensorFlow graph running time to your Python codes.
    105 ```shell
    106 tfprof> code -max_depth 1000 -show_name_regexes .*model_analyzer.*py.* -select micros -account_type_regexes .* -order_by micros
    107 _TFProfRoot (0us/22.44ms)
    108   model_analyzer_test.py:149:run_filename_as_m...:none (0us/22.44ms)
    109     model_analyzer_test.py:33:_run_code_in_main:none (0us/22.44ms)
    110       model_analyzer_test.py:208:<module>:test.main() (0us/22.44ms)
    111         model_analyzer_test.py:132:testComplexCodeView:x = lib.BuildFull... (0us/22.44ms)
    112           model_analyzer_testlib.py:63:BuildFullModel:return sgd_op.min... (0us/21.83ms)
    113           model_analyzer_testlib.py:58:BuildFullModel:cell, array_ops.c... (0us/333us)
    114           model_analyzer_testlib.py:54:BuildFullModel:seq.append(array_... (0us/254us)
    115             model_analyzer_testlib.py:42:BuildSmallModel:x = nn_ops.conv2d... (0us/134us)
    116             model_analyzer_testlib.py:46:BuildSmallModel:initializer=init_... (0us/40us)
    117             ...
    118           model_analyzer_testlib.py:61:BuildFullModel:loss = nn_ops.l2_... (0us/28us)
    119           model_analyzer_testlib.py:60:BuildFullModel:target = array_op... (0us/0us)
    120         model_analyzer_test.py:134:testComplexCodeView:sess.run(variable... (0us/0us)
    121 ```
    122 
    123 ### Show your model variables and the number of parameters.
    124 ```
    125 tfprof> scope -account_type_regexes VariableV2 -max_depth 4 -select params
    126 _TFProfRoot (--/930.58k params)
    127   global_step (1/1 params)
    128   init/init_conv/DW (3x3x3x16, 432/864 params)
    129   pool_logit/DW (64x10, 640/1.28k params)
    130     pool_logit/DW/Momentum (64x10, 640/640 params)
    131   pool_logit/biases (10, 10/20 params)
    132     pool_logit/biases/Momentum (10, 10/10 params)
    133   unit_last/final_bn/beta (64, 64/128 params)
    134   unit_last/final_bn/gamma (64, 64/128 params)
    135   unit_last/final_bn/moving_mean (64, 64/64 params)
    136   unit_last/final_bn/moving_variance (64, 64/64 params)
    137 ```
    138 
    139 ### Show the most expensive operation types.
    140 ```
    141 tfprof> op -select micros,bytes,occurrence -order_by micros
    142 node name | requested bytes | total execution time | accelerator execution time | cpu execution time | op occurrence (run|defined)
    143 SoftmaxCrossEntropyWithLogits      36.58MB (100.00%, 0.05%),      1.37sec (100.00%, 26.68%),           0us (100.00%, 0.00%),      1.37sec (100.00%, 30.75%),      30|30
    144 MatMul                        2720.57MB (99.95%, 3.66%),      708.14ms (73.32%, 13.83%),     280.76ms (100.00%, 41.42%),       427.39ms (69.25%, 9.62%),  2694|3450
    145 ConcatV2                       741.37MB (96.29%, 1.00%),       389.63ms (59.49%, 7.61%),        31.80ms (58.58%, 4.69%),       357.83ms (59.63%, 8.05%),  4801|6098
    146 Mul                           3957.24MB (95.29%, 5.33%),       338.02ms (51.88%, 6.60%),       80.88ms (53.88%, 11.93%),       257.14ms (51.58%, 5.79%),  7282|9427
    147 Add                            740.05MB (89.96%, 1.00%),       321.76ms (45.28%, 6.28%),        13.50ms (41.95%, 1.99%),       308.26ms (45.79%, 6.94%),  1699|2180
    148 Sub                             32.46MB (88.97%, 0.04%),       216.20ms (39.00%, 4.22%),          241us (39.96%, 0.04%),       215.96ms (38.85%, 4.86%),  1780|4372
    149 Slice                          708.07MB (88.92%, 0.95%),       179.88ms (34.78%, 3.51%),        25.38ms (39.92%, 3.74%),       154.50ms (33.99%, 3.48%),  5800|7277
    150 AddN                           733.21MB (87.97%, 0.99%),       158.36ms (31.26%, 3.09%),        50.10ms (36.18%, 7.39%),       108.26ms (30.51%, 2.44%),  4567|5481
    151 Fill                           954.27MB (86.98%, 1.28%),       138.29ms (28.17%, 2.70%),        16.21ms (28.79%, 2.39%),       122.08ms (28.08%, 2.75%),  3278|9686
    152 Select                         312.33MB (85.70%, 0.42%),       104.75ms (25.47%, 2.05%),        18.30ms (26.40%, 2.70%),        86.45ms (25.33%, 1.95%),  2880|5746
    153 ApplyAdam                      231.65MB (85.28%, 0.31%),        92.66ms (23.43%, 1.81%),            0us (23.70%, 0.00%),        92.66ms (23.38%, 2.09%),      27|27
    154 ```
    155 
    156 ### Auto-profile.
    157 
    158 ```shell
    159 tfprof> advise
    160 Not running under xxxx. Skip JobChecker.
    161 
    162 AcceleratorUtilizationChecker:
    163 device: /job:worker/replica:0/task:0/device:GPU:0 low utilization: 0.03
    164 device: /job:worker/replica:0/task:0/device:GPU:1 low utilization: 0.08
    165 device: /job:worker/replica:0/task:0/device:GPU:2 low utilization: 0.04
    166 device: /job:worker/replica:0/task:0/device:GPU:3 low utilization: 0.21
    167 
    168 OperationChecker:
    169 Found operation using NHWC data_format on GPU. Maybe NCHW is faster.
    170 
    171 JobChecker:
    172 
    173 ExpensiveOperationChecker:
    174 top 1 operation type: SoftmaxCrossEntropyWithLogits, cpu: 1.37sec, accelerator: 0us, total: 1.37sec (26.68%)
    175 top 2 operation type: MatMul, cpu: 427.39ms, accelerator: 280.76ms, total: 708.14ms (13.83%)
    176 top 3 operation type: ConcatV2, cpu: 357.83ms, accelerator: 31.80ms, total: 389.63ms (7.61%)
    177 top 1 graph node: seq2seq/loss/sampled_sequence_loss/sequence_loss_by_example/SoftmaxCrossEntropyWithLogits_11, cpu: 89.92ms, accelerator: 0us, total: 89.92ms
    178 top 2 graph node: train_step/update_seq2seq/output_projection/w/ApplyAdam, cpu: 84.52ms, accelerator: 0us, total: 84.52ms
    179 top 3 graph node: seq2seq/loss/sampled_sequence_loss/sequence_loss_by_example/SoftmaxCrossEntropyWithLogits_19, cpu: 73.02ms, accelerator: 0us, total: 73.02ms
    180 seq2seq_attention_model.py:360:build_graph:self._add_seq2seq(), cpu: 3.16sec, accelerator: 214.84ms, total: 3.37sec
    181   seq2seq_attention_model.py:293:_add_seq2seq:decoder_outputs, ..., cpu: 2.46sec, accelerator: 3.25ms, total: 2.47sec
    182     seq2seq_lib.py:181:sampled_sequence_...:average_across_ti..., cpu: 2.46sec, accelerator: 3.24ms, total: 2.47sec
    183       seq2seq_lib.py:147:sequence_loss_by_...:crossent = loss_f..., cpu: 2.46sec, accelerator: 3.06ms, total: 2.46sec
    184       seq2seq_lib.py:148:sequence_loss_by_...:log_perp_list.app..., cpu: 1.33ms, accelerator: 120us, total: 1.45ms
    185   seq2seq_attention_model.py:192:_add_seq2seq:sequence_length=a..., cpu: 651.56ms, accelerator: 158.92ms, total: 810.48ms
    186     seq2seq_lib.py:104:bidirectional_rnn:sequence_length, ..., cpu: 306.58ms, accelerator: 73.54ms, total: 380.12ms
    187       core_rnn.py:195:static_rnn:state_size=cell.s..., cpu: 306.52ms, accelerator: 73.54ms, total: 380.05ms
    188     seq2seq_lib.py:110:bidirectional_rnn:initial_state_bw,..., cpu: 296.21ms, accelerator: 73.54ms, total: 369.75ms
    189       core_rnn.py:195:static_rnn:state_size=cell.s..., cpu: 296.11ms, accelerator: 73.54ms, total: 369.65ms
    190     seq2seq_lib.py:113:bidirectional_rnn:outputs = [tf.con..., cpu: 46.88ms, accelerator: 3.87ms, total: 50.75ms
    191   seq2seq_attention_model.py:253:_add_seq2seq:initial_state_att..., cpu: 32.48ms, accelerator: 50.01ms, total: 82.50ms
    192     seq2seq.py:693:attention_decoder:attns = attention..., cpu: 11.73ms, accelerator: 38.41ms, total: 50.14ms
    193       seq2seq.py:653:attention:s = math_ops.redu..., cpu: 2.62ms, accelerator: 17.80ms, total: 20.41ms
    194       seq2seq.py:658:attention:array_ops.reshape..., cpu: 1.90ms, accelerator: 12.08ms, total: 13.98ms
    195       seq2seq.py:655:attention:a = nn_ops.softma..., cpu: 4.15ms, accelerator: 4.25ms, total: 8.40ms
    196     seq2seq.py:686:attention_decoder:cell_output, stat..., cpu: 14.43ms, accelerator: 4.85ms, total: 19.27ms
    197     seq2seq.py:696:attention_decoder:output = linear([..., cpu: 3.04ms, accelerator: 2.88ms, total: 5.93ms
    198       core_rnn_cell_impl.py:1009:_linear:res = math_ops.ma..., cpu: 2.33ms, accelerator: 2.71ms, total: 5.04ms
    199 seq2seq_attention_model.py:363:build_graph:self._add_train_o..., cpu: 1.28sec, accelerator: 462.93ms, total: 1.74sec
    200   seq2seq_attention_model.py:307:_add_train_op:tf.gradients(self..., cpu: 967.84ms, accelerator: 462.88ms, total: 1.43sec
    201     gradients_impl.py:563:gradients:grad_scope, op, f..., cpu: 692.60ms, accelerator: 390.75ms, total: 1.08sec
    202     gradients_impl.py:554:gradients:out_grads[i] = co..., cpu: 164.71ms, accelerator: 16.21ms, total: 180.92ms
    203       control_flow_ops.py:1314:ZerosLikeOutsideL...:return array_ops...., cpu: 121.85ms, accelerator: 16.21ms, total: 138.05ms
    204       control_flow_ops.py:1313:ZerosLikeOutsideL...:zeros_shape = arr..., cpu: 22.85ms, accelerator: 0us, total: 22.85ms
    205       control_flow_ops.py:1312:ZerosLikeOutsideL...:switch_val = swit..., cpu: 20.02ms, accelerator: 0us, total: 20.02ms
    206     gradients_impl.py:515:gradients:out_grads = _Aggr..., cpu: 108.69ms, accelerator: 51.92ms, total: 160.61ms
    207       gradients_impl.py:846:_AggregatedGrads:out_grads[i] = _M..., cpu: 107.99ms, accelerator: 50.05ms, total: 158.04ms
    208       gradients_impl.py:856:_AggregatedGrads:array_ops.concat(..., cpu: 340us, accelerator: 1.87ms, total: 2.21ms
    209   seq2seq_attention_model.py:322:_add_train_op:zip(grads, tvars)..., cpu: 307.56ms, accelerator: 0us, total: 307.56ms
    210     optimizer.py:456:apply_gradients:update_ops.append..., cpu: 307.43ms, accelerator: 0us, total: 307.43ms
    211       optimizer.py:102:update_op:return optimizer...., cpu: 222.66ms, accelerator: 0us, total: 222.66ms
    212       optimizer.py:97:update_op:return optimizer...., cpu: 84.76ms, accelerator: 0us, total: 84.76ms
    213 ```
    214 
    215 ### Visualize time and memory
    216 
    217 ```shell
    218 # The following example generates a timeline.
    219 tfprof> graph -step -1 -max_depth 100000 -output timeline:outfile=<filename>
    220 
    221 generating trace file.
    222 
    223 ******************************************************
    224 Timeline file is written to <filename>.
    225 Open a Chrome browser, enter URL chrome://tracing and load the timeline file.
    226 ******************************************************
    227 ```
    228 
    229 ![Timeline](g3doc/graph_timeline.png)
    230 
    231 ```shell
    232 # The following example generates a pprof graph (only supported by code view).
    233 # Since TensorFlow runs the graph instead of Python code, the pprof graph
    234 # doesn't profile the statistics of Python, but the TensorFlow graph
    235 # nodes created by the Python call stack.
    236 # Nevertheless, it pops critical Python code path for us.
    237 #
    238 # `-trim_name_regexes` trims the some traces that have no valuable information.
    239 # `-select accelerator_micros` pick accelerator time for pprof graph. User
    240 # can also generate memory profile using `-select bytes`
    241 tfprof> code -select accelerator_micros -max_depth 100000 -output pprof:outfile=<filename>  -trim_name_regexes .*apply_op.*
    242 
    243 # Use google-pprof, from the google-perftools package to visualize the generated file.
    244 # On Ubuntu you can install it with `apt-get install it google-perftools`.
    245 google-pprof --pdf --nodecount=100 <filename>
    246 ```
    247 
    248 ![PprofGraph](g3doc/pprof.jpg)
    249 
    250 ### Feature Request and Bug Report
    251 
    252 Contact: xpan (a] google.com
    253 
    254 Providing `GraphDef` and `RunMetadata` file will greatly help
    255 bug fix. `OpLogProto` is a good plus if it is used.
    256 
    257 
    258 #### Teams
    259 
    260 * Xin Pan
    261 * Chris Antaki
    262 * Yao Zhang
    263 * Jon Shlens
    264