Home | History | Annotate | Download | only in debug
      1 <html devsite>
      2   <head>
      3     <title>Understanding Systrace</title>
      4     <meta name="project_path" value="/_project.yaml" />
      5     <meta name="book_path" value="/_book.yaml" />
      6   </head>
      7   <body>
      8   <!--
      9       Copyright 2017 The Android Open Source Project
     10 
     11       Licensed under the Apache License, Version 2.0 (the "License");
     12       you may not use this file except in compliance with the License.
     13       You may obtain a copy of the License at
     14 
     15           http://www.apache.org/licenses/LICENSE-2.0
     16 
     17       Unless required by applicable law or agreed to in writing, software
     18       distributed under the License is distributed on an "AS IS" BASIS,
     19       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     20       See the License for the specific language governing permissions and
     21       limitations under the License.
     22   -->
     23 
     24 
     25 <p class="caution"><strong>Caution:</strong> If you've never used systrace
     26 before, we strongly recommend reading the
     27 <a href="https://developer.android.com/studio/profile/systrace.html">systrace
     28 overview</a> before continuing.</p>
     29 
     30 <p>systrace is the primary tool for analyzing Android device performance.
     31 However, it's really a wrapper around other tools: It is the host-side wrapper
     32 around <em>atrace</em>, the device-side executable that controls userspace
     33 tracing and sets up <em>ftrace</em>, the primary tracing mechanism in the Linux
     34 kernel. systrace uses atrace to enable tracing, then reads the ftrace buffer and
     35 wraps it all in a self-contained HTML viewer. (While newer kernels have support
     36 for Linux Enhanced Berkeley Packet Filter (eBPF), the following documentation
     37 pertains to the 3.18 kernel (no eFPF) as that's what was used on the Pixel/Pixel
     38 XL.)</p>
     39 
     40 <p>systrace is owned by the Google Android and Google Chrome teams and is
     41 developed in the open as part of the
     42 <a href="https://github.com/catapult-project/catapult">Catapult project</a>. In
     43 addition to systrace, Catapult includes other useful utilities. For example,
     44 ftrace has more features than can be directly enabled by systrace or atrace and
     45 contains some advanced functionality that is critical to debugging performance
     46 problems. (These features require root access and often a new kernel.)</p>
     47 
     48 <h2 id="running_systrace">Running systrace</h2>
     49 <p>When debugging jitter on Pixel/Pixel XL, start with the following
     50 command:</p>
     51 
     52 <pre class="devsite-terminal devsite-click-to-copy">
     53 ./systrace.py sched freq idle am wm gfx view sync binder_driver irq workq input -b 96000
     54 </pre>
     55 
     56 <p>When combined with the additional tracepoints required for GPU and display
     57 pipeline activity, this gives you the ability to trace from user input to frame
     58 displayed on screen. Set the buffer size to something large to avoid
     59 losing events (which usually manifests as some CPUs containing no events after
     60 some point in the trace).</p>
     61 
     62 <p>When going through systrace, keep in mind that <strong>every event is
     63 triggered by something on the CPU</strong>.</p>
     64 
     65 <p class="note"><strong>Note:</strong> Hardware interrupts are not controlled by
     66 the CPU and do trigger things in ftrace, but the actual commit to the trace log
     67 is done by the interrupt handler, which could have been delayed if your
     68 interrupt arrived while (for example) some other bad driver had interrupts
     69 disabled. The critical element is the CPU.</p>
     70 
     71 <p>Because systrace is built on top of ftrace and ftrace runs on the CPU,
     72 something on the CPU must write the ftrace buffer that logs hardware changes.
     73 This means that if you're curious about why a display fence changed state, you
     74 can see what was running on the CPU at the exact point of its transition
     75 (something that is running on the CPU triggered that change in the log). This
     76 concept is the foundation of analyzing performance using systrace.</p>
     77 
     78 <h2 id="example_1">Example: Working frame</h2>
     79 <p>This example describes a systrace for a normal UI pipeline. To follow along
     80 with the example, <a href="perf_traces.zip">download the zip file</a> of traces
     81 (which also includes other traces referred to in this section), upzip the file,
     82 and open the systrace_tutorial.html file in your browser. Be warned this
     83 systrace is a large file; unless you use systrace in your day-to-day work, this
     84 is likely a much bigger trace with much more information than you've ever seen
     85 in a single trace before.</p>
     86 <p>For a consistent, periodic workload such as TouchLatency, the UI pipeline
     87 contains the following:</p>
     88 <p></p>
     89 <ol>
     90 <li>EventThread in SurfaceFlinger wakes the app UI thread, signaling it's time
     91 to render a new frame.</li>
     92 <li>App renders frame in UI thread, RenderThread, and hwuiTasks, using CPU and
     93 GPU resources. This is the bulk of the capacity spent for UI.</li>
     94 <li>App sends rendered frame to SurfaceFlinger via binder and goes to
     95 sleep.</li>
     96 <li>A second EventThread in SurfaceFlinger wakes SurfaceFlinger to trigger
     97 composition and display output. If SurfaceFlinger determines there is no work to
     98 be done, it goes back to sleep.</li>
     99 <li>SurfaceFlinger handles composition via HWC/HWC2 or GL. HWC/HWC2 composition
    100 is faster and lower power but has limitations depending on the SOC. This usually
    101 takes ~4-6ms, but can overlap with step 2 because Android applications are
    102 always triple-buffered. (While applications are always triple-buffered, there
    103 may only be one pending frame waiting in SurfaceFlinger, which makes it appear
    104 identical to double-buffering.)</li>
    105 <li>SurfaceFlinger dispatches final output to display via vendor driver and goes
    106 back to sleep, waiting for EventThread wakeup.</li>
    107 </ol>
    108 
    109 <p>Let's walk through the frame beginning at 15409ms:</p>
    110 
    111 <p><img src="/devices/tech/debug/images/perf_trace_normal_pipeline.png"></p>
    112 <p class="img-caption"><strong>Figure 1.</strong> Normal UI pipeline,
    113 EventThread running.</p>
    114 
    115 <p>Figure 1 is a normal frame surrounded by normal frames, so it's a good
    116 starting point for understanding how the UI pipeline works. The UI thread row
    117 for TouchLatency includes different colors at different times. Bars denote
    118 different states for the thread:</p>
    119 
    120 <ul>
    121 <li><strong>Gray</strong>. Sleeping.</li>
    122 <li><strong>Blue</strong>. Runnable (it could run, but the scheduler hasn't
    123 picked it to run yet).</li>
    124 <li><strong>Green</strong>. Actively running (the scheduler thinks it's
    125 running).
    126 <p class="note"><strong>Note</strong>: Interrupt handlers aren't shown in the
    127 normal per-CPU timeline, so it's possible that you're actually running interrupt
    128 handlers or softirqs during some portion of a thread's runtime. Check the irq
    129 section of the trace (under process 0) to confirm whether an interrupt is
    130 running instead of a standard thread.</p>
    131 </li>
    132 <li><strong>Red</strong>. Uninterruptible sleep (generally sleeping on a lock
    133 in the kernel). Can be indicative of I/O load. Extremely useful for debugging
    134 performance issues.</li>
    135 <li><strong>Orange</strong>. Uninterruptible sleep due to I/O load.</li>
    136 </ul>
    137 
    138 <p>To view the reason for uninterruptible sleep (available from the
    139 <code>sched_blocked_reason</code> tracepoint), select the red uninterruptible
    140 sleep slice.</p>
    141 <p>While EventThread is running, the UI thread for TouchLatency becomes
    142 runnable. To see what woke it, click the blue section:</p>
    143 
    144 <p><img src="/devices/tech/debug/images/perf_trace_tl.png"></p>
    145 <p class="img-caption"><strong>Figure 2.</strong> UI thread for
    146 TouchLatency.</p>
    147 
    148 <p>Figure 2 shows the TouchLatency UI thread was woken by tid 6843, which
    149 corresponds to EventThread. The UI thread wakes:</p>
    150 
    151 <p><img src="/devices/tech/debug/images/perf_trace_wake_render_enqueue.png"></p>
    152 <p class="img-caption"><strong>Figure 3.</strong> UI thread wakes, renders a
    153 frame, and enqueues it for SurfaceFlinger to consume.</p>
    154 
    155 <p>If the <code>binder_driver</code> tag is enabled in a trace, you can select a
    156 binder transaction to view information on all of the processes involved in that
    157 transaction:</p>
    158 
    159 <p><img src="/devices/tech/debug/images/perf_trace_binder_trans.png"></p>
    160 <p class="img-caption"><strong>Figure 4.</strong> Binder transaction.</p>
    161 
    162 <p>Figure 4 shows that, at 15,423.65ms, Binder:6832_1 in SurfaceFlinger becomes
    163 runnable because of tid 9579, which is TouchLatency's RenderThread. You can also
    164 see queueBuffer on both sides of the binder transaction.</p>
    165 
    166 <p>During the queueBuffer on the SurfaceFlinger side, the number of pending
    167 frames from TouchLatency goes from 1 to 2:</p>
    168 
    169 <p><img src="/devices/tech/debug/images/perf_trace_pending_frames.png"></p>
    170 <p class="img-caption"><strong>Figure 5.</strong> Pending frames goes from 1 to
    171 2.</p>
    172 
    173 <p>Figure 5 shows triple-buffering, where there are two completed frames and the
    174 app will soon start rendering a third. This is because we've already dropped
    175 some frames, so the app keeps two pending frames instead of one to try to avoid
    176 further dropped frames.</p>
    177 
    178 <p>Soon after, SurfaceFlinger's main thread is woken by a second EventThread so
    179 it can output the older pending frame to the display:</p>
    180 
    181 <p><img src="/devices/tech/debug/images/perf_trace_sf_woken_et.png"></p>
    182 <p class="img-caption"><strong>Figure 6.</strong> SurfaceFlinger's main thread
    183 is woken by a second EventThread.</p>
    184 
    185 <p>SurfaceFlinger first latches the older pending buffer, which causes the
    186 pending buffer count to decrease from 2 to 1:</p>
    187 
    188 <p><img src="/devices/tech/debug/images/perf_trace_sf_latches_pend.png"></p>
    189 <p class="img-caption"><strong>Figure 7.</strong> SurfaceFlinger first latches
    190 the older pending buffer.</p>
    191 
    192 <p>After latching the buffer, SurfaceFlinger sets up composition and submits the
    193 final frame to the display. (Some of these sections are enabled as part of the
    194 <code>mdss</code> tracepoint, so they may not be there on your SOC.)</p>
    195 
    196 <p><img src="/devices/tech/debug/images/perf_trace_sf_comp_submit.png"></p>
    197 <p class="img-caption"><strong>Figure 8.</strong> SurfaceFlinger sets up
    198 composition and submits the final frame.</p>
    199 
    200 <p>Next, <code>mdss_fb0</code> wakes on CPU 0. <code>mdss_fb0</code> is the
    201 display pipeline's kernel thread for outputting a rendered frame to the display.
    202 We can see <code>mdss_fb0</code> as its own row in the trace (scroll down to
    203 view).</p>
    204 
    205 <p><img src="/devices/tech/debug/images/perf_trace_wake_cpu0.png"></p>
    206 <p class="img-caption"><strong>Figure 9.</strong> <code>mdss_fb0</code> wakes
    207 on CPU 0.</p>
    208 
    209 <p><code>mdss_fb0</code> wakes up, runs for a bit, enters uninterruptible sleep,
    210 then wakes again.</p>
    211 
    212 <p class="note"><strong>Note</strong>: From this point forward, the trace is
    213 more complicated as the final work is split between <code>mdss_fb0</code>,
    214 interrupts, and workqueue functions. If you need that level of detail, refer to
    215 the exact characteristics of the driver stack for your SOC (as what happens on
    216 the Pixel XL might not be useful to you).</p>
    217 
    218 <h2 id="example_2">Example: Non-working frame</h2>
    219 <p>This example describes a systrace used to debug Pixel/Pixel XL jitter. To
    220 follow along with the example, <a href="/devices/tech/debug/perf_traces.zip">download the zip
    221 file</a> of traces (which also includes other traces referred to in this
    222 section), upzip the file, and open the systrace_tutorial.html file in your
    223 browser.</p>
    224 
    225 <p>When you first open the systrace, you'll see something like this:</p>
    226 
    227 <p><img src="/devices/tech/debug/images/perf_trace_tl_pxl.png"></p>
    228 <p class="img-caption"><strong>Figure 10</strong>. TouchLatency running on Pixel
    229 XL (most options enabled, including mdss and kgsl tracepoints).</p>
    230 
    231 <p>When looking for jank, check the FrameMissed row under SurfaceFlinger.
    232 FrameMissed is a quality-of-life improvement provided by Hardware Composer 2
    233 (HWC2). As of December 2016, HWC2 is used only on Pixel/Pixel XL; when viewing
    234 systrace for other devices, the FrameMissed row may not be present. In either
    235 case, FrameMissed is correlated with SurfaceFlinger missing one of its
    236 extremely-regular runtimes and an unchanged pending-buffer count for the app
    237 (<code>com.prefabulated.touchlatency</code>) at a vsync:</p>
    238 
    239 <p><img src="/devices/tech/debug/images/perf_trace_fm_sf.png"></p>
    240 <p class="img-caption"><strong>Figure 11</strong>. FrameMissed correlation with
    241 SurfaceFlinger.</p>
    242 
    243 <p>Figure 11 shows a missed frame at 15598.29ms. SurfaceFlinger woke briefly at
    244 the vsync interval and went back to sleep without doing any work, which means
    245 SurfaceFlinger determined it was not worth trying to send a frame to the display
    246 again. Why?</p>
    247 
    248 <p>To understand how the pipeline broke down for this frame, first review the
    249 working frame example above to see how a normal UI pipeline appears in systrace.
    250 When ready, return to the missed frame and work backwards. Notice that
    251 SurfaceFlinger wakes and immediately goes to sleep. When viewing the number of
    252 pending frames from TouchLatency, there are two frames (a good clue to help
    253 figure out what's going on).</p>
    254 
    255 <p><img src="/devices/tech/debug/images/perf_trace_sf_wake_sleep.png"></p>
    256 <p class="img-caption"><strong>Figure 12.</strong> SurfaceFlinger wakes and
    257 immediately goes to sleep.</p>
    258 
    259 <p>Because we have frames in SurfaceFlinger, it's not an app issue. In addition,
    260 SurfaceFlinger is waking at the correct time, so it's not a SurfaceFlinger
    261 issue. If SurfaceFlinger and the app are both looking normal, it's probably a
    262 driver issue.</p>
    263 
    264 <p>Because the <code>mdss</code> and <code>sync</code> tracepoints are enabled,
    265 we can get information about the fences (shared between the display driver and
    266 SurfaceFlinger) that control when frames are actually submitted to the display.
    267 The fences we care about are listed under <code>mdss_fb0_retire</code>, which
    268 denotes when a frame is actually on the display. These fences are provided as
    269 part of the <code>sync</code> trace category. Which fences correspond to
    270 particular events in SurfaceFlinger depends on your SOC and driver stack, so
    271 work with your SOC vendor to understand the meaning of fence categories in your
    272 traces.</p>
    273 
    274 <p><img src="/devices/tech/debug/images/perf_traces_fences.png"></p>
    275 <p class="img-caption"><strong>Figure 13.</strong> <code>mdss_fb0_retire</code>
    276 fences.</p>
    277 
    278 <p>Figure 13 shows a frame that was displayed for 33ms, not 16.7ms as expected.
    279 Halfway through that slice, that frame should have been replaced by a new one
    280 but wasn't. View the previous frame and look for anything interesting:</p>
    281 
    282 <p><img src="/devices/tech/debug/images/perf_trace_frame_previous.png"></p>
    283 <p class="img-caption"><strong>Figure 14.</strong> Frame previous to busted
    284 frame.</p>
    285 
    286 <p>Figure 14 shows 14.482ms a frame. The busted two-frame segment was 33.6ms,
    287 which is roughly what we would expect for two frames (we render at 60Hz, 16.7ms
    288 a frame, which is close). But 14.482ms is not anywhere close to 16.7ms, which
    289 suggests that something is very wrong with the display pipe.</p>
    290 
    291 <p>Investigate exactly where that fence ends to determine what controls it:</p>
    292 
    293 <p><img src="/devices/tech/debug/images/perf_trace_fence_end.png"></p>
    294 <p class="img-caption"><strong>Figure 15.</strong> Investigate fence end.</p>
    295 
    296 <p>A workqueue contains a <code>__vsync_retire_work_handler</code> that is
    297 running when the fence changes. Looking through the kernel source, you can see
    298 it's part of the display driver. It definitely appears to be on the critical
    299 path for the display pipeline, so it must run as quickly as possible. It's
    300 runnable for 70us or so (not a long scheduling delay), but it's a workqueue and
    301 might not get scheduled accurately.</p>
    302 
    303 <p>Check the previous frame to determine if that contributed; sometimes jitter
    304 can add up over time and eventually cause us to miss a deadline.</p>
    305 
    306 <p><img src="/devices/tech/debug/images/perf_trace_previous_frame.png"></p>
    307 <p class="img-caption"><strong>Figure 16.</strong> Previous frame.</p>
    308 
    309 <p>The runnable line on the kworker isn't visible because the viewer turns it
    310 white when it's selected, but the statistics tell the story: 2.3ms of scheduler
    311 delay for part of the display pipeline critical path is <strong>bad</strong>.
    312 Before we do anything else, we should fix that by moving this part of the
    313 display pipeline critical path from a workqueue (which runs as a
    314 <code>SCHED_OTHER</code> CFS thread) to a dedicated <code>SCHED_FIFO</code>
    315 kthread. This function needs timing guarantees that workqueues can't (and aren't
    316 intended to) provide.</p>
    317 
    318 <p>Is this the reason for the jank? It's hard to say conclusively. Outside of
    319 easy-to-diagnose cases such as kernel lock contention causing display-critical
    320 threads to sleep, traces usually won't tell you directly what the problem is.
    321 Could this jitter have been the cause of the dropped frame? Absolutely. The
    322 fence times should be 16.7ms, but they aren't close to that at all in the frames
    323 leading up to the dropped frame. (There's a 19ms fence followed by a 14ms
    324 fence.) Given how tightly coupled the display pipeline is, it's entirely
    325 possible the jitter around fence timings resulted in an eventual dropped
    326 frame.</p>
    327 
    328 <p>In this example, the solution involved converting
    329 <code>__vsync_retire_work_handler</code> from a workqueue to a dedicated
    330 kthread. This resulted in noticeable jitter improvements and reduced jank in the
    331 bouncing ball test. Subsequent traces show fence timings that hover very close
    332 to 16.7ms.</p>
    333 
    334 </body>
    335 </html>
    336