1 <html devsite> 2 <head> 3 <title>Evaluating Performance</title> 4 <meta name="project_path" value="/_project.yaml" /> 5 <meta name="book_path" value="/_book.yaml" /> 6 </head> 7 <body> 8 <!-- 9 Copyright 2017 The Android Open Source Project 10 11 Licensed under the Apache License, Version 2.0 (the "License"); 12 you may not use this file except in compliance with the License. 13 You may obtain a copy of the License at 14 15 http://www.apache.org/licenses/LICENSE-2.0 16 17 Unless required by applicable law or agreed to in writing, software 18 distributed under the License is distributed on an "AS IS" BASIS, 19 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 20 See the License for the specific language governing permissions and 21 limitations under the License. 22 --> 23 24 25 <p>There are two user-visible indicators of performance:</p> 26 27 <ul> 28 <li><strong>Predictable, perceptible performance</strong>. Does the user 29 interface (UI) drop frames or consistently render at 60FPS? Does audio play 30 without artifacts or popping? How long is the delay between the user touching 31 the screen and the effect showing on the display?</li> 32 <li><strong>Length of time required for longer operations</strong> (such as 33 opening applications).</li> 34 </ul> 35 36 <p>The first is more noticeable than the second. Users typically notice jank 37 but they won't be able to tell 500ms vs 600ms application startup time unless 38 they are looking at two devices side-by-side. Touch latency is immediately 39 noticeable and significantly contributes to the perception of a device.</p> 40 41 <p>As a result, in a fast device, the UI pipeline is the most important thing in 42 the system other than what is necessary to keep the UI pipeline functional. This 43 means that the UI pipeline should preempt any other work that is not necessary 44 for fluid UI. To maintain a fluid UI, background syncing, notification delivery, 45 and similar work must all be delayed if UI work can be run. It is 46 acceptable to trade the performance of longer operations (HDR+ runtime, 47 application startup, etc.) to maintain a fluid UI.</p> 48 49 <h2 id="capacity_vs_jitter">Capacity vs jitter</h2> 50 <p>When considering device performance, <em>capacity</em> and <em>jitter</em> 51 are two meaningful metrics.</p> 52 53 <h3 id="capacity">Capacity</h3> 54 <p>Capacity is the total amount of some resource that the device possesses over 55 some amount of time. This can be CPU resources, GPU resources, I/O resources, 56 network resources, memory bandwidth, or any similar metric. When examining 57 whole-system performance, it can be useful to abstract the individual components 58 and assume a single metric that determines performance (especially when tuning a 59 new device because the workloads run on that device are likely fixed).</p> 60 61 <p>The capacity of a system varies based on the computing resources online. 62 Changing CPU/GPU frequency is the primary means of changing capacity, but there 63 are others such as changing the number of CPU cores online. Accordingly, the 64 capacity of a system corresponds with power consumption; <strong>changing 65 capacity always results in a similar change in power consumption.</strong></p> 66 67 <p>The capacity required at a given time is overwhelmingly determined by the 68 running application. As a result, the platform can do little to adjust the 69 capacity required for a given workload, and the means to do so are limited to 70 runtime improvements (Android framework, ART, Bionic, GPU compiler/drivers, 71 kernel).</p> 72 73 <h3 id="jitter">Jitter</h3> 74 <p>While the required capacity for a workload is easy to see, jitter is a more 75 nebulous concept. For a good introduction to jitter as an impediment to fast 76 systems, refer to 77 <em><a href="http://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-03-3116">THE 78 CASE OF THE MISSING SUPERCOMPUTER PERFORMANCE: ACHIEVING OPTIMAL PERFORMANCE ON 79 THE 8,192 PROCESSORS OF ASCl Q</em></a>. (It's an investigation of why the ASCI 80 Q supercomputer did not achieve its expected performance and is a great 81 introduction to optimizing large systems.)</p> 82 83 <p>This page uses the term jitter to describe what the ASCI Q paper calls 84 <em>noise</em>. Jitter is the random system behavior that prevents perceptible 85 work from running. It is often work that must be run, but it may not have strict 86 timing requirements that cause it to run at any particular time. Because it is 87 random, it is extremely difficult to disprove the existence of jitter for a 88 given workload. It is also extremely difficult to prove that a known source of 89 jitter was the cause of a particular performance issue. The tools most commonly 90 used for diagnosing causes of jitter (such as tracing or logging) can introduce 91 their own jitter.</p> 92 93 <p>Sources of jitter experienced in real-world implementations of Android 94 include:</p> 95 <ul> 96 <li>Scheduler delay</li> 97 <li>Interrupt handlers</li> 98 <li>Driver code running for too long with preemption or interrupts disabled</li> 99 <li>Long-running softirqs</li> 100 <li>Lock contention (application, framework, kernel driver, binder lock, mmap 101 lock)</li> 102 <li>File descriptor contention where a low-priority thread holds the lock on a 103 file, preventing a high-priority thread from running</li> 104 <li>Running UI-critical code in workqueues where it could be delayed</li> 105 <li>CPU idle transitions</li> 106 <li>Logging</li> 107 <li>I/O delays</li> 108 <li>Unnecessary process creation (e.g., CONNECTIVITY_CHANGE broadcasts)</li> 109 <li>Page cache thrashing caused by insufficient free memory</li> 110 </ul> 111 112 <p>The required amount of time for a given period of jitter may or may not 113 decrease as capacity increases. For example, if a driver leaves interrupts 114 disabled while waiting for a read from across an i2c bus, it will take a fixed 115 amount of time regardless of whether the CPU is at 384MHz or 2GHz. Increasing 116 capacity is not a feasible solution to improve performance when jitter is 117 involved. As a result, <strong>faster processors will not usually improve 118 performance in jitter-constrained situations.</strong></p> 119 120 <p>Finally, unlike capacity, jitter is almost entirely within the domain of the 121 system vendor.</p> 122 123 <h3 id="memory_consumption">Memory consumption</h3> 124 <p>Memory consumption is traditionally blamed for poor performance. While 125 consumption itself is not a performance issue, it can cause jitter via 126 lowmemorykiller overhead, service restarts, and page cache thrashing. Reducing 127 memory consumption can avoid the direct causes of poor performance, but there 128 may be other targeted improvements that avoid those causes as well (for example, 129 pinning the framework to prevent it from being paged out when it will be paged 130 in soon after).</p> 131 132 <h2 id="analyze_initial">Analyzing initial device performance</h2> 133 <p>Starting from a functional but poorly-performing system and attempting to fix 134 the system's behavior by looking at individual cases of user-visible poor 135 performance is <strong>not</strong> a sound strategy. Because poor performance 136 is usually not easily reproducible (i.e., jitter) or an application issue, too 137 many variables in the full system prevent this strategy from being effective. As 138 a result, it's very easy to misidentify causes and make minor improvements while 139 missing systemic opportunities for fixing performance across the system.</p> 140 141 <p>Instead, use the following general approach when bringing up a new 142 device:</p> 143 <ol> 144 <li>Get the system booting to UI with all drivers running and some basic 145 frequency governor settings (if you change the frequency governor settings, 146 repeat all steps below).</li> 147 <li>Ensure the kernel supports the <code>sched_blocked_reason</code> tracepoint 148 as well as other tracepoints in the display pipeline that denote when the frame 149 is delivered to the display.</li> 150 <li>Take long traces of the entire UI pipeline (from receiving input via an IRQ 151 to final scanout) while running a lightweight and consistent workload (e.g., 152 <a href="https://android.googlesource.com/platform/frameworks/base.git/+/master/tests/UiBench/">UiBench</a> 153 or the ball test in <a href="#touchlatency">TouchLatency)</a>.</li> 154 <li>Fix the frame drops detected in the lightweight and consistent 155 workload.</li> 156 <li>Repeat steps 3-4 until you can run with zero dropped frames for 20+ seconds 157 at a time. </li> 158 <li>Move on to other user-visible sources of jank.</li> 159 </ol> 160 161 <p>Other simple things you can do early on in device bringup include:</p> 162 163 <ul> 164 <li>Ensure your kernel has the 165 <a href="https://android.googlesource.com/kernel/msm/+/c9f00aa0e25e397533c198a0fcf6246715f99a7b%5E!/">sched_blocked_reason 166 tracepoint patch</a>. This tracepoint is enabled with the sched trace category 167 in systrace and provides the function responsible for sleeping when that 168 thread enters uninterruptible sleep. It is critical for performance analysis 169 because uninterruptible sleep is a very common indicator of jitter.</li> 170 <li>Ensure you have sufficient tracing for the GPU and display pipelines. On 171 recent Qualcomm SOCs, tracepoints are enabled using:</li> 172 <pre class="devsite-click-to-copy"> 173 <code class="devsite-terminal">adb shell "echo 1 > /d/tracing/events/kgsl/enable"</code> 174 <code class="devsite-terminal">adb shell "echo 1 > /d/tracing/events/mdss/enable"</code> 175 </pre> 176 177 <p>These events remain enabled when you run systrace so you can see additional 178 information in the trace about the display pipeline (MDSS) in the 179 <code>mdss_fb0</code> section. On Qualcomm SOCs, you won't see any additional 180 information about the GPU in the standard systrace view, but the results are 181 present in the trace itself (for details, see 182 <a href="/devices/tech/debug/systrace.html">Understanding 183 systrace</a>).</p> 184 185 <p>What you want from this kind of display tracing is a single event that 186 directly indicates a frame has been delivered to the display. From there, you 187 can determine if you've hit your frame time successfully; if event X<em>n</em> 188 occurs less than 16.7ms after event X<em>n-1</em> (assuming a 60Hz display), 189 then you know you did not jank. If your SOC does not provide such signals, work 190 with your vendor to get them. Debugging jitter is extremely difficult without a 191 definitive signal of frame completion.</p></ul> 192 193 <h3 id="synthetic_benchmarks">Using synthetic benchmarks</h3> 194 <p>Synthetic benchmarks are useful for ensuring a device's basic functionality 195 is present. However, treating benchmarks as a proxy for perceived device 196 performance is not useful.</p> 197 198 <p>Based on experiences with SOCs, differences in synthetic benchmark 199 performance between SOCs is not correlated with a similar difference in 200 perceptible UI performance (number of dropped frames, 99th percentile frame 201 time, etc.). Synthetic benchmarks are capacity-only benchmarks; jitter impacts 202 the measured performance of these benchmarks only by stealing time from the bulk 203 operation of the benchmark. As a result, synthetic benchmark scores are mostly 204 irrelevant as a metric of user-perceived performance.</p> 205 206 <p>Consider two SOCs running Benchmark X that renders 1000 frames of UI and 207 reports the total rendering time (lower score is better).</p> 208 209 <ul> 210 <li>SOC 1 renders each frame of Benchmark X in 10ms and scores 10,000.</li> 211 <li>SOC 2 renders 99% of frames in 1ms but 1% of frames in 100ms and scores 212 19,900, a dramatically better score.</li> 213 </ul> 214 215 <p>If the benchmark is indicative of actual UI performance, SOC 2 would be 216 unusable. Assuming a 60Hz refresh rate, SOC 2 would have a janky frame every 217 1.5s of operation. Meanwhile, SOC 1 (the slower SOC according to Benchmark X) 218 would be perfectly fluid.</p> 219 220 <h3 id="bug_reports">Using bug reports</h3> 221 <p>Bug reports are sometimes useful for performance analysis, but because they 222 are so heavyweight, they are rarely useful for debugging sporadic jank issues. 223 They may provide some hints on what the system was doing at a given time, 224 especially if the jank was around an application transition (which is logged in 225 a bug report). Bug reports can also indicate when something is more broadly 226 wrong with the system that could reduce its effective capacity (such as thermal 227 throttling or memory fragmentation).</p> 228 229 <h3 id="touchlatency">Using TouchLatency</h3> 230 <p>Several examples of bad behavior come from TouchLatency, which is the 231 preferred periodic workload used for the Pixel and Pixel XL. It's available at 232 <code>frameworks/base/tests/TouchLatency</code> and has two modes: touch latency 233 and bouncing ball (to switch modes, click the button in the upper-right 234 corner).</p> 235 236 <p>The bouncing ball test is exactly as simple as it appears: A ball bounces 237 around the screen forever, regardless of user input. It is usually also 238 <strong>by far</strong> the hardest test to run perfectly, but the closer it 239 comes to running without any dropped frames, the better your device will be. The 240 bouncing ball test is difficult because it is a trivial but perfectly consistent 241 workload that runs at a very low clock (this assumes device has a frequency 242 governor; if the device is instead running with fixed clocks, downclock the 243 CPU/GPU to near-minimum when running the bouncing ball test for the first time). 244 As the system quiesces and the clocks drop closer to idle, the required CPU/GPU 245 time per frame increases. You can watch the ball and see things jank, and you'll 246 be able to see missed frames in systrace as well.</p> 247 248 <p>Because the workload is so consistent, you can identify most sources of 249 jitter much more easily than in most user-visible workloads by tracking what 250 exactly is running on the system during each missed frame instead of the UI 251 pipeline. <strong>The lower clocks amplify the effects of jitter by making it 252 more likely that any jitter causes a dropped frame.</strong> As a result, the 253 closer TouchLatency is to 60FPS, the less likely you are to have bad system 254 behaviors that cause sporadic, hard-to-reproduce jank in larger 255 applications.</p> 256 257 <p>As jitter is often (but not always) clockspeed-invariant, use a test that 258 runs at very low clocks to diagnose jitter for the following reasons:</p> 259 <ul> 260 <li>Not all jitter is clockspeed-invariant; many sources just consume CPU 261 time.</li> 262 <li>The governor should get the average frame time close to the deadline by 263 clocking down, so time spent running non-UI work can push it over the edge to 264 dropping a frame.</li> 265 </ul> 266 267 </body> 268 </html> 269