Home | History | Annotate | Download | only in debug
      1 <html devsite>
      2   <head>
      3     <title>Identifying Jitter-Related Jank</title>
      4     <meta name="project_path" value="/_project.yaml" />
      5     <meta name="book_path" value="/_book.yaml" />
      6   </head>
      7   <body>
      8   <!--
      9       Copyright 2017 The Android Open Source Project
     10 
     11       Licensed under the Apache License, Version 2.0 (the "License");
     12       you may not use this file except in compliance with the License.
     13       You may obtain a copy of the License at
     14 
     15           http://www.apache.org/licenses/LICENSE-2.0
     16 
     17       Unless required by applicable law or agreed to in writing, software
     18       distributed under the License is distributed on an "AS IS" BASIS,
     19       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     20       See the License for the specific language governing permissions and
     21       limitations under the License.
     22   -->
     23 
     24 <p>Jitter is the random system behavior that prevents perceptible work from
     25 running. This page describes how to identify and address jitter-related jank
     26 issues.</p>
     27 
     28 <h2 id="application">Application thread scheduler delay</h2>
     29 <p>Scheduler delay is the most obvious symptom of jitter: A process that should
     30 be run is made runnable but does not run for some significant amount of time.
     31 The significance of the delay varies according to the context. For example:</p>
     32 <ul>
     33 <li>A random helper thread in an app can probably be delayed for many
     34 milliseconds without an issue.</li>
     35 <li>An application's UI thread may be able to tolerate 1-2ms of jitter.</li>
     36 <li>Driver kthreads running as SCHED_FIFO may cause issues if they are runnable
     37 for 500us before running.</li>
     38 </ul>
     39 <p>Runnable times can be identified in systrace by the blue bar preceding a
     40 running segment of a thread. A runnable time can also be determined by the
     41 length of time between the <code>sched_wakeup</code> event for a thread and the
     42 <code>sched_switch</code> event that signals the start of thread execution.</p>
     43 
     44 <h3 id="long_threads">Threads that run too long</h3>
     45 <p>Application UI threads that are runnable for too long can cause problems.
     46 Lower-level threads with long runnable times generally have different causes,
     47 but attempting to push UI thread runnable time toward zero may require fixing
     48 some of the same issues that cause lower level threads to have long runnable
     49 times. To mitigate delays:</p>
     50 <ol>
     51 <li>Use cpusets as described in
     52 <a href="/devices/tech/debug/jank_capacity.html#thermal-throttling">Thermal
     53 throttling</a>.</li>
     54 <li>Increase the CONFIG_HZ value.<ul>
     55 <li>Historically, the value has been set to 100 on arm and arm64 platforms.
     56 However, this is an accident of history and is not a good value to use for
     57 interactive devices. CONFIG_HZ=100 means that a jiffy is 10ms long, which means
     58 that load balancing between CPUs may take 20ms (two jiffies) to happen. This can
     59 significantly contribute to jank on a loaded system.</li>
     60 <li>Recent devices (Nexus 5X, Nexus 6P, Pixel, and Pixel XL) have shipped with
     61 CONFIG_HZ=300. This should have a negligible power cost while significantly
     62 improving runnable times. If you do see significant increases in power
     63 consumption or performance issues after changing CONFIG_HZ, it's likely one of
     64 your drivers is using a timer based on raw jiffies instead of milliseconds and
     65 converting to jiffies. This is usually an easy fix (see the
     66 <a href="https://android.googlesource.com/kernel/msm.git/+/daf0cdf29244ce4098cff186d16df23cfa782993%5E!/">patch</a>
     67 that fixed kgsl timer issues on Nexus 5X and 6P when converting to
     68 CONFIG_HZ=300).</li>
     69 <li>Finally, we've experimented with CONFIG_HZ=1000 on Nexus/Pixel and found it
     70 offers a noticeable performance and power reduction due to decreased RCU
     71 overhead.</li>
     72 </ul>
     73 </li>
     74 </ol>
     75 <p>With those two changes alone, a device should look much better for UI thread
     76 runnable time under load.</p>
     77 
     78 <h3 id="sys.use_fifo_ui">Using sys.use_fifo_ui</h3>
     79 <p>You can try to drive UI thread runnable time to zero by setting the
     80 <code>sys.use_fifo_ui</code> property to 1.</p>
     81 
     82 <p class="warning"><strong>Warning</strong>: Do not use this option on
     83 heterogeneous CPU configurations unless you have a capacity-aware RT scheduler.
     84 And, at this moment, <strong>NO CURRENTLY SHIPPING RT SCHEDULER IS
     85 CAPACITY-AWARE</strong>. We're working on one for EAS, but it's not yet
     86 available. The default RT scheduler is based purely on RT priorities and whether
     87 a CPU already has a RT thread of equal or higher priority.
     88 
     89 <br><br>As a result, the default RT scheduler will happily move your
     90 relatively-long-running UI thread from a high-frequency big core to a little
     91 core at minimum frequency if a higher priority FIFO kthread happens to wake up
     92 on the same big core. <strong>This will introduce significant performance
     93 regressions</strong>. As this option has not yet been used on a shipping Android
     94 device, if you want to use it get in touch with the Android performance team to
     95 help you validate it.</p>
     96 
     97 <p>When <code>sys.use_fifo_ui</code> is enabled, ActivityManager tracks the UI
     98 thread and RenderThread (the two most UI-critical threads) of the top
     99 application and makes those threads SCHED_FIFO instead of SCHED_OTHER. This
    100 effectively eliminates jitter from UI and RenderThreads; the traces we've
    101 gathered with this option enabled show runnable times on the order of
    102 microseconds instead of milliseconds.</p>
    103 
    104 <p>However, because the RT load balancer was not capacity-aware, there was a
    105 30% reduction in application startup performance because the UI thread
    106 responsible for starting up the app would be moved from a 2.1Ghz gold Kryo core
    107 to a 1.5GHz silver Kryo core. With a capacity-aware RT load balancer, we see
    108 equivalent performance in bulk operations and a 10-15% reduction in 95th and
    109 99th percentile frame times in many of our UI benchmarks.</p>
    110 
    111 <h2 id="interrupt">Interrupt traffic</h2>
    112 <p>Because ARM platforms deliver interrupts to CPU 0 only by default, we
    113 recommend the use of an IRQ balancer (irqbalance or msm_irqbalance on Qualcomm
    114 platforms).</p>
    115 <p>During Pixel development, we saw jank that could be directly attributed to
    116 overwhelming CPU 0 with interrupts. For example, if the <code>mdss_fb0</code>
    117 thread was scheduled on CPU 0, there was a much greater likelihood to jank
    118 because of an interrupt that is triggered by the display almost immediately
    119 before scanout. <code>mdss_fb0</code> would be in the middle of its own work
    120 with a very tight deadline, and then it would lose some time to the MDSS
    121 interrupt handler. Initially, we attempted to fix this by setting the CPU
    122 affinity of the mdss_fb0 thread to CPUs 1-3 to avoid contention with the
    123 interrupt, but then we realized that we had not yet enabled msm_irqbalance. With
    124 msm_irqbalance enabled, jank was noticeably improved even when both mdss_fb0 and
    125 the MDSS interrupt were on the same CPU due to reduced contention from other
    126 interrupts.</p>
    127 <p>This can be identified in systrace by looking at the sched section as well as
    128 the irq section. The sched section shows what has been scheduled, but an
    129 overlapping region in the irq section means an interrupt is running during that
    130 time instead of the normally scheduled process. If you see significant chunks of
    131 time taken during an interrupt, your options include:</p>
    132 <ul>
    133 <li>Make the interrupt handler faster.</li>
    134 <li>Prevent the interrupt from happening in the first place.</li>
    135 <li>Change the frequency of the interrupt to be out of phase with other regular
    136 work that it may be interfering with (if it is a regular interrupt).</li>
    137 <li>Set CPU affinity of the interrupt directly and prevent it from being
    138 balanced.</li>
    139 <li>Set CPU affinity of the thread the interrupt is interfering with to avoid
    140 the interrupt.</li>
    141 <li>Rely on the interrupt balancer to move the interrupt to a less loaded
    142 CPU.</li>
    143 </ul>
    144 <p>Setting CPU affinity is generally not recommended but can be useful for
    145 specific cases. In general, it's too hard to predict the state of the system for
    146 most common interrupts, but if you have a very specific set of conditions that
    147 triggers certain interrupts where the system is more constrained than normal
    148 (such as VR), explicit CPU affinity may be a good solution.</p>
    149 
    150 <h2>Long softirqs</h2>
    151 <p>While a softirq is running, it disables preemption. softirqs can also be
    152 triggered at many places within the kernel and can run inside of a user process.
    153 If there's enough softirq activity, user processes will stop running softirqs,
    154 and ksoftirqd wakes to run softirqs and be load balanced. Usually, this is fine.
    155 However, a single very long softirq can wreak havoc on the system.</p>
    156 
    157 <section class="expandable">
    158 <h4 class="showalways">Show Issue: Janky headtracking while streaming data over
    159 Wi-Fi</h4>
    160 
    161 <p>In this issue, we saw that VR performance was inconsistent under specific
    162 network conditions (Wi-Fi). What we found in the trace was that a single NET_RX
    163 softirq could run for 30+ milliseconds. This was eventually tracked down to
    164 Receive Packet Steering, a Qualcomm Wi-Fi feature that coalesces many NET_RX
    165 softirqs into a single softirq. The resulting softirq could, under the right
    166 conditions, have a very large (potentially unbounded) runtime.</p>
    167 <p>While this feature may have reduced total CPU cycles spent on networking, it
    168 prevented the system from running the right things at the right time. Disabling
    169 the feature didn't impact network throughput or battery life, but it did fix the
    170 VR headtracking jank by allowing ksoftirqd to load balance the softirqs instead
    171 of circumventing it.</p>
    172 </section>
    173 <hr>
    174 
    175 <p>softirqs are visible within the irq section of a trace, so they are easy to
    176 spot if the issue can be reproduced while tracing. Because a softirq can run
    177 within a user process, a bad softirq can also manifest as extra runtime inside
    178 of a user process for no obvious reason. If you see that, check the irq section
    179 to see if softirqs are to blame.</p>
    180 
    181 <h2 id="drivers">Drivers leaving preemption or IRQs disabled too long</h2>
    182 <p>Disabling preemption or interrupts for too long (tens of milliseconds)
    183 results in jank. Typically, the jank manifests as a thread becoming runnable but
    184 not running on a particular CPU, even if the runnable thread is significantly
    185 higher priority (or SCHED_FIFO) than the other thread.</p>
    186 
    187 <p>Some guidelines:</p>
    188 <ul>
    189 <li>If runnable thread is SCHED_FIFO and running thread is SCHED_OTHER, the
    190 running thread has preemption or interrupts disabled.</li>
    191 <li>If runnable thread is significantly higher priority (100) than the running
    192 thread (120), the running thread likely has preemption or interrupts disabled if
    193 the runnable thread doesn't run within two jiffies.</li>
    194 <li>If the runnable thread and the running thread have the same priority, the
    195 running thread likely has preemption or interrupts disabled if the runnable
    196 thread doesn't run within 20ms.</li>
    197 </ul>
    198 <p>Keep in mind that running an interrupt handler prevents you from servicing
    199 other interrupts, which also disables preemption.</p>
    200 
    201 <section class="expandable">
    202 <h4 class="showalways">Show Issue: CONFIG_BUS_AUTO_SUSPEND causes significant
    203 jank</h4>
    204 
    205 <p>In this issue, we identified a major source of jank in Pixel during bringup.
    206 To follow along with the example, <a href="perf_traces.zip">download the zip
    207 file</a> of traces (which also includes other traces referred to in this
    208 section), upzip the file, and open the trace_30293222.html file in your
    209 browser.</p>
    210 
    211 <p>In the trace, locate the SurfaceFlinger EventThread beginning
    212 at 2235.195 ms. When performing bouncing ball tuning, we often saw one dropped
    213 frame when either SurfaceFlinger or a UI-critical thread ran after being
    214 runnable for 6.6ms (two jiffies at CONFIG_HZ=300). Critical SurfaceFlinger
    215 threads and the app's UI thread were SCHED_FIFO at the time.</p>
    216 <p>According to the trace, the thread would wake on a particular CPU, remain
    217 runnable for two jiffies, and get load balanced to a different CPU, where it
    218 would then run. The thread that was running while the UI thread and
    219 SurfaceFlinger were runnable was always a priority 120 kworker in
    220 pm_runtime_work.</p>
    221 <p>While digging through the kernel to see what pm_runtime_work was actually
    222 doing, we discovered the Wi-Fi driver's power management was handled through
    223 pm_runtime_work. We took additional traces with Wi-Fi disabled and the jank
    224 disappeared. To double-check, we also disabled the Wi-Fi driver's power
    225 management in the kernel and took more traces with Wi-Fi connected, and the jank
    226 was still gone. Qualcomm was then able to find the offending region with
    227 preemption disabled and fix it, and we were able to reenable that option for
    228 launch.</p>
    229 </section>
    230 <hr>
    231 
    232 <p>Another option for identifying offending regions is with the preemptirqsoff
    233 tracer (see <a href="/devices/tech/debug/ftrace.html#dftrace">Using
    234 dynamic ftrace</a>). This tracer can give a much greater insight into the root
    235 cause of an uninterruptible region (such as function names), but requires more
    236 invasive work to enable. While it may have more of a performance impact, it is
    237 definitely worth trying.</p>
    238 
    239 <h2 id="workqueues">Incorrect use of workqueues</h2>
    240 <p>Interrupt handlers often need to do work that can run outside of an interrupt
    241 context, enabling work to be farmed out to different threads in the kernel. A
    242 driver developer may notice the kernel has a very convenient system-wide
    243 asynchronous task functionality called <em>workqueues</em> and might use that
    244 for interrupt-related work.</p>
    245 
    246 <p>However, workqueues are almost always the wrong answer for this problem
    247 because they are always SCHED_OTHER. Many hardware interrupts are in the
    248 critical path of performance and must be run immediately. Workqueues have no
    249 guarantees about when they will be run. Every time we've seen a workqueue in the
    250 critical path of performance, it's been a source of sporadic jank, regardless of
    251 the device. On Pixel, with a flagship processor, we saw that a single workqueue
    252 could be delayed up to 7ms if the device was under load depending on scheduler
    253 behavior and other things running on the system.</p>
    254 
    255 <p>Instead of a workqueue, drivers that need to handle interrupt-like work
    256 inside a separate thread should create their own SCHED_FIFO kthread. For help
    257 doing this with kthread_work functions, refer to this
    258 <a href="https://android.googlesource.com/kernel/msm/+/1a7a93bd33f48a369de29f6f2b56251127bf6ab4%5E!/">patch</a>.</p>
    259 
    260 <h2 id="framework-lock">Framework lock contention</h2>
    261 <p>Framework lock contention can be a source of jank or other performance
    262 issues. It's usually caused by the ActivityManagerService lock but can be seen
    263 in other locks as well. For example, the PowerManagerService lock can impact
    264 screen on performance. If you're seeing this on your device, there's no good fix
    265 because it can only be improved via architectural improvements to the framework.
    266 However, if you are modifying code that runs inside of system_server, it's
    267 critical to avoid holding locks for a long time, especially the
    268 ActivityManagerService lock.</p>
    269 
    270 <h2 id="binder-lock">Binder lock contention</h2>
    271 <p>Historically, binder has had a single global lock. If the thread running a
    272 binder transaction was preempted while holding the lock, no other thread can
    273 perform a binder transaction until the original thread has released the lock.
    274 This is bad; binder contention can block everything in the system, including
    275 sending UI updates to the display (UI threads communicate with SurfaceFlinger
    276 via binder).</p>
    277 <p>Android 6.0 included several patches to improve this behavior by disabling
    278 preemption while holding the binder lock. This was safe only because the binder
    279 lock should be held for a few microseconds of actual runtime. This dramatically
    280 improved performance in uncontended situations and prevented contention by
    281 preventing most scheduler switches while the binder lock was held. However,
    282 preemption could not be disabled for the entire runtime of holding the binder
    283 lock, meaning that preemption was enabled for functions that could sleep (such
    284 as copy_from_user), which could cause the same preemption as the original case.
    285 When we sent the patches upstream, they promptly told us that this was the worst
    286 idea in history. (We agreed with them, but we also couldn't argue with the
    287 efficacy of the patches toward preventing jank.)</p>
    288 
    289 <h2 id="fd-contention">fd contention within a process</h2>
    290 <p>This is rare. Your jank is probably not caused by this.</p>
    291 <p>That said, if you have multiple threads within a process writing the same fd,
    292 it is possible to see contention on this fd, however the only time we saw this
    293 during Pixel bringup is during a test where low-priority threads attempted to
    294 occupy all CPU time while a single high-priority thread was running within the
    295 same process. All threads were writing to the trace marker fd and the
    296 high-priority thread could get blocked on the trace marker fd if a low-priority
    297 thread was holding the fd lock and was then preempted. When tracing was disabled
    298 from the low-priority threads, there was no performance issue.</p>
    299 <p>We weren't able to reproduce this in any other situation, but it's worth
    300 pointing out as a potential cause of performance issues while tracing.</p>
    301 
    302 <h2 id="cpu-idle">Unnecessary CPU idle transitions</h2>
    303 <p>When dealing with IPC, especially multi-process pipelines, it's common to see
    304 variations on the following runtime behavior:</p>
    305 <ol>
    306 <li>Thread A runs on CPU 1.</li>
    307 <li>Thread A wakes up thread B.</li>
    308 <li>Thread B starts running on CPU 2.</li>
    309 <li>Thread A immediately goes to sleep, to be awakened by thread B when thread B
    310 has finished its current work.</li>
    311 </ol>
    312 
    313 <p>A common source of overhead is between steps 2 and 3. If CPU 2 is idle, it
    314 must be brought back to an active state before thread B can run. Depending on
    315 the SOC and how deep the idle is, this could be tens of microseconds before
    316 thread B begins running. If the actual runtime of each side of the IPC is close
    317 enough to the overhead, the overall performance of that pipeline can be
    318 significantly reduced by CPU idle transitions. The most common place for Android
    319 to hit this is around binder transactions, and many services that use binder end
    320 up looking like the situation described above.</p>
    321 
    322 <p>First, use the <code>wake_up_interruptible_sync()</code> function in your
    323 kernel drivers and support this from any custom scheduler. Treat this as a
    324 requirement, not a hint. Binder uses this today, and it helps a lot with
    325 synchronous binder transactions avoiding unnecessary CPU idle transitions.</p>
    326 <p>Second, ensure your cpuidle transition times are realistic and the cpuidle
    327 governor is taking these into account correctly. If your SOC is thrashing in and
    328 out of your deepest idle state, you won't save power by going to deepest
    329 idle.</p>
    330 
    331 <h2 id="logging">Logging</h2>
    332 <p>Logging is not free for CPU cycles or memory, so don't spam the log buffer.
    333 Logging costs cycles in your application (directly) and in the log daemon.
    334 Remove any debugging logs before shipping your device.</p>
    335 
    336 <h2 id="io-issues">I/O issues</h2>
    337 <p>I/O operations are common sources of jitter. If a thread accesses a
    338 memory-mapped file and the page is not in the page cache, it faults and reads
    339 the page from disk. This blocks the thread (usually for 10+ ms) and if it
    340 happens in the critical path of UI rendering, can result in jank. There are too
    341 many causes of I/O operations to discuss here, but check the following locations
    342 when trying to improve I/O behavior:</p>
    343 <ul>
    344 <li><strong>PinnerService</strong>. Added in Android 7.0, PinnerService enables
    345 the framework to lock some files in the page cache. This removes the memory for
    346 use by any other process, but if there are some files that are known a priori to
    347 be used regularly, it can be effective to mlock those files.<br><br>
    348 On Pixel and Nexus 6P devices running Android 7.0, we mlocked four files:
    349 <ul>
    350 <li>/system/framework/arm64/boot-framework.oat</li>
    351 <li>/system/framework/oat/arm64/services.odex</li>
    352 <li>/system/framework/arm64/boot.oat</li>
    353 <li>/system/framework/arm64/boot-core-libart.oat</li>
    354 </ul>
    355 These files are constantly in use by most applications and system_server, so
    356 they should not be paged out. In particular, we've found that if any of those
    357 are paged out, they will be paged back in and cause jank when switching from a
    358 heavyweight application.
    359 </li>
    360 <li><strong>Encryption</strong>. Another possible cause of I/O problems. We find
    361 inline encryption offers the best performance when compared to CPU-based
    362 encryption or using a hardware block accessible via DMA. Most importantly,
    363 inline encryption reduces the jitter associated with I/O, especially when
    364 compared to CPU-based encryption. Because fetches to the page cache are often in
    365 the critical path of UI rendering, CPU-based encryption introduces additional
    366 CPU load in the critical path, which adds more jitter than just the I/O fetch.
    367 <br><br>
    368 DMA-based hardware encryption engines have a similar problem, since the kernel
    369 has to spend cycles managing that work even if other critical work is available
    370 to run. We strongly recommend any SOC vendor building new hardware to include
    371 support for inline encryption.</li>
    372 </ul>
    373 
    374 <h2 id="small-task">Aggressive small task packing</h2>
    375 <p>Some schedulers offer support for packing small tasks onto single CPU cores
    376 to try to reduce power consumption by keeping more CPUs idle for longer. While
    377 this works well for throughput and power consumption, it can be
    378 <em>catastrophic</em> to latency. There are several short-running threads in the
    379 critical path of UI rendering that can be considered small; if these threads are
    380 delayed as they are slowly migrated to other CPUs, it <strong>will</strong>
    381 cause jank. We recommend using small task packing very conservatively.</p>
    382 
    383 <h2 id="page-cache">Page cache thrashing</h2>
    384 <p>A device without enough free memory may suddenly become extremely sluggish
    385 while performing a long-running operation, such as opening a new application. A
    386 trace of the application may reveal it is consistently blocked in I/O during a
    387 particular run even when it often is not blocked in I/O. This is usually a sign
    388 of page cache thrashing, especially on devices with less memory.</p>
    389 
    390 <p>One way to identify this is to take a systrace using the pagecache tag and
    391 feed that trace to the script at
    392 <code>system/extras/pagecache/pagecache.py</code>. pagecache.py translates
    393 individual requests to map files into the page cache into aggregate per-file
    394 statistics. If you find that more bytes of a file have been read than the total
    395 size of that file on disk, you are definitely hitting page cache thrashing.</p>
    396 
    397 <p>What this means is that the working set required by your workload (typically
    398 a single application plus system_server) is greater than the amount of memory
    399 available to the page cache on your device. As a result, as one part of the
    400 workload gets the data it needs in the page cache, another part that will be
    401 used in the near future will be evicted and will have to be fetched again,
    402 causing the problem to occur again until the load has completed. This is the
    403 fundamental cause of performance problems when not enough memory is available
    404 on a device.</p>
    405 
    406 <p>There's no foolproof way to fix page cache thrashing, but there are a few
    407 ways to try to improve this on a given device.</p>
    408 <ul>
    409 <li>Use less memory in persistent processes. The less memory used by persistent
    410 processes, the more memory available to applications and the page cache.</li>
    411 <li>Audit the carveouts you have for your device to ensure you are not
    412 unnecessarily removing memory from the OS. We've seen situations where carveouts
    413 used for debugging were accidentally left in shipping kernel configurations,
    414 consuming tens of megabytes of memory. This can make the difference between
    415 hitting page cache thrashing and not, especially on devices with less
    416 memory.</li>
    417 <li>If you're seeing page cache thrashing in system_server on critical files,
    418 consider pinning those files. This will increase memory pressure elsewhere, but
    419 it may modify behavior enough to avoid thrashing.</li>
    420 <li>Retune lowmemorykiller to try to keep more memory free. lowmemorykiller's
    421 thresholds are based on both absolute free memory and the page cache, so
    422 increasing the threshold at which processes at a given oom_adj level are killed
    423 may result in better behavior at the expense of increased background app
    424 death.</li>
    425 <li>Try using ZRAM. We use ZRAM on Pixel, even though Pixel has 4GB, because it
    426 could help with rarely used dirty pages.</li>
    427 </ul>
    428 
    429 </body>
    430 </html>
    431