Home | History | Annotate | Download | only in docs
      1 # bcc Tutorial
      2 
      3 This tutorial covers how to use [bcc](https://github.com/iovisor/bcc) tools to quickly solve performance, troubleshooting, and networking issues. If you want to develop new bcc tools, see [tutorial_bcc_python_developer.md](tutorial_bcc_python_developer.md) for that tutorial.
      4 
      5 It is assumed for this tutorial that bcc is already installed, and you can run tools like execsnoop successfully. See [INSTALL.md](../INSTALL.md). This uses enhancements added to the Linux 4.x series.
      6 
      7 ## Observability
      8 
      9 Some quick wins.
     10 
     11 ### 0. Before bcc
     12 
     13 Before using bcc, you should start with the Linux basics. One reference is the [Linux Performance Analysis in 60s](http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html) post, which covers these commands:
     14 
     15 1. uptime
     16 1. dmesg | tail
     17 1. vmstat 1
     18 1. mpstat -P ALL 1
     19 1. pidstat 1
     20 1. iostat -xz 1
     21 1. free -m
     22 1. sar -n DEV 1
     23 1. sar -n TCP,ETCP 1
     24 1. top
     25 
     26 ### 1. General Performance
     27 
     28 Here is a generic checklist for performance investigations with bcc, first as a list, then in detail:
     29 
     30 1. execsnoop
     31 1. opensnoop
     32 1. ext4slower (or btrfs\*, xfs\*, zfs\*)
     33 1. biolatency
     34 1. biosnoop
     35 1. cachestat
     36 1. tcpconnect
     37 1. tcpaccept
     38 1. tcpretrans
     39 1. runqlat
     40 1. profile
     41 
     42 These tools may be installed on your system under /usr/share/bcc/tools, or you can run them from the bcc github repo under /tools where they have a .py extension. Browse the 50+ tools available for more analysis options.
     43 
     44 #### 1.1 execsnoop
     45 
     46 ```
     47 # ./execsnoop
     48 PCOMM            PID    RET ARGS
     49 supervise        9660     0 ./run
     50 supervise        9661     0 ./run
     51 mkdir            9662     0 /bin/mkdir -p ./main
     52 run              9663     0 ./run
     53 [...]
     54 ```
     55 
     56 execsnoop prints one line of output for each new process. Check for short-lived processes. These can consume CPU resources, but not show up in most monitoring tools that periodically take snapshots of which processes are running.
     57 
     58 It works by tracing exec(), not the fork(), so it will catch many types of new processes but not all (eg, it won't see an application launching working processes, that doesn't exec() anything else).
     59 
     60 More [examples](../tools/execsnoop_example.txt).
     61 
     62 #### 1.2. opensnoop
     63 
     64 ```
     65 # ./opensnoop
     66 PID    COMM      FD ERR PATH
     67 1565   redis-server        5   0 /proc/1565/stat
     68 1565   redis-server        5   0 /proc/1565/stat
     69 1565   redis-server        5   0 /proc/1565/stat
     70 1603   snmpd               9   0 /proc/net/dev
     71 1603   snmpd              11   0 /proc/net/if_inet6
     72 1603   snmpd              -1   2 /sys/class/net/eth0/device/vendor
     73 1603   snmpd              11   0 /proc/sys/net/ipv4/neigh/eth0/retrans_time_ms
     74 1603   snmpd              11   0 /proc/sys/net/ipv6/neigh/eth0/retrans_time_ms
     75 1603   snmpd              11   0 /proc/sys/net/ipv6/conf/eth0/forwarding
     76 [...]
     77 ```
     78 
     79 opensnoop prints one line of output for each open() syscall, including details.
     80 
     81 Files that are opened can tell you a lot about how applications work: identifying their data files, config files, and log files. Sometimes applications can misbehave, and perform poorly, when they are constantly attempting to read files that do not exist. opensnoop gives you a quick look.
     82 
     83 More [examples](../tools/opensnoop_example.txt).
     84 
     85 #### 1.3. ext4slower (or btrfs\*, xfs\*, zfs\*)
     86 
     87 ```
     88 # ./ext4slower
     89 Tracing ext4 operations slower than 10 ms
     90 TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
     91 06:35:01 cron           16464  R 1249    0          16.05 common-auth
     92 06:35:01 cron           16463  R 1249    0          16.04 common-auth
     93 06:35:01 cron           16465  R 1249    0          16.03 common-auth
     94 06:35:01 cron           16465  R 4096    0          10.62 login.defs
     95 06:35:01 cron           16464  R 4096    0          10.61 login.defs
     96 ```
     97 
     98 ext4slower traces the ext4 file system and times common operations, and then only prints those that exceed a threshold.
     99 
    100 This is great for identifying or exonerating one type of performance issue: slow individually slow disk i/O via the file system. Disks process I/O asynchronously, and it can be difficult to associate latency at that layer with the latency applications experience. Tracing higher up in the kernel stack, at the VFS -> file system interface, will more closely match what an application suffers. Use this tool to identify if file system latency exceeds a given threshold.
    101 
    102 Similar tools exist in bcc for other file systems: btrfsslower, xfsslower, and zfsslower. There is also fileslower, which works at the VFS layer and traces everything (although at some higher overhead).
    103 
    104 More [examples](../tools/ext4slower_example.txt).
    105 
    106 #### 1.4. biolatency
    107 
    108 ```
    109 # ./biolatency
    110 Tracing block device I/O... Hit Ctrl-C to end.
    111 ^C
    112      usecs           : count     distribution
    113        0 -> 1        : 0        |                                      |
    114        2 -> 3        : 0        |                                      |
    115        4 -> 7        : 0        |                                      |
    116        8 -> 15       : 0        |                                      |
    117       16 -> 31       : 0        |                                      |
    118       32 -> 63       : 0        |                                      |
    119       64 -> 127      : 1        |                                      |
    120      128 -> 255      : 12       |********                              |
    121      256 -> 511      : 15       |**********                            |
    122      512 -> 1023     : 43       |*******************************       |
    123     1024 -> 2047     : 52       |**************************************|
    124     2048 -> 4095     : 47       |**********************************    |
    125     4096 -> 8191     : 52       |**************************************|
    126     8192 -> 16383    : 36       |**************************            |
    127    16384 -> 32767    : 15       |**********                            |
    128    32768 -> 65535    : 2        |*                                     |
    129    65536 -> 131071   : 2        |*                                     |
    130 ```
    131 
    132 biolatency traces disk I/O latency (time from device issue to completion), and when the tool ends (Ctrl-C, or a given interval), it prints a histogram summary of the latency.
    133 
    134 This is great for understanding disk I/O latency beyond the average times given by tools like iostat. I/O latency outliers will be visible at the end of the distribution, as well as multi-mode distributions.
    135 
    136 More [examples](../tools/biolatency_example.txt).
    137 
    138 #### 1.5. biosnoop
    139 
    140 ```
    141 # ./biosnoop
    142 TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
    143 0.000004001    supervise      1950   xvda1   W  13092560  4096       0.74
    144 0.000178002    supervise      1950   xvda1   W  13092432  4096       0.61
    145 0.001469001    supervise      1956   xvda1   W  13092440  4096       1.24
    146 0.001588002    supervise      1956   xvda1   W  13115128  4096       1.09
    147 1.022346001    supervise      1950   xvda1   W  13115272  4096       0.98
    148 1.022568002    supervise      1950   xvda1   W  13188496  4096       0.93
    149 [...]
    150 ```
    151 
    152 biosnoop prints a line of output for each disk I/O, with details including latency (time from device issue to completion).
    153 
    154 This allows you to examine disk I/O in more detail, and look for time-ordered patterns (eg, reads queueing behind writes). Note that the output will be verbose if your system performance a high rate of disk I/O.
    155 
    156 More [examples](../tools/biosnoop_example.txt).
    157 
    158 #### 1.6. cachestat
    159 
    160 ```
    161 # ./cachestat
    162     HITS   MISSES  DIRTIES  READ_HIT% WRITE_HIT%   BUFFERS_MB  CACHED_MB
    163     1074       44       13      94.9%       2.9%            1        223
    164     2195      170        8      92.5%       6.8%            1        143
    165      182       53       56      53.6%       1.3%            1        143
    166    62480    40960    20480      40.6%      19.8%            1        223
    167        7        2        5      22.2%      22.2%            1        223
    168      348        0        0     100.0%       0.0%            1        223
    169 [...]
    170 ```
    171 
    172 cachestat prints a one line summary every second (or every custom interval) showing statistics from the file system cache.
    173 
    174 Use this to identify a low cache hit ratio, and a high rate of misses: which gives one lead for performance tuning.
    175 
    176 More [examples](../tools/cachestat_example.txt).
    177 
    178 #### 1.7. tcpconnect
    179 
    180 ```
    181 # ./tcpconnect
    182 PID    COMM         IP SADDR            DADDR            DPORT
    183 1479   telnet       4  127.0.0.1        127.0.0.1        23
    184 1469   curl         4  10.201.219.236   54.245.105.25    80
    185 1469   curl         4  10.201.219.236   54.67.101.145    80
    186 1991   telnet       6  ::1              ::1              23
    187 2015   ssh          6  fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22
    188 [...]
    189 ```
    190 
    191 tcpconnect prints one line of output for every active TCP connection (eg, via connect()), with details including source and destination addresses.
    192 
    193 Look for unexpected connections that may point to inefficiencies in application configuration, or an intruder.
    194 
    195 More [examples](../tools/tcpconnect_example.txt).
    196 
    197 #### 1.8. tcpaccept
    198 
    199 ```
    200 # ./tcpaccept
    201 PID    COMM         IP RADDR            LADDR            LPORT
    202 907    sshd         4  192.168.56.1     192.168.56.102   22
    203 907    sshd         4  127.0.0.1        127.0.0.1        22
    204 5389   perl         6  1234:ab12:2040:5020:2299:0:5:0 1234:ab12:2040:5020:2299:0:5:0 7001
    205 [...]
    206 ```
    207 
    208 tcpaccept prints one line of output for every passive TCP connection (eg, via accept()), with details including source and destination addresses.
    209 
    210 Look for unexpected connections that may point to inefficiencies in application configuration, or an intruder.
    211 
    212 More [examples](../tools/tcpaccept_example.txt).
    213 
    214 #### 1.9. tcpretrans
    215 
    216 ```
    217 # ./tcpretrans
    218 TIME     PID    IP LADDR:LPORT          T> RADDR:RPORT          STATE
    219 01:55:05 0      4  10.153.223.157:22    R> 69.53.245.40:34619   ESTABLISHED
    220 01:55:05 0      4  10.153.223.157:22    R> 69.53.245.40:34619   ESTABLISHED
    221 01:55:17 0      4  10.153.223.157:22    R> 69.53.245.40:22957   ESTABLISHED
    222 [...]
    223 ```
    224 
    225 tcprerans prints one line of output for every TCP retransmit packet, with details including source and destination addresses, and kernel state of the TCP connection.
    226 
    227 TCP retransmissions cause latency and throughput issues. For ESTABLISHED retransmits, look for patterns with networks. For SYN_SENT, this may point to target kernel CPU saturation and kernel packet drops.
    228 
    229 More [examples](../tools/tcpretrans_example.txt).
    230 
    231 #### 1.10. runqlat
    232 
    233 ```
    234 # ./runqlat
    235 Tracing run queue latency... Hit Ctrl-C to end.
    236 ^C
    237      usecs               : count     distribution
    238          0 -> 1          : 233      |***********                             |
    239          2 -> 3          : 742      |************************************    |
    240          4 -> 7          : 203      |**********                              |
    241          8 -> 15         : 173      |********                                |
    242         16 -> 31         : 24       |*                                       |
    243         32 -> 63         : 0        |                                        |
    244         64 -> 127        : 30       |*                                       |
    245        128 -> 255        : 6        |                                        |
    246        256 -> 511        : 3        |                                        |
    247        512 -> 1023       : 5        |                                        |
    248       1024 -> 2047       : 27       |*                                       |
    249       2048 -> 4095       : 30       |*                                       |
    250       4096 -> 8191       : 20       |                                        |
    251       8192 -> 16383      : 29       |*                                       |
    252      16384 -> 32767      : 809      |****************************************|
    253      32768 -> 65535      : 64       |***                                     |
    254 ```
    255 
    256 runqlat times how long threads were waiting on the CPU run queues, and prints this as a histogram.
    257 
    258 This can help quantify time lost waiting for a turn on CPU, during periods of CPU saturation.
    259 
    260 More [examples](../tools/runqlat_example.txt).
    261 
    262 #### 1.11. profile
    263 
    264 ```
    265 # ./profile
    266 Sampling at 49 Hertz of all threads by user + kernel stack... Hit Ctrl-C to end.
    267 ^C
    268     00007f31d76c3251 [unknown]
    269     47a2c1e752bf47f7 [unknown]
    270     -                sign-file (8877)
    271         1
    272 
    273     ffffffff813d0af8 __clear_user
    274     ffffffff813d5277 iov_iter_zero
    275     ffffffff814ec5f2 read_iter_zero
    276     ffffffff8120be9d __vfs_read
    277     ffffffff8120c385 vfs_read
    278     ffffffff8120d786 sys_read
    279     ffffffff817cc076 entry_SYSCALL_64_fastpath
    280     00007fc5652ad9b0 read
    281     -                dd (25036)
    282         4
    283 
    284     0000000000400542 func_a
    285     0000000000400598 main
    286     00007f12a133e830 __libc_start_main
    287     083e258d4c544155 [unknown]
    288     -                func_ab (13549)
    289         5
    290 
    291 [...]
    292 
    293     ffffffff8105eb66 native_safe_halt
    294     ffffffff8103659e default_idle
    295     ffffffff81036d1f arch_cpu_idle
    296     ffffffff810bba5a default_idle_call
    297     ffffffff810bbd07 cpu_startup_entry
    298     ffffffff8104df55 start_secondary
    299     -                swapper/1 (0)
    300         75
    301 ```
    302 
    303 profile is a CPU profiler, which takes samples of stack traces at timed intervals, and prints a summary of unique stack traces and a count of their occurrence.
    304 
    305 Use this tool to understand the code paths that are consuming CPU resources.
    306 
    307 More [examples](../tools/profile_example.txt).
    308 
    309 ### 2. Observatility with Generic Tools
    310 
    311 In addition to the above tools for performance tuning, below is a checklist for bcc generic tools, first as a list, and in detail:
    312 
    313 1. trace
    314 1. argdist
    315 1. funccount
    316 
    317 These generic tools may be useful to provide visibility to solve your specific problems.
    318 
    319 #### 2.1. trace
    320 
    321 ##### Example 1
    322 
    323 Suppose you want to track file ownership change. There are three syscalls, `chown`, `fchown` and `lchown` which users can use to change file ownership. The corresponding syscall entry is `SyS_[f|l]chown`.  The following command can be used to print out syscall parameters and the calling process user id. You can use `id` command to find the uid of a particular user.
    324 
    325 ```
    326 $ trace.py \
    327   'p::SyS_chown "file = %s, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid' \
    328   'p::SyS_fchown "fd = %d, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid' \
    329   'p::SyS_lchown "file = %s, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid'
    330 PID    TID    COMM         FUNC             -
    331 1269255 1269255 python3.6    SyS_lchown       file = /tmp/dotsync-usisgezu/tmp, to_uid = 128203, to_gid = 100, from_uid = 128203
    332 1269441 1269441 zstd         SyS_chown        file = /tmp/dotsync-vic7ygj0/dotsync-package.zst, to_uid = 128203, to_gid = 100, from_uid = 128203
    333 1269255 1269255 python3.6    SyS_lchown       file = /tmp/dotsync-a40zd7ev/tmp, to_uid = 128203, to_gid = 100, from_uid = 128203
    334 1269442 1269442 zstd         SyS_chown        file = /tmp/dotsync-gzp413o_/dotsync-package.zst, to_uid = 128203, to_gid = 100, from_uid = 128203
    335 1269255 1269255 python3.6    SyS_lchown       file = /tmp/dotsync-whx4fivm/tmp/.bash_profile, to_uid = 128203, to_gid = 100, from_uid = 128203
    336 ```
    337 
    338 ##### Example 2
    339 
    340 Suppose you want to count nonvoluntary context switches (`nvcsw`) in your bpf based performance monitoring tools and you do not know what is the proper method. `/proc/<pid>/status` already tells you the number (`nonvoluntary_ctxt_switches`) for a pid and you can use `trace.py` to do a quick experiment to verify your method. With kernel source code, the `nvcsw` is counted at file `linux/kernel/sched/core.c` function `__schedule` and under condition
    341 ```
    342 !(!preempt && prev->state) // i.e., preempt || !prev->state
    343 ```
    344 
    345 The `__schedule` function is marked as `notrace`, and the best place to evaluate the above condition seems in `sched/sched_switch` tracepoint called inside function `__schedule` and defined in `linux/include/trace/events/sched.h`. `trace.py` already has `args` being the pointer to the tracepoint `TP_STRUCT__entry`.  The above condition in function `__schedule` can be represented as
    346 ```
    347 args->prev_state == TASK_STATE_MAX || args->prev_state == 0
    348 ```
    349 
    350 The below command can be used to count the involuntary context switches (per process or per pid) and compare to `/proc/<pid>/status` or `/proc/<pid>/task/<task_id>/status` for correctness, as in typical cases, involuntary context switches are not very common.
    351 ```
    352 $ trace.py -p 1134138 't:sched:sched_switch (args->prev_state == TASK_STATE_MAX || args->prev_state == 0)'
    353 PID    TID    COMM         FUNC
    354 1134138 1134140 contention_test sched_switch
    355 1134138 1134142 contention_test sched_switch
    356 ...
    357 $ trace.py -L 1134140 't:sched:sched_switch (args->prev_state == TASK_STATE_MAX || args->prev_state == 0)'
    358 PID    TID    COMM         FUNC
    359 1134138 1134140 contention_test sched_switch
    360 1134138 1134140 contention_test sched_switch
    361 ...
    362 ```
    363 
    364 ##### Example 3
    365 
    366 This example is related to issue [1231](https://github.com/iovisor/bcc/issues/1231) and [1516](https://github.com/iovisor/bcc/issues/1516) where uprobe does not work at all in certain cases. First, you can do a `strace` as below
    367 
    368 ```
    369 $ strace trace.py 'r:bash:readline "%s", retval'
    370 ...
    371 perf_event_open(0x7ffd968212f0, -1, 0, -1, 0x8 /* PERF_FLAG_??? */) = -1 EIO (Input/output error)
    372 ...
    373 ```
    374 
    375 The `perf_event_open` syscall returns `-EIO`. Digging into kernel uprobe related codes in `/kernel/trace` and `/kernel/events` directories to search `EIO`, the function `uprobe_register` is the most suspicious. Let us find whether this function is called or not and what is the return value if it is called. In one terminal using the following command to print out the return value of uprobe_register,
    376 ```
    377 $ trace.py 'r::uprobe_register "ret = %d", retval'
    378 ```
    379 In another terminal run the same bash uretprobe tracing example, and you should get
    380 ```
    381 $ trace.py 'r::uprobe_register "ret = %d", retval'
    382 PID    TID    COMM         FUNC             -
    383 1041401 1041401 python2.7    uprobe_register  ret = -5
    384 ```
    385 
    386 The `-5` error code is EIO. This confirms that the following code in function `uprobe_register` is the most suspicious culprit.
    387 ```
    388  if (!inode->i_mapping->a_ops->readpage && !shmem_mapping(inode->i_mapping))
    389         return -EIO;
    390 ```
    391 The `shmem_mapping` function is defined as
    392 ```
    393 bool shmem_mapping(struct address_space *mapping)
    394 {
    395         return mapping->a_ops == &shmem_aops;
    396 }
    397 ```
    398 
    399 To confirm the theory, find what is `inode->i_mapping->a_ops` with the following command
    400 ```
    401 $ trace.py -I 'linux/fs.h' 'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
    402 PID    TID    COMM         FUNC             -
    403 814288 814288 python2.7    uprobe_register  a_ops = ffffffff81a2adc0
    404 ^C$ grep ffffffff81a2adc0 /proc/kallsyms
    405 ffffffff81a2adc0 R empty_aops
    406 ```
    407 
    408 The kernel symbol `empty_aops` does not have `readpage` defined and hence the above suspicious condition is true. Further examining the kernel source code shows that `overlayfs` does not provide its own `a_ops` while some other file systems (e.g., ext4) define their own `a_ops` (e.g., `ext4_da_aops`), and `ext4_da_aops` defines `readpage`. Hence, uprobe works fine on ext4 while not on overlayfs.
    409 
    410 More [examples](../tools/trace_example.txt).
    411 
    412 #### 2.2. argdist
    413 
    414 More [examples](../tools/argdist_example.txt).
    415 
    416 #### 2.3. funccount
    417 
    418 More [examples](../tools/funccount_example.txt).
    419 
    420 ## Networking
    421 
    422 To do.
    423