1 # bcc Tutorial 2 3 This tutorial covers how to use [bcc](https://github.com/iovisor/bcc) tools to quickly solve performance, troubleshooting, and networking issues. If you want to develop new bcc tools, see [tutorial_bcc_python_developer.md](tutorial_bcc_python_developer.md) for that tutorial. 4 5 It is assumed for this tutorial that bcc is already installed, and you can run tools like execsnoop successfully. See [INSTALL.md](../INSTALL.md). This uses enhancements added to the Linux 4.x series. 6 7 ## Observability 8 9 Some quick wins. 10 11 ### 0. Before bcc 12 13 Before using bcc, you should start with the Linux basics. One reference is the [Linux Performance Analysis in 60s](http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html) post, which covers these commands: 14 15 1. uptime 16 1. dmesg | tail 17 1. vmstat 1 18 1. mpstat -P ALL 1 19 1. pidstat 1 20 1. iostat -xz 1 21 1. free -m 22 1. sar -n DEV 1 23 1. sar -n TCP,ETCP 1 24 1. top 25 26 ### 1. General Performance 27 28 Here is a generic checklist for performance investigations with bcc, first as a list, then in detail: 29 30 1. execsnoop 31 1. opensnoop 32 1. ext4slower (or btrfs\*, xfs\*, zfs\*) 33 1. biolatency 34 1. biosnoop 35 1. cachestat 36 1. tcpconnect 37 1. tcpaccept 38 1. tcpretrans 39 1. runqlat 40 1. profile 41 42 These tools may be installed on your system under /usr/share/bcc/tools, or you can run them from the bcc github repo under /tools where they have a .py extension. Browse the 50+ tools available for more analysis options. 43 44 #### 1.1 execsnoop 45 46 ``` 47 # ./execsnoop 48 PCOMM PID RET ARGS 49 supervise 9660 0 ./run 50 supervise 9661 0 ./run 51 mkdir 9662 0 /bin/mkdir -p ./main 52 run 9663 0 ./run 53 [...] 54 ``` 55 56 execsnoop prints one line of output for each new process. Check for short-lived processes. These can consume CPU resources, but not show up in most monitoring tools that periodically take snapshots of which processes are running. 57 58 It works by tracing exec(), not the fork(), so it will catch many types of new processes but not all (eg, it won't see an application launching working processes, that doesn't exec() anything else). 59 60 More [examples](../tools/execsnoop_example.txt). 61 62 #### 1.2. opensnoop 63 64 ``` 65 # ./opensnoop 66 PID COMM FD ERR PATH 67 1565 redis-server 5 0 /proc/1565/stat 68 1565 redis-server 5 0 /proc/1565/stat 69 1565 redis-server 5 0 /proc/1565/stat 70 1603 snmpd 9 0 /proc/net/dev 71 1603 snmpd 11 0 /proc/net/if_inet6 72 1603 snmpd -1 2 /sys/class/net/eth0/device/vendor 73 1603 snmpd 11 0 /proc/sys/net/ipv4/neigh/eth0/retrans_time_ms 74 1603 snmpd 11 0 /proc/sys/net/ipv6/neigh/eth0/retrans_time_ms 75 1603 snmpd 11 0 /proc/sys/net/ipv6/conf/eth0/forwarding 76 [...] 77 ``` 78 79 opensnoop prints one line of output for each open() syscall, including details. 80 81 Files that are opened can tell you a lot about how applications work: identifying their data files, config files, and log files. Sometimes applications can misbehave, and perform poorly, when they are constantly attempting to read files that do not exist. opensnoop gives you a quick look. 82 83 More [examples](../tools/opensnoop_example.txt). 84 85 #### 1.3. ext4slower (or btrfs\*, xfs\*, zfs\*) 86 87 ``` 88 # ./ext4slower 89 Tracing ext4 operations slower than 10 ms 90 TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 91 06:35:01 cron 16464 R 1249 0 16.05 common-auth 92 06:35:01 cron 16463 R 1249 0 16.04 common-auth 93 06:35:01 cron 16465 R 1249 0 16.03 common-auth 94 06:35:01 cron 16465 R 4096 0 10.62 login.defs 95 06:35:01 cron 16464 R 4096 0 10.61 login.defs 96 ``` 97 98 ext4slower traces the ext4 file system and times common operations, and then only prints those that exceed a threshold. 99 100 This is great for identifying or exonerating one type of performance issue: slow individually slow disk i/O via the file system. Disks process I/O asynchronously, and it can be difficult to associate latency at that layer with the latency applications experience. Tracing higher up in the kernel stack, at the VFS -> file system interface, will more closely match what an application suffers. Use this tool to identify if file system latency exceeds a given threshold. 101 102 Similar tools exist in bcc for other file systems: btrfsslower, xfsslower, and zfsslower. There is also fileslower, which works at the VFS layer and traces everything (although at some higher overhead). 103 104 More [examples](../tools/ext4slower_example.txt). 105 106 #### 1.4. biolatency 107 108 ``` 109 # ./biolatency 110 Tracing block device I/O... Hit Ctrl-C to end. 111 ^C 112 usecs : count distribution 113 0 -> 1 : 0 | | 114 2 -> 3 : 0 | | 115 4 -> 7 : 0 | | 116 8 -> 15 : 0 | | 117 16 -> 31 : 0 | | 118 32 -> 63 : 0 | | 119 64 -> 127 : 1 | | 120 128 -> 255 : 12 |******** | 121 256 -> 511 : 15 |********** | 122 512 -> 1023 : 43 |******************************* | 123 1024 -> 2047 : 52 |**************************************| 124 2048 -> 4095 : 47 |********************************** | 125 4096 -> 8191 : 52 |**************************************| 126 8192 -> 16383 : 36 |************************** | 127 16384 -> 32767 : 15 |********** | 128 32768 -> 65535 : 2 |* | 129 65536 -> 131071 : 2 |* | 130 ``` 131 132 biolatency traces disk I/O latency (time from device issue to completion), and when the tool ends (Ctrl-C, or a given interval), it prints a histogram summary of the latency. 133 134 This is great for understanding disk I/O latency beyond the average times given by tools like iostat. I/O latency outliers will be visible at the end of the distribution, as well as multi-mode distributions. 135 136 More [examples](../tools/biolatency_example.txt). 137 138 #### 1.5. biosnoop 139 140 ``` 141 # ./biosnoop 142 TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms) 143 0.000004001 supervise 1950 xvda1 W 13092560 4096 0.74 144 0.000178002 supervise 1950 xvda1 W 13092432 4096 0.61 145 0.001469001 supervise 1956 xvda1 W 13092440 4096 1.24 146 0.001588002 supervise 1956 xvda1 W 13115128 4096 1.09 147 1.022346001 supervise 1950 xvda1 W 13115272 4096 0.98 148 1.022568002 supervise 1950 xvda1 W 13188496 4096 0.93 149 [...] 150 ``` 151 152 biosnoop prints a line of output for each disk I/O, with details including latency (time from device issue to completion). 153 154 This allows you to examine disk I/O in more detail, and look for time-ordered patterns (eg, reads queueing behind writes). Note that the output will be verbose if your system performance a high rate of disk I/O. 155 156 More [examples](../tools/biosnoop_example.txt). 157 158 #### 1.6. cachestat 159 160 ``` 161 # ./cachestat 162 HITS MISSES DIRTIES READ_HIT% WRITE_HIT% BUFFERS_MB CACHED_MB 163 1074 44 13 94.9% 2.9% 1 223 164 2195 170 8 92.5% 6.8% 1 143 165 182 53 56 53.6% 1.3% 1 143 166 62480 40960 20480 40.6% 19.8% 1 223 167 7 2 5 22.2% 22.2% 1 223 168 348 0 0 100.0% 0.0% 1 223 169 [...] 170 ``` 171 172 cachestat prints a one line summary every second (or every custom interval) showing statistics from the file system cache. 173 174 Use this to identify a low cache hit ratio, and a high rate of misses: which gives one lead for performance tuning. 175 176 More [examples](../tools/cachestat_example.txt). 177 178 #### 1.7. tcpconnect 179 180 ``` 181 # ./tcpconnect 182 PID COMM IP SADDR DADDR DPORT 183 1479 telnet 4 127.0.0.1 127.0.0.1 23 184 1469 curl 4 10.201.219.236 54.245.105.25 80 185 1469 curl 4 10.201.219.236 54.67.101.145 80 186 1991 telnet 6 ::1 ::1 23 187 2015 ssh 6 fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22 188 [...] 189 ``` 190 191 tcpconnect prints one line of output for every active TCP connection (eg, via connect()), with details including source and destination addresses. 192 193 Look for unexpected connections that may point to inefficiencies in application configuration, or an intruder. 194 195 More [examples](../tools/tcpconnect_example.txt). 196 197 #### 1.8. tcpaccept 198 199 ``` 200 # ./tcpaccept 201 PID COMM IP RADDR LADDR LPORT 202 907 sshd 4 192.168.56.1 192.168.56.102 22 203 907 sshd 4 127.0.0.1 127.0.0.1 22 204 5389 perl 6 1234:ab12:2040:5020:2299:0:5:0 1234:ab12:2040:5020:2299:0:5:0 7001 205 [...] 206 ``` 207 208 tcpaccept prints one line of output for every passive TCP connection (eg, via accept()), with details including source and destination addresses. 209 210 Look for unexpected connections that may point to inefficiencies in application configuration, or an intruder. 211 212 More [examples](../tools/tcpaccept_example.txt). 213 214 #### 1.9. tcpretrans 215 216 ``` 217 # ./tcpretrans 218 TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE 219 01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED 220 01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED 221 01:55:17 0 4 10.153.223.157:22 R> 69.53.245.40:22957 ESTABLISHED 222 [...] 223 ``` 224 225 tcprerans prints one line of output for every TCP retransmit packet, with details including source and destination addresses, and kernel state of the TCP connection. 226 227 TCP retransmissions cause latency and throughput issues. For ESTABLISHED retransmits, look for patterns with networks. For SYN_SENT, this may point to target kernel CPU saturation and kernel packet drops. 228 229 More [examples](../tools/tcpretrans_example.txt). 230 231 #### 1.10. runqlat 232 233 ``` 234 # ./runqlat 235 Tracing run queue latency... Hit Ctrl-C to end. 236 ^C 237 usecs : count distribution 238 0 -> 1 : 233 |*********** | 239 2 -> 3 : 742 |************************************ | 240 4 -> 7 : 203 |********** | 241 8 -> 15 : 173 |******** | 242 16 -> 31 : 24 |* | 243 32 -> 63 : 0 | | 244 64 -> 127 : 30 |* | 245 128 -> 255 : 6 | | 246 256 -> 511 : 3 | | 247 512 -> 1023 : 5 | | 248 1024 -> 2047 : 27 |* | 249 2048 -> 4095 : 30 |* | 250 4096 -> 8191 : 20 | | 251 8192 -> 16383 : 29 |* | 252 16384 -> 32767 : 809 |****************************************| 253 32768 -> 65535 : 64 |*** | 254 ``` 255 256 runqlat times how long threads were waiting on the CPU run queues, and prints this as a histogram. 257 258 This can help quantify time lost waiting for a turn on CPU, during periods of CPU saturation. 259 260 More [examples](../tools/runqlat_example.txt). 261 262 #### 1.11. profile 263 264 ``` 265 # ./profile 266 Sampling at 49 Hertz of all threads by user + kernel stack... Hit Ctrl-C to end. 267 ^C 268 00007f31d76c3251 [unknown] 269 47a2c1e752bf47f7 [unknown] 270 - sign-file (8877) 271 1 272 273 ffffffff813d0af8 __clear_user 274 ffffffff813d5277 iov_iter_zero 275 ffffffff814ec5f2 read_iter_zero 276 ffffffff8120be9d __vfs_read 277 ffffffff8120c385 vfs_read 278 ffffffff8120d786 sys_read 279 ffffffff817cc076 entry_SYSCALL_64_fastpath 280 00007fc5652ad9b0 read 281 - dd (25036) 282 4 283 284 0000000000400542 func_a 285 0000000000400598 main 286 00007f12a133e830 __libc_start_main 287 083e258d4c544155 [unknown] 288 - func_ab (13549) 289 5 290 291 [...] 292 293 ffffffff8105eb66 native_safe_halt 294 ffffffff8103659e default_idle 295 ffffffff81036d1f arch_cpu_idle 296 ffffffff810bba5a default_idle_call 297 ffffffff810bbd07 cpu_startup_entry 298 ffffffff8104df55 start_secondary 299 - swapper/1 (0) 300 75 301 ``` 302 303 profile is a CPU profiler, which takes samples of stack traces at timed intervals, and prints a summary of unique stack traces and a count of their occurrence. 304 305 Use this tool to understand the code paths that are consuming CPU resources. 306 307 More [examples](../tools/profile_example.txt). 308 309 ### 2. Observatility with Generic Tools 310 311 In addition to the above tools for performance tuning, below is a checklist for bcc generic tools, first as a list, and in detail: 312 313 1. trace 314 1. argdist 315 1. funccount 316 317 These generic tools may be useful to provide visibility to solve your specific problems. 318 319 #### 2.1. trace 320 321 ##### Example 1 322 323 Suppose you want to track file ownership change. There are three syscalls, `chown`, `fchown` and `lchown` which users can use to change file ownership. The corresponding syscall entry is `SyS_[f|l]chown`. The following command can be used to print out syscall parameters and the calling process user id. You can use `id` command to find the uid of a particular user. 324 325 ``` 326 $ trace.py \ 327 'p::SyS_chown "file = %s, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid' \ 328 'p::SyS_fchown "fd = %d, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid' \ 329 'p::SyS_lchown "file = %s, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid' 330 PID TID COMM FUNC - 331 1269255 1269255 python3.6 SyS_lchown file = /tmp/dotsync-usisgezu/tmp, to_uid = 128203, to_gid = 100, from_uid = 128203 332 1269441 1269441 zstd SyS_chown file = /tmp/dotsync-vic7ygj0/dotsync-package.zst, to_uid = 128203, to_gid = 100, from_uid = 128203 333 1269255 1269255 python3.6 SyS_lchown file = /tmp/dotsync-a40zd7ev/tmp, to_uid = 128203, to_gid = 100, from_uid = 128203 334 1269442 1269442 zstd SyS_chown file = /tmp/dotsync-gzp413o_/dotsync-package.zst, to_uid = 128203, to_gid = 100, from_uid = 128203 335 1269255 1269255 python3.6 SyS_lchown file = /tmp/dotsync-whx4fivm/tmp/.bash_profile, to_uid = 128203, to_gid = 100, from_uid = 128203 336 ``` 337 338 ##### Example 2 339 340 Suppose you want to count nonvoluntary context switches (`nvcsw`) in your bpf based performance monitoring tools and you do not know what is the proper method. `/proc/<pid>/status` already tells you the number (`nonvoluntary_ctxt_switches`) for a pid and you can use `trace.py` to do a quick experiment to verify your method. With kernel source code, the `nvcsw` is counted at file `linux/kernel/sched/core.c` function `__schedule` and under condition 341 ``` 342 !(!preempt && prev->state) // i.e., preempt || !prev->state 343 ``` 344 345 The `__schedule` function is marked as `notrace`, and the best place to evaluate the above condition seems in `sched/sched_switch` tracepoint called inside function `__schedule` and defined in `linux/include/trace/events/sched.h`. `trace.py` already has `args` being the pointer to the tracepoint `TP_STRUCT__entry`. The above condition in function `__schedule` can be represented as 346 ``` 347 args->prev_state == TASK_STATE_MAX || args->prev_state == 0 348 ``` 349 350 The below command can be used to count the involuntary context switches (per process or per pid) and compare to `/proc/<pid>/status` or `/proc/<pid>/task/<task_id>/status` for correctness, as in typical cases, involuntary context switches are not very common. 351 ``` 352 $ trace.py -p 1134138 't:sched:sched_switch (args->prev_state == TASK_STATE_MAX || args->prev_state == 0)' 353 PID TID COMM FUNC 354 1134138 1134140 contention_test sched_switch 355 1134138 1134142 contention_test sched_switch 356 ... 357 $ trace.py -L 1134140 't:sched:sched_switch (args->prev_state == TASK_STATE_MAX || args->prev_state == 0)' 358 PID TID COMM FUNC 359 1134138 1134140 contention_test sched_switch 360 1134138 1134140 contention_test sched_switch 361 ... 362 ``` 363 364 ##### Example 3 365 366 This example is related to issue [1231](https://github.com/iovisor/bcc/issues/1231) and [1516](https://github.com/iovisor/bcc/issues/1516) where uprobe does not work at all in certain cases. First, you can do a `strace` as below 367 368 ``` 369 $ strace trace.py 'r:bash:readline "%s", retval' 370 ... 371 perf_event_open(0x7ffd968212f0, -1, 0, -1, 0x8 /* PERF_FLAG_??? */) = -1 EIO (Input/output error) 372 ... 373 ``` 374 375 The `perf_event_open` syscall returns `-EIO`. Digging into kernel uprobe related codes in `/kernel/trace` and `/kernel/events` directories to search `EIO`, the function `uprobe_register` is the most suspicious. Let us find whether this function is called or not and what is the return value if it is called. In one terminal using the following command to print out the return value of uprobe_register, 376 ``` 377 $ trace.py 'r::uprobe_register "ret = %d", retval' 378 ``` 379 In another terminal run the same bash uretprobe tracing example, and you should get 380 ``` 381 $ trace.py 'r::uprobe_register "ret = %d", retval' 382 PID TID COMM FUNC - 383 1041401 1041401 python2.7 uprobe_register ret = -5 384 ``` 385 386 The `-5` error code is EIO. This confirms that the following code in function `uprobe_register` is the most suspicious culprit. 387 ``` 388 if (!inode->i_mapping->a_ops->readpage && !shmem_mapping(inode->i_mapping)) 389 return -EIO; 390 ``` 391 The `shmem_mapping` function is defined as 392 ``` 393 bool shmem_mapping(struct address_space *mapping) 394 { 395 return mapping->a_ops == &shmem_aops; 396 } 397 ``` 398 399 To confirm the theory, find what is `inode->i_mapping->a_ops` with the following command 400 ``` 401 $ trace.py -I 'linux/fs.h' 'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops' 402 PID TID COMM FUNC - 403 814288 814288 python2.7 uprobe_register a_ops = ffffffff81a2adc0 404 ^C$ grep ffffffff81a2adc0 /proc/kallsyms 405 ffffffff81a2adc0 R empty_aops 406 ``` 407 408 The kernel symbol `empty_aops` does not have `readpage` defined and hence the above suspicious condition is true. Further examining the kernel source code shows that `overlayfs` does not provide its own `a_ops` while some other file systems (e.g., ext4) define their own `a_ops` (e.g., `ext4_da_aops`), and `ext4_da_aops` defines `readpage`. Hence, uprobe works fine on ext4 while not on overlayfs. 409 410 More [examples](../tools/trace_example.txt). 411 412 #### 2.2. argdist 413 414 More [examples](../tools/argdist_example.txt). 415 416 #### 2.3. funccount 417 418 More [examples](../tools/funccount_example.txt). 419 420 ## Networking 421 422 To do. 423