1 Demonstrations of funclatency, the Linux eBPF/bcc version. 2 3 4 Timing the do_sys_open() kernel function until Ctrl-C: 5 6 # ./funclatency do_sys_open 7 Tracing do_sys_open... Hit Ctrl-C to end. 8 ^C 9 nsecs : count distribution 10 0 -> 1 : 0 | | 11 2 -> 3 : 0 | | 12 4 -> 7 : 0 | | 13 8 -> 15 : 0 | | 14 16 -> 31 : 0 | | 15 32 -> 63 : 0 | | 16 64 -> 127 : 0 | | 17 128 -> 255 : 0 | | 18 256 -> 511 : 0 | | 19 512 -> 1023 : 0 | | 20 1024 -> 2047 : 0 | | 21 2048 -> 4095 : 124 |**************** | 22 4096 -> 8191 : 291 |**************************************| 23 8192 -> 16383 : 36 |**** | 24 16384 -> 32767 : 16 |** | 25 32768 -> 65535 : 8 |* | 26 65536 -> 131071 : 0 | | 27 131072 -> 262143 : 0 | | 28 262144 -> 524287 : 0 | | 29 524288 -> 1048575 : 0 | | 30 1048576 -> 2097151 : 0 | | 31 2097152 -> 4194303 : 1 | | 32 Detaching... 33 34 The output shows a histogram of function latency (call time), measured from when 35 the function began executing (was called) to when it finished (returned). 36 37 This example output shows that most of the time, do_sys_open() took between 38 2048 and 65536 nanoseconds (2 to 65 microseconds). The peak of this distribution 39 shows 291 calls of between 4096 and 8191 nanoseconds. There was also one 40 occurrence, an outlier, in the 2 to 4 millisecond range. 41 42 How this works: the function entry and return are traced using the kernel kprobe 43 and kretprobe tracer. Timestamps are collected, the delta time calculated, which 44 is the bucketized and stored as an in-kernel histogram for efficiency. The 45 histogram is visible in the output: it's the "count" column; everything else is 46 decoration. Only the count column is copied to user-level on output. This is an 47 efficient way to time kernel functions and examine their latency distribution. 48 49 50 Now trace a user function, pthread_mutex_lock in libpthread, to determine if 51 there is considerable lock contention: 52 53 # ./funclatency pthread:pthread_mutex_lock -p $(pidof contentions) 54 Tracing 1 function for "pthread:pthread_mutex_lock"... Hit Ctrl-C to end. 55 nsecs : count distribution 56 0 -> 1 : 0 | | 57 2 -> 3 : 0 | | 58 4 -> 7 : 0 | | 59 8 -> 15 : 0 | | 60 16 -> 31 : 0 | | 61 32 -> 63 : 0 | | 62 64 -> 127 : 0 | | 63 128 -> 255 : 0 | | 64 256 -> 511 : 0 | | 65 512 -> 1023 : 0 | | 66 1024 -> 2047 : 0 | | 67 2048 -> 4095 : 508967 |****************************************| 68 4096 -> 8191 : 70072 |***** | 69 8192 -> 16383 : 27686 |** | 70 16384 -> 32767 : 5075 | | 71 32768 -> 65535 : 2318 | | 72 65536 -> 131071 : 581 | | 73 131072 -> 262143 : 38 | | 74 262144 -> 524287 : 5 | | 75 524288 -> 1048575 : 1 | | 76 1048576 -> 2097151 : 9 | | 77 Detaching... 78 79 It seems that most calls to pthread_mutex_lock completed rather quickly (in 80 under 4us), but there were some cases of considerable contention, sometimes 81 over a full millisecond. 82 83 84 Run a quick-and-dirty profiler over all the functions in an executable: 85 # ./funclatency /home/user/primes:* -p $(pidof primes) -F 86 Tracing 15 functions for "/home/user/primes:*"... Hit Ctrl-C to end. 87 ^C 88 89 Function = is_prime [6556] 90 nsecs : count distribution 91 0 -> 1 : 0 | | 92 2 -> 3 : 0 | | 93 4 -> 7 : 0 | | 94 8 -> 15 : 0 | | 95 16 -> 31 : 0 | | 96 32 -> 63 : 0 | | 97 64 -> 127 : 0 | | 98 128 -> 255 : 0 | | 99 256 -> 511 : 0 | | 100 512 -> 1023 : 0 | | 101 1024 -> 2047 : 0 | | 102 2048 -> 4095 : 1495322 |****************************************| 103 4096 -> 8191 : 95744 |** | 104 8192 -> 16383 : 9926 | | 105 16384 -> 32767 : 3070 | | 106 32768 -> 65535 : 1415 | | 107 65536 -> 131071 : 112 | | 108 131072 -> 262143 : 9 | | 109 262144 -> 524287 : 3 | | 110 524288 -> 1048575 : 0 | | 111 1048576 -> 2097151 : 8 | | 112 113 Function = insert_result [6556] 114 nsecs : count distribution 115 0 -> 1 : 0 | | 116 2 -> 3 : 0 | | 117 4 -> 7 : 0 | | 118 8 -> 15 : 0 | | 119 16 -> 31 : 0 | | 120 32 -> 63 : 0 | | 121 64 -> 127 : 0 | | 122 128 -> 255 : 0 | | 123 256 -> 511 : 0 | | 124 512 -> 1023 : 0 | | 125 1024 -> 2047 : 0 | | 126 2048 -> 4095 : 111047 |****************************************| 127 4096 -> 8191 : 3998 |* | 128 8192 -> 16383 : 720 | | 129 16384 -> 32767 : 238 | | 130 32768 -> 65535 : 106 | | 131 65536 -> 131071 : 5 | | 132 131072 -> 262143 : 4 | | 133 Detaching... 134 135 From the results, we can see that the is_prime function has something resembling 136 an exponential distribution -- very few primes take a very long time to test, 137 while most numbers are verified as prime or composite in less than 4us. The 138 insert_result function exhibits a similar phenomenon, likely due to contention 139 over a shared results container. 140 141 142 Now vfs_read() is traced, and a microseconds histogram printed: 143 144 # ./funclatency -u vfs_read 145 Tracing vfs_read... Hit Ctrl-C to end. 146 ^C 147 usecs : count distribution 148 0 -> 1 : 1143 |**************************************| 149 2 -> 3 : 420 |************* | 150 4 -> 7 : 159 |***** | 151 8 -> 15 : 295 |********* | 152 16 -> 31 : 25 | | 153 32 -> 63 : 5 | | 154 64 -> 127 : 1 | | 155 128 -> 255 : 0 | | 156 256 -> 511 : 0 | | 157 512 -> 1023 : 0 | | 158 1024 -> 2047 : 1 | | 159 2048 -> 4095 : 0 | | 160 4096 -> 8191 : 5 | | 161 8192 -> 16383 : 0 | | 162 16384 -> 32767 : 0 | | 163 32768 -> 65535 : 0 | | 164 65536 -> 131071 : 7 | | 165 131072 -> 262143 : 7 | | 166 262144 -> 524287 : 3 | | 167 524288 -> 1048575 : 7 | | 168 Detaching... 169 170 This shows a bimodal distribution. Many vfs_read() calls were faster than 15 171 microseconds, however, there was also a small handful between 65 milliseconds 172 and 1 second, seen at the bottom of the table. These are likely network reads 173 from SSH, waiting on interactive keystrokes. 174 175 176 Tracing do_nanosleep() in milliseconds: 177 178 # ./funclatency -m do_nanosleep 179 Tracing do_nanosleep... Hit Ctrl-C to end. 180 ^C 181 msecs : count distribution 182 0 -> 1 : 0 | | 183 2 -> 3 : 0 | | 184 4 -> 7 : 0 | | 185 8 -> 15 : 0 | | 186 16 -> 31 : 0 | | 187 32 -> 63 : 0 | | 188 64 -> 127 : 0 | | 189 128 -> 255 : 0 | | 190 256 -> 511 : 0 | | 191 512 -> 1023 : 328 |**************************************| 192 1024 -> 2047 : 0 | | 193 2048 -> 4095 : 0 | | 194 4096 -> 8191 : 32 |*** | 195 8192 -> 16383 : 0 | | 196 16384 -> 32767 : 0 | | 197 32768 -> 65535 : 2 | | 198 Detaching... 199 200 This looks like it has found threads that are sleeping every 1, 5, and 60 201 seconds. 202 203 204 An interval can be provided using -i, and timestamps added using -T. For 205 example, tracing vfs_read() latency in milliseconds and printing output 206 every 5 seconds: 207 208 # ./funclatency -mTi 5 vfs_read 209 Tracing vfs_read... Hit Ctrl-C to end. 210 211 20:10:08 212 msecs : count distribution 213 0 -> 1 : 1500 |*************************************+| 214 2 -> 3 : 3 | | 215 4 -> 7 : 1 | | 216 8 -> 15 : 2 | | 217 16 -> 31 : 0 | | 218 32 -> 63 : 0 | | 219 64 -> 127 : 4 | | 220 128 -> 255 : 3 | | 221 256 -> 511 : 1 | | 222 512 -> 1023 : 7 | | 223 224 20:10:13 225 msecs : count distribution 226 0 -> 1 : 1251 |*************************************+| 227 2 -> 3 : 3 | | 228 4 -> 7 : 2 | | 229 8 -> 15 : 0 | | 230 16 -> 31 : 2 | | 231 32 -> 63 : 3 | | 232 64 -> 127 : 5 | | 233 128 -> 255 : 5 | | 234 256 -> 511 : 3 | | 235 512 -> 1023 : 6 | | 236 1024 -> 2047 : 2 | | 237 238 20:10:18 239 msecs : count distribution 240 0 -> 1 : 1265 |*************************************+| 241 2 -> 3 : 0 | | 242 4 -> 7 : 5 | | 243 8 -> 15 : 9 | | 244 16 -> 31 : 7 | | 245 32 -> 63 : 1 | | 246 64 -> 127 : 2 | | 247 128 -> 255 : 3 | | 248 256 -> 511 : 5 | | 249 512 -> 1023 : 5 | | 250 1024 -> 2047 : 0 | | 251 2048 -> 4095 : 1 | | 252 ^C 253 20:10:20 254 msecs : count distribution 255 0 -> 1 : 249 |*************************************+| 256 2 -> 3 : 0 | | 257 4 -> 7 : 0 | | 258 8 -> 15 : 1 | | 259 16 -> 31 : 0 | | 260 32 -> 63 : 0 | | 261 64 -> 127 : 0 | | 262 128 -> 255 : 0 | | 263 256 -> 511 : 0 | | 264 512 -> 1023 : 1 | | 265 Detaching... 266 267 268 A single process can be traced, which filters in-kernel for efficiency. Here, 269 the vfs_read() function is timed as milliseconds for PID 17064, which is a 270 bash shell: 271 272 # ./funclatency -mp 17064 vfs_read 273 Tracing vfs_read... Hit Ctrl-C to end. 274 ^C 275 msecs : count distribution 276 0 -> 1 : 1 |** | 277 2 -> 3 : 0 | | 278 4 -> 7 : 0 | | 279 8 -> 15 : 1 |** | 280 16 -> 31 : 2 |***** | 281 32 -> 63 : 0 | | 282 64 -> 127 : 13 |**************************************| 283 128 -> 255 : 10 |***************************** | 284 256 -> 511 : 4 |*********** | 285 Detaching... 286 287 The distribution between 64 and 511 milliseconds shows keystroke latency. 288 289 290 The -F option can be used to print a histogram per function. Eg: 291 292 # ./funclatency -uF 'vfs_r*' 293 Tracing 5 functions for "vfs_r*"... Hit Ctrl-C to end. 294 ^C 295 296 Function = vfs_read 297 usecs : count distribution 298 0 -> 1 : 1044 |****************************************| 299 2 -> 3 : 383 |************** | 300 4 -> 7 : 76 |** | 301 8 -> 15 : 41 |* | 302 16 -> 31 : 26 | | 303 32 -> 63 : 0 | | 304 64 -> 127 : 1 | | 305 128 -> 255 : 0 | | 306 256 -> 511 : 0 | | 307 512 -> 1023 : 0 | | 308 1024 -> 2047 : 0 | | 309 2048 -> 4095 : 4 | | 310 4096 -> 8191 : 2 | | 311 8192 -> 16383 : 0 | | 312 16384 -> 32767 : 0 | | 313 32768 -> 65535 : 2 | | 314 65536 -> 131071 : 5 | | 315 131072 -> 262143 : 5 | | 316 262144 -> 524287 : 3 | | 317 524288 -> 1048575 : 7 | | 318 319 Function = vfs_rename 320 usecs : count distribution 321 0 -> 1 : 2 |**** | 322 2 -> 3 : 2 |**** | 323 4 -> 7 : 2 |**** | 324 8 -> 15 : 0 | | 325 16 -> 31 : 6 |************* | 326 32 -> 63 : 18 |****************************************| 327 Detaching... 328 329 330 331 USAGE message: 332 333 # ./funclatency -h 334 usage: funclatency [-h] [-p PID] [-i INTERVAL] [-T] [-u] [-m] [-F] [-r] [-v] 335 pattern 336 337 Time functions and print latency as a histogram 338 339 positional arguments: 340 pattern search expression for functions 341 342 optional arguments: 343 -h, --help show this help message and exit 344 -p PID, --pid PID trace this PID only 345 -i INTERVAL, --interval INTERVAL 346 summary interval, in seconds 347 -d DURATION, --duration DURATION 348 total duration of trace, in seconds 349 -T, --timestamp include timestamp on output 350 -u, --microseconds microsecond histogram 351 -m, --milliseconds millisecond histogram 352 -F, --function show a separate histogram per function 353 -r, --regexp use regular expressions. Default is "*" wildcards 354 only. 355 -v, --verbose print the BPF program (for debugging purposes) 356 357 examples: 358 ./funclatency do_sys_open # time the do_sys_open() kernel function 359 ./funclatency c:read # time the read() C library function 360 ./funclatency -u vfs_read # time vfs_read(), in microseconds 361 ./funclatency -m do_nanosleep # time do_nanosleep(), in milliseconds 362 ./funclatency -i 2 -d 10 c:open # output every 2 seconds, for duration 10s 363 ./funclatency -mTi 5 vfs_read # output every 5 seconds, with timestamps 364 ./funclatency -p 181 vfs_read # time process 181 only 365 ./funclatency 'vfs_fstat*' # time both vfs_fstat() and vfs_fstatat() 366 ./funclatency 'c:*printf' # time the *printf family of functions 367 ./funclatency -F 'vfs_r*' # show one histogram per matched function 368