1 % 2 % Copyright (C) 2007 Alan D. Brunelle <Alan.Brunelle (a] hp.com> 3 % 4 % This program is free software; you can redistribute it and/or modify 5 % it under the terms of the GNU General Public License as published by 6 % the Free Software Foundation; either version 2 of the License, or 7 % (at your option) any later version. 8 % 9 % This program is distributed in the hope that it will be useful, 10 % but WITHOUT ANY WARRANTY; without even the implied warranty of 11 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 % GNU General Public License for more details. 13 % 14 % You should have received a copy of the GNU General Public License 15 % along with this program; if not, write to the Free Software 16 % Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 17 % 18 % vi :set textwidth=75 19 % 20 \documentclass{article} 21 \usepackage{multirow,graphicx,placeins} 22 23 \begin{document} 24 %--------------------- 25 \title{\texttt{btrecord} and \texttt{btreplay} User Guide} 26 \author{Alan D. Brunelle (Alan.Brunelle (a] hp.com)} 27 \date{\today} 28 \maketitle 29 \begin{abstract} 30 \input{abstract.tex} 31 \end{abstract} 32 \thispagestyle{empty}\newpage 33 %--------------------- 34 \tableofcontents\thispagestyle{empty}\newpage 35 %--------------------- 36 \section{Introduction} 37 \input{abstract.tex} 38 39 \bigskip 40 This document presents the command line overview for 41 \texttt{btrecord} and \texttt{btreplay}, and shows some commonly used 42 example usages of it in everyday work here at OSLO's Scalability and 43 Performance Group. 44 45 \subsection*{Build Note} 46 47 To build these tools, one needs to 48 place the source directory next to a valid 49 \texttt{blktrace}\footnote{\texttt{git://git.kernel.dk/blktrace.git}} 50 directory, as it includes \texttt{../blktrace} in the \texttt{Makefile}. 51 52 53 %--------------------- 54 \newpage\section{\texttt{btrecord} and \texttt{btreplay} Operating Model} 55 56 The \texttt{blktrace} utility provides the ability to collect detailed 57 traces from the kernel for each IO processed by the block IO layer. The 58 traces provide a complete timeline for each IO processed, including 59 detailed information concerning when an IO was first received by the block 60 IO layer -- indicating the device, CPU number, time stamp, IO direction, 61 sector number and IO size (number of sectors). Using this information, 62 one is able to \emph{replay} the IO again on the same machine or another 63 set up entirely. 64 65 \subsection{Basic Workflow} 66 The basic operating work-flow to replay IOs would be something like: 67 68 \begin{enumerate} 69 \item Run \texttt{blktrace} to collect traces. Here you specify the 70 device or devices that you wish to trace and later replay IOs upon. Note: 71 the only traces you are interested in are \emph{QUEUE} requests -- 72 thus, to save system resources (including storage for traces), one could 73 specify the \texttt{-a queue} command line option to \texttt{blktrace}. 74 75 \item While \texttt{blktrace} is running, you run the workload that you 76 are interested in. 77 78 \item When the work load has completed, you stop the \texttt{blktrace} 79 utility (thus saving all traces over the complete workload). 80 81 \item You extract the pertinent IO information from the traces saved by 82 \texttt{blktrace} using the \texttt{btrecord} utility. This will parse 83 each trace file created by \texttt{blktrace}, and craft IO descriptions 84 to be used in the next phase of the workload processing. 85 86 \item Once \texttt{btrecord} has successfully created a series of data 87 files to be processed, you can run the \texttt{btreplay} utility which 88 attempts to generate the same IOs seen during the sample workload phase. 89 \end{enumerate} 90 91 \subsection{IO Stream Replay Characteristics} 92 The major characteristics of the IO stream that are kept intact include: 93 94 \begin{description} 95 \item[Device] The IOs are replayed on the same device as was seen 96 during the sample workload. 97 98 \item[IO direction] The same IO direction (read/write) is maintained. 99 100 \item[IO offset] The same device offset is maintained. 101 102 \item[IO size] The same number of sectors are transferred. 103 104 \item[Time differential] The time stamps stored during the 105 \texttt{blktrace} run are used to determine the amount of time between 106 IOs during the sample workload. \texttt{btreplay} \emph{attempts} to 107 maintain the same time differential between IOs, but no guarantees as 108 to complete accuracy are provided by the utility. 109 110 \item[Device IO Stream Ordering] All IOs on a device are submitted in 111 the precise order they were seen during the sample workload run. 112 \end{description} 113 114 As noted above, the time between IOs may not be accurately maintained 115 during replays. In addition the actual ordering of IOs \emph{between} 116 devices is not necessarily maintained. (Each device with an IO stream 117 maintains its own concept of time, and thus there may be slippage of the 118 time kept between managing threads.) 119 120 \begin{quotation} 121 We have prototyped a different approach, wherein a single managing 122 thread handles all IOs across all devices. This approach, while 123 guaranteeing correct ordering of IOs across all devices, resulted in 124 much worse timing on a per IO basis. 125 \end{quotation} 126 127 \subsection{\texttt{btrecord/btreplay} Method of Operation} 128 129 As noted above, \texttt{btrecord} extracts \texttt{QUEUE} operations from 130 \texttt{blktrace} output. These \texttt{QUEUE} operations indicate the 131 entrance of IOs into the block IO layer. In order to replay these IOs with 132 some accuracy in regards to ordering and timeliness, we decided to take 133 multiple sequential (in time) IOs and put them in a single \emph{bunch} of 134 IOs that will be processed as a single \emph{asynchronous IO} call to the 135 kernel\footnote{Attempts to do them individually resulted in too large of a 136 turnaround time penalty (user-space to kernel and back). Note that in a 137 number of workloads, the IOs are coming in from the page cache handling 138 code, and thus are submitted to the block IO layer with \emph{very small} 139 time intervals between issues.}. To manage the size of the \emph{bunches}, 140 the \texttt{btrecord} utility provides you with two controlling knobs: 141 142 \begin{description} 143 \item[\texttt{--max-bunch-time}] This is the amount of time to encompass 144 in one bunch -- only IOs within the time specified are eligible 145 for \emph{bunching.} The default time is 10 milliseconds (10,000,000 146 nanoseconds). Refer to section~\ref{sec:c-o-m} on page~\pageref{sec:c-o-m} 147 for more information. 148 149 \item[\texttt{--max-pkts}] A \emph{bunch} size can be anywhere from 150 1 to 512 packets in size and by default we max a bunch to contain no 151 more than 8 individual IOs. With this option, one can increase or 152 decrease the maximum \emph{bunch} size. Refer to section~\ref{sec:c-o-M} 153 on page~\pageref{sec:c-o-M} for more information. 154 \end{description} 155 156 Each input data file (one per device per CPU) results in a new record 157 data file (again, one per device per CPU) which contains information 158 about \emph{bunches} of IOs to be replayed. \texttt{btreplay} operates on 159 these record data files by spawning a new pair of threads per file. One 160 thread manages the submitting of AIOs per bunch in the record data file, 161 while the other thread manages reclaiming AIOs completed\footnote{We 162 have found that having the same thread do both results in a further 163 reduction in replay timing accuracy.}. 164 165 Each submitting thread simply reads the input file of \emph{bunches} 166 recorded by \texttt{btrecord}, and attempts to faithfully reproduce the 167 ordering and timing of IOs seen during the sample workload. The reclaiming 168 thread simply waits for AIO completions, freeing up resources for the 169 submitting thread to utilize to submit new AIOs. 170 171 The number of CPUs being used on the replay system can be different from 172 the number on the recorded system. To help with mappings here the 173 \texttt{--cpus} option allows one to state how many CPUs on the replay 174 system to utilize. If the number of CPUs on the replay system is less than 175 on the recording system, we wrap CPU IDs. This \emph{may} result in an 176 overload of CPU processing capabilities on the replay system. (Refer to 177 section~\ref{sec:p-o-c} on page~\pageref{sec:p-o-c} for more details about the 178 \texttt{--cpus} option.) 179 180 \newpage\subsection{Known Deficiencies and Proposed Possible Fixes} 181 182 The overall known deficiencies with this current set of utilities is 183 outlined here, in some cases ideas on additions and/or improvements are 184 included as well. 185 186 \begin{enumerate} 187 \item Lack of IO ordering across devices. 188 189 \begin{quote} 190 \emph{We could institute the notion of global time across threads, 191 and thus ensure IO ordering across devices, with some reduction in 192 timing accuracy.} 193 \end{quote} 194 195 \item Lack of IO timing accuracy -- additional time between IO bunches. 196 197 \begin{quote} 198 \emph{This is the primary problem with any IO replay mechanism -- how 199 to guarantee per-IO timing accuracy with respect to other replayed IOs? 200 One idea to reduce errors in this area would be to push the IO replay 201 into the kernel, where you \emph{may} receive more responsive timings.} 202 \end{quote} 203 204 \item Bunching of IOs results in reduced time amongst IOs within a bunch. 205 206 \begin{quote} 207 \emph{The user has \emph{some} control over this (via the 208 \texttt{--max-pkts} option). One \emph{could} simply specify 209 \texttt{-max-pkts=1} and then each IO would be treated individually. Of 210 course, this would probably then run into the problem of excessive 211 inter-IO times.} 212 \end{quote} 213 214 \item 1-to-1 mapping of devices -- for now the devices on the replay 215 machine must be the same as on the recording machine. 216 217 \begin{quote} 218 \emph{It should be relatively trivial to add in the notion of 219 mapping -- simply include a file that is read which maps devices 220 on one machine to devices (with offsets and sizes) on the replay 221 machine\footnote{The notion of an offset and device size to replay on 222 could be used to both allow for a single device to masquerade as more 223 than one device, and could be utilized in case the replay device is 224 smaller than the recorded device.}.} 225 226 \medskip\emph{One could also add in the notion of CPU mappings as well -- 227 device $D_{rec}$ managed by CPU $C_{rec}$ on the recorded system 228 shall be replayed on device $D_{rep}$ and CPU $C_{rep}$ on the 229 replay machine.} 230 231 \bigskip 232 \begin{quote} 233 With version 0.9.1 we now support the \texttt{-M} option to do this 234 -- see section~\ref{sec:p-o-M} on page~\pageref{sec:p-o-M} for more 235 information on device mapping. 236 \end{quote} 237 \end{quote} 238 239 \end{enumerate} 240 241 %--------------------- 242 \newpage\section{\label{sec:command-line}Command Line Options} 243 \subsection{\texttt{btrecord} Command Line Options} 244 \begin{figure}[h!] 245 \begin{verbatim} 246 Usage: btrecord -- version 0.9.3 247 248 [ -d <dir> : --input-directory=<dir> ] Default: . 249 [ -D <dir> : --output-directory=<dir>] Default: . 250 [ -F : --find-traces ] Default: Off 251 [ -h : --help ] Default: Off 252 [ -m <nsec> : --max-bunch-time=<nsec> ] Default: 10 msec 253 [ -M <pkts> : --max-pkts=<pkts> ] Default: 8 254 [ -o <base> : --output-base=<base> ] Default: replay 255 [ -v : --verbose ] Default: Off 256 [ -V : --version ] Default: Off 257 <dev>... Default: None 258 \end{verbatim} 259 \caption{\label{fig:btrecord--help}\texttt{btrecord --help} Output} 260 \end{figure} 261 \FloatBarrier 262 263 \subsubsection{\label{sec:c-o-d}\texttt{-d} or 264 \texttt{--input-directory}\\Set Input Directory} 265 266 The \texttt{-d} option requires a single parameter providing the directory 267 name for where input files are to be found. The default directory is the 268 current directory (\texttt{.}). 269 270 \subsubsection{\label{sec:c-o-D}\texttt{-D} or 271 \texttt{--output-directory}\\Set Output Directory} 272 273 The \texttt{-D} option requires a single parameter providing the directory 274 name for where output files are to be placed. The default directory is the 275 current directory (\texttt{.}). 276 277 \subsubsection{\texttt{-F} or \texttt{--find-traces}\\Find Trace Files 278 Automatically} 279 280 The \texttt{-F} option instructs \texttt{btrecord} to go find all the 281 trace files in the directory specified (either via the \texttt{-d} 282 option, or in the default directory '.'). 283 284 \subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message} 285 \subsubsection{\texttt{-V} or \texttt{--version}\\Display 286 \texttt{btrecord}Version} 287 288 The \texttt{-h} option displays the command line options and 289 defaults, as presented in figure~\ref{fig:btrecord--help} on 290 page~\pageref{fig:btrecord--help}. 291 292 The \texttt{-V} option displays the \texttt{btreplay} version, as shown here: 293 294 \begin{verbatim} 295 $ btrecord --version 296 btrecord -- version 0.9.0 297 \end{verbatim} 298 299 Both commands exit immediately after processing the option. 300 301 \subsubsection{\label{sec:c-o-m}\texttt{-m} or 302 \texttt{--max-bunch-time}\\Set Maximum Time Per Bunch} 303 304 The \texttt{-m} option requires a single parameter which specifies an 305 amount of time (in nanoseconds) to include in any one bunch of IOs that 306 are to be processed. The smaller the value, the smaller the number of 307 IOs processed at one time -- perhaps yielding in more realistic replay. 308 However, after a certain point the amount of overhead per bunch may result 309 in additional real replay time, thus yielding less accurate replay times. 310 311 The default value is 10,000,000 nanoseconds (10 milliseconds). 312 313 \subsubsection{\label{sec:c-o-M}\texttt{-M} or 314 \texttt{--max-pkts}\\Set Maximum Packets Per Bunch} 315 316 The \texttt{-M} option requires a single parameter which specifies the 317 maximum number of IOs to store in a single bunch. As with the \texttt{-m} 318 option (section~\ref{sec:c-o-m}), smaller values \emph{may} or \emph{may not} 319 yield more accurate replay times. 320 321 The default value is 8, with a maximum value of up to 512 being supported. 322 323 \subsubsection{\label{sec:c-o-o}\texttt{-o} or 324 \texttt{--output-base}\\Set Base Name for Output Files} 325 326 Each output file has 3 fields: 327 328 \begin{enumerate} 329 \item Device identifier (taken directly from the device name of the 330 \texttt{blktrace} output file). 331 332 \item \texttt{btrecord} base name -- by default ``replay''. 333 334 \item And the CPU number (again, taken directly from the 335 \texttt{blktrace} output file name). 336 \end{enumerate} 337 338 This option requires a single parameter that will override the default name 339 (replay), and replace it with the specified value. 340 341 \subsubsection{\label{sec:c-o-v}\texttt{-v} or 342 \texttt{--verbose}\\Select Verbose Output} 343 344 This option will output some simple statistics at the end of a successful 345 run. Figure~\ref{fig:verb-out} (page~\pageref{fig:verb-out}) shows 346 an example of some output, while figure~\ref{fig:verb-defs} 347 (page~\pageref{fig:verb-defs}) shows what the fields mean. 348 349 \begin{figure}[h!] 350 \begin{verbatim} 351 sdab:0: 580661 pkts (tot), 126030 pkts (replay), 89809 bunches, 1.4 pkts/bunch 352 sdab:1: 2559775 pkts (tot), 430172 pkts (replay), 293029 bunches, 1.5 pkts/bunch 353 sdab:2: 653559 pkts (tot), 136522 pkts (replay), 102288 bunches, 1.3 pkts/bunch 354 sdab:3: 474773 pkts (tot), 117849 pkts (replay), 69572 bunches, 1.7 pkts/bunch 355 \end{verbatim} 356 \caption{\label{fig:verb-out}Verbose Output Example} 357 \end{figure} 358 \FloatBarrier 359 360 \begin{figure}[h!] 361 \begin{description} 362 \item[Field 1] The first field contains the device name and CPU 363 identifier. Thus: \texttt{sdab:0:} means the device \texttt{sdab} and 364 traces on CPU 0. 365 366 \item[Field 2] The second field contains the total number of packets 367 processed for each device file. 368 369 \item[Field 3] The next field shows the number of packets eligible for 370 replay. 371 372 \item[Field 4] The fourth field contains the total number of IO bunches. 373 374 \item[Field 5] The last field shows the average number of IOs per bunch 375 recorded. 376 \end{description} 377 \caption{\label{fig:verb-defs}Verbose Field Definitions} 378 \end{figure} 379 \FloatBarrier 380 381 %--------------------- 382 \newpage\subsection{\texttt{btreplay} Command Line Options} 383 \begin{figure}[h!] 384 \begin{verbatim} 385 Usage: btreplay -- version 0.9.3 386 387 [ -c <cpus> : --cpus=<cpus> ] Default: 1 388 [ -d <dir> : --input-directory=<dir> ] Default: . 389 [ -F : --find-records ] Default: Off 390 [ -h : --help ] Default: Off 391 [ -i <base> : --input-base=<base> ] Default: replay 392 [ -I <iters>: --iterations=<iters> ] Default: 1 393 [ -M <file> : --map-devs=<file> ] Default: None 394 [ -N : --no-stalls ] Default: Off 395 [ -x <int> : --acc-factor=<int> ] Default: 1 396 [ -v : --verbose ] Default: Off 397 [ -V : --version ] Default: Off 398 [ -W : --write-enable ] Default: Off 399 <dev...> Default: None 400 \end{verbatim} 401 \caption{\label{fig:btreplay--help}\texttt{btreplay --help} Output} 402 \end{figure} 403 \FloatBarrier 404 405 \subsubsection{\label{sec:p-o-c}\texttt{-c} or 406 \texttt{--cpus}\\Set Number of CPUs to Use} 407 408 \subsubsection{\label{sec:p-o-d}\texttt{-d} or 409 \texttt{--input-directory}\\Set Input Directory} 410 411 The \texttt{-d} option requires a single parameter providing the directory 412 name for where input files are to be found. The default directory is the 413 current directory (\texttt{.}). 414 415 \subsubsection{\texttt{-F} or \texttt{--find-records}\\Find RecordFiles 416 Automatically} 417 418 The \texttt{-F} option instructs \texttt{btreplay} to go find all the 419 record files in the directory specified (either via the \texttt{-d} 420 option, or in the default directory '.'). 421 422 \subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message} 423 \subsubsection{\texttt{-V} or \texttt{--version}\\Display 424 \texttt{btreplay}Version} 425 426 The \texttt{-h} option displays the command line options and 427 defaults, as presented in figure~\ref{fig:btreplay--help} on 428 page~\pageref{fig:btreplay--help}. 429 430 The \texttt{-V} option displays the \texttt{btreplay} version, as show here: 431 432 \begin{verbatim} 433 $ btreplay --version 434 btreplay -- version 0.9.0 435 \end{verbatim} 436 437 Both commands exit immediately after processing the option. 438 439 \subsubsection{\label{sec:p-o-i}\texttt{-i} or 440 \texttt{--input-base}\\Set Base Name for Input Files} 441 442 Each input file has 3 fields: 443 444 \begin{enumerate} 445 \item Device identifier (taken directly from the device name of the 446 \texttt{blktrace} output file). 447 448 \item \texttt{btrecord} base name -- by default ``replay''. 449 450 \item And the CPU number (again, taken directly from the 451 \texttt{blktrace} output file name). 452 \end{enumerate} 453 454 This option requires a single parameter that will override the default name 455 (replay), and replace it with the specified value. 456 457 \subsubsection{\label{sec:p-o-I}\texttt{-I} or 458 \texttt{--iterations}\\Set Number of Iterations to Run} 459 460 This option requires a single parameter which specifies the number of times 461 to run through the input files. The default value is 1. 462 463 \subsubsection{\label{sec:p-o-M}\texttt{-M} or \texttt{map-devs}\\ 464 Specify Device Mappings} 465 466 This option requires a single parameter which specifies the name of a 467 file containing device mappings. The file must be very simply managed, with 468 just two pieces of data per line: 469 470 \begin{enumerate} 471 \item The device name on the recorded system (with the \texttt{'/dev/'} 472 removed). Example: \texttt{/dev/sda} would just be \texttt{sda}. 473 474 \item The device name on the replay system to use (again, without the 475 \texttt{'/dev/'} path prepended). 476 \end{enumerate} 477 478 An example file for when one would map devices \texttt{/dev/sda} and 479 \texttt{/dev/sdb} on the recorded system to \texttt{dev/sdg} and 480 \texttt{sdh} on the replay system would be: 481 482 \begin{verbatim} 483 sda sdg 484 sdb sdh 485 \end{verbatim} 486 487 The only entries in the file that are allowed are these two element lines 488 -- we do not (yet?) support the notion of blank lines, or comment lines, or 489 the like. 490 491 The utility \emph{does} allow for multiple \texttt{-M} options to be 492 supplied on the command line. 493 494 \subsubsection{\label{sec:o-N}\texttt{-N} or \texttt{--no-stalls}\\Disable 495 Pre-bunch Stalls} 496 497 When specified on the command line, all pre-bunch stall indicators will be 498 ignored. IOs will be replayed without inter-bunch delays. 499 500 \subsubsection{\label{sec:o-x}\texttt{-x} or \texttt{--acc-factor}\\Acceleration 501 Factor} 502 503 While the \texttt{--no-stalls} option allows the traces to be replayed 504 with no waiting time, this option specifies some acceleration factor 505 to be used. If the value of two is used, then the stall time is 506 divided by half resulting in a reduction of the execution time by 507 this factor. Note that if this number is too high, the results will 508 be equivalent of not having stall. 509 510 \subsubsection{\label{sec:p-o-v}\texttt{-v} or 511 \texttt{--verbose}\\Select Verbose Output} 512 513 When specified on the command line, this option instructs \texttt{btreplay} 514 to store information concerning each \emph{stall} and IO operation 515 performed by \texttt{btreplay}. The name of each file so created will be 516 the input file name used with an extension of \texttt{.rep} appended onto 517 it. Thus, an input file of the name \texttt{sdab.replay.3} would generate a 518 verbose output file with the name \texttt{sdab.replay.3.rep} in the 519 directory specified for input files. 520 521 In addition, \texttt{btreplay} will also output to \texttt{stderr} the 522 names of the input files being processed. 523 524 \subsubsection{\label{sec:p-o-W}\texttt{-W} or 525 \texttt{--write-enable}\\Enable Writing During Replay} 526 527 As a precautionary measure, by default \texttt{btreplay} will \emph{not} 528 process \emph{write} requests. In order to enable \texttt{btreplay} to 529 actually \emph{write} to devices one must explicitly specify the 530 \texttt{-W} option. 531 532 \end{document} 533