Home | History | Annotate | Download | only in doc
      1 %
      2 % Copyright (C) 2007 Alan D. Brunelle <Alan.Brunelle (a] hp.com>
      3 %
      4 %  This program is free software; you can redistribute it and/or modify
      5 %  it under the terms of the GNU General Public License as published by
      6 %  the Free Software Foundation; either version 2 of the License, or
      7 %  (at your option) any later version.
      8 %
      9 %  This program is distributed in the hope that it will be useful,
     10 %  but WITHOUT ANY WARRANTY; without even the implied warranty of
     11 %  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     12 %  GNU General Public License for more details.
     13 %
     14 %  You should have received a copy of the GNU General Public License
     15 %  along with this program; if not, write to the Free Software
     16 %  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
     17 %
     18 %  vi :set textwidth=75
     19 %
     20 \documentclass{article}
     21 \usepackage{multirow,graphicx,placeins}
     22 
     23 \begin{document}
     24 %---------------------
     25 \title{\texttt{btrecord} and \texttt{btreplay} User Guide}
     26 \author{Alan D. Brunelle (Alan.Brunelle (a] hp.com)}
     27 \date{\today}
     28 \maketitle
     29 \begin{abstract}
     30 \input{abstract.tex}
     31 \end{abstract}
     32 \thispagestyle{empty}\newpage
     33 %---------------------
     34 \tableofcontents\thispagestyle{empty}\newpage
     35 %---------------------
     36 \section{Introduction}
     37 \input{abstract.tex}
     38 
     39 \bigskip 
     40 This document presents the command line overview for
     41 \texttt{btrecord} and \texttt{btreplay}, and shows some commonly used
     42 example usages of it in everyday work here at OSLO's Scalability and
     43 Performance Group.
     44 
     45 \subsection*{Build Note}
     46 
     47 To build these tools, one needs to
     48 place the source directory next to a valid
     49 \texttt{blktrace}\footnote{\texttt{git://git.kernel.dk/blktrace.git}}
     50 directory, as it includes \texttt{../blktrace} in the \texttt{Makefile}.
     51 
     52 
     53 %---------------------
     54 \newpage\section{\texttt{btrecord} and \texttt{btreplay} Operating Model}
     55 
     56 The \texttt{blktrace} utility provides the ability to collect detailed
     57 traces from the kernel for each IO processed by the block IO layer. The
     58 traces provide a complete timeline for each IO processed, including
     59 detailed information concerning when an IO was first received by the block
     60 IO layer -- indicating the device, CPU number, time stamp, IO direction,
     61 sector number and IO size (number of sectors). Using this information,
     62 one is able to \emph{replay} the IO again on the same machine or another
     63 set up entirely.
     64 
     65 \subsection{Basic Workflow}
     66 The basic operating work-flow to replay IOs would be something like:
     67 
     68 \begin{enumerate}
     69   \item Run \texttt{blktrace} to collect traces. Here you specify the
     70   device or devices that you wish to trace and later replay IOs upon. Note:
     71   the only traces you are interested in are \emph{QUEUE} requests --
     72   thus, to save system resources (including storage for traces), one could
     73   specify the \texttt{-a queue} command line option to \texttt{blktrace}.
     74 
     75   \item While \texttt{blktrace} is running, you run the workload that you
     76   are interested in. 
     77 
     78   \item When the work load has completed, you stop the \texttt{blktrace}
     79   utility (thus saving all traces over the complete workload). 
     80 
     81   \item You extract the pertinent IO information from the traces saved by
     82   \texttt{blktrace} using the \texttt{btrecord} utility. This will parse
     83   each trace file created by \texttt{blktrace}, and craft IO descriptions
     84   to be used in the next phase of the workload processing.
     85 
     86   \item Once \texttt{btrecord} has successfully created a series of data
     87   files to be processed, you can run the \texttt{btreplay} utility which
     88   attempts to generate the same IOs seen during the sample workload phase.
     89 \end{enumerate}
     90 
     91 \subsection{IO Stream Replay Characteristics}
     92   The major characteristics of the IO stream that are kept intact include:
     93 
     94   \begin{description}
     95     \item[Device] The IOs are replayed on the same device as was seen
     96     during the sample workload.
     97 
     98     \item[IO direction] The same IO direction (read/write) is maintained.
     99 
    100     \item[IO offset] The same device offset is maintained.
    101 
    102     \item[IO size] The same number of sectors are transferred.
    103 
    104     \item[Time differential] The time stamps stored during the
    105     \texttt{blktrace} run are used to determine the amount of time between
    106     IOs during the sample workload. \texttt{btreplay} \emph{attempts} to
    107     maintain the same time differential between IOs, but no guarantees as
    108     to complete accuracy are provided by the utility.
    109 
    110     \item[Device IO Stream Ordering] All IOs on a device are submitted in
    111     the precise order they were seen during the sample workload run. 
    112   \end{description}
    113 
    114   As noted above, the time between IOs may not be accurately maintained
    115   during replays. In addition the actual ordering of IOs \emph{between}
    116   devices is not necessarily maintained. (Each device with an IO stream
    117   maintains its own concept of time, and thus there may be slippage of the
    118   time kept between managing threads.)
    119 
    120   \begin{quotation}
    121     We have prototyped a different approach, wherein a single managing
    122     thread handles all IOs across all devices. This approach, while
    123     guaranteeing correct ordering of IOs across all devices, resulted in
    124     much worse timing on a per IO basis. 
    125   \end{quotation}
    126 
    127 \subsection{\texttt{btrecord/btreplay} Method of Operation}
    128 
    129 As noted above, \texttt{btrecord} extracts \texttt{QUEUE} operations from
    130 \texttt{blktrace} output. These \texttt{QUEUE} operations indicate the
    131 entrance of IOs into the block IO layer. In order to replay these IOs with
    132 some accuracy in regards to ordering and timeliness, we decided to take
    133 multiple sequential (in time) IOs and put them in a single \emph{bunch} of
    134 IOs that will be processed as a single \emph{asynchronous IO} call to the
    135 kernel\footnote{Attempts to do them individually resulted in too large of a
    136 turnaround time penalty (user-space to kernel and back). Note that in a
    137 number of workloads, the IOs are coming in from the page cache handling
    138 code, and thus are submitted to the block IO layer with \emph{very small}
    139 time intervals between issues.}. To manage the size of the \emph{bunches},
    140 the \texttt{btrecord} utility provides you with two controlling knobs:
    141 
    142 \begin{description}
    143   \item[\texttt{--max-bunch-time}] This is the amount of time to encompass
    144   in one bunch -- only IOs within the time specified are eligible
    145   for \emph{bunching.} The default time is 10 milliseconds (10,000,000
    146   nanoseconds). Refer to section~\ref{sec:c-o-m} on page~\pageref{sec:c-o-m}
    147   for more information.
    148 
    149   \item[\texttt{--max-pkts}] A \emph{bunch} size can be anywhere from
    150   1 to 512 packets in size and by default we max a bunch to contain no
    151   more than 8 individual IOs. With this option, one can increase or
    152   decrease the maximum \emph{bunch} size.  Refer to section~\ref{sec:c-o-M}
    153   on page~\pageref{sec:c-o-M} for more information.
    154 \end{description}
    155 
    156 Each input data file (one per device per CPU) results in a new record
    157 data file (again, one per device per CPU) which contains information
    158 about \emph{bunches} of IOs to be replayed. \texttt{btreplay} operates on
    159 these record data files by spawning a new pair of threads per file. One
    160 thread manages the submitting of AIOs per bunch in the record data file,
    161 while the other thread manages reclaiming AIOs completed\footnote{We
    162 have found that having the same thread do both results in a further
    163 reduction in replay timing accuracy.}.
    164 
    165 Each submitting thread simply reads the input file of \emph{bunches}
    166 recorded by \texttt{btrecord}, and attempts to faithfully reproduce the
    167 ordering and timing of IOs seen during the sample workload. The reclaiming
    168 thread simply waits for AIO completions, freeing up resources for the
    169 submitting thread to utilize to submit new AIOs.
    170 
    171 The number of CPUs being used on the replay system can be different from
    172 the number on the recorded system. To help with mappings here the
    173 \texttt{--cpus} option allows one to state how many CPUs on the replay
    174 system to utilize. If the number of CPUs on the replay system is less than
    175 on the recording system, we wrap CPU IDs. This \emph{may} result in an
    176 overload of CPU processing capabilities on the replay system. (Refer to
    177 section~\ref{sec:p-o-c} on page~\pageref{sec:p-o-c} for more details about the
    178 \texttt{--cpus} option.)
    179 
    180 \newpage\subsection{Known Deficiencies and Proposed Possible Fixes}
    181 
    182 The overall known deficiencies with this current set of utilities is
    183 outlined here, in some cases ideas on additions and/or improvements are
    184 included as well.
    185 
    186 \begin{enumerate}
    187   \item Lack of IO ordering across devices. 
    188 
    189   \begin{quote}
    190     \emph{We could institute the notion of global time across threads,
    191     and thus ensure IO ordering across devices, with some reduction in
    192     timing accuracy.}
    193   \end{quote}
    194 
    195   \item Lack of IO timing accuracy -- additional time between IO bunches.
    196 
    197   \begin{quote}
    198     \emph{This is the primary problem with any IO replay mechanism -- how
    199     to guarantee per-IO timing accuracy with respect to other replayed IOs?
    200     One idea to reduce errors in this area would be to push the IO replay
    201     into the kernel, where you \emph{may} receive more responsive timings.}
    202   \end{quote}
    203 
    204   \item Bunching of IOs results in reduced time amongst IOs within a bunch.
    205 
    206   \begin{quote}
    207     \emph{The user has \emph{some} control over this (via the
    208     \texttt{--max-pkts} option). One \emph{could} simply specify
    209     \texttt{-max-pkts=1} and then each IO would be treated individually. Of
    210     course, this would probably then run into the problem of excessive
    211     inter-IO times.}
    212   \end{quote}
    213 
    214   \item 1-to-1 mapping of devices -- for now the devices on the replay
    215   machine must be the same as on the recording machine. 
    216 
    217   \begin{quote}
    218     \emph{It should be relatively trivial to add in the notion of
    219     mapping -- simply include a file that is read which maps devices
    220     on one machine to devices (with offsets and sizes) on the replay
    221     machine\footnote{The notion of an offset and device size to replay on
    222     could be used to both allow for a single device to masquerade as more
    223     than one device, and could be utilized in case the replay device is
    224     smaller than the recorded device.}.}
    225     
    226     \medskip\emph{One could also add in the notion of CPU mappings as well --
    227     device $D_{rec}$ managed by CPU $C_{rec}$ on the recorded system
    228     shall be replayed on device $D_{rep}$ and CPU $C_{rep}$ on the
    229     replay machine.}
    230 
    231     \bigskip
    232     \begin{quote}
    233       With version 0.9.1 we now support the \texttt{-M} option to do this
    234       -- see section~\ref{sec:p-o-M} on page~\pageref{sec:p-o-M} for more
    235       information on device mapping.
    236     \end{quote}
    237   \end{quote}
    238 
    239 \end{enumerate}
    240 
    241 %---------------------
    242 \newpage\section{\label{sec:command-line}Command Line Options}
    243 \subsection{\texttt{btrecord} Command Line Options}
    244 \begin{figure}[h!]
    245 \begin{verbatim}
    246 Usage: btrecord -- version 0.9.3
    247 
    248 	[ -d <dir>  : --input-directory=<dir> ] Default: .
    249 	[ -D <dir>  : --output-directory=<dir>] Default: .
    250 	[ -F        : --find-traces           ] Default: Off
    251 	[ -h        : --help                  ] Default: Off
    252 	[ -m <nsec> : --max-bunch-time=<nsec> ] Default: 10 msec
    253 	[ -M <pkts> : --max-pkts=<pkts>       ] Default: 8
    254 	[ -o <base> : --output-base=<base>    ] Default: replay
    255 	[ -v        : --verbose               ] Default: Off
    256 	[ -V        : --version               ] Default: Off
    257 	<dev>...                                Default: None
    258 \end{verbatim}
    259 \caption{\label{fig:btrecord--help}\texttt{btrecord --help} Output}
    260 \end{figure}
    261 \FloatBarrier
    262 
    263 \subsubsection{\label{sec:c-o-d}\texttt{-d} or
    264 \texttt{--input-directory}\\Set Input Directory}
    265 
    266 The \texttt{-d} option requires a single parameter providing the directory
    267 name for where input files are to be found. The default directory is the
    268 current directory (\texttt{.}).
    269 
    270 \subsubsection{\label{sec:c-o-D}\texttt{-D} or
    271 \texttt{--output-directory}\\Set Output Directory}
    272 
    273 The \texttt{-D} option requires a single parameter providing the directory
    274 name for where output files are to be placed. The default directory is the
    275 current directory (\texttt{.}).
    276 
    277 \subsubsection{\texttt{-F} or \texttt{--find-traces}\\Find Trace Files
    278 Automatically}
    279 
    280 The \texttt{-F} option instructs \texttt{btrecord} to go find all the
    281 trace files in the directory specified (either via the \texttt{-d}
    282 option, or in the default directory '.').
    283 
    284 \subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
    285 \subsubsection{\texttt{-V} or \texttt{--version}\\Display
    286 \texttt{btrecord}Version}
    287 
    288 The \texttt{-h} option displays the command line options and
    289 defaults, as presented in figure~\ref{fig:btrecord--help} on
    290 page~\pageref{fig:btrecord--help}.
    291 
    292 The \texttt{-V} option displays the \texttt{btreplay} version, as shown here:
    293 
    294 \begin{verbatim}
    295 $ btrecord --version
    296 btrecord -- version 0.9.0
    297 \end{verbatim}
    298 
    299 Both commands exit immediately after processing the option.
    300 
    301 \subsubsection{\label{sec:c-o-m}\texttt{-m} or
    302 \texttt{--max-bunch-time}\\Set Maximum Time Per Bunch}
    303 
    304 The \texttt{-m} option requires a single parameter which specifies an
    305 amount of time (in nanoseconds) to include in any one bunch of IOs that
    306 are to be processed. The smaller the value, the smaller the number of
    307 IOs processed at one time -- perhaps yielding in more realistic replay.
    308 However, after a certain point the amount of overhead per bunch may result
    309 in additional real replay time, thus yielding less accurate replay times.
    310 
    311 The default value is 10,000,000 nanoseconds (10 milliseconds).
    312 
    313 \subsubsection{\label{sec:c-o-M}\texttt{-M} or
    314 \texttt{--max-pkts}\\Set Maximum Packets Per Bunch}
    315 
    316 The \texttt{-M} option requires a single parameter which specifies the
    317 maximum number of IOs to store in a single bunch. As with the \texttt{-m}
    318 option (section~\ref{sec:c-o-m}), smaller values \emph{may} or \emph{may not}
    319 yield more accurate replay times.
    320 
    321 The default value is 8, with a maximum value of up to 512 being supported.
    322 
    323 \subsubsection{\label{sec:c-o-o}\texttt{-o} or
    324 \texttt{--output-base}\\Set Base Name for Output Files}
    325 
    326 Each output file has 3 fields:
    327 
    328 \begin{enumerate}
    329   \item Device identifier (taken directly from the device name of the
    330   \texttt{blktrace} output file).
    331 
    332   \item \texttt{btrecord} base name -- by default ``replay''.
    333 
    334   \item And the CPU number (again, taken directly from the
    335   \texttt{blktrace} output file name).
    336 \end{enumerate}
    337 
    338 This option requires a single parameter that will override the default name
    339 (replay), and replace it with the specified value.
    340 
    341 \subsubsection{\label{sec:c-o-v}\texttt{-v} or
    342 \texttt{--verbose}\\Select Verbose Output}
    343 
    344 This option will output some simple statistics at the end of a successful
    345 run. Figure~\ref{fig:verb-out} (page~\pageref{fig:verb-out}) shows
    346 an example of some output, while figure~\ref{fig:verb-defs}
    347 (page~\pageref{fig:verb-defs}) shows what the fields mean.
    348 
    349 \begin{figure}[h!]
    350 \begin{verbatim}
    351 sdab:0: 580661 pkts (tot), 126030 pkts (replay), 89809 bunches, 1.4 pkts/bunch
    352 sdab:1: 2559775 pkts (tot), 430172 pkts (replay), 293029 bunches, 1.5 pkts/bunch
    353 sdab:2: 653559 pkts (tot), 136522 pkts (replay), 102288 bunches, 1.3 pkts/bunch
    354 sdab:3: 474773 pkts (tot), 117849 pkts (replay), 69572 bunches, 1.7 pkts/bunch
    355 \end{verbatim}
    356 \caption{\label{fig:verb-out}Verbose Output Example}
    357 \end{figure}
    358 \FloatBarrier
    359 
    360 \begin{figure}[h!]
    361 \begin{description}
    362   \item[Field 1] The first field contains the device name and CPU
    363   identifier. Thus: \texttt{sdab:0:} means the device \texttt{sdab} and
    364   traces on CPU 0. 
    365 
    366   \item[Field 2] The second field contains the total number of packets
    367   processed for each device file. 
    368 
    369   \item[Field 3] The next field shows the number of packets eligible for
    370   replay. 
    371 
    372   \item[Field 4] The fourth field contains the total number of IO bunches. 
    373 
    374   \item[Field 5] The last field shows the average number of IOs per bunch
    375   recorded.
    376 \end{description}
    377 \caption{\label{fig:verb-defs}Verbose Field Definitions}
    378 \end{figure}
    379 \FloatBarrier
    380 
    381 %---------------------
    382 \newpage\subsection{\texttt{btreplay} Command Line Options}
    383 \begin{figure}[h!]
    384 \begin{verbatim}
    385 Usage: btreplay -- version 0.9.3
    386 
    387 	[ -c <cpus> : --cpus=<cpus>           ] Default: 1
    388 	[ -d <dir>  : --input-directory=<dir> ] Default: .
    389 	[ -F        : --find-records          ] Default: Off
    390 	[ -h        : --help                  ] Default: Off
    391 	[ -i <base> : --input-base=<base>     ] Default: replay
    392 	[ -I <iters>: --iterations=<iters>    ] Default: 1
    393 	[ -M <file> : --map-devs=<file>       ] Default: None
    394 	[ -N        : --no-stalls             ] Default: Off
    395 	[ -x <int>  : --acc-factor=<int>      ] Default: 1
    396 	[ -v        : --verbose               ] Default: Off
    397 	[ -V        : --version               ] Default: Off
    398 	[ -W        : --write-enable          ] Default: Off
    399 	<dev...>                                Default: None
    400 \end{verbatim}
    401 \caption{\label{fig:btreplay--help}\texttt{btreplay --help} Output}
    402 \end{figure}
    403 \FloatBarrier
    404 
    405 \subsubsection{\label{sec:p-o-c}\texttt{-c} or
    406 \texttt{--cpus}\\Set Number of CPUs to Use}
    407 
    408 \subsubsection{\label{sec:p-o-d}\texttt{-d} or
    409 \texttt{--input-directory}\\Set Input Directory}
    410 
    411 The \texttt{-d} option requires a single parameter providing the directory
    412 name for where input files are to be found. The default directory is the
    413 current directory (\texttt{.}).
    414 
    415 \subsubsection{\texttt{-F} or \texttt{--find-records}\\Find RecordFiles
    416 Automatically}
    417 
    418 The \texttt{-F} option instructs \texttt{btreplay} to go find all the
    419 record files in the directory specified (either via the \texttt{-d}
    420 option, or in the default directory '.').
    421 
    422 \subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
    423 \subsubsection{\texttt{-V} or \texttt{--version}\\Display
    424 \texttt{btreplay}Version}
    425 
    426 The \texttt{-h} option displays the command line options and
    427 defaults, as presented in figure~\ref{fig:btreplay--help} on
    428 page~\pageref{fig:btreplay--help}.
    429 
    430 The \texttt{-V} option displays the \texttt{btreplay} version, as show here:
    431 
    432 \begin{verbatim}
    433 $ btreplay --version
    434 btreplay -- version 0.9.0
    435 \end{verbatim}
    436 
    437 Both commands exit immediately after processing the option.
    438 
    439 \subsubsection{\label{sec:p-o-i}\texttt{-i} or
    440 \texttt{--input-base}\\Set Base Name for Input Files}
    441 
    442 Each input file has 3 fields:
    443 
    444 \begin{enumerate}
    445   \item Device identifier (taken directly from the device name of the
    446   \texttt{blktrace} output file).
    447 
    448   \item \texttt{btrecord} base name -- by default ``replay''.
    449 
    450   \item And the CPU number (again, taken directly from the
    451   \texttt{blktrace} output file name).
    452 \end{enumerate}
    453 
    454 This option requires a single parameter that will override the default name
    455 (replay), and replace it with the specified value.
    456 
    457 \subsubsection{\label{sec:p-o-I}\texttt{-I} or
    458 \texttt{--iterations}\\Set Number of Iterations to Run}
    459 
    460 This option requires a single parameter which specifies the number of times
    461 to run through the input files. The default value is 1.
    462 
    463 \subsubsection{\label{sec:p-o-M}\texttt{-M} or \texttt{map-devs}\\
    464 Specify Device Mappings}
    465 
    466 This option requires a single parameter which specifies the name of a
    467 file containing device mappings. The file must be very simply managed, with
    468 just two pieces of data per line:
    469 
    470 \begin{enumerate}
    471   \item The device name on the recorded system (with the \texttt{'/dev/'}
    472   removed). Example: \texttt{/dev/sda} would just be \texttt{sda}.
    473 
    474   \item The device name on the replay system to use (again, without the
    475   \texttt{'/dev/'} path prepended).
    476 \end{enumerate}
    477 
    478 An example file for when one would map devices \texttt{/dev/sda} and
    479 \texttt{/dev/sdb} on the recorded system to \texttt{dev/sdg} and
    480 \texttt{sdh} on the replay system would be:
    481 
    482 \begin{verbatim}
    483 sda sdg
    484 sdb sdh
    485 \end{verbatim}
    486 
    487 The only entries in the file that are allowed are these two element lines
    488 -- we do not (yet?) support the notion of blank lines, or comment lines, or
    489 the like.
    490 
    491 The utility \emph{does} allow for multiple \texttt{-M} options to be
    492 supplied on the command line.
    493 
    494 \subsubsection{\label{sec:o-N}\texttt{-N} or \texttt{--no-stalls}\\Disable
    495 Pre-bunch Stalls}
    496 
    497 When specified on the command line, all pre-bunch stall indicators will be
    498 ignored. IOs will be replayed without inter-bunch delays.
    499 
    500 \subsubsection{\label{sec:o-x}\texttt{-x} or \texttt{--acc-factor}\\Acceleration
    501 Factor}
    502 
    503   While the \texttt{--no-stalls} option allows the traces to be replayed
    504   with no waiting time, this option specifies some acceleration factor
    505   to be used. If the value of two is used, then the stall time is
    506   divided by half resulting in a reduction of the execution time by
    507   this factor. Note that if this number is too high, the results will
    508   be equivalent of not having stall.
    509 
    510 \subsubsection{\label{sec:p-o-v}\texttt{-v} or
    511 \texttt{--verbose}\\Select Verbose Output}
    512 
    513 When specified on the command line, this option instructs \texttt{btreplay}
    514 to store information concerning each \emph{stall} and IO operation
    515 performed by \texttt{btreplay}. The name of each file so created will be
    516 the input file name used with an extension of \texttt{.rep} appended onto
    517 it. Thus, an input file of the name \texttt{sdab.replay.3} would generate a
    518 verbose output file with the name \texttt{sdab.replay.3.rep} in the
    519 directory specified for input files.
    520 
    521 In addition, \texttt{btreplay} will also output to \texttt{stderr} the
    522 names of the input files being processed.
    523 
    524 \subsubsection{\label{sec:p-o-W}\texttt{-W} or
    525 \texttt{--write-enable}\\Enable Writing During Replay}
    526 
    527 As a precautionary measure, by default \texttt{btreplay} will \emph{not}
    528 process \emph{write} requests. In order to enable \texttt{btreplay} to
    529 actually \emph{write} to devices one must explicitly specify the
    530 \texttt{-W} option.
    531 
    532 \end{document}
    533