Home | History | Annotate | Download | only in doc
      1 \input texinfo  @c -*-texinfo-*-
      2 @c
      3 @c -- Stuff that needs adding: ----------------------------------------------
      4 @c (document the `;' command-separator)
      5 @c --------------------------------------------------------------------------
      6 @c Check for consistency: regexps in @code, text that they match in @samp.
      7 @c 
      8 @c Tips:
      9 @c    @command for command
     10 @c    @samp for command fragments: @samp{cat -s}
     11 @c    @code for sed commands and flags
     12 @c    Use ``quote'' not `quote' or "quote".
     13 @c
     14 @c %**start of header
     15 @setfilename sed.info
     16 @settitle sed, a stream editor
     17 @c %**end of header
     18 
     19 @c @smallbook
     20 
     21 @include version.texi
     22 
     23 @c Combine indices.
     24 @syncodeindex ky cp
     25 @syncodeindex pg cp
     26 @syncodeindex tp cp
     27 
     28 @defcodeindex op
     29 @syncodeindex op fn
     30 
     31 @include config.texi
     32 
     33 @copying
     34 This file documents version @value{VERSION} of
     35 @value{SSED}, a stream editor.
     36 
     37 Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
     38 Software Foundation, Inc.
     39 
     40 This document is released under the terms of the @acronym{GNU} Free
     41 Documentation License as published by the Free Software Foundation;
     42 either version 1.1, or (at your option) any later version.
     43 
     44 You should have received a copy of the @acronym{GNU} Free Documentation
     45 License along with @value{SSED}; see the file @file{COPYING.DOC}.
     46 If not, write to the Free Software Foundation, 59 Temple Place - Suite
     47 330, Boston, MA 02110-1301, USA.
     48 
     49 There are no Cover Texts and no Invariant Sections; this text, along
     50 with its equivalent in the printed manual, constitutes the Title Page.
     51 @end copying
     52 
     53 @setchapternewpage off
     54 
     55 @titlepage
     56 @title @command{sed}, a stream editor
     57 @subtitle version @value{VERSION}, @value{UPDATED}
     58 @author by Ken Pizzini, Paolo Bonzini
     59 
     60 @page
     61 @vskip 0pt plus 1filll
     62 Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
     63 
     64 @insertcopying
     65 
     66 Published by the Free Software Foundation, @*
     67 51 Franklin Street, Fifth Floor @*
     68 Boston, MA 02110-1301, USA
     69 @end titlepage
     70 
     71 
     72 @node Top
     73 @top
     74 
     75 @ifnottex
     76 @insertcopying
     77 @end ifnottex
     78 
     79 @menu
     80 * Introduction::               Introduction
     81 * Invoking sed::               Invocation
     82 * sed Programs::               @command{sed} programs
     83 * Examples::                   Some sample scripts
     84 * Limitations::                Limitations and (non-)limitations of @value{SSED}
     85 * Other Resources::            Other resources for learning about @command{sed}
     86 * Reporting Bugs::             Reporting bugs
     87 
     88 * Extended regexps::           @command{egrep}-style regular expressions
     89 @ifset PERL
     90 * Perl regexps::               Perl-style regular expressions
     91 @end ifset
     92 
     93 * Concept Index::              A menu with all the topics in this manual.
     94 * Command and Option Index::   A menu with all @command{sed} commands and
     95                                command-line options.
     96 
     97 @detailmenu
     98 --- The detailed node listing ---
     99 
    100 sed Programs:
    101 * Execution Cycle::                 How @command{sed} works
    102 * Addresses::                       Selecting lines with @command{sed}
    103 * Regular Expressions::             Overview of regular expression syntax
    104 * Common Commands::                 Often used commands
    105 * The "s" Command::                 @command{sed}'s Swiss Army Knife
    106 * Other Commands::                  Less frequently used commands
    107 * Programming Commands::            Commands for @command{sed} gurus
    108 * Extended Commands::               Commands specific of @value{SSED}
    109 * Escapes::                         Specifying special characters
    110 
    111 Examples:
    112 * Centering lines::
    113 * Increment a number::
    114 * Rename files to lower case::
    115 * Print bash environment::
    116 * Reverse chars of lines::
    117 * tac::                             Reverse lines of files
    118 * cat -n::                          Numbering lines
    119 * cat -b::                          Numbering non-blank lines
    120 * wc -c::                           Counting chars
    121 * wc -w::                           Counting words
    122 * wc -l::                           Counting lines
    123 * head::                            Printing the first lines
    124 * tail::                            Printing the last lines
    125 * uniq::                            Make duplicate lines unique
    126 * uniq -d::                         Print duplicated lines of input
    127 * uniq -u::                         Remove all duplicated lines
    128 * cat -s::                          Squeezing blank lines
    129 
    130 @ifset PERL
    131 Perl regexps::                      Perl-style regular expressions
    132 * Backslash::                       Introduces special sequences
    133 * Circumflex/dollar sign/period::   Behave specially with regard to new lines
    134 * Square brackets::                 Are a bit different in strange cases
    135 * Options setting::                 Toggle modifiers in the middle of a regexp
    136 * Non-capturing subpatterns::       Are not counted when backreferencing
    137 * Repetition::                      Allows for non-greedy matching
    138 * Backreferences::                  Allows for more than 10 back references
    139 * Assertions::                      Allows for complex look ahead matches
    140 * Non-backtracking subpatterns::    Often gives more performance
    141 * Conditional subpatterns::         Allows if/then/else branches
    142 * Recursive patterns::              For example to match parentheses
    143 * Comments::                        Because things can get complex...
    144 @end ifset
    145 
    146 @end detailmenu
    147 @end menu
    148 
    149 
    150 @node Introduction
    151 @chapter Introduction
    152 
    153 @cindex Stream editor
    154 @command{sed} is a stream editor.
    155 A stream editor is used to perform basic text
    156 transformations on an input stream
    157 (a file or input from a pipeline).
    158 While in some ways similar to an editor which
    159 permits scripted edits (such as @command{ed}),
    160 @command{sed} works by making only one pass over the
    161 input(s), and is consequently more efficient.
    162 But it is @command{sed}'s ability to filter text in a pipeline
    163 which particularly distinguishes it from other types of
    164 editors.
    165 
    166 
    167 @node Invoking sed
    168 @chapter Invocation
    169 
    170 Normally @command{sed} is invoked like this:
    171 
    172 @example
    173 sed SCRIPT INPUTFILE...
    174 @end example
    175 
    176 The full format for invoking @command{sed} is:
    177 
    178 @example
    179 sed OPTIONS... [SCRIPT] [INPUTFILE...]
    180 @end example
    181 
    182 If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
    183 @command{sed} filters the contents of the standard input.  The @var{script}
    184 is actually the first non-option parameter, which @command{sed} specially
    185 considers a script and not an input file if (and only if) none of the
    186 other @var{options} specifies a script to be executed, that is if neither
    187 of the @option{-e} and @option{-f} options is specified.
    188 
    189 @command{sed} may be invoked with the following command-line options:
    190 
    191 @table @code
    192 @item --version
    193 @opindex --version
    194 @cindex Version, printing
    195 Print out the version of @command{sed} that is being run and a copyright notice,
    196 then exit.
    197 
    198 @item --help
    199 @opindex --help
    200 @cindex Usage summary, printing
    201 Print a usage message briefly summarizing these command-line options
    202 and the bug-reporting address,
    203 then exit.
    204 
    205 @item -n
    206 @itemx --quiet
    207 @itemx --silent
    208 @opindex -n
    209 @opindex --quiet
    210 @opindex --silent
    211 @cindex Disabling autoprint, from command line
    212 By default, @command{sed} prints out the pattern space
    213 at the end of each cycle through the script (@pxref{Execution Cycle, ,
    214 How @code{sed} works}).
    215 These options disable this automatic printing,
    216 and @command{sed} only produces output when explicitly told to
    217 via the @code{p} command.
    218 
    219 @item -e @var{script}
    220 @itemx --expression=@var{script}
    221 @opindex -e
    222 @opindex --expression
    223 @cindex Script, from command line
    224 Add the commands in @var{script} to the set of commands to be
    225 run while processing the input.
    226 
    227 @item -f @var{script-file}
    228 @itemx --file=@var{script-file}
    229 @opindex -f
    230 @opindex --file
    231 @cindex Script, from a file
    232 Add the commands contained in the file @var{script-file}
    233 to the set of commands to be run while processing the input.
    234 
    235 @item -i[@var{SUFFIX}]
    236 @itemx --in-place[=@var{SUFFIX}]
    237 @opindex -i
    238 @opindex --in-place
    239 @cindex In-place editing, activating
    240 @cindex @value{SSEDEXT}, in-place editing
    241 This option specifies that files are to be edited in-place.
    242 @value{SSED} does this by creating a temporary file and
    243 sending output to this file rather than to the standard
    244 output.@footnote{This applies to commands such as @code{=},
    245 @code{a}, @code{c}, @code{i}, @code{l}, @code{p}.  You can
    246 still write to the standard output by using the @code{w}
    247 @cindex @value{SSEDEXT}, @file{/dev/stdout} file
    248 or @code{W} commands together with the @file{/dev/stdout}
    249 special file}.
    250 
    251 This option implies @option{-s}.
    252 
    253 When the end of the file is reached, the temporary file is
    254 renamed to the output file's original name.  The extension,
    255 if supplied, is used to modify the name of the old file
    256 before renaming the temporary file, thereby making a backup
    257 copy@footnote{Note that @value{SSED} creates the backup
    258 file whether or not any output is actually changed.}).
    259 
    260 @cindex In-place editing, Perl-style backup file names
    261 This rule is followed: if the extension doesn't contain a @code{*},
    262 then it is appended to the end of the current filename as a
    263 suffix; if the extension does contain one or more @code{*}
    264 characters, then @emph{each} asterisk is replaced with the
    265 current filename.  This allows you to add a prefix to the
    266 backup file, instead of (or in addition to) a suffix, or
    267 even to place backup copies of the original files into another
    268 directory (provided the directory already exists).
    269 
    270 If no extension is supplied, the original file is
    271 overwritten without making a backup.
    272 
    273 @item -l @var{N}
    274 @itemx --line-length=@var{N}
    275 @opindex -l
    276 @opindex --line-length
    277 @cindex Line length, setting
    278 Specify the default line-wrap length for the @code{l} command.
    279 A length of 0 (zero) means to never wrap long lines.  If
    280 not specified, it is taken to be 70.
    281 
    282 @item --posix
    283 @cindex @value{SSEDEXT}, disabling
    284 @value{SSED} includes several extensions to @acronym{POSIX}
    285 sed.  In order to simplify writing portable scripts, this
    286 option disables all the extensions that this manual documents,
    287 including additional commands.
    288 @cindex @code{POSIXLY_CORRECT} behavior, enabling
    289 Most of the extensions accept @command{sed} programs that
    290 are outside the syntax mandated by @acronym{POSIX}, but some
    291 of them (such as the behavior of the @command{N} command
    292 described in @pxref{Reporting Bugs}) actually violate the
    293 standard.  If you want to disable only the latter kind of
    294 extension, you can set the @code{POSIXLY_CORRECT} variable
    295 to a non-empty value.
    296 
    297 @item -b
    298 @itemx --binary
    299 @opindex -b
    300 @opindex --binary
    301 This option is available on every platform, but is only effective where the
    302 operating system makes a distinction between text files and binary files.
    303 When such a distinction is made---as is the case for MS-DOS, Windows,
    304 Cygwin---text files are composed of lines separated by a carriage return
    305 @emph{and} a line feed character, and @command{sed} does not see the
    306 ending CR.  When this option is specified, @command{sed} will open
    307 input files in binary mode, thus not requesting this special processing
    308 and considering lines to end at a line feed.
    309 
    310 @item --follow-symlinks
    311 @opindex --follow-symlinks
    312 This option is available only on platforms that support
    313 symbolic links and has an effect only if option @option{-i}
    314 is specified.  In this case, if the file that is specified
    315 on the command line is a symbolic link, @command{sed} will
    316 follow the link and edit the ultimate destination of the
    317 link.  The default behavior is to break the symbolic link,
    318 so that the link destination will not be modified.
    319 
    320 @item -r
    321 @itemx --regexp-extended
    322 @opindex -r
    323 @opindex --regexp-extended
    324 @cindex Extended regular expressions, choosing
    325 @cindex @acronym{GNU} extensions, extended regular expressions
    326 Use extended regular expressions rather than basic
    327 regular expressions.  Extended regexps are those that
    328 @command{egrep} accepts; they can be clearer because they
    329 usually have less backslashes, but are a @acronym{GNU} extension
    330 and hence scripts that use them are not portable.
    331 @xref{Extended regexps, , Extended regular expressions}.
    332 
    333 @ifset PERL
    334 @item -R
    335 @itemx --regexp-perl
    336 @opindex -R
    337 @opindex --regexp-perl
    338 @cindex Perl-style regular expressions, choosing
    339 @cindex @value{SSEDEXT}, Perl-style regular expressions
    340 Use Perl-style regular expressions rather than basic
    341 regular expressions.  Perl-style regexps are extremely
    342 powerful but are a @value{SSED} extension and hence scripts that
    343 use it are not portable.  @xref{Perl regexps, ,
    344 Perl-style regular expressions}.
    345 @end ifset
    346 
    347 @item -s
    348 @itemx --separate
    349 @cindex Working on separate files
    350 By default, @command{sed} will consider the files specified on the
    351 command line as a single continuous long stream.  This @value{SSED}
    352 extension allows the user to consider them as separate files:
    353 range addresses (such as @samp{/abc/,/def/}) are not allowed
    354 to span several files, line numbers are relative to the start
    355 of each file, @code{$} refers to the last line of each file,
    356 and files invoked from the @code{R} commands are rewound at the
    357 start of each file.
    358 
    359 @item -u
    360 @itemx --unbuffered
    361 @opindex -u
    362 @opindex --unbuffered
    363 @cindex Unbuffered I/O, choosing
    364 Buffer both input and output as minimally as practical.
    365 (This is particularly useful if the input is coming from
    366 the likes of @samp{tail -f}, and you wish to see the transformed
    367 output as soon as possible.)
    368 
    369 @end table
    370 
    371 If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file}
    372 options are given on the command-line,
    373 then the first non-option argument on the command line is
    374 taken to be the @var{script} to be executed.
    375 
    376 @cindex Files to be processed as input
    377 If any command-line parameters remain after processing the above,
    378 these parameters are interpreted as the names of input files to
    379 be processed.
    380 @cindex Standard input, processing as input
    381 A file name of @samp{-} refers to the standard input stream.
    382 The standard input will be processed if no file names are specified.
    383 
    384 
    385 @node sed Programs
    386 @chapter @command{sed} Programs
    387 
    388 @cindex @command{sed} program structure
    389 @cindex Script structure
    390 A @command{sed} program consists of one or more @command{sed} commands,
    391 passed in by one or more of the
    392 @option{-e}, @option{-f}, @option{--expression}, and @option{--file}
    393 options, or the first non-option argument if zero of these
    394 options are used.
    395 This document will refer to ``the'' @command{sed} script;
    396 this is understood to mean the in-order catenation
    397 of all of the @var{script}s and @var{script-file}s passed in.
    398 
    399 Each @code{sed} command consists of an optional address or
    400 address range, followed by a one-character command name
    401 and any additional command-specific code.
    402 
    403 @menu
    404 * Execution Cycle::          How @command{sed} works
    405 * Addresses::                Selecting lines with @command{sed}
    406 * Regular Expressions::      Overview of regular expression syntax
    407 * Common Commands::          Often used commands
    408 * The "s" Command::          @command{sed}'s Swiss Army Knife
    409 * Other Commands::           Less frequently used commands
    410 * Programming Commands::     Commands for @command{sed} gurus
    411 * Extended Commands::        Commands specific of @value{SSED}
    412 * Escapes::                  Specifying special characters
    413 @end menu
    414 
    415 
    416 @node Execution Cycle
    417 @section How @command{sed} Works
    418 
    419 @cindex Buffer spaces, pattern and hold
    420 @cindex Spaces, pattern and hold
    421 @cindex Pattern space, definition
    422 @cindex Hold space, definition
    423 @command{sed} maintains two data buffers: the active @emph{pattern} space,
    424 and the auxiliary @emph{hold} space. Both are initially empty.
    425 
    426 @command{sed} operates by performing the following cycle on each
    427 lines of input: first, @command{sed} reads one line from the input
    428 stream, removes any trailing newline, and places it in the pattern space.
    429 Then commands are executed; each command can have an address associated
    430 to it: addresses are a kind of condition code, and a command is only
    431 executed if the condition is verified before the command is to be
    432 executed.
    433 
    434 When the end of the script is reached, unless the @option{-n} option
    435 is in use, the contents of pattern space are printed out to the output
    436 stream, adding back the trailing newline if it was removed.@footnote{Actually,
    437 if @command{sed} prints a line without the terminating newline, it will
    438 nevertheless print the missing newline as soon as more text is sent to
    439 the same output stream, which gives the ``least expected surprise''
    440 even though it does not make commands like @samp{sed -n p} exactly
    441 identical to @command{cat}.} Then the next cycle starts for the next
    442 input line.
    443 
    444 Unless special commands (like @samp{D}) are used, the pattern space is
    445 deleted between two cycles. The hold space, on the other hand, keeps
    446 its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
    447 @samp{g}, @samp{G} to move data between both buffers).
    448 
    449 
    450 @node Addresses
    451 @section Selecting lines with @command{sed}
    452 @cindex Addresses, in @command{sed} scripts
    453 @cindex Line selection
    454 @cindex Selecting lines to process
    455 
    456 Addresses in a @command{sed} script can be in any of the following forms:
    457 @table @code
    458 @item @var{number}
    459 @cindex Address, numeric
    460 @cindex Line, selecting by number
    461 Specifying a line number will match only that line in the input.
    462 (Note that @command{sed} counts lines continuously across all input files
    463 unless @option{-i} or @option{-s} options are specified.)
    464 
    465 @item @var{first}~@var{step}
    466 @cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
    467 This @acronym{GNU} extension matches every @var{step}th line
    468 starting with line @var{first}.
    469 In particular, lines will be selected when there exists
    470 a non-negative @var{n} such that the current line-number equals
    471 @var{first} + (@var{n} * @var{step}).
    472 Thus, to select the odd-numbered lines,
    473 one would use @code{1~2};
    474 to pick every third line starting with the second, @samp{2~3} would be used;
    475 to pick every fifth line starting with the tenth, use @samp{10~5};
    476 and @samp{50~0} is just an obscure way of saying @code{50}.
    477 
    478 @item $
    479 @cindex Address, last line
    480 @cindex Last line, selecting
    481 @cindex Line, selecting last
    482 This address matches the last line of the last file of input, or
    483 the last line of each file when the @option{-i} or @option{-s} options
    484 are specified.
    485 
    486 @item /@var{regexp}/
    487 @cindex Address, as a regular expression
    488 @cindex Line, selecting by regular expression match
    489 This will select any line which matches the regular expression @var{regexp}.
    490 If @var{regexp} itself includes any @code{/} characters,
    491 each must be escaped by a backslash (@code{\}).
    492 
    493 @cindex empty regular expression
    494 @cindex @value{SSEDEXT}, modifiers and the empty regular expression
    495 The empty regular expression @samp{//} repeats the last regular
    496 expression match (the same holds if the empty regular expression is
    497 passed to the @code{s} command).  Note that modifiers to regular expressions
    498 are evaluated when the regular expression is compiled, thus it is invalid to
    499 specify them together with the empty regular expression.
    500 
    501 @item \%@var{regexp}%
    502 (The @code{%} may be replaced by any other single character.)
    503 
    504 @cindex Slash character, in regular expressions
    505 This also matches the regular expression @var{regexp},
    506 but allows one to use a different delimiter than @code{/}.
    507 This is particularly useful if the @var{regexp} itself contains
    508 a lot of slashes, since it avoids the tedious escaping of every @code{/}.
    509 If @var{regexp} itself includes any delimiter characters,
    510 each must be escaped by a backslash (@code{\}).
    511 
    512 @item /@var{regexp}/I
    513 @itemx \%@var{regexp}%I
    514 @cindex @acronym{GNU} extensions, @code{I} modifier
    515 @ifset PERL
    516 @cindex Perl-style regular expressions, case-insensitive
    517 @end ifset
    518 The @code{I} modifier to regular-expression matching is a @acronym{GNU}
    519 extension which causes the @var{regexp} to be matched in
    520 a case-insensitive manner.
    521 
    522 @item /@var{regexp}/M
    523 @itemx \%@var{regexp}%M
    524 @ifset PERL
    525 @cindex @value{SSEDEXT}, @code{M} modifier
    526 @end ifset
    527 @cindex Perl-style regular expressions, multiline
    528 The @code{M} modifier to regular-expression matching is a @value{SSED}
    529 extension which causes @code{^} and @code{$} to match respectively
    530 (in addition to the normal behavior) the empty string after a newline,
    531 and the empty string before a newline.  There are special character
    532 sequences
    533 @ifset PERL
    534 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
    535 in basic or extended regular expression modes)
    536 @end ifset
    537 @ifclear PERL
    538 (@code{\`} and @code{\'})
    539 @end ifclear
    540 which always match the beginning or the end of the buffer.
    541 @code{M} stands for @cite{multi-line}.
    542 
    543 @ifset PERL
    544 @item /@var{regexp}/S
    545 @itemx \%@var{regexp}%S
    546 @cindex @value{SSEDEXT}, @code{S} modifier
    547 @cindex Perl-style regular expressions, single line
    548 The @code{S} modifier to regular-expression matching is only valid
    549 in Perl mode and specifies that the dot character (@code{.}) will
    550 match the newline character too.  @code{S} stands for @cite{single-line}.
    551 @end ifset
    552 
    553 @ifset PERL
    554 @item /@var{regexp}/X
    555 @itemx \%@var{regexp}%X
    556 @cindex @value{SSEDEXT}, @code{X} modifier
    557 @cindex Perl-style regular expressions, extended
    558 The @code{X} modifier to regular-expression matching is also
    559 valid in Perl mode only.  If it is used, whitespace in the
    560 pattern (other than in a character class) and
    561 characters between a @kbd{#} outside a character class and the
    562 next newline character are ignored. An escaping backslash
    563 can be used to include a whitespace or @kbd{#} character as part
    564 of the pattern.
    565 @end ifset
    566 @end table
    567 
    568 If no addresses are given, then all lines are matched;
    569 if one address is given, then only lines matching that
    570 address are matched.
    571 
    572 @cindex Range of lines
    573 @cindex Several lines, selecting
    574 An address range can be specified by specifying two addresses
    575 separated by a comma (@code{,}).  An address range matches lines
    576 starting from where the first address matches, and continues
    577 until the second address matches (inclusively).
    578 
    579 If the second address is a @var{regexp}, then checking for the
    580 ending match will start with the line @emph{following} the
    581 line which matched the first address: a range will always
    582 span at least two lines (except of course if the input stream
    583 ends).
    584 
    585 If the second address is a @var{number} less than (or equal to)
    586 the line matching the first address, then only the one line is
    587 matched.
    588 
    589 @cindex Special addressing forms
    590 @cindex Range with start address of zero
    591 @cindex Zero, as range start address
    592 @cindex @var{addr1},+N
    593 @cindex @var{addr1},~N
    594 @cindex @acronym{GNU} extensions, special two-address forms
    595 @cindex @acronym{GNU} extensions, @code{0} address
    596 @cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
    597 @cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
    598 @cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
    599 @value{SSED} also supports some special two-address forms; all these
    600 are @acronym{GNU} extensions:
    601 @table @code
    602 @item 0,/@var{regexp}/
    603 A line number of @code{0} can be used in an address specification like
    604 @code{0,/@var{regexp}/} so that @command{sed} will try to match
    605 @var{regexp} in the first input line too.  In other words,
    606 @code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
    607 except that if @var{addr2} matches the very first line of input the
    608 @code{0,/@var{regexp}/} form will consider it to end the range, whereas
    609 the @code{1,/@var{regexp}/} form will match the beginning of its range and
    610 hence make the range span up to the @emph{second} occurrence of the
    611 regular expression.
    612 
    613 Note that this is the only place where the @code{0} address makes
    614 sense; there is no 0-th line and commands which are given the @code{0}
    615 address in any other way will give an error.
    616 
    617 @item @var{addr1},+@var{N}
    618 Matches @var{addr1} and the @var{N} lines following @var{addr1}.
    619 
    620 @item @var{addr1},~@var{N}
    621 Matches @var{addr1} and the lines following @var{addr1}
    622 until the next line whose input line number is a multiple of @var{N}.
    623 @end table
    624 
    625 @cindex Excluding lines
    626 @cindex Selecting non-matching lines
    627 Appending the @code{!} character to the end of an address
    628 specification negates the sense of the match.
    629 That is, if the @code{!} character follows an address range,
    630 then only lines which do @emph{not} match the address range
    631 will be selected.
    632 This also works for singleton addresses,
    633 and, perhaps perversely, for the null address.
    634 
    635 
    636 @node Regular Expressions
    637 @section Overview of Regular Expression Syntax
    638 
    639 To know how to use @command{sed}, people should understand regular
    640 expressions (@dfn{regexp} for short).  A regular expression
    641 is a pattern that is matched against a
    642 subject string from left to right.  Most characters are
    643 @dfn{ordinary}: they stand for
    644 themselves in a pattern, and match the corresponding characters
    645 in the subject.  As a trivial example, the pattern
    646 
    647 @example
    648 The quick brown fox
    649 @end example
    650 
    651 @noindent
    652 matches a portion of a subject string that is identical to
    653 itself.  The power of regular expressions comes from the
    654 ability to include alternatives and repetitions in the pattern.
    655 These are encoded in the pattern by the use of @dfn{special characters},
    656 which do not stand for themselves but instead
    657 are interpreted in some special way.  Here is a brief description
    658 of regular expression syntax as used in @command{sed}.
    659 
    660 @table @code
    661 @item @var{char}
    662 A single ordinary character matches itself.
    663 
    664 @item *
    665 @cindex @acronym{GNU} extensions, to basic regular expressions
    666 Matches a sequence of zero or more instances of matches for the
    667 preceding regular expression, which must be an ordinary character, a
    668 special character preceded by @code{\}, a @code{.}, a grouped regexp
    669 (see below), or a bracket expression.  As a @acronym{GNU} extension, a
    670 postfixed regular expression can also be followed by @code{*}; for
    671 example, @code{a**} is equivalent to @code{a*}.  @acronym{POSIX}
    672 1003.1-2001 says that @code{*} stands for itself when it appears at
    673 the start of a regular expression or subexpression, but many
    674 non@acronym{GNU} implementations do not support this and portable
    675 scripts should instead use @code{\*} in these contexts.
    676 
    677 @item \+
    678 @cindex @acronym{GNU} extensions, to basic regular expressions
    679 As @code{*}, but matches one or more.  It is a @acronym{GNU} extension.
    680 
    681 @item \?
    682 @cindex @acronym{GNU} extensions, to basic regular expressions
    683 As @code{*}, but only matches zero or one.  It is a @acronym{GNU} extension.
    684 
    685 @item \@{@var{i}\@}
    686 As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
    687 decimal integer; for portability, keep it between 0 and 255
    688 inclusive).
    689 
    690 @item \@{@var{i},@var{j}\@}
    691 Matches between @var{i} and @var{j}, inclusive, sequences.
    692 
    693 @item \@{@var{i},\@}
    694 Matches more than or equal to @var{i} sequences.
    695 
    696 @item \(@var{regexp}\)
    697 Groups the inner @var{regexp} as a whole, this is used to: 
    698 
    699 @itemize @bullet
    700 @item
    701 @cindex @acronym{GNU} extensions, to basic regular expressions
    702 Apply postfix operators, like @code{\(abcd\)*}:
    703 this will search for zero or more whole sequences 
    704 of @samp{abcd}, while @code{abcd*} would search
    705 for @samp{abc} followed by zero or more occurrences
    706 of @samp{d}.  Note that support for @code{\(abcd\)*} is
    707 required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
    708 implementations do not support it and hence it is not universally
    709 portable.         
    710 
    711 @item
    712 Use back references (see below).
    713 @end itemize
    714 
    715 @item .
    716 Matches any character, including newline.
    717 
    718 @item ^
    719 Matches the null string at beginning of the pattern space, i.e. what
    720 appears after the circumflex must appear at the beginning of the
    721 pattern space.
    722 
    723 In most scripts, pattern space is initialized to the content of each
    724 line (@pxref{Execution Cycle, , How @code{sed} works}).  So, it is a
    725 useful simplification to think of @code{^#include} as matching only
    726 lines where @samp{#include} is the first thing on line---if there are
    727 spaces before, for example, the match fails.  This simplification is
    728 valid as long as the original content of pattern space is not modified,
    729 for example with an @code{s} command.
    730 
    731 @code{^} acts as a special character only at the beginning of the
    732 regular expression or subexpression (that is, after @code{\(} or
    733 @code{\|}).  Portable scripts should avoid @code{^} at the beginning of
    734 a subexpression, though, as @acronym{POSIX} allows implementations that
    735 treat @code{^} as an ordinary character in that context.
    736 
    737 @item $
    738 It is the same as @code{^}, but refers to end of pattern space.
    739 @code{$} also acts as a special character only at the end
    740 of the regular expression or subexpression (that is, before @code{\)}
    741 or @code{\|}), and its use at the end of a subexpression is not
    742 portable.
    743 
    744 
    745 @item [@var{list}]
    746 @itemx [^@var{list}]
    747 Matches any single character in @var{list}: for example,
    748 @code{[aeiou]} matches all vowels.  A list may include
    749 sequences like @code{@var{char1}-@var{char2}}, which
    750 matches any character between (inclusive) @var{char1}
    751 and @var{char2}.
    752 
    753 A leading @code{^} reverses the meaning of @var{list}, so that
    754 it matches any single character @emph{not} in @var{list}.  To include
    755 @code{]} in the list, make it the first character (after
    756 the @code{^} if needed), to include @code{-} in the list,
    757 make it the first or last; to include @code{^} put
    758 it after the first character.
    759 
    760 @cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
    761 The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
    762 are normally not special within @var{list}.  For example, @code{[\*]}
    763 matches either @samp{\} or @samp{*}, because the @code{\} is not
    764 special here.  However, strings like @code{[.ch.]}, @code{[=a=]}, and
    765 @code{[:space:]} are special within @var{list} and represent collating
    766 symbols, equivalence classes, and character classes, respectively, and
    767 @code{[} is therefore special within @var{list} when it is followed by
    768 @code{.}, @code{=}, or @code{:}.  Also, when not in
    769 @env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
    770 @code{\t} are recognized within @var{list}.  @xref{Escapes}.
    771 
    772 @item @var{regexp1}\|@var{regexp2}
    773 @cindex @acronym{GNU} extensions, to basic regular expressions
    774 Matches either @var{regexp1} or @var{regexp2}.  Use
    775 parentheses to use complex alternative regular expressions.
    776 The matching process tries each alternative in turn, from
    777 left to right, and the first one that succeeds is used.
    778 It is a @acronym{GNU} extension.
    779 
    780 @item @var{regexp1}@var{regexp2}
    781 Matches the concatenation of @var{regexp1} and @var{regexp2}.
    782 Concatenation binds more tightly than @code{\|}, @code{^}, and
    783 @code{$}, but less tightly than the other regular expression
    784 operators.
    785 
    786 @item \@var{digit}
    787 Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
    788 subexpression in the regular expression.  This is called a @dfn{back
    789 reference}.  Subexpressions are implicity numbered by counting
    790 occurrences of @code{\(} left-to-right.
    791 
    792 @item \n
    793 Matches the newline character.
    794 
    795 @item \@var{char}
    796 Matches @var{char}, where @var{char} is one of @code{$},
    797 @code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
    798 Note that the only C-like
    799 backslash sequences that you can portably assume to be
    800 interpreted are @code{\n} and @code{\\}; in particular
    801 @code{\t} is not portable, and matches a @samp{t} under most
    802 implementations of @command{sed}, rather than a tab character.
    803 
    804 @end table
    805 
    806 @cindex Greedy regular expression matching
    807 Note that the regular expression matcher is greedy, i.e., matches
    808 are attempted from left to right and, if two or more matches are
    809 possible starting at the same character, it selects the longest.
    810 
    811 @noindent
    812 Examples:
    813 @table @samp
    814 @item abcdef
    815 Matches @samp{abcdef}.
    816 
    817 @item a*b
    818 Matches zero or more @samp{a}s followed by a single
    819 @samp{b}.  For example, @samp{b} or @samp{aaaaab}. 
    820 
    821 @item a\?b
    822 Matches @samp{b} or @samp{ab}.
    823 
    824 @item a\+b\+
    825 Matches one or more @samp{a}s followed by one or more
    826 @samp{b}s: @samp{ab} is the shortest possible match, but
    827 other examples are @samp{aaaab} or @samp{abbbbb} or
    828 @samp{aaaaaabbbbbbb}.
    829 
    830 @item .*
    831 @itemx .\+
    832 These two both match all the characters in a string;
    833 however, the first matches every string (including the empty
    834 string), while the second matches only strings containing
    835 at least one character.
    836 
    837 @item ^main.*(.*)
    838 his matches a string starting with @samp{main},
    839 followed by an opening and closing
    840 parenthesis.  The @samp{n}, @samp{(} and @samp{)} need not
    841 be adjacent.
    842 
    843 @item ^#
    844 This matches a string beginning with @samp{#}.
    845 
    846 @item \\$
    847 This matches a string ending with a single backslash.  The
    848 regexp contains two backslashes for escaping.
    849 
    850 @item \$
    851 Instead, this matches a string consisting of a single dollar sign,
    852 because it is escaped.
    853 
    854 @item [a-zA-Z0-9]
    855 In the C locale, this matches any @acronym{ASCII} letters or digits.
    856 
    857 @item [^ @kbd{tab}]\+
    858 (Here @kbd{tab} stands for a single tab character.)
    859 This matches a string of one or more
    860 characters, none of which is a space or a tab.
    861 Usually this means a word.
    862 
    863 @item ^\(.*\)\n\1$
    864 This matches a string consisting of two equal substrings separated by
    865 a newline.
    866 
    867 @item .\@{9\@}A$
    868 This matches nine characters followed by an @samp{A}.
    869 
    870 @item ^.\@{15\@}A
    871 This matches the start of a string that contains 16 characters,
    872 the last of which is an @samp{A}.
    873 
    874 @end table
    875 
    876 
    877 
    878 @node Common Commands
    879 @section Often-Used Commands
    880 
    881 If you use @command{sed} at all, you will quite likely want to know
    882 these commands.
    883 
    884 @table @code
    885 @item #
    886 [No addresses allowed.]
    887 
    888 @findex # (comments)
    889 @cindex Comments, in scripts
    890 The @code{#} character begins a comment;
    891 the comment continues until the next newline.
    892 
    893 @cindex Portability, comments
    894 If you are concerned about portability, be aware that
    895 some implementations of @command{sed} (which are not @sc{posix}
    896 conformant) may only support a single one-line comment,
    897 and then only when the very first character of the script is a @code{#}.
    898 
    899 @findex -n, forcing from within a script
    900 @cindex Caveat --- #n on first line
    901 Warning: if the first two characters of the @command{sed} script
    902 are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
    903 If you want to put a comment in the first line of your script
    904 and that comment begins with the letter @samp{n}
    905 and you do not want this behavior,
    906 then be sure to either use a capital @samp{N},
    907 or place at least one space before the @samp{n}.
    908 
    909 @item q [@var{exit-code}]
    910 This command only accepts a single address.
    911 
    912 @findex q (quit) command
    913 @cindex @value{SSEDEXT}, returning an exit code
    914 @cindex Quitting
    915 Exit @command{sed} without processing any more commands or input.
    916 Note that the current pattern space is printed if auto-print is
    917 not disabled with the @option{-n} options.  The ability to return
    918 an exit code from the @command{sed} script is a @value{SSED} extension.
    919 
    920 @item d
    921 @findex d (delete) command
    922 @cindex Text, deleting
    923 Delete the pattern space;
    924 immediately start next cycle.
    925 
    926 @item p
    927 @findex p (print) command
    928 @cindex Text, printing
    929 Print out the pattern space (to the standard output).
    930 This command is usually only used in conjunction with the @option{-n}
    931 command-line option.
    932 
    933 @item n
    934 @findex n (next-line) command
    935 @cindex Next input line, replace pattern space with
    936 @cindex Read next input line
    937 If auto-print is not disabled, print the pattern space,
    938 then, regardless, replace the pattern space with the next line of input.
    939 If there is no more input then @command{sed} exits without processing
    940 any more commands.
    941 
    942 @item @{ @var{commands} @}
    943 @findex @{@} command grouping
    944 @cindex Grouping commands
    945 @cindex Command groups
    946 A group of commands may be enclosed between
    947 @code{@{} and @code{@}} characters.
    948 This is particularly useful when you want a group of commands
    949 to be triggered by a single address (or address-range) match.
    950 
    951 @end table
    952 
    953 @node The "s" Command
    954 @section The @code{s} Command
    955 
    956 The syntax of the @code{s} (as in substitute) command is
    957 @samp{s/@var{regexp}/@var{replacement}/@var{flags}}.  The @code{/}
    958 characters may be uniformly replaced by any other single
    959 character within any given @code{s} command.  The @code{/}
    960 character (or whatever other character is used in its stead)
    961 can appear in the @var{regexp} or @var{replacement}
    962 only if it is preceded by a @code{\} character.
    963 
    964 The @code{s} command is probably the most important in @command{sed}
    965 and has a lot of different options.  Its basic concept is simple:
    966 the @code{s} command attempts to match the pattern
    967 space against the supplied @var{regexp}; if the match is
    968 successful, then that portion of the pattern
    969 space which was matched is replaced with @var{replacement}.
    970 
    971 @cindex Backreferences, in regular expressions
    972 @cindex Parenthesized substrings
    973 The @var{replacement} can contain @code{\@var{n}} (@var{n} being
    974 a number from 1 to 9, inclusive) references, which refer to
    975 the portion of the match which is contained between the @var{n}th
    976 @code{\(} and its matching @code{\)}.
    977 Also, the @var{replacement} can contain unescaped @code{&}
    978 characters which reference the whole matched portion
    979 of the pattern space.
    980 @cindex @value{SSEDEXT}, case modifiers in @code{s} commands
    981 Finally, as a @value{SSED} extension, you can include a
    982 special sequence made of a backslash and one of the letters
    983 @code{L}, @code{l}, @code{U}, @code{u}, or @code{E}.
    984 The meaning is as follows:
    985 
    986 @table @code
    987 @item \L
    988 Turn the replacement
    989 to lowercase until a @code{\U} or @code{\E} is found,
    990 
    991 @item \l
    992 Turn the
    993 next character to lowercase,
    994 
    995 @item \U
    996 Turn the replacement to uppercase
    997 until a @code{\L} or @code{\E} is found,
    998 
    999 @item \u
   1000 Turn the next character
   1001 to uppercase,
   1002 
   1003 @item \E
   1004 Stop case conversion started by @code{\L} or @code{\U}.
   1005 @end table
   1006 
   1007 To include a literal @code{\}, @code{&}, or newline in the final
   1008 replacement, be sure to precede the desired @code{\}, @code{&},
   1009 or newline in the @var{replacement} with a @code{\}.
   1010 
   1011 @findex s command, option flags
   1012 @cindex Substitution of text, options
   1013 The @code{s} command can be followed by zero or more of the
   1014 following @var{flags}:
   1015 
   1016 @table @code
   1017 @item g
   1018 @cindex Global substitution
   1019 @cindex Replacing all text matching regexp in a line
   1020 Apply the replacement to @emph{all} matches to the @var{regexp},
   1021 not just the first.
   1022 
   1023 @item @var{number}
   1024 @cindex Replacing only @var{n}th match of regexp in a line
   1025 Only replace the @var{number}th match of the @var{regexp}.
   1026 
   1027 @cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
   1028 @cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
   1029 Note: the @sc{posix} standard does not specify what should happen
   1030 when you mix the @code{g} and @var{number} modifiers,
   1031 and currently there is no widely agreed upon meaning
   1032 across @command{sed} implementations.
   1033 For @value{SSED}, the interaction is defined to be:
   1034 ignore matches before the @var{number}th,
   1035 and then match and replace all matches from
   1036 the @var{number}th on.
   1037 
   1038 @item p
   1039 @cindex Text, printing after substitution
   1040 If the substitution was made, then print the new pattern space.
   1041 
   1042 Note: when both the @code{p} and @code{e} options are specified,
   1043 the relative ordering of the two produces very different results.
   1044 In general, @code{ep} (evaluate then print) is what you want,
   1045 but operating the other way round can be useful for debugging.
   1046 For this reason, the current version of @value{SSED} interprets
   1047 specially the presence of @code{p} options both before and after
   1048 @code{e}, printing the pattern space before and after evaluation,
   1049 while in general flags for the @code{s} command show their
   1050 effect just once.  This behavior, although documented, might
   1051 change in future versions.
   1052 
   1053 @item w @var{file-name}
   1054 @cindex Text, writing to a file after substitution
   1055 @cindex @value{SSEDEXT}, @file{/dev/stdout} file
   1056 @cindex @value{SSEDEXT}, @file{/dev/stderr} file
   1057 If the substitution was made, then write out the result to the named file.
   1058 As a @value{SSED} extension, two special values of @var{file-name} are
   1059 supported: @file{/dev/stderr}, which writes the result to the standard
   1060 error, and @file{/dev/stdout}, which writes to the standard
   1061 output.@footnote{This is equivalent to @code{p} unless the @option{-i}
   1062 option is being used.}
   1063 
   1064 @item e
   1065 @cindex Evaluate Bourne-shell commands, after substitution
   1066 @cindex Subprocesses
   1067 @cindex @value{SSEDEXT}, evaluating Bourne-shell commands
   1068 @cindex @value{SSEDEXT}, subprocesses
   1069 This command allows one to pipe input from a shell command
   1070 into pattern space.  If a substitution was made, the command
   1071 that is found in pattern space is executed and pattern space
   1072 is replaced with its output.  A trailing newline is suppressed;
   1073 results are undefined if the command to be executed contains
   1074 a @sc{nul} character.  This is a @value{SSED} extension.
   1075 
   1076 @item I
   1077 @itemx i
   1078 @cindex @acronym{GNU} extensions, @code{I} modifier
   1079 @cindex Case-insensitive matching
   1080 @ifset PERL
   1081 @cindex Perl-style regular expressions, case-insensitive
   1082 @end ifset
   1083 The @code{I} modifier to regular-expression matching is a @acronym{GNU}
   1084 extension which makes @command{sed} match @var{regexp} in a
   1085 case-insensitive manner.
   1086 
   1087 @item M
   1088 @itemx m
   1089 @cindex @value{SSEDEXT}, @code{M} modifier
   1090 @ifset PERL
   1091 @cindex Perl-style regular expressions, multiline
   1092 @end ifset
   1093 The @code{M} modifier to regular-expression matching is a @value{SSED}
   1094 extension which causes @code{^} and @code{$} to match respectively
   1095 (in addition to the normal behavior) the empty string after a newline,
   1096 and the empty string before a newline.  There are special character
   1097 sequences
   1098 @ifset PERL
   1099 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
   1100 in basic or extended regular expression modes)
   1101 @end ifset
   1102 @ifclear PERL
   1103 (@code{\`} and @code{\'})
   1104 @end ifclear
   1105 which always match the beginning or the end of the buffer.
   1106 @code{M} stands for @cite{multi-line}.
   1107 
   1108 @ifset PERL
   1109 @item S
   1110 @itemx s
   1111 @cindex @value{SSEDEXT}, @code{S} modifier
   1112 @cindex Perl-style regular expressions, single line
   1113 The @code{S} modifier to regular-expression matching is only valid
   1114 in Perl mode and specifies that the dot character (@code{.}) will
   1115 match the newline character too.  @code{S} stands for @cite{single-line}.
   1116 @end ifset
   1117 
   1118 @ifset PERL
   1119 @item X
   1120 @itemx x
   1121 @cindex @value{SSEDEXT}, @code{X} modifier
   1122 @cindex Perl-style regular expressions, extended
   1123 The @code{X} modifier to regular-expression matching is also
   1124 valid in Perl mode only.  If it is used, whitespace in the
   1125 pattern (other than in a character class) and
   1126 characters between a @kbd{#} outside a character class and the
   1127 next newline character are ignored. An escaping backslash
   1128 can be used to include a whitespace or @kbd{#} character as part
   1129 of the pattern.
   1130 @end ifset
   1131 @end table
   1132 
   1133 
   1134 @node Other Commands
   1135 @section Less Frequently-Used Commands
   1136 
   1137 Though perhaps less frequently used than those in the previous
   1138 section, some very small yet useful @command{sed} scripts can be built with
   1139 these commands.
   1140 
   1141 @table @code
   1142 @item y/@var{source-chars}/@var{dest-chars}/
   1143 (The @code{/} characters may be uniformly replaced by
   1144 any other single character within any given @code{y} command.)
   1145 
   1146 @findex y (transliterate) command
   1147 @cindex Transliteration
   1148 Transliterate any characters in the pattern space which match
   1149 any of the @var{source-chars} with the corresponding character
   1150 in @var{dest-chars}.
   1151 
   1152 Instances of the @code{/} (or whatever other character is used in its stead),
   1153 @code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
   1154 lists, provide that each instance is escaped by a @code{\}.
   1155 The @var{source-chars} and @var{dest-chars} lists @emph{must}
   1156 contain the same number of characters (after de-escaping).
   1157 
   1158 @item a\
   1159 @itemx @var{text}
   1160 @cindex @value{SSEDEXT}, two addresses supported by most commands
   1161 As a @acronym{GNU} extension, this command accepts two addresses.
   1162 
   1163 @findex a (append text lines) command
   1164 @cindex Appending text after a line
   1165 @cindex Text, appending
   1166 Queue the lines of text which follow this command
   1167 (each but the last ending with a @code{\},
   1168 which are removed from the output)
   1169 to be output at the end of the current cycle,
   1170 or when the next input line is read.
   1171 
   1172 Escape sequences in @var{text} are processed, so you should
   1173 use @code{\\} in @var{text} to print a single backslash.
   1174 
   1175 As a @acronym{GNU} extension, if between the @code{a} and the newline there is
   1176 other than a whitespace-@code{\} sequence, then the text of this line,
   1177 starting at the first non-whitespace character after the @code{a},
   1178 is taken as the first line of the @var{text} block.
   1179 (This enables a simplification in scripting a one-line add.)
   1180 This extension also works with the @code{i} and @code{c} commands.
   1181 
   1182 @item i\
   1183 @itemx @var{text}
   1184 @cindex @value{SSEDEXT}, two addresses supported by most commands
   1185 As a @acronym{GNU} extension, this command accepts two addresses.
   1186 
   1187 @findex i (insert text lines) command
   1188 @cindex Inserting text before a line
   1189 @cindex Text, insertion
   1190 Immediately output the lines of text which follow this command
   1191 (each but the last ending with a @code{\},
   1192 which are removed from the output).
   1193 
   1194 @item c\
   1195 @itemx @var{text}
   1196 @findex c (change to text lines) command
   1197 @cindex Replacing selected lines with other text
   1198 Delete the lines matching the address or address-range,
   1199 and output the lines of text which follow this command
   1200 (each but the last ending with a @code{\},
   1201 which are removed from the output)
   1202 in place of the last line
   1203 (or in place of each line, if no addresses were specified).
   1204 A new cycle is started after this command is done,
   1205 since the pattern space will have been deleted.
   1206 
   1207 @item =
   1208 @cindex @value{SSEDEXT}, two addresses supported by most commands
   1209 As a @acronym{GNU} extension, this command accepts two addresses.
   1210 
   1211 @findex = (print line number) command
   1212 @cindex Printing line number
   1213 @cindex Line number, printing
   1214 Print out the current input line number (with a trailing newline).
   1215 
   1216 @item l @var{n}
   1217 @findex l (list unambiguously) command
   1218 @cindex List pattern space
   1219 @cindex Printing text unambiguously
   1220 @cindex Line length, setting
   1221 @cindex @value{SSEDEXT}, setting line length
   1222 Print the pattern space in an unambiguous form:
   1223 non-printable characters (and the @code{\} character)
   1224 are printed in C-style escaped form; long lines are split,
   1225 with a trailing @code{\} character to indicate the split;
   1226 the end of each line is marked with a @code{$}.
   1227 
   1228 @var{n} specifies the desired line-wrap length;
   1229 a length of 0 (zero) means to never wrap long lines.  If omitted,
   1230 the default as specified on the command line is used.  The @var{n}
   1231 parameter is a @value{SSED} extension.
   1232 
   1233 @item r @var{filename}
   1234 @cindex @value{SSEDEXT}, two addresses supported by most commands
   1235 As a @acronym{GNU} extension, this command accepts two addresses.
   1236 
   1237 @findex r (read file) command
   1238 @cindex Read text from a file
   1239 @cindex @value{SSEDEXT}, @file{/dev/stdin} file
   1240 Queue the contents of @var{filename} to be read and
   1241 inserted into the output stream at the end of the current cycle,
   1242 or when the next input line is read.
   1243 Note that if @var{filename} cannot be read, it is treated as
   1244 if it were an empty file, without any error indication.
   1245 
   1246 As a @value{SSED} extension, the special value @file{/dev/stdin}
   1247 is supported for the file name, which reads the contents of the
   1248 standard input.
   1249 
   1250 @item w @var{filename}
   1251 @findex w (write file) command
   1252 @cindex Write to a file
   1253 @cindex @value{SSEDEXT}, @file{/dev/stdout} file
   1254 @cindex @value{SSEDEXT}, @file{/dev/stderr} file
   1255 Write the pattern space to @var{filename}.
   1256 As a @value{SSED} extension, two special values of @var{file-name} are
   1257 supported: @file{/dev/stderr}, which writes the result to the standard
   1258 error, and @file{/dev/stdout}, which writes to the standard
   1259 output.@footnote{This is equivalent to @code{p} unless the @option{-i}
   1260 option is being used.}
   1261 
   1262 The file will be created (or truncated) before the
   1263 first input line is read; all @code{w} commands
   1264 (including instances of @code{w} flag on successful @code{s} commands)
   1265 which refer to the same @var{filename} are output without
   1266 closing and reopening the file.
   1267 
   1268 @item D
   1269 @findex D (delete first line) command
   1270 @cindex Delete first line from pattern space
   1271 Delete text in the pattern space up to the first newline.
   1272 If any text is left, restart cycle with the resultant
   1273 pattern space (without reading a new line of input),
   1274 otherwise start a normal new cycle.
   1275 
   1276 @item N
   1277 @findex N (append Next line) command
   1278 @cindex Next input line, append to pattern space
   1279 @cindex Append next input line to pattern space
   1280 Add a newline to the pattern space,
   1281 then append the next line of input to the pattern space.
   1282 If there is no more input then @command{sed} exits without processing
   1283 any more commands.
   1284 
   1285 @item P
   1286 @findex P (print first line) command
   1287 @cindex Print first line from pattern space
   1288 Print out the portion of the pattern space up to the first newline.
   1289 
   1290 @item h
   1291 @findex h (hold) command
   1292 @cindex Copy pattern space into hold space
   1293 @cindex Replace hold space with copy of pattern space
   1294 @cindex Hold space, copying pattern space into
   1295 Replace the contents of the hold space with the contents of the pattern space.
   1296 
   1297 @item H
   1298 @findex H (append Hold) command
   1299 @cindex Append pattern space to hold space
   1300 @cindex Hold space, appending from pattern space
   1301 Append a newline to the contents of the hold space,
   1302 and then append the contents of the pattern space to that of the hold space.
   1303 
   1304 @item g
   1305 @findex g (get) command
   1306 @cindex Copy hold space into pattern space
   1307 @cindex Replace pattern space with copy of hold space
   1308 @cindex Hold space, copy into pattern space
   1309 Replace the contents of the pattern space with the contents of the hold space.
   1310 
   1311 @item G
   1312 @findex G (appending Get) command
   1313 @cindex Append hold space to pattern space
   1314 @cindex Hold space, appending to pattern space
   1315 Append a newline to the contents of the pattern space,
   1316 and then append the contents of the hold space to that of the pattern space.
   1317 
   1318 @item x
   1319 @findex x (eXchange) command
   1320 @cindex Exchange hold space with pattern space
   1321 @cindex Hold space, exchange with pattern space
   1322 Exchange the contents of the hold and pattern spaces.
   1323 
   1324 @end table
   1325 
   1326 
   1327 @node Programming Commands
   1328 @section Commands for @command{sed} gurus
   1329 
   1330 In most cases, use of these commands indicates that you are
   1331 probably better off programming in something like @command{awk}
   1332 or Perl.  But occasionally one is committed to sticking
   1333 with @command{sed}, and these commands can enable one to write
   1334 quite convoluted scripts.
   1335 
   1336 @cindex Flow of control in scripts
   1337 @table @code
   1338 @item : @var{label}
   1339 [No addresses allowed.]
   1340 
   1341 @findex : (label) command
   1342 @cindex Labels, in scripts
   1343 Specify the location of @var{label} for branch commands.
   1344 In all other respects, a no-op.
   1345 
   1346 @item b @var{label}
   1347 @findex b (branch) command
   1348 @cindex Branch to a label, unconditionally
   1349 @cindex Goto, in scripts
   1350 Unconditionally branch to @var{label}.
   1351 The @var{label} may be omitted, in which case the next cycle is started.
   1352 
   1353 @item t @var{label}
   1354 @findex t (test and branch if successful) command
   1355 @cindex Branch to a label, if @code{s///} succeeded
   1356 @cindex Conditional branch
   1357 Branch to @var{label} only if there has been a successful @code{s}ubstitution
   1358 since the last input line was read or conditional branch was taken.
   1359 The @var{label} may be omitted, in which case the next cycle is started.
   1360 
   1361 @end table
   1362 
   1363 @node Extended Commands
   1364 @section Commands Specific to @value{SSED}
   1365 
   1366 These commands are specific to @value{SSED}, so you
   1367 must use them with care and only when you are sure that
   1368 hindering portability is not evil.  They allow you to check
   1369 for @value{SSED} extensions or to do tasks that are required
   1370 quite often, yet are unsupported by standard @command{sed}s.
   1371 
   1372 @table @code
   1373 @item e [@var{command}]
   1374 @findex e (evaluate) command
   1375 @cindex Evaluate Bourne-shell commands
   1376 @cindex Subprocesses
   1377 @cindex @value{SSEDEXT}, evaluating Bourne-shell commands
   1378 @cindex @value{SSEDEXT}, subprocesses
   1379 This command allows one to pipe input from a shell command
   1380 into pattern space.  Without parameters, the @code{e} command
   1381 executes the command that is found in pattern space and
   1382 replaces the pattern space with the output; a trailing newline
   1383 is suppressed.
   1384 
   1385 If a parameter is specified, instead, the @code{e} command
   1386 interprets it as a command and sends its output to the output stream
   1387 (like @code{r} does).  The command can run across multiple
   1388 lines, all but the last ending with a back-slash.
   1389 
   1390 In both cases, the results are undefined if the command to be
   1391 executed contains a @sc{nul} character.
   1392 
   1393 @item L @var{n}
   1394 @findex L (fLow paragraphs) command
   1395 @cindex Reformat pattern space
   1396 @cindex Reformatting paragraphs
   1397 @cindex @value{SSEDEXT}, reformatting paragraphs
   1398 @cindex @value{SSEDEXT}, @code{L} command
   1399 This @value{SSED} extension fills and joins lines in pattern space
   1400 to produce output lines of (at most) @var{n} characters, like
   1401 @code{fmt} does; if @var{n} is omitted, the default as specified
   1402 on the command line is used.  This command is considered a failed
   1403 experiment and unless there is enough request (which seems unlikely)
   1404 will be removed in future versions.
   1405 
   1406 @ignore
   1407 Blank lines, spaces between words, and indentation are
   1408 preserved in the output; successive input lines with different
   1409 indentation are not joined; tabs are expanded to 8 columns.
   1410 
   1411 If the pattern space contains multiple lines, they are joined, but
   1412 since the pattern space usually contains a single line, the behavior
   1413 of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
   1414 it does not join short lines to form longer ones).
   1415 
   1416 @var{n} specifies the desired line-wrap length; if omitted,
   1417 the default as specified on the command line is used.
   1418 @end ignore
   1419 
   1420 @item Q [@var{exit-code}]
   1421 This command only accepts a single address.
   1422 
   1423 @findex Q (silent Quit) command
   1424 @cindex @value{SSEDEXT}, quitting silently
   1425 @cindex @value{SSEDEXT}, returning an exit code
   1426 @cindex Quitting
   1427 This command is the same as @code{q}, but will not print the
   1428 contents of pattern space.  Like @code{q}, it provides the
   1429 ability to return an exit code to the caller.
   1430 
   1431 This command can be useful because the only alternative ways
   1432 to accomplish this apparently trivial function are to use
   1433 the @option{-n} option (which can unnecessarily complicate
   1434 your script) or resorting to the following snippet, which
   1435 wastes time by reading the whole file without any visible effect:
   1436 
   1437 @example
   1438 :eat
   1439 $d       @i{@r{Quit silently on the last line}}
   1440 N        @i{@r{Read another line, silently}}
   1441 g        @i{@r{Overwrite pattern space each time to save memory}}
   1442 b eat
   1443 @end example
   1444 
   1445 @item R @var{filename}
   1446 @findex R (read line) command
   1447 @cindex Read text from a file
   1448 @cindex @value{SSEDEXT}, reading a file a line at a time
   1449 @cindex @value{SSEDEXT}, @code{R} command
   1450 @cindex @value{SSEDEXT}, @file{/dev/stdin} file
   1451 Queue a line of @var{filename} to be read and
   1452 inserted into the output stream at the end of the current cycle,
   1453 or when the next input line is read.
   1454 Note that if @var{filename} cannot be read, or if its end is
   1455 reached, no line is appended, without any error indication.
   1456 
   1457 As with the @code{r} command, the special value @file{/dev/stdin}
   1458 is supported for the file name, which reads a line from the
   1459 standard input.
   1460 
   1461 @item T @var{label}
   1462 @findex T (test and branch if failed) command
   1463 @cindex @value{SSEDEXT}, branch if @code{s///} failed
   1464 @cindex Branch to a label, if @code{s///} failed
   1465 @cindex Conditional branch
   1466 Branch to @var{label} only if there have been no successful
   1467 @code{s}ubstitutions since the last input line was read or
   1468 conditional branch was taken. The @var{label} may be omitted,
   1469 in which case the next cycle is started.
   1470 
   1471 @item v @var{version}
   1472 @findex v (version) command
   1473 @cindex @value{SSEDEXT}, checking for their presence
   1474 @cindex Requiring @value{SSED}
   1475 This command does nothing, but makes @command{sed} fail if
   1476 @value{SSED} extensions are not supported, simply because other
   1477 versions of @command{sed} do not implement it.  In addition, you
   1478 can specify the version of @command{sed} that your script
   1479 requires, such as @code{4.0.5}.  The default is @code{4.0}
   1480 because that is the first version that implemented this command.
   1481 
   1482 This command enables all @value{SSEDEXT} even if
   1483 @env{POSIXLY_CORRECT} is set in the environment.
   1484 
   1485 @item W @var{filename}
   1486 @findex W (write first line) command
   1487 @cindex Write first line to a file
   1488 @cindex @value{SSEDEXT}, writing first line to a file
   1489 Write to the given filename the portion of the pattern space up to
   1490 the first newline.  Everything said under the @code{w} command about
   1491 file handling holds here too.
   1492 
   1493 @item z
   1494 @findex z (Zap) command
   1495 @cindex @value{SSEDEXT}, emptying pattern space
   1496 @cindex Emptying pattern space
   1497 This command empties the content of pattern space.  It is
   1498 usually the same as @samp{s/.*//}, but is more efficient
   1499 and works in the presence of invalid multibyte sequences
   1500 in the input stream.  @sc{posix} mandates that such sequences
   1501 are @emph{not} matched by @samp{.}, so that there is no portable
   1502 way to clear @command{sed}'s buffers in the middle of the
   1503 script in most multibyte locales (including UTF-8 locales).
   1504 @end table
   1505 
   1506 @node Escapes
   1507 @section @acronym{GNU} Extensions for Escapes in Regular Expressions
   1508 
   1509 @cindex @acronym{GNU} extensions, special escapes
   1510 Until this chapter, we have only encountered escapes of the form
   1511 @samp{\^}, which tell @command{sed} not to interpret the circumflex
   1512 as a special character, but rather to take it literally.  For
   1513 example, @samp{\*} matches a single asterisk rather than zero
   1514 or more backslashes.
   1515 
   1516 @cindex @code{POSIXLY_CORRECT} behavior, escapes
   1517 This chapter introduces another kind of escape@footnote{All
   1518 the escapes introduced here are @acronym{GNU}
   1519 extensions, with the exception of @code{\n}.  In basic regular
   1520 expression mode, setting @code{POSIXLY_CORRECT} disables them inside
   1521 bracket expressions.}---that
   1522 is, escapes that are applied to a character or sequence of characters
   1523 that ordinarily are taken literally, and that @command{sed} replaces
   1524 with a special character.  This provides a way
   1525 of encoding non-printable characters in patterns in a visible manner.
   1526 There is no restriction on the appearance of non-printing characters
   1527 in a @command{sed} script but when a script is being prepared in the
   1528 shell or by text editing, it is usually easier to use one of
   1529 the following escape sequences than the binary character it
   1530 represents:
   1531 
   1532 The list of these escapes is:
   1533 
   1534 @table @code
   1535 @item \a
   1536 Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7).
   1537 
   1538 @item \f
   1539 Produces or matches a form feed (@sc{ascii} 12).
   1540 
   1541 @item \n
   1542 Produces or matches a newline (@sc{ascii} 10).
   1543 
   1544 @item \r
   1545 Produces or matches a carriage return (@sc{ascii} 13).
   1546 
   1547 @item \t
   1548 Produces or matches a horizontal tab (@sc{ascii} 9).
   1549 
   1550 @item \v
   1551 Produces or matches a so called ``vertical tab'' (@sc{ascii} 11).
   1552 
   1553 @item \c@var{x}
   1554 Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is
   1555 any character.  The precise effect of @samp{\c@var{x}} is as follows:
   1556 if @var{x} is a lower case letter, it is converted to upper case.
   1557 Then bit 6 of the character (hex 40) is inverted.  Thus @samp{\cz} becomes
   1558 hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B.
   1559 
   1560 @item \d@var{xxx}
   1561 Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}.
   1562 
   1563 @item \o@var{xxx}
   1564 @ifset PERL
   1565 @item \@var{xxx}
   1566 @end ifset
   1567 Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
   1568 @ifset PERL
   1569 The syntax without the @code{o} is active in Perl mode, while the one
   1570 with the @code{o} is active in the normal or extended @sc{posix} regular
   1571 expression modes.
   1572 @end ifset
   1573 
   1574 @item \x@var{xx}
   1575 Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
   1576 @end table
   1577 
   1578 @samp{\b} (backspace) was omitted because of the conflict with
   1579 the existing ``word boundary'' meaning.
   1580 
   1581 Other escapes match a particular character class and are valid only in
   1582 regular expressions:
   1583 
   1584 @table @code
   1585 @item \w
   1586 Matches any ``word'' character.  A ``word'' character is any
   1587 letter or digit or the underscore character.
   1588 
   1589 @item \W
   1590 Matches any ``non-word'' character.
   1591 
   1592 @item \b
   1593 Matches a word boundary; that is it matches if the character
   1594 to the left is a ``word'' character and the character to the
   1595 right is a ``non-word'' character, or vice-versa.
   1596 
   1597 @item \B
   1598 Matches everywhere but on a word boundary; that is it matches
   1599 if the character to the left and the character to the right
   1600 are either both ``word'' characters or both ``non-word''
   1601 characters.
   1602 
   1603 @item \`
   1604 Matches only at the start of pattern space.  This is different
   1605 from @code{^} in multi-line mode.
   1606 
   1607 @item \'
   1608 Matches only at the end of pattern space.  This is different
   1609 from @code{$} in multi-line mode.
   1610 
   1611 @ifset PERL
   1612 @item \G
   1613 Match only at the start of pattern space or, when doing a global
   1614 substitution using the @code{s///g} command and option, at
   1615 the end-of-match position of the prior match.  For example,
   1616 @samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
   1617 a run of @code{Z}s
   1618 @end ifset
   1619 @end table
   1620 
   1621 @node Examples
   1622 @chapter Some Sample Scripts
   1623 
   1624 Here are some @command{sed} scripts to guide you in the art of mastering
   1625 @command{sed}.
   1626 
   1627 @menu
   1628 Some exotic examples:
   1629 * Centering lines::
   1630 * Increment a number::
   1631 * Rename files to lower case::
   1632 * Print bash environment::
   1633 * Reverse chars of lines::
   1634 
   1635 Emulating standard utilities:
   1636 * tac::                             Reverse lines of files
   1637 * cat -n::                          Numbering lines
   1638 * cat -b::                          Numbering non-blank lines
   1639 * wc -c::                           Counting chars
   1640 * wc -w::                           Counting words
   1641 * wc -l::                           Counting lines
   1642 * head::                            Printing the first lines
   1643 * tail::                            Printing the last lines
   1644 * uniq::                            Make duplicate lines unique
   1645 * uniq -d::                         Print duplicated lines of input
   1646 * uniq -u::                         Remove all duplicated lines
   1647 * cat -s::                          Squeezing blank lines
   1648 @end menu
   1649 
   1650 @node Centering lines
   1651 @section Centering Lines
   1652 
   1653 This script centers all lines of a file on a 80 columns width.
   1654 To change that width, the number in @code{\@{@dots{}\@}} must be
   1655 replaced, and the number of added spaces also must be changed.
   1656 
   1657 Note how the buffer commands are used to separate parts in
   1658 the regular expressions to be matched---this is a common
   1659 technique.
   1660 
   1661 @c start-------------------------------------------
   1662 @example
   1663 #!/usr/bin/sed -f
   1664 
   1665 # Put 80 spaces in the buffer
   1666 1 @{
   1667   x
   1668   s/^$/          /
   1669   s/^.*$/&&&&&&&&/
   1670   x
   1671 @}
   1672 
   1673 # del leading and trailing spaces
   1674 y/@kbd{tab}/ /
   1675 s/^ *//
   1676 s/ *$//
   1677 
   1678 # add a newline and 80 spaces to end of line
   1679 G
   1680 
   1681 # keep first 81 chars (80 + a newline)
   1682 s/^\(.\@{81\@}\).*$/\1/
   1683 
   1684 # \2 matches half of the spaces, which are moved to the beginning
   1685 s/^\(.*\)\n\(.*\)\2/\2\1/
   1686 @end example
   1687 @c end---------------------------------------------
   1688 
   1689 @node Increment a number
   1690 @section Increment a Number
   1691 
   1692 This script is one of a few that demonstrate how to do arithmetic
   1693 in @command{sed}.  This is indeed possible,@footnote{@command{sed} guru Greg
   1694 Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator!
   1695 It is distributed together with sed.} but must be done manually.
   1696 
   1697 To increment one number you just add 1 to last digit, replacing
   1698 it by the following digit.  There is one exception: when the digit
   1699 is a nine the previous digits must be also incremented until you
   1700 don't have a nine.
   1701 
   1702 This solution by Bruno Haible is very clever and smart because
   1703 it uses a single buffer; if you don't have this limitation, the
   1704 algorithm used in @ref{cat -n, Numbering lines}, is faster.
   1705 It works by replacing trailing nines with an underscore, then
   1706 using multiple @code{s} commands to increment the last digit,
   1707 and then again substituting underscores with zeros.
   1708 
   1709 @c start-------------------------------------------
   1710 @example
   1711 #!/usr/bin/sed -f
   1712 
   1713 /[^0-9]/ d
   1714 
   1715 # replace all leading 9s by _ (any other character except digits, could
   1716 # be used)
   1717 :d
   1718 s/9\(_*\)$/_\1/
   1719 td
   1720 
   1721 # incr last digit only.  The first line adds a most-significant
   1722 # digit of 1 if we have to add a digit.
   1723 #
   1724 # The @code{tn} commands are not necessary, but make the thing
   1725 # faster
   1726 
   1727 s/^\(_*\)$/1\1/; tn
   1728 s/8\(_*\)$/9\1/; tn
   1729 s/7\(_*\)$/8\1/; tn
   1730 s/6\(_*\)$/7\1/; tn
   1731 s/5\(_*\)$/6\1/; tn
   1732 s/4\(_*\)$/5\1/; tn
   1733 s/3\(_*\)$/4\1/; tn
   1734 s/2\(_*\)$/3\1/; tn
   1735 s/1\(_*\)$/2\1/; tn
   1736 s/0\(_*\)$/1\1/; tn
   1737 
   1738 :n
   1739 y/_/0/
   1740 @end example
   1741 @c end---------------------------------------------
   1742 
   1743 @node Rename files to lower case
   1744 @section Rename Files to Lower Case
   1745 
   1746 This is a pretty strange use of @command{sed}.  We transform text, and
   1747 transform it to be shell commands, then just feed them to shell.
   1748 Don't worry, even worse hacks are done when using @command{sed}; I have
   1749 seen a script converting the output of @command{date} into a @command{bc}
   1750 program!
   1751 
   1752 The main body of this is the @command{sed} script, which remaps the name
   1753 from lower to upper (or vice-versa) and even checks out 
   1754 if the remapped name is the same as the original name.
   1755 Note how the script is parameterized using shell
   1756 variables and proper quoting.
   1757 
   1758 @c start-------------------------------------------
   1759 @example
   1760 #! /bin/sh
   1761 # rename files to lower/upper case... 
   1762 #
   1763 # usage: 
   1764 #    move-to-lower * 
   1765 #    move-to-upper * 
   1766 # or
   1767 #    move-to-lower -R .
   1768 #    move-to-upper -R .
   1769 #
   1770 
   1771 help()
   1772 @{
   1773         cat << eof
   1774 Usage: $0 [-n] [-r] [-h] files...
   1775 
   1776 -n      do nothing, only see what would be done
   1777 -R      recursive (use find)
   1778 -h      this message
   1779 files   files to remap to lower case
   1780 
   1781 Examples:
   1782        $0 -n *        (see if everything is ok, then...)
   1783        $0 *
   1784 
   1785        $0 -R .
   1786 
   1787 eof
   1788 @}
   1789 
   1790 apply_cmd='sh'
   1791 finder='echo "$@@" | tr " " "\n"'
   1792 files_only=
   1793 
   1794 while :
   1795 do
   1796     case "$1" in 
   1797         -n) apply_cmd='cat' ;;
   1798         -R) finder='find "$@@" -type f';;
   1799         -h) help ; exit 1 ;;
   1800         *) break ;;
   1801     esac
   1802     shift
   1803 done
   1804 
   1805 if [ -z "$1" ]; then
   1806         echo Usage: $0 [-h] [-n] [-r] files...
   1807         exit 1
   1808 fi
   1809 
   1810 LOWER='abcdefghijklmnopqrstuvwxyz'
   1811 UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
   1812 
   1813 case `basename $0` in
   1814         *upper*) TO=$UPPER; FROM=$LOWER ;;
   1815         *)       FROM=$UPPER; TO=$LOWER ;;
   1816 esac
   1817 
   1818 eval $finder | sed -n '
   1819 
   1820 # remove all trailing slashes
   1821 s/\/*$//
   1822 
   1823 # add ./ if there is no path, only a filename
   1824 /\//! s/^/.\//
   1825 
   1826 # save path+filename
   1827 h
   1828 
   1829 # remove path
   1830 s/.*\///
   1831 
   1832 # do conversion only on filename
   1833 y/'$FROM'/'$TO'/
   1834 
   1835 # now line contains original path+file, while
   1836 # hold space contains the new filename
   1837 x
   1838 
   1839 # add converted file name to line, which now contains
   1840 # path/file-name\nconverted-file-name
   1841 G
   1842 
   1843 # check if converted file name is equal to original file name,
   1844 # if it is, do not print nothing
   1845 /^.*\/\(.*\)\n\1/b
   1846 
   1847 # now, transform path/fromfile\n, into
   1848 # mv path/fromfile path/tofile and print it
   1849 s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p
   1850 
   1851 ' | $apply_cmd
   1852 @end example
   1853 @c end---------------------------------------------
   1854 
   1855 @node Print bash environment
   1856 @section Print @command{bash} Environment
   1857 
   1858 This script strips the definition of the shell functions
   1859 from the output of the @command{set} Bourne-shell command.
   1860 
   1861 @c start-------------------------------------------
   1862 @example
   1863 #!/bin/sh
   1864 
   1865 set | sed -n '
   1866 :x
   1867 
   1868 @ifinfo
   1869 # if no occurrence of "=()" print and load next line
   1870 @end ifinfo
   1871 @ifnotinfo
   1872 # if no occurrence of @samp{=()} print and load next line
   1873 @end ifnotinfo
   1874 /=()/! @{ p; b; @}
   1875 / () $/! @{ p; b; @}
   1876 
   1877 # possible start of functions section
   1878 # save the line in case this is a var like FOO="() "
   1879 h
   1880 
   1881 # if the next line has a brace, we quit because
   1882 # nothing comes after functions
   1883 n
   1884 /^@{/ q
   1885 
   1886 # print the old line
   1887 x; p
   1888 
   1889 # work on the new line now
   1890 x; bx
   1891 '
   1892 @end example
   1893 @c end---------------------------------------------
   1894 
   1895 @node Reverse chars of lines
   1896 @section Reverse Characters of Lines
   1897 
   1898 This script can be used to reverse the position of characters
   1899 in lines.  The technique moves two characters at a time, hence
   1900 it is faster than more intuitive implementations.
   1901 
   1902 Note the @code{tx} command before the definition of the label.
   1903 This is often needed to reset the flag that is tested by
   1904 the @code{t} command.
   1905 
   1906 Imaginative readers will find uses for this script.  An example
   1907 is reversing the output of @command{banner}.@footnote{This requires
   1908 another script to pad the output of banner; for example
   1909 
   1910 @example
   1911 #! /bin/sh
   1912 
   1913 banner -w $1 $2 $3 $4 |
   1914   sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' |
   1915   ~/sedscripts/reverseline.sed
   1916 @end example
   1917 }
   1918 
   1919 @c start-------------------------------------------
   1920 @example
   1921 #!/usr/bin/sed -f
   1922 
   1923 /../! b
   1924 
   1925 # Reverse a line.  Begin embedding the line between two newlines
   1926 s/^.*$/\
   1927 &\
   1928 /
   1929 
   1930 # Move first character at the end.  The regexp matches until
   1931 # there are zero or one characters between the markers
   1932 tx
   1933 :x
   1934 s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/
   1935 tx
   1936 
   1937 # Remove the newline markers
   1938 s/\n//g
   1939 @end example
   1940 @c end---------------------------------------------
   1941 
   1942 @node tac
   1943 @section Reverse Lines of Files
   1944 
   1945 This one begins a series of totally useless (yet interesting)
   1946 scripts emulating various Unix commands.  This, in particular,
   1947 is a @command{tac} workalike.
   1948 
   1949 Note that on implementations other than @acronym{GNU} @command{sed}
   1950 @ifset PERL
   1951 and @value{SSED}
   1952 @end ifset
   1953 this script might easily overflow internal buffers.
   1954 
   1955 @c start-------------------------------------------
   1956 @example
   1957 #!/usr/bin/sed -nf
   1958 
   1959 # reverse all lines of input, i.e. first line became last, ...
   1960 
   1961 # from the second line, the buffer (which contains all previous lines)
   1962 # is *appended* to current line, so, the order will be reversed
   1963 1! G
   1964 
   1965 # on the last line we're done -- print everything
   1966 $ p
   1967 
   1968 # store everything on the buffer again
   1969 h
   1970 @end example
   1971 @c end---------------------------------------------
   1972 
   1973 @node cat -n
   1974 @section Numbering Lines
   1975 
   1976 This script replaces @samp{cat -n}; in fact it formats its output
   1977 exactly like @acronym{GNU} @command{cat} does.
   1978 
   1979 Of course this is completely useless and for two reasons:  first,
   1980 because somebody else did it in C, second, because the following
   1981 Bourne-shell script could be used for the same purpose and would
   1982 be much faster:
   1983 
   1984 @c start-------------------------------------------
   1985 @example
   1986 #! /bin/sh
   1987 sed -e "=" $@@ | sed -e '
   1988   s/^/      /
   1989   N
   1990   s/^ *\(......\)\n/\1  /
   1991 '
   1992 @end example
   1993 @c end---------------------------------------------
   1994 
   1995 It uses @command{sed} to print the line number, then groups lines two
   1996 by two using @code{N}.  Of course, this script does not teach as much as
   1997 the one presented below.
   1998 
   1999 The algorithm used for incrementing uses both buffers, so the line
   2000 is printed as soon as possible and then discarded.  The number
   2001 is split so that changing digits go in a buffer and unchanged ones go
   2002 in the other; the changed digits are modified in a single step
   2003 (using a @code{y} command).  The line number for the next line
   2004 is then composed and stored in the hold space, to be used in the
   2005 next iteration.
   2006 
   2007 @c start-------------------------------------------
   2008 @example
   2009 #!/usr/bin/sed -nf
   2010 
   2011 # Prime the pump on the first line
   2012 x
   2013 /^$/ s/^.*$/1/
   2014 
   2015 # Add the correct line number before the pattern
   2016 G
   2017 h
   2018 
   2019 # Format it and print it
   2020 s/^/      /
   2021 s/^ *\(......\)\n/\1  /p
   2022 
   2023 # Get the line number from hold space; add a zero
   2024 # if we're going to add a digit on the next line
   2025 g
   2026 s/\n.*$//
   2027 /^9*$/ s/^/0/
   2028 
   2029 # separate changing/unchanged digits with an x
   2030 s/.9*$/x&/
   2031 
   2032 # keep changing digits in hold space
   2033 h
   2034 s/^.*x//
   2035 y/0123456789/1234567890/
   2036 x
   2037 
   2038 # keep unchanged digits in pattern space
   2039 s/x.*$//
   2040 
   2041 # compose the new number, remove the newline implicitly added by G
   2042 G
   2043 s/\n//
   2044 h
   2045 @end example
   2046 @c end---------------------------------------------
   2047 
   2048 @node cat -b
   2049 @section Numbering Non-blank Lines
   2050 
   2051 Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only
   2052 have to select which lines are to be numbered and which are not.
   2053 
   2054 The part that is common to this script and the previous one is
   2055 not commented to show how important it is to comment @command{sed}
   2056 scripts properly...
   2057 
   2058 @c start-------------------------------------------
   2059 @example
   2060 #!/usr/bin/sed -nf
   2061 
   2062 /^$/ @{
   2063   p
   2064   b
   2065 @}
   2066 
   2067 # Same as cat -n from now
   2068 x
   2069 /^$/ s/^.*$/1/
   2070 G
   2071 h
   2072 s/^/      /
   2073 s/^ *\(......\)\n/\1  /p
   2074 x
   2075 s/\n.*$//
   2076 /^9*$/ s/^/0/
   2077 s/.9*$/x&/
   2078 h
   2079 s/^.*x//
   2080 y/0123456789/1234567890/
   2081 x
   2082 s/x.*$//
   2083 G
   2084 s/\n//
   2085 h
   2086 @end example
   2087 @c end---------------------------------------------
   2088 
   2089 @node wc -c
   2090 @section Counting Characters
   2091 
   2092 This script shows another way to do arithmetic with @command{sed}.
   2093 In this case we have to add possibly large numbers, so implementing
   2094 this by successive increments would not be feasible (and possibly
   2095 even more complicated to contrive than this script).
   2096 
   2097 The approach is to map numbers to letters, kind of an abacus
   2098 implemented with @command{sed}.  @samp{a}s are units, @samp{b}s are
   2099 tens and so on: we simply add the number of characters
   2100 on the current line as units, and then propagate the carry
   2101 to tens, hundreds, and so on.
   2102 
   2103 As usual, running totals are kept in hold space.
   2104 
   2105 On the last line, we convert the abacus form back to decimal.
   2106 For the sake of variety, this is done with a loop rather than
   2107 with some 80 @code{s} commands@footnote{Some implementations
   2108 have a limit of 199 commands per script}: first we
   2109 convert units, removing @samp{a}s from the number; then we
   2110 rotate letters so that tens become @samp{a}s, and so on
   2111 until no more letters remain.
   2112 
   2113 @c start-------------------------------------------
   2114 @example
   2115 #!/usr/bin/sed -nf
   2116 
   2117 # Add n+1 a's to hold space (+1 is for the newline)
   2118 s/./a/g
   2119 H
   2120 x
   2121 s/\n/a/
   2122 
   2123 # Do the carry.  The t's and b's are not necessary,
   2124 # but they do speed up the thing
   2125 t a
   2126 : a;  s/aaaaaaaaaa/b/g; t b; b done
   2127 : b;  s/bbbbbbbbbb/c/g; t c; b done
   2128 : c;  s/cccccccccc/d/g; t d; b done
   2129 : d;  s/dddddddddd/e/g; t e; b done
   2130 : e;  s/eeeeeeeeee/f/g; t f; b done
   2131 : f;  s/ffffffffff/g/g; t g; b done
   2132 : g;  s/gggggggggg/h/g; t h; b done
   2133 : h;  s/hhhhhhhhhh//g
   2134 
   2135 : done
   2136 $! @{
   2137   h
   2138   b
   2139 @}
   2140 
   2141 # On the last line, convert back to decimal
   2142 
   2143 : loop
   2144 /a/! s/[b-h]*/&0/
   2145 s/aaaaaaaaa/9/
   2146 s/aaaaaaaa/8/
   2147 s/aaaaaaa/7/
   2148 s/aaaaaa/6/
   2149 s/aaaaa/5/
   2150 s/aaaa/4/
   2151 s/aaa/3/
   2152 s/aa/2/
   2153 s/a/1/
   2154 
   2155 : next
   2156 y/bcdefgh/abcdefg/
   2157 /[a-h]/ b loop
   2158 p
   2159 @end example
   2160 @c end---------------------------------------------
   2161 
   2162 @node wc -w
   2163 @section Counting Words
   2164 
   2165 This script is almost the same as the previous one, once each
   2166 of the words on the line is converted to a single @samp{a}
   2167 (in the previous script each letter was changed to an @samp{a}).
   2168 
   2169 It is interesting that real @command{wc} programs have optimized
   2170 loops for @samp{wc -c}, so they are much slower at counting
   2171 words rather than characters.  This script's bottleneck,
   2172 instead, is arithmetic, and hence the word-counting one
   2173 is faster (it has to manage smaller numbers).
   2174 
   2175 Again, the common parts are not commented to show the importance
   2176 of commenting @command{sed} scripts.
   2177 
   2178 @c start-------------------------------------------
   2179 @example
   2180 #!/usr/bin/sed -nf
   2181 
   2182 # Convert words to a's
   2183 s/[ @kbd{tab}][ @kbd{tab}]*/ /g
   2184 s/^/ /
   2185 s/ [^ ][^ ]*/a /g
   2186 s/ //g
   2187 
   2188 # Append them to hold space
   2189 H
   2190 x
   2191 s/\n//
   2192 
   2193 # From here on it is the same as in wc -c.
   2194 /aaaaaaaaaa/! bx;   s/aaaaaaaaaa/b/g
   2195 /bbbbbbbbbb/! bx;   s/bbbbbbbbbb/c/g
   2196 /cccccccccc/! bx;   s/cccccccccc/d/g
   2197 /dddddddddd/! bx;   s/dddddddddd/e/g
   2198 /eeeeeeeeee/! bx;   s/eeeeeeeeee/f/g
   2199 /ffffffffff/! bx;   s/ffffffffff/g/g
   2200 /gggggggggg/! bx;   s/gggggggggg/h/g
   2201 s/hhhhhhhhhh//g
   2202 :x
   2203 $! @{ h; b; @}
   2204 :y
   2205 /a/! s/[b-h]*/&0/
   2206 s/aaaaaaaaa/9/
   2207 s/aaaaaaaa/8/
   2208 s/aaaaaaa/7/
   2209 s/aaaaaa/6/
   2210 s/aaaaa/5/
   2211 s/aaaa/4/
   2212 s/aaa/3/
   2213 s/aa/2/
   2214 s/a/1/
   2215 y/bcdefgh/abcdefg/
   2216 /[a-h]/ by
   2217 p
   2218 @end example
   2219 @c end---------------------------------------------
   2220 
   2221 @node wc -l
   2222 @section Counting Lines
   2223 
   2224 No strange things are done now, because @command{sed} gives us
   2225 @samp{wc -l} functionality for free!!! Look:
   2226 
   2227 @c start-------------------------------------------
   2228 @example
   2229 #!/usr/bin/sed -nf
   2230 $=
   2231 @end example
   2232 @c end---------------------------------------------
   2233 
   2234 @node head
   2235 @section Printing the First Lines
   2236 
   2237 This script is probably the simplest useful @command{sed} script.
   2238 It displays the first 10 lines of input; the number of displayed
   2239 lines is right before the @code{q} command.
   2240 
   2241 @c start-------------------------------------------
   2242 @example
   2243 #!/usr/bin/sed -f
   2244 10q
   2245 @end example
   2246 @c end---------------------------------------------
   2247 
   2248 @node tail
   2249 @section Printing the Last Lines
   2250 
   2251 Printing the last @var{n} lines rather than the first is more complex
   2252 but indeed possible.  @var{n} is encoded in the second line, before
   2253 the bang character.
   2254 
   2255 This script is similar to the @command{tac} script in that it keeps the
   2256 final output in the hold space and prints it at the end:
   2257 
   2258 @c start-------------------------------------------
   2259 @example
   2260 #!/usr/bin/sed -nf
   2261 
   2262 1! @{; H; g; @}
   2263 1,10 !s/[^\n]*\n//
   2264 $p
   2265 h
   2266 @end example
   2267 @c end---------------------------------------------
   2268 
   2269 Mainly, the scripts keeps a window of 10 lines and slides it
   2270 by adding a line and deleting the oldest (the substitution command
   2271 on the second line works like a @code{D} command but does not
   2272 restart the loop).
   2273 
   2274 The ``sliding window'' technique is a very powerful way to write
   2275 efficient and complex @command{sed} scripts, because commands like
   2276 @code{P} would require a lot of work if implemented manually.
   2277 
   2278 To introduce the technique, which is fully demonstrated in the
   2279 rest of this chapter and is based on the @code{N}, @code{P}
   2280 and @code{D} commands, here is an implementation of @command{tail}
   2281 using a simple ``sliding window.''
   2282 
   2283 This looks complicated but in fact the working is the same as
   2284 the last script: after we have kicked in the appropriate number
   2285 of lines, however, we stop using the hold space to keep inter-line
   2286 state, and instead use @code{N} and @code{D} to slide pattern
   2287 space by one line:
   2288 
   2289 @c start-------------------------------------------
   2290 @example
   2291 #!/usr/bin/sed -f
   2292 
   2293 1h
   2294 2,10 @{; H; g; @}
   2295 $q
   2296 1,9d
   2297 N
   2298 D
   2299 @end example
   2300 @c end---------------------------------------------
   2301 
   2302 Note how the first, second and fourth line are inactive after
   2303 the first ten lines of input.  After that, all the script does
   2304 is: exiting on the last line of input, appending the next input
   2305 line to pattern space, and removing the first line.
   2306 
   2307 @node uniq
   2308 @section Make Duplicate Lines Unique
   2309 
   2310 This is an example of the art of using the @code{N}, @code{P}
   2311 and @code{D} commands, probably the most difficult to master.
   2312 
   2313 @c start-------------------------------------------
   2314 @example
   2315 #!/usr/bin/sed -f
   2316 h
   2317 
   2318 :b
   2319 # On the last line, print and exit
   2320 $b
   2321 N
   2322 /^\(.*\)\n\1$/ @{
   2323     # The two lines are identical.  Undo the effect of
   2324     # the n command.
   2325     g
   2326     bb
   2327 @}
   2328 
   2329 # If the @code{N} command had added the last line, print and exit
   2330 $b
   2331 
   2332 # The lines are different; print the first and go
   2333 # back working on the second.
   2334 P
   2335 D
   2336 @end example
   2337 @c end---------------------------------------------
   2338 
   2339 As you can see, we mantain a 2-line window using @code{P} and @code{D}.
   2340 This technique is often used in advanced @command{sed} scripts.
   2341 
   2342 @node uniq -d
   2343 @section Print Duplicated Lines of Input
   2344 
   2345 This script prints only duplicated lines, like @samp{uniq -d}.
   2346 
   2347 @c start-------------------------------------------
   2348 @example
   2349 #!/usr/bin/sed -nf
   2350 
   2351 $b
   2352 N
   2353 /^\(.*\)\n\1$/ @{
   2354     # Print the first of the duplicated lines
   2355     s/.*\n//
   2356     p
   2357 
   2358     # Loop until we get a different line
   2359     :b
   2360     $b
   2361     N
   2362     /^\(.*\)\n\1$/ @{
   2363         s/.*\n//
   2364         bb
   2365     @}
   2366 @}
   2367 
   2368 # The last line cannot be followed by duplicates
   2369 $b
   2370 
   2371 # Found a different one.  Leave it alone in the pattern space
   2372 # and go back to the top, hunting its duplicates
   2373 D
   2374 @end example
   2375 @c end---------------------------------------------
   2376 
   2377 @node uniq -u
   2378 @section Remove All Duplicated Lines
   2379 
   2380 This script prints only unique lines, like @samp{uniq -u}.
   2381 
   2382 @c start-------------------------------------------
   2383 @example
   2384 #!/usr/bin/sed -f
   2385 
   2386 # Search for a duplicate line --- until that, print what you find.
   2387 $b
   2388 N
   2389 /^\(.*\)\n\1$/ ! @{
   2390     P
   2391     D
   2392 @}
   2393 
   2394 :c
   2395 # Got two equal lines in pattern space.  At the
   2396 # end of the file we simply exit
   2397 $d
   2398 
   2399 # Else, we keep reading lines with @code{N} until we
   2400 # find a different one
   2401 s/.*\n//
   2402 N
   2403 /^\(.*\)\n\1$/ @{
   2404     bc
   2405 @}
   2406 
   2407 # Remove the last instance of the duplicate line
   2408 # and go back to the top
   2409 D
   2410 @end example
   2411 @c end---------------------------------------------
   2412 
   2413 @node cat -s
   2414 @section Squeezing Blank Lines
   2415 
   2416 As a final example, here are three scripts, of increasing complexity
   2417 and speed, that implement the same function as @samp{cat -s}, that is
   2418 squeezing blank lines.
   2419 
   2420 The first leaves a blank line at the beginning and end if there are
   2421 some already.
   2422 
   2423 @c start-------------------------------------------
   2424 @example
   2425 #!/usr/bin/sed -f
   2426 
   2427 # on empty lines, join with next
   2428 # Note there is a star in the regexp
   2429 :x
   2430 /^\n*$/ @{
   2431 N
   2432 bx
   2433 @}
   2434 
   2435 # now, squeeze all '\n', this can be also done by:
   2436 # s/^\(\n\)*/\1/
   2437 s/\n*/\
   2438 /
   2439 @end example
   2440 @c end---------------------------------------------
   2441 
   2442 This one is a bit more complex and removes all empty lines
   2443 at the beginning.  It does leave a single blank line at end
   2444 if one was there.
   2445 
   2446 @c start-------------------------------------------
   2447 @example
   2448 #!/usr/bin/sed -f
   2449 
   2450 # delete all leading empty lines
   2451 1,/^./@{
   2452 /./!d
   2453 @}
   2454 
   2455 # on an empty line we remove it and all the following
   2456 # empty lines, but one
   2457 :x
   2458 /./!@{
   2459 N
   2460 s/^\n$//
   2461 tx
   2462 @}
   2463 @end example
   2464 @c end---------------------------------------------
   2465 
   2466 This removes leading and trailing blank lines.  It is also the
   2467 fastest.  Note that loops are completely done with @code{n} and
   2468 @code{b}, without relying on @command{sed} to restart the
   2469 the script automatically at the end of a line.
   2470 
   2471 @c start-------------------------------------------
   2472 @example
   2473 #!/usr/bin/sed -nf
   2474 
   2475 # delete all (leading) blanks
   2476 /./!d
   2477 
   2478 # get here: so there is a non empty
   2479 :x
   2480 # print it
   2481 p
   2482 # get next
   2483 n
   2484 # got chars? print it again, etc... 
   2485 /./bx
   2486 
   2487 # no, don't have chars: got an empty line
   2488 :z
   2489 # get next, if last line we finish here so no trailing
   2490 # empty lines are written
   2491 n
   2492 # also empty? then ignore it, and get next... this will
   2493 # remove ALL empty lines
   2494 /./!bz
   2495 
   2496 # all empty lines were deleted/ignored, but we have a non empty.  As
   2497 # what we want to do is to squeeze, insert a blank line artificially
   2498 i\
   2499 
   2500 bx
   2501 @end example
   2502 @c end---------------------------------------------
   2503 
   2504 @node Limitations
   2505 @chapter @value{SSED}'s Limitations and Non-limitations
   2506 
   2507 @cindex @acronym{GNU} extensions, unlimited line length
   2508 @cindex Portability, line length limitations
   2509 For those who want to write portable @command{sed} scripts,
   2510 be aware that some implementations have been known to
   2511 limit line lengths (for the pattern and hold spaces)
   2512 to be no more than 4000 bytes.
   2513 The @sc{posix} standard specifies that conforming @command{sed}
   2514 implementations shall support at least 8192 byte line lengths.
   2515 @value{SSED} has no built-in limit on line length;
   2516 as long as it can @code{malloc()} more (virtual) memory,
   2517 you can feed or construct lines as long as you like.
   2518 
   2519 However, recursion is used to handle subpatterns and indefinite
   2520 repetition.  This means that the available stack space may limit
   2521 the size of the buffer that can be processed by certain patterns.
   2522 
   2523 @ifset PERL
   2524 There are some size limitations in the regular expression
   2525 matcher but it is hoped that they will never in practice
   2526 be relevant.  The maximum length of a compiled pattern
   2527 is 65539 (sic) bytes.  All values in repeating quantifiers
   2528 must be less than 65536.  The maximum nesting depth of
   2529 all parenthesized subpatterns, including capturing and
   2530 non-capturing subpatterns@footnote{The
   2531 distinction is meaningful when referring to Perl-style
   2532 regular expressions.}, assertions, and other types of
   2533 subpattern, is 200.
   2534 
   2535 Also, @value{SSED} recognizes the @sc{posix} syntax
   2536 @code{[.@var{ch}.]} and @code{[=@var{ch}=]}
   2537 where @var{ch} is a ``collating element'', but these
   2538 are not supported, and an error is given if they are
   2539 encountered.
   2540 
   2541 Here are a few distinctions between the real Perl-style
   2542 regular expressions and those that @option{-R} recognizes.
   2543 
   2544 @enumerate
   2545 @item
   2546 Lookahead assertions do not allow repeat quantifiers after them
   2547 Perl permits them, but they do not mean what you
   2548 might think. For example, @samp{(?!a)@{3@}} does not assert that the
   2549 next three characters are not @samp{a}. It just asserts three times that the
   2550 next character is not @samp{a} --- a waste of time and nothing else.
   2551 
   2552 @item
   2553 Capturing subpatterns that occur inside  negative  lookahead
   2554 head  assertions  are  counted,  but  their  entries are counted
   2555 as empty in the second half of an @code{s} command.
   2556 Perl sets its numerical variables from any such patterns
   2557 that are matched before the assertion fails to match
   2558 something (thereby succeeding), but only if the negative
   2559 lookahead assertion contains just one branch.
   2560 
   2561 @item
   2562 The following Perl escape sequences are not supported:
   2563 @samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
   2564 @samp{\Q}. In fact these are implemented by Perl's general
   2565 string-handling and are not part of its pattern matching engine.
   2566 
   2567 @item
   2568 The Perl @samp{\G} assertion is not supported as it is not
   2569 relevant to single pattern matches.
   2570 
   2571 @item
   2572 Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
   2573 and @samp{(?p@{code@})} constructions. However, there is some experimental
   2574 support for recursive patterns using the non-Perl item @samp{(?R)}.
   2575 
   2576 @item
   2577 There are at the time of writing some oddities in Perl
   2578 5.005_02 concerned with the settings of captured strings
   2579 when part of a pattern is repeated. For example, matching
   2580 @samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
   2581 @samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
   2582 to the value @samp{b}, but matching @samp{aabbaa}
   2583 against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
   2584 unset.  However, if the pattern is changed to
   2585 @samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
   2586 In Perl 5.004 @samp{$2} is set in both cases, and that is also
   2587 true of @value{SSED}.
   2588 
   2589 @item
   2590 Another as yet unresolved discrepancy is that in Perl
   2591 5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches
   2592 the string @samp{a}, whereas in @value{SSED} it does not.
   2593 However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
   2594 against @samp{a} leaves $1 unset.
   2595 @end enumerate
   2596 @end ifset
   2597 
   2598 @node Other Resources
   2599 @chapter Other Resources for Learning About @command{sed}
   2600 
   2601 @cindex Additional reading about @command{sed}
   2602 In addition to several books that have been written about @command{sed}
   2603 (either specifically or as chapters in books which discuss
   2604 shell programming), one can find out more about @command{sed}
   2605 (including suggestions of a few books) from the FAQ
   2606 for the @code{sed-users} mailing list, available from:
   2607 @display
   2608 @uref{http://sed.sourceforge.net/sedfaq.html}
   2609 @end display
   2610 
   2611 Also of interest are
   2612 @uref{http://www.student.northpark.edu/pemente/sed/index.htm}
   2613 and @uref{http://sed.sf.net/grabbag},
   2614 which include @command{sed} tutorials and other @command{sed}-related goodies.
   2615 
   2616 The @code{sed-users} mailing list itself maintained by Sven Guckes.
   2617 To subscribe, visit @uref{http://groups.yahoo.com} and search
   2618 for the @code{sed-users} mailing list.
   2619 
   2620 @node Reporting Bugs
   2621 @chapter Reporting Bugs
   2622 
   2623 @cindex Bugs, reporting
   2624 Email bug reports to @email{bonzini@@gnu.org}.
   2625 Be sure to include the word ``sed'' somewhere in the @code{Subject:} field.
   2626 Also, please include the output of @samp{sed --version} in the body
   2627 of your report if at all possible.
   2628 
   2629 Please do not send a bug report like this:
   2630 
   2631 @example
   2632 @i{@i{@r{while building frobme-1.3.4}}}
   2633 $ configure 
   2634 @error{} sed: file sedscr line 1: Unknown option to 's'
   2635 @end example
   2636 
   2637 If @value{SSED} doesn't configure your favorite package, take a
   2638 few extra minutes to identify the specific problem and make a stand-alone
   2639 test case.  Unlike other programs such as C compilers, making such test
   2640 cases for @command{sed} is quite simple.
   2641 
   2642 A stand-alone test case includes all the data necessary to perform the
   2643 test, and the specific invocation of @command{sed} that causes the problem.
   2644 The smaller a stand-alone test case is, the better.  A test case should
   2645 not involve something as far removed from @command{sed} as ``try to configure
   2646 frobme-1.3.4''.  Yes, that is in principle enough information to look
   2647 for the bug, but that is not a very practical prospect.
   2648 
   2649 Here are a few commonly reported bugs that are not bugs.
   2650 
   2651 @table @asis
   2652 @item @code{N} command on the last line
   2653 @cindex Portability, @code{N} command on the last line
   2654 @cindex Non-bugs, @code{N} command on the last line
   2655 
   2656 Most versions of @command{sed} exit without printing anything when
   2657 the @command{N} command is issued on the last line of a file.
   2658 @value{SSED} prints pattern space before exiting unless of course
   2659 the @command{-n} command switch has been specified.  This choice is
   2660 by design.
   2661 
   2662 For example, the behavior of
   2663 @example
   2664 sed N foo bar
   2665 @end example
   2666 @noindent
   2667 would depend on whether foo has an even or an odd number of
   2668 lines@footnote{which is the actual ``bug'' that prompted the
   2669 change in behavior}.  Or, when writing a script to read the
   2670 next few lines following a pattern match, traditional
   2671 implementations of @code{sed} would force you to write
   2672 something like
   2673 @example
   2674 /foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @}
   2675 @end example
   2676 @noindent
   2677 instead of just
   2678 @example
   2679 /foo/@{ N;N;N;N;N;N;N;N;N; @}
   2680 @end example
   2681 
   2682 @cindex @code{POSIXLY_CORRECT} behavior, @code{N} command
   2683 In any case, the simplest workaround is to use @code{$d;N} in
   2684 scripts that rely on the traditional behavior, or to set
   2685 the @code{POSIXLY_CORRECT} variable to a non-empty value.
   2686 
   2687 @item Regex syntax clashes (problems with backslashes)
   2688 @cindex @acronym{GNU} extensions, to basic regular expressions
   2689 @cindex Non-bugs, regex syntax clashes
   2690 @command{sed} uses the @sc{posix} basic regular expression syntax.  According to
   2691 the standard, the meaning of some escape sequences is undefined in
   2692 this syntax;  notable in the case of @command{sed} are @code{\|},
   2693 @code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<},
   2694 @code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}.
   2695 
   2696 As in all @acronym{GNU} programs that use @sc{posix} basic regular
   2697 expressions, @command{sed} interprets these escape sequences as special
   2698 characters.  So, @code{x\+} matches one or more occurrences of @samp{x}.
   2699 @code{abc\|def} matches either @samp{abc} or @samp{def}.
   2700 
   2701 This syntax may cause problems when running scripts written for other
   2702 @command{sed}s.  Some @command{sed} programs have been written with the
   2703 assumption that @code{\|} and @code{\+} match the literal characters
   2704 @code{|} and @code{+}.  Such scripts must be modified by removing the
   2705 spurious backslashes if they are to be used with modern implementations
   2706 of @command{sed}, like
   2707 @ifset PERL
   2708 @value{SSED} or
   2709 @end ifset
   2710 @acronym{GNU} @command{sed}.
   2711 
   2712 On the other hand, some scripts use s|abc\|def||g to remove occurrences
   2713 of @emph{either} @code{abc} or @code{def}.  While this worked until
   2714 @command{sed} 4.0.x, newer versions interpret this as removing the
   2715 string @code{abc|def}.  This is again undefined behavior according to
   2716 @acronym{POSIX}, and this interpretation is arguably more robust: older
   2717 @command{sed}s, for example, required that the regex matcher parsed
   2718 @code{\/} as @code{/} in the common case of escaping a slash, which is
   2719 again undefined behavior; the new behavior avoids this, and this is good
   2720 because the regex matcher is only partially under our control.
   2721 
   2722 @cindex @acronym{GNU} extensions, special escapes
   2723 In addition, this version of @command{sed} supports several escape characters
   2724 (some of which are multi-character) to insert non-printable characters
   2725 in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r},
   2726 @code{\t}, @code{\v}, @code{\x}).  These can cause similar problems
   2727 with scripts written for other @command{sed}s.
   2728 
   2729 @item @option{-i} clobbers read-only files
   2730 @cindex In-place editing
   2731 @cindex @value{SSEDEXT}, in-place editing
   2732 @cindex Non-bugs, in-place editing
   2733 
   2734 In short, @samp{sed -i} will let you delete the contents of
   2735 a read-only file, and in general the @option{-i} option
   2736 (@pxref{Invoking sed, , Invocation}) lets you clobber
   2737 protected files.  This is not a bug, but rather a consequence
   2738 of how the Unix filesystem works.
   2739 
   2740 The permissions on a file say what can happen to the data
   2741 in that file, while the permissions on a directory say what can
   2742 happen to the list of files in that directory.  @samp{sed -i}
   2743 will not ever open for writing  a file that is already on disk.
   2744 Rather, it will work on a temporary file that is finally renamed
   2745 to the original name: if you rename or delete files, you're actually
   2746 modifying the contents of the directory, so the operation depends on
   2747 the permissions of the directory, not of the file.  For this same
   2748 reason, @command{sed} does not let you use @option{-i} on a writeable file
   2749 in a read-only directory, and will break hard or symbolic links when
   2750 @option{-i} is used on such a file.
   2751 
   2752 @item @code{0a} does not work (gives an error)
   2753 @cindex @code{0} address
   2754 @cindex @acronym{GNU} extensions, @code{0} address
   2755 @cindex Non-bugs, @code{0} address
   2756 
   2757 There is no line 0.  0 is a special address that is only used to treat
   2758 addresses like @code{0,/@var{RE}/} as active when the script starts: if
   2759 you write @code{1,/abc/d} and the first line includes the word @samp{abc},
   2760 then that match would be ignored because address ranges must span at least
   2761 two lines (barring the end of the file); but what you probably wanted is
   2762 to delete every line up to the first one including @samp{abc}, and this
   2763 is obtained with @code{0,/abc/d}.
   2764 
   2765 @ifclear PERL
   2766 @item @code{[a-z]} is case insensitive
   2767 @cindex Non-bugs, localization-related
   2768 
   2769 You are encountering problems with locales.  POSIX mandates that @code{[a-z]}
   2770 uses the current locale's collation order -- in C parlance, that means using
   2771 @code{strcoll(3)} instead of @code{strcmp(3)}.  Some locales have a
   2772 case-insensitive collation order, others don't.
   2773 
   2774 Another problem is that @code{[a-z]} tries to use collation symbols.
   2775 This only happens if you are on the @acronym{GNU} system, using
   2776 @acronym{GNU} libc's regular expression matcher instead of compiling the
   2777 one supplied with @acronym{GNU} sed.  In a Danish locale, for example,
   2778 the regular expression @code{^[a-z]$} matches the string @samp{aa},
   2779 because this is a single collating symbol that comes after @samp{a}
   2780 and before @samp{b}; @samp{ll} behaves similarly in Spanish
   2781 locales, or @samp{ij} in Dutch locales.
   2782 
   2783 To work around these problems, which may cause bugs in shell scripts, set
   2784 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
   2785 
   2786 @item @code{s/.*//} does not clear pattern space
   2787 @cindex Non-bugs, localization-related
   2788 @cindex @value{SSEDEXT}, emptying pattern space
   2789 @cindex Emptying pattern space
   2790 
   2791 This happens if your input stream includes invalid multibyte
   2792 sequences.  @sc{posix} mandates that such sequences
   2793 are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear
   2794 pattern space as you would expect.  In fact, there is no way to clear
   2795 sed's buffers in the middle of the script in most multibyte locales
   2796 (including UTF-8 locales).  For this reason, @value{SSED} provides a `z'
   2797 command (for `zap') as an extension.
   2798 
   2799 To work around these problems, which may cause bugs in shell scripts, set
   2800 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
   2801 @end ifclear
   2802 @end table
   2803 
   2804 
   2805 @node Extended regexps
   2806 @appendix Extended regular expressions
   2807 @cindex Extended regular expressions, syntax
   2808 
   2809 The only difference between basic and extended regular expressions is in
   2810 the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
   2811 and braces (@samp{@{@}}).  While basic regular expressions require
   2812 these to be escaped if you want them to behave as special characters,
   2813 when using extended regular expressions you must escape them if
   2814 you want them @emph{to match a literal character}.
   2815 
   2816 @noindent
   2817 Examples:
   2818 @table @code
   2819 @item abc?
   2820 becomes @samp{abc\?} when using extended regular expressions.  It matches
   2821 the literal string @samp{abc?}.
   2822 
   2823 @item c\+
   2824 becomes @samp{c+} when using extended regular expressions.  It matches
   2825 one or more @samp{c}s.
   2826 
   2827 @item a\@{3,\@}
   2828 becomes @samp{a@{3,@}} when using extended regular expressions.  It matches
   2829 three or more @samp{a}s.
   2830 
   2831 @item \(abc\)\@{2,3\@}
   2832 becomes @samp{(abc)@{2,3@}} when using extended regular expressions.  It
   2833 matches either @samp{abcabc} or @samp{abcabcabc}.
   2834 
   2835 @item \(abc*\)\1
   2836 becomes @samp{(abc*)\1} when using extended regular expressions.
   2837 Backreferences must still be escaped when using extended regular
   2838 expressions.
   2839 @end table
   2840 
   2841 @ifset PERL
   2842 @node Perl regexps
   2843 @appendix Perl-style regular expressions
   2844 @cindex Perl-style regular expressions, syntax
   2845 
   2846 @emph{This part is taken from the @file{pcre.txt} file distributed together
   2847 with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
   2848 
   2849 Perl introduced several extensions to regular expressions, some
   2850 of them incompatible with the syntax of regular expressions
   2851 accepted by Emacs and other @acronym{GNU} tools (whose matcher was
   2852 based on the Emacs matcher).  @value{SSED} implements
   2853 both kinds of extensions.
   2854 
   2855 @iftex
   2856 Summarizing, we have:
   2857 
   2858 @itemize @bullet
   2859 @item
   2860 A backslash can introduce several special sequences
   2861 
   2862 @item
   2863 The circumflex, dollar sign, and period characters behave specially 
   2864 with regard to new lines
   2865 
   2866 @item
   2867 Strange uses of square brackets are parsed differently
   2868 
   2869 @item
   2870 You can toggle modifiers in the middle of a regular expression
   2871 
   2872 @item
   2873 You can specify that a subpattern does not count when numbering backreferences
   2874 
   2875 @item
   2876 @cindex Greedy regular expression matching
   2877 You can specify greedy or non-greedy matching
   2878 
   2879 @item
   2880 You can have more than ten back references
   2881 
   2882 @item
   2883 You can do complex look aheads and look behinds (in the spirit of
   2884 @code{\b}, but with subpatterns).
   2885 
   2886 @item
   2887 You can often improve performance by avoiding that @command{sed} wastes
   2888 time with backtracking
   2889 
   2890 @item
   2891 You can have if/then/else branches
   2892 
   2893 @item
   2894 You can do recursive matches, for example to look for unbalanced parentheses
   2895 
   2896 @item
   2897 You can have comments and non-significant whitespace, because things can
   2898 get complex...
   2899 @end itemize
   2900 
   2901 Most of these extensions are introduced by the special @code{(?}
   2902 sequence, which gives special meanings to parenthesized groups.
   2903 @end iftex
   2904 @menu
   2905 Other extensions can be roughly subdivided in two categories
   2906 On one hand Perl introduces several more escaped sequences
   2907 (that is, sequences introduced by a backslash).  On the other
   2908 hand, it specifies that if a question mark follows an open
   2909 parentheses it should give a special meaning to the parenthesized
   2910 group.
   2911 
   2912 * Backslash::                       Introduces special sequences
   2913 * Circumflex/dollar sign/period::   Behave specially with regard to new lines
   2914 * Square brackets::                 Are a bit different in strange cases
   2915 * Options setting::                 Toggle modifiers in the middle of a regexp
   2916 * Non-capturing subpatterns::       Are not counted when backreferencing
   2917 * Repetition::                      Allows for non-greedy matching
   2918 * Backreferences::                  Allows for more than 10 back references
   2919 * Assertions::                      Allows for complex look ahead matches
   2920 * Non-backtracking subpatterns::    Often gives more performance
   2921 * Conditional subpatterns::         Allows if/then/else branches
   2922 * Recursive patterns::              For example to match parentheses
   2923 * Comments::                        Because things can get complex...
   2924 @end menu
   2925 
   2926 @node Backslash
   2927 @appendixsec Backslash
   2928 @cindex Perl-style regular expressions, escaped sequences
   2929 
   2930 There are a few difference in the handling of backslashed 
   2931 sequences in Perl mode.
   2932 
   2933 First of all, there are no @code{\o} and @code{\d} sequences.
   2934 @sc{ascii} values for characters can be specified in octal
   2935 with a @code{\@var{xxx}} sequence, where @var{xxx} is a
   2936 sequence of up to three octal digits.  If the first digit
   2937 is a zero, the treatment of the sequence is straightforward;
   2938 just note that if the character that follows the escaped digit
   2939 is itself an octal digit, you have to supply three octal digits
   2940 for @var{xxx}.  For example @code{\07} is a @sc{bel} character
   2941 rather than a @sc{nul} and a literal @code{7} (this sequence is
   2942 instead represented by @code{\0007}).
   2943 
   2944 @cindex Perl-style regular expressions, backreferences
   2945 The handling of a backslash followed by a digit other than 0
   2946 is complicated.  Outside a character class, @command{sed} reads it
   2947 and any following digits as a decimal number. If the number
   2948 is less than 10, or if there have been at least that many
   2949 previous capturing left parentheses in the expression, the
   2950 entire sequence is taken as a back reference. A description
   2951 of how this works is given later, following the discussion
   2952 of parenthesized subpatterns.
   2953 
   2954 Inside a character class, or if the decimal number is
   2955 greater than 9 and there have not been that many capturing
   2956 subpatterns, @command{sed} re-reads up to three octal digits following
   2957 the backslash, and generates a single byte from the
   2958 least significant 8 bits of the value. Any subsequent digits
   2959 stand for themselves.  For example:
   2960 
   2961 @example
   2962 \040  @i{@r{is another way of writing a space}}
   2963 \40   @i{@r{is the same, provided there are fewer than 40}}
   2964       @i{@r{previous capturing subpatterns}}
   2965 \7    @i{@r{is always a back reference}}
   2966 \011  @i{@r{is always a tab}}
   2967 \11   @i{@r{might be a back reference, or another way of writing a tab}}
   2968 \0113 @i{@r{is a tab followed by the character @samp{3}}}
   2969 \113  @i{@r{is the character with octal code 113 (since there}}
   2970       @i{@r{can be no more than 99 back references)}}
   2971 \377  @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}}
   2972 \81   @i{@r{is either a back reference, or a binary zero}}
   2973       @i{@r{followed by the two characters @samp{81}}}
   2974 @end example
   2975 
   2976 Note that octal values of 100 or greater must not be introduced
   2977 by a leading zero, because no more than three octal
   2978 digits are ever read. Note that this applies only to the LHS 
   2979 pattern; it is not possible yet to specify more than 9 backreferences 
   2980 on the RHS of the `s' command.
   2981 
   2982 All the sequences that define a single byte value can be
   2983 used both inside and outside character classes. In addition,
   2984 inside a character class, the sequence @code{\b} is interpreted
   2985 as the backspace character (hex 08). Outside a character
   2986 class it has a different meaning (see below).
   2987 
   2988 In addition, there are four additional escapes specifying
   2989 generic character classes (like @code{\w} and @code{\W} do):
   2990 
   2991 @cindex Perl-style regular expressions, character classes
   2992 @table @samp
   2993 @item \d
   2994 Matches any decimal digit
   2995 
   2996 @item \D
   2997 Matches any character that is not a decimal digit
   2998 @end table
   2999 
   3000 In Perl mode, these character type sequences can appear both inside and
   3001 outside character classes. Instead, in @sc{posix} mode these sequences
   3002 (as well as @code{\w} and @code{\W}) are treated as two literal characters
   3003 (a backslash and a letter) inside square brackets.
   3004 
   3005 Escaped sequences specifying assertions are also different in
   3006 Perl mode.  An assertion specifies a condition that has to be met
   3007 at a particular point in a match, without consuming any
   3008 characters from the subject string. The use of subpatterns
   3009 for more complicated assertions is described below.  The
   3010 backslashed assertions are
   3011 
   3012 @cindex Perl-style regular expressions, assertions
   3013 @table @samp
   3014 @item \b
   3015 Asserts that the point is at a word boundary.
   3016 A word boundary is a position in the subject string where
   3017 the current character and the previous character do not both
   3018 match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
   3019 the other matches @code{\W}), or the start or end of the string
   3020 if the first or last character matches @code{\w}, respectively.
   3021 
   3022 @item \B
   3023 Asserts that the point is not at a word boundary.
   3024 
   3025 @item \A
   3026 Asserts the matcher is at the start of pattern space (independent
   3027 of multiline mode).
   3028 
   3029 @item \Z
   3030 Asserts the matcher is at the end of pattern space,
   3031 or at a newline before the end of pattern space (independent of
   3032 multiline mode)
   3033 
   3034 @item \z
   3035 Asserts the matcher is at the end of pattern space (independent
   3036 of multiline mode)
   3037 @end table
   3038 
   3039 These assertions may not appear in character classes (but
   3040 note that @code{\b} has a different meaning, namely the
   3041 backspace character, inside a character class).
   3042 Note that Perl mode does not support directly assertions
   3043 for the beginning and the end of word; the @acronym{GNU} extensions
   3044 @code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
   3045 instead.
   3046 
   3047 The @code{\A}, @code{\Z}, and @code{\z} assertions differ
   3048 from the traditional circumflex and dollar sign (described below)
   3049 in that they only ever match at the very start and end of the
   3050 subject string, whatever options are set; in particular @code{\A}
   3051 and @code{\z} are the same as the @acronym{GNU} extensions
   3052 @code{\`} and @code{\'} that are active in @sc{posix} mode.
   3053 
   3054 @node Circumflex/dollar sign/period
   3055 @appendixsec Circumflex, dollar sign, period
   3056 @cindex Perl-style regular expressions, newlines
   3057 
   3058 Outside a character class, in the default matching mode, the
   3059 circumflex character is an assertion which is true only if
   3060 the current matching point is at the start of the subject
   3061 string.  Inside a character class, the circumflex has an entirely
   3062 different meaning (see below).
   3063 
   3064 The circumflex need not be the first character of the pattern if
   3065 a number of alternatives are involved, but it should be the
   3066 first thing in each alternative in which it appears if the
   3067 pattern is ever to match that branch. If all possible alternatives,
   3068 start with a circumflex, that is, if the pattern is
   3069 constrained to match only at the start of the subject, it is
   3070 said to be an @dfn{anchored} pattern. (There are also other constructs
   3071 structs that can cause a pattern to be anchored.)
   3072 
   3073 A dollar sign is an assertion which is true only if the
   3074 current matching point is at the end of the subject string,
   3075 or immediately before a newline character that is the last
   3076 character in the string (by default).  A dollar sign need not be the
   3077 last character of the pattern if a number of alternatives
   3078 are involved, but it should be the last item in any branch
   3079 in which it appears.  A dollar sign has no special meaning in a
   3080 character class.
   3081 
   3082 @cindex Perl-style regular expressions, multiline
   3083 The meanings of the circumflex and dollar sign characters are
   3084 changed if the @code{M} modifier option is used. When this is
   3085 the case, they match immediately after and immediately
   3086 before an internal @code{\n} character, respectively, in addition
   3087 to matching at the start and end of the subject string.  For
   3088 example, the pattern @code{/^abc$/} matches the subject string
   3089 @samp{def\nabc} in multiline mode, but not otherwise.  Consequently,
   3090 patterns that are anchored in single line mode
   3091 because all branches start with @code{^} are not anchored in
   3092 multiline mode.
   3093 
   3094 @cindex Perl-style regular expressions, multiline
   3095 Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
   3096 can be used to match the start and end of the subject in both
   3097 modes, and if all branches of a pattern start with @code{\A}
   3098 is it always anchored, whether the @code{M} modifier is set or not.
   3099 
   3100 @cindex Perl-style regular expressions, single line
   3101 Outside a character class, a dot in the pattern matches any
   3102 one character in the subject, including a non-printing character,
   3103 but not (by default) newline.  If the @code{S} modifier is used,
   3104 dots match newlines as well.  Actually, the handling of
   3105 dot is entirely independent of the handling of circumflex
   3106 and dollar sign, the only relationship being that they both
   3107 involve newline characters. Dot has no special meaning in a
   3108 character class.
   3109 
   3110 @node Square brackets
   3111 @appendixsec Square brackets
   3112 @cindex Perl-style regular expressions, character classes
   3113 
   3114 An opening square bracket introduces a character class, terminated
   3115 by a closing square bracket.  A closing square bracket on its own
   3116 is not special.  If a closing square bracket is required as a
   3117 member of the class, it should be the first data character in
   3118 the class (after an initial circumflex, if present) or escaped with a backslash.
   3119 
   3120 A character class matches a single character in the subject;
   3121 the character must be in the set of characters defined by
   3122 the class, unless the first character in the class is a circumflex,
   3123 in which case the subject character must not be in
   3124 the set defined by the class. If a circumflex is actually
   3125 required as a member of the class, ensure it is not the
   3126 first character, or escape it with a backslash.
   3127 
   3128 For example, the character class [aeiou] matches any lower
   3129 case vowel, while [^aeiou] matches any character that is not
   3130 a lower case vowel. Note that a circumflex is just a convenient
   3131 venient notation for specifying the characters which are in
   3132 the class by enumerating those that are not. It is not an
   3133 assertion: it still consumes a character from the subject
   3134 string, and fails if the current pointer is at the end of
   3135 the string.
   3136 
   3137 @cindex Perl-style regular expressions, case-insensitive
   3138 When caseless matching is set, any letters in a class
   3139 represent both their upper case and lower case versions, so
   3140 for example, a caseless @code{[aeiou]} matches uppercase
   3141 and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
   3142 does not match @samp{A}, whereas a case-sensitive version would.
   3143 
   3144 @cindex Perl-style regular expressions, single line
   3145 @cindex Perl-style regular expressions, multiline
   3146 The newline character is never treated in any special way in
   3147 character classes, whatever the setting of the @code{S} and
   3148 @code{M} options (modifiers) is.  A class such as @code{[^a]} will
   3149 always match a newline.
   3150 
   3151 The minus (hyphen) character can be used to specify a range
   3152 of characters in a character class.  For example, @code{[d-m]}
   3153 matches any letter between d and m, inclusive.  If a minus
   3154 character is required in a class, it must be escaped with a
   3155 backslash or appear in a position where it cannot be interpreted
   3156 as indicating a range, typically as the first or last
   3157 character in the class.
   3158 
   3159 It is not possible to have the literal character @code{]} as the
   3160 end character of a range.  A pattern such as @code{[W-]46]} is
   3161 interpreted as a class of two characters (@code{W} and @code{-})
   3162 followed by a literal string @code{46]}, so it would match
   3163 @samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
   3164 with a backslash it is interpreted as the end of range, so
   3165 @code{[W-\]46]} is interpreted as a single class containing a
   3166 range followed by two separate characters. The octal or
   3167 hexadecimal representation of @code{]} can also be used to end a range.
   3168 
   3169 Ranges operate in @sc{ascii} collating sequence. They can also be
   3170 used for characters specified numerically, for example
   3171 @code{[\000-\037]}. If a range that includes letters is used when
   3172 caseless matching is set, it matches the letters in either
   3173 case. For example, a caseless @code{[W-c]} is equivalent to
   3174 @code{[][\^_`wxyzabc]}, matched caselessly, and if character
   3175 tables for the French locale are in use, @code{[\xc8-\xcb]}
   3176 matches accented E characters in both cases.
   3177 
   3178 Unlike in @sc{posix} mode, the character types @code{\d},
   3179 @code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
   3180 may also appear in a character class, and add the characters
   3181 that they match to the class. For example, @code{[\dABCDEF]} matches any
   3182 hexadecimal digit.  A circumflex can conveniently be used
   3183 with the upper case character types to specify a more restricted
   3184 set of characters than the matching lower case type.
   3185 For example, the class @code{[^\W_]} matches any letter or digit,
   3186 but not underscore.
   3187 
   3188 All non-alphameric characters other than @code{\}, @code{-},
   3189 @code{^} (at the start) and the terminating @code{]}
   3190 are non-special in character classes, but it does no harm
   3191 if they are escaped.
   3192 
   3193 Perl 5.6 supports the @sc{posix} notation for character classes, which
   3194 uses names enclosed by @code{[:} and @code{:]} within the enclosing
   3195 square brackets, and @value{SSED} supports this notation as well.
   3196 For example,
   3197 
   3198 @example
   3199 [01[:alpha:]%]
   3200 @end example
   3201 
   3202 @noindent
   3203 matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
   3204 The supported class names are
   3205 
   3206 @table @code
   3207 @item alnum
   3208 Matches letters and digits
   3209 
   3210 @item alpha
   3211 Matches letters
   3212 
   3213 @item ascii
   3214 Matches character codes 0 - 127
   3215 
   3216 @item cntrl
   3217 Matches control characters
   3218 
   3219 @item digit
   3220 Matches decimal digits (same as \d)
   3221 
   3222 @item graph
   3223 Matches printing characters, excluding space
   3224 
   3225 @item lower
   3226 Matches lower case letters
   3227 
   3228 @item print
   3229 Matches printing characters, including space
   3230 
   3231 @item punct
   3232 Matches printing characters, excluding letters and digits
   3233 
   3234 @item space
   3235 Matches white space (same as \s)
   3236 
   3237 @item upper
   3238 Matches upper case letters
   3239 
   3240 @item word
   3241 Matches ``word'' characters (same as \w)
   3242 
   3243 @item xdigit
   3244 Matches hexadecimal digits
   3245 @end table
   3246 
   3247 The names @code{ascii} and @code{word} are extensions valid only in
   3248 Perl mode.  Another Perl extension is negation, which is
   3249 indicated by a circumflex character after the colon. For example,
   3250 
   3251 @example
   3252 [12[:^digit:]]
   3253 @end example
   3254 
   3255 @noindent
   3256 matches @samp{1}, @samp{2}, or any non-digit.
   3257 
   3258 @node Options setting
   3259 @appendixsec Options setting
   3260 @cindex Perl-style regular expressions, toggling options
   3261 @cindex Perl-style regular expressions, case-insensitive
   3262 @cindex Perl-style regular expressions, multiline
   3263 @cindex Perl-style regular expressions, single line
   3264 @cindex Perl-style regular expressions, extended
   3265 
   3266 The settings of the @code{I}, @code{M}, @code{S}, @code{X}
   3267 modifiers can be changed from within the pattern by
   3268 a sequence of Perl option letters enclosed between @code{(?}
   3269 and @code{)}. The option letters must be lowercase.
   3270 
   3271 For example, @code{(?im)} sets caseless, multiline matching. It is
   3272 also possible to unset these options by preceding the letter
   3273 with a hyphen; you can also have combined settings and unsettings:
   3274 @code{(?im-sx)} sets caseless and multiline matching,
   3275 while unsets single line matching (for dots) and extended
   3276 whitespace interpretation.  If a letter appears both before
   3277 and after the hyphen, the option is unset.
   3278 
   3279 The scope of these option changes depends on where in the
   3280 pattern the setting occurs. For settings that are outside
   3281 any subpattern (defined below), the effect is the same as if
   3282 the options were set or unset at the start of matching. The
   3283 following patterns all behave in exactly the same way:
   3284 
   3285 @example
   3286 (?i)abc
   3287 a(?i)bc
   3288 ab(?i)c
   3289 abc(?i)
   3290 @end example
   3291 
   3292 which in turn is the same as specifying the pattern abc with
   3293 the @code{I} modifier.  In other words, ``top level'' settings
   3294 apply to the whole pattern (unless there are other
   3295 changes inside subpatterns). If there is more than one setting
   3296 of the same option at top level, the rightmost setting
   3297 is used.
   3298 
   3299 If an option change occurs inside a subpattern, the effect
   3300 is different.  This is a change of behaviour in Perl 5.005.
   3301 An option change inside a subpattern affects only that part
   3302 of the subpattern @emph{that follows} it, so
   3303 
   3304 @example
   3305 (a(?i)b)c
   3306 @end example
   3307 
   3308 @noindent
   3309 matches abc and aBc and no other  strings  (assuming
   3310 case-sensitive matching is used).  By this means, options can
   3311 be made to have different settings in different parts of the
   3312 pattern.  Any changes made in one alternative do carry on
   3313 into subsequent branches within the same subpattern.  For
   3314 example,
   3315 
   3316 @example
   3317 (a(?i)b|c)
   3318 @end example
   3319 
   3320 @noindent
   3321 matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
   3322 even though when matching @samp{C} the first branch is
   3323 abandoned before the option setting.
   3324 This is because the effects of option settings happen at
   3325 compile time. There would be some very weird behaviour otherwise.
   3326 
   3327 @ignore
   3328 There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
   3329 that can be changed in the same way as the Perl-compatible options by
   3330 using the characters U and X respectively.  The (?X) flag
   3331 setting is special in that it must always occur earlier in
   3332 the pattern than any of the additional features it turns on,
   3333 even when it is at top level. It is best put at the start.
   3334 @end ignore
   3335 
   3336 
   3337 @node Non-capturing subpatterns
   3338 @appendixsec Non-capturing subpatterns
   3339 @cindex Perl-style regular expressions, non-capturing subpatterns
   3340 
   3341 Marking part of a pattern as a subpattern does two things.
   3342 On one hand, it localizes a set of alternatives; on the other
   3343 hand, it sets up the subpattern as a capturing subpattern (as
   3344 defined above).  The subpattern can be backreferenced and
   3345 referenced in the right side of @code{s} commands.
   3346 
   3347 For example, if the string @samp{the red king} is matched against
   3348 the pattern
   3349 
   3350 @example
   3351 the ((red|white) (king|queen))
   3352 @end example
   3353 
   3354 @noindent
   3355 the captured substrings are @samp{red king}, @samp{red},
   3356 and @samp{king}, and are numbered 1, 2, and 3.
   3357 
   3358 The fact that plain parentheses fulfil two functions is not
   3359 always helpful.  There are often times when a grouping
   3360 subpattern is required without a capturing requirement.  If an
   3361 opening parenthesis is followed by @code{?:}, the subpattern does
   3362 not do any capturing, and is not counted when computing the
   3363 number of any subsequent capturing subpatterns. For example,
   3364 if the string @samp{the white queen} is matched against the pattern
   3365 
   3366 @example
   3367 the ((?:red|white) (king|queen))
   3368 @end example
   3369 
   3370 @noindent
   3371 the captured substrings are @samp{white queen} and @samp{queen},
   3372 and are numbered 1 and 2. The maximum number of captured
   3373 substrings is 99, while the maximum number of all subpatterns,
   3374 both capturing and non-capturing, is 200.
   3375 
   3376 As a convenient shorthand, if any option settings are
   3377 equired at the start of a non-capturing subpattern, the
   3378 option letters may appear between the @code{?} and the
   3379 @code{:}.  Thus the two patterns
   3380 
   3381 @example
   3382 (?i:saturday|sunday)
   3383 (?:(?i)saturday|sunday)
   3384 @end example
   3385 
   3386 @noindent
   3387 match exactly the same set of strings.  Because alternative
   3388 branches are tried from left to right, and options are not
   3389 reset until the end of the subpattern is reached, an option
   3390 setting in one branch does affect subsequent branches, so
   3391 the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
   3392 
   3393 
   3394 @node Repetition
   3395 @appendixsec Repetition
   3396 @cindex Perl-style regular expressions, repetitions
   3397 
   3398 Repetition is specified by quantifiers, which can follow any
   3399 of the following items:
   3400 
   3401 @itemize @bullet
   3402 @item
   3403 a single character, possibly escaped
   3404 
   3405 @item
   3406 the @code{.} special character
   3407 
   3408 @item
   3409 a character class
   3410 
   3411 @item
   3412 a back reference (see next section)
   3413 
   3414 @item
   3415 a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
   3416 @end itemize
   3417 
   3418 The general repetition quantifier specifies a minimum and
   3419 maximum number of permitted matches, by giving the two
   3420 numbers in curly brackets (braces), separated by a comma.
   3421 The numbers must be less than 65536, and the first must be
   3422 less than or equal to the second. For example:
   3423 
   3424 @example
   3425 z@{2,4@}
   3426 @end example
   3427 
   3428 @noindent
   3429 matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
   3430 is not a special character. If the second number is omitted,
   3431 but the comma is present, there is no upper limit; if the
   3432 second number and the comma are both omitted, the quantifier
   3433 specifies an exact number of required matches. Thus
   3434 
   3435 @example
   3436 [aeiou]@{3,@}
   3437 @end example
   3438 
   3439 @noindent
   3440 matches at least 3 successive vowels, but may match many
   3441 more, while
   3442 
   3443 @example
   3444 \d@{8@}
   3445 @end example
   3446 
   3447 @noindent
   3448 matches exactly 8 digits.  An opening curly bracket that
   3449 appears in a position where a quantifier is not allowed, or
   3450 one that does not match the syntax of a quantifier, is taken
   3451 as a literal character. For example, @{,6@} is not a quantifier,
   3452 but a literal string of four characters.@footnote{It
   3453 raises an error if @option{-R} is not used.}
   3454 
   3455 The quantifier @samp{@{0@}} is permitted, causing the expression to
   3456 behave as if the previous item and the quantifier were not
   3457 present.
   3458 
   3459 For convenience (and historical compatibility) the three
   3460 most common quantifiers have single-character abbreviations:
   3461 
   3462 @table @code
   3463 @item *
   3464 is equivalent to @{0,@}
   3465 
   3466 @item +
   3467 is equivalent to @{1,@}
   3468 
   3469 @item ?
   3470 is equivalent to @{0,1@}
   3471 @end table
   3472 
   3473 It is possible to construct infinite loops by following a
   3474 subpattern that can match no characters with a quantifier
   3475 that has no upper limit, for example:
   3476 
   3477 @example
   3478 (a?)*
   3479 @end example
   3480 
   3481 Earlier versions of Perl used to give an error at
   3482 compile time for such patterns. However, because there are
   3483 cases where this can be useful, such patterns are now
   3484 accepted, but if any repetition of the subpattern does in
   3485 fact match no characters, the loop is forcibly broken.
   3486 
   3487 @cindex Greedy regular expression matching
   3488 @cindex Perl-style regular expressions, stingy repetitions
   3489 By default, the quantifiers are @dfn{greedy} like in @sc{posix}
   3490 mode, that is, they match as much as possible (up to the maximum
   3491 number of permitted times), without causing the rest of the
   3492 pattern to fail. The classic example of where this gives problems
   3493 is in trying to match comments in C programs. These appear between
   3494 the sequences @code{/*} and @code{*/} and within the sequence, individual
   3495 @code{*} and @code{/} characters may appear. An attempt to match C
   3496 comments by applying the pattern
   3497 
   3498 @example
   3499 /\*.*\*/
   3500 @end example
   3501 
   3502 @noindent
   3503 to the string
   3504 
   3505 @example
   3506 /* first command */ not comment /* second comment */
   3507 @end example
   3508 
   3509 @noindent
   3510 
   3511 fails, because it matches the entire string owing to the
   3512 greediness of the @code{.*} item.
   3513 
   3514 However, if a quantifier is followed by a question mark, it
   3515 ceases to be greedy, and instead matches the minimum number
   3516 of times possible, so the pattern @code{/\*.*?\*/}
   3517 does the right thing with the C comments. The meaning of the
   3518 various quantifiers is not otherwise changed, just the preferred
   3519 number of matches.  Do not confuse this use of question
   3520 mark with its use as a quantifier in its own right.
   3521 Because it has two uses, it can sometimes appear doubled, as in
   3522 
   3523 @example
   3524 \d??\d
   3525 @end example
   3526 
   3527 which matches one digit by preference, but can match two if
   3528 that is the only way the rest of the pattern matches.
   3529 
   3530 Note that greediness does not matter when specifying addresses,
   3531 but can be nevertheless used to improve performance.
   3532 
   3533 @ignore
   3534 If the PCRE_UNGREEDY option is set (an option which is not
   3535 available in Perl), the quantifiers are not greedy by
   3536 default, but individual ones can be made greedy by following
   3537 them with a question mark. In other words, it inverts the
   3538 default behaviour.
   3539 @end ignore
   3540 
   3541 When a parenthesized subpattern is quantified with a minimum
   3542 repeat count that is greater than 1 or with a limited maximum,
   3543 more store is required for the compiled pattern, in
   3544 proportion to the size of the minimum or maximum.
   3545 
   3546 @cindex Perl-style regular expressions, single line
   3547 If a pattern starts with @code{.*} or @code{.@{0,@}} and the
   3548 @code{S} modifier is used, the pattern is implicitly anchored,
   3549 because whatever follows will be tried against every character
   3550 position in the subject string, so there is no point in
   3551 retrying the overall match at any position after the first.
   3552 PCRE treats such a pattern as though it were preceded by \A.
   3553 
   3554 When a capturing subpattern is repeated, the value captured
   3555 is the substring that matched the final iteration. For example,
   3556 after
   3557 
   3558 @example
   3559 (tweedle[dume]@{3@}\s*)+
   3560 @end example
   3561 
   3562 @noindent
   3563 has matched @samp{tweedledum tweedledee} the value of the
   3564 captured substring is @samp{tweedledee}.  However, if there are
   3565 nested capturing subpatterns, the corresponding captured
   3566 values may have been set in previous iterations. For example,
   3567 after
   3568 
   3569 @example
   3570 /(a|(b))+/
   3571 @end example
   3572 
   3573 matches @samp{aba}, the value of the second captured substring is
   3574 @samp{b}.
   3575 
   3576 @node Backreferences
   3577 @appendixsec Backreferences
   3578 @cindex Perl-style regular expressions, backreferences
   3579 
   3580 Outside a character class, a backslash followed by a digit
   3581 greater than 0 (and possibly further digits) is a back
   3582 reference to a capturing subpattern earlier (i.e.  to its
   3583 left) in the pattern, provided there have been that many
   3584 previous capturing left parentheses.
   3585 
   3586 However, if the decimal number following the backslash is
   3587 less than 10, it is always taken as a back reference, and
   3588 causes an error only if there are not that many capturing
   3589 left parentheses in the entire pattern. In other words, the
   3590 parentheses that are referenced need not be to the left of
   3591 the reference for numbers less than 10. @ref{Backslash}
   3592 for further details of the handling of digits following a backslash.
   3593 
   3594 A back reference matches whatever actually matched the capturing
   3595 subpattern in the current subject string, rather than
   3596 anything matching the subpattern itself. So the pattern
   3597 
   3598 @example
   3599 (sens|respons)e and \1ibility
   3600 @end example
   3601 
   3602 @noindent
   3603 matches @samp{sense and sensibility} and @samp{response and responsibility},
   3604 but not @samp{sense and responsibility}. If caseful
   3605 matching is in force at the time of the back reference, the
   3606 case of letters is relevant. For example,
   3607 
   3608 @example
   3609 ((?i)blah)\s+\1
   3610 @end example
   3611 
   3612 @noindent
   3613 matches @samp{blah blah} and @samp{Blah Blah}, but not
   3614 @samp{BLAH blah}, even though the original capturing
   3615 subpattern is matched caselessly.
   3616 
   3617 There may be more than one back reference to the same subpattern.
   3618 Also, if a subpattern has not actually been used in a
   3619 particular match, any back references to it always fail. For
   3620 example, the pattern
   3621 
   3622 @example
   3623 (a|(bc))\2
   3624 @end example
   3625 
   3626 @noindent
   3627 always fails if it starts to match @samp{a} rather than
   3628 @samp{bc}.  Because there may be up to 99 back references, all
   3629 digits following the backslash are taken as part of a potential
   3630 back reference number; this is different from what happens
   3631 in @sc{posix} mode. If the pattern continues with a digit
   3632 character, some delimiter must be used to terminate the back
   3633 reference.  If the @code{X} modifier option is set, this can be
   3634 whitespace.  Otherwise an empty comment can be used, or the
   3635 following character can be expressed in hexadecimal or octal.
   3636 Note that this applies only to the LHS pattern; it is
   3637 not possible yet to specify more than 9 backreferences on the 
   3638 RHS of the `s' command. 
   3639 
   3640 A back reference that occurs inside the parentheses to which
   3641 it refers fails when the subpattern is first used, so, for
   3642 example, @code{(a\1)} never matches.  However, such references
   3643 can be useful inside repeated subpatterns. For example, the
   3644 pattern
   3645 
   3646 @example
   3647 (a|b\1)+
   3648 @end example
   3649 
   3650 @noindent
   3651 matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
   3652 etc. At each iteration of the subpattern, the back reference matches
   3653 the character string corresponding to the previous iteration.  In
   3654 order for this to work, the pattern must be such that the first
   3655 iteration does not need to match the back reference.  This can be
   3656 done using alternation, as in the example above, or by a
   3657 quantifier with a minimum of zero.
   3658 
   3659 @node Assertions
   3660 @appendixsec Assertions
   3661 @cindex Perl-style regular expressions, assertions
   3662 @cindex Perl-style regular expressions, asserting subpatterns
   3663 
   3664 An assertion is a test on the characters following or
   3665 preceding the current matching point that does not actually
   3666 consume any characters. The simple assertions coded as @code{\b},
   3667 @code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
   3668 are described above. More complicated assertions are coded as
   3669 subpatterns.  There are two kinds: those that look ahead of the
   3670 current position in the subject string, and those that look behind it.
   3671 
   3672 @cindex Perl-style regular expressions, lookahead subpatterns
   3673 An assertion subpattern is matched in the normal way, except
   3674 that it does not cause the current matching position to be
   3675 changed. Lookahead assertions start with @code{(?=} for positive
   3676 assertions and @code{(?!} for negative assertions. For example,
   3677 
   3678 @example
   3679 \w+(?=;)
   3680 @end example
   3681 
   3682 @noindent
   3683 matches a word followed by a semicolon, but does not include
   3684 the semicolon in the match, and
   3685 
   3686 @example
   3687 foo(?!bar)
   3688 @end example
   3689 
   3690 @noindent
   3691 matches any occurrence of @samp{foo} that is not followed by
   3692 @samp{bar}.
   3693 
   3694 Note that the apparently similar pattern
   3695 
   3696 @example
   3697 (?!foo)bar
   3698 @end example
   3699 
   3700 @noindent
   3701 @cindex Perl-style regular expressions, lookbehind subpatterns
   3702 finds any occurrence of @samp{bar} even if it is preceded by
   3703 @samp{foo}, because the assertion @code{(?!foo)} is always true
   3704 when the next three characters are @samp{bar}. A lookbehind
   3705 assertion is needed to achieve this effect.
   3706 Lookbehind assertions start with @code{(?<=} for positive
   3707 assertions and @code{(?<!} for negative assertions. So,
   3708 
   3709 @example
   3710 (?<!foo)bar
   3711 @end example
   3712 
   3713 achieves the required effect of finding an occurrence of
   3714 @samp{bar} that is not preceded by @samp{foo}. The contents of a
   3715 lookbehind assertion are restricted
   3716 such that all the strings it matches must have a fixed
   3717 length.  However, if there are several alternatives, they do
   3718 not all have to have the same fixed length.  This is an extension
   3719 compared with Perl 5.005, which requires all branches to match
   3720 the same length of string. Thus
   3721 
   3722 @example
   3723 (?<=dogs|cats|)
   3724 @end example
   3725 
   3726 @noindent
   3727 is permitted, but the apparently equivalent regular expression
   3728 
   3729 @example
   3730 (?<!dogs?|cats?)
   3731 @end example
   3732 
   3733 @noindent
   3734 causes an error at compile time. Branches that match different
   3735 length strings are permitted only at the top level of
   3736 a lookbehind assertion: an assertion such as
   3737 
   3738 @example
   3739 (?<=ab(c|de))
   3740 @end example
   3741 
   3742 @noindent
   3743 is not permitted, because its single top-level branch can
   3744 match two different lengths, but it is acceptable if rewritten
   3745 to use two top-level branches:
   3746 
   3747 @example
   3748 (?<=abc|abde)
   3749 @end example
   3750 
   3751 All this is required because lookbehind assertions simply
   3752 move the current position back by the alternative's fixed
   3753 width and then try to match.  If there are
   3754 insufficient characters before the current position, the
   3755 match is deemed to fail.  Lookbehinds, in conjunction with
   3756 non-backtracking subpatterns can be particularly useful for
   3757 matching at the ends of strings; an example is given at the end
   3758 of the section on non-backtracking subpatterns.
   3759 
   3760 Several assertions (of any sort) may occur in succession.
   3761 For example,
   3762 
   3763 @example
   3764 (?<=\d@{3@})(?<!999)foo
   3765 @end example
   3766 
   3767 @noindent
   3768 matches @samp{foo} preceded by three digits that are not @samp{999}.
   3769 Notice that each of the assertions is applied independently
   3770 at the same point in the subject string. First there is a
   3771 check that the previous three characters are all digits, and
   3772 then there is a check that the same three characters are not
   3773 @samp{999}.  This pattern does not match @samp{foo} preceded by six
   3774 characters, the first of which are digits and the last three
   3775 of which are not @samp{999}.  For example, it doesn't match
   3776 @samp{123abcfoo}. A pattern to do that is
   3777 
   3778 @example
   3779 (?<=\d@{3@}...)(?<!999)foo
   3780 @end example
   3781 
   3782 @noindent
   3783 This time the first assertion looks at the preceding six
   3784 characters, checking that the first three are digits, and
   3785 then the second assertion checks that the preceding three
   3786 characters are not @samp{999}.  Actually, assertions can be
   3787 nested in any combination, so one can write this as 
   3788 
   3789 @example
   3790 (?<=\d@{3@}(?!999)...)foo
   3791 @end example
   3792 
   3793 or
   3794 
   3795 @example
   3796 (?<=\d@{3@}...(?<!999))foo
   3797 @end example
   3798 
   3799 @noindent
   3800 both of which might be considered more readable.
   3801 
   3802 Assertion subpatterns are not capturing subpatterns, and may
   3803 not be repeated, because it makes no sense to assert the
   3804 same thing several times. If any kind of assertion contains
   3805 capturing subpatterns within it, these are counted for the
   3806 purposes of numbering the capturing subpatterns in the whole
   3807 pattern.  However, substring capturing is carried out only
   3808 for positive assertions, because it does not make sense for
   3809 negative assertions.
   3810 
   3811 Assertions count towards the maximum of 200 parenthesized
   3812 subpatterns.
   3813 
   3814 @node Non-backtracking subpatterns
   3815 @appendixsec Non-backtracking subpatterns
   3816 @cindex Perl-style regular expressions, non-backtracking subpatterns
   3817 
   3818 With both maximizing and minimizing repetition, failure of
   3819 what follows normally causes the repeated item to be evaluated
   3820 again to see if a different number of repeats allows the
   3821 rest of the pattern to match. Sometimes it is useful to
   3822 prevent this, either to change the nature of the match, or
   3823 to cause it fail earlier than it otherwise might, when the
   3824 author of the pattern knows there is no point in carrying
   3825 on.
   3826 
   3827 Consider, for example, the pattern @code{\d+foo} when applied to
   3828 the subject line
   3829 
   3830 @example
   3831 123456bar
   3832 @end example
   3833 
   3834 After matching all 6 digits and then failing to match @samp{foo},
   3835 the normal action of the matcher is to try again with only 5
   3836 digits matching the @code{\d+} item, and then with 4, and so on,
   3837 before ultimately failing. Non-backtracking subpatterns
   3838 provide the means for specifying that once a portion of the
   3839 pattern has matched, it is not to be re-evaluated in this way,
   3840 so the matcher would give up immediately on failing to match
   3841 @samp{foo} the first time.  The notation is another kind of special
   3842 parenthesis, starting with @code{(?>} as in this example:
   3843 
   3844 @example
   3845 (?>\d+)bar
   3846 @end example
   3847 
   3848 This kind of parenthesis ``locks up'' the part of the pattern
   3849 it contains once it has matched, and a failure further into
   3850 the pattern is prevented from backtracking into it.
   3851 Backtracking past it to previous items, however, works as
   3852 normal.
   3853 
   3854 Non-backtracking subpatterns are not capturing subpatterns.  Simple
   3855 cases such as the above example can be thought of as a maximizing
   3856 repeat that must swallow everything it can.  So,
   3857 while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
   3858 digits they match in order to make the rest of the pattern
   3859 match, @code{(?>\d+)} can only match an entire sequence of digits.
   3860 
   3861 This construction can of course contain arbitrarily complicated
   3862 subpatterns, and it can be nested.
   3863 
   3864 @cindex Perl-style regular expressions, lookbehind subpatterns
   3865 Non-backtracking subpatterns can be used in conjunction with look-behind
   3866 assertions to specify efficient matching at the end
   3867 of the subject string. Consider a simple pattern such as
   3868 
   3869 @example
   3870 abcd$
   3871 @end example
   3872 
   3873 @noindent
   3874 when applied to a long string which does not match.  Because
   3875 matching proceeds from left to right, @command{sed} will look for
   3876 each @samp{a} in the subject and then see if what follows matches
   3877 the rest of the pattern. If the pattern is specified as
   3878 
   3879 @example
   3880 ^.*abcd$
   3881 @end example
   3882 
   3883 @noindent
   3884 the initial @code{.*} matches the entire string at first, but when
   3885 this fails (because there is no following @samp{a}), it backtracks
   3886 to match all but the last character, then all but the
   3887 last two characters, and so on. Once again the search for
   3888 @samp{a} covers the entire string, from right to left, so we are
   3889 no better off. However, if the pattern is written as
   3890 
   3891 @example
   3892 ^(?>.*)(?<=abcd)
   3893 @end example
   3894 
   3895 there can be no backtracking for the .* item; it can match
   3896 only the entire string. The subsequent lookbehind assertion
   3897 does a single test on the last four characters. If it fails,
   3898 the match fails immediately. For long strings, this approach
   3899 makes a significant difference to the processing time.
   3900 
   3901 When a pattern contains an unlimited repeat inside a subpattern
   3902 that can itself be repeated an unlimited number of
   3903 times, the use of a once-only subpattern is the only way to
   3904 avoid some failing matches taking a very long time
   3905 indeed.@footnote{Actually, the matcher embedded in @value{SSED}
   3906 tries to do something for this in the simplest cases,
   3907 like @code{([^b]*b)*}.  These cases are actually quite
   3908 common: they happen for example in a regular expression
   3909 like @code{\/\*([^*]*\*)*\/} which matches C comments.}
   3910 
   3911 The pattern
   3912 
   3913 @example
   3914 (\D+|<\d+>)*[!?]
   3915 @end example
   3916 
   3917 ([^0-9<]+<(\d+>)?)*[!?]
   3918 
   3919 @noindent
   3920 matches an unlimited number of substrings that either consist
   3921 of non-digits, or digits enclosed in angular brackets, followed by
   3922 an exclamation or question mark. When it matches, it runs quickly.
   3923 However, if it is applied to
   3924 
   3925 @example
   3926 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
   3927 @end example
   3928 
   3929 @noindent
   3930 it takes a long time before reporting failure.  This is
   3931 because the string can be divided between the two repeats in
   3932 a large number of ways, and all have to be tried.@footnote{The
   3933 example used @code{[!?]} rather than a single character at the end,
   3934 because both @value{SSED} and Perl have an optimization that allows
   3935 for fast failure when a single character is used. They
   3936 remember the last single character that is required for a
   3937 match, and fail early if it is not present in the string.}
   3938 
   3939 If the pattern is changed to
   3940 
   3941 @example
   3942 ((?>\D+)|<\d+>)*[!?]
   3943 @end example
   3944 
   3945 sequences of non-digits cannot be broken, and failure happens
   3946 quickly.
   3947 
   3948 @node Conditional subpatterns
   3949 @appendixsec Conditional subpatterns
   3950 @cindex Perl-style regular expressions, conditional subpatterns
   3951 
   3952 It is possible to cause the matching process to obey a subpattern
   3953 conditionally or to choose between two alternative
   3954 subpatterns, depending on the result of an assertion, or
   3955 whether a previous capturing subpattern matched or not. The
   3956 two possible forms of conditional subpattern are
   3957 
   3958 @example
   3959 (?(@var{condition})@var{yes-pattern})
   3960 (?(@var{condition})@var{yes-pattern}|@var{no-pattern})
   3961 @end example
   3962 
   3963 If the condition is satisfied, the yes-pattern is used; otherwise
   3964 the no-pattern (if present) is used. If there are more than two
   3965 alternatives in the subpattern, a compile-time error occurs.
   3966 
   3967 There are two kinds of condition. If the text between the
   3968 parentheses consists of a sequence of digits, the condition
   3969 is satisfied if the capturing subpattern of that number has
   3970 previously matched.  The number must be greater than zero.
   3971 Consider the following pattern, which contains non-significant
   3972 white space to make it more readable (assume the @code{X} modifier)
   3973 and to divide it into three parts for ease of discussion:
   3974 
   3975 @example
   3976 ( \( )?   [^()]+   (?(1) \) )
   3977 @end example
   3978 
   3979 The first part matches an optional opening parenthesis, and
   3980 if that character is present, sets it as the first captured
   3981 substring. The second part matches one or more characters
   3982 that are not parentheses. The third part is a conditional
   3983 subpattern that tests whether the first set of parentheses
   3984 matched or not.  If they did, that is, if subject started
   3985 with an opening parenthesis, the condition is true, and so
   3986 the yes-pattern is executed and a closing parenthesis is
   3987 required. Otherwise, since no-pattern is not present, the
   3988 subpattern matches nothing.  In other words, this pattern
   3989 matches a sequence of non-parentheses, optionally enclosed
   3990 in parentheses.
   3991 
   3992 @cindex Perl-style regular expressions, lookahead subpatterns
   3993 If the condition is not a sequence of digits, it must be an
   3994 assertion.  This may be a positive or negative lookahead or
   3995 lookbehind assertion. Consider this pattern, again containing
   3996 non-significant white space, and with the two alternatives
   3997 on the second line:
   3998 
   3999 @example
   4000 (?(?=...[a-z])
   4001    \d\d-[a-z]@{3@}-\d\d |
   4002    \d\d-\d\d-\d\d )
   4003 @end example
   4004 
   4005 The condition is a positive lookahead assertion that matches
   4006 a letter that is three characters away from the current point.
   4007 If a letter is found, the subject is matched against the first
   4008 alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
   4009 letters and @var{dd} are digits); otherwise it is matched against 
   4010 the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
   4011 
   4012 
   4013 @node Recursive patterns
   4014 @appendixsec Recursive patterns
   4015 @cindex Perl-style regular expressions, recursive patterns
   4016 @cindex Perl-style regular expressions, recursion
   4017 
   4018 Consider the problem of matching a string in parentheses,
   4019 allowing for unlimited nested parentheses. Without the use
   4020 of recursion, the best that can be done is to use a pattern
   4021 that matches up to some fixed depth of nesting. It is not
   4022 possible to handle an arbitrary nesting depth. Perl 5.6 has
   4023 provided an experimental facility that allows regular
   4024 expressions to recurse (amongst other things). It does this
   4025 by interpolating Perl code in the expression at run time,
   4026 and the code can refer to the expression itself. A Perl pattern
   4027 tern to solve the parentheses problem can be created like
   4028 this:
   4029 
   4030 @example
   4031 $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x;
   4032 @end example
   4033 
   4034 The @code{(?p@{...@})} item interpolates Perl code at run time,
   4035 and in this case refers recursively to the pattern in which it
   4036 appears. Obviously, @command{sed} cannot support the interpolation of
   4037 Perl code.  Instead, the special item @code{(?R)} is provided for
   4038 the specific case of recursion. This pattern solves the
   4039 parentheses problem (assume the @code{X} modifier option is used
   4040 so that white space is ignored):
   4041 
   4042 @example
   4043 \( ( (?>[^()]+) | (?R) )* \)
   4044 @end example
   4045 
   4046 First it matches an opening parenthesis. Then it matches any
   4047 number of substrings which can either be a sequence of
   4048 non-parentheses, or a recursive match of the pattern itself
   4049 (i.e. a correctly parenthesized substring). Finally there is
   4050 a closing parenthesis.
   4051 
   4052 This particular example pattern contains nested unlimited
   4053 repeats, and so the use of a non-backtracking subpattern for
   4054 matching strings of non-parentheses is important when applying
   4055 the pattern to strings that do not match. For example, when
   4056 it is applied to
   4057 
   4058 @example
   4059 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
   4060 @end example
   4061 
   4062 it yields a ``no match'' response quickly. However, if a
   4063 standard backtracking subpattern is not used, the match runs
   4064 for a very long time indeed because there are so many different
   4065 ways the @code{+} and @code{*} repeats can carve up the subject,
   4066 and all have to be tested before failure can be reported.
   4067 
   4068 The values set for any capturing subpatterns are those from
   4069 the outermost level of the recursion at which the subpattern
   4070 value is set. If the pattern above is matched against
   4071 
   4072 @example
   4073 (ab(cd)ef)
   4074 @end example
   4075 
   4076 @noindent
   4077 the value for the capturing parentheses is @samp{ef}, which is
   4078 the last value taken on at the top level.
   4079 
   4080 @node Comments
   4081 @appendixsec Comments
   4082 @cindex Perl-style regular expressions, comments
   4083 
   4084 The sequence (?# marks the start of a comment which continues
   4085 ues up to the next closing parenthesis. Nested parentheses
   4086 are not permitted. The characters that make up a comment
   4087 play no part in the pattern matching at all.
   4088 
   4089 @cindex Perl-style regular expressions, extended
   4090 If the @code{X} modifier option is used, an unescaped @code{#} character
   4091 outside a character class introduces a comment that continues
   4092 up to the next newline character in the pattern.
   4093 @end ifset
   4094 
   4095 
   4096 @page
   4097 @node Concept Index
   4098 @unnumbered Concept Index
   4099 
   4100 This is a general index of all issues discussed in this manual, with the
   4101 exception of the @command{sed} commands and command-line options.
   4102 
   4103 @printindex cp
   4104 
   4105 @page
   4106 @node Command and Option Index
   4107 @unnumbered Command and Option Index
   4108 
   4109 This is an alphabetical list of all @command{sed} commands and command-line
   4110 options.
   4111 
   4112 @printindex fn
   4113 
   4114 @contents
   4115 @bye
   4116 
   4117 @c XXX FIXME: the term "cycle" is never defined...
   4118