Home | History | Annotate | Download | only in doc
      1 \input texinfo  @c -*-texinfo-*-
      2 @c Do not edit this file!! It is automatically generated from sed-in.texi.
      3 @c
      4 @c -- Stuff that needs adding: ----------------------------------------------
      5 @c (document the `;' command-separator)
      6 @c --------------------------------------------------------------------------
      7 @c Check for consistency: regexps in @code, text that they match in @samp.
      8 @c 
      9 @c Tips:
     10 @c    @command for command
     11 @c    @samp for command fragments: @samp{cat -s}
     12 @c    @code for sed commands and flags
     13 @c    Use ``quote'' not `quote' or "quote".
     14 @c
     15 @c %**start of header
     16 @setfilename sed.info
     17 @settitle sed, a stream editor
     18 @c %**end of header
     19 
     20 @c @smallbook
     21 
     22 @include version.texi
     23 
     24 @c Combine indices.
     25 @syncodeindex ky cp
     26 @syncodeindex pg cp
     27 @syncodeindex tp cp
     28 
     29 @defcodeindex op
     30 @syncodeindex op fn
     31 
     32 @include config.texi
     33 
     34 @copying
     35 This file documents version @value{VERSION} of
     36 @value{SSED}, a stream editor.
     37 
     38 Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
     39 Software Foundation, Inc.
     40 
     41 This document is released under the terms of the @acronym{GNU} Free
     42 Documentation License as published by the Free Software Foundation;
     43 either version 1.1, or (at your option) any later version.
     44 
     45 You should have received a copy of the @acronym{GNU} Free Documentation
     46 License along with @value{SSED}; see the file @file{COPYING.DOC}.
     47 If not, write to the Free Software Foundation, 59 Temple Place - Suite
     48 330, Boston, MA 02110-1301, USA.
     49 
     50 There are no Cover Texts and no Invariant Sections; this text, along
     51 with its equivalent in the printed manual, constitutes the Title Page.
     52 @end copying
     53 
     54 @setchapternewpage off
     55 
     56 @titlepage
     57 @title @command{sed}, a stream editor
     58 @subtitle version @value{VERSION}, @value{UPDATED}
     59 @author by Ken Pizzini, Paolo Bonzini
     60 
     61 @page
     62 @vskip 0pt plus 1filll
     63 Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
     64 
     65 @insertcopying
     66 
     67 Published by the Free Software Foundation, @*
     68 51 Franklin Street, Fifth Floor @*
     69 Boston, MA 02110-1301, USA
     70 @end titlepage
     71 
     72 
     73 @node Top
     74 @top
     75 
     76 @ifnottex
     77 @insertcopying
     78 @end ifnottex
     79 
     80 @menu
     81 * Introduction::               Introduction
     82 * Invoking sed::               Invocation
     83 * sed Programs::               @command{sed} programs
     84 * Examples::                   Some sample scripts
     85 * Limitations::                Limitations and (non-)limitations of @value{SSED}
     86 * Other Resources::            Other resources for learning about @command{sed}
     87 * Reporting Bugs::             Reporting bugs
     88 
     89 * Extended regexps::           @command{egrep}-style regular expressions
     90 @ifset PERL
     91 * Perl regexps::               Perl-style regular expressions
     92 @end ifset
     93 
     94 * Concept Index::              A menu with all the topics in this manual.
     95 * Command and Option Index::   A menu with all @command{sed} commands and
     96                                command-line options.
     97 
     98 @detailmenu
     99 --- The detailed node listing ---
    100 
    101 sed Programs:
    102 * Execution Cycle::                 How @command{sed} works
    103 * Addresses::                       Selecting lines with @command{sed}
    104 * Regular Expressions::             Overview of regular expression syntax
    105 * Common Commands::                 Often used commands
    106 * The "s" Command::                 @command{sed}'s Swiss Army Knife
    107 * Other Commands::                  Less frequently used commands
    108 * Programming Commands::            Commands for @command{sed} gurus
    109 * Extended Commands::               Commands specific of @value{SSED}
    110 * Escapes::                         Specifying special characters
    111 
    112 Examples:
    113 * Centering lines::
    114 * Increment a number::
    115 * Rename files to lower case::
    116 * Print bash environment::
    117 * Reverse chars of lines::
    118 * tac::                             Reverse lines of files
    119 * cat -n::                          Numbering lines
    120 * cat -b::                          Numbering non-blank lines
    121 * wc -c::                           Counting chars
    122 * wc -w::                           Counting words
    123 * wc -l::                           Counting lines
    124 * head::                            Printing the first lines
    125 * tail::                            Printing the last lines
    126 * uniq::                            Make duplicate lines unique
    127 * uniq -d::                         Print duplicated lines of input
    128 * uniq -u::                         Remove all duplicated lines
    129 * cat -s::                          Squeezing blank lines
    130 
    131 @ifset PERL
    132 Perl regexps::                      Perl-style regular expressions
    133 * Backslash::                       Introduces special sequences
    134 * Circumflex/dollar sign/period::   Behave specially with regard to new lines
    135 * Square brackets::                 Are a bit different in strange cases
    136 * Options setting::                 Toggle modifiers in the middle of a regexp
    137 * Non-capturing subpatterns::       Are not counted when backreferencing
    138 * Repetition::                      Allows for non-greedy matching
    139 * Backreferences::                  Allows for more than 10 back references
    140 * Assertions::                      Allows for complex look ahead matches
    141 * Non-backtracking subpatterns::    Often gives more performance
    142 * Conditional subpatterns::         Allows if/then/else branches
    143 * Recursive patterns::              For example to match parentheses
    144 * Comments::                        Because things can get complex...
    145 @end ifset
    146 
    147 @end detailmenu
    148 @end menu
    149 
    150 
    151 @node Introduction
    152 @chapter Introduction
    153 
    154 @cindex Stream editor
    155 @command{sed} is a stream editor.
    156 A stream editor is used to perform basic text
    157 transformations on an input stream
    158 (a file or input from a pipeline).
    159 While in some ways similar to an editor which
    160 permits scripted edits (such as @command{ed}),
    161 @command{sed} works by making only one pass over the
    162 input(s), and is consequently more efficient.
    163 But it is @command{sed}'s ability to filter text in a pipeline
    164 which particularly distinguishes it from other types of
    165 editors.
    166 
    167 
    168 @node Invoking sed
    169 @chapter Invocation
    170 
    171 Normally @command{sed} is invoked like this:
    172 
    173 @example
    174 sed SCRIPT INPUTFILE...
    175 @end example
    176 
    177 The full format for invoking @command{sed} is:
    178 
    179 @example
    180 sed OPTIONS... [SCRIPT] [INPUTFILE...]
    181 @end example
    182 
    183 If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
    184 @command{sed} filters the contents of the standard input.  The @var{script}
    185 is actually the first non-option parameter, which @command{sed} specially
    186 considers a script and not an input file if (and only if) none of the
    187 other @var{options} specifies a script to be executed, that is if neither
    188 of the @option{-e} and @option{-f} options is specified.
    189 
    190 @command{sed} may be invoked with the following command-line options:
    191 
    192 @table @code
    193 @item --version
    194 @opindex --version
    195 @cindex Version, printing
    196 Print out the version of @command{sed} that is being run and a copyright notice,
    197 then exit.
    198 
    199 @item --help
    200 @opindex --help
    201 @cindex Usage summary, printing
    202 Print a usage message briefly summarizing these command-line options
    203 and the bug-reporting address,
    204 then exit.
    205 
    206 @item -n
    207 @itemx --quiet
    208 @itemx --silent
    209 @opindex -n
    210 @opindex --quiet
    211 @opindex --silent
    212 @cindex Disabling autoprint, from command line
    213 By default, @command{sed} prints out the pattern space
    214 at the end of each cycle through the script (@pxref{Execution Cycle, ,
    215 How @code{sed} works}).
    216 These options disable this automatic printing,
    217 and @command{sed} only produces output when explicitly told to
    218 via the @code{p} command.
    219 
    220 @item -e @var{script}
    221 @itemx --expression=@var{script}
    222 @opindex -e
    223 @opindex --expression
    224 @cindex Script, from command line
    225 Add the commands in @var{script} to the set of commands to be
    226 run while processing the input.
    227 
    228 @item -f @var{script-file}
    229 @itemx --file=@var{script-file}
    230 @opindex -f
    231 @opindex --file
    232 @cindex Script, from a file
    233 Add the commands contained in the file @var{script-file}
    234 to the set of commands to be run while processing the input.
    235 
    236 @item -i[@var{SUFFIX}]
    237 @itemx --in-place[=@var{SUFFIX}]
    238 @opindex -i
    239 @opindex --in-place
    240 @cindex In-place editing, activating
    241 @cindex @value{SSEDEXT}, in-place editing
    242 This option specifies that files are to be edited in-place.
    243 @value{SSED} does this by creating a temporary file and
    244 sending output to this file rather than to the standard
    245 output.@footnote{This applies to commands such as @code{=},
    246 @code{a}, @code{c}, @code{i}, @code{l}, @code{p}.  You can
    247 still write to the standard output by using the @code{w}
    248 @cindex @value{SSEDEXT}, @file{/dev/stdout} file
    249 or @code{W} commands together with the @file{/dev/stdout}
    250 special file}.
    251 
    252 This option implies @option{-s}.
    253 
    254 When the end of the file is reached, the temporary file is
    255 renamed to the output file's original name.  The extension,
    256 if supplied, is used to modify the name of the old file
    257 before renaming the temporary file, thereby making a backup
    258 copy@footnote{Note that @value{SSED} creates the backup
    259 file whether or not any output is actually changed.}).
    260 
    261 @cindex In-place editing, Perl-style backup file names
    262 This rule is followed: if the extension doesn't contain a @code{*},
    263 then it is appended to the end of the current filename as a
    264 suffix; if the extension does contain one or more @code{*}
    265 characters, then @emph{each} asterisk is replaced with the
    266 current filename.  This allows you to add a prefix to the
    267 backup file, instead of (or in addition to) a suffix, or
    268 even to place backup copies of the original files into another
    269 directory (provided the directory already exists).
    270 
    271 If no extension is supplied, the original file is
    272 overwritten without making a backup.
    273 
    274 @item -l @var{N}
    275 @itemx --line-length=@var{N}
    276 @opindex -l
    277 @opindex --line-length
    278 @cindex Line length, setting
    279 Specify the default line-wrap length for the @code{l} command.
    280 A length of 0 (zero) means to never wrap long lines.  If
    281 not specified, it is taken to be 70.
    282 
    283 @item --posix
    284 @cindex @value{SSEDEXT}, disabling
    285 @value{SSED} includes several extensions to @acronym{POSIX}
    286 sed.  In order to simplify writing portable scripts, this
    287 option disables all the extensions that this manual documents,
    288 including additional commands.
    289 @cindex @code{POSIXLY_CORRECT} behavior, enabling
    290 Most of the extensions accept @command{sed} programs that
    291 are outside the syntax mandated by @acronym{POSIX}, but some
    292 of them (such as the behavior of the @command{N} command
    293 described in @pxref{Reporting Bugs}) actually violate the
    294 standard.  If you want to disable only the latter kind of
    295 extension, you can set the @code{POSIXLY_CORRECT} variable
    296 to a non-empty value.
    297 
    298 @item -b
    299 @itemx --binary
    300 @opindex -b
    301 @opindex --binary
    302 This option is available on every platform, but is only effective where the
    303 operating system makes a distinction between text files and binary files.
    304 When such a distinction is made---as is the case for MS-DOS, Windows,
    305 Cygwin---text files are composed of lines separated by a carriage return
    306 @emph{and} a line feed character, and @command{sed} does not see the
    307 ending CR.  When this option is specified, @command{sed} will open
    308 input files in binary mode, thus not requesting this special processing
    309 and considering lines to end at a line feed.
    310 
    311 @item --follow-symlinks
    312 @opindex --follow-symlinks
    313 This option is available only on platforms that support
    314 symbolic links and has an effect only if option @option{-i}
    315 is specified.  In this case, if the file that is specified
    316 on the command line is a symbolic link, @command{sed} will
    317 follow the link and edit the ultimate destination of the
    318 link.  The default behavior is to break the symbolic link,
    319 so that the link destination will not be modified.
    320 
    321 @item -r
    322 @itemx --regexp-extended
    323 @opindex -r
    324 @opindex --regexp-extended
    325 @cindex Extended regular expressions, choosing
    326 @cindex @acronym{GNU} extensions, extended regular expressions
    327 Use extended regular expressions rather than basic
    328 regular expressions.  Extended regexps are those that
    329 @command{egrep} accepts; they can be clearer because they
    330 usually have less backslashes, but are a @acronym{GNU} extension
    331 and hence scripts that use them are not portable.
    332 @xref{Extended regexps, , Extended regular expressions}.
    333 
    334 @ifset PERL
    335 @item -R
    336 @itemx --regexp-perl
    337 @opindex -R
    338 @opindex --regexp-perl
    339 @cindex Perl-style regular expressions, choosing
    340 @cindex @value{SSEDEXT}, Perl-style regular expressions
    341 Use Perl-style regular expressions rather than basic
    342 regular expressions.  Perl-style regexps are extremely
    343 powerful but are a @value{SSED} extension and hence scripts that
    344 use it are not portable.  @xref{Perl regexps, ,
    345 Perl-style regular expressions}.
    346 @end ifset
    347 
    348 @item -s
    349 @itemx --separate
    350 @cindex Working on separate files
    351 By default, @command{sed} will consider the files specified on the
    352 command line as a single continuous long stream.  This @value{SSED}
    353 extension allows the user to consider them as separate files:
    354 range addresses (such as @samp{/abc/,/def/}) are not allowed
    355 to span several files, line numbers are relative to the start
    356 of each file, @code{$} refers to the last line of each file,
    357 and files invoked from the @code{R} commands are rewound at the
    358 start of each file.
    359 
    360 @item -u
    361 @itemx --unbuffered
    362 @opindex -u
    363 @opindex --unbuffered
    364 @cindex Unbuffered I/O, choosing
    365 Buffer both input and output as minimally as practical.
    366 (This is particularly useful if the input is coming from
    367 the likes of @samp{tail -f}, and you wish to see the transformed
    368 output as soon as possible.)
    369 
    370 @end table
    371 
    372 If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file}
    373 options are given on the command-line,
    374 then the first non-option argument on the command line is
    375 taken to be the @var{script} to be executed.
    376 
    377 @cindex Files to be processed as input
    378 If any command-line parameters remain after processing the above,
    379 these parameters are interpreted as the names of input files to
    380 be processed.
    381 @cindex Standard input, processing as input
    382 A file name of @samp{-} refers to the standard input stream.
    383 The standard input will be processed if no file names are specified.
    384 
    385 
    386 @node sed Programs
    387 @chapter @command{sed} Programs
    388 
    389 @cindex @command{sed} program structure
    390 @cindex Script structure
    391 A @command{sed} program consists of one or more @command{sed} commands,
    392 passed in by one or more of the
    393 @option{-e}, @option{-f}, @option{--expression}, and @option{--file}
    394 options, or the first non-option argument if zero of these
    395 options are used.
    396 This document will refer to ``the'' @command{sed} script;
    397 this is understood to mean the in-order catenation
    398 of all of the @var{script}s and @var{script-file}s passed in.
    399 
    400 Each @code{sed} command consists of an optional address or
    401 address range, followed by a one-character command name
    402 and any additional command-specific code.
    403 
    404 @menu
    405 * Execution Cycle::          How @command{sed} works
    406 * Addresses::                Selecting lines with @command{sed}
    407 * Regular Expressions::      Overview of regular expression syntax
    408 * Common Commands::          Often used commands
    409 * The "s" Command::          @command{sed}'s Swiss Army Knife
    410 * Other Commands::           Less frequently used commands
    411 * Programming Commands::     Commands for @command{sed} gurus
    412 * Extended Commands::        Commands specific of @value{SSED}
    413 * Escapes::                  Specifying special characters
    414 @end menu
    415 
    416 
    417 @node Execution Cycle
    418 @section How @command{sed} Works
    419 
    420 @cindex Buffer spaces, pattern and hold
    421 @cindex Spaces, pattern and hold
    422 @cindex Pattern space, definition
    423 @cindex Hold space, definition
    424 @command{sed} maintains two data buffers: the active @emph{pattern} space,
    425 and the auxiliary @emph{hold} space. Both are initially empty.
    426 
    427 @command{sed} operates by performing the following cycle on each
    428 lines of input: first, @command{sed} reads one line from the input
    429 stream, removes any trailing newline, and places it in the pattern space.
    430 Then commands are executed; each command can have an address associated
    431 to it: addresses are a kind of condition code, and a command is only
    432 executed if the condition is verified before the command is to be
    433 executed.
    434 
    435 When the end of the script is reached, unless the @option{-n} option
    436 is in use, the contents of pattern space are printed out to the output
    437 stream, adding back the trailing newline if it was removed.@footnote{Actually,
    438 if @command{sed} prints a line without the terminating newline, it will
    439 nevertheless print the missing newline as soon as more text is sent to
    440 the same output stream, which gives the ``least expected surprise''
    441 even though it does not make commands like @samp{sed -n p} exactly
    442 identical to @command{cat}.} Then the next cycle starts for the next
    443 input line.
    444 
    445 Unless special commands (like @samp{D}) are used, the pattern space is
    446 deleted between two cycles. The hold space, on the other hand, keeps
    447 its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
    448 @samp{g}, @samp{G} to move data between both buffers).
    449 
    450 
    451 @node Addresses
    452 @section Selecting lines with @command{sed}
    453 @cindex Addresses, in @command{sed} scripts
    454 @cindex Line selection
    455 @cindex Selecting lines to process
    456 
    457 Addresses in a @command{sed} script can be in any of the following forms:
    458 @table @code
    459 @item @var{number}
    460 @cindex Address, numeric
    461 @cindex Line, selecting by number
    462 Specifying a line number will match only that line in the input.
    463 (Note that @command{sed} counts lines continuously across all input files
    464 unless @option{-i} or @option{-s} options are specified.)
    465 
    466 @item @var{first}~@var{step}
    467 @cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
    468 This @acronym{GNU} extension matches every @var{step}th line
    469 starting with line @var{first}.
    470 In particular, lines will be selected when there exists
    471 a non-negative @var{n} such that the current line-number equals
    472 @var{first} + (@var{n} * @var{step}).
    473 Thus, to select the odd-numbered lines,
    474 one would use @code{1~2};
    475 to pick every third line starting with the second, @samp{2~3} would be used;
    476 to pick every fifth line starting with the tenth, use @samp{10~5};
    477 and @samp{50~0} is just an obscure way of saying @code{50}.
    478 
    479 @item $
    480 @cindex Address, last line
    481 @cindex Last line, selecting
    482 @cindex Line, selecting last
    483 This address matches the last line of the last file of input, or
    484 the last line of each file when the @option{-i} or @option{-s} options
    485 are specified.
    486 
    487 @item /@var{regexp}/
    488 @cindex Address, as a regular expression
    489 @cindex Line, selecting by regular expression match
    490 This will select any line which matches the regular expression @var{regexp}.
    491 If @var{regexp} itself includes any @code{/} characters,
    492 each must be escaped by a backslash (@code{\}).
    493 
    494 @cindex empty regular expression
    495 @cindex @value{SSEDEXT}, modifiers and the empty regular expression
    496 The empty regular expression @samp{//} repeats the last regular
    497 expression match (the same holds if the empty regular expression is
    498 passed to the @code{s} command).  Note that modifiers to regular expressions
    499 are evaluated when the regular expression is compiled, thus it is invalid to
    500 specify them together with the empty regular expression.
    501 
    502 @item \%@var{regexp}%
    503 (The @code{%} may be replaced by any other single character.)
    504 
    505 @cindex Slash character, in regular expressions
    506 This also matches the regular expression @var{regexp},
    507 but allows one to use a different delimiter than @code{/}.
    508 This is particularly useful if the @var{regexp} itself contains
    509 a lot of slashes, since it avoids the tedious escaping of every @code{/}.
    510 If @var{regexp} itself includes any delimiter characters,
    511 each must be escaped by a backslash (@code{\}).
    512 
    513 @item /@var{regexp}/I
    514 @itemx \%@var{regexp}%I
    515 @cindex @acronym{GNU} extensions, @code{I} modifier
    516 @ifset PERL
    517 @cindex Perl-style regular expressions, case-insensitive
    518 @end ifset
    519 The @code{I} modifier to regular-expression matching is a @acronym{GNU}
    520 extension which causes the @var{regexp} to be matched in
    521 a case-insensitive manner.
    522 
    523 @item /@var{regexp}/M
    524 @itemx \%@var{regexp}%M
    525 @ifset PERL
    526 @cindex @value{SSEDEXT}, @code{M} modifier
    527 @end ifset
    528 @cindex Perl-style regular expressions, multiline
    529 The @code{M} modifier to regular-expression matching is a @value{SSED}
    530 extension which causes @code{^} and @code{$} to match respectively
    531 (in addition to the normal behavior) the empty string after a newline,
    532 and the empty string before a newline.  There are special character
    533 sequences
    534 @ifset PERL
    535 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
    536 in basic or extended regular expression modes)
    537 @end ifset
    538 @ifclear PERL
    539 (@code{\`} and @code{\'})
    540 @end ifclear
    541 which always match the beginning or the end of the buffer.
    542 @code{M} stands for @cite{multi-line}.
    543 
    544 @ifset PERL
    545 @item /@var{regexp}/S
    546 @itemx \%@var{regexp}%S
    547 @cindex @value{SSEDEXT}, @code{S} modifier
    548 @cindex Perl-style regular expressions, single line
    549 The @code{S} modifier to regular-expression matching is only valid
    550 in Perl mode and specifies that the dot character (@code{.}) will
    551 match the newline character too.  @code{S} stands for @cite{single-line}.
    552 @end ifset
    553 
    554 @ifset PERL
    555 @item /@var{regexp}/X
    556 @itemx \%@var{regexp}%X
    557 @cindex @value{SSEDEXT}, @code{X} modifier
    558 @cindex Perl-style regular expressions, extended
    559 The @code{X} modifier to regular-expression matching is also
    560 valid in Perl mode only.  If it is used, whitespace in the
    561 pattern (other than in a character class) and
    562 characters between a @kbd{#} outside a character class and the
    563 next newline character are ignored. An escaping backslash
    564 can be used to include a whitespace or @kbd{#} character as part
    565 of the pattern.
    566 @end ifset
    567 @end table
    568 
    569 If no addresses are given, then all lines are matched;
    570 if one address is given, then only lines matching that
    571 address are matched.
    572 
    573 @cindex Range of lines
    574 @cindex Several lines, selecting
    575 An address range can be specified by specifying two addresses
    576 separated by a comma (@code{,}).  An address range matches lines
    577 starting from where the first address matches, and continues
    578 until the second address matches (inclusively).
    579 
    580 If the second address is a @var{regexp}, then checking for the
    581 ending match will start with the line @emph{following} the
    582 line which matched the first address: a range will always
    583 span at least two lines (except of course if the input stream
    584 ends).
    585 
    586 If the second address is a @var{number} less than (or equal to)
    587 the line matching the first address, then only the one line is
    588 matched.
    589 
    590 @cindex Special addressing forms
    591 @cindex Range with start address of zero
    592 @cindex Zero, as range start address
    593 @cindex @var{addr1},+N
    594 @cindex @var{addr1},~N
    595 @cindex @acronym{GNU} extensions, special two-address forms
    596 @cindex @acronym{GNU} extensions, @code{0} address
    597 @cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
    598 @cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
    599 @cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
    600 @value{SSED} also supports some special two-address forms; all these
    601 are @acronym{GNU} extensions:
    602 @table @code
    603 @item 0,/@var{regexp}/
    604 A line number of @code{0} can be used in an address specification like
    605 @code{0,/@var{regexp}/} so that @command{sed} will try to match
    606 @var{regexp} in the first input line too.  In other words,
    607 @code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
    608 except that if @var{addr2} matches the very first line of input the
    609 @code{0,/@var{regexp}/} form will consider it to end the range, whereas
    610 the @code{1,/@var{regexp}/} form will match the beginning of its range and
    611 hence make the range span up to the @emph{second} occurrence of the
    612 regular expression.
    613 
    614 Note that this is the only place where the @code{0} address makes
    615 sense; there is no 0-th line and commands which are given the @code{0}
    616 address in any other way will give an error.
    617 
    618 @item @var{addr1},+@var{N}
    619 Matches @var{addr1} and the @var{N} lines following @var{addr1}.
    620 
    621 @item @var{addr1},~@var{N}
    622 Matches @var{addr1} and the lines following @var{addr1}
    623 until the next line whose input line number is a multiple of @var{N}.
    624 @end table
    625 
    626 @cindex Excluding lines
    627 @cindex Selecting non-matching lines
    628 Appending the @code{!} character to the end of an address
    629 specification negates the sense of the match.
    630 That is, if the @code{!} character follows an address range,
    631 then only lines which do @emph{not} match the address range
    632 will be selected.
    633 This also works for singleton addresses,
    634 and, perhaps perversely, for the null address.
    635 
    636 
    637 @node Regular Expressions
    638 @section Overview of Regular Expression Syntax
    639 
    640 To know how to use @command{sed}, people should understand regular
    641 expressions (@dfn{regexp} for short).  A regular expression
    642 is a pattern that is matched against a
    643 subject string from left to right.  Most characters are
    644 @dfn{ordinary}: they stand for
    645 themselves in a pattern, and match the corresponding characters
    646 in the subject.  As a trivial example, the pattern
    647 
    648 @example
    649 The quick brown fox
    650 @end example
    651 
    652 @noindent
    653 matches a portion of a subject string that is identical to
    654 itself.  The power of regular expressions comes from the
    655 ability to include alternatives and repetitions in the pattern.
    656 These are encoded in the pattern by the use of @dfn{special characters},
    657 which do not stand for themselves but instead
    658 are interpreted in some special way.  Here is a brief description
    659 of regular expression syntax as used in @command{sed}.
    660 
    661 @table @code
    662 @item @var{char}
    663 A single ordinary character matches itself.
    664 
    665 @item *
    666 @cindex @acronym{GNU} extensions, to basic regular expressions
    667 Matches a sequence of zero or more instances of matches for the
    668 preceding regular expression, which must be an ordinary character, a
    669 special character preceded by @code{\}, a @code{.}, a grouped regexp
    670 (see below), or a bracket expression.  As a @acronym{GNU} extension, a
    671 postfixed regular expression can also be followed by @code{*}; for
    672 example, @code{a**} is equivalent to @code{a*}.  @acronym{POSIX}
    673 1003.1-2001 says that @code{*} stands for itself when it appears at
    674 the start of a regular expression or subexpression, but many
    675 non@acronym{GNU} implementations do not support this and portable
    676 scripts should instead use @code{\*} in these contexts.
    677 
    678 @item \+
    679 @cindex @acronym{GNU} extensions, to basic regular expressions
    680 As @code{*}, but matches one or more.  It is a @acronym{GNU} extension.
    681 
    682 @item \?
    683 @cindex @acronym{GNU} extensions, to basic regular expressions
    684 As @code{*}, but only matches zero or one.  It is a @acronym{GNU} extension.
    685 
    686 @item \@{@var{i}\@}
    687 As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
    688 decimal integer; for portability, keep it between 0 and 255
    689 inclusive).
    690 
    691 @item \@{@var{i},@var{j}\@}
    692 Matches between @var{i} and @var{j}, inclusive, sequences.
    693 
    694 @item \@{@var{i},\@}
    695 Matches more than or equal to @var{i} sequences.
    696 
    697 @item \(@var{regexp}\)
    698 Groups the inner @var{regexp} as a whole, this is used to: 
    699 
    700 @itemize @bullet
    701 @item
    702 @cindex @acronym{GNU} extensions, to basic regular expressions
    703 Apply postfix operators, like @code{\(abcd\)*}:
    704 this will search for zero or more whole sequences 
    705 of @samp{abcd}, while @code{abcd*} would search
    706 for @samp{abc} followed by zero or more occurrences
    707 of @samp{d}.  Note that support for @code{\(abcd\)*} is
    708 required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
    709 implementations do not support it and hence it is not universally
    710 portable.         
    711 
    712 @item
    713 Use back references (see below).
    714 @end itemize
    715 
    716 @item .
    717 Matches any character, including newline.
    718 
    719 @item ^
    720 Matches the null string at beginning of the pattern space, i.e. what
    721 appears after the circumflex must appear at the beginning of the
    722 pattern space.
    723 
    724 In most scripts, pattern space is initialized to the content of each
    725 line (@pxref{Execution Cycle, , How @code{sed} works}).  So, it is a
    726 useful simplification to think of @code{^#include} as matching only
    727 lines where @samp{#include} is the first thing on line---if there are
    728 spaces before, for example, the match fails.  This simplification is
    729 valid as long as the original content of pattern space is not modified,
    730 for example with an @code{s} command.
    731 
    732 @code{^} acts as a special character only at the beginning of the
    733 regular expression or subexpression (that is, after @code{\(} or
    734 @code{\|}).  Portable scripts should avoid @code{^} at the beginning of
    735 a subexpression, though, as @acronym{POSIX} allows implementations that
    736 treat @code{^} as an ordinary character in that context.
    737 
    738 @item $
    739 It is the same as @code{^}, but refers to end of pattern space.
    740 @code{$} also acts as a special character only at the end
    741 of the regular expression or subexpression (that is, before @code{\)}
    742 or @code{\|}), and its use at the end of a subexpression is not
    743 portable.
    744 
    745 
    746 @item [@var{list}]
    747 @itemx [^@var{list}]
    748 Matches any single character in @var{list}: for example,
    749 @code{[aeiou]} matches all vowels.  A list may include
    750 sequences like @code{@var{char1}-@var{char2}}, which
    751 matches any character between (inclusive) @var{char1}
    752 and @var{char2}.
    753 
    754 A leading @code{^} reverses the meaning of @var{list}, so that
    755 it matches any single character @emph{not} in @var{list}.  To include
    756 @code{]} in the list, make it the first character (after
    757 the @code{^} if needed), to include @code{-} in the list,
    758 make it the first or last; to include @code{^} put
    759 it after the first character.
    760 
    761 @cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
    762 The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
    763 are normally not special within @var{list}.  For example, @code{[\*]}
    764 matches either @samp{\} or @samp{*}, because the @code{\} is not
    765 special here.  However, strings like @code{[.ch.]}, @code{[=a=]}, and
    766 @code{[:space:]} are special within @var{list} and represent collating
    767 symbols, equivalence classes, and character classes, respectively, and
    768 @code{[} is therefore special within @var{list} when it is followed by
    769 @code{.}, @code{=}, or @code{:}.  Also, when not in
    770 @env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
    771 @code{\t} are recognized within @var{list}.  @xref{Escapes}.
    772 
    773 @item @var{regexp1}\|@var{regexp2}
    774 @cindex @acronym{GNU} extensions, to basic regular expressions
    775 Matches either @var{regexp1} or @var{regexp2}.  Use
    776 parentheses to use complex alternative regular expressions.
    777 The matching process tries each alternative in turn, from
    778 left to right, and the first one that succeeds is used.
    779 It is a @acronym{GNU} extension.
    780 
    781 @item @var{regexp1}@var{regexp2}
    782 Matches the concatenation of @var{regexp1} and @var{regexp2}.
    783 Concatenation binds more tightly than @code{\|}, @code{^}, and
    784 @code{$}, but less tightly than the other regular expression
    785 operators.
    786 
    787 @item \@var{digit}
    788 Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
    789 subexpression in the regular expression.  This is called a @dfn{back
    790 reference}.  Subexpressions are implicity numbered by counting
    791 occurrences of @code{\(} left-to-right.
    792 
    793 @item \n
    794 Matches the newline character.
    795 
    796 @item \@var{char}
    797 Matches @var{char}, where @var{char} is one of @code{$},
    798 @code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
    799 Note that the only C-like
    800 backslash sequences that you can portably assume to be
    801 interpreted are @code{\n} and @code{\\}; in particular
    802 @code{\t} is not portable, and matches a @samp{t} under most
    803 implementations of @command{sed}, rather than a tab character.
    804 
    805 @end table
    806 
    807 @cindex Greedy regular expression matching
    808 Note that the regular expression matcher is greedy, i.e., matches
    809 are attempted from left to right and, if two or more matches are
    810 possible starting at the same character, it selects the longest.
    811 
    812 @noindent
    813 Examples:
    814 @table @samp
    815 @item abcdef
    816 Matches @samp{abcdef}.
    817 
    818 @item a*b
    819 Matches zero or more @samp{a}s followed by a single
    820 @samp{b}.  For example, @samp{b} or @samp{aaaaab}. 
    821 
    822 @item a\?b
    823 Matches @samp{b} or @samp{ab}.
    824 
    825 @item a\+b\+
    826 Matches one or more @samp{a}s followed by one or more
    827 @samp{b}s: @samp{ab} is the shortest possible match, but
    828 other examples are @samp{aaaab} or @samp{abbbbb} or
    829 @samp{aaaaaabbbbbbb}.
    830 
    831 @item .*
    832 @itemx .\+
    833 These two both match all the characters in a string;
    834 however, the first matches every string (including the empty
    835 string), while the second matches only strings containing
    836 at least one character.
    837 
    838 @item ^main.*(.*)
    839 his matches a string starting with @samp{main},
    840 followed by an opening and closing
    841 parenthesis.  The @samp{n}, @samp{(} and @samp{)} need not
    842 be adjacent.
    843 
    844 @item ^#
    845 This matches a string beginning with @samp{#}.
    846 
    847 @item \\$
    848 This matches a string ending with a single backslash.  The
    849 regexp contains two backslashes for escaping.
    850 
    851 @item \$
    852 Instead, this matches a string consisting of a single dollar sign,
    853 because it is escaped.
    854 
    855 @item [a-zA-Z0-9]
    856 In the C locale, this matches any @acronym{ASCII} letters or digits.
    857 
    858 @item [^ @kbd{tab}]\+
    859 (Here @kbd{tab} stands for a single tab character.)
    860 This matches a string of one or more
    861 characters, none of which is a space or a tab.
    862 Usually this means a word.
    863 
    864 @item ^\(.*\)\n\1$
    865 This matches a string consisting of two equal substrings separated by
    866 a newline.
    867 
    868 @item .\@{9\@}A$
    869 This matches nine characters followed by an @samp{A}.
    870 
    871 @item ^.\@{15\@}A
    872 This matches the start of a string that contains 16 characters,
    873 the last of which is an @samp{A}.
    874 
    875 @end table
    876 
    877 
    878 
    879 @node Common Commands
    880 @section Often-Used Commands
    881 
    882 If you use @command{sed} at all, you will quite likely want to know
    883 these commands.
    884 
    885 @table @code
    886 @item #
    887 [No addresses allowed.]
    888 
    889 @findex # (comments)
    890 @cindex Comments, in scripts
    891 The @code{#} character begins a comment;
    892 the comment continues until the next newline.
    893 
    894 @cindex Portability, comments
    895 If you are concerned about portability, be aware that
    896 some implementations of @command{sed} (which are not @sc{posix}
    897 conformant) may only support a single one-line comment,
    898 and then only when the very first character of the script is a @code{#}.
    899 
    900 @findex -n, forcing from within a script
    901 @cindex Caveat --- #n on first line
    902 Warning: if the first two characters of the @command{sed} script
    903 are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
    904 If you want to put a comment in the first line of your script
    905 and that comment begins with the letter @samp{n}
    906 and you do not want this behavior,
    907 then be sure to either use a capital @samp{N},
    908 or place at least one space before the @samp{n}.
    909 
    910 @item q [@var{exit-code}]
    911 This command only accepts a single address.
    912 
    913 @findex q (quit) command
    914 @cindex @value{SSEDEXT}, returning an exit code
    915 @cindex Quitting
    916 Exit @command{sed} without processing any more commands or input.
    917 Note that the current pattern space is printed if auto-print is
    918 not disabled with the @option{-n} options.  The ability to return
    919 an exit code from the @command{sed} script is a @value{SSED} extension.
    920 
    921 @item d
    922 @findex d (delete) command
    923 @cindex Text, deleting
    924 Delete the pattern space;
    925 immediately start next cycle.
    926 
    927 @item p
    928 @findex p (print) command
    929 @cindex Text, printing
    930 Print out the pattern space (to the standard output).
    931 This command is usually only used in conjunction with the @option{-n}
    932 command-line option.
    933 
    934 @item n
    935 @findex n (next-line) command
    936 @cindex Next input line, replace pattern space with
    937 @cindex Read next input line
    938 If auto-print is not disabled, print the pattern space,
    939 then, regardless, replace the pattern space with the next line of input.
    940 If there is no more input then @command{sed} exits without processing
    941 any more commands.
    942 
    943 @item @{ @var{commands} @}
    944 @findex @{@} command grouping
    945 @cindex Grouping commands
    946 @cindex Command groups
    947 A group of commands may be enclosed between
    948 @code{@{} and @code{@}} characters.
    949 This is particularly useful when you want a group of commands
    950 to be triggered by a single address (or address-range) match.
    951 
    952 @end table
    953 
    954 @node The "s" Command
    955 @section The @code{s} Command
    956 
    957 The syntax of the @code{s} (as in substitute) command is
    958 @samp{s/@var{regexp}/@var{replacement}/@var{flags}}.  The @code{/}
    959 characters may be uniformly replaced by any other single
    960 character within any given @code{s} command.  The @code{/}
    961 character (or whatever other character is used in its stead)
    962 can appear in the @var{regexp} or @var{replacement}
    963 only if it is preceded by a @code{\} character.
    964 
    965 The @code{s} command is probably the most important in @command{sed}
    966 and has a lot of different options.  Its basic concept is simple:
    967 the @code{s} command attempts to match the pattern
    968 space against the supplied @var{regexp}; if the match is
    969 successful, then that portion of the pattern
    970 space which was matched is replaced with @var{replacement}.
    971 
    972 @cindex Backreferences, in regular expressions
    973 @cindex Parenthesized substrings
    974 The @var{replacement} can contain @code{\@var{n}} (@var{n} being
    975 a number from 1 to 9, inclusive) references, which refer to
    976 the portion of the match which is contained between the @var{n}th
    977 @code{\(} and its matching @code{\)}.
    978 Also, the @var{replacement} can contain unescaped @code{&}
    979 characters which reference the whole matched portion
    980 of the pattern space.
    981 @cindex @value{SSEDEXT}, case modifiers in @code{s} commands
    982 Finally, as a @value{SSED} extension, you can include a
    983 special sequence made of a backslash and one of the letters
    984 @code{L}, @code{l}, @code{U}, @code{u}, or @code{E}.
    985 The meaning is as follows:
    986 
    987 @table @code
    988 @item \L
    989 Turn the replacement
    990 to lowercase until a @code{\U} or @code{\E} is found,
    991 
    992 @item \l
    993 Turn the
    994 next character to lowercase,
    995 
    996 @item \U
    997 Turn the replacement to uppercase
    998 until a @code{\L} or @code{\E} is found,
    999 
   1000 @item \u
   1001 Turn the next character
   1002 to uppercase,
   1003 
   1004 @item \E
   1005 Stop case conversion started by @code{\L} or @code{\U}.
   1006 @end table
   1007 
   1008 To include a literal @code{\}, @code{&}, or newline in the final
   1009 replacement, be sure to precede the desired @code{\}, @code{&},
   1010 or newline in the @var{replacement} with a @code{\}.
   1011 
   1012 @findex s command, option flags
   1013 @cindex Substitution of text, options
   1014 The @code{s} command can be followed by zero or more of the
   1015 following @var{flags}:
   1016 
   1017 @table @code
   1018 @item g
   1019 @cindex Global substitution
   1020 @cindex Replacing all text matching regexp in a line
   1021 Apply the replacement to @emph{all} matches to the @var{regexp},
   1022 not just the first.
   1023 
   1024 @item @var{number}
   1025 @cindex Replacing only @var{n}th match of regexp in a line
   1026 Only replace the @var{number}th match of the @var{regexp}.
   1027 
   1028 @cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
   1029 @cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
   1030 Note: the @sc{posix} standard does not specify what should happen
   1031 when you mix the @code{g} and @var{number} modifiers,
   1032 and currently there is no widely agreed upon meaning
   1033 across @command{sed} implementations.
   1034 For @value{SSED}, the interaction is defined to be:
   1035 ignore matches before the @var{number}th,
   1036 and then match and replace all matches from
   1037 the @var{number}th on.
   1038 
   1039 @item p
   1040 @cindex Text, printing after substitution
   1041 If the substitution was made, then print the new pattern space.
   1042 
   1043 Note: when both the @code{p} and @code{e} options are specified,
   1044 the relative ordering of the two produces very different results.
   1045 In general, @code{ep} (evaluate then print) is what you want,
   1046 but operating the other way round can be useful for debugging.
   1047 For this reason, the current version of @value{SSED} interprets
   1048 specially the presence of @code{p} options both before and after
   1049 @code{e}, printing the pattern space before and after evaluation,
   1050 while in general flags for the @code{s} command show their
   1051 effect just once.  This behavior, although documented, might
   1052 change in future versions.
   1053 
   1054 @item w @var{file-name}
   1055 @cindex Text, writing to a file after substitution
   1056 @cindex @value{SSEDEXT}, @file{/dev/stdout} file
   1057 @cindex @value{SSEDEXT}, @file{/dev/stderr} file
   1058 If the substitution was made, then write out the result to the named file.
   1059 As a @value{SSED} extension, two special values of @var{file-name} are
   1060 supported: @file{/dev/stderr}, which writes the result to the standard
   1061 error, and @file{/dev/stdout}, which writes to the standard
   1062 output.@footnote{This is equivalent to @code{p} unless the @option{-i}
   1063 option is being used.}
   1064 
   1065 @item e
   1066 @cindex Evaluate Bourne-shell commands, after substitution
   1067 @cindex Subprocesses
   1068 @cindex @value{SSEDEXT}, evaluating Bourne-shell commands
   1069 @cindex @value{SSEDEXT}, subprocesses
   1070 This command allows one to pipe input from a shell command
   1071 into pattern space.  If a substitution was made, the command
   1072 that is found in pattern space is executed and pattern space
   1073 is replaced with its output.  A trailing newline is suppressed;
   1074 results are undefined if the command to be executed contains
   1075 a @sc{nul} character.  This is a @value{SSED} extension.
   1076 
   1077 @item I
   1078 @itemx i
   1079 @cindex @acronym{GNU} extensions, @code{I} modifier
   1080 @cindex Case-insensitive matching
   1081 @ifset PERL
   1082 @cindex Perl-style regular expressions, case-insensitive
   1083 @end ifset
   1084 The @code{I} modifier to regular-expression matching is a @acronym{GNU}
   1085 extension which makes @command{sed} match @var{regexp} in a
   1086 case-insensitive manner.
   1087 
   1088 @item M
   1089 @itemx m
   1090 @cindex @value{SSEDEXT}, @code{M} modifier
   1091 @ifset PERL
   1092 @cindex Perl-style regular expressions, multiline
   1093 @end ifset
   1094 The @code{M} modifier to regular-expression matching is a @value{SSED}
   1095 extension which causes @code{^} and @code{$} to match respectively
   1096 (in addition to the normal behavior) the empty string after a newline,
   1097 and the empty string before a newline.  There are special character
   1098 sequences
   1099 @ifset PERL
   1100 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
   1101 in basic or extended regular expression modes)
   1102 @end ifset
   1103 @ifclear PERL
   1104 (@code{\`} and @code{\'})
   1105 @end ifclear
   1106 which always match the beginning or the end of the buffer.
   1107 @code{M} stands for @cite{multi-line}.
   1108 
   1109 @ifset PERL
   1110 @item S
   1111 @itemx s
   1112 @cindex @value{SSEDEXT}, @code{S} modifier
   1113 @cindex Perl-style regular expressions, single line
   1114 The @code{S} modifier to regular-expression matching is only valid
   1115 in Perl mode and specifies that the dot character (@code{.}) will
   1116 match the newline character too.  @code{S} stands for @cite{single-line}.
   1117 @end ifset
   1118 
   1119 @ifset PERL
   1120 @item X
   1121 @itemx x
   1122 @cindex @value{SSEDEXT}, @code{X} modifier
   1123 @cindex Perl-style regular expressions, extended
   1124 The @code{X} modifier to regular-expression matching is also
   1125 valid in Perl mode only.  If it is used, whitespace in the
   1126 pattern (other than in a character class) and
   1127 characters between a @kbd{#} outside a character class and the
   1128 next newline character are ignored. An escaping backslash
   1129 can be used to include a whitespace or @kbd{#} character as part
   1130 of the pattern.
   1131 @end ifset
   1132 @end table
   1133 
   1134 
   1135 @node Other Commands
   1136 @section Less Frequently-Used Commands
   1137 
   1138 Though perhaps less frequently used than those in the previous
   1139 section, some very small yet useful @command{sed} scripts can be built with
   1140 these commands.
   1141 
   1142 @table @code
   1143 @item y/@var{source-chars}/@var{dest-chars}/
   1144 (The @code{/} characters may be uniformly replaced by
   1145 any other single character within any given @code{y} command.)
   1146 
   1147 @findex y (transliterate) command
   1148 @cindex Transliteration
   1149 Transliterate any characters in the pattern space which match
   1150 any of the @var{source-chars} with the corresponding character
   1151 in @var{dest-chars}.
   1152 
   1153 Instances of the @code{/} (or whatever other character is used in its stead),
   1154 @code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
   1155 lists, provide that each instance is escaped by a @code{\}.
   1156 The @var{source-chars} and @var{dest-chars} lists @emph{must}
   1157 contain the same number of characters (after de-escaping).
   1158 
   1159 @item a\
   1160 @itemx @var{text}
   1161 @cindex @value{SSEDEXT}, two addresses supported by most commands
   1162 As a @acronym{GNU} extension, this command accepts two addresses.
   1163 
   1164 @findex a (append text lines) command
   1165 @cindex Appending text after a line
   1166 @cindex Text, appending
   1167 Queue the lines of text which follow this command
   1168 (each but the last ending with a @code{\},
   1169 which are removed from the output)
   1170 to be output at the end of the current cycle,
   1171 or when the next input line is read.
   1172 
   1173 Escape sequences in @var{text} are processed, so you should
   1174 use @code{\\} in @var{text} to print a single backslash.
   1175 
   1176 As a @acronym{GNU} extension, if between the @code{a} and the newline there is
   1177 other than a whitespace-@code{\} sequence, then the text of this line,
   1178 starting at the first non-whitespace character after the @code{a},
   1179 is taken as the first line of the @var{text} block.
   1180 (This enables a simplification in scripting a one-line add.)
   1181 This extension also works with the @code{i} and @code{c} commands.
   1182 
   1183 @item i\
   1184 @itemx @var{text}
   1185 @cindex @value{SSEDEXT}, two addresses supported by most commands
   1186 As a @acronym{GNU} extension, this command accepts two addresses.
   1187 
   1188 @findex i (insert text lines) command
   1189 @cindex Inserting text before a line
   1190 @cindex Text, insertion
   1191 Immediately output the lines of text which follow this command
   1192 (each but the last ending with a @code{\},
   1193 which are removed from the output).
   1194 
   1195 @item c\
   1196 @itemx @var{text}
   1197 @findex c (change to text lines) command
   1198 @cindex Replacing selected lines with other text
   1199 Delete the lines matching the address or address-range,
   1200 and output the lines of text which follow this command
   1201 (each but the last ending with a @code{\},
   1202 which are removed from the output)
   1203 in place of the last line
   1204 (or in place of each line, if no addresses were specified).
   1205 A new cycle is started after this command is done,
   1206 since the pattern space will have been deleted.
   1207 
   1208 @item =
   1209 @cindex @value{SSEDEXT}, two addresses supported by most commands
   1210 As a @acronym{GNU} extension, this command accepts two addresses.
   1211 
   1212 @findex = (print line number) command
   1213 @cindex Printing line number
   1214 @cindex Line number, printing
   1215 Print out the current input line number (with a trailing newline).
   1216 
   1217 @item l @var{n}
   1218 @findex l (list unambiguously) command
   1219 @cindex List pattern space
   1220 @cindex Printing text unambiguously
   1221 @cindex Line length, setting
   1222 @cindex @value{SSEDEXT}, setting line length
   1223 Print the pattern space in an unambiguous form:
   1224 non-printable characters (and the @code{\} character)
   1225 are printed in C-style escaped form; long lines are split,
   1226 with a trailing @code{\} character to indicate the split;
   1227 the end of each line is marked with a @code{$}.
   1228 
   1229 @var{n} specifies the desired line-wrap length;
   1230 a length of 0 (zero) means to never wrap long lines.  If omitted,
   1231 the default as specified on the command line is used.  The @var{n}
   1232 parameter is a @value{SSED} extension.
   1233 
   1234 @item r @var{filename}
   1235 @cindex @value{SSEDEXT}, two addresses supported by most commands
   1236 As a @acronym{GNU} extension, this command accepts two addresses.
   1237 
   1238 @findex r (read file) command
   1239 @cindex Read text from a file
   1240 @cindex @value{SSEDEXT}, @file{/dev/stdin} file
   1241 Queue the contents of @var{filename} to be read and
   1242 inserted into the output stream at the end of the current cycle,
   1243 or when the next input line is read.
   1244 Note that if @var{filename} cannot be read, it is treated as
   1245 if it were an empty file, without any error indication.
   1246 
   1247 As a @value{SSED} extension, the special value @file{/dev/stdin}
   1248 is supported for the file name, which reads the contents of the
   1249 standard input.
   1250 
   1251 @item w @var{filename}
   1252 @findex w (write file) command
   1253 @cindex Write to a file
   1254 @cindex @value{SSEDEXT}, @file{/dev/stdout} file
   1255 @cindex @value{SSEDEXT}, @file{/dev/stderr} file
   1256 Write the pattern space to @var{filename}.
   1257 As a @value{SSED} extension, two special values of @var{file-name} are
   1258 supported: @file{/dev/stderr}, which writes the result to the standard
   1259 error, and @file{/dev/stdout}, which writes to the standard
   1260 output.@footnote{This is equivalent to @code{p} unless the @option{-i}
   1261 option is being used.}
   1262 
   1263 The file will be created (or truncated) before the
   1264 first input line is read; all @code{w} commands
   1265 (including instances of @code{w} flag on successful @code{s} commands)
   1266 which refer to the same @var{filename} are output without
   1267 closing and reopening the file.
   1268 
   1269 @item D
   1270 @findex D (delete first line) command
   1271 @cindex Delete first line from pattern space
   1272 Delete text in the pattern space up to the first newline.
   1273 If any text is left, restart cycle with the resultant
   1274 pattern space (without reading a new line of input),
   1275 otherwise start a normal new cycle.
   1276 
   1277 @item N
   1278 @findex N (append Next line) command
   1279 @cindex Next input line, append to pattern space
   1280 @cindex Append next input line to pattern space
   1281 Add a newline to the pattern space,
   1282 then append the next line of input to the pattern space.
   1283 If there is no more input then @command{sed} exits without processing
   1284 any more commands.
   1285 
   1286 @item P
   1287 @findex P (print first line) command
   1288 @cindex Print first line from pattern space
   1289 Print out the portion of the pattern space up to the first newline.
   1290 
   1291 @item h
   1292 @findex h (hold) command
   1293 @cindex Copy pattern space into hold space
   1294 @cindex Replace hold space with copy of pattern space
   1295 @cindex Hold space, copying pattern space into
   1296 Replace the contents of the hold space with the contents of the pattern space.
   1297 
   1298 @item H
   1299 @findex H (append Hold) command
   1300 @cindex Append pattern space to hold space
   1301 @cindex Hold space, appending from pattern space
   1302 Append a newline to the contents of the hold space,
   1303 and then append the contents of the pattern space to that of the hold space.
   1304 
   1305 @item g
   1306 @findex g (get) command
   1307 @cindex Copy hold space into pattern space
   1308 @cindex Replace pattern space with copy of hold space
   1309 @cindex Hold space, copy into pattern space
   1310 Replace the contents of the pattern space with the contents of the hold space.
   1311 
   1312 @item G
   1313 @findex G (appending Get) command
   1314 @cindex Append hold space to pattern space
   1315 @cindex Hold space, appending to pattern space
   1316 Append a newline to the contents of the pattern space,
   1317 and then append the contents of the hold space to that of the pattern space.
   1318 
   1319 @item x
   1320 @findex x (eXchange) command
   1321 @cindex Exchange hold space with pattern space
   1322 @cindex Hold space, exchange with pattern space
   1323 Exchange the contents of the hold and pattern spaces.
   1324 
   1325 @end table
   1326 
   1327 
   1328 @node Programming Commands
   1329 @section Commands for @command{sed} gurus
   1330 
   1331 In most cases, use of these commands indicates that you are
   1332 probably better off programming in something like @command{awk}
   1333 or Perl.  But occasionally one is committed to sticking
   1334 with @command{sed}, and these commands can enable one to write
   1335 quite convoluted scripts.
   1336 
   1337 @cindex Flow of control in scripts
   1338 @table @code
   1339 @item : @var{label}
   1340 [No addresses allowed.]
   1341 
   1342 @findex : (label) command
   1343 @cindex Labels, in scripts
   1344 Specify the location of @var{label} for branch commands.
   1345 In all other respects, a no-op.
   1346 
   1347 @item b @var{label}
   1348 @findex b (branch) command
   1349 @cindex Branch to a label, unconditionally
   1350 @cindex Goto, in scripts
   1351 Unconditionally branch to @var{label}.
   1352 The @var{label} may be omitted, in which case the next cycle is started.
   1353 
   1354 @item t @var{label}
   1355 @findex t (test and branch if successful) command
   1356 @cindex Branch to a label, if @code{s///} succeeded
   1357 @cindex Conditional branch
   1358 Branch to @var{label} only if there has been a successful @code{s}ubstitution
   1359 since the last input line was read or conditional branch was taken.
   1360 The @var{label} may be omitted, in which case the next cycle is started.
   1361 
   1362 @end table
   1363 
   1364 @node Extended Commands
   1365 @section Commands Specific to @value{SSED}
   1366 
   1367 These commands are specific to @value{SSED}, so you
   1368 must use them with care and only when you are sure that
   1369 hindering portability is not evil.  They allow you to check
   1370 for @value{SSED} extensions or to do tasks that are required
   1371 quite often, yet are unsupported by standard @command{sed}s.
   1372 
   1373 @table @code
   1374 @item e [@var{command}]
   1375 @findex e (evaluate) command
   1376 @cindex Evaluate Bourne-shell commands
   1377 @cindex Subprocesses
   1378 @cindex @value{SSEDEXT}, evaluating Bourne-shell commands
   1379 @cindex @value{SSEDEXT}, subprocesses
   1380 This command allows one to pipe input from a shell command
   1381 into pattern space.  Without parameters, the @code{e} command
   1382 executes the command that is found in pattern space and
   1383 replaces the pattern space with the output; a trailing newline
   1384 is suppressed.
   1385 
   1386 If a parameter is specified, instead, the @code{e} command
   1387 interprets it as a command and sends its output to the output stream
   1388 (like @code{r} does).  The command can run across multiple
   1389 lines, all but the last ending with a back-slash.
   1390 
   1391 In both cases, the results are undefined if the command to be
   1392 executed contains a @sc{nul} character.
   1393 
   1394 @item L @var{n}
   1395 @findex L (fLow paragraphs) command
   1396 @cindex Reformat pattern space
   1397 @cindex Reformatting paragraphs
   1398 @cindex @value{SSEDEXT}, reformatting paragraphs
   1399 @cindex @value{SSEDEXT}, @code{L} command
   1400 This @value{SSED} extension fills and joins lines in pattern space
   1401 to produce output lines of (at most) @var{n} characters, like
   1402 @code{fmt} does; if @var{n} is omitted, the default as specified
   1403 on the command line is used.  This command is considered a failed
   1404 experiment and unless there is enough request (which seems unlikely)
   1405 will be removed in future versions.
   1406 
   1407 @ignore
   1408 Blank lines, spaces between words, and indentation are
   1409 preserved in the output; successive input lines with different
   1410 indentation are not joined; tabs are expanded to 8 columns.
   1411 
   1412 If the pattern space contains multiple lines, they are joined, but
   1413 since the pattern space usually contains a single line, the behavior
   1414 of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
   1415 it does not join short lines to form longer ones).
   1416 
   1417 @var{n} specifies the desired line-wrap length; if omitted,
   1418 the default as specified on the command line is used.
   1419 @end ignore
   1420 
   1421 @item Q [@var{exit-code}]
   1422 This command only accepts a single address.
   1423 
   1424 @findex Q (silent Quit) command
   1425 @cindex @value{SSEDEXT}, quitting silently
   1426 @cindex @value{SSEDEXT}, returning an exit code
   1427 @cindex Quitting
   1428 This command is the same as @code{q}, but will not print the
   1429 contents of pattern space.  Like @code{q}, it provides the
   1430 ability to return an exit code to the caller.
   1431 
   1432 This command can be useful because the only alternative ways
   1433 to accomplish this apparently trivial function are to use
   1434 the @option{-n} option (which can unnecessarily complicate
   1435 your script) or resorting to the following snippet, which
   1436 wastes time by reading the whole file without any visible effect:
   1437 
   1438 @example
   1439 :eat
   1440 $d       @i{@r{Quit silently on the last line}}
   1441 N        @i{@r{Read another line, silently}}
   1442 g        @i{@r{Overwrite pattern space each time to save memory}}
   1443 b eat
   1444 @end example
   1445 
   1446 @item R @var{filename}
   1447 @findex R (read line) command
   1448 @cindex Read text from a file
   1449 @cindex @value{SSEDEXT}, reading a file a line at a time
   1450 @cindex @value{SSEDEXT}, @code{R} command
   1451 @cindex @value{SSEDEXT}, @file{/dev/stdin} file
   1452 Queue a line of @var{filename} to be read and
   1453 inserted into the output stream at the end of the current cycle,
   1454 or when the next input line is read.
   1455 Note that if @var{filename} cannot be read, or if its end is
   1456 reached, no line is appended, without any error indication.
   1457 
   1458 As with the @code{r} command, the special value @file{/dev/stdin}
   1459 is supported for the file name, which reads a line from the
   1460 standard input.
   1461 
   1462 @item T @var{label}
   1463 @findex T (test and branch if failed) command
   1464 @cindex @value{SSEDEXT}, branch if @code{s///} failed
   1465 @cindex Branch to a label, if @code{s///} failed
   1466 @cindex Conditional branch
   1467 Branch to @var{label} only if there have been no successful
   1468 @code{s}ubstitutions since the last input line was read or
   1469 conditional branch was taken. The @var{label} may be omitted,
   1470 in which case the next cycle is started.
   1471 
   1472 @item v @var{version}
   1473 @findex v (version) command
   1474 @cindex @value{SSEDEXT}, checking for their presence
   1475 @cindex Requiring @value{SSED}
   1476 This command does nothing, but makes @command{sed} fail if
   1477 @value{SSED} extensions are not supported, simply because other
   1478 versions of @command{sed} do not implement it.  In addition, you
   1479 can specify the version of @command{sed} that your script
   1480 requires, such as @code{4.0.5}.  The default is @code{4.0}
   1481 because that is the first version that implemented this command.
   1482 
   1483 This command enables all @value{SSEDEXT} even if
   1484 @env{POSIXLY_CORRECT} is set in the environment.
   1485 
   1486 @item W @var{filename}
   1487 @findex W (write first line) command
   1488 @cindex Write first line to a file
   1489 @cindex @value{SSEDEXT}, writing first line to a file
   1490 Write to the given filename the portion of the pattern space up to
   1491 the first newline.  Everything said under the @code{w} command about
   1492 file handling holds here too.
   1493 
   1494 @item z
   1495 @findex z (Zap) command
   1496 @cindex @value{SSEDEXT}, emptying pattern space
   1497 @cindex Emptying pattern space
   1498 This command empties the content of pattern space.  It is
   1499 usually the same as @samp{s/.*//}, but is more efficient
   1500 and works in the presence of invalid multibyte sequences
   1501 in the input stream.  @sc{posix} mandates that such sequences
   1502 are @emph{not} matched by @samp{.}, so that there is no portable
   1503 way to clear @command{sed}'s buffers in the middle of the
   1504 script in most multibyte locales (including UTF-8 locales).
   1505 @end table
   1506 
   1507 @node Escapes
   1508 @section @acronym{GNU} Extensions for Escapes in Regular Expressions
   1509 
   1510 @cindex @acronym{GNU} extensions, special escapes
   1511 Until this chapter, we have only encountered escapes of the form
   1512 @samp{\^}, which tell @command{sed} not to interpret the circumflex
   1513 as a special character, but rather to take it literally.  For
   1514 example, @samp{\*} matches a single asterisk rather than zero
   1515 or more backslashes.
   1516 
   1517 @cindex @code{POSIXLY_CORRECT} behavior, escapes
   1518 This chapter introduces another kind of escape@footnote{All
   1519 the escapes introduced here are @acronym{GNU}
   1520 extensions, with the exception of @code{\n}.  In basic regular
   1521 expression mode, setting @code{POSIXLY_CORRECT} disables them inside
   1522 bracket expressions.}---that
   1523 is, escapes that are applied to a character or sequence of characters
   1524 that ordinarily are taken literally, and that @command{sed} replaces
   1525 with a special character.  This provides a way
   1526 of encoding non-printable characters in patterns in a visible manner.
   1527 There is no restriction on the appearance of non-printing characters
   1528 in a @command{sed} script but when a script is being prepared in the
   1529 shell or by text editing, it is usually easier to use one of
   1530 the following escape sequences than the binary character it
   1531 represents:
   1532 
   1533 The list of these escapes is:
   1534 
   1535 @table @code
   1536 @item \a
   1537 Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7).
   1538 
   1539 @item \f
   1540 Produces or matches a form feed (@sc{ascii} 12).
   1541 
   1542 @item \n
   1543 Produces or matches a newline (@sc{ascii} 10).
   1544 
   1545 @item \r
   1546 Produces or matches a carriage return (@sc{ascii} 13).
   1547 
   1548 @item \t
   1549 Produces or matches a horizontal tab (@sc{ascii} 9).
   1550 
   1551 @item \v
   1552 Produces or matches a so called ``vertical tab'' (@sc{ascii} 11).
   1553 
   1554 @item \c@var{x}
   1555 Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is
   1556 any character.  The precise effect of @samp{\c@var{x}} is as follows:
   1557 if @var{x} is a lower case letter, it is converted to upper case.
   1558 Then bit 6 of the character (hex 40) is inverted.  Thus @samp{\cz} becomes
   1559 hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B.
   1560 
   1561 @item \d@var{xxx}
   1562 Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}.
   1563 
   1564 @item \o@var{xxx}
   1565 @ifset PERL
   1566 @item \@var{xxx}
   1567 @end ifset
   1568 Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
   1569 @ifset PERL
   1570 The syntax without the @code{o} is active in Perl mode, while the one
   1571 with the @code{o} is active in the normal or extended @sc{posix} regular
   1572 expression modes.
   1573 @end ifset
   1574 
   1575 @item \x@var{xx}
   1576 Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
   1577 @end table
   1578 
   1579 @samp{\b} (backspace) was omitted because of the conflict with
   1580 the existing ``word boundary'' meaning.
   1581 
   1582 Other escapes match a particular character class and are valid only in
   1583 regular expressions:
   1584 
   1585 @table @code
   1586 @item \w
   1587 Matches any ``word'' character.  A ``word'' character is any
   1588 letter or digit or the underscore character.
   1589 
   1590 @item \W
   1591 Matches any ``non-word'' character.
   1592 
   1593 @item \b
   1594 Matches a word boundary; that is it matches if the character
   1595 to the left is a ``word'' character and the character to the
   1596 right is a ``non-word'' character, or vice-versa.
   1597 
   1598 @item \B
   1599 Matches everywhere but on a word boundary; that is it matches
   1600 if the character to the left and the character to the right
   1601 are either both ``word'' characters or both ``non-word''
   1602 characters.
   1603 
   1604 @item \`
   1605 Matches only at the start of pattern space.  This is different
   1606 from @code{^} in multi-line mode.
   1607 
   1608 @item \'
   1609 Matches only at the end of pattern space.  This is different
   1610 from @code{$} in multi-line mode.
   1611 
   1612 @ifset PERL
   1613 @item \G
   1614 Match only at the start of pattern space or, when doing a global
   1615 substitution using the @code{s///g} command and option, at
   1616 the end-of-match position of the prior match.  For example,
   1617 @samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
   1618 a run of @code{Z}s
   1619 @end ifset
   1620 @end table
   1621 
   1622 @node Examples
   1623 @chapter Some Sample Scripts
   1624 
   1625 Here are some @command{sed} scripts to guide you in the art of mastering
   1626 @command{sed}.
   1627 
   1628 @menu
   1629 Some exotic examples:
   1630 * Centering lines::
   1631 * Increment a number::
   1632 * Rename files to lower case::
   1633 * Print bash environment::
   1634 * Reverse chars of lines::
   1635 
   1636 Emulating standard utilities:
   1637 * tac::                             Reverse lines of files
   1638 * cat -n::                          Numbering lines
   1639 * cat -b::                          Numbering non-blank lines
   1640 * wc -c::                           Counting chars
   1641 * wc -w::                           Counting words
   1642 * wc -l::                           Counting lines
   1643 * head::                            Printing the first lines
   1644 * tail::                            Printing the last lines
   1645 * uniq::                            Make duplicate lines unique
   1646 * uniq -d::                         Print duplicated lines of input
   1647 * uniq -u::                         Remove all duplicated lines
   1648 * cat -s::                          Squeezing blank lines
   1649 @end menu
   1650 
   1651 @node Centering lines
   1652 @section Centering Lines
   1653 
   1654 This script centers all lines of a file on a 80 columns width.
   1655 To change that width, the number in @code{\@{@dots{}\@}} must be
   1656 replaced, and the number of added spaces also must be changed.
   1657 
   1658 Note how the buffer commands are used to separate parts in
   1659 the regular expressions to be matched---this is a common
   1660 technique.
   1661 
   1662 @c start-------------------------------------------
   1663 @example
   1664 #!/usr/bin/sed -f
   1665 
   1666 @group
   1667 # Put 80 spaces in the buffer
   1668 1 @{
   1669   x
   1670   s/^$/          /
   1671   s/^.*$/&&&&&&&&/
   1672   x
   1673 @}
   1674 @end group
   1675 
   1676 @group
   1677 # del leading and trailing spaces
   1678 y/@kbd{tab}/ /
   1679 s/^ *//
   1680 s/ *$//
   1681 @end group
   1682 
   1683 @group
   1684 # add a newline and 80 spaces to end of line
   1685 G
   1686 @end group
   1687 
   1688 @group
   1689 # keep first 81 chars (80 + a newline)
   1690 s/^\(.\@{81\@}\).*$/\1/
   1691 @end group
   1692 
   1693 @group
   1694 # \2 matches half of the spaces, which are moved to the beginning
   1695 s/^\(.*\)\n\(.*\)\2/\2\1/
   1696 @end group
   1697 @end example
   1698 @c end---------------------------------------------
   1699 
   1700 @node Increment a number
   1701 @section Increment a Number
   1702 
   1703 This script is one of a few that demonstrate how to do arithmetic
   1704 in @command{sed}.  This is indeed possible,@footnote{@command{sed} guru Greg
   1705 Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator!
   1706 It is distributed together with sed.} but must be done manually.
   1707 
   1708 To increment one number you just add 1 to last digit, replacing
   1709 it by the following digit.  There is one exception: when the digit
   1710 is a nine the previous digits must be also incremented until you
   1711 don't have a nine.
   1712 
   1713 This solution by Bruno Haible is very clever and smart because
   1714 it uses a single buffer; if you don't have this limitation, the
   1715 algorithm used in @ref{cat -n, Numbering lines}, is faster.
   1716 It works by replacing trailing nines with an underscore, then
   1717 using multiple @code{s} commands to increment the last digit,
   1718 and then again substituting underscores with zeros.
   1719 
   1720 @c start-------------------------------------------
   1721 @example
   1722 #!/usr/bin/sed -f
   1723 
   1724 /[^0-9]/ d
   1725 
   1726 @group
   1727 # replace all leading 9s by _ (any other character except digits, could
   1728 # be used)
   1729 :d
   1730 s/9\(_*\)$/_\1/
   1731 td
   1732 @end group
   1733 
   1734 @group
   1735 # incr last digit only.  The first line adds a most-significant
   1736 # digit of 1 if we have to add a digit.
   1737 #
   1738 # The @code{tn} commands are not necessary, but make the thing
   1739 # faster
   1740 @end group
   1741 
   1742 @group
   1743 s/^\(_*\)$/1\1/; tn
   1744 s/8\(_*\)$/9\1/; tn
   1745 s/7\(_*\)$/8\1/; tn
   1746 s/6\(_*\)$/7\1/; tn
   1747 s/5\(_*\)$/6\1/; tn
   1748 s/4\(_*\)$/5\1/; tn
   1749 s/3\(_*\)$/4\1/; tn
   1750 s/2\(_*\)$/3\1/; tn
   1751 s/1\(_*\)$/2\1/; tn
   1752 s/0\(_*\)$/1\1/; tn
   1753 @end group
   1754 
   1755 @group
   1756 :n
   1757 y/_/0/
   1758 @end group
   1759 @end example
   1760 @c end---------------------------------------------
   1761 
   1762 @node Rename files to lower case
   1763 @section Rename Files to Lower Case
   1764 
   1765 This is a pretty strange use of @command{sed}.  We transform text, and
   1766 transform it to be shell commands, then just feed them to shell.
   1767 Don't worry, even worse hacks are done when using @command{sed}; I have
   1768 seen a script converting the output of @command{date} into a @command{bc}
   1769 program!
   1770 
   1771 The main body of this is the @command{sed} script, which remaps the name
   1772 from lower to upper (or vice-versa) and even checks out 
   1773 if the remapped name is the same as the original name.
   1774 Note how the script is parameterized using shell
   1775 variables and proper quoting.
   1776 
   1777 @c start-------------------------------------------
   1778 @example
   1779 @group
   1780 #! /bin/sh
   1781 # rename files to lower/upper case... 
   1782 #
   1783 # usage: 
   1784 #    move-to-lower * 
   1785 #    move-to-upper * 
   1786 # or
   1787 #    move-to-lower -R .
   1788 #    move-to-upper -R .
   1789 #
   1790 @end group
   1791 
   1792 @group
   1793 help()
   1794 @{
   1795         cat << eof
   1796 Usage: $0 [-n] [-r] [-h] files...
   1797 @end group
   1798 
   1799 @group
   1800 -n      do nothing, only see what would be done
   1801 -R      recursive (use find)
   1802 -h      this message
   1803 files   files to remap to lower case
   1804 @end group
   1805 
   1806 @group
   1807 Examples:
   1808        $0 -n *        (see if everything is ok, then...)
   1809        $0 *
   1810 @end group
   1811 
   1812        $0 -R .
   1813 
   1814 @group
   1815 eof
   1816 @}
   1817 @end group
   1818 
   1819 @group
   1820 apply_cmd='sh'
   1821 finder='echo "$@@" | tr " " "\n"'
   1822 files_only=
   1823 @end group
   1824 
   1825 @group
   1826 while :
   1827 do
   1828     case "$1" in 
   1829         -n) apply_cmd='cat' ;;
   1830         -R) finder='find "$@@" -type f';;
   1831         -h) help ; exit 1 ;;
   1832         *) break ;;
   1833     esac
   1834     shift
   1835 done
   1836 @end group
   1837 
   1838 @group
   1839 if [ -z "$1" ]; then
   1840         echo Usage: $0 [-h] [-n] [-r] files...
   1841         exit 1
   1842 fi
   1843 @end group
   1844 
   1845 @group
   1846 LOWER='abcdefghijklmnopqrstuvwxyz'
   1847 UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
   1848 @end group
   1849 
   1850 @group
   1851 case `basename $0` in
   1852         *upper*) TO=$UPPER; FROM=$LOWER ;;
   1853         *)       FROM=$UPPER; TO=$LOWER ;;
   1854 esac
   1855 @end group
   1856 
   1857 eval $finder | sed -n '
   1858 
   1859 @group
   1860 # remove all trailing slashes
   1861 s/\/*$//
   1862 @end group
   1863 
   1864 @group
   1865 # add ./ if there is no path, only a filename
   1866 /\//! s/^/.\//
   1867 @end group
   1868 
   1869 @group
   1870 # save path+filename
   1871 h
   1872 @end group
   1873 
   1874 @group
   1875 # remove path
   1876 s/.*\///
   1877 @end group
   1878 
   1879 @group
   1880 # do conversion only on filename
   1881 y/'$FROM'/'$TO'/
   1882 @end group
   1883 
   1884 @group
   1885 # now line contains original path+file, while
   1886 # hold space contains the new filename
   1887 x
   1888 @end group
   1889 
   1890 @group
   1891 # add converted file name to line, which now contains
   1892 # path/file-name\nconverted-file-name
   1893 G
   1894 @end group
   1895 
   1896 @group
   1897 # check if converted file name is equal to original file name,
   1898 # if it is, do not print nothing
   1899 /^.*\/\(.*\)\n\1/b
   1900 @end group
   1901 
   1902 @group
   1903 # now, transform path/fromfile\n, into
   1904 # mv path/fromfile path/tofile and print it
   1905 s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p
   1906 @end group
   1907 
   1908 ' | $apply_cmd
   1909 @end example
   1910 @c end---------------------------------------------
   1911 
   1912 @node Print bash environment
   1913 @section Print @command{bash} Environment
   1914 
   1915 This script strips the definition of the shell functions
   1916 from the output of the @command{set} Bourne-shell command.
   1917 
   1918 @c start-------------------------------------------
   1919 @example
   1920 #!/bin/sh
   1921 
   1922 @group
   1923 set | sed -n '
   1924 :x
   1925 @end group
   1926 
   1927 @group
   1928 @ifinfo
   1929 # if no occurrence of "=()" print and load next line
   1930 @end ifinfo
   1931 @ifnotinfo
   1932 # if no occurrence of @samp{=()} print and load next line
   1933 @end ifnotinfo
   1934 /=()/! @{ p; b; @}
   1935 / () $/! @{ p; b; @}
   1936 @end group
   1937 
   1938 @group
   1939 # possible start of functions section
   1940 # save the line in case this is a var like FOO="() "
   1941 h
   1942 @end group
   1943 
   1944 @group
   1945 # if the next line has a brace, we quit because
   1946 # nothing comes after functions
   1947 n
   1948 /^@{/ q
   1949 @end group
   1950 
   1951 @group
   1952 # print the old line
   1953 x; p
   1954 @end group
   1955 
   1956 @group
   1957 # work on the new line now
   1958 x; bx
   1959 '
   1960 @end group
   1961 @end example
   1962 @c end---------------------------------------------
   1963 
   1964 @node Reverse chars of lines
   1965 @section Reverse Characters of Lines
   1966 
   1967 This script can be used to reverse the position of characters
   1968 in lines.  The technique moves two characters at a time, hence
   1969 it is faster than more intuitive implementations.
   1970 
   1971 Note the @code{tx} command before the definition of the label.
   1972 This is often needed to reset the flag that is tested by
   1973 the @code{t} command.
   1974 
   1975 Imaginative readers will find uses for this script.  An example
   1976 is reversing the output of @command{banner}.@footnote{This requires
   1977 another script to pad the output of banner; for example
   1978 
   1979 @example
   1980 #! /bin/sh
   1981 
   1982 banner -w $1 $2 $3 $4 |
   1983   sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' |
   1984   ~/sedscripts/reverseline.sed
   1985 @end example
   1986 }
   1987 
   1988 @c start-------------------------------------------
   1989 @example
   1990 #!/usr/bin/sed -f
   1991 
   1992 /../! b
   1993 
   1994 @group
   1995 # Reverse a line.  Begin embedding the line between two newlines
   1996 s/^.*$/\
   1997 &\
   1998 /
   1999 @end group
   2000 
   2001 @group
   2002 # Move first character at the end.  The regexp matches until
   2003 # there are zero or one characters between the markers
   2004 tx
   2005 :x
   2006 s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/
   2007 tx
   2008 @end group
   2009 
   2010 @group
   2011 # Remove the newline markers
   2012 s/\n//g
   2013 @end group
   2014 @end example
   2015 @c end---------------------------------------------
   2016 
   2017 @node tac
   2018 @section Reverse Lines of Files
   2019 
   2020 This one begins a series of totally useless (yet interesting)
   2021 scripts emulating various Unix commands.  This, in particular,
   2022 is a @command{tac} workalike.
   2023 
   2024 Note that on implementations other than @acronym{GNU} @command{sed}
   2025 @ifset PERL
   2026 and @value{SSED}
   2027 @end ifset
   2028 this script might easily overflow internal buffers.
   2029 
   2030 @c start-------------------------------------------
   2031 @example
   2032 #!/usr/bin/sed -nf
   2033 
   2034 # reverse all lines of input, i.e. first line became last, ...
   2035 
   2036 @group
   2037 # from the second line, the buffer (which contains all previous lines)
   2038 # is *appended* to current line, so, the order will be reversed
   2039 1! G
   2040 @end group
   2041 
   2042 @group
   2043 # on the last line we're done -- print everything
   2044 $ p
   2045 @end group
   2046 
   2047 @group
   2048 # store everything on the buffer again
   2049 h
   2050 @end group
   2051 @end example
   2052 @c end---------------------------------------------
   2053 
   2054 @node cat -n
   2055 @section Numbering Lines
   2056 
   2057 This script replaces @samp{cat -n}; in fact it formats its output
   2058 exactly like @acronym{GNU} @command{cat} does.
   2059 
   2060 Of course this is completely useless and for two reasons:  first,
   2061 because somebody else did it in C, second, because the following
   2062 Bourne-shell script could be used for the same purpose and would
   2063 be much faster:
   2064 
   2065 @c start-------------------------------------------
   2066 @example
   2067 @group
   2068 #! /bin/sh
   2069 sed -e "=" $@@ | sed -e '
   2070   s/^/      /
   2071   N
   2072   s/^ *\(......\)\n/\1  /
   2073 '
   2074 @end group
   2075 @end example
   2076 @c end---------------------------------------------
   2077 
   2078 It uses @command{sed} to print the line number, then groups lines two
   2079 by two using @code{N}.  Of course, this script does not teach as much as
   2080 the one presented below.
   2081 
   2082 The algorithm used for incrementing uses both buffers, so the line
   2083 is printed as soon as possible and then discarded.  The number
   2084 is split so that changing digits go in a buffer and unchanged ones go
   2085 in the other; the changed digits are modified in a single step
   2086 (using a @code{y} command).  The line number for the next line
   2087 is then composed and stored in the hold space, to be used in the
   2088 next iteration.
   2089 
   2090 @c start-------------------------------------------
   2091 @example
   2092 #!/usr/bin/sed -nf
   2093 
   2094 @group
   2095 # Prime the pump on the first line
   2096 x
   2097 /^$/ s/^.*$/1/
   2098 @end group
   2099 
   2100 @group
   2101 # Add the correct line number before the pattern
   2102 G
   2103 h
   2104 @end group
   2105 
   2106 @group
   2107 # Format it and print it
   2108 s/^/      /
   2109 s/^ *\(......\)\n/\1  /p
   2110 @end group
   2111 
   2112 @group
   2113 # Get the line number from hold space; add a zero
   2114 # if we're going to add a digit on the next line
   2115 g
   2116 s/\n.*$//
   2117 /^9*$/ s/^/0/
   2118 @end group
   2119 
   2120 @group
   2121 # separate changing/unchanged digits with an x
   2122 s/.9*$/x&/
   2123 @end group
   2124 
   2125 @group
   2126 # keep changing digits in hold space
   2127 h
   2128 s/^.*x//
   2129 y/0123456789/1234567890/
   2130 x
   2131 @end group
   2132 
   2133 @group
   2134 # keep unchanged digits in pattern space
   2135 s/x.*$//
   2136 @end group
   2137 
   2138 @group
   2139 # compose the new number, remove the newline implicitly added by G
   2140 G
   2141 s/\n//
   2142 h
   2143 @end group
   2144 @end example
   2145 @c end---------------------------------------------
   2146 
   2147 @node cat -b
   2148 @section Numbering Non-blank Lines
   2149 
   2150 Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only
   2151 have to select which lines are to be numbered and which are not.
   2152 
   2153 The part that is common to this script and the previous one is
   2154 not commented to show how important it is to comment @command{sed}
   2155 scripts properly...
   2156 
   2157 @c start-------------------------------------------
   2158 @example
   2159 #!/usr/bin/sed -nf
   2160 
   2161 @group
   2162 /^$/ @{
   2163   p
   2164   b
   2165 @}
   2166 @end group
   2167 
   2168 @group
   2169 # Same as cat -n from now
   2170 x
   2171 /^$/ s/^.*$/1/
   2172 G
   2173 h
   2174 s/^/      /
   2175 s/^ *\(......\)\n/\1  /p
   2176 x
   2177 s/\n.*$//
   2178 /^9*$/ s/^/0/
   2179 s/.9*$/x&/
   2180 h
   2181 s/^.*x//
   2182 y/0123456789/1234567890/
   2183 x
   2184 s/x.*$//
   2185 G
   2186 s/\n//
   2187 h
   2188 @end group
   2189 @end example
   2190 @c end---------------------------------------------
   2191 
   2192 @node wc -c
   2193 @section Counting Characters
   2194 
   2195 This script shows another way to do arithmetic with @command{sed}.
   2196 In this case we have to add possibly large numbers, so implementing
   2197 this by successive increments would not be feasible (and possibly
   2198 even more complicated to contrive than this script).
   2199 
   2200 The approach is to map numbers to letters, kind of an abacus
   2201 implemented with @command{sed}.  @samp{a}s are units, @samp{b}s are
   2202 tens and so on: we simply add the number of characters
   2203 on the current line as units, and then propagate the carry
   2204 to tens, hundreds, and so on.
   2205 
   2206 As usual, running totals are kept in hold space.
   2207 
   2208 On the last line, we convert the abacus form back to decimal.
   2209 For the sake of variety, this is done with a loop rather than
   2210 with some 80 @code{s} commands@footnote{Some implementations
   2211 have a limit of 199 commands per script}: first we
   2212 convert units, removing @samp{a}s from the number; then we
   2213 rotate letters so that tens become @samp{a}s, and so on
   2214 until no more letters remain.
   2215 
   2216 @c start-------------------------------------------
   2217 @example
   2218 #!/usr/bin/sed -nf
   2219 
   2220 @group
   2221 # Add n+1 a's to hold space (+1 is for the newline)
   2222 s/./a/g
   2223 H
   2224 x
   2225 s/\n/a/
   2226 @end group
   2227 
   2228 @group
   2229 # Do the carry.  The t's and b's are not necessary,
   2230 # but they do speed up the thing
   2231 t a
   2232 : a;  s/aaaaaaaaaa/b/g; t b; b done
   2233 : b;  s/bbbbbbbbbb/c/g; t c; b done
   2234 : c;  s/cccccccccc/d/g; t d; b done
   2235 : d;  s/dddddddddd/e/g; t e; b done
   2236 : e;  s/eeeeeeeeee/f/g; t f; b done
   2237 : f;  s/ffffffffff/g/g; t g; b done
   2238 : g;  s/gggggggggg/h/g; t h; b done
   2239 : h;  s/hhhhhhhhhh//g
   2240 @end group
   2241 
   2242 @group
   2243 : done
   2244 $! @{
   2245   h
   2246   b
   2247 @}
   2248 @end group
   2249 
   2250 # On the last line, convert back to decimal
   2251 
   2252 @group
   2253 : loop
   2254 /a/! s/[b-h]*/&0/
   2255 s/aaaaaaaaa/9/
   2256 s/aaaaaaaa/8/
   2257 s/aaaaaaa/7/
   2258 s/aaaaaa/6/
   2259 s/aaaaa/5/
   2260 s/aaaa/4/
   2261 s/aaa/3/
   2262 s/aa/2/
   2263 s/a/1/
   2264 @end group
   2265 
   2266 @group
   2267 : next
   2268 y/bcdefgh/abcdefg/
   2269 /[a-h]/ b loop
   2270 p
   2271 @end group
   2272 @end example
   2273 @c end---------------------------------------------
   2274 
   2275 @node wc -w
   2276 @section Counting Words
   2277 
   2278 This script is almost the same as the previous one, once each
   2279 of the words on the line is converted to a single @samp{a}
   2280 (in the previous script each letter was changed to an @samp{a}).
   2281 
   2282 It is interesting that real @command{wc} programs have optimized
   2283 loops for @samp{wc -c}, so they are much slower at counting
   2284 words rather than characters.  This script's bottleneck,
   2285 instead, is arithmetic, and hence the word-counting one
   2286 is faster (it has to manage smaller numbers).
   2287 
   2288 Again, the common parts are not commented to show the importance
   2289 of commenting @command{sed} scripts.
   2290 
   2291 @c start-------------------------------------------
   2292 @example
   2293 #!/usr/bin/sed -nf
   2294 
   2295 @group
   2296 # Convert words to a's
   2297 s/[ @kbd{tab}][ @kbd{tab}]*/ /g
   2298 s/^/ /
   2299 s/ [^ ][^ ]*/a /g
   2300 s/ //g
   2301 @end group
   2302 
   2303 @group
   2304 # Append them to hold space
   2305 H
   2306 x
   2307 s/\n//
   2308 @end group
   2309 
   2310 @group
   2311 # From here on it is the same as in wc -c.
   2312 /aaaaaaaaaa/! bx;   s/aaaaaaaaaa/b/g
   2313 /bbbbbbbbbb/! bx;   s/bbbbbbbbbb/c/g
   2314 /cccccccccc/! bx;   s/cccccccccc/d/g
   2315 /dddddddddd/! bx;   s/dddddddddd/e/g
   2316 /eeeeeeeeee/! bx;   s/eeeeeeeeee/f/g
   2317 /ffffffffff/! bx;   s/ffffffffff/g/g
   2318 /gggggggggg/! bx;   s/gggggggggg/h/g
   2319 s/hhhhhhhhhh//g
   2320 :x
   2321 $! @{ h; b; @}
   2322 :y
   2323 /a/! s/[b-h]*/&0/
   2324 s/aaaaaaaaa/9/
   2325 s/aaaaaaaa/8/
   2326 s/aaaaaaa/7/
   2327 s/aaaaaa/6/
   2328 s/aaaaa/5/
   2329 s/aaaa/4/
   2330 s/aaa/3/
   2331 s/aa/2/
   2332 s/a/1/
   2333 y/bcdefgh/abcdefg/
   2334 /[a-h]/ by
   2335 p
   2336 @end group
   2337 @end example
   2338 @c end---------------------------------------------
   2339 
   2340 @node wc -l
   2341 @section Counting Lines
   2342 
   2343 No strange things are done now, because @command{sed} gives us
   2344 @samp{wc -l} functionality for free!!! Look:
   2345 
   2346 @c start-------------------------------------------
   2347 @example
   2348 @group
   2349 #!/usr/bin/sed -nf
   2350 $=
   2351 @end group
   2352 @end example
   2353 @c end---------------------------------------------
   2354 
   2355 @node head
   2356 @section Printing the First Lines
   2357 
   2358 This script is probably the simplest useful @command{sed} script.
   2359 It displays the first 10 lines of input; the number of displayed
   2360 lines is right before the @code{q} command.
   2361 
   2362 @c start-------------------------------------------
   2363 @example
   2364 @group
   2365 #!/usr/bin/sed -f
   2366 10q
   2367 @end group
   2368 @end example
   2369 @c end---------------------------------------------
   2370 
   2371 @node tail
   2372 @section Printing the Last Lines
   2373 
   2374 Printing the last @var{n} lines rather than the first is more complex
   2375 but indeed possible.  @var{n} is encoded in the second line, before
   2376 the bang character.
   2377 
   2378 This script is similar to the @command{tac} script in that it keeps the
   2379 final output in the hold space and prints it at the end:
   2380 
   2381 @c start-------------------------------------------
   2382 @example
   2383 #!/usr/bin/sed -nf
   2384 
   2385 @group
   2386 1! @{; H; g; @}
   2387 1,10 !s/[^\n]*\n//
   2388 $p
   2389 h
   2390 @end group
   2391 @end example
   2392 @c end---------------------------------------------
   2393 
   2394 Mainly, the scripts keeps a window of 10 lines and slides it
   2395 by adding a line and deleting the oldest (the substitution command
   2396 on the second line works like a @code{D} command but does not
   2397 restart the loop).
   2398 
   2399 The ``sliding window'' technique is a very powerful way to write
   2400 efficient and complex @command{sed} scripts, because commands like
   2401 @code{P} would require a lot of work if implemented manually.
   2402 
   2403 To introduce the technique, which is fully demonstrated in the
   2404 rest of this chapter and is based on the @code{N}, @code{P}
   2405 and @code{D} commands, here is an implementation of @command{tail}
   2406 using a simple ``sliding window.''
   2407 
   2408 This looks complicated but in fact the working is the same as
   2409 the last script: after we have kicked in the appropriate number
   2410 of lines, however, we stop using the hold space to keep inter-line
   2411 state, and instead use @code{N} and @code{D} to slide pattern
   2412 space by one line:
   2413 
   2414 @c start-------------------------------------------
   2415 @example
   2416 #!/usr/bin/sed -f
   2417 
   2418 @group
   2419 1h
   2420 2,10 @{; H; g; @}
   2421 $q
   2422 1,9d
   2423 N
   2424 D
   2425 @end group
   2426 @end example
   2427 @c end---------------------------------------------
   2428 
   2429 Note how the first, second and fourth line are inactive after
   2430 the first ten lines of input.  After that, all the script does
   2431 is: exiting on the last line of input, appending the next input
   2432 line to pattern space, and removing the first line.
   2433 
   2434 @node uniq
   2435 @section Make Duplicate Lines Unique
   2436 
   2437 This is an example of the art of using the @code{N}, @code{P}
   2438 and @code{D} commands, probably the most difficult to master.
   2439 
   2440 @c start-------------------------------------------
   2441 @example
   2442 @group
   2443 #!/usr/bin/sed -f
   2444 h
   2445 @end group
   2446 
   2447 @group
   2448 :b
   2449 # On the last line, print and exit
   2450 $b
   2451 N
   2452 /^\(.*\)\n\1$/ @{
   2453     # The two lines are identical.  Undo the effect of
   2454     # the n command.
   2455     g
   2456     bb
   2457 @}
   2458 @end group
   2459 
   2460 @group
   2461 # If the @code{N} command had added the last line, print and exit
   2462 $b
   2463 @end group
   2464 
   2465 @group
   2466 # The lines are different; print the first and go
   2467 # back working on the second.
   2468 P
   2469 D
   2470 @end group
   2471 @end example
   2472 @c end---------------------------------------------
   2473 
   2474 As you can see, we mantain a 2-line window using @code{P} and @code{D}.
   2475 This technique is often used in advanced @command{sed} scripts.
   2476 
   2477 @node uniq -d
   2478 @section Print Duplicated Lines of Input
   2479 
   2480 This script prints only duplicated lines, like @samp{uniq -d}.
   2481 
   2482 @c start-------------------------------------------
   2483 @example
   2484 #!/usr/bin/sed -nf
   2485 
   2486 @group
   2487 $b
   2488 N
   2489 /^\(.*\)\n\1$/ @{
   2490     # Print the first of the duplicated lines
   2491     s/.*\n//
   2492     p
   2493 @end group
   2494 
   2495 @group
   2496     # Loop until we get a different line
   2497     :b
   2498     $b
   2499     N
   2500     /^\(.*\)\n\1$/ @{
   2501         s/.*\n//
   2502         bb
   2503     @}
   2504 @}
   2505 @end group
   2506 
   2507 @group
   2508 # The last line cannot be followed by duplicates
   2509 $b
   2510 @end group
   2511 
   2512 @group
   2513 # Found a different one.  Leave it alone in the pattern space
   2514 # and go back to the top, hunting its duplicates
   2515 D
   2516 @end group
   2517 @end example
   2518 @c end---------------------------------------------
   2519 
   2520 @node uniq -u
   2521 @section Remove All Duplicated Lines
   2522 
   2523 This script prints only unique lines, like @samp{uniq -u}.
   2524 
   2525 @c start-------------------------------------------
   2526 @example
   2527 #!/usr/bin/sed -f
   2528 
   2529 @group
   2530 # Search for a duplicate line --- until that, print what you find.
   2531 $b
   2532 N
   2533 /^\(.*\)\n\1$/ ! @{
   2534     P
   2535     D
   2536 @}
   2537 @end group
   2538 
   2539 @group
   2540 :c
   2541 # Got two equal lines in pattern space.  At the
   2542 # end of the file we simply exit
   2543 $d
   2544 @end group
   2545 
   2546 @group
   2547 # Else, we keep reading lines with @code{N} until we
   2548 # find a different one
   2549 s/.*\n//
   2550 N
   2551 /^\(.*\)\n\1$/ @{
   2552     bc
   2553 @}
   2554 @end group
   2555 
   2556 @group
   2557 # Remove the last instance of the duplicate line
   2558 # and go back to the top
   2559 D
   2560 @end group
   2561 @end example
   2562 @c end---------------------------------------------
   2563 
   2564 @node cat -s
   2565 @section Squeezing Blank Lines
   2566 
   2567 As a final example, here are three scripts, of increasing complexity
   2568 and speed, that implement the same function as @samp{cat -s}, that is
   2569 squeezing blank lines.
   2570 
   2571 The first leaves a blank line at the beginning and end if there are
   2572 some already.
   2573 
   2574 @c start-------------------------------------------
   2575 @example
   2576 #!/usr/bin/sed -f
   2577 
   2578 @group
   2579 # on empty lines, join with next
   2580 # Note there is a star in the regexp
   2581 :x
   2582 /^\n*$/ @{
   2583 N
   2584 bx
   2585 @}
   2586 @end group
   2587 
   2588 @group
   2589 # now, squeeze all '\n', this can be also done by:
   2590 # s/^\(\n\)*/\1/
   2591 s/\n*/\
   2592 /
   2593 @end group
   2594 @end example
   2595 @c end---------------------------------------------
   2596 
   2597 This one is a bit more complex and removes all empty lines
   2598 at the beginning.  It does leave a single blank line at end
   2599 if one was there.
   2600 
   2601 @c start-------------------------------------------
   2602 @example
   2603 #!/usr/bin/sed -f
   2604 
   2605 @group
   2606 # delete all leading empty lines
   2607 1,/^./@{
   2608 /./!d
   2609 @}
   2610 @end group
   2611 
   2612 @group
   2613 # on an empty line we remove it and all the following
   2614 # empty lines, but one
   2615 :x
   2616 /./!@{
   2617 N
   2618 s/^\n$//
   2619 tx
   2620 @}
   2621 @end group
   2622 @end example
   2623 @c end---------------------------------------------
   2624 
   2625 This removes leading and trailing blank lines.  It is also the
   2626 fastest.  Note that loops are completely done with @code{n} and
   2627 @code{b}, without relying on @command{sed} to restart the
   2628 the script automatically at the end of a line.
   2629 
   2630 @c start-------------------------------------------
   2631 @example
   2632 #!/usr/bin/sed -nf
   2633 
   2634 @group
   2635 # delete all (leading) blanks
   2636 /./!d
   2637 @end group
   2638 
   2639 @group
   2640 # get here: so there is a non empty
   2641 :x
   2642 # print it
   2643 p
   2644 # get next
   2645 n
   2646 # got chars? print it again, etc... 
   2647 /./bx
   2648 @end group
   2649 
   2650 @group
   2651 # no, don't have chars: got an empty line
   2652 :z
   2653 # get next, if last line we finish here so no trailing
   2654 # empty lines are written
   2655 n
   2656 # also empty? then ignore it, and get next... this will
   2657 # remove ALL empty lines
   2658 /./!bz
   2659 @end group
   2660 
   2661 @group
   2662 # all empty lines were deleted/ignored, but we have a non empty.  As
   2663 # what we want to do is to squeeze, insert a blank line artificially
   2664 i\
   2665 @end group
   2666 
   2667 bx
   2668 @end example
   2669 @c end---------------------------------------------
   2670 
   2671 @node Limitations
   2672 @chapter @value{SSED}'s Limitations and Non-limitations
   2673 
   2674 @cindex @acronym{GNU} extensions, unlimited line length
   2675 @cindex Portability, line length limitations
   2676 For those who want to write portable @command{sed} scripts,
   2677 be aware that some implementations have been known to
   2678 limit line lengths (for the pattern and hold spaces)
   2679 to be no more than 4000 bytes.
   2680 The @sc{posix} standard specifies that conforming @command{sed}
   2681 implementations shall support at least 8192 byte line lengths.
   2682 @value{SSED} has no built-in limit on line length;
   2683 as long as it can @code{malloc()} more (virtual) memory,
   2684 you can feed or construct lines as long as you like.
   2685 
   2686 However, recursion is used to handle subpatterns and indefinite
   2687 repetition.  This means that the available stack space may limit
   2688 the size of the buffer that can be processed by certain patterns.
   2689 
   2690 @ifset PERL
   2691 There are some size limitations in the regular expression
   2692 matcher but it is hoped that they will never in practice
   2693 be relevant.  The maximum length of a compiled pattern
   2694 is 65539 (sic) bytes.  All values in repeating quantifiers
   2695 must be less than 65536.  The maximum nesting depth of
   2696 all parenthesized subpatterns, including capturing and
   2697 non-capturing subpatterns@footnote{The
   2698 distinction is meaningful when referring to Perl-style
   2699 regular expressions.}, assertions, and other types of
   2700 subpattern, is 200.
   2701 
   2702 Also, @value{SSED} recognizes the @sc{posix} syntax
   2703 @code{[.@var{ch}.]} and @code{[=@var{ch}=]}
   2704 where @var{ch} is a ``collating element'', but these
   2705 are not supported, and an error is given if they are
   2706 encountered.
   2707 
   2708 Here are a few distinctions between the real Perl-style
   2709 regular expressions and those that @option{-R} recognizes.
   2710 
   2711 @enumerate
   2712 @item
   2713 Lookahead assertions do not allow repeat quantifiers after them
   2714 Perl permits them, but they do not mean what you
   2715 might think. For example, @samp{(?!a)@{3@}} does not assert that the
   2716 next three characters are not @samp{a}. It just asserts three times that the
   2717 next character is not @samp{a} --- a waste of time and nothing else.
   2718 
   2719 @item
   2720 Capturing subpatterns that occur inside  negative  lookahead
   2721 head  assertions  are  counted,  but  their  entries are counted
   2722 as empty in the second half of an @code{s} command.
   2723 Perl sets its numerical variables from any such patterns
   2724 that are matched before the assertion fails to match
   2725 something (thereby succeeding), but only if the negative
   2726 lookahead assertion contains just one branch.
   2727 
   2728 @item
   2729 The following Perl escape sequences are not supported:
   2730 @samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
   2731 @samp{\Q}. In fact these are implemented by Perl's general
   2732 string-handling and are not part of its pattern matching engine.
   2733 
   2734 @item
   2735 The Perl @samp{\G} assertion is not supported as it is not
   2736 relevant to single pattern matches.
   2737 
   2738 @item
   2739 Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
   2740 and @samp{(?p@{code@})} constructions. However, there is some experimental
   2741 support for recursive patterns using the non-Perl item @samp{(?R)}.
   2742 
   2743 @item
   2744 There are at the time of writing some oddities in Perl
   2745 5.005_02 concerned with the settings of captured strings
   2746 when part of a pattern is repeated. For example, matching
   2747 @samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
   2748 @samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
   2749 to the value @samp{b}, but matching @samp{aabbaa}
   2750 against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
   2751 unset.  However, if the pattern is changed to
   2752 @samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
   2753 In Perl 5.004 @samp{$2} is set in both cases, and that is also
   2754 true of @value{SSED}.
   2755 
   2756 @item
   2757 Another as yet unresolved discrepancy is that in Perl
   2758 5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches
   2759 the string @samp{a}, whereas in @value{SSED} it does not.
   2760 However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
   2761 against @samp{a} leaves $1 unset.
   2762 @end enumerate
   2763 @end ifset
   2764 
   2765 @node Other Resources
   2766 @chapter Other Resources for Learning About @command{sed}
   2767 
   2768 @cindex Additional reading about @command{sed}
   2769 In addition to several books that have been written about @command{sed}
   2770 (either specifically or as chapters in books which discuss
   2771 shell programming), one can find out more about @command{sed}
   2772 (including suggestions of a few books) from the FAQ
   2773 for the @code{sed-users} mailing list, available from:
   2774 @display
   2775 @uref{http://sed.sourceforge.net/sedfaq.html}
   2776 @end display
   2777 
   2778 Also of interest are
   2779 @uref{http://www.student.northpark.edu/pemente/sed/index.htm}
   2780 and @uref{http://sed.sf.net/grabbag},
   2781 which include @command{sed} tutorials and other @command{sed}-related goodies.
   2782 
   2783 The @code{sed-users} mailing list itself maintained by Sven Guckes.
   2784 To subscribe, visit @uref{http://groups.yahoo.com} and search
   2785 for the @code{sed-users} mailing list.
   2786 
   2787 @node Reporting Bugs
   2788 @chapter Reporting Bugs
   2789 
   2790 @cindex Bugs, reporting
   2791 Email bug reports to @email{bonzini@@gnu.org}.
   2792 Be sure to include the word ``sed'' somewhere in the @code{Subject:} field.
   2793 Also, please include the output of @samp{sed --version} in the body
   2794 of your report if at all possible.
   2795 
   2796 Please do not send a bug report like this:
   2797 
   2798 @example
   2799 @i{@i{@r{while building frobme-1.3.4}}}
   2800 $ configure 
   2801 @error{} sed: file sedscr line 1: Unknown option to 's'
   2802 @end example
   2803 
   2804 If @value{SSED} doesn't configure your favorite package, take a
   2805 few extra minutes to identify the specific problem and make a stand-alone
   2806 test case.  Unlike other programs such as C compilers, making such test
   2807 cases for @command{sed} is quite simple.
   2808 
   2809 A stand-alone test case includes all the data necessary to perform the
   2810 test, and the specific invocation of @command{sed} that causes the problem.
   2811 The smaller a stand-alone test case is, the better.  A test case should
   2812 not involve something as far removed from @command{sed} as ``try to configure
   2813 frobme-1.3.4''.  Yes, that is in principle enough information to look
   2814 for the bug, but that is not a very practical prospect.
   2815 
   2816 Here are a few commonly reported bugs that are not bugs.
   2817 
   2818 @table @asis
   2819 @item @code{N} command on the last line
   2820 @cindex Portability, @code{N} command on the last line
   2821 @cindex Non-bugs, @code{N} command on the last line
   2822 
   2823 Most versions of @command{sed} exit without printing anything when
   2824 the @command{N} command is issued on the last line of a file.
   2825 @value{SSED} prints pattern space before exiting unless of course
   2826 the @command{-n} command switch has been specified.  This choice is
   2827 by design.
   2828 
   2829 For example, the behavior of
   2830 @example
   2831 sed N foo bar
   2832 @end example
   2833 @noindent
   2834 would depend on whether foo has an even or an odd number of
   2835 lines@footnote{which is the actual ``bug'' that prompted the
   2836 change in behavior}.  Or, when writing a script to read the
   2837 next few lines following a pattern match, traditional
   2838 implementations of @code{sed} would force you to write
   2839 something like
   2840 @example
   2841 /foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @}
   2842 @end example
   2843 @noindent
   2844 instead of just
   2845 @example
   2846 /foo/@{ N;N;N;N;N;N;N;N;N; @}
   2847 @end example
   2848 
   2849 @cindex @code{POSIXLY_CORRECT} behavior, @code{N} command
   2850 In any case, the simplest workaround is to use @code{$d;N} in
   2851 scripts that rely on the traditional behavior, or to set
   2852 the @code{POSIXLY_CORRECT} variable to a non-empty value.
   2853 
   2854 @item Regex syntax clashes (problems with backslashes)
   2855 @cindex @acronym{GNU} extensions, to basic regular expressions
   2856 @cindex Non-bugs, regex syntax clashes
   2857 @command{sed} uses the @sc{posix} basic regular expression syntax.  According to
   2858 the standard, the meaning of some escape sequences is undefined in
   2859 this syntax;  notable in the case of @command{sed} are @code{\|},
   2860 @code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<},
   2861 @code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}.
   2862 
   2863 As in all @acronym{GNU} programs that use @sc{posix} basic regular
   2864 expressions, @command{sed} interprets these escape sequences as special
   2865 characters.  So, @code{x\+} matches one or more occurrences of @samp{x}.
   2866 @code{abc\|def} matches either @samp{abc} or @samp{def}.
   2867 
   2868 This syntax may cause problems when running scripts written for other
   2869 @command{sed}s.  Some @command{sed} programs have been written with the
   2870 assumption that @code{\|} and @code{\+} match the literal characters
   2871 @code{|} and @code{+}.  Such scripts must be modified by removing the
   2872 spurious backslashes if they are to be used with modern implementations
   2873 of @command{sed}, like
   2874 @ifset PERL
   2875 @value{SSED} or
   2876 @end ifset
   2877 @acronym{GNU} @command{sed}.
   2878 
   2879 On the other hand, some scripts use s|abc\|def||g to remove occurrences
   2880 of @emph{either} @code{abc} or @code{def}.  While this worked until
   2881 @command{sed} 4.0.x, newer versions interpret this as removing the
   2882 string @code{abc|def}.  This is again undefined behavior according to
   2883 @acronym{POSIX}, and this interpretation is arguably more robust: older
   2884 @command{sed}s, for example, required that the regex matcher parsed
   2885 @code{\/} as @code{/} in the common case of escaping a slash, which is
   2886 again undefined behavior; the new behavior avoids this, and this is good
   2887 because the regex matcher is only partially under our control.
   2888 
   2889 @cindex @acronym{GNU} extensions, special escapes
   2890 In addition, this version of @command{sed} supports several escape characters
   2891 (some of which are multi-character) to insert non-printable characters
   2892 in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r},
   2893 @code{\t}, @code{\v}, @code{\x}).  These can cause similar problems
   2894 with scripts written for other @command{sed}s.
   2895 
   2896 @item @option{-i} clobbers read-only files
   2897 @cindex In-place editing
   2898 @cindex @value{SSEDEXT}, in-place editing
   2899 @cindex Non-bugs, in-place editing
   2900 
   2901 In short, @samp{sed -i} will let you delete the contents of
   2902 a read-only file, and in general the @option{-i} option
   2903 (@pxref{Invoking sed, , Invocation}) lets you clobber
   2904 protected files.  This is not a bug, but rather a consequence
   2905 of how the Unix filesystem works.
   2906 
   2907 The permissions on a file say what can happen to the data
   2908 in that file, while the permissions on a directory say what can
   2909 happen to the list of files in that directory.  @samp{sed -i}
   2910 will not ever open for writing  a file that is already on disk.
   2911 Rather, it will work on a temporary file that is finally renamed
   2912 to the original name: if you rename or delete files, you're actually
   2913 modifying the contents of the directory, so the operation depends on
   2914 the permissions of the directory, not of the file.  For this same
   2915 reason, @command{sed} does not let you use @option{-i} on a writeable file
   2916 in a read-only directory, and will break hard or symbolic links when
   2917 @option{-i} is used on such a file.
   2918 
   2919 @item @code{0a} does not work (gives an error)
   2920 @cindex @code{0} address
   2921 @cindex @acronym{GNU} extensions, @code{0} address
   2922 @cindex Non-bugs, @code{0} address
   2923 
   2924 There is no line 0.  0 is a special address that is only used to treat
   2925 addresses like @code{0,/@var{RE}/} as active when the script starts: if
   2926 you write @code{1,/abc/d} and the first line includes the word @samp{abc},
   2927 then that match would be ignored because address ranges must span at least
   2928 two lines (barring the end of the file); but what you probably wanted is
   2929 to delete every line up to the first one including @samp{abc}, and this
   2930 is obtained with @code{0,/abc/d}.
   2931 
   2932 @ifclear PERL
   2933 @item @code{[a-z]} is case insensitive
   2934 @cindex Non-bugs, localization-related
   2935 
   2936 You are encountering problems with locales.  POSIX mandates that @code{[a-z]}
   2937 uses the current locale's collation order -- in C parlance, that means using
   2938 @code{strcoll(3)} instead of @code{strcmp(3)}.  Some locales have a
   2939 case-insensitive collation order, others don't.
   2940 
   2941 Another problem is that @code{[a-z]} tries to use collation symbols.
   2942 This only happens if you are on the @acronym{GNU} system, using
   2943 @acronym{GNU} libc's regular expression matcher instead of compiling the
   2944 one supplied with @acronym{GNU} sed.  In a Danish locale, for example,
   2945 the regular expression @code{^[a-z]$} matches the string @samp{aa},
   2946 because this is a single collating symbol that comes after @samp{a}
   2947 and before @samp{b}; @samp{ll} behaves similarly in Spanish
   2948 locales, or @samp{ij} in Dutch locales.
   2949 
   2950 To work around these problems, which may cause bugs in shell scripts, set
   2951 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
   2952 
   2953 @item @code{s/.*//} does not clear pattern space
   2954 @cindex Non-bugs, localization-related
   2955 @cindex @value{SSEDEXT}, emptying pattern space
   2956 @cindex Emptying pattern space
   2957 
   2958 This happens if your input stream includes invalid multibyte
   2959 sequences.  @sc{posix} mandates that such sequences
   2960 are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear
   2961 pattern space as you would expect.  In fact, there is no way to clear
   2962 sed's buffers in the middle of the script in most multibyte locales
   2963 (including UTF-8 locales).  For this reason, @value{SSED} provides a `z'
   2964 command (for `zap') as an extension.
   2965 
   2966 To work around these problems, which may cause bugs in shell scripts, set
   2967 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
   2968 @end ifclear
   2969 @end table
   2970 
   2971 
   2972 @node Extended regexps
   2973 @appendix Extended regular expressions
   2974 @cindex Extended regular expressions, syntax
   2975 
   2976 The only difference between basic and extended regular expressions is in
   2977 the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
   2978 and braces (@samp{@{@}}).  While basic regular expressions require
   2979 these to be escaped if you want them to behave as special characters,
   2980 when using extended regular expressions you must escape them if
   2981 you want them @emph{to match a literal character}.
   2982 
   2983 @noindent
   2984 Examples:
   2985 @table @code
   2986 @item abc?
   2987 becomes @samp{abc\?} when using extended regular expressions.  It matches
   2988 the literal string @samp{abc?}.
   2989 
   2990 @item c\+
   2991 becomes @samp{c+} when using extended regular expressions.  It matches
   2992 one or more @samp{c}s.
   2993 
   2994 @item a\@{3,\@}
   2995 becomes @samp{a@{3,@}} when using extended regular expressions.  It matches
   2996 three or more @samp{a}s.
   2997 
   2998 @item \(abc\)\@{2,3\@}
   2999 becomes @samp{(abc)@{2,3@}} when using extended regular expressions.  It
   3000 matches either @samp{abcabc} or @samp{abcabcabc}.
   3001 
   3002 @item \(abc*\)\1
   3003 becomes @samp{(abc*)\1} when using extended regular expressions.
   3004 Backreferences must still be escaped when using extended regular
   3005 expressions.
   3006 @end table
   3007 
   3008 @ifset PERL
   3009 @node Perl regexps
   3010 @appendix Perl-style regular expressions
   3011 @cindex Perl-style regular expressions, syntax
   3012 
   3013 @emph{This part is taken from the @file{pcre.txt} file distributed together
   3014 with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
   3015 
   3016 Perl introduced several extensions to regular expressions, some
   3017 of them incompatible with the syntax of regular expressions
   3018 accepted by Emacs and other @acronym{GNU} tools (whose matcher was
   3019 based on the Emacs matcher).  @value{SSED} implements
   3020 both kinds of extensions.
   3021 
   3022 @iftex
   3023 Summarizing, we have:
   3024 
   3025 @itemize @bullet
   3026 @item
   3027 A backslash can introduce several special sequences
   3028 
   3029 @item
   3030 The circumflex, dollar sign, and period characters behave specially 
   3031 with regard to new lines
   3032 
   3033 @item
   3034 Strange uses of square brackets are parsed differently
   3035 
   3036 @item
   3037 You can toggle modifiers in the middle of a regular expression
   3038 
   3039 @item
   3040 You can specify that a subpattern does not count when numbering backreferences
   3041 
   3042 @item
   3043 @cindex Greedy regular expression matching
   3044 You can specify greedy or non-greedy matching
   3045 
   3046 @item
   3047 You can have more than ten back references
   3048 
   3049 @item
   3050 You can do complex look aheads and look behinds (in the spirit of
   3051 @code{\b}, but with subpatterns).
   3052 
   3053 @item
   3054 You can often improve performance by avoiding that @command{sed} wastes
   3055 time with backtracking
   3056 
   3057 @item
   3058 You can have if/then/else branches
   3059 
   3060 @item
   3061 You can do recursive matches, for example to look for unbalanced parentheses
   3062 
   3063 @item
   3064 You can have comments and non-significant whitespace, because things can
   3065 get complex...
   3066 @end itemize
   3067 
   3068 Most of these extensions are introduced by the special @code{(?}
   3069 sequence, which gives special meanings to parenthesized groups.
   3070 @end iftex
   3071 @menu
   3072 Other extensions can be roughly subdivided in two categories
   3073 On one hand Perl introduces several more escaped sequences
   3074 (that is, sequences introduced by a backslash).  On the other
   3075 hand, it specifies that if a question mark follows an open
   3076 parentheses it should give a special meaning to the parenthesized
   3077 group.
   3078 
   3079 * Backslash::                       Introduces special sequences
   3080 * Circumflex/dollar sign/period::   Behave specially with regard to new lines
   3081 * Square brackets::                 Are a bit different in strange cases
   3082 * Options setting::                 Toggle modifiers in the middle of a regexp
   3083 * Non-capturing subpatterns::       Are not counted when backreferencing
   3084 * Repetition::                      Allows for non-greedy matching
   3085 * Backreferences::                  Allows for more than 10 back references
   3086 * Assertions::                      Allows for complex look ahead matches
   3087 * Non-backtracking subpatterns::    Often gives more performance
   3088 * Conditional subpatterns::         Allows if/then/else branches
   3089 * Recursive patterns::              For example to match parentheses
   3090 * Comments::                        Because things can get complex...
   3091 @end menu
   3092 
   3093 @node Backslash
   3094 @appendixsec Backslash
   3095 @cindex Perl-style regular expressions, escaped sequences
   3096 
   3097 There are a few difference in the handling of backslashed 
   3098 sequences in Perl mode.
   3099 
   3100 First of all, there are no @code{\o} and @code{\d} sequences.
   3101 @sc{ascii} values for characters can be specified in octal
   3102 with a @code{\@var{xxx}} sequence, where @var{xxx} is a
   3103 sequence of up to three octal digits.  If the first digit
   3104 is a zero, the treatment of the sequence is straightforward;
   3105 just note that if the character that follows the escaped digit
   3106 is itself an octal digit, you have to supply three octal digits
   3107 for @var{xxx}.  For example @code{\07} is a @sc{bel} character
   3108 rather than a @sc{nul} and a literal @code{7} (this sequence is
   3109 instead represented by @code{\0007}).
   3110 
   3111 @cindex Perl-style regular expressions, backreferences
   3112 The handling of a backslash followed by a digit other than 0
   3113 is complicated.  Outside a character class, @command{sed} reads it
   3114 and any following digits as a decimal number. If the number
   3115 is less than 10, or if there have been at least that many
   3116 previous capturing left parentheses in the expression, the
   3117 entire sequence is taken as a back reference. A description
   3118 of how this works is given later, following the discussion
   3119 of parenthesized subpatterns.
   3120 
   3121 Inside a character class, or if the decimal number is
   3122 greater than 9 and there have not been that many capturing
   3123 subpatterns, @command{sed} re-reads up to three octal digits following
   3124 the backslash, and generates a single byte from the
   3125 least significant 8 bits of the value. Any subsequent digits
   3126 stand for themselves.  For example:
   3127 
   3128 @example
   3129 \040  @i{@r{is another way of writing a space}}
   3130 \40   @i{@r{is the same, provided there are fewer than 40}}
   3131       @i{@r{previous capturing subpatterns}}
   3132 \7    @i{@r{is always a back reference}}
   3133 \011  @i{@r{is always a tab}}
   3134 \11   @i{@r{might be a back reference, or another way of writing a tab}}
   3135 \0113 @i{@r{is a tab followed by the character @samp{3}}}
   3136 \113  @i{@r{is the character with octal code 113 (since there}}
   3137       @i{@r{can be no more than 99 back references)}}
   3138 \377  @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}}
   3139 \81   @i{@r{is either a back reference, or a binary zero}}
   3140       @i{@r{followed by the two characters @samp{81}}}
   3141 @end example
   3142 
   3143 Note that octal values of 100 or greater must not be introduced
   3144 by a leading zero, because no more than three octal
   3145 digits are ever read. Note that this applies only to the LHS 
   3146 pattern; it is not possible yet to specify more than 9 backreferences 
   3147 on the RHS of the `s' command.
   3148 
   3149 All the sequences that define a single byte value can be
   3150 used both inside and outside character classes. In addition,
   3151 inside a character class, the sequence @code{\b} is interpreted
   3152 as the backspace character (hex 08). Outside a character
   3153 class it has a different meaning (see below).
   3154 
   3155 In addition, there are four additional escapes specifying
   3156 generic character classes (like @code{\w} and @code{\W} do):
   3157 
   3158 @cindex Perl-style regular expressions, character classes
   3159 @table @samp
   3160 @item \d
   3161 Matches any decimal digit
   3162 
   3163 @item \D
   3164 Matches any character that is not a decimal digit
   3165 @end table
   3166 
   3167 In Perl mode, these character type sequences can appear both inside and
   3168 outside character classes. Instead, in @sc{posix} mode these sequences
   3169 (as well as @code{\w} and @code{\W}) are treated as two literal characters
   3170 (a backslash and a letter) inside square brackets.
   3171 
   3172 Escaped sequences specifying assertions are also different in
   3173 Perl mode.  An assertion specifies a condition that has to be met
   3174 at a particular point in a match, without consuming any
   3175 characters from the subject string. The use of subpatterns
   3176 for more complicated assertions is described below.  The
   3177 backslashed assertions are
   3178 
   3179 @cindex Perl-style regular expressions, assertions
   3180 @table @samp
   3181 @item \b
   3182 Asserts that the point is at a word boundary.
   3183 A word boundary is a position in the subject string where
   3184 the current character and the previous character do not both
   3185 match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
   3186 the other matches @code{\W}), or the start or end of the string
   3187 if the first or last character matches @code{\w}, respectively.
   3188 
   3189 @item \B
   3190 Asserts that the point is not at a word boundary.
   3191 
   3192 @item \A
   3193 Asserts the matcher is at the start of pattern space (independent
   3194 of multiline mode).
   3195 
   3196 @item \Z
   3197 Asserts the matcher is at the end of pattern space,
   3198 or at a newline before the end of pattern space (independent of
   3199 multiline mode)
   3200 
   3201 @item \z
   3202 Asserts the matcher is at the end of pattern space (independent
   3203 of multiline mode)
   3204 @end table
   3205 
   3206 These assertions may not appear in character classes (but
   3207 note that @code{\b} has a different meaning, namely the
   3208 backspace character, inside a character class).
   3209 Note that Perl mode does not support directly assertions
   3210 for the beginning and the end of word; the @acronym{GNU} extensions
   3211 @code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
   3212 instead.
   3213 
   3214 The @code{\A}, @code{\Z}, and @code{\z} assertions differ
   3215 from the traditional circumflex and dollar sign (described below)
   3216 in that they only ever match at the very start and end of the
   3217 subject string, whatever options are set; in particular @code{\A}
   3218 and @code{\z} are the same as the @acronym{GNU} extensions
   3219 @code{\`} and @code{\'} that are active in @sc{posix} mode.
   3220 
   3221 @node Circumflex/dollar sign/period
   3222 @appendixsec Circumflex, dollar sign, period
   3223 @cindex Perl-style regular expressions, newlines
   3224 
   3225 Outside a character class, in the default matching mode, the
   3226 circumflex character is an assertion which is true only if
   3227 the current matching point is at the start of the subject
   3228 string.  Inside a character class, the circumflex has an entirely
   3229 different meaning (see below).
   3230 
   3231 The circumflex need not be the first character of the pattern if
   3232 a number of alternatives are involved, but it should be the
   3233 first thing in each alternative in which it appears if the
   3234 pattern is ever to match that branch. If all possible alternatives,
   3235 start with a circumflex, that is, if the pattern is
   3236 constrained to match only at the start of the subject, it is
   3237 said to be an @dfn{anchored} pattern. (There are also other constructs
   3238 structs that can cause a pattern to be anchored.)
   3239 
   3240 A dollar sign is an assertion which is true only if the
   3241 current matching point is at the end of the subject string,
   3242 or immediately before a newline character that is the last
   3243 character in the string (by default).  A dollar sign need not be the
   3244 last character of the pattern if a number of alternatives
   3245 are involved, but it should be the last item in any branch
   3246 in which it appears.  A dollar sign has no special meaning in a
   3247 character class.
   3248 
   3249 @cindex Perl-style regular expressions, multiline
   3250 The meanings of the circumflex and dollar sign characters are
   3251 changed if the @code{M} modifier option is used. When this is
   3252 the case, they match immediately after and immediately
   3253 before an internal @code{\n} character, respectively, in addition
   3254 to matching at the start and end of the subject string.  For
   3255 example, the pattern @code{/^abc$/} matches the subject string
   3256 @samp{def\nabc} in multiline mode, but not otherwise.  Consequently,
   3257 patterns that are anchored in single line mode
   3258 because all branches start with @code{^} are not anchored in
   3259 multiline mode.
   3260 
   3261 @cindex Perl-style regular expressions, multiline
   3262 Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
   3263 can be used to match the start and end of the subject in both
   3264 modes, and if all branches of a pattern start with @code{\A}
   3265 is it always anchored, whether the @code{M} modifier is set or not.
   3266 
   3267 @cindex Perl-style regular expressions, single line
   3268 Outside a character class, a dot in the pattern matches any
   3269 one character in the subject, including a non-printing character,
   3270 but not (by default) newline.  If the @code{S} modifier is used,
   3271 dots match newlines as well.  Actually, the handling of
   3272 dot is entirely independent of the handling of circumflex
   3273 and dollar sign, the only relationship being that they both
   3274 involve newline characters. Dot has no special meaning in a
   3275 character class.
   3276 
   3277 @node Square brackets
   3278 @appendixsec Square brackets
   3279 @cindex Perl-style regular expressions, character classes
   3280 
   3281 An opening square bracket introduces a character class, terminated
   3282 by a closing square bracket.  A closing square bracket on its own
   3283 is not special.  If a closing square bracket is required as a
   3284 member of the class, it should be the first data character in
   3285 the class (after an initial circumflex, if present) or escaped with a backslash.
   3286 
   3287 A character class matches a single character in the subject;
   3288 the character must be in the set of characters defined by
   3289 the class, unless the first character in the class is a circumflex,
   3290 in which case the subject character must not be in
   3291 the set defined by the class. If a circumflex is actually
   3292 required as a member of the class, ensure it is not the
   3293 first character, or escape it with a backslash.
   3294 
   3295 For example, the character class [aeiou] matches any lower
   3296 case vowel, while [^aeiou] matches any character that is not
   3297 a lower case vowel. Note that a circumflex is just a convenient
   3298 venient notation for specifying the characters which are in
   3299 the class by enumerating those that are not. It is not an
   3300 assertion: it still consumes a character from the subject
   3301 string, and fails if the current pointer is at the end of
   3302 the string.
   3303 
   3304 @cindex Perl-style regular expressions, case-insensitive
   3305 When caseless matching is set, any letters in a class
   3306 represent both their upper case and lower case versions, so
   3307 for example, a caseless @code{[aeiou]} matches uppercase
   3308 and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
   3309 does not match @samp{A}, whereas a case-sensitive version would.
   3310 
   3311 @cindex Perl-style regular expressions, single line
   3312 @cindex Perl-style regular expressions, multiline
   3313 The newline character is never treated in any special way in
   3314 character classes, whatever the setting of the @code{S} and
   3315 @code{M} options (modifiers) is.  A class such as @code{[^a]} will
   3316 always match a newline.
   3317 
   3318 The minus (hyphen) character can be used to specify a range
   3319 of characters in a character class.  For example, @code{[d-m]}
   3320 matches any letter between d and m, inclusive.  If a minus
   3321 character is required in a class, it must be escaped with a
   3322 backslash or appear in a position where it cannot be interpreted
   3323 as indicating a range, typically as the first or last
   3324 character in the class.
   3325 
   3326 It is not possible to have the literal character @code{]} as the
   3327 end character of a range.  A pattern such as @code{[W-]46]} is
   3328 interpreted as a class of two characters (@code{W} and @code{-})
   3329 followed by a literal string @code{46]}, so it would match
   3330 @samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
   3331 with a backslash it is interpreted as the end of range, so
   3332 @code{[W-\]46]} is interpreted as a single class containing a
   3333 range followed by two separate characters. The octal or
   3334 hexadecimal representation of @code{]} can also be used to end a range.
   3335 
   3336 Ranges operate in @sc{ascii} collating sequence. They can also be
   3337 used for characters specified numerically, for example
   3338 @code{[\000-\037]}. If a range that includes letters is used when
   3339 caseless matching is set, it matches the letters in either
   3340 case. For example, a caseless @code{[W-c]} is equivalent to
   3341 @code{[][\^_`wxyzabc]}, matched caselessly, and if character
   3342 tables for the French locale are in use, @code{[\xc8-\xcb]}
   3343 matches accented E characters in both cases.
   3344 
   3345 Unlike in @sc{posix} mode, the character types @code{\d},
   3346 @code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
   3347 may also appear in a character class, and add the characters
   3348 that they match to the class. For example, @code{[\dABCDEF]} matches any
   3349 hexadecimal digit.  A circumflex can conveniently be used
   3350 with the upper case character types to specify a more restricted
   3351 set of characters than the matching lower case type.
   3352 For example, the class @code{[^\W_]} matches any letter or digit,
   3353 but not underscore.
   3354 
   3355 All non-alphameric characters other than @code{\}, @code{-},
   3356 @code{^} (at the start) and the terminating @code{]}
   3357 are non-special in character classes, but it does no harm
   3358 if they are escaped.
   3359 
   3360 Perl 5.6 supports the @sc{posix} notation for character classes, which
   3361 uses names enclosed by @code{[:} and @code{:]} within the enclosing
   3362 square brackets, and @value{SSED} supports this notation as well.
   3363 For example,
   3364 
   3365 @example
   3366 [01[:alpha:]%]
   3367 @end example
   3368 
   3369 @noindent
   3370 matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
   3371 The supported class names are
   3372 
   3373 @table @code
   3374 @item alnum
   3375 Matches letters and digits
   3376 
   3377 @item alpha
   3378 Matches letters
   3379 
   3380 @item ascii
   3381 Matches character codes 0 - 127
   3382 
   3383 @item cntrl
   3384 Matches control characters
   3385 
   3386 @item digit
   3387 Matches decimal digits (same as \d)
   3388 
   3389 @item graph
   3390 Matches printing characters, excluding space
   3391 
   3392 @item lower
   3393 Matches lower case letters
   3394 
   3395 @item print
   3396 Matches printing characters, including space
   3397 
   3398 @item punct
   3399 Matches printing characters, excluding letters and digits
   3400 
   3401 @item space
   3402 Matches white space (same as \s)
   3403 
   3404 @item upper
   3405 Matches upper case letters
   3406 
   3407 @item word
   3408 Matches ``word'' characters (same as \w)
   3409 
   3410 @item xdigit
   3411 Matches hexadecimal digits
   3412 @end table
   3413 
   3414 The names @code{ascii} and @code{word} are extensions valid only in
   3415 Perl mode.  Another Perl extension is negation, which is
   3416 indicated by a circumflex character after the colon. For example,
   3417 
   3418 @example
   3419 [12[:^digit:]]
   3420 @end example
   3421 
   3422 @noindent
   3423 matches @samp{1}, @samp{2}, or any non-digit.
   3424 
   3425 @node Options setting
   3426 @appendixsec Options setting
   3427 @cindex Perl-style regular expressions, toggling options
   3428 @cindex Perl-style regular expressions, case-insensitive
   3429 @cindex Perl-style regular expressions, multiline
   3430 @cindex Perl-style regular expressions, single line
   3431 @cindex Perl-style regular expressions, extended
   3432 
   3433 The settings of the @code{I}, @code{M}, @code{S}, @code{X}
   3434 modifiers can be changed from within the pattern by
   3435 a sequence of Perl option letters enclosed between @code{(?}
   3436 and @code{)}. The option letters must be lowercase.
   3437 
   3438 For example, @code{(?im)} sets caseless, multiline matching. It is
   3439 also possible to unset these options by preceding the letter
   3440 with a hyphen; you can also have combined settings and unsettings:
   3441 @code{(?im-sx)} sets caseless and multiline matching,
   3442 while unsets single line matching (for dots) and extended
   3443 whitespace interpretation.  If a letter appears both before
   3444 and after the hyphen, the option is unset.
   3445 
   3446 The scope of these option changes depends on where in the
   3447 pattern the setting occurs. For settings that are outside
   3448 any subpattern (defined below), the effect is the same as if
   3449 the options were set or unset at the start of matching. The
   3450 following patterns all behave in exactly the same way:
   3451 
   3452 @example
   3453 (?i)abc
   3454 a(?i)bc
   3455 ab(?i)c
   3456 abc(?i)
   3457 @end example
   3458 
   3459 which in turn is the same as specifying the pattern abc with
   3460 the @code{I} modifier.  In other words, ``top level'' settings
   3461 apply to the whole pattern (unless there are other
   3462 changes inside subpatterns). If there is more than one setting
   3463 of the same option at top level, the rightmost setting
   3464 is used.
   3465 
   3466 If an option change occurs inside a subpattern, the effect
   3467 is different.  This is a change of behaviour in Perl 5.005.
   3468 An option change inside a subpattern affects only that part
   3469 of the subpattern @emph{that follows} it, so
   3470 
   3471 @example
   3472 (a(?i)b)c
   3473 @end example
   3474 
   3475 @noindent
   3476 matches abc and aBc and no other  strings  (assuming
   3477 case-sensitive matching is used).  By this means, options can
   3478 be made to have different settings in different parts of the
   3479 pattern.  Any changes made in one alternative do carry on
   3480 into subsequent branches within the same subpattern.  For
   3481 example,
   3482 
   3483 @example
   3484 (a(?i)b|c)
   3485 @end example
   3486 
   3487 @noindent
   3488 matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
   3489 even though when matching @samp{C} the first branch is
   3490 abandoned before the option setting.
   3491 This is because the effects of option settings happen at
   3492 compile time. There would be some very weird behaviour otherwise.
   3493 
   3494 @ignore
   3495 There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
   3496 that can be changed in the same way as the Perl-compatible options by
   3497 using the characters U and X respectively.  The (?X) flag
   3498 setting is special in that it must always occur earlier in
   3499 the pattern than any of the additional features it turns on,
   3500 even when it is at top level. It is best put at the start.
   3501 @end ignore
   3502 
   3503 
   3504 @node Non-capturing subpatterns
   3505 @appendixsec Non-capturing subpatterns
   3506 @cindex Perl-style regular expressions, non-capturing subpatterns
   3507 
   3508 Marking part of a pattern as a subpattern does two things.
   3509 On one hand, it localizes a set of alternatives; on the other
   3510 hand, it sets up the subpattern as a capturing subpattern (as
   3511 defined above).  The subpattern can be backreferenced and
   3512 referenced in the right side of @code{s} commands.
   3513 
   3514 For example, if the string @samp{the red king} is matched against
   3515 the pattern
   3516 
   3517 @example
   3518 the ((red|white) (king|queen))
   3519 @end example
   3520 
   3521 @noindent
   3522 the captured substrings are @samp{red king}, @samp{red},
   3523 and @samp{king}, and are numbered 1, 2, and 3.
   3524 
   3525 The fact that plain parentheses fulfil two functions is not
   3526 always helpful.  There are often times when a grouping
   3527 subpattern is required without a capturing requirement.  If an
   3528 opening parenthesis is followed by @code{?:}, the subpattern does
   3529 not do any capturing, and is not counted when computing the
   3530 number of any subsequent capturing subpatterns. For example,
   3531 if the string @samp{the white queen} is matched against the pattern
   3532 
   3533 @example
   3534 the ((?:red|white) (king|queen))
   3535 @end example
   3536 
   3537 @noindent
   3538 the captured substrings are @samp{white queen} and @samp{queen},
   3539 and are numbered 1 and 2. The maximum number of captured
   3540 substrings is 99, while the maximum number of all subpatterns,
   3541 both capturing and non-capturing, is 200.
   3542 
   3543 As a convenient shorthand, if any option settings are
   3544 equired at the start of a non-capturing subpattern, the
   3545 option letters may appear between the @code{?} and the
   3546 @code{:}.  Thus the two patterns
   3547 
   3548 @example
   3549 (?i:saturday|sunday)
   3550 (?:(?i)saturday|sunday)
   3551 @end example
   3552 
   3553 @noindent
   3554 match exactly the same set of strings.  Because alternative
   3555 branches are tried from left to right, and options are not
   3556 reset until the end of the subpattern is reached, an option
   3557 setting in one branch does affect subsequent branches, so
   3558 the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
   3559 
   3560 
   3561 @node Repetition
   3562 @appendixsec Repetition
   3563 @cindex Perl-style regular expressions, repetitions
   3564 
   3565 Repetition is specified by quantifiers, which can follow any
   3566 of the following items:
   3567 
   3568 @itemize @bullet
   3569 @item
   3570 a single character, possibly escaped
   3571 
   3572 @item
   3573 the @code{.} special character
   3574 
   3575 @item
   3576 a character class
   3577 
   3578 @item
   3579 a back reference (see next section)
   3580 
   3581 @item
   3582 a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
   3583 @end itemize
   3584 
   3585 The general repetition quantifier specifies a minimum and
   3586 maximum number of permitted matches, by giving the two
   3587 numbers in curly brackets (braces), separated by a comma.
   3588 The numbers must be less than 65536, and the first must be
   3589 less than or equal to the second. For example:
   3590 
   3591 @example
   3592 z@{2,4@}
   3593 @end example
   3594 
   3595 @noindent
   3596 matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
   3597 is not a special character. If the second number is omitted,
   3598 but the comma is present, there is no upper limit; if the
   3599 second number and the comma are both omitted, the quantifier
   3600 specifies an exact number of required matches. Thus
   3601 
   3602 @example
   3603 [aeiou]@{3,@}
   3604 @end example
   3605 
   3606 @noindent
   3607 matches at least 3 successive vowels, but may match many
   3608 more, while
   3609 
   3610 @example
   3611 \d@{8@}
   3612 @end example
   3613 
   3614 @noindent
   3615 matches exactly 8 digits.  An opening curly bracket that
   3616 appears in a position where a quantifier is not allowed, or
   3617 one that does not match the syntax of a quantifier, is taken
   3618 as a literal character. For example, @{,6@} is not a quantifier,
   3619 but a literal string of four characters.@footnote{It
   3620 raises an error if @option{-R} is not used.}
   3621 
   3622 The quantifier @samp{@{0@}} is permitted, causing the expression to
   3623 behave as if the previous item and the quantifier were not
   3624 present.
   3625 
   3626 For convenience (and historical compatibility) the three
   3627 most common quantifiers have single-character abbreviations:
   3628 
   3629 @table @code
   3630 @item *
   3631 is equivalent to @{0,@}
   3632 
   3633 @item +
   3634 is equivalent to @{1,@}
   3635 
   3636 @item ?
   3637 is equivalent to @{0,1@}
   3638 @end table
   3639 
   3640 It is possible to construct infinite loops by following a
   3641 subpattern that can match no characters with a quantifier
   3642 that has no upper limit, for example:
   3643 
   3644 @example
   3645 (a?)*
   3646 @end example
   3647 
   3648 Earlier versions of Perl used to give an error at
   3649 compile time for such patterns. However, because there are
   3650 cases where this can be useful, such patterns are now
   3651 accepted, but if any repetition of the subpattern does in
   3652 fact match no characters, the loop is forcibly broken.
   3653 
   3654 @cindex Greedy regular expression matching
   3655 @cindex Perl-style regular expressions, stingy repetitions
   3656 By default, the quantifiers are @dfn{greedy} like in @sc{posix}
   3657 mode, that is, they match as much as possible (up to the maximum
   3658 number of permitted times), without causing the rest of the
   3659 pattern to fail. The classic example of where this gives problems
   3660 is in trying to match comments in C programs. These appear between
   3661 the sequences @code{/*} and @code{*/} and within the sequence, individual
   3662 @code{*} and @code{/} characters may appear. An attempt to match C
   3663 comments by applying the pattern
   3664 
   3665 @example
   3666 /\*.*\*/
   3667 @end example
   3668 
   3669 @noindent
   3670 to the string
   3671 
   3672 @example
   3673 /* first command */ not comment /* second comment */
   3674 @end example
   3675 
   3676 @noindent
   3677 
   3678 fails, because it matches the entire string owing to the
   3679 greediness of the @code{.*} item.
   3680 
   3681 However, if a quantifier is followed by a question mark, it
   3682 ceases to be greedy, and instead matches the minimum number
   3683 of times possible, so the pattern @code{/\*.*?\*/}
   3684 does the right thing with the C comments. The meaning of the
   3685 various quantifiers is not otherwise changed, just the preferred
   3686 number of matches.  Do not confuse this use of question
   3687 mark with its use as a quantifier in its own right.
   3688 Because it has two uses, it can sometimes appear doubled, as in
   3689 
   3690 @example
   3691 \d??\d
   3692 @end example
   3693 
   3694 which matches one digit by preference, but can match two if
   3695 that is the only way the rest of the pattern matches.
   3696 
   3697 Note that greediness does not matter when specifying addresses,
   3698 but can be nevertheless used to improve performance.
   3699 
   3700 @ignore
   3701 If the PCRE_UNGREEDY option is set (an option which is not
   3702 available in Perl), the quantifiers are not greedy by
   3703 default, but individual ones can be made greedy by following
   3704 them with a question mark. In other words, it inverts the
   3705 default behaviour.
   3706 @end ignore
   3707 
   3708 When a parenthesized subpattern is quantified with a minimum
   3709 repeat count that is greater than 1 or with a limited maximum,
   3710 more store is required for the compiled pattern, in
   3711 proportion to the size of the minimum or maximum.
   3712 
   3713 @cindex Perl-style regular expressions, single line
   3714 If a pattern starts with @code{.*} or @code{.@{0,@}} and the
   3715 @code{S} modifier is used, the pattern is implicitly anchored,
   3716 because whatever follows will be tried against every character
   3717 position in the subject string, so there is no point in
   3718 retrying the overall match at any position after the first.
   3719 PCRE treats such a pattern as though it were preceded by \A.
   3720 
   3721 When a capturing subpattern is repeated, the value captured
   3722 is the substring that matched the final iteration. For example,
   3723 after
   3724 
   3725 @example
   3726 (tweedle[dume]@{3@}\s*)+
   3727 @end example
   3728 
   3729 @noindent
   3730 has matched @samp{tweedledum tweedledee} the value of the
   3731 captured substring is @samp{tweedledee}.  However, if there are
   3732 nested capturing subpatterns, the corresponding captured
   3733 values may have been set in previous iterations. For example,
   3734 after
   3735 
   3736 @example
   3737 /(a|(b))+/
   3738 @end example
   3739 
   3740 matches @samp{aba}, the value of the second captured substring is
   3741 @samp{b}.
   3742 
   3743 @node Backreferences
   3744 @appendixsec Backreferences
   3745 @cindex Perl-style regular expressions, backreferences
   3746 
   3747 Outside a character class, a backslash followed by a digit
   3748 greater than 0 (and possibly further digits) is a back
   3749 reference to a capturing subpattern earlier (i.e.  to its
   3750 left) in the pattern, provided there have been that many
   3751 previous capturing left parentheses.
   3752 
   3753 However, if the decimal number following the backslash is
   3754 less than 10, it is always taken as a back reference, and
   3755 causes an error only if there are not that many capturing
   3756 left parentheses in the entire pattern. In other words, the
   3757 parentheses that are referenced need not be to the left of
   3758 the reference for numbers less than 10. @ref{Backslash}
   3759 for further details of the handling of digits following a backslash.
   3760 
   3761 A back reference matches whatever actually matched the capturing
   3762 subpattern in the current subject string, rather than
   3763 anything matching the subpattern itself. So the pattern
   3764 
   3765 @example
   3766 (sens|respons)e and \1ibility
   3767 @end example
   3768 
   3769 @noindent
   3770 matches @samp{sense and sensibility} and @samp{response and responsibility},
   3771 but not @samp{sense and responsibility}. If caseful
   3772 matching is in force at the time of the back reference, the
   3773 case of letters is relevant. For example,
   3774 
   3775 @example
   3776 ((?i)blah)\s+\1
   3777 @end example
   3778 
   3779 @noindent
   3780 matches @samp{blah blah} and @samp{Blah Blah}, but not
   3781 @samp{BLAH blah}, even though the original capturing
   3782 subpattern is matched caselessly.
   3783 
   3784 There may be more than one back reference to the same subpattern.
   3785 Also, if a subpattern has not actually been used in a
   3786 particular match, any back references to it always fail. For
   3787 example, the pattern
   3788 
   3789 @example
   3790 (a|(bc))\2
   3791 @end example
   3792 
   3793 @noindent
   3794 always fails if it starts to match @samp{a} rather than
   3795 @samp{bc}.  Because there may be up to 99 back references, all
   3796 digits following the backslash are taken as part of a potential
   3797 back reference number; this is different from what happens
   3798 in @sc{posix} mode. If the pattern continues with a digit
   3799 character, some delimiter must be used to terminate the back
   3800 reference.  If the @code{X} modifier option is set, this can be
   3801 whitespace.  Otherwise an empty comment can be used, or the
   3802 following character can be expressed in hexadecimal or octal.
   3803 Note that this applies only to the LHS pattern; it is
   3804 not possible yet to specify more than 9 backreferences on the 
   3805 RHS of the `s' command. 
   3806 
   3807 A back reference that occurs inside the parentheses to which
   3808 it refers fails when the subpattern is first used, so, for
   3809 example, @code{(a\1)} never matches.  However, such references
   3810 can be useful inside repeated subpatterns. For example, the
   3811 pattern
   3812 
   3813 @example
   3814 (a|b\1)+
   3815 @end example
   3816 
   3817 @noindent
   3818 matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
   3819 etc. At each iteration of the subpattern, the back reference matches
   3820 the character string corresponding to the previous iteration.  In
   3821 order for this to work, the pattern must be such that the first
   3822 iteration does not need to match the back reference.  This can be
   3823 done using alternation, as in the example above, or by a
   3824 quantifier with a minimum of zero.
   3825 
   3826 @node Assertions
   3827 @appendixsec Assertions
   3828 @cindex Perl-style regular expressions, assertions
   3829 @cindex Perl-style regular expressions, asserting subpatterns
   3830 
   3831 An assertion is a test on the characters following or
   3832 preceding the current matching point that does not actually
   3833 consume any characters. The simple assertions coded as @code{\b},
   3834 @code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
   3835 are described above. More complicated assertions are coded as
   3836 subpatterns.  There are two kinds: those that look ahead of the
   3837 current position in the subject string, and those that look behind it.
   3838 
   3839 @cindex Perl-style regular expressions, lookahead subpatterns
   3840 An assertion subpattern is matched in the normal way, except
   3841 that it does not cause the current matching position to be
   3842 changed. Lookahead assertions start with @code{(?=} for positive
   3843 assertions and @code{(?!} for negative assertions. For example,
   3844 
   3845 @example
   3846 \w+(?=;)
   3847 @end example
   3848 
   3849 @noindent
   3850 matches a word followed by a semicolon, but does not include
   3851 the semicolon in the match, and
   3852 
   3853 @example
   3854 foo(?!bar)
   3855 @end example
   3856 
   3857 @noindent
   3858 matches any occurrence of @samp{foo} that is not followed by
   3859 @samp{bar}.
   3860 
   3861 Note that the apparently similar pattern
   3862 
   3863 @example
   3864 (?!foo)bar
   3865 @end example
   3866 
   3867 @noindent
   3868 @cindex Perl-style regular expressions, lookbehind subpatterns
   3869 finds any occurrence of @samp{bar} even if it is preceded by
   3870 @samp{foo}, because the assertion @code{(?!foo)} is always true
   3871 when the next three characters are @samp{bar}. A lookbehind
   3872 assertion is needed to achieve this effect.
   3873 Lookbehind assertions start with @code{(?<=} for positive
   3874 assertions and @code{(?<!} for negative assertions. So,
   3875 
   3876 @example
   3877 (?<!foo)bar
   3878 @end example
   3879 
   3880 achieves the required effect of finding an occurrence of
   3881 @samp{bar} that is not preceded by @samp{foo}. The contents of a
   3882 lookbehind assertion are restricted
   3883 such that all the strings it matches must have a fixed
   3884 length.  However, if there are several alternatives, they do
   3885 not all have to have the same fixed length.  This is an extension
   3886 compared with Perl 5.005, which requires all branches to match
   3887 the same length of string. Thus
   3888 
   3889 @example
   3890 (?<=dogs|cats|)
   3891 @end example
   3892 
   3893 @noindent
   3894 is permitted, but the apparently equivalent regular expression
   3895 
   3896 @example
   3897 (?<!dogs?|cats?)
   3898 @end example
   3899 
   3900 @noindent
   3901 causes an error at compile time. Branches that match different
   3902 length strings are permitted only at the top level of
   3903 a lookbehind assertion: an assertion such as
   3904 
   3905 @example
   3906 (?<=ab(c|de))
   3907 @end example
   3908 
   3909 @noindent
   3910 is not permitted, because its single top-level branch can
   3911 match two different lengths, but it is acceptable if rewritten
   3912 to use two top-level branches:
   3913 
   3914 @example
   3915 (?<=abc|abde)
   3916 @end example
   3917 
   3918 All this is required because lookbehind assertions simply
   3919 move the current position back by the alternative's fixed
   3920 width and then try to match.  If there are
   3921 insufficient characters before the current position, the
   3922 match is deemed to fail.  Lookbehinds, in conjunction with
   3923 non-backtracking subpatterns can be particularly useful for
   3924 matching at the ends of strings; an example is given at the end
   3925 of the section on non-backtracking subpatterns.
   3926 
   3927 Several assertions (of any sort) may occur in succession.
   3928 For example,
   3929 
   3930 @example
   3931 (?<=\d@{3@})(?<!999)foo
   3932 @end example
   3933 
   3934 @noindent
   3935 matches @samp{foo} preceded by three digits that are not @samp{999}.
   3936 Notice that each of the assertions is applied independently
   3937 at the same point in the subject string. First there is a
   3938 check that the previous three characters are all digits, and
   3939 then there is a check that the same three characters are not
   3940 @samp{999}.  This pattern does not match @samp{foo} preceded by six
   3941 characters, the first of which are digits and the last three
   3942 of which are not @samp{999}.  For example, it doesn't match
   3943 @samp{123abcfoo}. A pattern to do that is
   3944 
   3945 @example
   3946 (?<=\d@{3@}...)(?<!999)foo
   3947 @end example
   3948 
   3949 @noindent
   3950 This time the first assertion looks at the preceding six
   3951 characters, checking that the first three are digits, and
   3952 then the second assertion checks that the preceding three
   3953 characters are not @samp{999}.  Actually, assertions can be
   3954 nested in any combination, so one can write this as 
   3955 
   3956 @example
   3957 (?<=\d@{3@}(?!999)...)foo
   3958 @end example
   3959 
   3960 or
   3961 
   3962 @example
   3963 (?<=\d@{3@}...(?<!999))foo
   3964 @end example
   3965 
   3966 @noindent
   3967 both of which might be considered more readable.
   3968 
   3969 Assertion subpatterns are not capturing subpatterns, and may
   3970 not be repeated, because it makes no sense to assert the
   3971 same thing several times. If any kind of assertion contains
   3972 capturing subpatterns within it, these are counted for the
   3973 purposes of numbering the capturing subpatterns in the whole
   3974 pattern.  However, substring capturing is carried out only
   3975 for positive assertions, because it does not make sense for
   3976 negative assertions.
   3977 
   3978 Assertions count towards the maximum of 200 parenthesized
   3979 subpatterns.
   3980 
   3981 @node Non-backtracking subpatterns
   3982 @appendixsec Non-backtracking subpatterns
   3983 @cindex Perl-style regular expressions, non-backtracking subpatterns
   3984 
   3985 With both maximizing and minimizing repetition, failure of
   3986 what follows normally causes the repeated item to be evaluated
   3987 again to see if a different number of repeats allows the
   3988 rest of the pattern to match. Sometimes it is useful to
   3989 prevent this, either to change the nature of the match, or
   3990 to cause it fail earlier than it otherwise might, when the
   3991 author of the pattern knows there is no point in carrying
   3992 on.
   3993 
   3994 Consider, for example, the pattern @code{\d+foo} when applied to
   3995 the subject line
   3996 
   3997 @example
   3998 123456bar
   3999 @end example
   4000 
   4001 After matching all 6 digits and then failing to match @samp{foo},
   4002 the normal action of the matcher is to try again with only 5
   4003 digits matching the @code{\d+} item, and then with 4, and so on,
   4004 before ultimately failing. Non-backtracking subpatterns
   4005 provide the means for specifying that once a portion of the
   4006 pattern has matched, it is not to be re-evaluated in this way,
   4007 so the matcher would give up immediately on failing to match
   4008 @samp{foo} the first time.  The notation is another kind of special
   4009 parenthesis, starting with @code{(?>} as in this example:
   4010 
   4011 @example
   4012 (?>\d+)bar
   4013 @end example
   4014 
   4015 This kind of parenthesis ``locks up'' the part of the pattern
   4016 it contains once it has matched, and a failure further into
   4017 the pattern is prevented from backtracking into it.
   4018 Backtracking past it to previous items, however, works as
   4019 normal.
   4020 
   4021 Non-backtracking subpatterns are not capturing subpatterns.  Simple
   4022 cases such as the above example can be thought of as a maximizing
   4023 repeat that must swallow everything it can.  So,
   4024 while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
   4025 digits they match in order to make the rest of the pattern
   4026 match, @code{(?>\d+)} can only match an entire sequence of digits.
   4027 
   4028 This construction can of course contain arbitrarily complicated
   4029 subpatterns, and it can be nested.
   4030 
   4031 @cindex Perl-style regular expressions, lookbehind subpatterns
   4032 Non-backtracking subpatterns can be used in conjunction with look-behind
   4033 assertions to specify efficient matching at the end
   4034 of the subject string. Consider a simple pattern such as
   4035 
   4036 @example
   4037 abcd$
   4038 @end example
   4039 
   4040 @noindent
   4041 when applied to a long string which does not match.  Because
   4042 matching proceeds from left to right, @command{sed} will look for
   4043 each @samp{a} in the subject and then see if what follows matches
   4044 the rest of the pattern. If the pattern is specified as
   4045 
   4046 @example
   4047 ^.*abcd$
   4048 @end example
   4049 
   4050 @noindent
   4051 the initial @code{.*} matches the entire string at first, but when
   4052 this fails (because there is no following @samp{a}), it backtracks
   4053 to match all but the last character, then all but the
   4054 last two characters, and so on. Once again the search for
   4055 @samp{a} covers the entire string, from right to left, so we are
   4056 no better off. However, if the pattern is written as
   4057 
   4058 @example
   4059 ^(?>.*)(?<=abcd)
   4060 @end example
   4061 
   4062 there can be no backtracking for the .* item; it can match
   4063 only the entire string. The subsequent lookbehind assertion
   4064 does a single test on the last four characters. If it fails,
   4065 the match fails immediately. For long strings, this approach
   4066 makes a significant difference to the processing time.
   4067 
   4068 When a pattern contains an unlimited repeat inside a subpattern
   4069 that can itself be repeated an unlimited number of
   4070 times, the use of a once-only subpattern is the only way to
   4071 avoid some failing matches taking a very long time
   4072 indeed.@footnote{Actually, the matcher embedded in @value{SSED}
   4073 tries to do something for this in the simplest cases,
   4074 like @code{([^b]*b)*}.  These cases are actually quite
   4075 common: they happen for example in a regular expression
   4076 like @code{\/\*([^*]*\*)*\/} which matches C comments.}
   4077 
   4078 The pattern
   4079 
   4080 @example
   4081 (\D+|<\d+>)*[!?]
   4082 @end example
   4083 
   4084 ([^0-9<]+<(\d+>)?)*[!?]
   4085 
   4086 @noindent
   4087 matches an unlimited number of substrings that either consist
   4088 of non-digits, or digits enclosed in angular brackets, followed by
   4089 an exclamation or question mark. When it matches, it runs quickly.
   4090 However, if it is applied to
   4091 
   4092 @example
   4093 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
   4094 @end example
   4095 
   4096 @noindent
   4097 it takes a long time before reporting failure.  This is
   4098 because the string can be divided between the two repeats in
   4099 a large number of ways, and all have to be tried.@footnote{The
   4100 example used @code{[!?]} rather than a single character at the end,
   4101 because both @value{SSED} and Perl have an optimization that allows
   4102 for fast failure when a single character is used. They
   4103 remember the last single character that is required for a
   4104 match, and fail early if it is not present in the string.}
   4105 
   4106 If the pattern is changed to
   4107 
   4108 @example
   4109 ((?>\D+)|<\d+>)*[!?]
   4110 @end example
   4111 
   4112 sequences of non-digits cannot be broken, and failure happens
   4113 quickly.
   4114 
   4115 @node Conditional subpatterns
   4116 @appendixsec Conditional subpatterns
   4117 @cindex Perl-style regular expressions, conditional subpatterns
   4118 
   4119 It is possible to cause the matching process to obey a subpattern
   4120 conditionally or to choose between two alternative
   4121 subpatterns, depending on the result of an assertion, or
   4122 whether a previous capturing subpattern matched or not. The
   4123 two possible forms of conditional subpattern are
   4124 
   4125 @example
   4126 (?(@var{condition})@var{yes-pattern})
   4127 (?(@var{condition})@var{yes-pattern}|@var{no-pattern})
   4128 @end example
   4129 
   4130 If the condition is satisfied, the yes-pattern is used; otherwise
   4131 the no-pattern (if present) is used. If there are more than two
   4132 alternatives in the subpattern, a compile-time error occurs.
   4133 
   4134 There are two kinds of condition. If the text between the
   4135 parentheses consists of a sequence of digits, the condition
   4136 is satisfied if the capturing subpattern of that number has
   4137 previously matched.  The number must be greater than zero.
   4138 Consider the following pattern, which contains non-significant
   4139 white space to make it more readable (assume the @code{X} modifier)
   4140 and to divide it into three parts for ease of discussion:
   4141 
   4142 @example
   4143 ( \( )?   [^()]+   (?(1) \) )
   4144 @end example
   4145 
   4146 The first part matches an optional opening parenthesis, and
   4147 if that character is present, sets it as the first captured
   4148 substring. The second part matches one or more characters
   4149 that are not parentheses. The third part is a conditional
   4150 subpattern that tests whether the first set of parentheses
   4151 matched or not.  If they did, that is, if subject started
   4152 with an opening parenthesis, the condition is true, and so
   4153 the yes-pattern is executed and a closing parenthesis is
   4154 required. Otherwise, since no-pattern is not present, the
   4155 subpattern matches nothing.  In other words, this pattern
   4156 matches a sequence of non-parentheses, optionally enclosed
   4157 in parentheses.
   4158 
   4159 @cindex Perl-style regular expressions, lookahead subpatterns
   4160 If the condition is not a sequence of digits, it must be an
   4161 assertion.  This may be a positive or negative lookahead or
   4162 lookbehind assertion. Consider this pattern, again containing
   4163 non-significant white space, and with the two alternatives
   4164 on the second line:
   4165 
   4166 @example
   4167 (?(?=...[a-z])
   4168    \d\d-[a-z]@{3@}-\d\d |
   4169    \d\d-\d\d-\d\d )
   4170 @end example
   4171 
   4172 The condition is a positive lookahead assertion that matches
   4173 a letter that is three characters away from the current point.
   4174 If a letter is found, the subject is matched against the first
   4175 alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
   4176 letters and @var{dd} are digits); otherwise it is matched against 
   4177 the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
   4178 
   4179 
   4180 @node Recursive patterns
   4181 @appendixsec Recursive patterns
   4182 @cindex Perl-style regular expressions, recursive patterns
   4183 @cindex Perl-style regular expressions, recursion
   4184 
   4185 Consider the problem of matching a string in parentheses,
   4186 allowing for unlimited nested parentheses. Without the use
   4187 of recursion, the best that can be done is to use a pattern
   4188 that matches up to some fixed depth of nesting. It is not
   4189 possible to handle an arbitrary nesting depth. Perl 5.6 has
   4190 provided an experimental facility that allows regular
   4191 expressions to recurse (amongst other things). It does this
   4192 by interpolating Perl code in the expression at run time,
   4193 and the code can refer to the expression itself. A Perl pattern
   4194 tern to solve the parentheses problem can be created like
   4195 this:
   4196 
   4197 @example
   4198 $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x;
   4199 @end example
   4200 
   4201 The @code{(?p@{...@})} item interpolates Perl code at run time,
   4202 and in this case refers recursively to the pattern in which it
   4203 appears. Obviously, @command{sed} cannot support the interpolation of
   4204 Perl code.  Instead, the special item @code{(?R)} is provided for
   4205 the specific case of recursion. This pattern solves the
   4206 parentheses problem (assume the @code{X} modifier option is used
   4207 so that white space is ignored):
   4208 
   4209 @example
   4210 \( ( (?>[^()]+) | (?R) )* \)
   4211 @end example
   4212 
   4213 First it matches an opening parenthesis. Then it matches any
   4214 number of substrings which can either be a sequence of
   4215 non-parentheses, or a recursive match of the pattern itself
   4216 (i.e. a correctly parenthesized substring). Finally there is
   4217 a closing parenthesis.
   4218 
   4219 This particular example pattern contains nested unlimited
   4220 repeats, and so the use of a non-backtracking subpattern for
   4221 matching strings of non-parentheses is important when applying
   4222 the pattern to strings that do not match. For example, when
   4223 it is applied to
   4224 
   4225 @example
   4226 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
   4227 @end example
   4228 
   4229 it yields a ``no match'' response quickly. However, if a
   4230 standard backtracking subpattern is not used, the match runs
   4231 for a very long time indeed because there are so many different
   4232 ways the @code{+} and @code{*} repeats can carve up the subject,
   4233 and all have to be tested before failure can be reported.
   4234 
   4235 The values set for any capturing subpatterns are those from
   4236 the outermost level of the recursion at which the subpattern
   4237 value is set. If the pattern above is matched against
   4238 
   4239 @example
   4240 (ab(cd)ef)
   4241 @end example
   4242 
   4243 @noindent
   4244 the value for the capturing parentheses is @samp{ef}, which is
   4245 the last value taken on at the top level.
   4246 
   4247 @node Comments
   4248 @appendixsec Comments
   4249 @cindex Perl-style regular expressions, comments
   4250 
   4251 The sequence (?# marks the start of a comment which continues
   4252 ues up to the next closing parenthesis. Nested parentheses
   4253 are not permitted. The characters that make up a comment
   4254 play no part in the pattern matching at all.
   4255 
   4256 @cindex Perl-style regular expressions, extended
   4257 If the @code{X} modifier option is used, an unescaped @code{#} character
   4258 outside a character class introduces a comment that continues
   4259 up to the next newline character in the pattern.
   4260 @end ifset
   4261 
   4262 
   4263 @page
   4264 @node Concept Index
   4265 @unnumbered Concept Index
   4266 
   4267 This is a general index of all issues discussed in this manual, with the
   4268 exception of the @command{sed} commands and command-line options.
   4269 
   4270 @printindex cp
   4271 
   4272 @page
   4273 @node Command and Option Index
   4274 @unnumbered Command and Option Index
   4275 
   4276 This is an alphabetical list of all @command{sed} commands and command-line
   4277 options.
   4278 
   4279 @printindex fn
   4280 
   4281 @contents
   4282 @bye
   4283 
   4284 @c XXX FIXME: the term "cycle" is never defined...
   4285