1 \input texinfo @c -*-texinfo-*- 2 @c Do not edit this file!! It is automatically generated from sed-in.texi. 3 @c 4 @c -- Stuff that needs adding: ---------------------------------------------- 5 @c (document the `;' command-separator) 6 @c -------------------------------------------------------------------------- 7 @c Check for consistency: regexps in @code, text that they match in @samp. 8 @c 9 @c Tips: 10 @c @command for command 11 @c @samp for command fragments: @samp{cat -s} 12 @c @code for sed commands and flags 13 @c Use ``quote'' not `quote' or "quote". 14 @c 15 @c %**start of header 16 @setfilename sed.info 17 @settitle sed, a stream editor 18 @c %**end of header 19 20 @c @smallbook 21 22 @include version.texi 23 24 @c Combine indices. 25 @syncodeindex ky cp 26 @syncodeindex pg cp 27 @syncodeindex tp cp 28 29 @defcodeindex op 30 @syncodeindex op fn 31 32 @include config.texi 33 34 @copying 35 This file documents version @value{VERSION} of 36 @value{SSED}, a stream editor. 37 38 Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free 39 Software Foundation, Inc. 40 41 This document is released under the terms of the @acronym{GNU} Free 42 Documentation License as published by the Free Software Foundation; 43 either version 1.1, or (at your option) any later version. 44 45 You should have received a copy of the @acronym{GNU} Free Documentation 46 License along with @value{SSED}; see the file @file{COPYING.DOC}. 47 If not, write to the Free Software Foundation, 59 Temple Place - Suite 48 330, Boston, MA 02110-1301, USA. 49 50 There are no Cover Texts and no Invariant Sections; this text, along 51 with its equivalent in the printed manual, constitutes the Title Page. 52 @end copying 53 54 @setchapternewpage off 55 56 @titlepage 57 @title @command{sed}, a stream editor 58 @subtitle version @value{VERSION}, @value{UPDATED} 59 @author by Ken Pizzini, Paolo Bonzini 60 61 @page 62 @vskip 0pt plus 1filll 63 Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc. 64 65 @insertcopying 66 67 Published by the Free Software Foundation, @* 68 51 Franklin Street, Fifth Floor @* 69 Boston, MA 02110-1301, USA 70 @end titlepage 71 72 73 @node Top 74 @top 75 76 @ifnottex 77 @insertcopying 78 @end ifnottex 79 80 @menu 81 * Introduction:: Introduction 82 * Invoking sed:: Invocation 83 * sed Programs:: @command{sed} programs 84 * Examples:: Some sample scripts 85 * Limitations:: Limitations and (non-)limitations of @value{SSED} 86 * Other Resources:: Other resources for learning about @command{sed} 87 * Reporting Bugs:: Reporting bugs 88 89 * Extended regexps:: @command{egrep}-style regular expressions 90 @ifset PERL 91 * Perl regexps:: Perl-style regular expressions 92 @end ifset 93 94 * Concept Index:: A menu with all the topics in this manual. 95 * Command and Option Index:: A menu with all @command{sed} commands and 96 command-line options. 97 98 @detailmenu 99 --- The detailed node listing --- 100 101 sed Programs: 102 * Execution Cycle:: How @command{sed} works 103 * Addresses:: Selecting lines with @command{sed} 104 * Regular Expressions:: Overview of regular expression syntax 105 * Common Commands:: Often used commands 106 * The "s" Command:: @command{sed}'s Swiss Army Knife 107 * Other Commands:: Less frequently used commands 108 * Programming Commands:: Commands for @command{sed} gurus 109 * Extended Commands:: Commands specific of @value{SSED} 110 * Escapes:: Specifying special characters 111 112 Examples: 113 * Centering lines:: 114 * Increment a number:: 115 * Rename files to lower case:: 116 * Print bash environment:: 117 * Reverse chars of lines:: 118 * tac:: Reverse lines of files 119 * cat -n:: Numbering lines 120 * cat -b:: Numbering non-blank lines 121 * wc -c:: Counting chars 122 * wc -w:: Counting words 123 * wc -l:: Counting lines 124 * head:: Printing the first lines 125 * tail:: Printing the last lines 126 * uniq:: Make duplicate lines unique 127 * uniq -d:: Print duplicated lines of input 128 * uniq -u:: Remove all duplicated lines 129 * cat -s:: Squeezing blank lines 130 131 @ifset PERL 132 Perl regexps:: Perl-style regular expressions 133 * Backslash:: Introduces special sequences 134 * Circumflex/dollar sign/period:: Behave specially with regard to new lines 135 * Square brackets:: Are a bit different in strange cases 136 * Options setting:: Toggle modifiers in the middle of a regexp 137 * Non-capturing subpatterns:: Are not counted when backreferencing 138 * Repetition:: Allows for non-greedy matching 139 * Backreferences:: Allows for more than 10 back references 140 * Assertions:: Allows for complex look ahead matches 141 * Non-backtracking subpatterns:: Often gives more performance 142 * Conditional subpatterns:: Allows if/then/else branches 143 * Recursive patterns:: For example to match parentheses 144 * Comments:: Because things can get complex... 145 @end ifset 146 147 @end detailmenu 148 @end menu 149 150 151 @node Introduction 152 @chapter Introduction 153 154 @cindex Stream editor 155 @command{sed} is a stream editor. 156 A stream editor is used to perform basic text 157 transformations on an input stream 158 (a file or input from a pipeline). 159 While in some ways similar to an editor which 160 permits scripted edits (such as @command{ed}), 161 @command{sed} works by making only one pass over the 162 input(s), and is consequently more efficient. 163 But it is @command{sed}'s ability to filter text in a pipeline 164 which particularly distinguishes it from other types of 165 editors. 166 167 168 @node Invoking sed 169 @chapter Invocation 170 171 Normally @command{sed} is invoked like this: 172 173 @example 174 sed SCRIPT INPUTFILE... 175 @end example 176 177 The full format for invoking @command{sed} is: 178 179 @example 180 sed OPTIONS... [SCRIPT] [INPUTFILE...] 181 @end example 182 183 If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-}, 184 @command{sed} filters the contents of the standard input. The @var{script} 185 is actually the first non-option parameter, which @command{sed} specially 186 considers a script and not an input file if (and only if) none of the 187 other @var{options} specifies a script to be executed, that is if neither 188 of the @option{-e} and @option{-f} options is specified. 189 190 @command{sed} may be invoked with the following command-line options: 191 192 @table @code 193 @item --version 194 @opindex --version 195 @cindex Version, printing 196 Print out the version of @command{sed} that is being run and a copyright notice, 197 then exit. 198 199 @item --help 200 @opindex --help 201 @cindex Usage summary, printing 202 Print a usage message briefly summarizing these command-line options 203 and the bug-reporting address, 204 then exit. 205 206 @item -n 207 @itemx --quiet 208 @itemx --silent 209 @opindex -n 210 @opindex --quiet 211 @opindex --silent 212 @cindex Disabling autoprint, from command line 213 By default, @command{sed} prints out the pattern space 214 at the end of each cycle through the script (@pxref{Execution Cycle, , 215 How @code{sed} works}). 216 These options disable this automatic printing, 217 and @command{sed} only produces output when explicitly told to 218 via the @code{p} command. 219 220 @item -e @var{script} 221 @itemx --expression=@var{script} 222 @opindex -e 223 @opindex --expression 224 @cindex Script, from command line 225 Add the commands in @var{script} to the set of commands to be 226 run while processing the input. 227 228 @item -f @var{script-file} 229 @itemx --file=@var{script-file} 230 @opindex -f 231 @opindex --file 232 @cindex Script, from a file 233 Add the commands contained in the file @var{script-file} 234 to the set of commands to be run while processing the input. 235 236 @item -i[@var{SUFFIX}] 237 @itemx --in-place[=@var{SUFFIX}] 238 @opindex -i 239 @opindex --in-place 240 @cindex In-place editing, activating 241 @cindex @value{SSEDEXT}, in-place editing 242 This option specifies that files are to be edited in-place. 243 @value{SSED} does this by creating a temporary file and 244 sending output to this file rather than to the standard 245 output.@footnote{This applies to commands such as @code{=}, 246 @code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can 247 still write to the standard output by using the @code{w} 248 @cindex @value{SSEDEXT}, @file{/dev/stdout} file 249 or @code{W} commands together with the @file{/dev/stdout} 250 special file}. 251 252 This option implies @option{-s}. 253 254 When the end of the file is reached, the temporary file is 255 renamed to the output file's original name. The extension, 256 if supplied, is used to modify the name of the old file 257 before renaming the temporary file, thereby making a backup 258 copy@footnote{Note that @value{SSED} creates the backup 259 file whether or not any output is actually changed.}). 260 261 @cindex In-place editing, Perl-style backup file names 262 This rule is followed: if the extension doesn't contain a @code{*}, 263 then it is appended to the end of the current filename as a 264 suffix; if the extension does contain one or more @code{*} 265 characters, then @emph{each} asterisk is replaced with the 266 current filename. This allows you to add a prefix to the 267 backup file, instead of (or in addition to) a suffix, or 268 even to place backup copies of the original files into another 269 directory (provided the directory already exists). 270 271 If no extension is supplied, the original file is 272 overwritten without making a backup. 273 274 @item -l @var{N} 275 @itemx --line-length=@var{N} 276 @opindex -l 277 @opindex --line-length 278 @cindex Line length, setting 279 Specify the default line-wrap length for the @code{l} command. 280 A length of 0 (zero) means to never wrap long lines. If 281 not specified, it is taken to be 70. 282 283 @item --posix 284 @cindex @value{SSEDEXT}, disabling 285 @value{SSED} includes several extensions to @acronym{POSIX} 286 sed. In order to simplify writing portable scripts, this 287 option disables all the extensions that this manual documents, 288 including additional commands. 289 @cindex @code{POSIXLY_CORRECT} behavior, enabling 290 Most of the extensions accept @command{sed} programs that 291 are outside the syntax mandated by @acronym{POSIX}, but some 292 of them (such as the behavior of the @command{N} command 293 described in @pxref{Reporting Bugs}) actually violate the 294 standard. If you want to disable only the latter kind of 295 extension, you can set the @code{POSIXLY_CORRECT} variable 296 to a non-empty value. 297 298 @item -b 299 @itemx --binary 300 @opindex -b 301 @opindex --binary 302 This option is available on every platform, but is only effective where the 303 operating system makes a distinction between text files and binary files. 304 When such a distinction is made---as is the case for MS-DOS, Windows, 305 Cygwin---text files are composed of lines separated by a carriage return 306 @emph{and} a line feed character, and @command{sed} does not see the 307 ending CR. When this option is specified, @command{sed} will open 308 input files in binary mode, thus not requesting this special processing 309 and considering lines to end at a line feed. 310 311 @item --follow-symlinks 312 @opindex --follow-symlinks 313 This option is available only on platforms that support 314 symbolic links and has an effect only if option @option{-i} 315 is specified. In this case, if the file that is specified 316 on the command line is a symbolic link, @command{sed} will 317 follow the link and edit the ultimate destination of the 318 link. The default behavior is to break the symbolic link, 319 so that the link destination will not be modified. 320 321 @item -r 322 @itemx --regexp-extended 323 @opindex -r 324 @opindex --regexp-extended 325 @cindex Extended regular expressions, choosing 326 @cindex @acronym{GNU} extensions, extended regular expressions 327 Use extended regular expressions rather than basic 328 regular expressions. Extended regexps are those that 329 @command{egrep} accepts; they can be clearer because they 330 usually have less backslashes, but are a @acronym{GNU} extension 331 and hence scripts that use them are not portable. 332 @xref{Extended regexps, , Extended regular expressions}. 333 334 @ifset PERL 335 @item -R 336 @itemx --regexp-perl 337 @opindex -R 338 @opindex --regexp-perl 339 @cindex Perl-style regular expressions, choosing 340 @cindex @value{SSEDEXT}, Perl-style regular expressions 341 Use Perl-style regular expressions rather than basic 342 regular expressions. Perl-style regexps are extremely 343 powerful but are a @value{SSED} extension and hence scripts that 344 use it are not portable. @xref{Perl regexps, , 345 Perl-style regular expressions}. 346 @end ifset 347 348 @item -s 349 @itemx --separate 350 @cindex Working on separate files 351 By default, @command{sed} will consider the files specified on the 352 command line as a single continuous long stream. This @value{SSED} 353 extension allows the user to consider them as separate files: 354 range addresses (such as @samp{/abc/,/def/}) are not allowed 355 to span several files, line numbers are relative to the start 356 of each file, @code{$} refers to the last line of each file, 357 and files invoked from the @code{R} commands are rewound at the 358 start of each file. 359 360 @item -u 361 @itemx --unbuffered 362 @opindex -u 363 @opindex --unbuffered 364 @cindex Unbuffered I/O, choosing 365 Buffer both input and output as minimally as practical. 366 (This is particularly useful if the input is coming from 367 the likes of @samp{tail -f}, and you wish to see the transformed 368 output as soon as possible.) 369 370 @end table 371 372 If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file} 373 options are given on the command-line, 374 then the first non-option argument on the command line is 375 taken to be the @var{script} to be executed. 376 377 @cindex Files to be processed as input 378 If any command-line parameters remain after processing the above, 379 these parameters are interpreted as the names of input files to 380 be processed. 381 @cindex Standard input, processing as input 382 A file name of @samp{-} refers to the standard input stream. 383 The standard input will be processed if no file names are specified. 384 385 386 @node sed Programs 387 @chapter @command{sed} Programs 388 389 @cindex @command{sed} program structure 390 @cindex Script structure 391 A @command{sed} program consists of one or more @command{sed} commands, 392 passed in by one or more of the 393 @option{-e}, @option{-f}, @option{--expression}, and @option{--file} 394 options, or the first non-option argument if zero of these 395 options are used. 396 This document will refer to ``the'' @command{sed} script; 397 this is understood to mean the in-order catenation 398 of all of the @var{script}s and @var{script-file}s passed in. 399 400 Each @code{sed} command consists of an optional address or 401 address range, followed by a one-character command name 402 and any additional command-specific code. 403 404 @menu 405 * Execution Cycle:: How @command{sed} works 406 * Addresses:: Selecting lines with @command{sed} 407 * Regular Expressions:: Overview of regular expression syntax 408 * Common Commands:: Often used commands 409 * The "s" Command:: @command{sed}'s Swiss Army Knife 410 * Other Commands:: Less frequently used commands 411 * Programming Commands:: Commands for @command{sed} gurus 412 * Extended Commands:: Commands specific of @value{SSED} 413 * Escapes:: Specifying special characters 414 @end menu 415 416 417 @node Execution Cycle 418 @section How @command{sed} Works 419 420 @cindex Buffer spaces, pattern and hold 421 @cindex Spaces, pattern and hold 422 @cindex Pattern space, definition 423 @cindex Hold space, definition 424 @command{sed} maintains two data buffers: the active @emph{pattern} space, 425 and the auxiliary @emph{hold} space. Both are initially empty. 426 427 @command{sed} operates by performing the following cycle on each 428 lines of input: first, @command{sed} reads one line from the input 429 stream, removes any trailing newline, and places it in the pattern space. 430 Then commands are executed; each command can have an address associated 431 to it: addresses are a kind of condition code, and a command is only 432 executed if the condition is verified before the command is to be 433 executed. 434 435 When the end of the script is reached, unless the @option{-n} option 436 is in use, the contents of pattern space are printed out to the output 437 stream, adding back the trailing newline if it was removed.@footnote{Actually, 438 if @command{sed} prints a line without the terminating newline, it will 439 nevertheless print the missing newline as soon as more text is sent to 440 the same output stream, which gives the ``least expected surprise'' 441 even though it does not make commands like @samp{sed -n p} exactly 442 identical to @command{cat}.} Then the next cycle starts for the next 443 input line. 444 445 Unless special commands (like @samp{D}) are used, the pattern space is 446 deleted between two cycles. The hold space, on the other hand, keeps 447 its data between cycles (see commands @samp{h}, @samp{H}, @samp{x}, 448 @samp{g}, @samp{G} to move data between both buffers). 449 450 451 @node Addresses 452 @section Selecting lines with @command{sed} 453 @cindex Addresses, in @command{sed} scripts 454 @cindex Line selection 455 @cindex Selecting lines to process 456 457 Addresses in a @command{sed} script can be in any of the following forms: 458 @table @code 459 @item @var{number} 460 @cindex Address, numeric 461 @cindex Line, selecting by number 462 Specifying a line number will match only that line in the input. 463 (Note that @command{sed} counts lines continuously across all input files 464 unless @option{-i} or @option{-s} options are specified.) 465 466 @item @var{first}~@var{step} 467 @cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses 468 This @acronym{GNU} extension matches every @var{step}th line 469 starting with line @var{first}. 470 In particular, lines will be selected when there exists 471 a non-negative @var{n} such that the current line-number equals 472 @var{first} + (@var{n} * @var{step}). 473 Thus, to select the odd-numbered lines, 474 one would use @code{1~2}; 475 to pick every third line starting with the second, @samp{2~3} would be used; 476 to pick every fifth line starting with the tenth, use @samp{10~5}; 477 and @samp{50~0} is just an obscure way of saying @code{50}. 478 479 @item $ 480 @cindex Address, last line 481 @cindex Last line, selecting 482 @cindex Line, selecting last 483 This address matches the last line of the last file of input, or 484 the last line of each file when the @option{-i} or @option{-s} options 485 are specified. 486 487 @item /@var{regexp}/ 488 @cindex Address, as a regular expression 489 @cindex Line, selecting by regular expression match 490 This will select any line which matches the regular expression @var{regexp}. 491 If @var{regexp} itself includes any @code{/} characters, 492 each must be escaped by a backslash (@code{\}). 493 494 @cindex empty regular expression 495 @cindex @value{SSEDEXT}, modifiers and the empty regular expression 496 The empty regular expression @samp{//} repeats the last regular 497 expression match (the same holds if the empty regular expression is 498 passed to the @code{s} command). Note that modifiers to regular expressions 499 are evaluated when the regular expression is compiled, thus it is invalid to 500 specify them together with the empty regular expression. 501 502 @item \%@var{regexp}% 503 (The @code{%} may be replaced by any other single character.) 504 505 @cindex Slash character, in regular expressions 506 This also matches the regular expression @var{regexp}, 507 but allows one to use a different delimiter than @code{/}. 508 This is particularly useful if the @var{regexp} itself contains 509 a lot of slashes, since it avoids the tedious escaping of every @code{/}. 510 If @var{regexp} itself includes any delimiter characters, 511 each must be escaped by a backslash (@code{\}). 512 513 @item /@var{regexp}/I 514 @itemx \%@var{regexp}%I 515 @cindex @acronym{GNU} extensions, @code{I} modifier 516 @ifset PERL 517 @cindex Perl-style regular expressions, case-insensitive 518 @end ifset 519 The @code{I} modifier to regular-expression matching is a @acronym{GNU} 520 extension which causes the @var{regexp} to be matched in 521 a case-insensitive manner. 522 523 @item /@var{regexp}/M 524 @itemx \%@var{regexp}%M 525 @ifset PERL 526 @cindex @value{SSEDEXT}, @code{M} modifier 527 @end ifset 528 @cindex Perl-style regular expressions, multiline 529 The @code{M} modifier to regular-expression matching is a @value{SSED} 530 extension which causes @code{^} and @code{$} to match respectively 531 (in addition to the normal behavior) the empty string after a newline, 532 and the empty string before a newline. There are special character 533 sequences 534 @ifset PERL 535 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 536 in basic or extended regular expression modes) 537 @end ifset 538 @ifclear PERL 539 (@code{\`} and @code{\'}) 540 @end ifclear 541 which always match the beginning or the end of the buffer. 542 @code{M} stands for @cite{multi-line}. 543 544 @ifset PERL 545 @item /@var{regexp}/S 546 @itemx \%@var{regexp}%S 547 @cindex @value{SSEDEXT}, @code{S} modifier 548 @cindex Perl-style regular expressions, single line 549 The @code{S} modifier to regular-expression matching is only valid 550 in Perl mode and specifies that the dot character (@code{.}) will 551 match the newline character too. @code{S} stands for @cite{single-line}. 552 @end ifset 553 554 @ifset PERL 555 @item /@var{regexp}/X 556 @itemx \%@var{regexp}%X 557 @cindex @value{SSEDEXT}, @code{X} modifier 558 @cindex Perl-style regular expressions, extended 559 The @code{X} modifier to regular-expression matching is also 560 valid in Perl mode only. If it is used, whitespace in the 561 pattern (other than in a character class) and 562 characters between a @kbd{#} outside a character class and the 563 next newline character are ignored. An escaping backslash 564 can be used to include a whitespace or @kbd{#} character as part 565 of the pattern. 566 @end ifset 567 @end table 568 569 If no addresses are given, then all lines are matched; 570 if one address is given, then only lines matching that 571 address are matched. 572 573 @cindex Range of lines 574 @cindex Several lines, selecting 575 An address range can be specified by specifying two addresses 576 separated by a comma (@code{,}). An address range matches lines 577 starting from where the first address matches, and continues 578 until the second address matches (inclusively). 579 580 If the second address is a @var{regexp}, then checking for the 581 ending match will start with the line @emph{following} the 582 line which matched the first address: a range will always 583 span at least two lines (except of course if the input stream 584 ends). 585 586 If the second address is a @var{number} less than (or equal to) 587 the line matching the first address, then only the one line is 588 matched. 589 590 @cindex Special addressing forms 591 @cindex Range with start address of zero 592 @cindex Zero, as range start address 593 @cindex @var{addr1},+N 594 @cindex @var{addr1},~N 595 @cindex @acronym{GNU} extensions, special two-address forms 596 @cindex @acronym{GNU} extensions, @code{0} address 597 @cindex @acronym{GNU} extensions, 0,@var{addr2} addressing 598 @cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing 599 @cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing 600 @value{SSED} also supports some special two-address forms; all these 601 are @acronym{GNU} extensions: 602 @table @code 603 @item 0,/@var{regexp}/ 604 A line number of @code{0} can be used in an address specification like 605 @code{0,/@var{regexp}/} so that @command{sed} will try to match 606 @var{regexp} in the first input line too. In other words, 607 @code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/}, 608 except that if @var{addr2} matches the very first line of input the 609 @code{0,/@var{regexp}/} form will consider it to end the range, whereas 610 the @code{1,/@var{regexp}/} form will match the beginning of its range and 611 hence make the range span up to the @emph{second} occurrence of the 612 regular expression. 613 614 Note that this is the only place where the @code{0} address makes 615 sense; there is no 0-th line and commands which are given the @code{0} 616 address in any other way will give an error. 617 618 @item @var{addr1},+@var{N} 619 Matches @var{addr1} and the @var{N} lines following @var{addr1}. 620 621 @item @var{addr1},~@var{N} 622 Matches @var{addr1} and the lines following @var{addr1} 623 until the next line whose input line number is a multiple of @var{N}. 624 @end table 625 626 @cindex Excluding lines 627 @cindex Selecting non-matching lines 628 Appending the @code{!} character to the end of an address 629 specification negates the sense of the match. 630 That is, if the @code{!} character follows an address range, 631 then only lines which do @emph{not} match the address range 632 will be selected. 633 This also works for singleton addresses, 634 and, perhaps perversely, for the null address. 635 636 637 @node Regular Expressions 638 @section Overview of Regular Expression Syntax 639 640 To know how to use @command{sed}, people should understand regular 641 expressions (@dfn{regexp} for short). A regular expression 642 is a pattern that is matched against a 643 subject string from left to right. Most characters are 644 @dfn{ordinary}: they stand for 645 themselves in a pattern, and match the corresponding characters 646 in the subject. As a trivial example, the pattern 647 648 @example 649 The quick brown fox 650 @end example 651 652 @noindent 653 matches a portion of a subject string that is identical to 654 itself. The power of regular expressions comes from the 655 ability to include alternatives and repetitions in the pattern. 656 These are encoded in the pattern by the use of @dfn{special characters}, 657 which do not stand for themselves but instead 658 are interpreted in some special way. Here is a brief description 659 of regular expression syntax as used in @command{sed}. 660 661 @table @code 662 @item @var{char} 663 A single ordinary character matches itself. 664 665 @item * 666 @cindex @acronym{GNU} extensions, to basic regular expressions 667 Matches a sequence of zero or more instances of matches for the 668 preceding regular expression, which must be an ordinary character, a 669 special character preceded by @code{\}, a @code{.}, a grouped regexp 670 (see below), or a bracket expression. As a @acronym{GNU} extension, a 671 postfixed regular expression can also be followed by @code{*}; for 672 example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX} 673 1003.1-2001 says that @code{*} stands for itself when it appears at 674 the start of a regular expression or subexpression, but many 675 non@acronym{GNU} implementations do not support this and portable 676 scripts should instead use @code{\*} in these contexts. 677 678 @item \+ 679 @cindex @acronym{GNU} extensions, to basic regular expressions 680 As @code{*}, but matches one or more. It is a @acronym{GNU} extension. 681 682 @item \? 683 @cindex @acronym{GNU} extensions, to basic regular expressions 684 As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension. 685 686 @item \@{@var{i}\@} 687 As @code{*}, but matches exactly @var{i} sequences (@var{i} is a 688 decimal integer; for portability, keep it between 0 and 255 689 inclusive). 690 691 @item \@{@var{i},@var{j}\@} 692 Matches between @var{i} and @var{j}, inclusive, sequences. 693 694 @item \@{@var{i},\@} 695 Matches more than or equal to @var{i} sequences. 696 697 @item \(@var{regexp}\) 698 Groups the inner @var{regexp} as a whole, this is used to: 699 700 @itemize @bullet 701 @item 702 @cindex @acronym{GNU} extensions, to basic regular expressions 703 Apply postfix operators, like @code{\(abcd\)*}: 704 this will search for zero or more whole sequences 705 of @samp{abcd}, while @code{abcd*} would search 706 for @samp{abc} followed by zero or more occurrences 707 of @samp{d}. Note that support for @code{\(abcd\)*} is 708 required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU} 709 implementations do not support it and hence it is not universally 710 portable. 711 712 @item 713 Use back references (see below). 714 @end itemize 715 716 @item . 717 Matches any character, including newline. 718 719 @item ^ 720 Matches the null string at beginning of the pattern space, i.e. what 721 appears after the circumflex must appear at the beginning of the 722 pattern space. 723 724 In most scripts, pattern space is initialized to the content of each 725 line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a 726 useful simplification to think of @code{^#include} as matching only 727 lines where @samp{#include} is the first thing on line---if there are 728 spaces before, for example, the match fails. This simplification is 729 valid as long as the original content of pattern space is not modified, 730 for example with an @code{s} command. 731 732 @code{^} acts as a special character only at the beginning of the 733 regular expression or subexpression (that is, after @code{\(} or 734 @code{\|}). Portable scripts should avoid @code{^} at the beginning of 735 a subexpression, though, as @acronym{POSIX} allows implementations that 736 treat @code{^} as an ordinary character in that context. 737 738 @item $ 739 It is the same as @code{^}, but refers to end of pattern space. 740 @code{$} also acts as a special character only at the end 741 of the regular expression or subexpression (that is, before @code{\)} 742 or @code{\|}), and its use at the end of a subexpression is not 743 portable. 744 745 746 @item [@var{list}] 747 @itemx [^@var{list}] 748 Matches any single character in @var{list}: for example, 749 @code{[aeiou]} matches all vowels. A list may include 750 sequences like @code{@var{char1}-@var{char2}}, which 751 matches any character between (inclusive) @var{char1} 752 and @var{char2}. 753 754 A leading @code{^} reverses the meaning of @var{list}, so that 755 it matches any single character @emph{not} in @var{list}. To include 756 @code{]} in the list, make it the first character (after 757 the @code{^} if needed), to include @code{-} in the list, 758 make it the first or last; to include @code{^} put 759 it after the first character. 760 761 @cindex @code{POSIXLY_CORRECT} behavior, bracket expressions 762 The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\} 763 are normally not special within @var{list}. For example, @code{[\*]} 764 matches either @samp{\} or @samp{*}, because the @code{\} is not 765 special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and 766 @code{[:space:]} are special within @var{list} and represent collating 767 symbols, equivalence classes, and character classes, respectively, and 768 @code{[} is therefore special within @var{list} when it is followed by 769 @code{.}, @code{=}, or @code{:}. Also, when not in 770 @env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and 771 @code{\t} are recognized within @var{list}. @xref{Escapes}. 772 773 @item @var{regexp1}\|@var{regexp2} 774 @cindex @acronym{GNU} extensions, to basic regular expressions 775 Matches either @var{regexp1} or @var{regexp2}. Use 776 parentheses to use complex alternative regular expressions. 777 The matching process tries each alternative in turn, from 778 left to right, and the first one that succeeds is used. 779 It is a @acronym{GNU} extension. 780 781 @item @var{regexp1}@var{regexp2} 782 Matches the concatenation of @var{regexp1} and @var{regexp2}. 783 Concatenation binds more tightly than @code{\|}, @code{^}, and 784 @code{$}, but less tightly than the other regular expression 785 operators. 786 787 @item \@var{digit} 788 Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized 789 subexpression in the regular expression. This is called a @dfn{back 790 reference}. Subexpressions are implicity numbered by counting 791 occurrences of @code{\(} left-to-right. 792 793 @item \n 794 Matches the newline character. 795 796 @item \@var{char} 797 Matches @var{char}, where @var{char} is one of @code{$}, 798 @code{*}, @code{.}, @code{[}, @code{\}, or @code{^}. 799 Note that the only C-like 800 backslash sequences that you can portably assume to be 801 interpreted are @code{\n} and @code{\\}; in particular 802 @code{\t} is not portable, and matches a @samp{t} under most 803 implementations of @command{sed}, rather than a tab character. 804 805 @end table 806 807 @cindex Greedy regular expression matching 808 Note that the regular expression matcher is greedy, i.e., matches 809 are attempted from left to right and, if two or more matches are 810 possible starting at the same character, it selects the longest. 811 812 @noindent 813 Examples: 814 @table @samp 815 @item abcdef 816 Matches @samp{abcdef}. 817 818 @item a*b 819 Matches zero or more @samp{a}s followed by a single 820 @samp{b}. For example, @samp{b} or @samp{aaaaab}. 821 822 @item a\?b 823 Matches @samp{b} or @samp{ab}. 824 825 @item a\+b\+ 826 Matches one or more @samp{a}s followed by one or more 827 @samp{b}s: @samp{ab} is the shortest possible match, but 828 other examples are @samp{aaaab} or @samp{abbbbb} or 829 @samp{aaaaaabbbbbbb}. 830 831 @item .* 832 @itemx .\+ 833 These two both match all the characters in a string; 834 however, the first matches every string (including the empty 835 string), while the second matches only strings containing 836 at least one character. 837 838 @item ^main.*(.*) 839 his matches a string starting with @samp{main}, 840 followed by an opening and closing 841 parenthesis. The @samp{n}, @samp{(} and @samp{)} need not 842 be adjacent. 843 844 @item ^# 845 This matches a string beginning with @samp{#}. 846 847 @item \\$ 848 This matches a string ending with a single backslash. The 849 regexp contains two backslashes for escaping. 850 851 @item \$ 852 Instead, this matches a string consisting of a single dollar sign, 853 because it is escaped. 854 855 @item [a-zA-Z0-9] 856 In the C locale, this matches any @acronym{ASCII} letters or digits. 857 858 @item [^ @kbd{tab}]\+ 859 (Here @kbd{tab} stands for a single tab character.) 860 This matches a string of one or more 861 characters, none of which is a space or a tab. 862 Usually this means a word. 863 864 @item ^\(.*\)\n\1$ 865 This matches a string consisting of two equal substrings separated by 866 a newline. 867 868 @item .\@{9\@}A$ 869 This matches nine characters followed by an @samp{A}. 870 871 @item ^.\@{15\@}A 872 This matches the start of a string that contains 16 characters, 873 the last of which is an @samp{A}. 874 875 @end table 876 877 878 879 @node Common Commands 880 @section Often-Used Commands 881 882 If you use @command{sed} at all, you will quite likely want to know 883 these commands. 884 885 @table @code 886 @item # 887 [No addresses allowed.] 888 889 @findex # (comments) 890 @cindex Comments, in scripts 891 The @code{#} character begins a comment; 892 the comment continues until the next newline. 893 894 @cindex Portability, comments 895 If you are concerned about portability, be aware that 896 some implementations of @command{sed} (which are not @sc{posix} 897 conformant) may only support a single one-line comment, 898 and then only when the very first character of the script is a @code{#}. 899 900 @findex -n, forcing from within a script 901 @cindex Caveat --- #n on first line 902 Warning: if the first two characters of the @command{sed} script 903 are @code{#n}, then the @option{-n} (no-autoprint) option is forced. 904 If you want to put a comment in the first line of your script 905 and that comment begins with the letter @samp{n} 906 and you do not want this behavior, 907 then be sure to either use a capital @samp{N}, 908 or place at least one space before the @samp{n}. 909 910 @item q [@var{exit-code}] 911 This command only accepts a single address. 912 913 @findex q (quit) command 914 @cindex @value{SSEDEXT}, returning an exit code 915 @cindex Quitting 916 Exit @command{sed} without processing any more commands or input. 917 Note that the current pattern space is printed if auto-print is 918 not disabled with the @option{-n} options. The ability to return 919 an exit code from the @command{sed} script is a @value{SSED} extension. 920 921 @item d 922 @findex d (delete) command 923 @cindex Text, deleting 924 Delete the pattern space; 925 immediately start next cycle. 926 927 @item p 928 @findex p (print) command 929 @cindex Text, printing 930 Print out the pattern space (to the standard output). 931 This command is usually only used in conjunction with the @option{-n} 932 command-line option. 933 934 @item n 935 @findex n (next-line) command 936 @cindex Next input line, replace pattern space with 937 @cindex Read next input line 938 If auto-print is not disabled, print the pattern space, 939 then, regardless, replace the pattern space with the next line of input. 940 If there is no more input then @command{sed} exits without processing 941 any more commands. 942 943 @item @{ @var{commands} @} 944 @findex @{@} command grouping 945 @cindex Grouping commands 946 @cindex Command groups 947 A group of commands may be enclosed between 948 @code{@{} and @code{@}} characters. 949 This is particularly useful when you want a group of commands 950 to be triggered by a single address (or address-range) match. 951 952 @end table 953 954 @node The "s" Command 955 @section The @code{s} Command 956 957 The syntax of the @code{s} (as in substitute) command is 958 @samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/} 959 characters may be uniformly replaced by any other single 960 character within any given @code{s} command. The @code{/} 961 character (or whatever other character is used in its stead) 962 can appear in the @var{regexp} or @var{replacement} 963 only if it is preceded by a @code{\} character. 964 965 The @code{s} command is probably the most important in @command{sed} 966 and has a lot of different options. Its basic concept is simple: 967 the @code{s} command attempts to match the pattern 968 space against the supplied @var{regexp}; if the match is 969 successful, then that portion of the pattern 970 space which was matched is replaced with @var{replacement}. 971 972 @cindex Backreferences, in regular expressions 973 @cindex Parenthesized substrings 974 The @var{replacement} can contain @code{\@var{n}} (@var{n} being 975 a number from 1 to 9, inclusive) references, which refer to 976 the portion of the match which is contained between the @var{n}th 977 @code{\(} and its matching @code{\)}. 978 Also, the @var{replacement} can contain unescaped @code{&} 979 characters which reference the whole matched portion 980 of the pattern space. 981 @cindex @value{SSEDEXT}, case modifiers in @code{s} commands 982 Finally, as a @value{SSED} extension, you can include a 983 special sequence made of a backslash and one of the letters 984 @code{L}, @code{l}, @code{U}, @code{u}, or @code{E}. 985 The meaning is as follows: 986 987 @table @code 988 @item \L 989 Turn the replacement 990 to lowercase until a @code{\U} or @code{\E} is found, 991 992 @item \l 993 Turn the 994 next character to lowercase, 995 996 @item \U 997 Turn the replacement to uppercase 998 until a @code{\L} or @code{\E} is found, 999 1000 @item \u 1001 Turn the next character 1002 to uppercase, 1003 1004 @item \E 1005 Stop case conversion started by @code{\L} or @code{\U}. 1006 @end table 1007 1008 To include a literal @code{\}, @code{&}, or newline in the final 1009 replacement, be sure to precede the desired @code{\}, @code{&}, 1010 or newline in the @var{replacement} with a @code{\}. 1011 1012 @findex s command, option flags 1013 @cindex Substitution of text, options 1014 The @code{s} command can be followed by zero or more of the 1015 following @var{flags}: 1016 1017 @table @code 1018 @item g 1019 @cindex Global substitution 1020 @cindex Replacing all text matching regexp in a line 1021 Apply the replacement to @emph{all} matches to the @var{regexp}, 1022 not just the first. 1023 1024 @item @var{number} 1025 @cindex Replacing only @var{n}th match of regexp in a line 1026 Only replace the @var{number}th match of the @var{regexp}. 1027 1028 @cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command 1029 @cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command 1030 Note: the @sc{posix} standard does not specify what should happen 1031 when you mix the @code{g} and @var{number} modifiers, 1032 and currently there is no widely agreed upon meaning 1033 across @command{sed} implementations. 1034 For @value{SSED}, the interaction is defined to be: 1035 ignore matches before the @var{number}th, 1036 and then match and replace all matches from 1037 the @var{number}th on. 1038 1039 @item p 1040 @cindex Text, printing after substitution 1041 If the substitution was made, then print the new pattern space. 1042 1043 Note: when both the @code{p} and @code{e} options are specified, 1044 the relative ordering of the two produces very different results. 1045 In general, @code{ep} (evaluate then print) is what you want, 1046 but operating the other way round can be useful for debugging. 1047 For this reason, the current version of @value{SSED} interprets 1048 specially the presence of @code{p} options both before and after 1049 @code{e}, printing the pattern space before and after evaluation, 1050 while in general flags for the @code{s} command show their 1051 effect just once. This behavior, although documented, might 1052 change in future versions. 1053 1054 @item w @var{file-name} 1055 @cindex Text, writing to a file after substitution 1056 @cindex @value{SSEDEXT}, @file{/dev/stdout} file 1057 @cindex @value{SSEDEXT}, @file{/dev/stderr} file 1058 If the substitution was made, then write out the result to the named file. 1059 As a @value{SSED} extension, two special values of @var{file-name} are 1060 supported: @file{/dev/stderr}, which writes the result to the standard 1061 error, and @file{/dev/stdout}, which writes to the standard 1062 output.@footnote{This is equivalent to @code{p} unless the @option{-i} 1063 option is being used.} 1064 1065 @item e 1066 @cindex Evaluate Bourne-shell commands, after substitution 1067 @cindex Subprocesses 1068 @cindex @value{SSEDEXT}, evaluating Bourne-shell commands 1069 @cindex @value{SSEDEXT}, subprocesses 1070 This command allows one to pipe input from a shell command 1071 into pattern space. If a substitution was made, the command 1072 that is found in pattern space is executed and pattern space 1073 is replaced with its output. A trailing newline is suppressed; 1074 results are undefined if the command to be executed contains 1075 a @sc{nul} character. This is a @value{SSED} extension. 1076 1077 @item I 1078 @itemx i 1079 @cindex @acronym{GNU} extensions, @code{I} modifier 1080 @cindex Case-insensitive matching 1081 @ifset PERL 1082 @cindex Perl-style regular expressions, case-insensitive 1083 @end ifset 1084 The @code{I} modifier to regular-expression matching is a @acronym{GNU} 1085 extension which makes @command{sed} match @var{regexp} in a 1086 case-insensitive manner. 1087 1088 @item M 1089 @itemx m 1090 @cindex @value{SSEDEXT}, @code{M} modifier 1091 @ifset PERL 1092 @cindex Perl-style regular expressions, multiline 1093 @end ifset 1094 The @code{M} modifier to regular-expression matching is a @value{SSED} 1095 extension which causes @code{^} and @code{$} to match respectively 1096 (in addition to the normal behavior) the empty string after a newline, 1097 and the empty string before a newline. There are special character 1098 sequences 1099 @ifset PERL 1100 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 1101 in basic or extended regular expression modes) 1102 @end ifset 1103 @ifclear PERL 1104 (@code{\`} and @code{\'}) 1105 @end ifclear 1106 which always match the beginning or the end of the buffer. 1107 @code{M} stands for @cite{multi-line}. 1108 1109 @ifset PERL 1110 @item S 1111 @itemx s 1112 @cindex @value{SSEDEXT}, @code{S} modifier 1113 @cindex Perl-style regular expressions, single line 1114 The @code{S} modifier to regular-expression matching is only valid 1115 in Perl mode and specifies that the dot character (@code{.}) will 1116 match the newline character too. @code{S} stands for @cite{single-line}. 1117 @end ifset 1118 1119 @ifset PERL 1120 @item X 1121 @itemx x 1122 @cindex @value{SSEDEXT}, @code{X} modifier 1123 @cindex Perl-style regular expressions, extended 1124 The @code{X} modifier to regular-expression matching is also 1125 valid in Perl mode only. If it is used, whitespace in the 1126 pattern (other than in a character class) and 1127 characters between a @kbd{#} outside a character class and the 1128 next newline character are ignored. An escaping backslash 1129 can be used to include a whitespace or @kbd{#} character as part 1130 of the pattern. 1131 @end ifset 1132 @end table 1133 1134 1135 @node Other Commands 1136 @section Less Frequently-Used Commands 1137 1138 Though perhaps less frequently used than those in the previous 1139 section, some very small yet useful @command{sed} scripts can be built with 1140 these commands. 1141 1142 @table @code 1143 @item y/@var{source-chars}/@var{dest-chars}/ 1144 (The @code{/} characters may be uniformly replaced by 1145 any other single character within any given @code{y} command.) 1146 1147 @findex y (transliterate) command 1148 @cindex Transliteration 1149 Transliterate any characters in the pattern space which match 1150 any of the @var{source-chars} with the corresponding character 1151 in @var{dest-chars}. 1152 1153 Instances of the @code{/} (or whatever other character is used in its stead), 1154 @code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars} 1155 lists, provide that each instance is escaped by a @code{\}. 1156 The @var{source-chars} and @var{dest-chars} lists @emph{must} 1157 contain the same number of characters (after de-escaping). 1158 1159 @item a\ 1160 @itemx @var{text} 1161 @cindex @value{SSEDEXT}, two addresses supported by most commands 1162 As a @acronym{GNU} extension, this command accepts two addresses. 1163 1164 @findex a (append text lines) command 1165 @cindex Appending text after a line 1166 @cindex Text, appending 1167 Queue the lines of text which follow this command 1168 (each but the last ending with a @code{\}, 1169 which are removed from the output) 1170 to be output at the end of the current cycle, 1171 or when the next input line is read. 1172 1173 Escape sequences in @var{text} are processed, so you should 1174 use @code{\\} in @var{text} to print a single backslash. 1175 1176 As a @acronym{GNU} extension, if between the @code{a} and the newline there is 1177 other than a whitespace-@code{\} sequence, then the text of this line, 1178 starting at the first non-whitespace character after the @code{a}, 1179 is taken as the first line of the @var{text} block. 1180 (This enables a simplification in scripting a one-line add.) 1181 This extension also works with the @code{i} and @code{c} commands. 1182 1183 @item i\ 1184 @itemx @var{text} 1185 @cindex @value{SSEDEXT}, two addresses supported by most commands 1186 As a @acronym{GNU} extension, this command accepts two addresses. 1187 1188 @findex i (insert text lines) command 1189 @cindex Inserting text before a line 1190 @cindex Text, insertion 1191 Immediately output the lines of text which follow this command 1192 (each but the last ending with a @code{\}, 1193 which are removed from the output). 1194 1195 @item c\ 1196 @itemx @var{text} 1197 @findex c (change to text lines) command 1198 @cindex Replacing selected lines with other text 1199 Delete the lines matching the address or address-range, 1200 and output the lines of text which follow this command 1201 (each but the last ending with a @code{\}, 1202 which are removed from the output) 1203 in place of the last line 1204 (or in place of each line, if no addresses were specified). 1205 A new cycle is started after this command is done, 1206 since the pattern space will have been deleted. 1207 1208 @item = 1209 @cindex @value{SSEDEXT}, two addresses supported by most commands 1210 As a @acronym{GNU} extension, this command accepts two addresses. 1211 1212 @findex = (print line number) command 1213 @cindex Printing line number 1214 @cindex Line number, printing 1215 Print out the current input line number (with a trailing newline). 1216 1217 @item l @var{n} 1218 @findex l (list unambiguously) command 1219 @cindex List pattern space 1220 @cindex Printing text unambiguously 1221 @cindex Line length, setting 1222 @cindex @value{SSEDEXT}, setting line length 1223 Print the pattern space in an unambiguous form: 1224 non-printable characters (and the @code{\} character) 1225 are printed in C-style escaped form; long lines are split, 1226 with a trailing @code{\} character to indicate the split; 1227 the end of each line is marked with a @code{$}. 1228 1229 @var{n} specifies the desired line-wrap length; 1230 a length of 0 (zero) means to never wrap long lines. If omitted, 1231 the default as specified on the command line is used. The @var{n} 1232 parameter is a @value{SSED} extension. 1233 1234 @item r @var{filename} 1235 @cindex @value{SSEDEXT}, two addresses supported by most commands 1236 As a @acronym{GNU} extension, this command accepts two addresses. 1237 1238 @findex r (read file) command 1239 @cindex Read text from a file 1240 @cindex @value{SSEDEXT}, @file{/dev/stdin} file 1241 Queue the contents of @var{filename} to be read and 1242 inserted into the output stream at the end of the current cycle, 1243 or when the next input line is read. 1244 Note that if @var{filename} cannot be read, it is treated as 1245 if it were an empty file, without any error indication. 1246 1247 As a @value{SSED} extension, the special value @file{/dev/stdin} 1248 is supported for the file name, which reads the contents of the 1249 standard input. 1250 1251 @item w @var{filename} 1252 @findex w (write file) command 1253 @cindex Write to a file 1254 @cindex @value{SSEDEXT}, @file{/dev/stdout} file 1255 @cindex @value{SSEDEXT}, @file{/dev/stderr} file 1256 Write the pattern space to @var{filename}. 1257 As a @value{SSED} extension, two special values of @var{file-name} are 1258 supported: @file{/dev/stderr}, which writes the result to the standard 1259 error, and @file{/dev/stdout}, which writes to the standard 1260 output.@footnote{This is equivalent to @code{p} unless the @option{-i} 1261 option is being used.} 1262 1263 The file will be created (or truncated) before the 1264 first input line is read; all @code{w} commands 1265 (including instances of @code{w} flag on successful @code{s} commands) 1266 which refer to the same @var{filename} are output without 1267 closing and reopening the file. 1268 1269 @item D 1270 @findex D (delete first line) command 1271 @cindex Delete first line from pattern space 1272 Delete text in the pattern space up to the first newline. 1273 If any text is left, restart cycle with the resultant 1274 pattern space (without reading a new line of input), 1275 otherwise start a normal new cycle. 1276 1277 @item N 1278 @findex N (append Next line) command 1279 @cindex Next input line, append to pattern space 1280 @cindex Append next input line to pattern space 1281 Add a newline to the pattern space, 1282 then append the next line of input to the pattern space. 1283 If there is no more input then @command{sed} exits without processing 1284 any more commands. 1285 1286 @item P 1287 @findex P (print first line) command 1288 @cindex Print first line from pattern space 1289 Print out the portion of the pattern space up to the first newline. 1290 1291 @item h 1292 @findex h (hold) command 1293 @cindex Copy pattern space into hold space 1294 @cindex Replace hold space with copy of pattern space 1295 @cindex Hold space, copying pattern space into 1296 Replace the contents of the hold space with the contents of the pattern space. 1297 1298 @item H 1299 @findex H (append Hold) command 1300 @cindex Append pattern space to hold space 1301 @cindex Hold space, appending from pattern space 1302 Append a newline to the contents of the hold space, 1303 and then append the contents of the pattern space to that of the hold space. 1304 1305 @item g 1306 @findex g (get) command 1307 @cindex Copy hold space into pattern space 1308 @cindex Replace pattern space with copy of hold space 1309 @cindex Hold space, copy into pattern space 1310 Replace the contents of the pattern space with the contents of the hold space. 1311 1312 @item G 1313 @findex G (appending Get) command 1314 @cindex Append hold space to pattern space 1315 @cindex Hold space, appending to pattern space 1316 Append a newline to the contents of the pattern space, 1317 and then append the contents of the hold space to that of the pattern space. 1318 1319 @item x 1320 @findex x (eXchange) command 1321 @cindex Exchange hold space with pattern space 1322 @cindex Hold space, exchange with pattern space 1323 Exchange the contents of the hold and pattern spaces. 1324 1325 @end table 1326 1327 1328 @node Programming Commands 1329 @section Commands for @command{sed} gurus 1330 1331 In most cases, use of these commands indicates that you are 1332 probably better off programming in something like @command{awk} 1333 or Perl. But occasionally one is committed to sticking 1334 with @command{sed}, and these commands can enable one to write 1335 quite convoluted scripts. 1336 1337 @cindex Flow of control in scripts 1338 @table @code 1339 @item : @var{label} 1340 [No addresses allowed.] 1341 1342 @findex : (label) command 1343 @cindex Labels, in scripts 1344 Specify the location of @var{label} for branch commands. 1345 In all other respects, a no-op. 1346 1347 @item b @var{label} 1348 @findex b (branch) command 1349 @cindex Branch to a label, unconditionally 1350 @cindex Goto, in scripts 1351 Unconditionally branch to @var{label}. 1352 The @var{label} may be omitted, in which case the next cycle is started. 1353 1354 @item t @var{label} 1355 @findex t (test and branch if successful) command 1356 @cindex Branch to a label, if @code{s///} succeeded 1357 @cindex Conditional branch 1358 Branch to @var{label} only if there has been a successful @code{s}ubstitution 1359 since the last input line was read or conditional branch was taken. 1360 The @var{label} may be omitted, in which case the next cycle is started. 1361 1362 @end table 1363 1364 @node Extended Commands 1365 @section Commands Specific to @value{SSED} 1366 1367 These commands are specific to @value{SSED}, so you 1368 must use them with care and only when you are sure that 1369 hindering portability is not evil. They allow you to check 1370 for @value{SSED} extensions or to do tasks that are required 1371 quite often, yet are unsupported by standard @command{sed}s. 1372 1373 @table @code 1374 @item e [@var{command}] 1375 @findex e (evaluate) command 1376 @cindex Evaluate Bourne-shell commands 1377 @cindex Subprocesses 1378 @cindex @value{SSEDEXT}, evaluating Bourne-shell commands 1379 @cindex @value{SSEDEXT}, subprocesses 1380 This command allows one to pipe input from a shell command 1381 into pattern space. Without parameters, the @code{e} command 1382 executes the command that is found in pattern space and 1383 replaces the pattern space with the output; a trailing newline 1384 is suppressed. 1385 1386 If a parameter is specified, instead, the @code{e} command 1387 interprets it as a command and sends its output to the output stream 1388 (like @code{r} does). The command can run across multiple 1389 lines, all but the last ending with a back-slash. 1390 1391 In both cases, the results are undefined if the command to be 1392 executed contains a @sc{nul} character. 1393 1394 @item L @var{n} 1395 @findex L (fLow paragraphs) command 1396 @cindex Reformat pattern space 1397 @cindex Reformatting paragraphs 1398 @cindex @value{SSEDEXT}, reformatting paragraphs 1399 @cindex @value{SSEDEXT}, @code{L} command 1400 This @value{SSED} extension fills and joins lines in pattern space 1401 to produce output lines of (at most) @var{n} characters, like 1402 @code{fmt} does; if @var{n} is omitted, the default as specified 1403 on the command line is used. This command is considered a failed 1404 experiment and unless there is enough request (which seems unlikely) 1405 will be removed in future versions. 1406 1407 @ignore 1408 Blank lines, spaces between words, and indentation are 1409 preserved in the output; successive input lines with different 1410 indentation are not joined; tabs are expanded to 8 columns. 1411 1412 If the pattern space contains multiple lines, they are joined, but 1413 since the pattern space usually contains a single line, the behavior 1414 of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e., 1415 it does not join short lines to form longer ones). 1416 1417 @var{n} specifies the desired line-wrap length; if omitted, 1418 the default as specified on the command line is used. 1419 @end ignore 1420 1421 @item Q [@var{exit-code}] 1422 This command only accepts a single address. 1423 1424 @findex Q (silent Quit) command 1425 @cindex @value{SSEDEXT}, quitting silently 1426 @cindex @value{SSEDEXT}, returning an exit code 1427 @cindex Quitting 1428 This command is the same as @code{q}, but will not print the 1429 contents of pattern space. Like @code{q}, it provides the 1430 ability to return an exit code to the caller. 1431 1432 This command can be useful because the only alternative ways 1433 to accomplish this apparently trivial function are to use 1434 the @option{-n} option (which can unnecessarily complicate 1435 your script) or resorting to the following snippet, which 1436 wastes time by reading the whole file without any visible effect: 1437 1438 @example 1439 :eat 1440 $d @i{@r{Quit silently on the last line}} 1441 N @i{@r{Read another line, silently}} 1442 g @i{@r{Overwrite pattern space each time to save memory}} 1443 b eat 1444 @end example 1445 1446 @item R @var{filename} 1447 @findex R (read line) command 1448 @cindex Read text from a file 1449 @cindex @value{SSEDEXT}, reading a file a line at a time 1450 @cindex @value{SSEDEXT}, @code{R} command 1451 @cindex @value{SSEDEXT}, @file{/dev/stdin} file 1452 Queue a line of @var{filename} to be read and 1453 inserted into the output stream at the end of the current cycle, 1454 or when the next input line is read. 1455 Note that if @var{filename} cannot be read, or if its end is 1456 reached, no line is appended, without any error indication. 1457 1458 As with the @code{r} command, the special value @file{/dev/stdin} 1459 is supported for the file name, which reads a line from the 1460 standard input. 1461 1462 @item T @var{label} 1463 @findex T (test and branch if failed) command 1464 @cindex @value{SSEDEXT}, branch if @code{s///} failed 1465 @cindex Branch to a label, if @code{s///} failed 1466 @cindex Conditional branch 1467 Branch to @var{label} only if there have been no successful 1468 @code{s}ubstitutions since the last input line was read or 1469 conditional branch was taken. The @var{label} may be omitted, 1470 in which case the next cycle is started. 1471 1472 @item v @var{version} 1473 @findex v (version) command 1474 @cindex @value{SSEDEXT}, checking for their presence 1475 @cindex Requiring @value{SSED} 1476 This command does nothing, but makes @command{sed} fail if 1477 @value{SSED} extensions are not supported, simply because other 1478 versions of @command{sed} do not implement it. In addition, you 1479 can specify the version of @command{sed} that your script 1480 requires, such as @code{4.0.5}. The default is @code{4.0} 1481 because that is the first version that implemented this command. 1482 1483 This command enables all @value{SSEDEXT} even if 1484 @env{POSIXLY_CORRECT} is set in the environment. 1485 1486 @item W @var{filename} 1487 @findex W (write first line) command 1488 @cindex Write first line to a file 1489 @cindex @value{SSEDEXT}, writing first line to a file 1490 Write to the given filename the portion of the pattern space up to 1491 the first newline. Everything said under the @code{w} command about 1492 file handling holds here too. 1493 1494 @item z 1495 @findex z (Zap) command 1496 @cindex @value{SSEDEXT}, emptying pattern space 1497 @cindex Emptying pattern space 1498 This command empties the content of pattern space. It is 1499 usually the same as @samp{s/.*//}, but is more efficient 1500 and works in the presence of invalid multibyte sequences 1501 in the input stream. @sc{posix} mandates that such sequences 1502 are @emph{not} matched by @samp{.}, so that there is no portable 1503 way to clear @command{sed}'s buffers in the middle of the 1504 script in most multibyte locales (including UTF-8 locales). 1505 @end table 1506 1507 @node Escapes 1508 @section @acronym{GNU} Extensions for Escapes in Regular Expressions 1509 1510 @cindex @acronym{GNU} extensions, special escapes 1511 Until this chapter, we have only encountered escapes of the form 1512 @samp{\^}, which tell @command{sed} not to interpret the circumflex 1513 as a special character, but rather to take it literally. For 1514 example, @samp{\*} matches a single asterisk rather than zero 1515 or more backslashes. 1516 1517 @cindex @code{POSIXLY_CORRECT} behavior, escapes 1518 This chapter introduces another kind of escape@footnote{All 1519 the escapes introduced here are @acronym{GNU} 1520 extensions, with the exception of @code{\n}. In basic regular 1521 expression mode, setting @code{POSIXLY_CORRECT} disables them inside 1522 bracket expressions.}---that 1523 is, escapes that are applied to a character or sequence of characters 1524 that ordinarily are taken literally, and that @command{sed} replaces 1525 with a special character. This provides a way 1526 of encoding non-printable characters in patterns in a visible manner. 1527 There is no restriction on the appearance of non-printing characters 1528 in a @command{sed} script but when a script is being prepared in the 1529 shell or by text editing, it is usually easier to use one of 1530 the following escape sequences than the binary character it 1531 represents: 1532 1533 The list of these escapes is: 1534 1535 @table @code 1536 @item \a 1537 Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7). 1538 1539 @item \f 1540 Produces or matches a form feed (@sc{ascii} 12). 1541 1542 @item \n 1543 Produces or matches a newline (@sc{ascii} 10). 1544 1545 @item \r 1546 Produces or matches a carriage return (@sc{ascii} 13). 1547 1548 @item \t 1549 Produces or matches a horizontal tab (@sc{ascii} 9). 1550 1551 @item \v 1552 Produces or matches a so called ``vertical tab'' (@sc{ascii} 11). 1553 1554 @item \c@var{x} 1555 Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is 1556 any character. The precise effect of @samp{\c@var{x}} is as follows: 1557 if @var{x} is a lower case letter, it is converted to upper case. 1558 Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes 1559 hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B. 1560 1561 @item \d@var{xxx} 1562 Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}. 1563 1564 @item \o@var{xxx} 1565 @ifset PERL 1566 @item \@var{xxx} 1567 @end ifset 1568 Produces or matches a character whose octal @sc{ascii} value is @var{xxx}. 1569 @ifset PERL 1570 The syntax without the @code{o} is active in Perl mode, while the one 1571 with the @code{o} is active in the normal or extended @sc{posix} regular 1572 expression modes. 1573 @end ifset 1574 1575 @item \x@var{xx} 1576 Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}. 1577 @end table 1578 1579 @samp{\b} (backspace) was omitted because of the conflict with 1580 the existing ``word boundary'' meaning. 1581 1582 Other escapes match a particular character class and are valid only in 1583 regular expressions: 1584 1585 @table @code 1586 @item \w 1587 Matches any ``word'' character. A ``word'' character is any 1588 letter or digit or the underscore character. 1589 1590 @item \W 1591 Matches any ``non-word'' character. 1592 1593 @item \b 1594 Matches a word boundary; that is it matches if the character 1595 to the left is a ``word'' character and the character to the 1596 right is a ``non-word'' character, or vice-versa. 1597 1598 @item \B 1599 Matches everywhere but on a word boundary; that is it matches 1600 if the character to the left and the character to the right 1601 are either both ``word'' characters or both ``non-word'' 1602 characters. 1603 1604 @item \` 1605 Matches only at the start of pattern space. This is different 1606 from @code{^} in multi-line mode. 1607 1608 @item \' 1609 Matches only at the end of pattern space. This is different 1610 from @code{$} in multi-line mode. 1611 1612 @ifset PERL 1613 @item \G 1614 Match only at the start of pattern space or, when doing a global 1615 substitution using the @code{s///g} command and option, at 1616 the end-of-match position of the prior match. For example, 1617 @samp{s/\Ga/Z/g} will change an initial run of @code{a}s to 1618 a run of @code{Z}s 1619 @end ifset 1620 @end table 1621 1622 @node Examples 1623 @chapter Some Sample Scripts 1624 1625 Here are some @command{sed} scripts to guide you in the art of mastering 1626 @command{sed}. 1627 1628 @menu 1629 Some exotic examples: 1630 * Centering lines:: 1631 * Increment a number:: 1632 * Rename files to lower case:: 1633 * Print bash environment:: 1634 * Reverse chars of lines:: 1635 1636 Emulating standard utilities: 1637 * tac:: Reverse lines of files 1638 * cat -n:: Numbering lines 1639 * cat -b:: Numbering non-blank lines 1640 * wc -c:: Counting chars 1641 * wc -w:: Counting words 1642 * wc -l:: Counting lines 1643 * head:: Printing the first lines 1644 * tail:: Printing the last lines 1645 * uniq:: Make duplicate lines unique 1646 * uniq -d:: Print duplicated lines of input 1647 * uniq -u:: Remove all duplicated lines 1648 * cat -s:: Squeezing blank lines 1649 @end menu 1650 1651 @node Centering lines 1652 @section Centering Lines 1653 1654 This script centers all lines of a file on a 80 columns width. 1655 To change that width, the number in @code{\@{@dots{}\@}} must be 1656 replaced, and the number of added spaces also must be changed. 1657 1658 Note how the buffer commands are used to separate parts in 1659 the regular expressions to be matched---this is a common 1660 technique. 1661 1662 @c start------------------------------------------- 1663 @example 1664 #!/usr/bin/sed -f 1665 1666 @group 1667 # Put 80 spaces in the buffer 1668 1 @{ 1669 x 1670 s/^$/ / 1671 s/^.*$/&&&&&&&&/ 1672 x 1673 @} 1674 @end group 1675 1676 @group 1677 # del leading and trailing spaces 1678 y/@kbd{tab}/ / 1679 s/^ *// 1680 s/ *$// 1681 @end group 1682 1683 @group 1684 # add a newline and 80 spaces to end of line 1685 G 1686 @end group 1687 1688 @group 1689 # keep first 81 chars (80 + a newline) 1690 s/^\(.\@{81\@}\).*$/\1/ 1691 @end group 1692 1693 @group 1694 # \2 matches half of the spaces, which are moved to the beginning 1695 s/^\(.*\)\n\(.*\)\2/\2\1/ 1696 @end group 1697 @end example 1698 @c end--------------------------------------------- 1699 1700 @node Increment a number 1701 @section Increment a Number 1702 1703 This script is one of a few that demonstrate how to do arithmetic 1704 in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg 1705 Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator! 1706 It is distributed together with sed.} but must be done manually. 1707 1708 To increment one number you just add 1 to last digit, replacing 1709 it by the following digit. There is one exception: when the digit 1710 is a nine the previous digits must be also incremented until you 1711 don't have a nine. 1712 1713 This solution by Bruno Haible is very clever and smart because 1714 it uses a single buffer; if you don't have this limitation, the 1715 algorithm used in @ref{cat -n, Numbering lines}, is faster. 1716 It works by replacing trailing nines with an underscore, then 1717 using multiple @code{s} commands to increment the last digit, 1718 and then again substituting underscores with zeros. 1719 1720 @c start------------------------------------------- 1721 @example 1722 #!/usr/bin/sed -f 1723 1724 /[^0-9]/ d 1725 1726 @group 1727 # replace all leading 9s by _ (any other character except digits, could 1728 # be used) 1729 :d 1730 s/9\(_*\)$/_\1/ 1731 td 1732 @end group 1733 1734 @group 1735 # incr last digit only. The first line adds a most-significant 1736 # digit of 1 if we have to add a digit. 1737 # 1738 # The @code{tn} commands are not necessary, but make the thing 1739 # faster 1740 @end group 1741 1742 @group 1743 s/^\(_*\)$/1\1/; tn 1744 s/8\(_*\)$/9\1/; tn 1745 s/7\(_*\)$/8\1/; tn 1746 s/6\(_*\)$/7\1/; tn 1747 s/5\(_*\)$/6\1/; tn 1748 s/4\(_*\)$/5\1/; tn 1749 s/3\(_*\)$/4\1/; tn 1750 s/2\(_*\)$/3\1/; tn 1751 s/1\(_*\)$/2\1/; tn 1752 s/0\(_*\)$/1\1/; tn 1753 @end group 1754 1755 @group 1756 :n 1757 y/_/0/ 1758 @end group 1759 @end example 1760 @c end--------------------------------------------- 1761 1762 @node Rename files to lower case 1763 @section Rename Files to Lower Case 1764 1765 This is a pretty strange use of @command{sed}. We transform text, and 1766 transform it to be shell commands, then just feed them to shell. 1767 Don't worry, even worse hacks are done when using @command{sed}; I have 1768 seen a script converting the output of @command{date} into a @command{bc} 1769 program! 1770 1771 The main body of this is the @command{sed} script, which remaps the name 1772 from lower to upper (or vice-versa) and even checks out 1773 if the remapped name is the same as the original name. 1774 Note how the script is parameterized using shell 1775 variables and proper quoting. 1776 1777 @c start------------------------------------------- 1778 @example 1779 @group 1780 #! /bin/sh 1781 # rename files to lower/upper case... 1782 # 1783 # usage: 1784 # move-to-lower * 1785 # move-to-upper * 1786 # or 1787 # move-to-lower -R . 1788 # move-to-upper -R . 1789 # 1790 @end group 1791 1792 @group 1793 help() 1794 @{ 1795 cat << eof 1796 Usage: $0 [-n] [-r] [-h] files... 1797 @end group 1798 1799 @group 1800 -n do nothing, only see what would be done 1801 -R recursive (use find) 1802 -h this message 1803 files files to remap to lower case 1804 @end group 1805 1806 @group 1807 Examples: 1808 $0 -n * (see if everything is ok, then...) 1809 $0 * 1810 @end group 1811 1812 $0 -R . 1813 1814 @group 1815 eof 1816 @} 1817 @end group 1818 1819 @group 1820 apply_cmd='sh' 1821 finder='echo "$@@" | tr " " "\n"' 1822 files_only= 1823 @end group 1824 1825 @group 1826 while : 1827 do 1828 case "$1" in 1829 -n) apply_cmd='cat' ;; 1830 -R) finder='find "$@@" -type f';; 1831 -h) help ; exit 1 ;; 1832 *) break ;; 1833 esac 1834 shift 1835 done 1836 @end group 1837 1838 @group 1839 if [ -z "$1" ]; then 1840 echo Usage: $0 [-h] [-n] [-r] files... 1841 exit 1 1842 fi 1843 @end group 1844 1845 @group 1846 LOWER='abcdefghijklmnopqrstuvwxyz' 1847 UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ' 1848 @end group 1849 1850 @group 1851 case `basename $0` in 1852 *upper*) TO=$UPPER; FROM=$LOWER ;; 1853 *) FROM=$UPPER; TO=$LOWER ;; 1854 esac 1855 @end group 1856 1857 eval $finder | sed -n ' 1858 1859 @group 1860 # remove all trailing slashes 1861 s/\/*$// 1862 @end group 1863 1864 @group 1865 # add ./ if there is no path, only a filename 1866 /\//! s/^/.\// 1867 @end group 1868 1869 @group 1870 # save path+filename 1871 h 1872 @end group 1873 1874 @group 1875 # remove path 1876 s/.*\/// 1877 @end group 1878 1879 @group 1880 # do conversion only on filename 1881 y/'$FROM'/'$TO'/ 1882 @end group 1883 1884 @group 1885 # now line contains original path+file, while 1886 # hold space contains the new filename 1887 x 1888 @end group 1889 1890 @group 1891 # add converted file name to line, which now contains 1892 # path/file-name\nconverted-file-name 1893 G 1894 @end group 1895 1896 @group 1897 # check if converted file name is equal to original file name, 1898 # if it is, do not print nothing 1899 /^.*\/\(.*\)\n\1/b 1900 @end group 1901 1902 @group 1903 # now, transform path/fromfile\n, into 1904 # mv path/fromfile path/tofile and print it 1905 s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p 1906 @end group 1907 1908 ' | $apply_cmd 1909 @end example 1910 @c end--------------------------------------------- 1911 1912 @node Print bash environment 1913 @section Print @command{bash} Environment 1914 1915 This script strips the definition of the shell functions 1916 from the output of the @command{set} Bourne-shell command. 1917 1918 @c start------------------------------------------- 1919 @example 1920 #!/bin/sh 1921 1922 @group 1923 set | sed -n ' 1924 :x 1925 @end group 1926 1927 @group 1928 @ifinfo 1929 # if no occurrence of "=()" print and load next line 1930 @end ifinfo 1931 @ifnotinfo 1932 # if no occurrence of @samp{=()} print and load next line 1933 @end ifnotinfo 1934 /=()/! @{ p; b; @} 1935 / () $/! @{ p; b; @} 1936 @end group 1937 1938 @group 1939 # possible start of functions section 1940 # save the line in case this is a var like FOO="() " 1941 h 1942 @end group 1943 1944 @group 1945 # if the next line has a brace, we quit because 1946 # nothing comes after functions 1947 n 1948 /^@{/ q 1949 @end group 1950 1951 @group 1952 # print the old line 1953 x; p 1954 @end group 1955 1956 @group 1957 # work on the new line now 1958 x; bx 1959 ' 1960 @end group 1961 @end example 1962 @c end--------------------------------------------- 1963 1964 @node Reverse chars of lines 1965 @section Reverse Characters of Lines 1966 1967 This script can be used to reverse the position of characters 1968 in lines. The technique moves two characters at a time, hence 1969 it is faster than more intuitive implementations. 1970 1971 Note the @code{tx} command before the definition of the label. 1972 This is often needed to reset the flag that is tested by 1973 the @code{t} command. 1974 1975 Imaginative readers will find uses for this script. An example 1976 is reversing the output of @command{banner}.@footnote{This requires 1977 another script to pad the output of banner; for example 1978 1979 @example 1980 #! /bin/sh 1981 1982 banner -w $1 $2 $3 $4 | 1983 sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' | 1984 ~/sedscripts/reverseline.sed 1985 @end example 1986 } 1987 1988 @c start------------------------------------------- 1989 @example 1990 #!/usr/bin/sed -f 1991 1992 /../! b 1993 1994 @group 1995 # Reverse a line. Begin embedding the line between two newlines 1996 s/^.*$/\ 1997 &\ 1998 / 1999 @end group 2000 2001 @group 2002 # Move first character at the end. The regexp matches until 2003 # there are zero or one characters between the markers 2004 tx 2005 :x 2006 s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/ 2007 tx 2008 @end group 2009 2010 @group 2011 # Remove the newline markers 2012 s/\n//g 2013 @end group 2014 @end example 2015 @c end--------------------------------------------- 2016 2017 @node tac 2018 @section Reverse Lines of Files 2019 2020 This one begins a series of totally useless (yet interesting) 2021 scripts emulating various Unix commands. This, in particular, 2022 is a @command{tac} workalike. 2023 2024 Note that on implementations other than @acronym{GNU} @command{sed} 2025 @ifset PERL 2026 and @value{SSED} 2027 @end ifset 2028 this script might easily overflow internal buffers. 2029 2030 @c start------------------------------------------- 2031 @example 2032 #!/usr/bin/sed -nf 2033 2034 # reverse all lines of input, i.e. first line became last, ... 2035 2036 @group 2037 # from the second line, the buffer (which contains all previous lines) 2038 # is *appended* to current line, so, the order will be reversed 2039 1! G 2040 @end group 2041 2042 @group 2043 # on the last line we're done -- print everything 2044 $ p 2045 @end group 2046 2047 @group 2048 # store everything on the buffer again 2049 h 2050 @end group 2051 @end example 2052 @c end--------------------------------------------- 2053 2054 @node cat -n 2055 @section Numbering Lines 2056 2057 This script replaces @samp{cat -n}; in fact it formats its output 2058 exactly like @acronym{GNU} @command{cat} does. 2059 2060 Of course this is completely useless and for two reasons: first, 2061 because somebody else did it in C, second, because the following 2062 Bourne-shell script could be used for the same purpose and would 2063 be much faster: 2064 2065 @c start------------------------------------------- 2066 @example 2067 @group 2068 #! /bin/sh 2069 sed -e "=" $@@ | sed -e ' 2070 s/^/ / 2071 N 2072 s/^ *\(......\)\n/\1 / 2073 ' 2074 @end group 2075 @end example 2076 @c end--------------------------------------------- 2077 2078 It uses @command{sed} to print the line number, then groups lines two 2079 by two using @code{N}. Of course, this script does not teach as much as 2080 the one presented below. 2081 2082 The algorithm used for incrementing uses both buffers, so the line 2083 is printed as soon as possible and then discarded. The number 2084 is split so that changing digits go in a buffer and unchanged ones go 2085 in the other; the changed digits are modified in a single step 2086 (using a @code{y} command). The line number for the next line 2087 is then composed and stored in the hold space, to be used in the 2088 next iteration. 2089 2090 @c start------------------------------------------- 2091 @example 2092 #!/usr/bin/sed -nf 2093 2094 @group 2095 # Prime the pump on the first line 2096 x 2097 /^$/ s/^.*$/1/ 2098 @end group 2099 2100 @group 2101 # Add the correct line number before the pattern 2102 G 2103 h 2104 @end group 2105 2106 @group 2107 # Format it and print it 2108 s/^/ / 2109 s/^ *\(......\)\n/\1 /p 2110 @end group 2111 2112 @group 2113 # Get the line number from hold space; add a zero 2114 # if we're going to add a digit on the next line 2115 g 2116 s/\n.*$// 2117 /^9*$/ s/^/0/ 2118 @end group 2119 2120 @group 2121 # separate changing/unchanged digits with an x 2122 s/.9*$/x&/ 2123 @end group 2124 2125 @group 2126 # keep changing digits in hold space 2127 h 2128 s/^.*x// 2129 y/0123456789/1234567890/ 2130 x 2131 @end group 2132 2133 @group 2134 # keep unchanged digits in pattern space 2135 s/x.*$// 2136 @end group 2137 2138 @group 2139 # compose the new number, remove the newline implicitly added by G 2140 G 2141 s/\n// 2142 h 2143 @end group 2144 @end example 2145 @c end--------------------------------------------- 2146 2147 @node cat -b 2148 @section Numbering Non-blank Lines 2149 2150 Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only 2151 have to select which lines are to be numbered and which are not. 2152 2153 The part that is common to this script and the previous one is 2154 not commented to show how important it is to comment @command{sed} 2155 scripts properly... 2156 2157 @c start------------------------------------------- 2158 @example 2159 #!/usr/bin/sed -nf 2160 2161 @group 2162 /^$/ @{ 2163 p 2164 b 2165 @} 2166 @end group 2167 2168 @group 2169 # Same as cat -n from now 2170 x 2171 /^$/ s/^.*$/1/ 2172 G 2173 h 2174 s/^/ / 2175 s/^ *\(......\)\n/\1 /p 2176 x 2177 s/\n.*$// 2178 /^9*$/ s/^/0/ 2179 s/.9*$/x&/ 2180 h 2181 s/^.*x// 2182 y/0123456789/1234567890/ 2183 x 2184 s/x.*$// 2185 G 2186 s/\n// 2187 h 2188 @end group 2189 @end example 2190 @c end--------------------------------------------- 2191 2192 @node wc -c 2193 @section Counting Characters 2194 2195 This script shows another way to do arithmetic with @command{sed}. 2196 In this case we have to add possibly large numbers, so implementing 2197 this by successive increments would not be feasible (and possibly 2198 even more complicated to contrive than this script). 2199 2200 The approach is to map numbers to letters, kind of an abacus 2201 implemented with @command{sed}. @samp{a}s are units, @samp{b}s are 2202 tens and so on: we simply add the number of characters 2203 on the current line as units, and then propagate the carry 2204 to tens, hundreds, and so on. 2205 2206 As usual, running totals are kept in hold space. 2207 2208 On the last line, we convert the abacus form back to decimal. 2209 For the sake of variety, this is done with a loop rather than 2210 with some 80 @code{s} commands@footnote{Some implementations 2211 have a limit of 199 commands per script}: first we 2212 convert units, removing @samp{a}s from the number; then we 2213 rotate letters so that tens become @samp{a}s, and so on 2214 until no more letters remain. 2215 2216 @c start------------------------------------------- 2217 @example 2218 #!/usr/bin/sed -nf 2219 2220 @group 2221 # Add n+1 a's to hold space (+1 is for the newline) 2222 s/./a/g 2223 H 2224 x 2225 s/\n/a/ 2226 @end group 2227 2228 @group 2229 # Do the carry. The t's and b's are not necessary, 2230 # but they do speed up the thing 2231 t a 2232 : a; s/aaaaaaaaaa/b/g; t b; b done 2233 : b; s/bbbbbbbbbb/c/g; t c; b done 2234 : c; s/cccccccccc/d/g; t d; b done 2235 : d; s/dddddddddd/e/g; t e; b done 2236 : e; s/eeeeeeeeee/f/g; t f; b done 2237 : f; s/ffffffffff/g/g; t g; b done 2238 : g; s/gggggggggg/h/g; t h; b done 2239 : h; s/hhhhhhhhhh//g 2240 @end group 2241 2242 @group 2243 : done 2244 $! @{ 2245 h 2246 b 2247 @} 2248 @end group 2249 2250 # On the last line, convert back to decimal 2251 2252 @group 2253 : loop 2254 /a/! s/[b-h]*/&0/ 2255 s/aaaaaaaaa/9/ 2256 s/aaaaaaaa/8/ 2257 s/aaaaaaa/7/ 2258 s/aaaaaa/6/ 2259 s/aaaaa/5/ 2260 s/aaaa/4/ 2261 s/aaa/3/ 2262 s/aa/2/ 2263 s/a/1/ 2264 @end group 2265 2266 @group 2267 : next 2268 y/bcdefgh/abcdefg/ 2269 /[a-h]/ b loop 2270 p 2271 @end group 2272 @end example 2273 @c end--------------------------------------------- 2274 2275 @node wc -w 2276 @section Counting Words 2277 2278 This script is almost the same as the previous one, once each 2279 of the words on the line is converted to a single @samp{a} 2280 (in the previous script each letter was changed to an @samp{a}). 2281 2282 It is interesting that real @command{wc} programs have optimized 2283 loops for @samp{wc -c}, so they are much slower at counting 2284 words rather than characters. This script's bottleneck, 2285 instead, is arithmetic, and hence the word-counting one 2286 is faster (it has to manage smaller numbers). 2287 2288 Again, the common parts are not commented to show the importance 2289 of commenting @command{sed} scripts. 2290 2291 @c start------------------------------------------- 2292 @example 2293 #!/usr/bin/sed -nf 2294 2295 @group 2296 # Convert words to a's 2297 s/[ @kbd{tab}][ @kbd{tab}]*/ /g 2298 s/^/ / 2299 s/ [^ ][^ ]*/a /g 2300 s/ //g 2301 @end group 2302 2303 @group 2304 # Append them to hold space 2305 H 2306 x 2307 s/\n// 2308 @end group 2309 2310 @group 2311 # From here on it is the same as in wc -c. 2312 /aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g 2313 /bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g 2314 /cccccccccc/! bx; s/cccccccccc/d/g 2315 /dddddddddd/! bx; s/dddddddddd/e/g 2316 /eeeeeeeeee/! bx; s/eeeeeeeeee/f/g 2317 /ffffffffff/! bx; s/ffffffffff/g/g 2318 /gggggggggg/! bx; s/gggggggggg/h/g 2319 s/hhhhhhhhhh//g 2320 :x 2321 $! @{ h; b; @} 2322 :y 2323 /a/! s/[b-h]*/&0/ 2324 s/aaaaaaaaa/9/ 2325 s/aaaaaaaa/8/ 2326 s/aaaaaaa/7/ 2327 s/aaaaaa/6/ 2328 s/aaaaa/5/ 2329 s/aaaa/4/ 2330 s/aaa/3/ 2331 s/aa/2/ 2332 s/a/1/ 2333 y/bcdefgh/abcdefg/ 2334 /[a-h]/ by 2335 p 2336 @end group 2337 @end example 2338 @c end--------------------------------------------- 2339 2340 @node wc -l 2341 @section Counting Lines 2342 2343 No strange things are done now, because @command{sed} gives us 2344 @samp{wc -l} functionality for free!!! Look: 2345 2346 @c start------------------------------------------- 2347 @example 2348 @group 2349 #!/usr/bin/sed -nf 2350 $= 2351 @end group 2352 @end example 2353 @c end--------------------------------------------- 2354 2355 @node head 2356 @section Printing the First Lines 2357 2358 This script is probably the simplest useful @command{sed} script. 2359 It displays the first 10 lines of input; the number of displayed 2360 lines is right before the @code{q} command. 2361 2362 @c start------------------------------------------- 2363 @example 2364 @group 2365 #!/usr/bin/sed -f 2366 10q 2367 @end group 2368 @end example 2369 @c end--------------------------------------------- 2370 2371 @node tail 2372 @section Printing the Last Lines 2373 2374 Printing the last @var{n} lines rather than the first is more complex 2375 but indeed possible. @var{n} is encoded in the second line, before 2376 the bang character. 2377 2378 This script is similar to the @command{tac} script in that it keeps the 2379 final output in the hold space and prints it at the end: 2380 2381 @c start------------------------------------------- 2382 @example 2383 #!/usr/bin/sed -nf 2384 2385 @group 2386 1! @{; H; g; @} 2387 1,10 !s/[^\n]*\n// 2388 $p 2389 h 2390 @end group 2391 @end example 2392 @c end--------------------------------------------- 2393 2394 Mainly, the scripts keeps a window of 10 lines and slides it 2395 by adding a line and deleting the oldest (the substitution command 2396 on the second line works like a @code{D} command but does not 2397 restart the loop). 2398 2399 The ``sliding window'' technique is a very powerful way to write 2400 efficient and complex @command{sed} scripts, because commands like 2401 @code{P} would require a lot of work if implemented manually. 2402 2403 To introduce the technique, which is fully demonstrated in the 2404 rest of this chapter and is based on the @code{N}, @code{P} 2405 and @code{D} commands, here is an implementation of @command{tail} 2406 using a simple ``sliding window.'' 2407 2408 This looks complicated but in fact the working is the same as 2409 the last script: after we have kicked in the appropriate number 2410 of lines, however, we stop using the hold space to keep inter-line 2411 state, and instead use @code{N} and @code{D} to slide pattern 2412 space by one line: 2413 2414 @c start------------------------------------------- 2415 @example 2416 #!/usr/bin/sed -f 2417 2418 @group 2419 1h 2420 2,10 @{; H; g; @} 2421 $q 2422 1,9d 2423 N 2424 D 2425 @end group 2426 @end example 2427 @c end--------------------------------------------- 2428 2429 Note how the first, second and fourth line are inactive after 2430 the first ten lines of input. After that, all the script does 2431 is: exiting on the last line of input, appending the next input 2432 line to pattern space, and removing the first line. 2433 2434 @node uniq 2435 @section Make Duplicate Lines Unique 2436 2437 This is an example of the art of using the @code{N}, @code{P} 2438 and @code{D} commands, probably the most difficult to master. 2439 2440 @c start------------------------------------------- 2441 @example 2442 @group 2443 #!/usr/bin/sed -f 2444 h 2445 @end group 2446 2447 @group 2448 :b 2449 # On the last line, print and exit 2450 $b 2451 N 2452 /^\(.*\)\n\1$/ @{ 2453 # The two lines are identical. Undo the effect of 2454 # the n command. 2455 g 2456 bb 2457 @} 2458 @end group 2459 2460 @group 2461 # If the @code{N} command had added the last line, print and exit 2462 $b 2463 @end group 2464 2465 @group 2466 # The lines are different; print the first and go 2467 # back working on the second. 2468 P 2469 D 2470 @end group 2471 @end example 2472 @c end--------------------------------------------- 2473 2474 As you can see, we mantain a 2-line window using @code{P} and @code{D}. 2475 This technique is often used in advanced @command{sed} scripts. 2476 2477 @node uniq -d 2478 @section Print Duplicated Lines of Input 2479 2480 This script prints only duplicated lines, like @samp{uniq -d}. 2481 2482 @c start------------------------------------------- 2483 @example 2484 #!/usr/bin/sed -nf 2485 2486 @group 2487 $b 2488 N 2489 /^\(.*\)\n\1$/ @{ 2490 # Print the first of the duplicated lines 2491 s/.*\n// 2492 p 2493 @end group 2494 2495 @group 2496 # Loop until we get a different line 2497 :b 2498 $b 2499 N 2500 /^\(.*\)\n\1$/ @{ 2501 s/.*\n// 2502 bb 2503 @} 2504 @} 2505 @end group 2506 2507 @group 2508 # The last line cannot be followed by duplicates 2509 $b 2510 @end group 2511 2512 @group 2513 # Found a different one. Leave it alone in the pattern space 2514 # and go back to the top, hunting its duplicates 2515 D 2516 @end group 2517 @end example 2518 @c end--------------------------------------------- 2519 2520 @node uniq -u 2521 @section Remove All Duplicated Lines 2522 2523 This script prints only unique lines, like @samp{uniq -u}. 2524 2525 @c start------------------------------------------- 2526 @example 2527 #!/usr/bin/sed -f 2528 2529 @group 2530 # Search for a duplicate line --- until that, print what you find. 2531 $b 2532 N 2533 /^\(.*\)\n\1$/ ! @{ 2534 P 2535 D 2536 @} 2537 @end group 2538 2539 @group 2540 :c 2541 # Got two equal lines in pattern space. At the 2542 # end of the file we simply exit 2543 $d 2544 @end group 2545 2546 @group 2547 # Else, we keep reading lines with @code{N} until we 2548 # find a different one 2549 s/.*\n// 2550 N 2551 /^\(.*\)\n\1$/ @{ 2552 bc 2553 @} 2554 @end group 2555 2556 @group 2557 # Remove the last instance of the duplicate line 2558 # and go back to the top 2559 D 2560 @end group 2561 @end example 2562 @c end--------------------------------------------- 2563 2564 @node cat -s 2565 @section Squeezing Blank Lines 2566 2567 As a final example, here are three scripts, of increasing complexity 2568 and speed, that implement the same function as @samp{cat -s}, that is 2569 squeezing blank lines. 2570 2571 The first leaves a blank line at the beginning and end if there are 2572 some already. 2573 2574 @c start------------------------------------------- 2575 @example 2576 #!/usr/bin/sed -f 2577 2578 @group 2579 # on empty lines, join with next 2580 # Note there is a star in the regexp 2581 :x 2582 /^\n*$/ @{ 2583 N 2584 bx 2585 @} 2586 @end group 2587 2588 @group 2589 # now, squeeze all '\n', this can be also done by: 2590 # s/^\(\n\)*/\1/ 2591 s/\n*/\ 2592 / 2593 @end group 2594 @end example 2595 @c end--------------------------------------------- 2596 2597 This one is a bit more complex and removes all empty lines 2598 at the beginning. It does leave a single blank line at end 2599 if one was there. 2600 2601 @c start------------------------------------------- 2602 @example 2603 #!/usr/bin/sed -f 2604 2605 @group 2606 # delete all leading empty lines 2607 1,/^./@{ 2608 /./!d 2609 @} 2610 @end group 2611 2612 @group 2613 # on an empty line we remove it and all the following 2614 # empty lines, but one 2615 :x 2616 /./!@{ 2617 N 2618 s/^\n$// 2619 tx 2620 @} 2621 @end group 2622 @end example 2623 @c end--------------------------------------------- 2624 2625 This removes leading and trailing blank lines. It is also the 2626 fastest. Note that loops are completely done with @code{n} and 2627 @code{b}, without relying on @command{sed} to restart the 2628 the script automatically at the end of a line. 2629 2630 @c start------------------------------------------- 2631 @example 2632 #!/usr/bin/sed -nf 2633 2634 @group 2635 # delete all (leading) blanks 2636 /./!d 2637 @end group 2638 2639 @group 2640 # get here: so there is a non empty 2641 :x 2642 # print it 2643 p 2644 # get next 2645 n 2646 # got chars? print it again, etc... 2647 /./bx 2648 @end group 2649 2650 @group 2651 # no, don't have chars: got an empty line 2652 :z 2653 # get next, if last line we finish here so no trailing 2654 # empty lines are written 2655 n 2656 # also empty? then ignore it, and get next... this will 2657 # remove ALL empty lines 2658 /./!bz 2659 @end group 2660 2661 @group 2662 # all empty lines were deleted/ignored, but we have a non empty. As 2663 # what we want to do is to squeeze, insert a blank line artificially 2664 i\ 2665 @end group 2666 2667 bx 2668 @end example 2669 @c end--------------------------------------------- 2670 2671 @node Limitations 2672 @chapter @value{SSED}'s Limitations and Non-limitations 2673 2674 @cindex @acronym{GNU} extensions, unlimited line length 2675 @cindex Portability, line length limitations 2676 For those who want to write portable @command{sed} scripts, 2677 be aware that some implementations have been known to 2678 limit line lengths (for the pattern and hold spaces) 2679 to be no more than 4000 bytes. 2680 The @sc{posix} standard specifies that conforming @command{sed} 2681 implementations shall support at least 8192 byte line lengths. 2682 @value{SSED} has no built-in limit on line length; 2683 as long as it can @code{malloc()} more (virtual) memory, 2684 you can feed or construct lines as long as you like. 2685 2686 However, recursion is used to handle subpatterns and indefinite 2687 repetition. This means that the available stack space may limit 2688 the size of the buffer that can be processed by certain patterns. 2689 2690 @ifset PERL 2691 There are some size limitations in the regular expression 2692 matcher but it is hoped that they will never in practice 2693 be relevant. The maximum length of a compiled pattern 2694 is 65539 (sic) bytes. All values in repeating quantifiers 2695 must be less than 65536. The maximum nesting depth of 2696 all parenthesized subpatterns, including capturing and 2697 non-capturing subpatterns@footnote{The 2698 distinction is meaningful when referring to Perl-style 2699 regular expressions.}, assertions, and other types of 2700 subpattern, is 200. 2701 2702 Also, @value{SSED} recognizes the @sc{posix} syntax 2703 @code{[.@var{ch}.]} and @code{[=@var{ch}=]} 2704 where @var{ch} is a ``collating element'', but these 2705 are not supported, and an error is given if they are 2706 encountered. 2707 2708 Here are a few distinctions between the real Perl-style 2709 regular expressions and those that @option{-R} recognizes. 2710 2711 @enumerate 2712 @item 2713 Lookahead assertions do not allow repeat quantifiers after them 2714 Perl permits them, but they do not mean what you 2715 might think. For example, @samp{(?!a)@{3@}} does not assert that the 2716 next three characters are not @samp{a}. It just asserts three times that the 2717 next character is not @samp{a} --- a waste of time and nothing else. 2718 2719 @item 2720 Capturing subpatterns that occur inside negative lookahead 2721 head assertions are counted, but their entries are counted 2722 as empty in the second half of an @code{s} command. 2723 Perl sets its numerical variables from any such patterns 2724 that are matched before the assertion fails to match 2725 something (thereby succeeding), but only if the negative 2726 lookahead assertion contains just one branch. 2727 2728 @item 2729 The following Perl escape sequences are not supported: 2730 @samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E}, 2731 @samp{\Q}. In fact these are implemented by Perl's general 2732 string-handling and are not part of its pattern matching engine. 2733 2734 @item 2735 The Perl @samp{\G} assertion is not supported as it is not 2736 relevant to single pattern matches. 2737 2738 @item 2739 Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})} 2740 and @samp{(?p@{code@})} constructions. However, there is some experimental 2741 support for recursive patterns using the non-Perl item @samp{(?R)}. 2742 2743 @item 2744 There are at the time of writing some oddities in Perl 2745 5.005_02 concerned with the settings of captured strings 2746 when part of a pattern is repeated. For example, matching 2747 @samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets 2748 @samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.} 2749 to the value @samp{b}, but matching @samp{aabbaa} 2750 against @samp{/^(aa(bb)?)+$/} leaves @samp{$2} 2751 unset. However, if the pattern is changed to 2752 @samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set. 2753 In Perl 5.004 @samp{$2} is set in both cases, and that is also 2754 true of @value{SSED}. 2755 2756 @item 2757 Another as yet unresolved discrepancy is that in Perl 2758 5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches 2759 the string @samp{a}, whereas in @value{SSED} it does not. 2760 However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched 2761 against @samp{a} leaves $1 unset. 2762 @end enumerate 2763 @end ifset 2764 2765 @node Other Resources 2766 @chapter Other Resources for Learning About @command{sed} 2767 2768 @cindex Additional reading about @command{sed} 2769 In addition to several books that have been written about @command{sed} 2770 (either specifically or as chapters in books which discuss 2771 shell programming), one can find out more about @command{sed} 2772 (including suggestions of a few books) from the FAQ 2773 for the @code{sed-users} mailing list, available from: 2774 @display 2775 @uref{http://sed.sourceforge.net/sedfaq.html} 2776 @end display 2777 2778 Also of interest are 2779 @uref{http://www.student.northpark.edu/pemente/sed/index.htm} 2780 and @uref{http://sed.sf.net/grabbag}, 2781 which include @command{sed} tutorials and other @command{sed}-related goodies. 2782 2783 The @code{sed-users} mailing list itself maintained by Sven Guckes. 2784 To subscribe, visit @uref{http://groups.yahoo.com} and search 2785 for the @code{sed-users} mailing list. 2786 2787 @node Reporting Bugs 2788 @chapter Reporting Bugs 2789 2790 @cindex Bugs, reporting 2791 Email bug reports to @email{bonzini@@gnu.org}. 2792 Be sure to include the word ``sed'' somewhere in the @code{Subject:} field. 2793 Also, please include the output of @samp{sed --version} in the body 2794 of your report if at all possible. 2795 2796 Please do not send a bug report like this: 2797 2798 @example 2799 @i{@i{@r{while building frobme-1.3.4}}} 2800 $ configure 2801 @error{} sed: file sedscr line 1: Unknown option to 's' 2802 @end example 2803 2804 If @value{SSED} doesn't configure your favorite package, take a 2805 few extra minutes to identify the specific problem and make a stand-alone 2806 test case. Unlike other programs such as C compilers, making such test 2807 cases for @command{sed} is quite simple. 2808 2809 A stand-alone test case includes all the data necessary to perform the 2810 test, and the specific invocation of @command{sed} that causes the problem. 2811 The smaller a stand-alone test case is, the better. A test case should 2812 not involve something as far removed from @command{sed} as ``try to configure 2813 frobme-1.3.4''. Yes, that is in principle enough information to look 2814 for the bug, but that is not a very practical prospect. 2815 2816 Here are a few commonly reported bugs that are not bugs. 2817 2818 @table @asis 2819 @item @code{N} command on the last line 2820 @cindex Portability, @code{N} command on the last line 2821 @cindex Non-bugs, @code{N} command on the last line 2822 2823 Most versions of @command{sed} exit without printing anything when 2824 the @command{N} command is issued on the last line of a file. 2825 @value{SSED} prints pattern space before exiting unless of course 2826 the @command{-n} command switch has been specified. This choice is 2827 by design. 2828 2829 For example, the behavior of 2830 @example 2831 sed N foo bar 2832 @end example 2833 @noindent 2834 would depend on whether foo has an even or an odd number of 2835 lines@footnote{which is the actual ``bug'' that prompted the 2836 change in behavior}. Or, when writing a script to read the 2837 next few lines following a pattern match, traditional 2838 implementations of @code{sed} would force you to write 2839 something like 2840 @example 2841 /foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @} 2842 @end example 2843 @noindent 2844 instead of just 2845 @example 2846 /foo/@{ N;N;N;N;N;N;N;N;N; @} 2847 @end example 2848 2849 @cindex @code{POSIXLY_CORRECT} behavior, @code{N} command 2850 In any case, the simplest workaround is to use @code{$d;N} in 2851 scripts that rely on the traditional behavior, or to set 2852 the @code{POSIXLY_CORRECT} variable to a non-empty value. 2853 2854 @item Regex syntax clashes (problems with backslashes) 2855 @cindex @acronym{GNU} extensions, to basic regular expressions 2856 @cindex Non-bugs, regex syntax clashes 2857 @command{sed} uses the @sc{posix} basic regular expression syntax. According to 2858 the standard, the meaning of some escape sequences is undefined in 2859 this syntax; notable in the case of @command{sed} are @code{\|}, 2860 @code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<}, 2861 @code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}. 2862 2863 As in all @acronym{GNU} programs that use @sc{posix} basic regular 2864 expressions, @command{sed} interprets these escape sequences as special 2865 characters. So, @code{x\+} matches one or more occurrences of @samp{x}. 2866 @code{abc\|def} matches either @samp{abc} or @samp{def}. 2867 2868 This syntax may cause problems when running scripts written for other 2869 @command{sed}s. Some @command{sed} programs have been written with the 2870 assumption that @code{\|} and @code{\+} match the literal characters 2871 @code{|} and @code{+}. Such scripts must be modified by removing the 2872 spurious backslashes if they are to be used with modern implementations 2873 of @command{sed}, like 2874 @ifset PERL 2875 @value{SSED} or 2876 @end ifset 2877 @acronym{GNU} @command{sed}. 2878 2879 On the other hand, some scripts use s|abc\|def||g to remove occurrences 2880 of @emph{either} @code{abc} or @code{def}. While this worked until 2881 @command{sed} 4.0.x, newer versions interpret this as removing the 2882 string @code{abc|def}. This is again undefined behavior according to 2883 @acronym{POSIX}, and this interpretation is arguably more robust: older 2884 @command{sed}s, for example, required that the regex matcher parsed 2885 @code{\/} as @code{/} in the common case of escaping a slash, which is 2886 again undefined behavior; the new behavior avoids this, and this is good 2887 because the regex matcher is only partially under our control. 2888 2889 @cindex @acronym{GNU} extensions, special escapes 2890 In addition, this version of @command{sed} supports several escape characters 2891 (some of which are multi-character) to insert non-printable characters 2892 in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r}, 2893 @code{\t}, @code{\v}, @code{\x}). These can cause similar problems 2894 with scripts written for other @command{sed}s. 2895 2896 @item @option{-i} clobbers read-only files 2897 @cindex In-place editing 2898 @cindex @value{SSEDEXT}, in-place editing 2899 @cindex Non-bugs, in-place editing 2900 2901 In short, @samp{sed -i} will let you delete the contents of 2902 a read-only file, and in general the @option{-i} option 2903 (@pxref{Invoking sed, , Invocation}) lets you clobber 2904 protected files. This is not a bug, but rather a consequence 2905 of how the Unix filesystem works. 2906 2907 The permissions on a file say what can happen to the data 2908 in that file, while the permissions on a directory say what can 2909 happen to the list of files in that directory. @samp{sed -i} 2910 will not ever open for writing a file that is already on disk. 2911 Rather, it will work on a temporary file that is finally renamed 2912 to the original name: if you rename or delete files, you're actually 2913 modifying the contents of the directory, so the operation depends on 2914 the permissions of the directory, not of the file. For this same 2915 reason, @command{sed} does not let you use @option{-i} on a writeable file 2916 in a read-only directory, and will break hard or symbolic links when 2917 @option{-i} is used on such a file. 2918 2919 @item @code{0a} does not work (gives an error) 2920 @cindex @code{0} address 2921 @cindex @acronym{GNU} extensions, @code{0} address 2922 @cindex Non-bugs, @code{0} address 2923 2924 There is no line 0. 0 is a special address that is only used to treat 2925 addresses like @code{0,/@var{RE}/} as active when the script starts: if 2926 you write @code{1,/abc/d} and the first line includes the word @samp{abc}, 2927 then that match would be ignored because address ranges must span at least 2928 two lines (barring the end of the file); but what you probably wanted is 2929 to delete every line up to the first one including @samp{abc}, and this 2930 is obtained with @code{0,/abc/d}. 2931 2932 @ifclear PERL 2933 @item @code{[a-z]} is case insensitive 2934 @cindex Non-bugs, localization-related 2935 2936 You are encountering problems with locales. POSIX mandates that @code{[a-z]} 2937 uses the current locale's collation order -- in C parlance, that means using 2938 @code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a 2939 case-insensitive collation order, others don't. 2940 2941 Another problem is that @code{[a-z]} tries to use collation symbols. 2942 This only happens if you are on the @acronym{GNU} system, using 2943 @acronym{GNU} libc's regular expression matcher instead of compiling the 2944 one supplied with @acronym{GNU} sed. In a Danish locale, for example, 2945 the regular expression @code{^[a-z]$} matches the string @samp{aa}, 2946 because this is a single collating symbol that comes after @samp{a} 2947 and before @samp{b}; @samp{ll} behaves similarly in Spanish 2948 locales, or @samp{ij} in Dutch locales. 2949 2950 To work around these problems, which may cause bugs in shell scripts, set 2951 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2952 2953 @item @code{s/.*//} does not clear pattern space 2954 @cindex Non-bugs, localization-related 2955 @cindex @value{SSEDEXT}, emptying pattern space 2956 @cindex Emptying pattern space 2957 2958 This happens if your input stream includes invalid multibyte 2959 sequences. @sc{posix} mandates that such sequences 2960 are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear 2961 pattern space as you would expect. In fact, there is no way to clear 2962 sed's buffers in the middle of the script in most multibyte locales 2963 (including UTF-8 locales). For this reason, @value{SSED} provides a `z' 2964 command (for `zap') as an extension. 2965 2966 To work around these problems, which may cause bugs in shell scripts, set 2967 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2968 @end ifclear 2969 @end table 2970 2971 2972 @node Extended regexps 2973 @appendix Extended regular expressions 2974 @cindex Extended regular expressions, syntax 2975 2976 The only difference between basic and extended regular expressions is in 2977 the behavior of a few characters: @samp{?}, @samp{+}, parentheses, 2978 and braces (@samp{@{@}}). While basic regular expressions require 2979 these to be escaped if you want them to behave as special characters, 2980 when using extended regular expressions you must escape them if 2981 you want them @emph{to match a literal character}. 2982 2983 @noindent 2984 Examples: 2985 @table @code 2986 @item abc? 2987 becomes @samp{abc\?} when using extended regular expressions. It matches 2988 the literal string @samp{abc?}. 2989 2990 @item c\+ 2991 becomes @samp{c+} when using extended regular expressions. It matches 2992 one or more @samp{c}s. 2993 2994 @item a\@{3,\@} 2995 becomes @samp{a@{3,@}} when using extended regular expressions. It matches 2996 three or more @samp{a}s. 2997 2998 @item \(abc\)\@{2,3\@} 2999 becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It 3000 matches either @samp{abcabc} or @samp{abcabcabc}. 3001 3002 @item \(abc*\)\1 3003 becomes @samp{(abc*)\1} when using extended regular expressions. 3004 Backreferences must still be escaped when using extended regular 3005 expressions. 3006 @end table 3007 3008 @ifset PERL 3009 @node Perl regexps 3010 @appendix Perl-style regular expressions 3011 @cindex Perl-style regular expressions, syntax 3012 3013 @emph{This part is taken from the @file{pcre.txt} file distributed together 3014 with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.} 3015 3016 Perl introduced several extensions to regular expressions, some 3017 of them incompatible with the syntax of regular expressions 3018 accepted by Emacs and other @acronym{GNU} tools (whose matcher was 3019 based on the Emacs matcher). @value{SSED} implements 3020 both kinds of extensions. 3021 3022 @iftex 3023 Summarizing, we have: 3024 3025 @itemize @bullet 3026 @item 3027 A backslash can introduce several special sequences 3028 3029 @item 3030 The circumflex, dollar sign, and period characters behave specially 3031 with regard to new lines 3032 3033 @item 3034 Strange uses of square brackets are parsed differently 3035 3036 @item 3037 You can toggle modifiers in the middle of a regular expression 3038 3039 @item 3040 You can specify that a subpattern does not count when numbering backreferences 3041 3042 @item 3043 @cindex Greedy regular expression matching 3044 You can specify greedy or non-greedy matching 3045 3046 @item 3047 You can have more than ten back references 3048 3049 @item 3050 You can do complex look aheads and look behinds (in the spirit of 3051 @code{\b}, but with subpatterns). 3052 3053 @item 3054 You can often improve performance by avoiding that @command{sed} wastes 3055 time with backtracking 3056 3057 @item 3058 You can have if/then/else branches 3059 3060 @item 3061 You can do recursive matches, for example to look for unbalanced parentheses 3062 3063 @item 3064 You can have comments and non-significant whitespace, because things can 3065 get complex... 3066 @end itemize 3067 3068 Most of these extensions are introduced by the special @code{(?} 3069 sequence, which gives special meanings to parenthesized groups. 3070 @end iftex 3071 @menu 3072 Other extensions can be roughly subdivided in two categories 3073 On one hand Perl introduces several more escaped sequences 3074 (that is, sequences introduced by a backslash). On the other 3075 hand, it specifies that if a question mark follows an open 3076 parentheses it should give a special meaning to the parenthesized 3077 group. 3078 3079 * Backslash:: Introduces special sequences 3080 * Circumflex/dollar sign/period:: Behave specially with regard to new lines 3081 * Square brackets:: Are a bit different in strange cases 3082 * Options setting:: Toggle modifiers in the middle of a regexp 3083 * Non-capturing subpatterns:: Are not counted when backreferencing 3084 * Repetition:: Allows for non-greedy matching 3085 * Backreferences:: Allows for more than 10 back references 3086 * Assertions:: Allows for complex look ahead matches 3087 * Non-backtracking subpatterns:: Often gives more performance 3088 * Conditional subpatterns:: Allows if/then/else branches 3089 * Recursive patterns:: For example to match parentheses 3090 * Comments:: Because things can get complex... 3091 @end menu 3092 3093 @node Backslash 3094 @appendixsec Backslash 3095 @cindex Perl-style regular expressions, escaped sequences 3096 3097 There are a few difference in the handling of backslashed 3098 sequences in Perl mode. 3099 3100 First of all, there are no @code{\o} and @code{\d} sequences. 3101 @sc{ascii} values for characters can be specified in octal 3102 with a @code{\@var{xxx}} sequence, where @var{xxx} is a 3103 sequence of up to three octal digits. If the first digit 3104 is a zero, the treatment of the sequence is straightforward; 3105 just note that if the character that follows the escaped digit 3106 is itself an octal digit, you have to supply three octal digits 3107 for @var{xxx}. For example @code{\07} is a @sc{bel} character 3108 rather than a @sc{nul} and a literal @code{7} (this sequence is 3109 instead represented by @code{\0007}). 3110 3111 @cindex Perl-style regular expressions, backreferences 3112 The handling of a backslash followed by a digit other than 0 3113 is complicated. Outside a character class, @command{sed} reads it 3114 and any following digits as a decimal number. If the number 3115 is less than 10, or if there have been at least that many 3116 previous capturing left parentheses in the expression, the 3117 entire sequence is taken as a back reference. A description 3118 of how this works is given later, following the discussion 3119 of parenthesized subpatterns. 3120 3121 Inside a character class, or if the decimal number is 3122 greater than 9 and there have not been that many capturing 3123 subpatterns, @command{sed} re-reads up to three octal digits following 3124 the backslash, and generates a single byte from the 3125 least significant 8 bits of the value. Any subsequent digits 3126 stand for themselves. For example: 3127 3128 @example 3129 \040 @i{@r{is another way of writing a space}} 3130 \40 @i{@r{is the same, provided there are fewer than 40}} 3131 @i{@r{previous capturing subpatterns}} 3132 \7 @i{@r{is always a back reference}} 3133 \011 @i{@r{is always a tab}} 3134 \11 @i{@r{might be a back reference, or another way of writing a tab}} 3135 \0113 @i{@r{is a tab followed by the character @samp{3}}} 3136 \113 @i{@r{is the character with octal code 113 (since there}} 3137 @i{@r{can be no more than 99 back references)}} 3138 \377 @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}} 3139 \81 @i{@r{is either a back reference, or a binary zero}} 3140 @i{@r{followed by the two characters @samp{81}}} 3141 @end example 3142 3143 Note that octal values of 100 or greater must not be introduced 3144 by a leading zero, because no more than three octal 3145 digits are ever read. Note that this applies only to the LHS 3146 pattern; it is not possible yet to specify more than 9 backreferences 3147 on the RHS of the `s' command. 3148 3149 All the sequences that define a single byte value can be 3150 used both inside and outside character classes. In addition, 3151 inside a character class, the sequence @code{\b} is interpreted 3152 as the backspace character (hex 08). Outside a character 3153 class it has a different meaning (see below). 3154 3155 In addition, there are four additional escapes specifying 3156 generic character classes (like @code{\w} and @code{\W} do): 3157 3158 @cindex Perl-style regular expressions, character classes 3159 @table @samp 3160 @item \d 3161 Matches any decimal digit 3162 3163 @item \D 3164 Matches any character that is not a decimal digit 3165 @end table 3166 3167 In Perl mode, these character type sequences can appear both inside and 3168 outside character classes. Instead, in @sc{posix} mode these sequences 3169 (as well as @code{\w} and @code{\W}) are treated as two literal characters 3170 (a backslash and a letter) inside square brackets. 3171 3172 Escaped sequences specifying assertions are also different in 3173 Perl mode. An assertion specifies a condition that has to be met 3174 at a particular point in a match, without consuming any 3175 characters from the subject string. The use of subpatterns 3176 for more complicated assertions is described below. The 3177 backslashed assertions are 3178 3179 @cindex Perl-style regular expressions, assertions 3180 @table @samp 3181 @item \b 3182 Asserts that the point is at a word boundary. 3183 A word boundary is a position in the subject string where 3184 the current character and the previous character do not both 3185 match @code{\w} or @code{\W} (i.e. one matches @code{\w} and 3186 the other matches @code{\W}), or the start or end of the string 3187 if the first or last character matches @code{\w}, respectively. 3188 3189 @item \B 3190 Asserts that the point is not at a word boundary. 3191 3192 @item \A 3193 Asserts the matcher is at the start of pattern space (independent 3194 of multiline mode). 3195 3196 @item \Z 3197 Asserts the matcher is at the end of pattern space, 3198 or at a newline before the end of pattern space (independent of 3199 multiline mode) 3200 3201 @item \z 3202 Asserts the matcher is at the end of pattern space (independent 3203 of multiline mode) 3204 @end table 3205 3206 These assertions may not appear in character classes (but 3207 note that @code{\b} has a different meaning, namely the 3208 backspace character, inside a character class). 3209 Note that Perl mode does not support directly assertions 3210 for the beginning and the end of word; the @acronym{GNU} extensions 3211 @code{\<} and @code{\>} achieve this purpose in @sc{posix} mode 3212 instead. 3213 3214 The @code{\A}, @code{\Z}, and @code{\z} assertions differ 3215 from the traditional circumflex and dollar sign (described below) 3216 in that they only ever match at the very start and end of the 3217 subject string, whatever options are set; in particular @code{\A} 3218 and @code{\z} are the same as the @acronym{GNU} extensions 3219 @code{\`} and @code{\'} that are active in @sc{posix} mode. 3220 3221 @node Circumflex/dollar sign/period 3222 @appendixsec Circumflex, dollar sign, period 3223 @cindex Perl-style regular expressions, newlines 3224 3225 Outside a character class, in the default matching mode, the 3226 circumflex character is an assertion which is true only if 3227 the current matching point is at the start of the subject 3228 string. Inside a character class, the circumflex has an entirely 3229 different meaning (see below). 3230 3231 The circumflex need not be the first character of the pattern if 3232 a number of alternatives are involved, but it should be the 3233 first thing in each alternative in which it appears if the 3234 pattern is ever to match that branch. If all possible alternatives, 3235 start with a circumflex, that is, if the pattern is 3236 constrained to match only at the start of the subject, it is 3237 said to be an @dfn{anchored} pattern. (There are also other constructs 3238 structs that can cause a pattern to be anchored.) 3239 3240 A dollar sign is an assertion which is true only if the 3241 current matching point is at the end of the subject string, 3242 or immediately before a newline character that is the last 3243 character in the string (by default). A dollar sign need not be the 3244 last character of the pattern if a number of alternatives 3245 are involved, but it should be the last item in any branch 3246 in which it appears. A dollar sign has no special meaning in a 3247 character class. 3248 3249 @cindex Perl-style regular expressions, multiline 3250 The meanings of the circumflex and dollar sign characters are 3251 changed if the @code{M} modifier option is used. When this is 3252 the case, they match immediately after and immediately 3253 before an internal @code{\n} character, respectively, in addition 3254 to matching at the start and end of the subject string. For 3255 example, the pattern @code{/^abc$/} matches the subject string 3256 @samp{def\nabc} in multiline mode, but not otherwise. Consequently, 3257 patterns that are anchored in single line mode 3258 because all branches start with @code{^} are not anchored in 3259 multiline mode. 3260 3261 @cindex Perl-style regular expressions, multiline 3262 Note that the sequences @code{\A}, @code{\Z}, and @code{\z} 3263 can be used to match the start and end of the subject in both 3264 modes, and if all branches of a pattern start with @code{\A} 3265 is it always anchored, whether the @code{M} modifier is set or not. 3266 3267 @cindex Perl-style regular expressions, single line 3268 Outside a character class, a dot in the pattern matches any 3269 one character in the subject, including a non-printing character, 3270 but not (by default) newline. If the @code{S} modifier is used, 3271 dots match newlines as well. Actually, the handling of 3272 dot is entirely independent of the handling of circumflex 3273 and dollar sign, the only relationship being that they both 3274 involve newline characters. Dot has no special meaning in a 3275 character class. 3276 3277 @node Square brackets 3278 @appendixsec Square brackets 3279 @cindex Perl-style regular expressions, character classes 3280 3281 An opening square bracket introduces a character class, terminated 3282 by a closing square bracket. A closing square bracket on its own 3283 is not special. If a closing square bracket is required as a 3284 member of the class, it should be the first data character in 3285 the class (after an initial circumflex, if present) or escaped with a backslash. 3286 3287 A character class matches a single character in the subject; 3288 the character must be in the set of characters defined by 3289 the class, unless the first character in the class is a circumflex, 3290 in which case the subject character must not be in 3291 the set defined by the class. If a circumflex is actually 3292 required as a member of the class, ensure it is not the 3293 first character, or escape it with a backslash. 3294 3295 For example, the character class [aeiou] matches any lower 3296 case vowel, while [^aeiou] matches any character that is not 3297 a lower case vowel. Note that a circumflex is just a convenient 3298 venient notation for specifying the characters which are in 3299 the class by enumerating those that are not. It is not an 3300 assertion: it still consumes a character from the subject 3301 string, and fails if the current pointer is at the end of 3302 the string. 3303 3304 @cindex Perl-style regular expressions, case-insensitive 3305 When caseless matching is set, any letters in a class 3306 represent both their upper case and lower case versions, so 3307 for example, a caseless @code{[aeiou]} matches uppercase 3308 and lowercase @samp{A}s, and a caseless @code{[^aeiou]} 3309 does not match @samp{A}, whereas a case-sensitive version would. 3310 3311 @cindex Perl-style regular expressions, single line 3312 @cindex Perl-style regular expressions, multiline 3313 The newline character is never treated in any special way in 3314 character classes, whatever the setting of the @code{S} and 3315 @code{M} options (modifiers) is. A class such as @code{[^a]} will 3316 always match a newline. 3317 3318 The minus (hyphen) character can be used to specify a range 3319 of characters in a character class. For example, @code{[d-m]} 3320 matches any letter between d and m, inclusive. If a minus 3321 character is required in a class, it must be escaped with a 3322 backslash or appear in a position where it cannot be interpreted 3323 as indicating a range, typically as the first or last 3324 character in the class. 3325 3326 It is not possible to have the literal character @code{]} as the 3327 end character of a range. A pattern such as @code{[W-]46]} is 3328 interpreted as a class of two characters (@code{W} and @code{-}) 3329 followed by a literal string @code{46]}, so it would match 3330 @samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped 3331 with a backslash it is interpreted as the end of range, so 3332 @code{[W-\]46]} is interpreted as a single class containing a 3333 range followed by two separate characters. The octal or 3334 hexadecimal representation of @code{]} can also be used to end a range. 3335 3336 Ranges operate in @sc{ascii} collating sequence. They can also be 3337 used for characters specified numerically, for example 3338 @code{[\000-\037]}. If a range that includes letters is used when 3339 caseless matching is set, it matches the letters in either 3340 case. For example, a caseless @code{[W-c]} is equivalent to 3341 @code{[][\^_`wxyzabc]}, matched caselessly, and if character 3342 tables for the French locale are in use, @code{[\xc8-\xcb]} 3343 matches accented E characters in both cases. 3344 3345 Unlike in @sc{posix} mode, the character types @code{\d}, 3346 @code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W} 3347 may also appear in a character class, and add the characters 3348 that they match to the class. For example, @code{[\dABCDEF]} matches any 3349 hexadecimal digit. A circumflex can conveniently be used 3350 with the upper case character types to specify a more restricted 3351 set of characters than the matching lower case type. 3352 For example, the class @code{[^\W_]} matches any letter or digit, 3353 but not underscore. 3354 3355 All non-alphameric characters other than @code{\}, @code{-}, 3356 @code{^} (at the start) and the terminating @code{]} 3357 are non-special in character classes, but it does no harm 3358 if they are escaped. 3359 3360 Perl 5.6 supports the @sc{posix} notation for character classes, which 3361 uses names enclosed by @code{[:} and @code{:]} within the enclosing 3362 square brackets, and @value{SSED} supports this notation as well. 3363 For example, 3364 3365 @example 3366 [01[:alpha:]%] 3367 @end example 3368 3369 @noindent 3370 matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}. 3371 The supported class names are 3372 3373 @table @code 3374 @item alnum 3375 Matches letters and digits 3376 3377 @item alpha 3378 Matches letters 3379 3380 @item ascii 3381 Matches character codes 0 - 127 3382 3383 @item cntrl 3384 Matches control characters 3385 3386 @item digit 3387 Matches decimal digits (same as \d) 3388 3389 @item graph 3390 Matches printing characters, excluding space 3391 3392 @item lower 3393 Matches lower case letters 3394 3395 @item print 3396 Matches printing characters, including space 3397 3398 @item punct 3399 Matches printing characters, excluding letters and digits 3400 3401 @item space 3402 Matches white space (same as \s) 3403 3404 @item upper 3405 Matches upper case letters 3406 3407 @item word 3408 Matches ``word'' characters (same as \w) 3409 3410 @item xdigit 3411 Matches hexadecimal digits 3412 @end table 3413 3414 The names @code{ascii} and @code{word} are extensions valid only in 3415 Perl mode. Another Perl extension is negation, which is 3416 indicated by a circumflex character after the colon. For example, 3417 3418 @example 3419 [12[:^digit:]] 3420 @end example 3421 3422 @noindent 3423 matches @samp{1}, @samp{2}, or any non-digit. 3424 3425 @node Options setting 3426 @appendixsec Options setting 3427 @cindex Perl-style regular expressions, toggling options 3428 @cindex Perl-style regular expressions, case-insensitive 3429 @cindex Perl-style regular expressions, multiline 3430 @cindex Perl-style regular expressions, single line 3431 @cindex Perl-style regular expressions, extended 3432 3433 The settings of the @code{I}, @code{M}, @code{S}, @code{X} 3434 modifiers can be changed from within the pattern by 3435 a sequence of Perl option letters enclosed between @code{(?} 3436 and @code{)}. The option letters must be lowercase. 3437 3438 For example, @code{(?im)} sets caseless, multiline matching. It is 3439 also possible to unset these options by preceding the letter 3440 with a hyphen; you can also have combined settings and unsettings: 3441 @code{(?im-sx)} sets caseless and multiline matching, 3442 while unsets single line matching (for dots) and extended 3443 whitespace interpretation. If a letter appears both before 3444 and after the hyphen, the option is unset. 3445 3446 The scope of these option changes depends on where in the 3447 pattern the setting occurs. For settings that are outside 3448 any subpattern (defined below), the effect is the same as if 3449 the options were set or unset at the start of matching. The 3450 following patterns all behave in exactly the same way: 3451 3452 @example 3453 (?i)abc 3454 a(?i)bc 3455 ab(?i)c 3456 abc(?i) 3457 @end example 3458 3459 which in turn is the same as specifying the pattern abc with 3460 the @code{I} modifier. In other words, ``top level'' settings 3461 apply to the whole pattern (unless there are other 3462 changes inside subpatterns). If there is more than one setting 3463 of the same option at top level, the rightmost setting 3464 is used. 3465 3466 If an option change occurs inside a subpattern, the effect 3467 is different. This is a change of behaviour in Perl 5.005. 3468 An option change inside a subpattern affects only that part 3469 of the subpattern @emph{that follows} it, so 3470 3471 @example 3472 (a(?i)b)c 3473 @end example 3474 3475 @noindent 3476 matches abc and aBc and no other strings (assuming 3477 case-sensitive matching is used). By this means, options can 3478 be made to have different settings in different parts of the 3479 pattern. Any changes made in one alternative do carry on 3480 into subsequent branches within the same subpattern. For 3481 example, 3482 3483 @example 3484 (a(?i)b|c) 3485 @end example 3486 3487 @noindent 3488 matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C}, 3489 even though when matching @samp{C} the first branch is 3490 abandoned before the option setting. 3491 This is because the effects of option settings happen at 3492 compile time. There would be some very weird behaviour otherwise. 3493 3494 @ignore 3495 There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA 3496 that can be changed in the same way as the Perl-compatible options by 3497 using the characters U and X respectively. The (?X) flag 3498 setting is special in that it must always occur earlier in 3499 the pattern than any of the additional features it turns on, 3500 even when it is at top level. It is best put at the start. 3501 @end ignore 3502 3503 3504 @node Non-capturing subpatterns 3505 @appendixsec Non-capturing subpatterns 3506 @cindex Perl-style regular expressions, non-capturing subpatterns 3507 3508 Marking part of a pattern as a subpattern does two things. 3509 On one hand, it localizes a set of alternatives; on the other 3510 hand, it sets up the subpattern as a capturing subpattern (as 3511 defined above). The subpattern can be backreferenced and 3512 referenced in the right side of @code{s} commands. 3513 3514 For example, if the string @samp{the red king} is matched against 3515 the pattern 3516 3517 @example 3518 the ((red|white) (king|queen)) 3519 @end example 3520 3521 @noindent 3522 the captured substrings are @samp{red king}, @samp{red}, 3523 and @samp{king}, and are numbered 1, 2, and 3. 3524 3525 The fact that plain parentheses fulfil two functions is not 3526 always helpful. There are often times when a grouping 3527 subpattern is required without a capturing requirement. If an 3528 opening parenthesis is followed by @code{?:}, the subpattern does 3529 not do any capturing, and is not counted when computing the 3530 number of any subsequent capturing subpatterns. For example, 3531 if the string @samp{the white queen} is matched against the pattern 3532 3533 @example 3534 the ((?:red|white) (king|queen)) 3535 @end example 3536 3537 @noindent 3538 the captured substrings are @samp{white queen} and @samp{queen}, 3539 and are numbered 1 and 2. The maximum number of captured 3540 substrings is 99, while the maximum number of all subpatterns, 3541 both capturing and non-capturing, is 200. 3542 3543 As a convenient shorthand, if any option settings are 3544 equired at the start of a non-capturing subpattern, the 3545 option letters may appear between the @code{?} and the 3546 @code{:}. Thus the two patterns 3547 3548 @example 3549 (?i:saturday|sunday) 3550 (?:(?i)saturday|sunday) 3551 @end example 3552 3553 @noindent 3554 match exactly the same set of strings. Because alternative 3555 branches are tried from left to right, and options are not 3556 reset until the end of the subpattern is reached, an option 3557 setting in one branch does affect subsequent branches, so 3558 the above patterns match @samp{SUNDAY} as well as @samp{Saturday}. 3559 3560 3561 @node Repetition 3562 @appendixsec Repetition 3563 @cindex Perl-style regular expressions, repetitions 3564 3565 Repetition is specified by quantifiers, which can follow any 3566 of the following items: 3567 3568 @itemize @bullet 3569 @item 3570 a single character, possibly escaped 3571 3572 @item 3573 the @code{.} special character 3574 3575 @item 3576 a character class 3577 3578 @item 3579 a back reference (see next section) 3580 3581 @item 3582 a parenthesized subpattern (unless it is an assertion; @pxref{Assertions}) 3583 @end itemize 3584 3585 The general repetition quantifier specifies a minimum and 3586 maximum number of permitted matches, by giving the two 3587 numbers in curly brackets (braces), separated by a comma. 3588 The numbers must be less than 65536, and the first must be 3589 less than or equal to the second. For example: 3590 3591 @example 3592 z@{2,4@} 3593 @end example 3594 3595 @noindent 3596 matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own 3597 is not a special character. If the second number is omitted, 3598 but the comma is present, there is no upper limit; if the 3599 second number and the comma are both omitted, the quantifier 3600 specifies an exact number of required matches. Thus 3601 3602 @example 3603 [aeiou]@{3,@} 3604 @end example 3605 3606 @noindent 3607 matches at least 3 successive vowels, but may match many 3608 more, while 3609 3610 @example 3611 \d@{8@} 3612 @end example 3613 3614 @noindent 3615 matches exactly 8 digits. An opening curly bracket that 3616 appears in a position where a quantifier is not allowed, or 3617 one that does not match the syntax of a quantifier, is taken 3618 as a literal character. For example, @{,6@} is not a quantifier, 3619 but a literal string of four characters.@footnote{It 3620 raises an error if @option{-R} is not used.} 3621 3622 The quantifier @samp{@{0@}} is permitted, causing the expression to 3623 behave as if the previous item and the quantifier were not 3624 present. 3625 3626 For convenience (and historical compatibility) the three 3627 most common quantifiers have single-character abbreviations: 3628 3629 @table @code 3630 @item * 3631 is equivalent to @{0,@} 3632 3633 @item + 3634 is equivalent to @{1,@} 3635 3636 @item ? 3637 is equivalent to @{0,1@} 3638 @end table 3639 3640 It is possible to construct infinite loops by following a 3641 subpattern that can match no characters with a quantifier 3642 that has no upper limit, for example: 3643 3644 @example 3645 (a?)* 3646 @end example 3647 3648 Earlier versions of Perl used to give an error at 3649 compile time for such patterns. However, because there are 3650 cases where this can be useful, such patterns are now 3651 accepted, but if any repetition of the subpattern does in 3652 fact match no characters, the loop is forcibly broken. 3653 3654 @cindex Greedy regular expression matching 3655 @cindex Perl-style regular expressions, stingy repetitions 3656 By default, the quantifiers are @dfn{greedy} like in @sc{posix} 3657 mode, that is, they match as much as possible (up to the maximum 3658 number of permitted times), without causing the rest of the 3659 pattern to fail. The classic example of where this gives problems 3660 is in trying to match comments in C programs. These appear between 3661 the sequences @code{/*} and @code{*/} and within the sequence, individual 3662 @code{*} and @code{/} characters may appear. An attempt to match C 3663 comments by applying the pattern 3664 3665 @example 3666 /\*.*\*/ 3667 @end example 3668 3669 @noindent 3670 to the string 3671 3672 @example 3673 /* first command */ not comment /* second comment */ 3674 @end example 3675 3676 @noindent 3677 3678 fails, because it matches the entire string owing to the 3679 greediness of the @code{.*} item. 3680 3681 However, if a quantifier is followed by a question mark, it 3682 ceases to be greedy, and instead matches the minimum number 3683 of times possible, so the pattern @code{/\*.*?\*/} 3684 does the right thing with the C comments. The meaning of the 3685 various quantifiers is not otherwise changed, just the preferred 3686 number of matches. Do not confuse this use of question 3687 mark with its use as a quantifier in its own right. 3688 Because it has two uses, it can sometimes appear doubled, as in 3689 3690 @example 3691 \d??\d 3692 @end example 3693 3694 which matches one digit by preference, but can match two if 3695 that is the only way the rest of the pattern matches. 3696 3697 Note that greediness does not matter when specifying addresses, 3698 but can be nevertheless used to improve performance. 3699 3700 @ignore 3701 If the PCRE_UNGREEDY option is set (an option which is not 3702 available in Perl), the quantifiers are not greedy by 3703 default, but individual ones can be made greedy by following 3704 them with a question mark. In other words, it inverts the 3705 default behaviour. 3706 @end ignore 3707 3708 When a parenthesized subpattern is quantified with a minimum 3709 repeat count that is greater than 1 or with a limited maximum, 3710 more store is required for the compiled pattern, in 3711 proportion to the size of the minimum or maximum. 3712 3713 @cindex Perl-style regular expressions, single line 3714 If a pattern starts with @code{.*} or @code{.@{0,@}} and the 3715 @code{S} modifier is used, the pattern is implicitly anchored, 3716 because whatever follows will be tried against every character 3717 position in the subject string, so there is no point in 3718 retrying the overall match at any position after the first. 3719 PCRE treats such a pattern as though it were preceded by \A. 3720 3721 When a capturing subpattern is repeated, the value captured 3722 is the substring that matched the final iteration. For example, 3723 after 3724 3725 @example 3726 (tweedle[dume]@{3@}\s*)+ 3727 @end example 3728 3729 @noindent 3730 has matched @samp{tweedledum tweedledee} the value of the 3731 captured substring is @samp{tweedledee}. However, if there are 3732 nested capturing subpatterns, the corresponding captured 3733 values may have been set in previous iterations. For example, 3734 after 3735 3736 @example 3737 /(a|(b))+/ 3738 @end example 3739 3740 matches @samp{aba}, the value of the second captured substring is 3741 @samp{b}. 3742 3743 @node Backreferences 3744 @appendixsec Backreferences 3745 @cindex Perl-style regular expressions, backreferences 3746 3747 Outside a character class, a backslash followed by a digit 3748 greater than 0 (and possibly further digits) is a back 3749 reference to a capturing subpattern earlier (i.e. to its 3750 left) in the pattern, provided there have been that many 3751 previous capturing left parentheses. 3752 3753 However, if the decimal number following the backslash is 3754 less than 10, it is always taken as a back reference, and 3755 causes an error only if there are not that many capturing 3756 left parentheses in the entire pattern. In other words, the 3757 parentheses that are referenced need not be to the left of 3758 the reference for numbers less than 10. @ref{Backslash} 3759 for further details of the handling of digits following a backslash. 3760 3761 A back reference matches whatever actually matched the capturing 3762 subpattern in the current subject string, rather than 3763 anything matching the subpattern itself. So the pattern 3764 3765 @example 3766 (sens|respons)e and \1ibility 3767 @end example 3768 3769 @noindent 3770 matches @samp{sense and sensibility} and @samp{response and responsibility}, 3771 but not @samp{sense and responsibility}. If caseful 3772 matching is in force at the time of the back reference, the 3773 case of letters is relevant. For example, 3774 3775 @example 3776 ((?i)blah)\s+\1 3777 @end example 3778 3779 @noindent 3780 matches @samp{blah blah} and @samp{Blah Blah}, but not 3781 @samp{BLAH blah}, even though the original capturing 3782 subpattern is matched caselessly. 3783 3784 There may be more than one back reference to the same subpattern. 3785 Also, if a subpattern has not actually been used in a 3786 particular match, any back references to it always fail. For 3787 example, the pattern 3788 3789 @example 3790 (a|(bc))\2 3791 @end example 3792 3793 @noindent 3794 always fails if it starts to match @samp{a} rather than 3795 @samp{bc}. Because there may be up to 99 back references, all 3796 digits following the backslash are taken as part of a potential 3797 back reference number; this is different from what happens 3798 in @sc{posix} mode. If the pattern continues with a digit 3799 character, some delimiter must be used to terminate the back 3800 reference. If the @code{X} modifier option is set, this can be 3801 whitespace. Otherwise an empty comment can be used, or the 3802 following character can be expressed in hexadecimal or octal. 3803 Note that this applies only to the LHS pattern; it is 3804 not possible yet to specify more than 9 backreferences on the 3805 RHS of the `s' command. 3806 3807 A back reference that occurs inside the parentheses to which 3808 it refers fails when the subpattern is first used, so, for 3809 example, @code{(a\1)} never matches. However, such references 3810 can be useful inside repeated subpatterns. For example, the 3811 pattern 3812 3813 @example 3814 (a|b\1)+ 3815 @end example 3816 3817 @noindent 3818 matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa}, 3819 etc. At each iteration of the subpattern, the back reference matches 3820 the character string corresponding to the previous iteration. In 3821 order for this to work, the pattern must be such that the first 3822 iteration does not need to match the back reference. This can be 3823 done using alternation, as in the example above, or by a 3824 quantifier with a minimum of zero. 3825 3826 @node Assertions 3827 @appendixsec Assertions 3828 @cindex Perl-style regular expressions, assertions 3829 @cindex Perl-style regular expressions, asserting subpatterns 3830 3831 An assertion is a test on the characters following or 3832 preceding the current matching point that does not actually 3833 consume any characters. The simple assertions coded as @code{\b}, 3834 @code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$} 3835 are described above. More complicated assertions are coded as 3836 subpatterns. There are two kinds: those that look ahead of the 3837 current position in the subject string, and those that look behind it. 3838 3839 @cindex Perl-style regular expressions, lookahead subpatterns 3840 An assertion subpattern is matched in the normal way, except 3841 that it does not cause the current matching position to be 3842 changed. Lookahead assertions start with @code{(?=} for positive 3843 assertions and @code{(?!} for negative assertions. For example, 3844 3845 @example 3846 \w+(?=;) 3847 @end example 3848 3849 @noindent 3850 matches a word followed by a semicolon, but does not include 3851 the semicolon in the match, and 3852 3853 @example 3854 foo(?!bar) 3855 @end example 3856 3857 @noindent 3858 matches any occurrence of @samp{foo} that is not followed by 3859 @samp{bar}. 3860 3861 Note that the apparently similar pattern 3862 3863 @example 3864 (?!foo)bar 3865 @end example 3866 3867 @noindent 3868 @cindex Perl-style regular expressions, lookbehind subpatterns 3869 finds any occurrence of @samp{bar} even if it is preceded by 3870 @samp{foo}, because the assertion @code{(?!foo)} is always true 3871 when the next three characters are @samp{bar}. A lookbehind 3872 assertion is needed to achieve this effect. 3873 Lookbehind assertions start with @code{(?<=} for positive 3874 assertions and @code{(?<!} for negative assertions. So, 3875 3876 @example 3877 (?<!foo)bar 3878 @end example 3879 3880 achieves the required effect of finding an occurrence of 3881 @samp{bar} that is not preceded by @samp{foo}. The contents of a 3882 lookbehind assertion are restricted 3883 such that all the strings it matches must have a fixed 3884 length. However, if there are several alternatives, they do 3885 not all have to have the same fixed length. This is an extension 3886 compared with Perl 5.005, which requires all branches to match 3887 the same length of string. Thus 3888 3889 @example 3890 (?<=dogs|cats|) 3891 @end example 3892 3893 @noindent 3894 is permitted, but the apparently equivalent regular expression 3895 3896 @example 3897 (?<!dogs?|cats?) 3898 @end example 3899 3900 @noindent 3901 causes an error at compile time. Branches that match different 3902 length strings are permitted only at the top level of 3903 a lookbehind assertion: an assertion such as 3904 3905 @example 3906 (?<=ab(c|de)) 3907 @end example 3908 3909 @noindent 3910 is not permitted, because its single top-level branch can 3911 match two different lengths, but it is acceptable if rewritten 3912 to use two top-level branches: 3913 3914 @example 3915 (?<=abc|abde) 3916 @end example 3917 3918 All this is required because lookbehind assertions simply 3919 move the current position back by the alternative's fixed 3920 width and then try to match. If there are 3921 insufficient characters before the current position, the 3922 match is deemed to fail. Lookbehinds, in conjunction with 3923 non-backtracking subpatterns can be particularly useful for 3924 matching at the ends of strings; an example is given at the end 3925 of the section on non-backtracking subpatterns. 3926 3927 Several assertions (of any sort) may occur in succession. 3928 For example, 3929 3930 @example 3931 (?<=\d@{3@})(?<!999)foo 3932 @end example 3933 3934 @noindent 3935 matches @samp{foo} preceded by three digits that are not @samp{999}. 3936 Notice that each of the assertions is applied independently 3937 at the same point in the subject string. First there is a 3938 check that the previous three characters are all digits, and 3939 then there is a check that the same three characters are not 3940 @samp{999}. This pattern does not match @samp{foo} preceded by six 3941 characters, the first of which are digits and the last three 3942 of which are not @samp{999}. For example, it doesn't match 3943 @samp{123abcfoo}. A pattern to do that is 3944 3945 @example 3946 (?<=\d@{3@}...)(?<!999)foo 3947 @end example 3948 3949 @noindent 3950 This time the first assertion looks at the preceding six 3951 characters, checking that the first three are digits, and 3952 then the second assertion checks that the preceding three 3953 characters are not @samp{999}. Actually, assertions can be 3954 nested in any combination, so one can write this as 3955 3956 @example 3957 (?<=\d@{3@}(?!999)...)foo 3958 @end example 3959 3960 or 3961 3962 @example 3963 (?<=\d@{3@}...(?<!999))foo 3964 @end example 3965 3966 @noindent 3967 both of which might be considered more readable. 3968 3969 Assertion subpatterns are not capturing subpatterns, and may 3970 not be repeated, because it makes no sense to assert the 3971 same thing several times. If any kind of assertion contains 3972 capturing subpatterns within it, these are counted for the 3973 purposes of numbering the capturing subpatterns in the whole 3974 pattern. However, substring capturing is carried out only 3975 for positive assertions, because it does not make sense for 3976 negative assertions. 3977 3978 Assertions count towards the maximum of 200 parenthesized 3979 subpatterns. 3980 3981 @node Non-backtracking subpatterns 3982 @appendixsec Non-backtracking subpatterns 3983 @cindex Perl-style regular expressions, non-backtracking subpatterns 3984 3985 With both maximizing and minimizing repetition, failure of 3986 what follows normally causes the repeated item to be evaluated 3987 again to see if a different number of repeats allows the 3988 rest of the pattern to match. Sometimes it is useful to 3989 prevent this, either to change the nature of the match, or 3990 to cause it fail earlier than it otherwise might, when the 3991 author of the pattern knows there is no point in carrying 3992 on. 3993 3994 Consider, for example, the pattern @code{\d+foo} when applied to 3995 the subject line 3996 3997 @example 3998 123456bar 3999 @end example 4000 4001 After matching all 6 digits and then failing to match @samp{foo}, 4002 the normal action of the matcher is to try again with only 5 4003 digits matching the @code{\d+} item, and then with 4, and so on, 4004 before ultimately failing. Non-backtracking subpatterns 4005 provide the means for specifying that once a portion of the 4006 pattern has matched, it is not to be re-evaluated in this way, 4007 so the matcher would give up immediately on failing to match 4008 @samp{foo} the first time. The notation is another kind of special 4009 parenthesis, starting with @code{(?>} as in this example: 4010 4011 @example 4012 (?>\d+)bar 4013 @end example 4014 4015 This kind of parenthesis ``locks up'' the part of the pattern 4016 it contains once it has matched, and a failure further into 4017 the pattern is prevented from backtracking into it. 4018 Backtracking past it to previous items, however, works as 4019 normal. 4020 4021 Non-backtracking subpatterns are not capturing subpatterns. Simple 4022 cases such as the above example can be thought of as a maximizing 4023 repeat that must swallow everything it can. So, 4024 while both @code{\d+} and @code{\d+?} are prepared to adjust the number of 4025 digits they match in order to make the rest of the pattern 4026 match, @code{(?>\d+)} can only match an entire sequence of digits. 4027 4028 This construction can of course contain arbitrarily complicated 4029 subpatterns, and it can be nested. 4030 4031 @cindex Perl-style regular expressions, lookbehind subpatterns 4032 Non-backtracking subpatterns can be used in conjunction with look-behind 4033 assertions to specify efficient matching at the end 4034 of the subject string. Consider a simple pattern such as 4035 4036 @example 4037 abcd$ 4038 @end example 4039 4040 @noindent 4041 when applied to a long string which does not match. Because 4042 matching proceeds from left to right, @command{sed} will look for 4043 each @samp{a} in the subject and then see if what follows matches 4044 the rest of the pattern. If the pattern is specified as 4045 4046 @example 4047 ^.*abcd$ 4048 @end example 4049 4050 @noindent 4051 the initial @code{.*} matches the entire string at first, but when 4052 this fails (because there is no following @samp{a}), it backtracks 4053 to match all but the last character, then all but the 4054 last two characters, and so on. Once again the search for 4055 @samp{a} covers the entire string, from right to left, so we are 4056 no better off. However, if the pattern is written as 4057 4058 @example 4059 ^(?>.*)(?<=abcd) 4060 @end example 4061 4062 there can be no backtracking for the .* item; it can match 4063 only the entire string. The subsequent lookbehind assertion 4064 does a single test on the last four characters. If it fails, 4065 the match fails immediately. For long strings, this approach 4066 makes a significant difference to the processing time. 4067 4068 When a pattern contains an unlimited repeat inside a subpattern 4069 that can itself be repeated an unlimited number of 4070 times, the use of a once-only subpattern is the only way to 4071 avoid some failing matches taking a very long time 4072 indeed.@footnote{Actually, the matcher embedded in @value{SSED} 4073 tries to do something for this in the simplest cases, 4074 like @code{([^b]*b)*}. These cases are actually quite 4075 common: they happen for example in a regular expression 4076 like @code{\/\*([^*]*\*)*\/} which matches C comments.} 4077 4078 The pattern 4079 4080 @example 4081 (\D+|<\d+>)*[!?] 4082 @end example 4083 4084 ([^0-9<]+<(\d+>)?)*[!?] 4085 4086 @noindent 4087 matches an unlimited number of substrings that either consist 4088 of non-digits, or digits enclosed in angular brackets, followed by 4089 an exclamation or question mark. When it matches, it runs quickly. 4090 However, if it is applied to 4091 4092 @example 4093 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4094 @end example 4095 4096 @noindent 4097 it takes a long time before reporting failure. This is 4098 because the string can be divided between the two repeats in 4099 a large number of ways, and all have to be tried.@footnote{The 4100 example used @code{[!?]} rather than a single character at the end, 4101 because both @value{SSED} and Perl have an optimization that allows 4102 for fast failure when a single character is used. They 4103 remember the last single character that is required for a 4104 match, and fail early if it is not present in the string.} 4105 4106 If the pattern is changed to 4107 4108 @example 4109 ((?>\D+)|<\d+>)*[!?] 4110 @end example 4111 4112 sequences of non-digits cannot be broken, and failure happens 4113 quickly. 4114 4115 @node Conditional subpatterns 4116 @appendixsec Conditional subpatterns 4117 @cindex Perl-style regular expressions, conditional subpatterns 4118 4119 It is possible to cause the matching process to obey a subpattern 4120 conditionally or to choose between two alternative 4121 subpatterns, depending on the result of an assertion, or 4122 whether a previous capturing subpattern matched or not. The 4123 two possible forms of conditional subpattern are 4124 4125 @example 4126 (?(@var{condition})@var{yes-pattern}) 4127 (?(@var{condition})@var{yes-pattern}|@var{no-pattern}) 4128 @end example 4129 4130 If the condition is satisfied, the yes-pattern is used; otherwise 4131 the no-pattern (if present) is used. If there are more than two 4132 alternatives in the subpattern, a compile-time error occurs. 4133 4134 There are two kinds of condition. If the text between the 4135 parentheses consists of a sequence of digits, the condition 4136 is satisfied if the capturing subpattern of that number has 4137 previously matched. The number must be greater than zero. 4138 Consider the following pattern, which contains non-significant 4139 white space to make it more readable (assume the @code{X} modifier) 4140 and to divide it into three parts for ease of discussion: 4141 4142 @example 4143 ( \( )? [^()]+ (?(1) \) ) 4144 @end example 4145 4146 The first part matches an optional opening parenthesis, and 4147 if that character is present, sets it as the first captured 4148 substring. The second part matches one or more characters 4149 that are not parentheses. The third part is a conditional 4150 subpattern that tests whether the first set of parentheses 4151 matched or not. If they did, that is, if subject started 4152 with an opening parenthesis, the condition is true, and so 4153 the yes-pattern is executed and a closing parenthesis is 4154 required. Otherwise, since no-pattern is not present, the 4155 subpattern matches nothing. In other words, this pattern 4156 matches a sequence of non-parentheses, optionally enclosed 4157 in parentheses. 4158 4159 @cindex Perl-style regular expressions, lookahead subpatterns 4160 If the condition is not a sequence of digits, it must be an 4161 assertion. This may be a positive or negative lookahead or 4162 lookbehind assertion. Consider this pattern, again containing 4163 non-significant white space, and with the two alternatives 4164 on the second line: 4165 4166 @example 4167 (?(?=...[a-z]) 4168 \d\d-[a-z]@{3@}-\d\d | 4169 \d\d-\d\d-\d\d ) 4170 @end example 4171 4172 The condition is a positive lookahead assertion that matches 4173 a letter that is three characters away from the current point. 4174 If a letter is found, the subject is matched against the first 4175 alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are 4176 letters and @var{dd} are digits); otherwise it is matched against 4177 the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}. 4178 4179 4180 @node Recursive patterns 4181 @appendixsec Recursive patterns 4182 @cindex Perl-style regular expressions, recursive patterns 4183 @cindex Perl-style regular expressions, recursion 4184 4185 Consider the problem of matching a string in parentheses, 4186 allowing for unlimited nested parentheses. Without the use 4187 of recursion, the best that can be done is to use a pattern 4188 that matches up to some fixed depth of nesting. It is not 4189 possible to handle an arbitrary nesting depth. Perl 5.6 has 4190 provided an experimental facility that allows regular 4191 expressions to recurse (amongst other things). It does this 4192 by interpolating Perl code in the expression at run time, 4193 and the code can refer to the expression itself. A Perl pattern 4194 tern to solve the parentheses problem can be created like 4195 this: 4196 4197 @example 4198 $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x; 4199 @end example 4200 4201 The @code{(?p@{...@})} item interpolates Perl code at run time, 4202 and in this case refers recursively to the pattern in which it 4203 appears. Obviously, @command{sed} cannot support the interpolation of 4204 Perl code. Instead, the special item @code{(?R)} is provided for 4205 the specific case of recursion. This pattern solves the 4206 parentheses problem (assume the @code{X} modifier option is used 4207 so that white space is ignored): 4208 4209 @example 4210 \( ( (?>[^()]+) | (?R) )* \) 4211 @end example 4212 4213 First it matches an opening parenthesis. Then it matches any 4214 number of substrings which can either be a sequence of 4215 non-parentheses, or a recursive match of the pattern itself 4216 (i.e. a correctly parenthesized substring). Finally there is 4217 a closing parenthesis. 4218 4219 This particular example pattern contains nested unlimited 4220 repeats, and so the use of a non-backtracking subpattern for 4221 matching strings of non-parentheses is important when applying 4222 the pattern to strings that do not match. For example, when 4223 it is applied to 4224 4225 @example 4226 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 4227 @end example 4228 4229 it yields a ``no match'' response quickly. However, if a 4230 standard backtracking subpattern is not used, the match runs 4231 for a very long time indeed because there are so many different 4232 ways the @code{+} and @code{*} repeats can carve up the subject, 4233 and all have to be tested before failure can be reported. 4234 4235 The values set for any capturing subpatterns are those from 4236 the outermost level of the recursion at which the subpattern 4237 value is set. If the pattern above is matched against 4238 4239 @example 4240 (ab(cd)ef) 4241 @end example 4242 4243 @noindent 4244 the value for the capturing parentheses is @samp{ef}, which is 4245 the last value taken on at the top level. 4246 4247 @node Comments 4248 @appendixsec Comments 4249 @cindex Perl-style regular expressions, comments 4250 4251 The sequence (?# marks the start of a comment which continues 4252 ues up to the next closing parenthesis. Nested parentheses 4253 are not permitted. The characters that make up a comment 4254 play no part in the pattern matching at all. 4255 4256 @cindex Perl-style regular expressions, extended 4257 If the @code{X} modifier option is used, an unescaped @code{#} character 4258 outside a character class introduces a comment that continues 4259 up to the next newline character in the pattern. 4260 @end ifset 4261 4262 4263 @page 4264 @node Concept Index 4265 @unnumbered Concept Index 4266 4267 This is a general index of all issues discussed in this manual, with the 4268 exception of the @command{sed} commands and command-line options. 4269 4270 @printindex cp 4271 4272 @page 4273 @node Command and Option Index 4274 @unnumbered Command and Option Index 4275 4276 This is an alphabetical list of all @command{sed} commands and command-line 4277 options. 4278 4279 @printindex fn 4280 4281 @contents 4282 @bye 4283 4284 @c XXX FIXME: the term "cycle" is never defined... 4285