1 \input texinfo @c -*-texinfo-*- 2 @c 3 @c -- Stuff that needs adding: ---------------------------------------------- 4 @c (document the `;' command-separator) 5 @c -------------------------------------------------------------------------- 6 @c Check for consistency: regexps in @code, text that they match in @samp. 7 @c 8 @c Tips: 9 @c @command for command 10 @c @samp for command fragments: @samp{cat -s} 11 @c @code for sed commands and flags 12 @c Use ``quote'' not `quote' or "quote". 13 @c 14 @c %**start of header 15 @setfilename sed.info 16 @settitle sed, a stream editor 17 @c %**end of header 18 19 @c @smallbook 20 21 @include version.texi 22 23 @c Combine indices. 24 @syncodeindex ky cp 25 @syncodeindex pg cp 26 @syncodeindex tp cp 27 28 @defcodeindex op 29 @syncodeindex op fn 30 31 @include config.texi 32 33 @copying 34 This file documents version @value{VERSION} of 35 @value{SSED}, a stream editor. 36 37 Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free 38 Software Foundation, Inc. 39 40 This document is released under the terms of the @acronym{GNU} Free 41 Documentation License as published by the Free Software Foundation; 42 either version 1.1, or (at your option) any later version. 43 44 You should have received a copy of the @acronym{GNU} Free Documentation 45 License along with @value{SSED}; see the file @file{COPYING.DOC}. 46 If not, write to the Free Software Foundation, 59 Temple Place - Suite 47 330, Boston, MA 02110-1301, USA. 48 49 There are no Cover Texts and no Invariant Sections; this text, along 50 with its equivalent in the printed manual, constitutes the Title Page. 51 @end copying 52 53 @setchapternewpage off 54 55 @titlepage 56 @title @command{sed}, a stream editor 57 @subtitle version @value{VERSION}, @value{UPDATED} 58 @author by Ken Pizzini, Paolo Bonzini 59 60 @page 61 @vskip 0pt plus 1filll 62 Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc. 63 64 @insertcopying 65 66 Published by the Free Software Foundation, @* 67 51 Franklin Street, Fifth Floor @* 68 Boston, MA 02110-1301, USA 69 @end titlepage 70 71 72 @node Top 73 @top 74 75 @ifnottex 76 @insertcopying 77 @end ifnottex 78 79 @menu 80 * Introduction:: Introduction 81 * Invoking sed:: Invocation 82 * sed Programs:: @command{sed} programs 83 * Examples:: Some sample scripts 84 * Limitations:: Limitations and (non-)limitations of @value{SSED} 85 * Other Resources:: Other resources for learning about @command{sed} 86 * Reporting Bugs:: Reporting bugs 87 88 * Extended regexps:: @command{egrep}-style regular expressions 89 @ifset PERL 90 * Perl regexps:: Perl-style regular expressions 91 @end ifset 92 93 * Concept Index:: A menu with all the topics in this manual. 94 * Command and Option Index:: A menu with all @command{sed} commands and 95 command-line options. 96 97 @detailmenu 98 --- The detailed node listing --- 99 100 sed Programs: 101 * Execution Cycle:: How @command{sed} works 102 * Addresses:: Selecting lines with @command{sed} 103 * Regular Expressions:: Overview of regular expression syntax 104 * Common Commands:: Often used commands 105 * The "s" Command:: @command{sed}'s Swiss Army Knife 106 * Other Commands:: Less frequently used commands 107 * Programming Commands:: Commands for @command{sed} gurus 108 * Extended Commands:: Commands specific of @value{SSED} 109 * Escapes:: Specifying special characters 110 111 Examples: 112 * Centering lines:: 113 * Increment a number:: 114 * Rename files to lower case:: 115 * Print bash environment:: 116 * Reverse chars of lines:: 117 * tac:: Reverse lines of files 118 * cat -n:: Numbering lines 119 * cat -b:: Numbering non-blank lines 120 * wc -c:: Counting chars 121 * wc -w:: Counting words 122 * wc -l:: Counting lines 123 * head:: Printing the first lines 124 * tail:: Printing the last lines 125 * uniq:: Make duplicate lines unique 126 * uniq -d:: Print duplicated lines of input 127 * uniq -u:: Remove all duplicated lines 128 * cat -s:: Squeezing blank lines 129 130 @ifset PERL 131 Perl regexps:: Perl-style regular expressions 132 * Backslash:: Introduces special sequences 133 * Circumflex/dollar sign/period:: Behave specially with regard to new lines 134 * Square brackets:: Are a bit different in strange cases 135 * Options setting:: Toggle modifiers in the middle of a regexp 136 * Non-capturing subpatterns:: Are not counted when backreferencing 137 * Repetition:: Allows for non-greedy matching 138 * Backreferences:: Allows for more than 10 back references 139 * Assertions:: Allows for complex look ahead matches 140 * Non-backtracking subpatterns:: Often gives more performance 141 * Conditional subpatterns:: Allows if/then/else branches 142 * Recursive patterns:: For example to match parentheses 143 * Comments:: Because things can get complex... 144 @end ifset 145 146 @end detailmenu 147 @end menu 148 149 150 @node Introduction 151 @chapter Introduction 152 153 @cindex Stream editor 154 @command{sed} is a stream editor. 155 A stream editor is used to perform basic text 156 transformations on an input stream 157 (a file or input from a pipeline). 158 While in some ways similar to an editor which 159 permits scripted edits (such as @command{ed}), 160 @command{sed} works by making only one pass over the 161 input(s), and is consequently more efficient. 162 But it is @command{sed}'s ability to filter text in a pipeline 163 which particularly distinguishes it from other types of 164 editors. 165 166 167 @node Invoking sed 168 @chapter Invocation 169 170 Normally @command{sed} is invoked like this: 171 172 @example 173 sed SCRIPT INPUTFILE... 174 @end example 175 176 The full format for invoking @command{sed} is: 177 178 @example 179 sed OPTIONS... [SCRIPT] [INPUTFILE...] 180 @end example 181 182 If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-}, 183 @command{sed} filters the contents of the standard input. The @var{script} 184 is actually the first non-option parameter, which @command{sed} specially 185 considers a script and not an input file if (and only if) none of the 186 other @var{options} specifies a script to be executed, that is if neither 187 of the @option{-e} and @option{-f} options is specified. 188 189 @command{sed} may be invoked with the following command-line options: 190 191 @table @code 192 @item --version 193 @opindex --version 194 @cindex Version, printing 195 Print out the version of @command{sed} that is being run and a copyright notice, 196 then exit. 197 198 @item --help 199 @opindex --help 200 @cindex Usage summary, printing 201 Print a usage message briefly summarizing these command-line options 202 and the bug-reporting address, 203 then exit. 204 205 @item -n 206 @itemx --quiet 207 @itemx --silent 208 @opindex -n 209 @opindex --quiet 210 @opindex --silent 211 @cindex Disabling autoprint, from command line 212 By default, @command{sed} prints out the pattern space 213 at the end of each cycle through the script (@pxref{Execution Cycle, , 214 How @code{sed} works}). 215 These options disable this automatic printing, 216 and @command{sed} only produces output when explicitly told to 217 via the @code{p} command. 218 219 @item -e @var{script} 220 @itemx --expression=@var{script} 221 @opindex -e 222 @opindex --expression 223 @cindex Script, from command line 224 Add the commands in @var{script} to the set of commands to be 225 run while processing the input. 226 227 @item -f @var{script-file} 228 @itemx --file=@var{script-file} 229 @opindex -f 230 @opindex --file 231 @cindex Script, from a file 232 Add the commands contained in the file @var{script-file} 233 to the set of commands to be run while processing the input. 234 235 @item -i[@var{SUFFIX}] 236 @itemx --in-place[=@var{SUFFIX}] 237 @opindex -i 238 @opindex --in-place 239 @cindex In-place editing, activating 240 @cindex @value{SSEDEXT}, in-place editing 241 This option specifies that files are to be edited in-place. 242 @value{SSED} does this by creating a temporary file and 243 sending output to this file rather than to the standard 244 output.@footnote{This applies to commands such as @code{=}, 245 @code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can 246 still write to the standard output by using the @code{w} 247 @cindex @value{SSEDEXT}, @file{/dev/stdout} file 248 or @code{W} commands together with the @file{/dev/stdout} 249 special file}. 250 251 This option implies @option{-s}. 252 253 When the end of the file is reached, the temporary file is 254 renamed to the output file's original name. The extension, 255 if supplied, is used to modify the name of the old file 256 before renaming the temporary file, thereby making a backup 257 copy@footnote{Note that @value{SSED} creates the backup 258 file whether or not any output is actually changed.}). 259 260 @cindex In-place editing, Perl-style backup file names 261 This rule is followed: if the extension doesn't contain a @code{*}, 262 then it is appended to the end of the current filename as a 263 suffix; if the extension does contain one or more @code{*} 264 characters, then @emph{each} asterisk is replaced with the 265 current filename. This allows you to add a prefix to the 266 backup file, instead of (or in addition to) a suffix, or 267 even to place backup copies of the original files into another 268 directory (provided the directory already exists). 269 270 If no extension is supplied, the original file is 271 overwritten without making a backup. 272 273 @item -l @var{N} 274 @itemx --line-length=@var{N} 275 @opindex -l 276 @opindex --line-length 277 @cindex Line length, setting 278 Specify the default line-wrap length for the @code{l} command. 279 A length of 0 (zero) means to never wrap long lines. If 280 not specified, it is taken to be 70. 281 282 @item --posix 283 @cindex @value{SSEDEXT}, disabling 284 @value{SSED} includes several extensions to @acronym{POSIX} 285 sed. In order to simplify writing portable scripts, this 286 option disables all the extensions that this manual documents, 287 including additional commands. 288 @cindex @code{POSIXLY_CORRECT} behavior, enabling 289 Most of the extensions accept @command{sed} programs that 290 are outside the syntax mandated by @acronym{POSIX}, but some 291 of them (such as the behavior of the @command{N} command 292 described in @pxref{Reporting Bugs}) actually violate the 293 standard. If you want to disable only the latter kind of 294 extension, you can set the @code{POSIXLY_CORRECT} variable 295 to a non-empty value. 296 297 @item -b 298 @itemx --binary 299 @opindex -b 300 @opindex --binary 301 This option is available on every platform, but is only effective where the 302 operating system makes a distinction between text files and binary files. 303 When such a distinction is made---as is the case for MS-DOS, Windows, 304 Cygwin---text files are composed of lines separated by a carriage return 305 @emph{and} a line feed character, and @command{sed} does not see the 306 ending CR. When this option is specified, @command{sed} will open 307 input files in binary mode, thus not requesting this special processing 308 and considering lines to end at a line feed. 309 310 @item --follow-symlinks 311 @opindex --follow-symlinks 312 This option is available only on platforms that support 313 symbolic links and has an effect only if option @option{-i} 314 is specified. In this case, if the file that is specified 315 on the command line is a symbolic link, @command{sed} will 316 follow the link and edit the ultimate destination of the 317 link. The default behavior is to break the symbolic link, 318 so that the link destination will not be modified. 319 320 @item -r 321 @itemx --regexp-extended 322 @opindex -r 323 @opindex --regexp-extended 324 @cindex Extended regular expressions, choosing 325 @cindex @acronym{GNU} extensions, extended regular expressions 326 Use extended regular expressions rather than basic 327 regular expressions. Extended regexps are those that 328 @command{egrep} accepts; they can be clearer because they 329 usually have less backslashes, but are a @acronym{GNU} extension 330 and hence scripts that use them are not portable. 331 @xref{Extended regexps, , Extended regular expressions}. 332 333 @ifset PERL 334 @item -R 335 @itemx --regexp-perl 336 @opindex -R 337 @opindex --regexp-perl 338 @cindex Perl-style regular expressions, choosing 339 @cindex @value{SSEDEXT}, Perl-style regular expressions 340 Use Perl-style regular expressions rather than basic 341 regular expressions. Perl-style regexps are extremely 342 powerful but are a @value{SSED} extension and hence scripts that 343 use it are not portable. @xref{Perl regexps, , 344 Perl-style regular expressions}. 345 @end ifset 346 347 @item -s 348 @itemx --separate 349 @cindex Working on separate files 350 By default, @command{sed} will consider the files specified on the 351 command line as a single continuous long stream. This @value{SSED} 352 extension allows the user to consider them as separate files: 353 range addresses (such as @samp{/abc/,/def/}) are not allowed 354 to span several files, line numbers are relative to the start 355 of each file, @code{$} refers to the last line of each file, 356 and files invoked from the @code{R} commands are rewound at the 357 start of each file. 358 359 @item -u 360 @itemx --unbuffered 361 @opindex -u 362 @opindex --unbuffered 363 @cindex Unbuffered I/O, choosing 364 Buffer both input and output as minimally as practical. 365 (This is particularly useful if the input is coming from 366 the likes of @samp{tail -f}, and you wish to see the transformed 367 output as soon as possible.) 368 369 @end table 370 371 If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file} 372 options are given on the command-line, 373 then the first non-option argument on the command line is 374 taken to be the @var{script} to be executed. 375 376 @cindex Files to be processed as input 377 If any command-line parameters remain after processing the above, 378 these parameters are interpreted as the names of input files to 379 be processed. 380 @cindex Standard input, processing as input 381 A file name of @samp{-} refers to the standard input stream. 382 The standard input will be processed if no file names are specified. 383 384 385 @node sed Programs 386 @chapter @command{sed} Programs 387 388 @cindex @command{sed} program structure 389 @cindex Script structure 390 A @command{sed} program consists of one or more @command{sed} commands, 391 passed in by one or more of the 392 @option{-e}, @option{-f}, @option{--expression}, and @option{--file} 393 options, or the first non-option argument if zero of these 394 options are used. 395 This document will refer to ``the'' @command{sed} script; 396 this is understood to mean the in-order catenation 397 of all of the @var{script}s and @var{script-file}s passed in. 398 399 Each @code{sed} command consists of an optional address or 400 address range, followed by a one-character command name 401 and any additional command-specific code. 402 403 @menu 404 * Execution Cycle:: How @command{sed} works 405 * Addresses:: Selecting lines with @command{sed} 406 * Regular Expressions:: Overview of regular expression syntax 407 * Common Commands:: Often used commands 408 * The "s" Command:: @command{sed}'s Swiss Army Knife 409 * Other Commands:: Less frequently used commands 410 * Programming Commands:: Commands for @command{sed} gurus 411 * Extended Commands:: Commands specific of @value{SSED} 412 * Escapes:: Specifying special characters 413 @end menu 414 415 416 @node Execution Cycle 417 @section How @command{sed} Works 418 419 @cindex Buffer spaces, pattern and hold 420 @cindex Spaces, pattern and hold 421 @cindex Pattern space, definition 422 @cindex Hold space, definition 423 @command{sed} maintains two data buffers: the active @emph{pattern} space, 424 and the auxiliary @emph{hold} space. Both are initially empty. 425 426 @command{sed} operates by performing the following cycle on each 427 lines of input: first, @command{sed} reads one line from the input 428 stream, removes any trailing newline, and places it in the pattern space. 429 Then commands are executed; each command can have an address associated 430 to it: addresses are a kind of condition code, and a command is only 431 executed if the condition is verified before the command is to be 432 executed. 433 434 When the end of the script is reached, unless the @option{-n} option 435 is in use, the contents of pattern space are printed out to the output 436 stream, adding back the trailing newline if it was removed.@footnote{Actually, 437 if @command{sed} prints a line without the terminating newline, it will 438 nevertheless print the missing newline as soon as more text is sent to 439 the same output stream, which gives the ``least expected surprise'' 440 even though it does not make commands like @samp{sed -n p} exactly 441 identical to @command{cat}.} Then the next cycle starts for the next 442 input line. 443 444 Unless special commands (like @samp{D}) are used, the pattern space is 445 deleted between two cycles. The hold space, on the other hand, keeps 446 its data between cycles (see commands @samp{h}, @samp{H}, @samp{x}, 447 @samp{g}, @samp{G} to move data between both buffers). 448 449 450 @node Addresses 451 @section Selecting lines with @command{sed} 452 @cindex Addresses, in @command{sed} scripts 453 @cindex Line selection 454 @cindex Selecting lines to process 455 456 Addresses in a @command{sed} script can be in any of the following forms: 457 @table @code 458 @item @var{number} 459 @cindex Address, numeric 460 @cindex Line, selecting by number 461 Specifying a line number will match only that line in the input. 462 (Note that @command{sed} counts lines continuously across all input files 463 unless @option{-i} or @option{-s} options are specified.) 464 465 @item @var{first}~@var{step} 466 @cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses 467 This @acronym{GNU} extension matches every @var{step}th line 468 starting with line @var{first}. 469 In particular, lines will be selected when there exists 470 a non-negative @var{n} such that the current line-number equals 471 @var{first} + (@var{n} * @var{step}). 472 Thus, to select the odd-numbered lines, 473 one would use @code{1~2}; 474 to pick every third line starting with the second, @samp{2~3} would be used; 475 to pick every fifth line starting with the tenth, use @samp{10~5}; 476 and @samp{50~0} is just an obscure way of saying @code{50}. 477 478 @item $ 479 @cindex Address, last line 480 @cindex Last line, selecting 481 @cindex Line, selecting last 482 This address matches the last line of the last file of input, or 483 the last line of each file when the @option{-i} or @option{-s} options 484 are specified. 485 486 @item /@var{regexp}/ 487 @cindex Address, as a regular expression 488 @cindex Line, selecting by regular expression match 489 This will select any line which matches the regular expression @var{regexp}. 490 If @var{regexp} itself includes any @code{/} characters, 491 each must be escaped by a backslash (@code{\}). 492 493 @cindex empty regular expression 494 @cindex @value{SSEDEXT}, modifiers and the empty regular expression 495 The empty regular expression @samp{//} repeats the last regular 496 expression match (the same holds if the empty regular expression is 497 passed to the @code{s} command). Note that modifiers to regular expressions 498 are evaluated when the regular expression is compiled, thus it is invalid to 499 specify them together with the empty regular expression. 500 501 @item \%@var{regexp}% 502 (The @code{%} may be replaced by any other single character.) 503 504 @cindex Slash character, in regular expressions 505 This also matches the regular expression @var{regexp}, 506 but allows one to use a different delimiter than @code{/}. 507 This is particularly useful if the @var{regexp} itself contains 508 a lot of slashes, since it avoids the tedious escaping of every @code{/}. 509 If @var{regexp} itself includes any delimiter characters, 510 each must be escaped by a backslash (@code{\}). 511 512 @item /@var{regexp}/I 513 @itemx \%@var{regexp}%I 514 @cindex @acronym{GNU} extensions, @code{I} modifier 515 @ifset PERL 516 @cindex Perl-style regular expressions, case-insensitive 517 @end ifset 518 The @code{I} modifier to regular-expression matching is a @acronym{GNU} 519 extension which causes the @var{regexp} to be matched in 520 a case-insensitive manner. 521 522 @item /@var{regexp}/M 523 @itemx \%@var{regexp}%M 524 @ifset PERL 525 @cindex @value{SSEDEXT}, @code{M} modifier 526 @end ifset 527 @cindex Perl-style regular expressions, multiline 528 The @code{M} modifier to regular-expression matching is a @value{SSED} 529 extension which causes @code{^} and @code{$} to match respectively 530 (in addition to the normal behavior) the empty string after a newline, 531 and the empty string before a newline. There are special character 532 sequences 533 @ifset PERL 534 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 535 in basic or extended regular expression modes) 536 @end ifset 537 @ifclear PERL 538 (@code{\`} and @code{\'}) 539 @end ifclear 540 which always match the beginning or the end of the buffer. 541 @code{M} stands for @cite{multi-line}. 542 543 @ifset PERL 544 @item /@var{regexp}/S 545 @itemx \%@var{regexp}%S 546 @cindex @value{SSEDEXT}, @code{S} modifier 547 @cindex Perl-style regular expressions, single line 548 The @code{S} modifier to regular-expression matching is only valid 549 in Perl mode and specifies that the dot character (@code{.}) will 550 match the newline character too. @code{S} stands for @cite{single-line}. 551 @end ifset 552 553 @ifset PERL 554 @item /@var{regexp}/X 555 @itemx \%@var{regexp}%X 556 @cindex @value{SSEDEXT}, @code{X} modifier 557 @cindex Perl-style regular expressions, extended 558 The @code{X} modifier to regular-expression matching is also 559 valid in Perl mode only. If it is used, whitespace in the 560 pattern (other than in a character class) and 561 characters between a @kbd{#} outside a character class and the 562 next newline character are ignored. An escaping backslash 563 can be used to include a whitespace or @kbd{#} character as part 564 of the pattern. 565 @end ifset 566 @end table 567 568 If no addresses are given, then all lines are matched; 569 if one address is given, then only lines matching that 570 address are matched. 571 572 @cindex Range of lines 573 @cindex Several lines, selecting 574 An address range can be specified by specifying two addresses 575 separated by a comma (@code{,}). An address range matches lines 576 starting from where the first address matches, and continues 577 until the second address matches (inclusively). 578 579 If the second address is a @var{regexp}, then checking for the 580 ending match will start with the line @emph{following} the 581 line which matched the first address: a range will always 582 span at least two lines (except of course if the input stream 583 ends). 584 585 If the second address is a @var{number} less than (or equal to) 586 the line matching the first address, then only the one line is 587 matched. 588 589 @cindex Special addressing forms 590 @cindex Range with start address of zero 591 @cindex Zero, as range start address 592 @cindex @var{addr1},+N 593 @cindex @var{addr1},~N 594 @cindex @acronym{GNU} extensions, special two-address forms 595 @cindex @acronym{GNU} extensions, @code{0} address 596 @cindex @acronym{GNU} extensions, 0,@var{addr2} addressing 597 @cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing 598 @cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing 599 @value{SSED} also supports some special two-address forms; all these 600 are @acronym{GNU} extensions: 601 @table @code 602 @item 0,/@var{regexp}/ 603 A line number of @code{0} can be used in an address specification like 604 @code{0,/@var{regexp}/} so that @command{sed} will try to match 605 @var{regexp} in the first input line too. In other words, 606 @code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/}, 607 except that if @var{addr2} matches the very first line of input the 608 @code{0,/@var{regexp}/} form will consider it to end the range, whereas 609 the @code{1,/@var{regexp}/} form will match the beginning of its range and 610 hence make the range span up to the @emph{second} occurrence of the 611 regular expression. 612 613 Note that this is the only place where the @code{0} address makes 614 sense; there is no 0-th line and commands which are given the @code{0} 615 address in any other way will give an error. 616 617 @item @var{addr1},+@var{N} 618 Matches @var{addr1} and the @var{N} lines following @var{addr1}. 619 620 @item @var{addr1},~@var{N} 621 Matches @var{addr1} and the lines following @var{addr1} 622 until the next line whose input line number is a multiple of @var{N}. 623 @end table 624 625 @cindex Excluding lines 626 @cindex Selecting non-matching lines 627 Appending the @code{!} character to the end of an address 628 specification negates the sense of the match. 629 That is, if the @code{!} character follows an address range, 630 then only lines which do @emph{not} match the address range 631 will be selected. 632 This also works for singleton addresses, 633 and, perhaps perversely, for the null address. 634 635 636 @node Regular Expressions 637 @section Overview of Regular Expression Syntax 638 639 To know how to use @command{sed}, people should understand regular 640 expressions (@dfn{regexp} for short). A regular expression 641 is a pattern that is matched against a 642 subject string from left to right. Most characters are 643 @dfn{ordinary}: they stand for 644 themselves in a pattern, and match the corresponding characters 645 in the subject. As a trivial example, the pattern 646 647 @example 648 The quick brown fox 649 @end example 650 651 @noindent 652 matches a portion of a subject string that is identical to 653 itself. The power of regular expressions comes from the 654 ability to include alternatives and repetitions in the pattern. 655 These are encoded in the pattern by the use of @dfn{special characters}, 656 which do not stand for themselves but instead 657 are interpreted in some special way. Here is a brief description 658 of regular expression syntax as used in @command{sed}. 659 660 @table @code 661 @item @var{char} 662 A single ordinary character matches itself. 663 664 @item * 665 @cindex @acronym{GNU} extensions, to basic regular expressions 666 Matches a sequence of zero or more instances of matches for the 667 preceding regular expression, which must be an ordinary character, a 668 special character preceded by @code{\}, a @code{.}, a grouped regexp 669 (see below), or a bracket expression. As a @acronym{GNU} extension, a 670 postfixed regular expression can also be followed by @code{*}; for 671 example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX} 672 1003.1-2001 says that @code{*} stands for itself when it appears at 673 the start of a regular expression or subexpression, but many 674 non@acronym{GNU} implementations do not support this and portable 675 scripts should instead use @code{\*} in these contexts. 676 677 @item \+ 678 @cindex @acronym{GNU} extensions, to basic regular expressions 679 As @code{*}, but matches one or more. It is a @acronym{GNU} extension. 680 681 @item \? 682 @cindex @acronym{GNU} extensions, to basic regular expressions 683 As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension. 684 685 @item \@{@var{i}\@} 686 As @code{*}, but matches exactly @var{i} sequences (@var{i} is a 687 decimal integer; for portability, keep it between 0 and 255 688 inclusive). 689 690 @item \@{@var{i},@var{j}\@} 691 Matches between @var{i} and @var{j}, inclusive, sequences. 692 693 @item \@{@var{i},\@} 694 Matches more than or equal to @var{i} sequences. 695 696 @item \(@var{regexp}\) 697 Groups the inner @var{regexp} as a whole, this is used to: 698 699 @itemize @bullet 700 @item 701 @cindex @acronym{GNU} extensions, to basic regular expressions 702 Apply postfix operators, like @code{\(abcd\)*}: 703 this will search for zero or more whole sequences 704 of @samp{abcd}, while @code{abcd*} would search 705 for @samp{abc} followed by zero or more occurrences 706 of @samp{d}. Note that support for @code{\(abcd\)*} is 707 required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU} 708 implementations do not support it and hence it is not universally 709 portable. 710 711 @item 712 Use back references (see below). 713 @end itemize 714 715 @item . 716 Matches any character, including newline. 717 718 @item ^ 719 Matches the null string at beginning of the pattern space, i.e. what 720 appears after the circumflex must appear at the beginning of the 721 pattern space. 722 723 In most scripts, pattern space is initialized to the content of each 724 line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a 725 useful simplification to think of @code{^#include} as matching only 726 lines where @samp{#include} is the first thing on line---if there are 727 spaces before, for example, the match fails. This simplification is 728 valid as long as the original content of pattern space is not modified, 729 for example with an @code{s} command. 730 731 @code{^} acts as a special character only at the beginning of the 732 regular expression or subexpression (that is, after @code{\(} or 733 @code{\|}). Portable scripts should avoid @code{^} at the beginning of 734 a subexpression, though, as @acronym{POSIX} allows implementations that 735 treat @code{^} as an ordinary character in that context. 736 737 @item $ 738 It is the same as @code{^}, but refers to end of pattern space. 739 @code{$} also acts as a special character only at the end 740 of the regular expression or subexpression (that is, before @code{\)} 741 or @code{\|}), and its use at the end of a subexpression is not 742 portable. 743 744 745 @item [@var{list}] 746 @itemx [^@var{list}] 747 Matches any single character in @var{list}: for example, 748 @code{[aeiou]} matches all vowels. A list may include 749 sequences like @code{@var{char1}-@var{char2}}, which 750 matches any character between (inclusive) @var{char1} 751 and @var{char2}. 752 753 A leading @code{^} reverses the meaning of @var{list}, so that 754 it matches any single character @emph{not} in @var{list}. To include 755 @code{]} in the list, make it the first character (after 756 the @code{^} if needed), to include @code{-} in the list, 757 make it the first or last; to include @code{^} put 758 it after the first character. 759 760 @cindex @code{POSIXLY_CORRECT} behavior, bracket expressions 761 The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\} 762 are normally not special within @var{list}. For example, @code{[\*]} 763 matches either @samp{\} or @samp{*}, because the @code{\} is not 764 special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and 765 @code{[:space:]} are special within @var{list} and represent collating 766 symbols, equivalence classes, and character classes, respectively, and 767 @code{[} is therefore special within @var{list} when it is followed by 768 @code{.}, @code{=}, or @code{:}. Also, when not in 769 @env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and 770 @code{\t} are recognized within @var{list}. @xref{Escapes}. 771 772 @item @var{regexp1}\|@var{regexp2} 773 @cindex @acronym{GNU} extensions, to basic regular expressions 774 Matches either @var{regexp1} or @var{regexp2}. Use 775 parentheses to use complex alternative regular expressions. 776 The matching process tries each alternative in turn, from 777 left to right, and the first one that succeeds is used. 778 It is a @acronym{GNU} extension. 779 780 @item @var{regexp1}@var{regexp2} 781 Matches the concatenation of @var{regexp1} and @var{regexp2}. 782 Concatenation binds more tightly than @code{\|}, @code{^}, and 783 @code{$}, but less tightly than the other regular expression 784 operators. 785 786 @item \@var{digit} 787 Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized 788 subexpression in the regular expression. This is called a @dfn{back 789 reference}. Subexpressions are implicity numbered by counting 790 occurrences of @code{\(} left-to-right. 791 792 @item \n 793 Matches the newline character. 794 795 @item \@var{char} 796 Matches @var{char}, where @var{char} is one of @code{$}, 797 @code{*}, @code{.}, @code{[}, @code{\}, or @code{^}. 798 Note that the only C-like 799 backslash sequences that you can portably assume to be 800 interpreted are @code{\n} and @code{\\}; in particular 801 @code{\t} is not portable, and matches a @samp{t} under most 802 implementations of @command{sed}, rather than a tab character. 803 804 @end table 805 806 @cindex Greedy regular expression matching 807 Note that the regular expression matcher is greedy, i.e., matches 808 are attempted from left to right and, if two or more matches are 809 possible starting at the same character, it selects the longest. 810 811 @noindent 812 Examples: 813 @table @samp 814 @item abcdef 815 Matches @samp{abcdef}. 816 817 @item a*b 818 Matches zero or more @samp{a}s followed by a single 819 @samp{b}. For example, @samp{b} or @samp{aaaaab}. 820 821 @item a\?b 822 Matches @samp{b} or @samp{ab}. 823 824 @item a\+b\+ 825 Matches one or more @samp{a}s followed by one or more 826 @samp{b}s: @samp{ab} is the shortest possible match, but 827 other examples are @samp{aaaab} or @samp{abbbbb} or 828 @samp{aaaaaabbbbbbb}. 829 830 @item .* 831 @itemx .\+ 832 These two both match all the characters in a string; 833 however, the first matches every string (including the empty 834 string), while the second matches only strings containing 835 at least one character. 836 837 @item ^main.*(.*) 838 his matches a string starting with @samp{main}, 839 followed by an opening and closing 840 parenthesis. The @samp{n}, @samp{(} and @samp{)} need not 841 be adjacent. 842 843 @item ^# 844 This matches a string beginning with @samp{#}. 845 846 @item \\$ 847 This matches a string ending with a single backslash. The 848 regexp contains two backslashes for escaping. 849 850 @item \$ 851 Instead, this matches a string consisting of a single dollar sign, 852 because it is escaped. 853 854 @item [a-zA-Z0-9] 855 In the C locale, this matches any @acronym{ASCII} letters or digits. 856 857 @item [^ @kbd{tab}]\+ 858 (Here @kbd{tab} stands for a single tab character.) 859 This matches a string of one or more 860 characters, none of which is a space or a tab. 861 Usually this means a word. 862 863 @item ^\(.*\)\n\1$ 864 This matches a string consisting of two equal substrings separated by 865 a newline. 866 867 @item .\@{9\@}A$ 868 This matches nine characters followed by an @samp{A}. 869 870 @item ^.\@{15\@}A 871 This matches the start of a string that contains 16 characters, 872 the last of which is an @samp{A}. 873 874 @end table 875 876 877 878 @node Common Commands 879 @section Often-Used Commands 880 881 If you use @command{sed} at all, you will quite likely want to know 882 these commands. 883 884 @table @code 885 @item # 886 [No addresses allowed.] 887 888 @findex # (comments) 889 @cindex Comments, in scripts 890 The @code{#} character begins a comment; 891 the comment continues until the next newline. 892 893 @cindex Portability, comments 894 If you are concerned about portability, be aware that 895 some implementations of @command{sed} (which are not @sc{posix} 896 conformant) may only support a single one-line comment, 897 and then only when the very first character of the script is a @code{#}. 898 899 @findex -n, forcing from within a script 900 @cindex Caveat --- #n on first line 901 Warning: if the first two characters of the @command{sed} script 902 are @code{#n}, then the @option{-n} (no-autoprint) option is forced. 903 If you want to put a comment in the first line of your script 904 and that comment begins with the letter @samp{n} 905 and you do not want this behavior, 906 then be sure to either use a capital @samp{N}, 907 or place at least one space before the @samp{n}. 908 909 @item q [@var{exit-code}] 910 This command only accepts a single address. 911 912 @findex q (quit) command 913 @cindex @value{SSEDEXT}, returning an exit code 914 @cindex Quitting 915 Exit @command{sed} without processing any more commands or input. 916 Note that the current pattern space is printed if auto-print is 917 not disabled with the @option{-n} options. The ability to return 918 an exit code from the @command{sed} script is a @value{SSED} extension. 919 920 @item d 921 @findex d (delete) command 922 @cindex Text, deleting 923 Delete the pattern space; 924 immediately start next cycle. 925 926 @item p 927 @findex p (print) command 928 @cindex Text, printing 929 Print out the pattern space (to the standard output). 930 This command is usually only used in conjunction with the @option{-n} 931 command-line option. 932 933 @item n 934 @findex n (next-line) command 935 @cindex Next input line, replace pattern space with 936 @cindex Read next input line 937 If auto-print is not disabled, print the pattern space, 938 then, regardless, replace the pattern space with the next line of input. 939 If there is no more input then @command{sed} exits without processing 940 any more commands. 941 942 @item @{ @var{commands} @} 943 @findex @{@} command grouping 944 @cindex Grouping commands 945 @cindex Command groups 946 A group of commands may be enclosed between 947 @code{@{} and @code{@}} characters. 948 This is particularly useful when you want a group of commands 949 to be triggered by a single address (or address-range) match. 950 951 @end table 952 953 @node The "s" Command 954 @section The @code{s} Command 955 956 The syntax of the @code{s} (as in substitute) command is 957 @samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/} 958 characters may be uniformly replaced by any other single 959 character within any given @code{s} command. The @code{/} 960 character (or whatever other character is used in its stead) 961 can appear in the @var{regexp} or @var{replacement} 962 only if it is preceded by a @code{\} character. 963 964 The @code{s} command is probably the most important in @command{sed} 965 and has a lot of different options. Its basic concept is simple: 966 the @code{s} command attempts to match the pattern 967 space against the supplied @var{regexp}; if the match is 968 successful, then that portion of the pattern 969 space which was matched is replaced with @var{replacement}. 970 971 @cindex Backreferences, in regular expressions 972 @cindex Parenthesized substrings 973 The @var{replacement} can contain @code{\@var{n}} (@var{n} being 974 a number from 1 to 9, inclusive) references, which refer to 975 the portion of the match which is contained between the @var{n}th 976 @code{\(} and its matching @code{\)}. 977 Also, the @var{replacement} can contain unescaped @code{&} 978 characters which reference the whole matched portion 979 of the pattern space. 980 @cindex @value{SSEDEXT}, case modifiers in @code{s} commands 981 Finally, as a @value{SSED} extension, you can include a 982 special sequence made of a backslash and one of the letters 983 @code{L}, @code{l}, @code{U}, @code{u}, or @code{E}. 984 The meaning is as follows: 985 986 @table @code 987 @item \L 988 Turn the replacement 989 to lowercase until a @code{\U} or @code{\E} is found, 990 991 @item \l 992 Turn the 993 next character to lowercase, 994 995 @item \U 996 Turn the replacement to uppercase 997 until a @code{\L} or @code{\E} is found, 998 999 @item \u 1000 Turn the next character 1001 to uppercase, 1002 1003 @item \E 1004 Stop case conversion started by @code{\L} or @code{\U}. 1005 @end table 1006 1007 To include a literal @code{\}, @code{&}, or newline in the final 1008 replacement, be sure to precede the desired @code{\}, @code{&}, 1009 or newline in the @var{replacement} with a @code{\}. 1010 1011 @findex s command, option flags 1012 @cindex Substitution of text, options 1013 The @code{s} command can be followed by zero or more of the 1014 following @var{flags}: 1015 1016 @table @code 1017 @item g 1018 @cindex Global substitution 1019 @cindex Replacing all text matching regexp in a line 1020 Apply the replacement to @emph{all} matches to the @var{regexp}, 1021 not just the first. 1022 1023 @item @var{number} 1024 @cindex Replacing only @var{n}th match of regexp in a line 1025 Only replace the @var{number}th match of the @var{regexp}. 1026 1027 @cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command 1028 @cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command 1029 Note: the @sc{posix} standard does not specify what should happen 1030 when you mix the @code{g} and @var{number} modifiers, 1031 and currently there is no widely agreed upon meaning 1032 across @command{sed} implementations. 1033 For @value{SSED}, the interaction is defined to be: 1034 ignore matches before the @var{number}th, 1035 and then match and replace all matches from 1036 the @var{number}th on. 1037 1038 @item p 1039 @cindex Text, printing after substitution 1040 If the substitution was made, then print the new pattern space. 1041 1042 Note: when both the @code{p} and @code{e} options are specified, 1043 the relative ordering of the two produces very different results. 1044 In general, @code{ep} (evaluate then print) is what you want, 1045 but operating the other way round can be useful for debugging. 1046 For this reason, the current version of @value{SSED} interprets 1047 specially the presence of @code{p} options both before and after 1048 @code{e}, printing the pattern space before and after evaluation, 1049 while in general flags for the @code{s} command show their 1050 effect just once. This behavior, although documented, might 1051 change in future versions. 1052 1053 @item w @var{file-name} 1054 @cindex Text, writing to a file after substitution 1055 @cindex @value{SSEDEXT}, @file{/dev/stdout} file 1056 @cindex @value{SSEDEXT}, @file{/dev/stderr} file 1057 If the substitution was made, then write out the result to the named file. 1058 As a @value{SSED} extension, two special values of @var{file-name} are 1059 supported: @file{/dev/stderr}, which writes the result to the standard 1060 error, and @file{/dev/stdout}, which writes to the standard 1061 output.@footnote{This is equivalent to @code{p} unless the @option{-i} 1062 option is being used.} 1063 1064 @item e 1065 @cindex Evaluate Bourne-shell commands, after substitution 1066 @cindex Subprocesses 1067 @cindex @value{SSEDEXT}, evaluating Bourne-shell commands 1068 @cindex @value{SSEDEXT}, subprocesses 1069 This command allows one to pipe input from a shell command 1070 into pattern space. If a substitution was made, the command 1071 that is found in pattern space is executed and pattern space 1072 is replaced with its output. A trailing newline is suppressed; 1073 results are undefined if the command to be executed contains 1074 a @sc{nul} character. This is a @value{SSED} extension. 1075 1076 @item I 1077 @itemx i 1078 @cindex @acronym{GNU} extensions, @code{I} modifier 1079 @cindex Case-insensitive matching 1080 @ifset PERL 1081 @cindex Perl-style regular expressions, case-insensitive 1082 @end ifset 1083 The @code{I} modifier to regular-expression matching is a @acronym{GNU} 1084 extension which makes @command{sed} match @var{regexp} in a 1085 case-insensitive manner. 1086 1087 @item M 1088 @itemx m 1089 @cindex @value{SSEDEXT}, @code{M} modifier 1090 @ifset PERL 1091 @cindex Perl-style regular expressions, multiline 1092 @end ifset 1093 The @code{M} modifier to regular-expression matching is a @value{SSED} 1094 extension which causes @code{^} and @code{$} to match respectively 1095 (in addition to the normal behavior) the empty string after a newline, 1096 and the empty string before a newline. There are special character 1097 sequences 1098 @ifset PERL 1099 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 1100 in basic or extended regular expression modes) 1101 @end ifset 1102 @ifclear PERL 1103 (@code{\`} and @code{\'}) 1104 @end ifclear 1105 which always match the beginning or the end of the buffer. 1106 @code{M} stands for @cite{multi-line}. 1107 1108 @ifset PERL 1109 @item S 1110 @itemx s 1111 @cindex @value{SSEDEXT}, @code{S} modifier 1112 @cindex Perl-style regular expressions, single line 1113 The @code{S} modifier to regular-expression matching is only valid 1114 in Perl mode and specifies that the dot character (@code{.}) will 1115 match the newline character too. @code{S} stands for @cite{single-line}. 1116 @end ifset 1117 1118 @ifset PERL 1119 @item X 1120 @itemx x 1121 @cindex @value{SSEDEXT}, @code{X} modifier 1122 @cindex Perl-style regular expressions, extended 1123 The @code{X} modifier to regular-expression matching is also 1124 valid in Perl mode only. If it is used, whitespace in the 1125 pattern (other than in a character class) and 1126 characters between a @kbd{#} outside a character class and the 1127 next newline character are ignored. An escaping backslash 1128 can be used to include a whitespace or @kbd{#} character as part 1129 of the pattern. 1130 @end ifset 1131 @end table 1132 1133 1134 @node Other Commands 1135 @section Less Frequently-Used Commands 1136 1137 Though perhaps less frequently used than those in the previous 1138 section, some very small yet useful @command{sed} scripts can be built with 1139 these commands. 1140 1141 @table @code 1142 @item y/@var{source-chars}/@var{dest-chars}/ 1143 (The @code{/} characters may be uniformly replaced by 1144 any other single character within any given @code{y} command.) 1145 1146 @findex y (transliterate) command 1147 @cindex Transliteration 1148 Transliterate any characters in the pattern space which match 1149 any of the @var{source-chars} with the corresponding character 1150 in @var{dest-chars}. 1151 1152 Instances of the @code{/} (or whatever other character is used in its stead), 1153 @code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars} 1154 lists, provide that each instance is escaped by a @code{\}. 1155 The @var{source-chars} and @var{dest-chars} lists @emph{must} 1156 contain the same number of characters (after de-escaping). 1157 1158 @item a\ 1159 @itemx @var{text} 1160 @cindex @value{SSEDEXT}, two addresses supported by most commands 1161 As a @acronym{GNU} extension, this command accepts two addresses. 1162 1163 @findex a (append text lines) command 1164 @cindex Appending text after a line 1165 @cindex Text, appending 1166 Queue the lines of text which follow this command 1167 (each but the last ending with a @code{\}, 1168 which are removed from the output) 1169 to be output at the end of the current cycle, 1170 or when the next input line is read. 1171 1172 Escape sequences in @var{text} are processed, so you should 1173 use @code{\\} in @var{text} to print a single backslash. 1174 1175 As a @acronym{GNU} extension, if between the @code{a} and the newline there is 1176 other than a whitespace-@code{\} sequence, then the text of this line, 1177 starting at the first non-whitespace character after the @code{a}, 1178 is taken as the first line of the @var{text} block. 1179 (This enables a simplification in scripting a one-line add.) 1180 This extension also works with the @code{i} and @code{c} commands. 1181 1182 @item i\ 1183 @itemx @var{text} 1184 @cindex @value{SSEDEXT}, two addresses supported by most commands 1185 As a @acronym{GNU} extension, this command accepts two addresses. 1186 1187 @findex i (insert text lines) command 1188 @cindex Inserting text before a line 1189 @cindex Text, insertion 1190 Immediately output the lines of text which follow this command 1191 (each but the last ending with a @code{\}, 1192 which are removed from the output). 1193 1194 @item c\ 1195 @itemx @var{text} 1196 @findex c (change to text lines) command 1197 @cindex Replacing selected lines with other text 1198 Delete the lines matching the address or address-range, 1199 and output the lines of text which follow this command 1200 (each but the last ending with a @code{\}, 1201 which are removed from the output) 1202 in place of the last line 1203 (or in place of each line, if no addresses were specified). 1204 A new cycle is started after this command is done, 1205 since the pattern space will have been deleted. 1206 1207 @item = 1208 @cindex @value{SSEDEXT}, two addresses supported by most commands 1209 As a @acronym{GNU} extension, this command accepts two addresses. 1210 1211 @findex = (print line number) command 1212 @cindex Printing line number 1213 @cindex Line number, printing 1214 Print out the current input line number (with a trailing newline). 1215 1216 @item l @var{n} 1217 @findex l (list unambiguously) command 1218 @cindex List pattern space 1219 @cindex Printing text unambiguously 1220 @cindex Line length, setting 1221 @cindex @value{SSEDEXT}, setting line length 1222 Print the pattern space in an unambiguous form: 1223 non-printable characters (and the @code{\} character) 1224 are printed in C-style escaped form; long lines are split, 1225 with a trailing @code{\} character to indicate the split; 1226 the end of each line is marked with a @code{$}. 1227 1228 @var{n} specifies the desired line-wrap length; 1229 a length of 0 (zero) means to never wrap long lines. If omitted, 1230 the default as specified on the command line is used. The @var{n} 1231 parameter is a @value{SSED} extension. 1232 1233 @item r @var{filename} 1234 @cindex @value{SSEDEXT}, two addresses supported by most commands 1235 As a @acronym{GNU} extension, this command accepts two addresses. 1236 1237 @findex r (read file) command 1238 @cindex Read text from a file 1239 @cindex @value{SSEDEXT}, @file{/dev/stdin} file 1240 Queue the contents of @var{filename} to be read and 1241 inserted into the output stream at the end of the current cycle, 1242 or when the next input line is read. 1243 Note that if @var{filename} cannot be read, it is treated as 1244 if it were an empty file, without any error indication. 1245 1246 As a @value{SSED} extension, the special value @file{/dev/stdin} 1247 is supported for the file name, which reads the contents of the 1248 standard input. 1249 1250 @item w @var{filename} 1251 @findex w (write file) command 1252 @cindex Write to a file 1253 @cindex @value{SSEDEXT}, @file{/dev/stdout} file 1254 @cindex @value{SSEDEXT}, @file{/dev/stderr} file 1255 Write the pattern space to @var{filename}. 1256 As a @value{SSED} extension, two special values of @var{file-name} are 1257 supported: @file{/dev/stderr}, which writes the result to the standard 1258 error, and @file{/dev/stdout}, which writes to the standard 1259 output.@footnote{This is equivalent to @code{p} unless the @option{-i} 1260 option is being used.} 1261 1262 The file will be created (or truncated) before the 1263 first input line is read; all @code{w} commands 1264 (including instances of @code{w} flag on successful @code{s} commands) 1265 which refer to the same @var{filename} are output without 1266 closing and reopening the file. 1267 1268 @item D 1269 @findex D (delete first line) command 1270 @cindex Delete first line from pattern space 1271 Delete text in the pattern space up to the first newline. 1272 If any text is left, restart cycle with the resultant 1273 pattern space (without reading a new line of input), 1274 otherwise start a normal new cycle. 1275 1276 @item N 1277 @findex N (append Next line) command 1278 @cindex Next input line, append to pattern space 1279 @cindex Append next input line to pattern space 1280 Add a newline to the pattern space, 1281 then append the next line of input to the pattern space. 1282 If there is no more input then @command{sed} exits without processing 1283 any more commands. 1284 1285 @item P 1286 @findex P (print first line) command 1287 @cindex Print first line from pattern space 1288 Print out the portion of the pattern space up to the first newline. 1289 1290 @item h 1291 @findex h (hold) command 1292 @cindex Copy pattern space into hold space 1293 @cindex Replace hold space with copy of pattern space 1294 @cindex Hold space, copying pattern space into 1295 Replace the contents of the hold space with the contents of the pattern space. 1296 1297 @item H 1298 @findex H (append Hold) command 1299 @cindex Append pattern space to hold space 1300 @cindex Hold space, appending from pattern space 1301 Append a newline to the contents of the hold space, 1302 and then append the contents of the pattern space to that of the hold space. 1303 1304 @item g 1305 @findex g (get) command 1306 @cindex Copy hold space into pattern space 1307 @cindex Replace pattern space with copy of hold space 1308 @cindex Hold space, copy into pattern space 1309 Replace the contents of the pattern space with the contents of the hold space. 1310 1311 @item G 1312 @findex G (appending Get) command 1313 @cindex Append hold space to pattern space 1314 @cindex Hold space, appending to pattern space 1315 Append a newline to the contents of the pattern space, 1316 and then append the contents of the hold space to that of the pattern space. 1317 1318 @item x 1319 @findex x (eXchange) command 1320 @cindex Exchange hold space with pattern space 1321 @cindex Hold space, exchange with pattern space 1322 Exchange the contents of the hold and pattern spaces. 1323 1324 @end table 1325 1326 1327 @node Programming Commands 1328 @section Commands for @command{sed} gurus 1329 1330 In most cases, use of these commands indicates that you are 1331 probably better off programming in something like @command{awk} 1332 or Perl. But occasionally one is committed to sticking 1333 with @command{sed}, and these commands can enable one to write 1334 quite convoluted scripts. 1335 1336 @cindex Flow of control in scripts 1337 @table @code 1338 @item : @var{label} 1339 [No addresses allowed.] 1340 1341 @findex : (label) command 1342 @cindex Labels, in scripts 1343 Specify the location of @var{label} for branch commands. 1344 In all other respects, a no-op. 1345 1346 @item b @var{label} 1347 @findex b (branch) command 1348 @cindex Branch to a label, unconditionally 1349 @cindex Goto, in scripts 1350 Unconditionally branch to @var{label}. 1351 The @var{label} may be omitted, in which case the next cycle is started. 1352 1353 @item t @var{label} 1354 @findex t (test and branch if successful) command 1355 @cindex Branch to a label, if @code{s///} succeeded 1356 @cindex Conditional branch 1357 Branch to @var{label} only if there has been a successful @code{s}ubstitution 1358 since the last input line was read or conditional branch was taken. 1359 The @var{label} may be omitted, in which case the next cycle is started. 1360 1361 @end table 1362 1363 @node Extended Commands 1364 @section Commands Specific to @value{SSED} 1365 1366 These commands are specific to @value{SSED}, so you 1367 must use them with care and only when you are sure that 1368 hindering portability is not evil. They allow you to check 1369 for @value{SSED} extensions or to do tasks that are required 1370 quite often, yet are unsupported by standard @command{sed}s. 1371 1372 @table @code 1373 @item e [@var{command}] 1374 @findex e (evaluate) command 1375 @cindex Evaluate Bourne-shell commands 1376 @cindex Subprocesses 1377 @cindex @value{SSEDEXT}, evaluating Bourne-shell commands 1378 @cindex @value{SSEDEXT}, subprocesses 1379 This command allows one to pipe input from a shell command 1380 into pattern space. Without parameters, the @code{e} command 1381 executes the command that is found in pattern space and 1382 replaces the pattern space with the output; a trailing newline 1383 is suppressed. 1384 1385 If a parameter is specified, instead, the @code{e} command 1386 interprets it as a command and sends its output to the output stream 1387 (like @code{r} does). The command can run across multiple 1388 lines, all but the last ending with a back-slash. 1389 1390 In both cases, the results are undefined if the command to be 1391 executed contains a @sc{nul} character. 1392 1393 @item L @var{n} 1394 @findex L (fLow paragraphs) command 1395 @cindex Reformat pattern space 1396 @cindex Reformatting paragraphs 1397 @cindex @value{SSEDEXT}, reformatting paragraphs 1398 @cindex @value{SSEDEXT}, @code{L} command 1399 This @value{SSED} extension fills and joins lines in pattern space 1400 to produce output lines of (at most) @var{n} characters, like 1401 @code{fmt} does; if @var{n} is omitted, the default as specified 1402 on the command line is used. This command is considered a failed 1403 experiment and unless there is enough request (which seems unlikely) 1404 will be removed in future versions. 1405 1406 @ignore 1407 Blank lines, spaces between words, and indentation are 1408 preserved in the output; successive input lines with different 1409 indentation are not joined; tabs are expanded to 8 columns. 1410 1411 If the pattern space contains multiple lines, they are joined, but 1412 since the pattern space usually contains a single line, the behavior 1413 of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e., 1414 it does not join short lines to form longer ones). 1415 1416 @var{n} specifies the desired line-wrap length; if omitted, 1417 the default as specified on the command line is used. 1418 @end ignore 1419 1420 @item Q [@var{exit-code}] 1421 This command only accepts a single address. 1422 1423 @findex Q (silent Quit) command 1424 @cindex @value{SSEDEXT}, quitting silently 1425 @cindex @value{SSEDEXT}, returning an exit code 1426 @cindex Quitting 1427 This command is the same as @code{q}, but will not print the 1428 contents of pattern space. Like @code{q}, it provides the 1429 ability to return an exit code to the caller. 1430 1431 This command can be useful because the only alternative ways 1432 to accomplish this apparently trivial function are to use 1433 the @option{-n} option (which can unnecessarily complicate 1434 your script) or resorting to the following snippet, which 1435 wastes time by reading the whole file without any visible effect: 1436 1437 @example 1438 :eat 1439 $d @i{@r{Quit silently on the last line}} 1440 N @i{@r{Read another line, silently}} 1441 g @i{@r{Overwrite pattern space each time to save memory}} 1442 b eat 1443 @end example 1444 1445 @item R @var{filename} 1446 @findex R (read line) command 1447 @cindex Read text from a file 1448 @cindex @value{SSEDEXT}, reading a file a line at a time 1449 @cindex @value{SSEDEXT}, @code{R} command 1450 @cindex @value{SSEDEXT}, @file{/dev/stdin} file 1451 Queue a line of @var{filename} to be read and 1452 inserted into the output stream at the end of the current cycle, 1453 or when the next input line is read. 1454 Note that if @var{filename} cannot be read, or if its end is 1455 reached, no line is appended, without any error indication. 1456 1457 As with the @code{r} command, the special value @file{/dev/stdin} 1458 is supported for the file name, which reads a line from the 1459 standard input. 1460 1461 @item T @var{label} 1462 @findex T (test and branch if failed) command 1463 @cindex @value{SSEDEXT}, branch if @code{s///} failed 1464 @cindex Branch to a label, if @code{s///} failed 1465 @cindex Conditional branch 1466 Branch to @var{label} only if there have been no successful 1467 @code{s}ubstitutions since the last input line was read or 1468 conditional branch was taken. The @var{label} may be omitted, 1469 in which case the next cycle is started. 1470 1471 @item v @var{version} 1472 @findex v (version) command 1473 @cindex @value{SSEDEXT}, checking for their presence 1474 @cindex Requiring @value{SSED} 1475 This command does nothing, but makes @command{sed} fail if 1476 @value{SSED} extensions are not supported, simply because other 1477 versions of @command{sed} do not implement it. In addition, you 1478 can specify the version of @command{sed} that your script 1479 requires, such as @code{4.0.5}. The default is @code{4.0} 1480 because that is the first version that implemented this command. 1481 1482 This command enables all @value{SSEDEXT} even if 1483 @env{POSIXLY_CORRECT} is set in the environment. 1484 1485 @item W @var{filename} 1486 @findex W (write first line) command 1487 @cindex Write first line to a file 1488 @cindex @value{SSEDEXT}, writing first line to a file 1489 Write to the given filename the portion of the pattern space up to 1490 the first newline. Everything said under the @code{w} command about 1491 file handling holds here too. 1492 1493 @item z 1494 @findex z (Zap) command 1495 @cindex @value{SSEDEXT}, emptying pattern space 1496 @cindex Emptying pattern space 1497 This command empties the content of pattern space. It is 1498 usually the same as @samp{s/.*//}, but is more efficient 1499 and works in the presence of invalid multibyte sequences 1500 in the input stream. @sc{posix} mandates that such sequences 1501 are @emph{not} matched by @samp{.}, so that there is no portable 1502 way to clear @command{sed}'s buffers in the middle of the 1503 script in most multibyte locales (including UTF-8 locales). 1504 @end table 1505 1506 @node Escapes 1507 @section @acronym{GNU} Extensions for Escapes in Regular Expressions 1508 1509 @cindex @acronym{GNU} extensions, special escapes 1510 Until this chapter, we have only encountered escapes of the form 1511 @samp{\^}, which tell @command{sed} not to interpret the circumflex 1512 as a special character, but rather to take it literally. For 1513 example, @samp{\*} matches a single asterisk rather than zero 1514 or more backslashes. 1515 1516 @cindex @code{POSIXLY_CORRECT} behavior, escapes 1517 This chapter introduces another kind of escape@footnote{All 1518 the escapes introduced here are @acronym{GNU} 1519 extensions, with the exception of @code{\n}. In basic regular 1520 expression mode, setting @code{POSIXLY_CORRECT} disables them inside 1521 bracket expressions.}---that 1522 is, escapes that are applied to a character or sequence of characters 1523 that ordinarily are taken literally, and that @command{sed} replaces 1524 with a special character. This provides a way 1525 of encoding non-printable characters in patterns in a visible manner. 1526 There is no restriction on the appearance of non-printing characters 1527 in a @command{sed} script but when a script is being prepared in the 1528 shell or by text editing, it is usually easier to use one of 1529 the following escape sequences than the binary character it 1530 represents: 1531 1532 The list of these escapes is: 1533 1534 @table @code 1535 @item \a 1536 Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7). 1537 1538 @item \f 1539 Produces or matches a form feed (@sc{ascii} 12). 1540 1541 @item \n 1542 Produces or matches a newline (@sc{ascii} 10). 1543 1544 @item \r 1545 Produces or matches a carriage return (@sc{ascii} 13). 1546 1547 @item \t 1548 Produces or matches a horizontal tab (@sc{ascii} 9). 1549 1550 @item \v 1551 Produces or matches a so called ``vertical tab'' (@sc{ascii} 11). 1552 1553 @item \c@var{x} 1554 Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is 1555 any character. The precise effect of @samp{\c@var{x}} is as follows: 1556 if @var{x} is a lower case letter, it is converted to upper case. 1557 Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes 1558 hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B. 1559 1560 @item \d@var{xxx} 1561 Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}. 1562 1563 @item \o@var{xxx} 1564 @ifset PERL 1565 @item \@var{xxx} 1566 @end ifset 1567 Produces or matches a character whose octal @sc{ascii} value is @var{xxx}. 1568 @ifset PERL 1569 The syntax without the @code{o} is active in Perl mode, while the one 1570 with the @code{o} is active in the normal or extended @sc{posix} regular 1571 expression modes. 1572 @end ifset 1573 1574 @item \x@var{xx} 1575 Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}. 1576 @end table 1577 1578 @samp{\b} (backspace) was omitted because of the conflict with 1579 the existing ``word boundary'' meaning. 1580 1581 Other escapes match a particular character class and are valid only in 1582 regular expressions: 1583 1584 @table @code 1585 @item \w 1586 Matches any ``word'' character. A ``word'' character is any 1587 letter or digit or the underscore character. 1588 1589 @item \W 1590 Matches any ``non-word'' character. 1591 1592 @item \b 1593 Matches a word boundary; that is it matches if the character 1594 to the left is a ``word'' character and the character to the 1595 right is a ``non-word'' character, or vice-versa. 1596 1597 @item \B 1598 Matches everywhere but on a word boundary; that is it matches 1599 if the character to the left and the character to the right 1600 are either both ``word'' characters or both ``non-word'' 1601 characters. 1602 1603 @item \` 1604 Matches only at the start of pattern space. This is different 1605 from @code{^} in multi-line mode. 1606 1607 @item \' 1608 Matches only at the end of pattern space. This is different 1609 from @code{$} in multi-line mode. 1610 1611 @ifset PERL 1612 @item \G 1613 Match only at the start of pattern space or, when doing a global 1614 substitution using the @code{s///g} command and option, at 1615 the end-of-match position of the prior match. For example, 1616 @samp{s/\Ga/Z/g} will change an initial run of @code{a}s to 1617 a run of @code{Z}s 1618 @end ifset 1619 @end table 1620 1621 @node Examples 1622 @chapter Some Sample Scripts 1623 1624 Here are some @command{sed} scripts to guide you in the art of mastering 1625 @command{sed}. 1626 1627 @menu 1628 Some exotic examples: 1629 * Centering lines:: 1630 * Increment a number:: 1631 * Rename files to lower case:: 1632 * Print bash environment:: 1633 * Reverse chars of lines:: 1634 1635 Emulating standard utilities: 1636 * tac:: Reverse lines of files 1637 * cat -n:: Numbering lines 1638 * cat -b:: Numbering non-blank lines 1639 * wc -c:: Counting chars 1640 * wc -w:: Counting words 1641 * wc -l:: Counting lines 1642 * head:: Printing the first lines 1643 * tail:: Printing the last lines 1644 * uniq:: Make duplicate lines unique 1645 * uniq -d:: Print duplicated lines of input 1646 * uniq -u:: Remove all duplicated lines 1647 * cat -s:: Squeezing blank lines 1648 @end menu 1649 1650 @node Centering lines 1651 @section Centering Lines 1652 1653 This script centers all lines of a file on a 80 columns width. 1654 To change that width, the number in @code{\@{@dots{}\@}} must be 1655 replaced, and the number of added spaces also must be changed. 1656 1657 Note how the buffer commands are used to separate parts in 1658 the regular expressions to be matched---this is a common 1659 technique. 1660 1661 @c start------------------------------------------- 1662 @example 1663 #!/usr/bin/sed -f 1664 1665 # Put 80 spaces in the buffer 1666 1 @{ 1667 x 1668 s/^$/ / 1669 s/^.*$/&&&&&&&&/ 1670 x 1671 @} 1672 1673 # del leading and trailing spaces 1674 y/@kbd{tab}/ / 1675 s/^ *// 1676 s/ *$// 1677 1678 # add a newline and 80 spaces to end of line 1679 G 1680 1681 # keep first 81 chars (80 + a newline) 1682 s/^\(.\@{81\@}\).*$/\1/ 1683 1684 # \2 matches half of the spaces, which are moved to the beginning 1685 s/^\(.*\)\n\(.*\)\2/\2\1/ 1686 @end example 1687 @c end--------------------------------------------- 1688 1689 @node Increment a number 1690 @section Increment a Number 1691 1692 This script is one of a few that demonstrate how to do arithmetic 1693 in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg 1694 Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator! 1695 It is distributed together with sed.} but must be done manually. 1696 1697 To increment one number you just add 1 to last digit, replacing 1698 it by the following digit. There is one exception: when the digit 1699 is a nine the previous digits must be also incremented until you 1700 don't have a nine. 1701 1702 This solution by Bruno Haible is very clever and smart because 1703 it uses a single buffer; if you don't have this limitation, the 1704 algorithm used in @ref{cat -n, Numbering lines}, is faster. 1705 It works by replacing trailing nines with an underscore, then 1706 using multiple @code{s} commands to increment the last digit, 1707 and then again substituting underscores with zeros. 1708 1709 @c start------------------------------------------- 1710 @example 1711 #!/usr/bin/sed -f 1712 1713 /[^0-9]/ d 1714 1715 # replace all leading 9s by _ (any other character except digits, could 1716 # be used) 1717 :d 1718 s/9\(_*\)$/_\1/ 1719 td 1720 1721 # incr last digit only. The first line adds a most-significant 1722 # digit of 1 if we have to add a digit. 1723 # 1724 # The @code{tn} commands are not necessary, but make the thing 1725 # faster 1726 1727 s/^\(_*\)$/1\1/; tn 1728 s/8\(_*\)$/9\1/; tn 1729 s/7\(_*\)$/8\1/; tn 1730 s/6\(_*\)$/7\1/; tn 1731 s/5\(_*\)$/6\1/; tn 1732 s/4\(_*\)$/5\1/; tn 1733 s/3\(_*\)$/4\1/; tn 1734 s/2\(_*\)$/3\1/; tn 1735 s/1\(_*\)$/2\1/; tn 1736 s/0\(_*\)$/1\1/; tn 1737 1738 :n 1739 y/_/0/ 1740 @end example 1741 @c end--------------------------------------------- 1742 1743 @node Rename files to lower case 1744 @section Rename Files to Lower Case 1745 1746 This is a pretty strange use of @command{sed}. We transform text, and 1747 transform it to be shell commands, then just feed them to shell. 1748 Don't worry, even worse hacks are done when using @command{sed}; I have 1749 seen a script converting the output of @command{date} into a @command{bc} 1750 program! 1751 1752 The main body of this is the @command{sed} script, which remaps the name 1753 from lower to upper (or vice-versa) and even checks out 1754 if the remapped name is the same as the original name. 1755 Note how the script is parameterized using shell 1756 variables and proper quoting. 1757 1758 @c start------------------------------------------- 1759 @example 1760 #! /bin/sh 1761 # rename files to lower/upper case... 1762 # 1763 # usage: 1764 # move-to-lower * 1765 # move-to-upper * 1766 # or 1767 # move-to-lower -R . 1768 # move-to-upper -R . 1769 # 1770 1771 help() 1772 @{ 1773 cat << eof 1774 Usage: $0 [-n] [-r] [-h] files... 1775 1776 -n do nothing, only see what would be done 1777 -R recursive (use find) 1778 -h this message 1779 files files to remap to lower case 1780 1781 Examples: 1782 $0 -n * (see if everything is ok, then...) 1783 $0 * 1784 1785 $0 -R . 1786 1787 eof 1788 @} 1789 1790 apply_cmd='sh' 1791 finder='echo "$@@" | tr " " "\n"' 1792 files_only= 1793 1794 while : 1795 do 1796 case "$1" in 1797 -n) apply_cmd='cat' ;; 1798 -R) finder='find "$@@" -type f';; 1799 -h) help ; exit 1 ;; 1800 *) break ;; 1801 esac 1802 shift 1803 done 1804 1805 if [ -z "$1" ]; then 1806 echo Usage: $0 [-h] [-n] [-r] files... 1807 exit 1 1808 fi 1809 1810 LOWER='abcdefghijklmnopqrstuvwxyz' 1811 UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ' 1812 1813 case `basename $0` in 1814 *upper*) TO=$UPPER; FROM=$LOWER ;; 1815 *) FROM=$UPPER; TO=$LOWER ;; 1816 esac 1817 1818 eval $finder | sed -n ' 1819 1820 # remove all trailing slashes 1821 s/\/*$// 1822 1823 # add ./ if there is no path, only a filename 1824 /\//! s/^/.\// 1825 1826 # save path+filename 1827 h 1828 1829 # remove path 1830 s/.*\/// 1831 1832 # do conversion only on filename 1833 y/'$FROM'/'$TO'/ 1834 1835 # now line contains original path+file, while 1836 # hold space contains the new filename 1837 x 1838 1839 # add converted file name to line, which now contains 1840 # path/file-name\nconverted-file-name 1841 G 1842 1843 # check if converted file name is equal to original file name, 1844 # if it is, do not print nothing 1845 /^.*\/\(.*\)\n\1/b 1846 1847 # now, transform path/fromfile\n, into 1848 # mv path/fromfile path/tofile and print it 1849 s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p 1850 1851 ' | $apply_cmd 1852 @end example 1853 @c end--------------------------------------------- 1854 1855 @node Print bash environment 1856 @section Print @command{bash} Environment 1857 1858 This script strips the definition of the shell functions 1859 from the output of the @command{set} Bourne-shell command. 1860 1861 @c start------------------------------------------- 1862 @example 1863 #!/bin/sh 1864 1865 set | sed -n ' 1866 :x 1867 1868 @ifinfo 1869 # if no occurrence of "=()" print and load next line 1870 @end ifinfo 1871 @ifnotinfo 1872 # if no occurrence of @samp{=()} print and load next line 1873 @end ifnotinfo 1874 /=()/! @{ p; b; @} 1875 / () $/! @{ p; b; @} 1876 1877 # possible start of functions section 1878 # save the line in case this is a var like FOO="() " 1879 h 1880 1881 # if the next line has a brace, we quit because 1882 # nothing comes after functions 1883 n 1884 /^@{/ q 1885 1886 # print the old line 1887 x; p 1888 1889 # work on the new line now 1890 x; bx 1891 ' 1892 @end example 1893 @c end--------------------------------------------- 1894 1895 @node Reverse chars of lines 1896 @section Reverse Characters of Lines 1897 1898 This script can be used to reverse the position of characters 1899 in lines. The technique moves two characters at a time, hence 1900 it is faster than more intuitive implementations. 1901 1902 Note the @code{tx} command before the definition of the label. 1903 This is often needed to reset the flag that is tested by 1904 the @code{t} command. 1905 1906 Imaginative readers will find uses for this script. An example 1907 is reversing the output of @command{banner}.@footnote{This requires 1908 another script to pad the output of banner; for example 1909 1910 @example 1911 #! /bin/sh 1912 1913 banner -w $1 $2 $3 $4 | 1914 sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' | 1915 ~/sedscripts/reverseline.sed 1916 @end example 1917 } 1918 1919 @c start------------------------------------------- 1920 @example 1921 #!/usr/bin/sed -f 1922 1923 /../! b 1924 1925 # Reverse a line. Begin embedding the line between two newlines 1926 s/^.*$/\ 1927 &\ 1928 / 1929 1930 # Move first character at the end. The regexp matches until 1931 # there are zero or one characters between the markers 1932 tx 1933 :x 1934 s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/ 1935 tx 1936 1937 # Remove the newline markers 1938 s/\n//g 1939 @end example 1940 @c end--------------------------------------------- 1941 1942 @node tac 1943 @section Reverse Lines of Files 1944 1945 This one begins a series of totally useless (yet interesting) 1946 scripts emulating various Unix commands. This, in particular, 1947 is a @command{tac} workalike. 1948 1949 Note that on implementations other than @acronym{GNU} @command{sed} 1950 @ifset PERL 1951 and @value{SSED} 1952 @end ifset 1953 this script might easily overflow internal buffers. 1954 1955 @c start------------------------------------------- 1956 @example 1957 #!/usr/bin/sed -nf 1958 1959 # reverse all lines of input, i.e. first line became last, ... 1960 1961 # from the second line, the buffer (which contains all previous lines) 1962 # is *appended* to current line, so, the order will be reversed 1963 1! G 1964 1965 # on the last line we're done -- print everything 1966 $ p 1967 1968 # store everything on the buffer again 1969 h 1970 @end example 1971 @c end--------------------------------------------- 1972 1973 @node cat -n 1974 @section Numbering Lines 1975 1976 This script replaces @samp{cat -n}; in fact it formats its output 1977 exactly like @acronym{GNU} @command{cat} does. 1978 1979 Of course this is completely useless and for two reasons: first, 1980 because somebody else did it in C, second, because the following 1981 Bourne-shell script could be used for the same purpose and would 1982 be much faster: 1983 1984 @c start------------------------------------------- 1985 @example 1986 #! /bin/sh 1987 sed -e "=" $@@ | sed -e ' 1988 s/^/ / 1989 N 1990 s/^ *\(......\)\n/\1 / 1991 ' 1992 @end example 1993 @c end--------------------------------------------- 1994 1995 It uses @command{sed} to print the line number, then groups lines two 1996 by two using @code{N}. Of course, this script does not teach as much as 1997 the one presented below. 1998 1999 The algorithm used for incrementing uses both buffers, so the line 2000 is printed as soon as possible and then discarded. The number 2001 is split so that changing digits go in a buffer and unchanged ones go 2002 in the other; the changed digits are modified in a single step 2003 (using a @code{y} command). The line number for the next line 2004 is then composed and stored in the hold space, to be used in the 2005 next iteration. 2006 2007 @c start------------------------------------------- 2008 @example 2009 #!/usr/bin/sed -nf 2010 2011 # Prime the pump on the first line 2012 x 2013 /^$/ s/^.*$/1/ 2014 2015 # Add the correct line number before the pattern 2016 G 2017 h 2018 2019 # Format it and print it 2020 s/^/ / 2021 s/^ *\(......\)\n/\1 /p 2022 2023 # Get the line number from hold space; add a zero 2024 # if we're going to add a digit on the next line 2025 g 2026 s/\n.*$// 2027 /^9*$/ s/^/0/ 2028 2029 # separate changing/unchanged digits with an x 2030 s/.9*$/x&/ 2031 2032 # keep changing digits in hold space 2033 h 2034 s/^.*x// 2035 y/0123456789/1234567890/ 2036 x 2037 2038 # keep unchanged digits in pattern space 2039 s/x.*$// 2040 2041 # compose the new number, remove the newline implicitly added by G 2042 G 2043 s/\n// 2044 h 2045 @end example 2046 @c end--------------------------------------------- 2047 2048 @node cat -b 2049 @section Numbering Non-blank Lines 2050 2051 Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only 2052 have to select which lines are to be numbered and which are not. 2053 2054 The part that is common to this script and the previous one is 2055 not commented to show how important it is to comment @command{sed} 2056 scripts properly... 2057 2058 @c start------------------------------------------- 2059 @example 2060 #!/usr/bin/sed -nf 2061 2062 /^$/ @{ 2063 p 2064 b 2065 @} 2066 2067 # Same as cat -n from now 2068 x 2069 /^$/ s/^.*$/1/ 2070 G 2071 h 2072 s/^/ / 2073 s/^ *\(......\)\n/\1 /p 2074 x 2075 s/\n.*$// 2076 /^9*$/ s/^/0/ 2077 s/.9*$/x&/ 2078 h 2079 s/^.*x// 2080 y/0123456789/1234567890/ 2081 x 2082 s/x.*$// 2083 G 2084 s/\n// 2085 h 2086 @end example 2087 @c end--------------------------------------------- 2088 2089 @node wc -c 2090 @section Counting Characters 2091 2092 This script shows another way to do arithmetic with @command{sed}. 2093 In this case we have to add possibly large numbers, so implementing 2094 this by successive increments would not be feasible (and possibly 2095 even more complicated to contrive than this script). 2096 2097 The approach is to map numbers to letters, kind of an abacus 2098 implemented with @command{sed}. @samp{a}s are units, @samp{b}s are 2099 tens and so on: we simply add the number of characters 2100 on the current line as units, and then propagate the carry 2101 to tens, hundreds, and so on. 2102 2103 As usual, running totals are kept in hold space. 2104 2105 On the last line, we convert the abacus form back to decimal. 2106 For the sake of variety, this is done with a loop rather than 2107 with some 80 @code{s} commands@footnote{Some implementations 2108 have a limit of 199 commands per script}: first we 2109 convert units, removing @samp{a}s from the number; then we 2110 rotate letters so that tens become @samp{a}s, and so on 2111 until no more letters remain. 2112 2113 @c start------------------------------------------- 2114 @example 2115 #!/usr/bin/sed -nf 2116 2117 # Add n+1 a's to hold space (+1 is for the newline) 2118 s/./a/g 2119 H 2120 x 2121 s/\n/a/ 2122 2123 # Do the carry. The t's and b's are not necessary, 2124 # but they do speed up the thing 2125 t a 2126 : a; s/aaaaaaaaaa/b/g; t b; b done 2127 : b; s/bbbbbbbbbb/c/g; t c; b done 2128 : c; s/cccccccccc/d/g; t d; b done 2129 : d; s/dddddddddd/e/g; t e; b done 2130 : e; s/eeeeeeeeee/f/g; t f; b done 2131 : f; s/ffffffffff/g/g; t g; b done 2132 : g; s/gggggggggg/h/g; t h; b done 2133 : h; s/hhhhhhhhhh//g 2134 2135 : done 2136 $! @{ 2137 h 2138 b 2139 @} 2140 2141 # On the last line, convert back to decimal 2142 2143 : loop 2144 /a/! s/[b-h]*/&0/ 2145 s/aaaaaaaaa/9/ 2146 s/aaaaaaaa/8/ 2147 s/aaaaaaa/7/ 2148 s/aaaaaa/6/ 2149 s/aaaaa/5/ 2150 s/aaaa/4/ 2151 s/aaa/3/ 2152 s/aa/2/ 2153 s/a/1/ 2154 2155 : next 2156 y/bcdefgh/abcdefg/ 2157 /[a-h]/ b loop 2158 p 2159 @end example 2160 @c end--------------------------------------------- 2161 2162 @node wc -w 2163 @section Counting Words 2164 2165 This script is almost the same as the previous one, once each 2166 of the words on the line is converted to a single @samp{a} 2167 (in the previous script each letter was changed to an @samp{a}). 2168 2169 It is interesting that real @command{wc} programs have optimized 2170 loops for @samp{wc -c}, so they are much slower at counting 2171 words rather than characters. This script's bottleneck, 2172 instead, is arithmetic, and hence the word-counting one 2173 is faster (it has to manage smaller numbers). 2174 2175 Again, the common parts are not commented to show the importance 2176 of commenting @command{sed} scripts. 2177 2178 @c start------------------------------------------- 2179 @example 2180 #!/usr/bin/sed -nf 2181 2182 # Convert words to a's 2183 s/[ @kbd{tab}][ @kbd{tab}]*/ /g 2184 s/^/ / 2185 s/ [^ ][^ ]*/a /g 2186 s/ //g 2187 2188 # Append them to hold space 2189 H 2190 x 2191 s/\n// 2192 2193 # From here on it is the same as in wc -c. 2194 /aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g 2195 /bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g 2196 /cccccccccc/! bx; s/cccccccccc/d/g 2197 /dddddddddd/! bx; s/dddddddddd/e/g 2198 /eeeeeeeeee/! bx; s/eeeeeeeeee/f/g 2199 /ffffffffff/! bx; s/ffffffffff/g/g 2200 /gggggggggg/! bx; s/gggggggggg/h/g 2201 s/hhhhhhhhhh//g 2202 :x 2203 $! @{ h; b; @} 2204 :y 2205 /a/! s/[b-h]*/&0/ 2206 s/aaaaaaaaa/9/ 2207 s/aaaaaaaa/8/ 2208 s/aaaaaaa/7/ 2209 s/aaaaaa/6/ 2210 s/aaaaa/5/ 2211 s/aaaa/4/ 2212 s/aaa/3/ 2213 s/aa/2/ 2214 s/a/1/ 2215 y/bcdefgh/abcdefg/ 2216 /[a-h]/ by 2217 p 2218 @end example 2219 @c end--------------------------------------------- 2220 2221 @node wc -l 2222 @section Counting Lines 2223 2224 No strange things are done now, because @command{sed} gives us 2225 @samp{wc -l} functionality for free!!! Look: 2226 2227 @c start------------------------------------------- 2228 @example 2229 #!/usr/bin/sed -nf 2230 $= 2231 @end example 2232 @c end--------------------------------------------- 2233 2234 @node head 2235 @section Printing the First Lines 2236 2237 This script is probably the simplest useful @command{sed} script. 2238 It displays the first 10 lines of input; the number of displayed 2239 lines is right before the @code{q} command. 2240 2241 @c start------------------------------------------- 2242 @example 2243 #!/usr/bin/sed -f 2244 10q 2245 @end example 2246 @c end--------------------------------------------- 2247 2248 @node tail 2249 @section Printing the Last Lines 2250 2251 Printing the last @var{n} lines rather than the first is more complex 2252 but indeed possible. @var{n} is encoded in the second line, before 2253 the bang character. 2254 2255 This script is similar to the @command{tac} script in that it keeps the 2256 final output in the hold space and prints it at the end: 2257 2258 @c start------------------------------------------- 2259 @example 2260 #!/usr/bin/sed -nf 2261 2262 1! @{; H; g; @} 2263 1,10 !s/[^\n]*\n// 2264 $p 2265 h 2266 @end example 2267 @c end--------------------------------------------- 2268 2269 Mainly, the scripts keeps a window of 10 lines and slides it 2270 by adding a line and deleting the oldest (the substitution command 2271 on the second line works like a @code{D} command but does not 2272 restart the loop). 2273 2274 The ``sliding window'' technique is a very powerful way to write 2275 efficient and complex @command{sed} scripts, because commands like 2276 @code{P} would require a lot of work if implemented manually. 2277 2278 To introduce the technique, which is fully demonstrated in the 2279 rest of this chapter and is based on the @code{N}, @code{P} 2280 and @code{D} commands, here is an implementation of @command{tail} 2281 using a simple ``sliding window.'' 2282 2283 This looks complicated but in fact the working is the same as 2284 the last script: after we have kicked in the appropriate number 2285 of lines, however, we stop using the hold space to keep inter-line 2286 state, and instead use @code{N} and @code{D} to slide pattern 2287 space by one line: 2288 2289 @c start------------------------------------------- 2290 @example 2291 #!/usr/bin/sed -f 2292 2293 1h 2294 2,10 @{; H; g; @} 2295 $q 2296 1,9d 2297 N 2298 D 2299 @end example 2300 @c end--------------------------------------------- 2301 2302 Note how the first, second and fourth line are inactive after 2303 the first ten lines of input. After that, all the script does 2304 is: exiting on the last line of input, appending the next input 2305 line to pattern space, and removing the first line. 2306 2307 @node uniq 2308 @section Make Duplicate Lines Unique 2309 2310 This is an example of the art of using the @code{N}, @code{P} 2311 and @code{D} commands, probably the most difficult to master. 2312 2313 @c start------------------------------------------- 2314 @example 2315 #!/usr/bin/sed -f 2316 h 2317 2318 :b 2319 # On the last line, print and exit 2320 $b 2321 N 2322 /^\(.*\)\n\1$/ @{ 2323 # The two lines are identical. Undo the effect of 2324 # the n command. 2325 g 2326 bb 2327 @} 2328 2329 # If the @code{N} command had added the last line, print and exit 2330 $b 2331 2332 # The lines are different; print the first and go 2333 # back working on the second. 2334 P 2335 D 2336 @end example 2337 @c end--------------------------------------------- 2338 2339 As you can see, we mantain a 2-line window using @code{P} and @code{D}. 2340 This technique is often used in advanced @command{sed} scripts. 2341 2342 @node uniq -d 2343 @section Print Duplicated Lines of Input 2344 2345 This script prints only duplicated lines, like @samp{uniq -d}. 2346 2347 @c start------------------------------------------- 2348 @example 2349 #!/usr/bin/sed -nf 2350 2351 $b 2352 N 2353 /^\(.*\)\n\1$/ @{ 2354 # Print the first of the duplicated lines 2355 s/.*\n// 2356 p 2357 2358 # Loop until we get a different line 2359 :b 2360 $b 2361 N 2362 /^\(.*\)\n\1$/ @{ 2363 s/.*\n// 2364 bb 2365 @} 2366 @} 2367 2368 # The last line cannot be followed by duplicates 2369 $b 2370 2371 # Found a different one. Leave it alone in the pattern space 2372 # and go back to the top, hunting its duplicates 2373 D 2374 @end example 2375 @c end--------------------------------------------- 2376 2377 @node uniq -u 2378 @section Remove All Duplicated Lines 2379 2380 This script prints only unique lines, like @samp{uniq -u}. 2381 2382 @c start------------------------------------------- 2383 @example 2384 #!/usr/bin/sed -f 2385 2386 # Search for a duplicate line --- until that, print what you find. 2387 $b 2388 N 2389 /^\(.*\)\n\1$/ ! @{ 2390 P 2391 D 2392 @} 2393 2394 :c 2395 # Got two equal lines in pattern space. At the 2396 # end of the file we simply exit 2397 $d 2398 2399 # Else, we keep reading lines with @code{N} until we 2400 # find a different one 2401 s/.*\n// 2402 N 2403 /^\(.*\)\n\1$/ @{ 2404 bc 2405 @} 2406 2407 # Remove the last instance of the duplicate line 2408 # and go back to the top 2409 D 2410 @end example 2411 @c end--------------------------------------------- 2412 2413 @node cat -s 2414 @section Squeezing Blank Lines 2415 2416 As a final example, here are three scripts, of increasing complexity 2417 and speed, that implement the same function as @samp{cat -s}, that is 2418 squeezing blank lines. 2419 2420 The first leaves a blank line at the beginning and end if there are 2421 some already. 2422 2423 @c start------------------------------------------- 2424 @example 2425 #!/usr/bin/sed -f 2426 2427 # on empty lines, join with next 2428 # Note there is a star in the regexp 2429 :x 2430 /^\n*$/ @{ 2431 N 2432 bx 2433 @} 2434 2435 # now, squeeze all '\n', this can be also done by: 2436 # s/^\(\n\)*/\1/ 2437 s/\n*/\ 2438 / 2439 @end example 2440 @c end--------------------------------------------- 2441 2442 This one is a bit more complex and removes all empty lines 2443 at the beginning. It does leave a single blank line at end 2444 if one was there. 2445 2446 @c start------------------------------------------- 2447 @example 2448 #!/usr/bin/sed -f 2449 2450 # delete all leading empty lines 2451 1,/^./@{ 2452 /./!d 2453 @} 2454 2455 # on an empty line we remove it and all the following 2456 # empty lines, but one 2457 :x 2458 /./!@{ 2459 N 2460 s/^\n$// 2461 tx 2462 @} 2463 @end example 2464 @c end--------------------------------------------- 2465 2466 This removes leading and trailing blank lines. It is also the 2467 fastest. Note that loops are completely done with @code{n} and 2468 @code{b}, without relying on @command{sed} to restart the 2469 the script automatically at the end of a line. 2470 2471 @c start------------------------------------------- 2472 @example 2473 #!/usr/bin/sed -nf 2474 2475 # delete all (leading) blanks 2476 /./!d 2477 2478 # get here: so there is a non empty 2479 :x 2480 # print it 2481 p 2482 # get next 2483 n 2484 # got chars? print it again, etc... 2485 /./bx 2486 2487 # no, don't have chars: got an empty line 2488 :z 2489 # get next, if last line we finish here so no trailing 2490 # empty lines are written 2491 n 2492 # also empty? then ignore it, and get next... this will 2493 # remove ALL empty lines 2494 /./!bz 2495 2496 # all empty lines were deleted/ignored, but we have a non empty. As 2497 # what we want to do is to squeeze, insert a blank line artificially 2498 i\ 2499 2500 bx 2501 @end example 2502 @c end--------------------------------------------- 2503 2504 @node Limitations 2505 @chapter @value{SSED}'s Limitations and Non-limitations 2506 2507 @cindex @acronym{GNU} extensions, unlimited line length 2508 @cindex Portability, line length limitations 2509 For those who want to write portable @command{sed} scripts, 2510 be aware that some implementations have been known to 2511 limit line lengths (for the pattern and hold spaces) 2512 to be no more than 4000 bytes. 2513 The @sc{posix} standard specifies that conforming @command{sed} 2514 implementations shall support at least 8192 byte line lengths. 2515 @value{SSED} has no built-in limit on line length; 2516 as long as it can @code{malloc()} more (virtual) memory, 2517 you can feed or construct lines as long as you like. 2518 2519 However, recursion is used to handle subpatterns and indefinite 2520 repetition. This means that the available stack space may limit 2521 the size of the buffer that can be processed by certain patterns. 2522 2523 @ifset PERL 2524 There are some size limitations in the regular expression 2525 matcher but it is hoped that they will never in practice 2526 be relevant. The maximum length of a compiled pattern 2527 is 65539 (sic) bytes. All values in repeating quantifiers 2528 must be less than 65536. The maximum nesting depth of 2529 all parenthesized subpatterns, including capturing and 2530 non-capturing subpatterns@footnote{The 2531 distinction is meaningful when referring to Perl-style 2532 regular expressions.}, assertions, and other types of 2533 subpattern, is 200. 2534 2535 Also, @value{SSED} recognizes the @sc{posix} syntax 2536 @code{[.@var{ch}.]} and @code{[=@var{ch}=]} 2537 where @var{ch} is a ``collating element'', but these 2538 are not supported, and an error is given if they are 2539 encountered. 2540 2541 Here are a few distinctions between the real Perl-style 2542 regular expressions and those that @option{-R} recognizes. 2543 2544 @enumerate 2545 @item 2546 Lookahead assertions do not allow repeat quantifiers after them 2547 Perl permits them, but they do not mean what you 2548 might think. For example, @samp{(?!a)@{3@}} does not assert that the 2549 next three characters are not @samp{a}. It just asserts three times that the 2550 next character is not @samp{a} --- a waste of time and nothing else. 2551 2552 @item 2553 Capturing subpatterns that occur inside negative lookahead 2554 head assertions are counted, but their entries are counted 2555 as empty in the second half of an @code{s} command. 2556 Perl sets its numerical variables from any such patterns 2557 that are matched before the assertion fails to match 2558 something (thereby succeeding), but only if the negative 2559 lookahead assertion contains just one branch. 2560 2561 @item 2562 The following Perl escape sequences are not supported: 2563 @samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E}, 2564 @samp{\Q}. In fact these are implemented by Perl's general 2565 string-handling and are not part of its pattern matching engine. 2566 2567 @item 2568 The Perl @samp{\G} assertion is not supported as it is not 2569 relevant to single pattern matches. 2570 2571 @item 2572 Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})} 2573 and @samp{(?p@{code@})} constructions. However, there is some experimental 2574 support for recursive patterns using the non-Perl item @samp{(?R)}. 2575 2576 @item 2577 There are at the time of writing some oddities in Perl 2578 5.005_02 concerned with the settings of captured strings 2579 when part of a pattern is repeated. For example, matching 2580 @samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets 2581 @samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.} 2582 to the value @samp{b}, but matching @samp{aabbaa} 2583 against @samp{/^(aa(bb)?)+$/} leaves @samp{$2} 2584 unset. However, if the pattern is changed to 2585 @samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set. 2586 In Perl 5.004 @samp{$2} is set in both cases, and that is also 2587 true of @value{SSED}. 2588 2589 @item 2590 Another as yet unresolved discrepancy is that in Perl 2591 5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches 2592 the string @samp{a}, whereas in @value{SSED} it does not. 2593 However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched 2594 against @samp{a} leaves $1 unset. 2595 @end enumerate 2596 @end ifset 2597 2598 @node Other Resources 2599 @chapter Other Resources for Learning About @command{sed} 2600 2601 @cindex Additional reading about @command{sed} 2602 In addition to several books that have been written about @command{sed} 2603 (either specifically or as chapters in books which discuss 2604 shell programming), one can find out more about @command{sed} 2605 (including suggestions of a few books) from the FAQ 2606 for the @code{sed-users} mailing list, available from: 2607 @display 2608 @uref{http://sed.sourceforge.net/sedfaq.html} 2609 @end display 2610 2611 Also of interest are 2612 @uref{http://www.student.northpark.edu/pemente/sed/index.htm} 2613 and @uref{http://sed.sf.net/grabbag}, 2614 which include @command{sed} tutorials and other @command{sed}-related goodies. 2615 2616 The @code{sed-users} mailing list itself maintained by Sven Guckes. 2617 To subscribe, visit @uref{http://groups.yahoo.com} and search 2618 for the @code{sed-users} mailing list. 2619 2620 @node Reporting Bugs 2621 @chapter Reporting Bugs 2622 2623 @cindex Bugs, reporting 2624 Email bug reports to @email{bonzini@@gnu.org}. 2625 Be sure to include the word ``sed'' somewhere in the @code{Subject:} field. 2626 Also, please include the output of @samp{sed --version} in the body 2627 of your report if at all possible. 2628 2629 Please do not send a bug report like this: 2630 2631 @example 2632 @i{@i{@r{while building frobme-1.3.4}}} 2633 $ configure 2634 @error{} sed: file sedscr line 1: Unknown option to 's' 2635 @end example 2636 2637 If @value{SSED} doesn't configure your favorite package, take a 2638 few extra minutes to identify the specific problem and make a stand-alone 2639 test case. Unlike other programs such as C compilers, making such test 2640 cases for @command{sed} is quite simple. 2641 2642 A stand-alone test case includes all the data necessary to perform the 2643 test, and the specific invocation of @command{sed} that causes the problem. 2644 The smaller a stand-alone test case is, the better. A test case should 2645 not involve something as far removed from @command{sed} as ``try to configure 2646 frobme-1.3.4''. Yes, that is in principle enough information to look 2647 for the bug, but that is not a very practical prospect. 2648 2649 Here are a few commonly reported bugs that are not bugs. 2650 2651 @table @asis 2652 @item @code{N} command on the last line 2653 @cindex Portability, @code{N} command on the last line 2654 @cindex Non-bugs, @code{N} command on the last line 2655 2656 Most versions of @command{sed} exit without printing anything when 2657 the @command{N} command is issued on the last line of a file. 2658 @value{SSED} prints pattern space before exiting unless of course 2659 the @command{-n} command switch has been specified. This choice is 2660 by design. 2661 2662 For example, the behavior of 2663 @example 2664 sed N foo bar 2665 @end example 2666 @noindent 2667 would depend on whether foo has an even or an odd number of 2668 lines@footnote{which is the actual ``bug'' that prompted the 2669 change in behavior}. Or, when writing a script to read the 2670 next few lines following a pattern match, traditional 2671 implementations of @code{sed} would force you to write 2672 something like 2673 @example 2674 /foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @} 2675 @end example 2676 @noindent 2677 instead of just 2678 @example 2679 /foo/@{ N;N;N;N;N;N;N;N;N; @} 2680 @end example 2681 2682 @cindex @code{POSIXLY_CORRECT} behavior, @code{N} command 2683 In any case, the simplest workaround is to use @code{$d;N} in 2684 scripts that rely on the traditional behavior, or to set 2685 the @code{POSIXLY_CORRECT} variable to a non-empty value. 2686 2687 @item Regex syntax clashes (problems with backslashes) 2688 @cindex @acronym{GNU} extensions, to basic regular expressions 2689 @cindex Non-bugs, regex syntax clashes 2690 @command{sed} uses the @sc{posix} basic regular expression syntax. According to 2691 the standard, the meaning of some escape sequences is undefined in 2692 this syntax; notable in the case of @command{sed} are @code{\|}, 2693 @code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<}, 2694 @code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}. 2695 2696 As in all @acronym{GNU} programs that use @sc{posix} basic regular 2697 expressions, @command{sed} interprets these escape sequences as special 2698 characters. So, @code{x\+} matches one or more occurrences of @samp{x}. 2699 @code{abc\|def} matches either @samp{abc} or @samp{def}. 2700 2701 This syntax may cause problems when running scripts written for other 2702 @command{sed}s. Some @command{sed} programs have been written with the 2703 assumption that @code{\|} and @code{\+} match the literal characters 2704 @code{|} and @code{+}. Such scripts must be modified by removing the 2705 spurious backslashes if they are to be used with modern implementations 2706 of @command{sed}, like 2707 @ifset PERL 2708 @value{SSED} or 2709 @end ifset 2710 @acronym{GNU} @command{sed}. 2711 2712 On the other hand, some scripts use s|abc\|def||g to remove occurrences 2713 of @emph{either} @code{abc} or @code{def}. While this worked until 2714 @command{sed} 4.0.x, newer versions interpret this as removing the 2715 string @code{abc|def}. This is again undefined behavior according to 2716 @acronym{POSIX}, and this interpretation is arguably more robust: older 2717 @command{sed}s, for example, required that the regex matcher parsed 2718 @code{\/} as @code{/} in the common case of escaping a slash, which is 2719 again undefined behavior; the new behavior avoids this, and this is good 2720 because the regex matcher is only partially under our control. 2721 2722 @cindex @acronym{GNU} extensions, special escapes 2723 In addition, this version of @command{sed} supports several escape characters 2724 (some of which are multi-character) to insert non-printable characters 2725 in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r}, 2726 @code{\t}, @code{\v}, @code{\x}). These can cause similar problems 2727 with scripts written for other @command{sed}s. 2728 2729 @item @option{-i} clobbers read-only files 2730 @cindex In-place editing 2731 @cindex @value{SSEDEXT}, in-place editing 2732 @cindex Non-bugs, in-place editing 2733 2734 In short, @samp{sed -i} will let you delete the contents of 2735 a read-only file, and in general the @option{-i} option 2736 (@pxref{Invoking sed, , Invocation}) lets you clobber 2737 protected files. This is not a bug, but rather a consequence 2738 of how the Unix filesystem works. 2739 2740 The permissions on a file say what can happen to the data 2741 in that file, while the permissions on a directory say what can 2742 happen to the list of files in that directory. @samp{sed -i} 2743 will not ever open for writing a file that is already on disk. 2744 Rather, it will work on a temporary file that is finally renamed 2745 to the original name: if you rename or delete files, you're actually 2746 modifying the contents of the directory, so the operation depends on 2747 the permissions of the directory, not of the file. For this same 2748 reason, @command{sed} does not let you use @option{-i} on a writeable file 2749 in a read-only directory, and will break hard or symbolic links when 2750 @option{-i} is used on such a file. 2751 2752 @item @code{0a} does not work (gives an error) 2753 @cindex @code{0} address 2754 @cindex @acronym{GNU} extensions, @code{0} address 2755 @cindex Non-bugs, @code{0} address 2756 2757 There is no line 0. 0 is a special address that is only used to treat 2758 addresses like @code{0,/@var{RE}/} as active when the script starts: if 2759 you write @code{1,/abc/d} and the first line includes the word @samp{abc}, 2760 then that match would be ignored because address ranges must span at least 2761 two lines (barring the end of the file); but what you probably wanted is 2762 to delete every line up to the first one including @samp{abc}, and this 2763 is obtained with @code{0,/abc/d}. 2764 2765 @ifclear PERL 2766 @item @code{[a-z]} is case insensitive 2767 @cindex Non-bugs, localization-related 2768 2769 You are encountering problems with locales. POSIX mandates that @code{[a-z]} 2770 uses the current locale's collation order -- in C parlance, that means using 2771 @code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a 2772 case-insensitive collation order, others don't. 2773 2774 Another problem is that @code{[a-z]} tries to use collation symbols. 2775 This only happens if you are on the @acronym{GNU} system, using 2776 @acronym{GNU} libc's regular expression matcher instead of compiling the 2777 one supplied with @acronym{GNU} sed. In a Danish locale, for example, 2778 the regular expression @code{^[a-z]$} matches the string @samp{aa}, 2779 because this is a single collating symbol that comes after @samp{a} 2780 and before @samp{b}; @samp{ll} behaves similarly in Spanish 2781 locales, or @samp{ij} in Dutch locales. 2782 2783 To work around these problems, which may cause bugs in shell scripts, set 2784 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2785 2786 @item @code{s/.*//} does not clear pattern space 2787 @cindex Non-bugs, localization-related 2788 @cindex @value{SSEDEXT}, emptying pattern space 2789 @cindex Emptying pattern space 2790 2791 This happens if your input stream includes invalid multibyte 2792 sequences. @sc{posix} mandates that such sequences 2793 are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear 2794 pattern space as you would expect. In fact, there is no way to clear 2795 sed's buffers in the middle of the script in most multibyte locales 2796 (including UTF-8 locales). For this reason, @value{SSED} provides a `z' 2797 command (for `zap') as an extension. 2798 2799 To work around these problems, which may cause bugs in shell scripts, set 2800 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2801 @end ifclear 2802 @end table 2803 2804 2805 @node Extended regexps 2806 @appendix Extended regular expressions 2807 @cindex Extended regular expressions, syntax 2808 2809 The only difference between basic and extended regular expressions is in 2810 the behavior of a few characters: @samp{?}, @samp{+}, parentheses, 2811 and braces (@samp{@{@}}). While basic regular expressions require 2812 these to be escaped if you want them to behave as special characters, 2813 when using extended regular expressions you must escape them if 2814 you want them @emph{to match a literal character}. 2815 2816 @noindent 2817 Examples: 2818 @table @code 2819 @item abc? 2820 becomes @samp{abc\?} when using extended regular expressions. It matches 2821 the literal string @samp{abc?}. 2822 2823 @item c\+ 2824 becomes @samp{c+} when using extended regular expressions. It matches 2825 one or more @samp{c}s. 2826 2827 @item a\@{3,\@} 2828 becomes @samp{a@{3,@}} when using extended regular expressions. It matches 2829 three or more @samp{a}s. 2830 2831 @item \(abc\)\@{2,3\@} 2832 becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It 2833 matches either @samp{abcabc} or @samp{abcabcabc}. 2834 2835 @item \(abc*\)\1 2836 becomes @samp{(abc*)\1} when using extended regular expressions. 2837 Backreferences must still be escaped when using extended regular 2838 expressions. 2839 @end table 2840 2841 @ifset PERL 2842 @node Perl regexps 2843 @appendix Perl-style regular expressions 2844 @cindex Perl-style regular expressions, syntax 2845 2846 @emph{This part is taken from the @file{pcre.txt} file distributed together 2847 with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.} 2848 2849 Perl introduced several extensions to regular expressions, some 2850 of them incompatible with the syntax of regular expressions 2851 accepted by Emacs and other @acronym{GNU} tools (whose matcher was 2852 based on the Emacs matcher). @value{SSED} implements 2853 both kinds of extensions. 2854 2855 @iftex 2856 Summarizing, we have: 2857 2858 @itemize @bullet 2859 @item 2860 A backslash can introduce several special sequences 2861 2862 @item 2863 The circumflex, dollar sign, and period characters behave specially 2864 with regard to new lines 2865 2866 @item 2867 Strange uses of square brackets are parsed differently 2868 2869 @item 2870 You can toggle modifiers in the middle of a regular expression 2871 2872 @item 2873 You can specify that a subpattern does not count when numbering backreferences 2874 2875 @item 2876 @cindex Greedy regular expression matching 2877 You can specify greedy or non-greedy matching 2878 2879 @item 2880 You can have more than ten back references 2881 2882 @item 2883 You can do complex look aheads and look behinds (in the spirit of 2884 @code{\b}, but with subpatterns). 2885 2886 @item 2887 You can often improve performance by avoiding that @command{sed} wastes 2888 time with backtracking 2889 2890 @item 2891 You can have if/then/else branches 2892 2893 @item 2894 You can do recursive matches, for example to look for unbalanced parentheses 2895 2896 @item 2897 You can have comments and non-significant whitespace, because things can 2898 get complex... 2899 @end itemize 2900 2901 Most of these extensions are introduced by the special @code{(?} 2902 sequence, which gives special meanings to parenthesized groups. 2903 @end iftex 2904 @menu 2905 Other extensions can be roughly subdivided in two categories 2906 On one hand Perl introduces several more escaped sequences 2907 (that is, sequences introduced by a backslash). On the other 2908 hand, it specifies that if a question mark follows an open 2909 parentheses it should give a special meaning to the parenthesized 2910 group. 2911 2912 * Backslash:: Introduces special sequences 2913 * Circumflex/dollar sign/period:: Behave specially with regard to new lines 2914 * Square brackets:: Are a bit different in strange cases 2915 * Options setting:: Toggle modifiers in the middle of a regexp 2916 * Non-capturing subpatterns:: Are not counted when backreferencing 2917 * Repetition:: Allows for non-greedy matching 2918 * Backreferences:: Allows for more than 10 back references 2919 * Assertions:: Allows for complex look ahead matches 2920 * Non-backtracking subpatterns:: Often gives more performance 2921 * Conditional subpatterns:: Allows if/then/else branches 2922 * Recursive patterns:: For example to match parentheses 2923 * Comments:: Because things can get complex... 2924 @end menu 2925 2926 @node Backslash 2927 @appendixsec Backslash 2928 @cindex Perl-style regular expressions, escaped sequences 2929 2930 There are a few difference in the handling of backslashed 2931 sequences in Perl mode. 2932 2933 First of all, there are no @code{\o} and @code{\d} sequences. 2934 @sc{ascii} values for characters can be specified in octal 2935 with a @code{\@var{xxx}} sequence, where @var{xxx} is a 2936 sequence of up to three octal digits. If the first digit 2937 is a zero, the treatment of the sequence is straightforward; 2938 just note that if the character that follows the escaped digit 2939 is itself an octal digit, you have to supply three octal digits 2940 for @var{xxx}. For example @code{\07} is a @sc{bel} character 2941 rather than a @sc{nul} and a literal @code{7} (this sequence is 2942 instead represented by @code{\0007}). 2943 2944 @cindex Perl-style regular expressions, backreferences 2945 The handling of a backslash followed by a digit other than 0 2946 is complicated. Outside a character class, @command{sed} reads it 2947 and any following digits as a decimal number. If the number 2948 is less than 10, or if there have been at least that many 2949 previous capturing left parentheses in the expression, the 2950 entire sequence is taken as a back reference. A description 2951 of how this works is given later, following the discussion 2952 of parenthesized subpatterns. 2953 2954 Inside a character class, or if the decimal number is 2955 greater than 9 and there have not been that many capturing 2956 subpatterns, @command{sed} re-reads up to three octal digits following 2957 the backslash, and generates a single byte from the 2958 least significant 8 bits of the value. Any subsequent digits 2959 stand for themselves. For example: 2960 2961 @example 2962 \040 @i{@r{is another way of writing a space}} 2963 \40 @i{@r{is the same, provided there are fewer than 40}} 2964 @i{@r{previous capturing subpatterns}} 2965 \7 @i{@r{is always a back reference}} 2966 \011 @i{@r{is always a tab}} 2967 \11 @i{@r{might be a back reference, or another way of writing a tab}} 2968 \0113 @i{@r{is a tab followed by the character @samp{3}}} 2969 \113 @i{@r{is the character with octal code 113 (since there}} 2970 @i{@r{can be no more than 99 back references)}} 2971 \377 @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}} 2972 \81 @i{@r{is either a back reference, or a binary zero}} 2973 @i{@r{followed by the two characters @samp{81}}} 2974 @end example 2975 2976 Note that octal values of 100 or greater must not be introduced 2977 by a leading zero, because no more than three octal 2978 digits are ever read. Note that this applies only to the LHS 2979 pattern; it is not possible yet to specify more than 9 backreferences 2980 on the RHS of the `s' command. 2981 2982 All the sequences that define a single byte value can be 2983 used both inside and outside character classes. In addition, 2984 inside a character class, the sequence @code{\b} is interpreted 2985 as the backspace character (hex 08). Outside a character 2986 class it has a different meaning (see below). 2987 2988 In addition, there are four additional escapes specifying 2989 generic character classes (like @code{\w} and @code{\W} do): 2990 2991 @cindex Perl-style regular expressions, character classes 2992 @table @samp 2993 @item \d 2994 Matches any decimal digit 2995 2996 @item \D 2997 Matches any character that is not a decimal digit 2998 @end table 2999 3000 In Perl mode, these character type sequences can appear both inside and 3001 outside character classes. Instead, in @sc{posix} mode these sequences 3002 (as well as @code{\w} and @code{\W}) are treated as two literal characters 3003 (a backslash and a letter) inside square brackets. 3004 3005 Escaped sequences specifying assertions are also different in 3006 Perl mode. An assertion specifies a condition that has to be met 3007 at a particular point in a match, without consuming any 3008 characters from the subject string. The use of subpatterns 3009 for more complicated assertions is described below. The 3010 backslashed assertions are 3011 3012 @cindex Perl-style regular expressions, assertions 3013 @table @samp 3014 @item \b 3015 Asserts that the point is at a word boundary. 3016 A word boundary is a position in the subject string where 3017 the current character and the previous character do not both 3018 match @code{\w} or @code{\W} (i.e. one matches @code{\w} and 3019 the other matches @code{\W}), or the start or end of the string 3020 if the first or last character matches @code{\w}, respectively. 3021 3022 @item \B 3023 Asserts that the point is not at a word boundary. 3024 3025 @item \A 3026 Asserts the matcher is at the start of pattern space (independent 3027 of multiline mode). 3028 3029 @item \Z 3030 Asserts the matcher is at the end of pattern space, 3031 or at a newline before the end of pattern space (independent of 3032 multiline mode) 3033 3034 @item \z 3035 Asserts the matcher is at the end of pattern space (independent 3036 of multiline mode) 3037 @end table 3038 3039 These assertions may not appear in character classes (but 3040 note that @code{\b} has a different meaning, namely the 3041 backspace character, inside a character class). 3042 Note that Perl mode does not support directly assertions 3043 for the beginning and the end of word; the @acronym{GNU} extensions 3044 @code{\<} and @code{\>} achieve this purpose in @sc{posix} mode 3045 instead. 3046 3047 The @code{\A}, @code{\Z}, and @code{\z} assertions differ 3048 from the traditional circumflex and dollar sign (described below) 3049 in that they only ever match at the very start and end of the 3050 subject string, whatever options are set; in particular @code{\A} 3051 and @code{\z} are the same as the @acronym{GNU} extensions 3052 @code{\`} and @code{\'} that are active in @sc{posix} mode. 3053 3054 @node Circumflex/dollar sign/period 3055 @appendixsec Circumflex, dollar sign, period 3056 @cindex Perl-style regular expressions, newlines 3057 3058 Outside a character class, in the default matching mode, the 3059 circumflex character is an assertion which is true only if 3060 the current matching point is at the start of the subject 3061 string. Inside a character class, the circumflex has an entirely 3062 different meaning (see below). 3063 3064 The circumflex need not be the first character of the pattern if 3065 a number of alternatives are involved, but it should be the 3066 first thing in each alternative in which it appears if the 3067 pattern is ever to match that branch. If all possible alternatives, 3068 start with a circumflex, that is, if the pattern is 3069 constrained to match only at the start of the subject, it is 3070 said to be an @dfn{anchored} pattern. (There are also other constructs 3071 structs that can cause a pattern to be anchored.) 3072 3073 A dollar sign is an assertion which is true only if the 3074 current matching point is at the end of the subject string, 3075 or immediately before a newline character that is the last 3076 character in the string (by default). A dollar sign need not be the 3077 last character of the pattern if a number of alternatives 3078 are involved, but it should be the last item in any branch 3079 in which it appears. A dollar sign has no special meaning in a 3080 character class. 3081 3082 @cindex Perl-style regular expressions, multiline 3083 The meanings of the circumflex and dollar sign characters are 3084 changed if the @code{M} modifier option is used. When this is 3085 the case, they match immediately after and immediately 3086 before an internal @code{\n} character, respectively, in addition 3087 to matching at the start and end of the subject string. For 3088 example, the pattern @code{/^abc$/} matches the subject string 3089 @samp{def\nabc} in multiline mode, but not otherwise. Consequently, 3090 patterns that are anchored in single line mode 3091 because all branches start with @code{^} are not anchored in 3092 multiline mode. 3093 3094 @cindex Perl-style regular expressions, multiline 3095 Note that the sequences @code{\A}, @code{\Z}, and @code{\z} 3096 can be used to match the start and end of the subject in both 3097 modes, and if all branches of a pattern start with @code{\A} 3098 is it always anchored, whether the @code{M} modifier is set or not. 3099 3100 @cindex Perl-style regular expressions, single line 3101 Outside a character class, a dot in the pattern matches any 3102 one character in the subject, including a non-printing character, 3103 but not (by default) newline. If the @code{S} modifier is used, 3104 dots match newlines as well. Actually, the handling of 3105 dot is entirely independent of the handling of circumflex 3106 and dollar sign, the only relationship being that they both 3107 involve newline characters. Dot has no special meaning in a 3108 character class. 3109 3110 @node Square brackets 3111 @appendixsec Square brackets 3112 @cindex Perl-style regular expressions, character classes 3113 3114 An opening square bracket introduces a character class, terminated 3115 by a closing square bracket. A closing square bracket on its own 3116 is not special. If a closing square bracket is required as a 3117 member of the class, it should be the first data character in 3118 the class (after an initial circumflex, if present) or escaped with a backslash. 3119 3120 A character class matches a single character in the subject; 3121 the character must be in the set of characters defined by 3122 the class, unless the first character in the class is a circumflex, 3123 in which case the subject character must not be in 3124 the set defined by the class. If a circumflex is actually 3125 required as a member of the class, ensure it is not the 3126 first character, or escape it with a backslash. 3127 3128 For example, the character class [aeiou] matches any lower 3129 case vowel, while [^aeiou] matches any character that is not 3130 a lower case vowel. Note that a circumflex is just a convenient 3131 venient notation for specifying the characters which are in 3132 the class by enumerating those that are not. It is not an 3133 assertion: it still consumes a character from the subject 3134 string, and fails if the current pointer is at the end of 3135 the string. 3136 3137 @cindex Perl-style regular expressions, case-insensitive 3138 When caseless matching is set, any letters in a class 3139 represent both their upper case and lower case versions, so 3140 for example, a caseless @code{[aeiou]} matches uppercase 3141 and lowercase @samp{A}s, and a caseless @code{[^aeiou]} 3142 does not match @samp{A}, whereas a case-sensitive version would. 3143 3144 @cindex Perl-style regular expressions, single line 3145 @cindex Perl-style regular expressions, multiline 3146 The newline character is never treated in any special way in 3147 character classes, whatever the setting of the @code{S} and 3148 @code{M} options (modifiers) is. A class such as @code{[^a]} will 3149 always match a newline. 3150 3151 The minus (hyphen) character can be used to specify a range 3152 of characters in a character class. For example, @code{[d-m]} 3153 matches any letter between d and m, inclusive. If a minus 3154 character is required in a class, it must be escaped with a 3155 backslash or appear in a position where it cannot be interpreted 3156 as indicating a range, typically as the first or last 3157 character in the class. 3158 3159 It is not possible to have the literal character @code{]} as the 3160 end character of a range. A pattern such as @code{[W-]46]} is 3161 interpreted as a class of two characters (@code{W} and @code{-}) 3162 followed by a literal string @code{46]}, so it would match 3163 @samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped 3164 with a backslash it is interpreted as the end of range, so 3165 @code{[W-\]46]} is interpreted as a single class containing a 3166 range followed by two separate characters. The octal or 3167 hexadecimal representation of @code{]} can also be used to end a range. 3168 3169 Ranges operate in @sc{ascii} collating sequence. They can also be 3170 used for characters specified numerically, for example 3171 @code{[\000-\037]}. If a range that includes letters is used when 3172 caseless matching is set, it matches the letters in either 3173 case. For example, a caseless @code{[W-c]} is equivalent to 3174 @code{[][\^_`wxyzabc]}, matched caselessly, and if character 3175 tables for the French locale are in use, @code{[\xc8-\xcb]} 3176 matches accented E characters in both cases. 3177 3178 Unlike in @sc{posix} mode, the character types @code{\d}, 3179 @code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W} 3180 may also appear in a character class, and add the characters 3181 that they match to the class. For example, @code{[\dABCDEF]} matches any 3182 hexadecimal digit. A circumflex can conveniently be used 3183 with the upper case character types to specify a more restricted 3184 set of characters than the matching lower case type. 3185 For example, the class @code{[^\W_]} matches any letter or digit, 3186 but not underscore. 3187 3188 All non-alphameric characters other than @code{\}, @code{-}, 3189 @code{^} (at the start) and the terminating @code{]} 3190 are non-special in character classes, but it does no harm 3191 if they are escaped. 3192 3193 Perl 5.6 supports the @sc{posix} notation for character classes, which 3194 uses names enclosed by @code{[:} and @code{:]} within the enclosing 3195 square brackets, and @value{SSED} supports this notation as well. 3196 For example, 3197 3198 @example 3199 [01[:alpha:]%] 3200 @end example 3201 3202 @noindent 3203 matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}. 3204 The supported class names are 3205 3206 @table @code 3207 @item alnum 3208 Matches letters and digits 3209 3210 @item alpha 3211 Matches letters 3212 3213 @item ascii 3214 Matches character codes 0 - 127 3215 3216 @item cntrl 3217 Matches control characters 3218 3219 @item digit 3220 Matches decimal digits (same as \d) 3221 3222 @item graph 3223 Matches printing characters, excluding space 3224 3225 @item lower 3226 Matches lower case letters 3227 3228 @item print 3229 Matches printing characters, including space 3230 3231 @item punct 3232 Matches printing characters, excluding letters and digits 3233 3234 @item space 3235 Matches white space (same as \s) 3236 3237 @item upper 3238 Matches upper case letters 3239 3240 @item word 3241 Matches ``word'' characters (same as \w) 3242 3243 @item xdigit 3244 Matches hexadecimal digits 3245 @end table 3246 3247 The names @code{ascii} and @code{word} are extensions valid only in 3248 Perl mode. Another Perl extension is negation, which is 3249 indicated by a circumflex character after the colon. For example, 3250 3251 @example 3252 [12[:^digit:]] 3253 @end example 3254 3255 @noindent 3256 matches @samp{1}, @samp{2}, or any non-digit. 3257 3258 @node Options setting 3259 @appendixsec Options setting 3260 @cindex Perl-style regular expressions, toggling options 3261 @cindex Perl-style regular expressions, case-insensitive 3262 @cindex Perl-style regular expressions, multiline 3263 @cindex Perl-style regular expressions, single line 3264 @cindex Perl-style regular expressions, extended 3265 3266 The settings of the @code{I}, @code{M}, @code{S}, @code{X} 3267 modifiers can be changed from within the pattern by 3268 a sequence of Perl option letters enclosed between @code{(?} 3269 and @code{)}. The option letters must be lowercase. 3270 3271 For example, @code{(?im)} sets caseless, multiline matching. It is 3272 also possible to unset these options by preceding the letter 3273 with a hyphen; you can also have combined settings and unsettings: 3274 @code{(?im-sx)} sets caseless and multiline matching, 3275 while unsets single line matching (for dots) and extended 3276 whitespace interpretation. If a letter appears both before 3277 and after the hyphen, the option is unset. 3278 3279 The scope of these option changes depends on where in the 3280 pattern the setting occurs. For settings that are outside 3281 any subpattern (defined below), the effect is the same as if 3282 the options were set or unset at the start of matching. The 3283 following patterns all behave in exactly the same way: 3284 3285 @example 3286 (?i)abc 3287 a(?i)bc 3288 ab(?i)c 3289 abc(?i) 3290 @end example 3291 3292 which in turn is the same as specifying the pattern abc with 3293 the @code{I} modifier. In other words, ``top level'' settings 3294 apply to the whole pattern (unless there are other 3295 changes inside subpatterns). If there is more than one setting 3296 of the same option at top level, the rightmost setting 3297 is used. 3298 3299 If an option change occurs inside a subpattern, the effect 3300 is different. This is a change of behaviour in Perl 5.005. 3301 An option change inside a subpattern affects only that part 3302 of the subpattern @emph{that follows} it, so 3303 3304 @example 3305 (a(?i)b)c 3306 @end example 3307 3308 @noindent 3309 matches abc and aBc and no other strings (assuming 3310 case-sensitive matching is used). By this means, options can 3311 be made to have different settings in different parts of the 3312 pattern. Any changes made in one alternative do carry on 3313 into subsequent branches within the same subpattern. For 3314 example, 3315 3316 @example 3317 (a(?i)b|c) 3318 @end example 3319 3320 @noindent 3321 matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C}, 3322 even though when matching @samp{C} the first branch is 3323 abandoned before the option setting. 3324 This is because the effects of option settings happen at 3325 compile time. There would be some very weird behaviour otherwise. 3326 3327 @ignore 3328 There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA 3329 that can be changed in the same way as the Perl-compatible options by 3330 using the characters U and X respectively. The (?X) flag 3331 setting is special in that it must always occur earlier in 3332 the pattern than any of the additional features it turns on, 3333 even when it is at top level. It is best put at the start. 3334 @end ignore 3335 3336 3337 @node Non-capturing subpatterns 3338 @appendixsec Non-capturing subpatterns 3339 @cindex Perl-style regular expressions, non-capturing subpatterns 3340 3341 Marking part of a pattern as a subpattern does two things. 3342 On one hand, it localizes a set of alternatives; on the other 3343 hand, it sets up the subpattern as a capturing subpattern (as 3344 defined above). The subpattern can be backreferenced and 3345 referenced in the right side of @code{s} commands. 3346 3347 For example, if the string @samp{the red king} is matched against 3348 the pattern 3349 3350 @example 3351 the ((red|white) (king|queen)) 3352 @end example 3353 3354 @noindent 3355 the captured substrings are @samp{red king}, @samp{red}, 3356 and @samp{king}, and are numbered 1, 2, and 3. 3357 3358 The fact that plain parentheses fulfil two functions is not 3359 always helpful. There are often times when a grouping 3360 subpattern is required without a capturing requirement. If an 3361 opening parenthesis is followed by @code{?:}, the subpattern does 3362 not do any capturing, and is not counted when computing the 3363 number of any subsequent capturing subpatterns. For example, 3364 if the string @samp{the white queen} is matched against the pattern 3365 3366 @example 3367 the ((?:red|white) (king|queen)) 3368 @end example 3369 3370 @noindent 3371 the captured substrings are @samp{white queen} and @samp{queen}, 3372 and are numbered 1 and 2. The maximum number of captured 3373 substrings is 99, while the maximum number of all subpatterns, 3374 both capturing and non-capturing, is 200. 3375 3376 As a convenient shorthand, if any option settings are 3377 equired at the start of a non-capturing subpattern, the 3378 option letters may appear between the @code{?} and the 3379 @code{:}. Thus the two patterns 3380 3381 @example 3382 (?i:saturday|sunday) 3383 (?:(?i)saturday|sunday) 3384 @end example 3385 3386 @noindent 3387 match exactly the same set of strings. Because alternative 3388 branches are tried from left to right, and options are not 3389 reset until the end of the subpattern is reached, an option 3390 setting in one branch does affect subsequent branches, so 3391 the above patterns match @samp{SUNDAY} as well as @samp{Saturday}. 3392 3393 3394 @node Repetition 3395 @appendixsec Repetition 3396 @cindex Perl-style regular expressions, repetitions 3397 3398 Repetition is specified by quantifiers, which can follow any 3399 of the following items: 3400 3401 @itemize @bullet 3402 @item 3403 a single character, possibly escaped 3404 3405 @item 3406 the @code{.} special character 3407 3408 @item 3409 a character class 3410 3411 @item 3412 a back reference (see next section) 3413 3414 @item 3415 a parenthesized subpattern (unless it is an assertion; @pxref{Assertions}) 3416 @end itemize 3417 3418 The general repetition quantifier specifies a minimum and 3419 maximum number of permitted matches, by giving the two 3420 numbers in curly brackets (braces), separated by a comma. 3421 The numbers must be less than 65536, and the first must be 3422 less than or equal to the second. For example: 3423 3424 @example 3425 z@{2,4@} 3426 @end example 3427 3428 @noindent 3429 matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own 3430 is not a special character. If the second number is omitted, 3431 but the comma is present, there is no upper limit; if the 3432 second number and the comma are both omitted, the quantifier 3433 specifies an exact number of required matches. Thus 3434 3435 @example 3436 [aeiou]@{3,@} 3437 @end example 3438 3439 @noindent 3440 matches at least 3 successive vowels, but may match many 3441 more, while 3442 3443 @example 3444 \d@{8@} 3445 @end example 3446 3447 @noindent 3448 matches exactly 8 digits. An opening curly bracket that 3449 appears in a position where a quantifier is not allowed, or 3450 one that does not match the syntax of a quantifier, is taken 3451 as a literal character. For example, @{,6@} is not a quantifier, 3452 but a literal string of four characters.@footnote{It 3453 raises an error if @option{-R} is not used.} 3454 3455 The quantifier @samp{@{0@}} is permitted, causing the expression to 3456 behave as if the previous item and the quantifier were not 3457 present. 3458 3459 For convenience (and historical compatibility) the three 3460 most common quantifiers have single-character abbreviations: 3461 3462 @table @code 3463 @item * 3464 is equivalent to @{0,@} 3465 3466 @item + 3467 is equivalent to @{1,@} 3468 3469 @item ? 3470 is equivalent to @{0,1@} 3471 @end table 3472 3473 It is possible to construct infinite loops by following a 3474 subpattern that can match no characters with a quantifier 3475 that has no upper limit, for example: 3476 3477 @example 3478 (a?)* 3479 @end example 3480 3481 Earlier versions of Perl used to give an error at 3482 compile time for such patterns. However, because there are 3483 cases where this can be useful, such patterns are now 3484 accepted, but if any repetition of the subpattern does in 3485 fact match no characters, the loop is forcibly broken. 3486 3487 @cindex Greedy regular expression matching 3488 @cindex Perl-style regular expressions, stingy repetitions 3489 By default, the quantifiers are @dfn{greedy} like in @sc{posix} 3490 mode, that is, they match as much as possible (up to the maximum 3491 number of permitted times), without causing the rest of the 3492 pattern to fail. The classic example of where this gives problems 3493 is in trying to match comments in C programs. These appear between 3494 the sequences @code{/*} and @code{*/} and within the sequence, individual 3495 @code{*} and @code{/} characters may appear. An attempt to match C 3496 comments by applying the pattern 3497 3498 @example 3499 /\*.*\*/ 3500 @end example 3501 3502 @noindent 3503 to the string 3504 3505 @example 3506 /* first command */ not comment /* second comment */ 3507 @end example 3508 3509 @noindent 3510 3511 fails, because it matches the entire string owing to the 3512 greediness of the @code{.*} item. 3513 3514 However, if a quantifier is followed by a question mark, it 3515 ceases to be greedy, and instead matches the minimum number 3516 of times possible, so the pattern @code{/\*.*?\*/} 3517 does the right thing with the C comments. The meaning of the 3518 various quantifiers is not otherwise changed, just the preferred 3519 number of matches. Do not confuse this use of question 3520 mark with its use as a quantifier in its own right. 3521 Because it has two uses, it can sometimes appear doubled, as in 3522 3523 @example 3524 \d??\d 3525 @end example 3526 3527 which matches one digit by preference, but can match two if 3528 that is the only way the rest of the pattern matches. 3529 3530 Note that greediness does not matter when specifying addresses, 3531 but can be nevertheless used to improve performance. 3532 3533 @ignore 3534 If the PCRE_UNGREEDY option is set (an option which is not 3535 available in Perl), the quantifiers are not greedy by 3536 default, but individual ones can be made greedy by following 3537 them with a question mark. In other words, it inverts the 3538 default behaviour. 3539 @end ignore 3540 3541 When a parenthesized subpattern is quantified with a minimum 3542 repeat count that is greater than 1 or with a limited maximum, 3543 more store is required for the compiled pattern, in 3544 proportion to the size of the minimum or maximum. 3545 3546 @cindex Perl-style regular expressions, single line 3547 If a pattern starts with @code{.*} or @code{.@{0,@}} and the 3548 @code{S} modifier is used, the pattern is implicitly anchored, 3549 because whatever follows will be tried against every character 3550 position in the subject string, so there is no point in 3551 retrying the overall match at any position after the first. 3552 PCRE treats such a pattern as though it were preceded by \A. 3553 3554 When a capturing subpattern is repeated, the value captured 3555 is the substring that matched the final iteration. For example, 3556 after 3557 3558 @example 3559 (tweedle[dume]@{3@}\s*)+ 3560 @end example 3561 3562 @noindent 3563 has matched @samp{tweedledum tweedledee} the value of the 3564 captured substring is @samp{tweedledee}. However, if there are 3565 nested capturing subpatterns, the corresponding captured 3566 values may have been set in previous iterations. For example, 3567 after 3568 3569 @example 3570 /(a|(b))+/ 3571 @end example 3572 3573 matches @samp{aba}, the value of the second captured substring is 3574 @samp{b}. 3575 3576 @node Backreferences 3577 @appendixsec Backreferences 3578 @cindex Perl-style regular expressions, backreferences 3579 3580 Outside a character class, a backslash followed by a digit 3581 greater than 0 (and possibly further digits) is a back 3582 reference to a capturing subpattern earlier (i.e. to its 3583 left) in the pattern, provided there have been that many 3584 previous capturing left parentheses. 3585 3586 However, if the decimal number following the backslash is 3587 less than 10, it is always taken as a back reference, and 3588 causes an error only if there are not that many capturing 3589 left parentheses in the entire pattern. In other words, the 3590 parentheses that are referenced need not be to the left of 3591 the reference for numbers less than 10. @ref{Backslash} 3592 for further details of the handling of digits following a backslash. 3593 3594 A back reference matches whatever actually matched the capturing 3595 subpattern in the current subject string, rather than 3596 anything matching the subpattern itself. So the pattern 3597 3598 @example 3599 (sens|respons)e and \1ibility 3600 @end example 3601 3602 @noindent 3603 matches @samp{sense and sensibility} and @samp{response and responsibility}, 3604 but not @samp{sense and responsibility}. If caseful 3605 matching is in force at the time of the back reference, the 3606 case of letters is relevant. For example, 3607 3608 @example 3609 ((?i)blah)\s+\1 3610 @end example 3611 3612 @noindent 3613 matches @samp{blah blah} and @samp{Blah Blah}, but not 3614 @samp{BLAH blah}, even though the original capturing 3615 subpattern is matched caselessly. 3616 3617 There may be more than one back reference to the same subpattern. 3618 Also, if a subpattern has not actually been used in a 3619 particular match, any back references to it always fail. For 3620 example, the pattern 3621 3622 @example 3623 (a|(bc))\2 3624 @end example 3625 3626 @noindent 3627 always fails if it starts to match @samp{a} rather than 3628 @samp{bc}. Because there may be up to 99 back references, all 3629 digits following the backslash are taken as part of a potential 3630 back reference number; this is different from what happens 3631 in @sc{posix} mode. If the pattern continues with a digit 3632 character, some delimiter must be used to terminate the back 3633 reference. If the @code{X} modifier option is set, this can be 3634 whitespace. Otherwise an empty comment can be used, or the 3635 following character can be expressed in hexadecimal or octal. 3636 Note that this applies only to the LHS pattern; it is 3637 not possible yet to specify more than 9 backreferences on the 3638 RHS of the `s' command. 3639 3640 A back reference that occurs inside the parentheses to which 3641 it refers fails when the subpattern is first used, so, for 3642 example, @code{(a\1)} never matches. However, such references 3643 can be useful inside repeated subpatterns. For example, the 3644 pattern 3645 3646 @example 3647 (a|b\1)+ 3648 @end example 3649 3650 @noindent 3651 matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa}, 3652 etc. At each iteration of the subpattern, the back reference matches 3653 the character string corresponding to the previous iteration. In 3654 order for this to work, the pattern must be such that the first 3655 iteration does not need to match the back reference. This can be 3656 done using alternation, as in the example above, or by a 3657 quantifier with a minimum of zero. 3658 3659 @node Assertions 3660 @appendixsec Assertions 3661 @cindex Perl-style regular expressions, assertions 3662 @cindex Perl-style regular expressions, asserting subpatterns 3663 3664 An assertion is a test on the characters following or 3665 preceding the current matching point that does not actually 3666 consume any characters. The simple assertions coded as @code{\b}, 3667 @code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$} 3668 are described above. More complicated assertions are coded as 3669 subpatterns. There are two kinds: those that look ahead of the 3670 current position in the subject string, and those that look behind it. 3671 3672 @cindex Perl-style regular expressions, lookahead subpatterns 3673 An assertion subpattern is matched in the normal way, except 3674 that it does not cause the current matching position to be 3675 changed. Lookahead assertions start with @code{(?=} for positive 3676 assertions and @code{(?!} for negative assertions. For example, 3677 3678 @example 3679 \w+(?=;) 3680 @end example 3681 3682 @noindent 3683 matches a word followed by a semicolon, but does not include 3684 the semicolon in the match, and 3685 3686 @example 3687 foo(?!bar) 3688 @end example 3689 3690 @noindent 3691 matches any occurrence of @samp{foo} that is not followed by 3692 @samp{bar}. 3693 3694 Note that the apparently similar pattern 3695 3696 @example 3697 (?!foo)bar 3698 @end example 3699 3700 @noindent 3701 @cindex Perl-style regular expressions, lookbehind subpatterns 3702 finds any occurrence of @samp{bar} even if it is preceded by 3703 @samp{foo}, because the assertion @code{(?!foo)} is always true 3704 when the next three characters are @samp{bar}. A lookbehind 3705 assertion is needed to achieve this effect. 3706 Lookbehind assertions start with @code{(?<=} for positive 3707 assertions and @code{(?<!} for negative assertions. So, 3708 3709 @example 3710 (?<!foo)bar 3711 @end example 3712 3713 achieves the required effect of finding an occurrence of 3714 @samp{bar} that is not preceded by @samp{foo}. The contents of a 3715 lookbehind assertion are restricted 3716 such that all the strings it matches must have a fixed 3717 length. However, if there are several alternatives, they do 3718 not all have to have the same fixed length. This is an extension 3719 compared with Perl 5.005, which requires all branches to match 3720 the same length of string. Thus 3721 3722 @example 3723 (?<=dogs|cats|) 3724 @end example 3725 3726 @noindent 3727 is permitted, but the apparently equivalent regular expression 3728 3729 @example 3730 (?<!dogs?|cats?) 3731 @end example 3732 3733 @noindent 3734 causes an error at compile time. Branches that match different 3735 length strings are permitted only at the top level of 3736 a lookbehind assertion: an assertion such as 3737 3738 @example 3739 (?<=ab(c|de)) 3740 @end example 3741 3742 @noindent 3743 is not permitted, because its single top-level branch can 3744 match two different lengths, but it is acceptable if rewritten 3745 to use two top-level branches: 3746 3747 @example 3748 (?<=abc|abde) 3749 @end example 3750 3751 All this is required because lookbehind assertions simply 3752 move the current position back by the alternative's fixed 3753 width and then try to match. If there are 3754 insufficient characters before the current position, the 3755 match is deemed to fail. Lookbehinds, in conjunction with 3756 non-backtracking subpatterns can be particularly useful for 3757 matching at the ends of strings; an example is given at the end 3758 of the section on non-backtracking subpatterns. 3759 3760 Several assertions (of any sort) may occur in succession. 3761 For example, 3762 3763 @example 3764 (?<=\d@{3@})(?<!999)foo 3765 @end example 3766 3767 @noindent 3768 matches @samp{foo} preceded by three digits that are not @samp{999}. 3769 Notice that each of the assertions is applied independently 3770 at the same point in the subject string. First there is a 3771 check that the previous three characters are all digits, and 3772 then there is a check that the same three characters are not 3773 @samp{999}. This pattern does not match @samp{foo} preceded by six 3774 characters, the first of which are digits and the last three 3775 of which are not @samp{999}. For example, it doesn't match 3776 @samp{123abcfoo}. A pattern to do that is 3777 3778 @example 3779 (?<=\d@{3@}...)(?<!999)foo 3780 @end example 3781 3782 @noindent 3783 This time the first assertion looks at the preceding six 3784 characters, checking that the first three are digits, and 3785 then the second assertion checks that the preceding three 3786 characters are not @samp{999}. Actually, assertions can be 3787 nested in any combination, so one can write this as 3788 3789 @example 3790 (?<=\d@{3@}(?!999)...)foo 3791 @end example 3792 3793 or 3794 3795 @example 3796 (?<=\d@{3@}...(?<!999))foo 3797 @end example 3798 3799 @noindent 3800 both of which might be considered more readable. 3801 3802 Assertion subpatterns are not capturing subpatterns, and may 3803 not be repeated, because it makes no sense to assert the 3804 same thing several times. If any kind of assertion contains 3805 capturing subpatterns within it, these are counted for the 3806 purposes of numbering the capturing subpatterns in the whole 3807 pattern. However, substring capturing is carried out only 3808 for positive assertions, because it does not make sense for 3809 negative assertions. 3810 3811 Assertions count towards the maximum of 200 parenthesized 3812 subpatterns. 3813 3814 @node Non-backtracking subpatterns 3815 @appendixsec Non-backtracking subpatterns 3816 @cindex Perl-style regular expressions, non-backtracking subpatterns 3817 3818 With both maximizing and minimizing repetition, failure of 3819 what follows normally causes the repeated item to be evaluated 3820 again to see if a different number of repeats allows the 3821 rest of the pattern to match. Sometimes it is useful to 3822 prevent this, either to change the nature of the match, or 3823 to cause it fail earlier than it otherwise might, when the 3824 author of the pattern knows there is no point in carrying 3825 on. 3826 3827 Consider, for example, the pattern @code{\d+foo} when applied to 3828 the subject line 3829 3830 @example 3831 123456bar 3832 @end example 3833 3834 After matching all 6 digits and then failing to match @samp{foo}, 3835 the normal action of the matcher is to try again with only 5 3836 digits matching the @code{\d+} item, and then with 4, and so on, 3837 before ultimately failing. Non-backtracking subpatterns 3838 provide the means for specifying that once a portion of the 3839 pattern has matched, it is not to be re-evaluated in this way, 3840 so the matcher would give up immediately on failing to match 3841 @samp{foo} the first time. The notation is another kind of special 3842 parenthesis, starting with @code{(?>} as in this example: 3843 3844 @example 3845 (?>\d+)bar 3846 @end example 3847 3848 This kind of parenthesis ``locks up'' the part of the pattern 3849 it contains once it has matched, and a failure further into 3850 the pattern is prevented from backtracking into it. 3851 Backtracking past it to previous items, however, works as 3852 normal. 3853 3854 Non-backtracking subpatterns are not capturing subpatterns. Simple 3855 cases such as the above example can be thought of as a maximizing 3856 repeat that must swallow everything it can. So, 3857 while both @code{\d+} and @code{\d+?} are prepared to adjust the number of 3858 digits they match in order to make the rest of the pattern 3859 match, @code{(?>\d+)} can only match an entire sequence of digits. 3860 3861 This construction can of course contain arbitrarily complicated 3862 subpatterns, and it can be nested. 3863 3864 @cindex Perl-style regular expressions, lookbehind subpatterns 3865 Non-backtracking subpatterns can be used in conjunction with look-behind 3866 assertions to specify efficient matching at the end 3867 of the subject string. Consider a simple pattern such as 3868 3869 @example 3870 abcd$ 3871 @end example 3872 3873 @noindent 3874 when applied to a long string which does not match. Because 3875 matching proceeds from left to right, @command{sed} will look for 3876 each @samp{a} in the subject and then see if what follows matches 3877 the rest of the pattern. If the pattern is specified as 3878 3879 @example 3880 ^.*abcd$ 3881 @end example 3882 3883 @noindent 3884 the initial @code{.*} matches the entire string at first, but when 3885 this fails (because there is no following @samp{a}), it backtracks 3886 to match all but the last character, then all but the 3887 last two characters, and so on. Once again the search for 3888 @samp{a} covers the entire string, from right to left, so we are 3889 no better off. However, if the pattern is written as 3890 3891 @example 3892 ^(?>.*)(?<=abcd) 3893 @end example 3894 3895 there can be no backtracking for the .* item; it can match 3896 only the entire string. The subsequent lookbehind assertion 3897 does a single test on the last four characters. If it fails, 3898 the match fails immediately. For long strings, this approach 3899 makes a significant difference to the processing time. 3900 3901 When a pattern contains an unlimited repeat inside a subpattern 3902 that can itself be repeated an unlimited number of 3903 times, the use of a once-only subpattern is the only way to 3904 avoid some failing matches taking a very long time 3905 indeed.@footnote{Actually, the matcher embedded in @value{SSED} 3906 tries to do something for this in the simplest cases, 3907 like @code{([^b]*b)*}. These cases are actually quite 3908 common: they happen for example in a regular expression 3909 like @code{\/\*([^*]*\*)*\/} which matches C comments.} 3910 3911 The pattern 3912 3913 @example 3914 (\D+|<\d+>)*[!?] 3915 @end example 3916 3917 ([^0-9<]+<(\d+>)?)*[!?] 3918 3919 @noindent 3920 matches an unlimited number of substrings that either consist 3921 of non-digits, or digits enclosed in angular brackets, followed by 3922 an exclamation or question mark. When it matches, it runs quickly. 3923 However, if it is applied to 3924 3925 @example 3926 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 3927 @end example 3928 3929 @noindent 3930 it takes a long time before reporting failure. This is 3931 because the string can be divided between the two repeats in 3932 a large number of ways, and all have to be tried.@footnote{The 3933 example used @code{[!?]} rather than a single character at the end, 3934 because both @value{SSED} and Perl have an optimization that allows 3935 for fast failure when a single character is used. They 3936 remember the last single character that is required for a 3937 match, and fail early if it is not present in the string.} 3938 3939 If the pattern is changed to 3940 3941 @example 3942 ((?>\D+)|<\d+>)*[!?] 3943 @end example 3944 3945 sequences of non-digits cannot be broken, and failure happens 3946 quickly. 3947 3948 @node Conditional subpatterns 3949 @appendixsec Conditional subpatterns 3950 @cindex Perl-style regular expressions, conditional subpatterns 3951 3952 It is possible to cause the matching process to obey a subpattern 3953 conditionally or to choose between two alternative 3954 subpatterns, depending on the result of an assertion, or 3955 whether a previous capturing subpattern matched or not. The 3956 two possible forms of conditional subpattern are 3957 3958 @example 3959 (?(@var{condition})@var{yes-pattern}) 3960 (?(@var{condition})@var{yes-pattern}|@var{no-pattern}) 3961 @end example 3962 3963 If the condition is satisfied, the yes-pattern is used; otherwise 3964 the no-pattern (if present) is used. If there are more than two 3965 alternatives in the subpattern, a compile-time error occurs. 3966 3967 There are two kinds of condition. If the text between the 3968 parentheses consists of a sequence of digits, the condition 3969 is satisfied if the capturing subpattern of that number has 3970 previously matched. The number must be greater than zero. 3971 Consider the following pattern, which contains non-significant 3972 white space to make it more readable (assume the @code{X} modifier) 3973 and to divide it into three parts for ease of discussion: 3974 3975 @example 3976 ( \( )? [^()]+ (?(1) \) ) 3977 @end example 3978 3979 The first part matches an optional opening parenthesis, and 3980 if that character is present, sets it as the first captured 3981 substring. The second part matches one or more characters 3982 that are not parentheses. The third part is a conditional 3983 subpattern that tests whether the first set of parentheses 3984 matched or not. If they did, that is, if subject started 3985 with an opening parenthesis, the condition is true, and so 3986 the yes-pattern is executed and a closing parenthesis is 3987 required. Otherwise, since no-pattern is not present, the 3988 subpattern matches nothing. In other words, this pattern 3989 matches a sequence of non-parentheses, optionally enclosed 3990 in parentheses. 3991 3992 @cindex Perl-style regular expressions, lookahead subpatterns 3993 If the condition is not a sequence of digits, it must be an 3994 assertion. This may be a positive or negative lookahead or 3995 lookbehind assertion. Consider this pattern, again containing 3996 non-significant white space, and with the two alternatives 3997 on the second line: 3998 3999 @example 4000 (?(?=...[a-z]) 4001 \d\d-[a-z]@{3@}-\d\d | 4002 \d\d-\d\d-\d\d ) 4003 @end example 4004 4005 The condition is a positive lookahead assertion that matches 4006 a letter that is three characters away from the current point. 4007 If a letter is found, the subject is matched against the first 4008 alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are 4009 letters and @var{dd} are digits); otherwise it is matched against 4010 the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}. 4011 4012 4013 @node Recursive patterns 4014 @appendixsec Recursive patterns 4015 @cindex Perl-style regular expressions, recursive patterns 4016 @cindex Perl-style regular expressions, recursion 4017 4018 Consider the problem of matching a string in parentheses, 4019 allowing for unlimited nested parentheses. Without the use 4020 of recursion, the best that can be done is to use a pattern 4021 that matches up to some fixed depth of nesting. It is not 4022 possible to handle an arbitrary nesting depth. Perl 5.6 has 4023 provided an experimental facility that allows regular 4024 expressions to recurse (amongst other things). It does this 4025 by interpolating Perl code in the expression at run time, 4026 and the code can refer to the expression itself. A Perl pattern 4027 tern to solve the parentheses problem can be created like 4028 this: 4029 4030 @example 4031 $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x; 4032 @end example 4033 4034 The @code{(?p@{...@})} item interpolates Perl code at run time, 4035 and in this case refers recursively to the pattern in which it 4036 appears. Obviously, @command{sed} cannot support the interpolation of 4037 Perl code. Instead, the special item @code{(?R)} is provided for 4038 the specific case of recursion. This pattern solves the 4039 parentheses problem (assume the @code{X} modifier option is used 4040 so that white space is ignored): 4041 4042 @example 4043 \( ( (?>[^()]+) | (?R) )* \) 4044 @end example 4045 4046 First it matches an opening parenthesis. Then it matches any 4047 number of substrings which can either be a sequence of 4048 non-parentheses, or a recursive match of the pattern itself 4049 (i.e. a correctly parenthesized substring). Finally there is 4050 a closing parenthesis. 4051 4052 This particular example pattern contains nested unlimited 4053 repeats, and so the use of a non-backtracking subpattern for 4054 matching strings of non-parentheses is important when applying 4055 the pattern to strings that do not match. For example, when 4056 it is applied to 4057 4058 @example 4059 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 4060 @end example 4061 4062 it yields a ``no match'' response quickly. However, if a 4063 standard backtracking subpattern is not used, the match runs 4064 for a very long time indeed because there are so many different 4065 ways the @code{+} and @code{*} repeats can carve up the subject, 4066 and all have to be tested before failure can be reported. 4067 4068 The values set for any capturing subpatterns are those from 4069 the outermost level of the recursion at which the subpattern 4070 value is set. If the pattern above is matched against 4071 4072 @example 4073 (ab(cd)ef) 4074 @end example 4075 4076 @noindent 4077 the value for the capturing parentheses is @samp{ef}, which is 4078 the last value taken on at the top level. 4079 4080 @node Comments 4081 @appendixsec Comments 4082 @cindex Perl-style regular expressions, comments 4083 4084 The sequence (?# marks the start of a comment which continues 4085 ues up to the next closing parenthesis. Nested parentheses 4086 are not permitted. The characters that make up a comment 4087 play no part in the pattern matching at all. 4088 4089 @cindex Perl-style regular expressions, extended 4090 If the @code{X} modifier option is used, an unescaped @code{#} character 4091 outside a character class introduces a comment that continues 4092 up to the next newline character in the pattern. 4093 @end ifset 4094 4095 4096 @page 4097 @node Concept Index 4098 @unnumbered Concept Index 4099 4100 This is a general index of all issues discussed in this manual, with the 4101 exception of the @command{sed} commands and command-line options. 4102 4103 @printindex cp 4104 4105 @page 4106 @node Command and Option Index 4107 @unnumbered Command and Option Index 4108 4109 This is an alphabetical list of all @command{sed} commands and command-line 4110 options. 4111 4112 @printindex fn 4113 4114 @contents 4115 @bye 4116 4117 @c XXX FIXME: the term "cycle" is never defined... 4118