Home | History | Annotate | Download | only in html
      1 <html>
      2 <head>
      3 <title>pcreperform specification</title>
      4 </head>
      5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
      6 <h1>pcreperform man page</h1>
      7 <p>
      8 Return to the <a href="index.html">PCRE index page</a>.
      9 </p>
     10 <p>
     11 This page is part of the PCRE HTML documentation. It was generated automatically
     12 from the original man page. If there is any nonsense in it, please consult the
     13 man page, in case the conversion went wrong.
     14 <br>
     15 <br><b>
     16 PCRE PERFORMANCE
     17 </b><br>
     18 <P>
     19 Two aspects of performance are discussed below: memory usage and processing
     20 time. The way you express your pattern as a regular expression can affect both
     21 of them.
     22 </P>
     23 <br><b>
     24 COMPILED PATTERN MEMORY USAGE
     25 </b><br>
     26 <P>
     27 Patterns are compiled by PCRE into a reasonably efficient byte code, so that
     28 most simple patterns do not use much memory. However, there is one case where
     29 the memory usage of a compiled pattern can be unexpectedly large. If a
     30 parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
     31 a limited maximum, the whole subpattern is repeated in the compiled code. For
     32 example, the pattern
     33 <pre>
     34   (abc|def){2,4}
     35 </pre>
     36 is compiled as if it were
     37 <pre>
     38   (abc|def)(abc|def)((abc|def)(abc|def)?)?
     39 </pre>
     40 (Technical aside: It is done this way so that backtrack points within each of
     41 the repetitions can be independently maintained.)
     42 </P>
     43 <P>
     44 For regular expressions whose quantifiers use only small numbers, this is not
     45 usually a problem. However, if the numbers are large, and particularly if such
     46 repetitions are nested, the memory usage can become an embarrassment. For
     47 example, the very simple pattern
     48 <pre>
     49   ((ab){1,1000}c){1,3}
     50 </pre>
     51 uses 51K bytes when compiled. When PCRE is compiled with its default internal
     52 pointer size of two bytes, the size limit on a compiled pattern is 64K, and
     53 this is reached with the above pattern if the outer repetition is increased
     54 from 3 to 4. PCRE can be compiled to use larger internal pointers and thus
     55 handle larger compiled patterns, but it is better to try to rewrite your
     56 pattern to use less memory if you can.
     57 </P>
     58 <P>
     59 One way of reducing the memory usage for such patterns is to make use of PCRE's
     60 <a href="pcrepattern.html#subpatternsassubroutines">"subroutine"</a>
     61 facility. Re-writing the above pattern as
     62 <pre>
     63   ((ab)(?2){0,999}c)(?1){0,2}
     64 </pre>
     65 reduces the memory requirements to 18K, and indeed it remains under 20K even
     66 with the outer repetition increased to 100. However, this pattern is not
     67 exactly equivalent, because the "subroutine" calls are treated as
     68 <a href="pcrepattern.html#atomicgroup">atomic groups</a>
     69 into which there can be no backtracking if there is a subsequent matching
     70 failure. Therefore, PCRE cannot do this kind of rewriting automatically.
     71 Furthermore, there is a noticeable loss of speed when executing the modified
     72 pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
     73 speed is acceptable, this kind of rewriting will allow you to process patterns
     74 that PCRE cannot otherwise handle.
     75 </P>
     76 <br><b>
     77 STACK USAGE AT RUN TIME
     78 </b><br>
     79 <P>
     80 When <b>pcre_exec()</b> is used for matching, certain kinds of pattern can cause
     81 it to use large amounts of the process stack. In some environments the default
     82 process stack is quite small, and if it runs out the result is often SIGSEGV.
     83 This issue is probably the most frequently raised problem with PCRE. Rewriting
     84 your pattern can often help. The
     85 <a href="pcrestack.html"><b>pcrestack</b></a>
     86 documentation discusses this issue in detail.
     87 </P>
     88 <br><b>
     89 PROCESSING TIME
     90 </b><br>
     91 <P>
     92 Certain items in regular expression patterns are processed more efficiently
     93 than others. It is more efficient to use a character class like [aeiou] than a
     94 set of single-character alternatives such as (a|e|i|o|u). In general, the
     95 simplest construction that provides the required behaviour is usually the most
     96 efficient. Jeffrey Friedl's book contains a lot of useful general discussion
     97 about optimizing regular expressions for efficient performance. This document
     98 contains a few observations about PCRE.
     99 </P>
    100 <P>
    101 Using Unicode character properties (the \p, \P, and \X escapes) is slow,
    102 because PCRE has to scan a structure that contains data for over fifteen
    103 thousand characters whenever it needs a character's property. If you can find
    104 an alternative pattern that does not use character properties, it will probably
    105 be faster.
    106 </P>
    107 <P>
    108 By default, the escape sequences \b, \d, \s, and \w, and the POSIX
    109 character classes such as [:alpha:] do not use Unicode properties, partly for
    110 backwards compatibility, and partly for performance reasons. However, you can
    111 set PCRE_UCP if you want Unicode character properties to be used. This can
    112 double the matching time for items such as \d, when matched with
    113 <b>pcre_exec()</b>; the performance loss is less with <b>pcre_dfa_exec()</b>, and
    114 in both cases there is not much difference for \b.
    115 </P>
    116 <P>
    117 When a pattern begins with .* not in parentheses, or in parentheses that are
    118 not the subject of a backreference, and the PCRE_DOTALL option is set, the
    119 pattern is implicitly anchored by PCRE, since it can match only at the start of
    120 a subject string. However, if PCRE_DOTALL is not set, PCRE cannot make this
    121 optimization, because the . metacharacter does not then match a newline, and if
    122 the subject string contains newlines, the pattern may match from the character
    123 immediately following one of them instead of from the very start. For example,
    124 the pattern
    125 <pre>
    126   .*second
    127 </pre>
    128 matches the subject "first\nand second" (where \n stands for a newline
    129 character), with the match starting at the seventh character. In order to do
    130 this, PCRE has to retry the match starting after every newline in the subject.
    131 </P>
    132 <P>
    133 If you are using such a pattern with subject strings that do not contain
    134 newlines, the best performance is obtained by setting PCRE_DOTALL, or starting
    135 the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE
    136 from having to scan along the subject looking for a newline to restart at.
    137 </P>
    138 <P>
    139 Beware of patterns that contain nested indefinite repeats. These can take a
    140 long time to run when applied to a string that does not match. Consider the
    141 pattern fragment
    142 <pre>
    143   ^(a+)*
    144 </pre>
    145 This can match "aaaa" in 16 different ways, and this number increases very
    146 rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
    147 times, and for each of those cases other than 0 or 4, the + repeats can match
    148 different numbers of times.) When the remainder of the pattern is such that the
    149 entire match is going to fail, PCRE has in principle to try every possible
    150 variation, and this can take an extremely long time, even for relatively short
    151 strings.
    152 </P>
    153 <P>
    154 An optimization catches some of the more simple cases such as
    155 <pre>
    156   (a+)*b
    157 </pre>
    158 where a literal character follows. Before embarking on the standard matching
    159 procedure, PCRE checks that there is a "b" later in the subject string, and if
    160 there is not, it fails the match immediately. However, when there is no
    161 following literal this optimization cannot be used. You can see the difference
    162 by comparing the behaviour of
    163 <pre>
    164   (a+)*\d
    165 </pre>
    166 with the pattern above. The former gives a failure almost instantly when
    167 applied to a whole line of "a" characters, whereas the latter takes an
    168 appreciable time with strings longer than about 20 characters.
    169 </P>
    170 <P>
    171 In many cases, the solution to this kind of performance issue is to use an
    172 atomic group or a possessive quantifier.
    173 </P>
    174 <br><b>
    175 AUTHOR
    176 </b><br>
    177 <P>
    178 Philip Hazel
    179 <br>
    180 University Computing Service
    181 <br>
    182 Cambridge CB2 3QH, England.
    183 <br>
    184 </P>
    185 <br><b>
    186 REVISION
    187 </b><br>
    188 <P>
    189 Last updated: 16 May 2010
    190 <br>
    191 Copyright &copy; 1997-2010 University of Cambridge.
    192 <br>
    193 <p>
    194 Return to the <a href="index.html">PCRE index page</a>.
    195 </p>
    196