Home | History | Annotate | Download | only in doc
      1 -----------------------------------------------------------------------------
      2 This file contains a concatenation of the PCRE man pages, converted to plain
      3 text format for ease of searching with a text editor, or for use on systems
      4 that do not have a man page processor. The small individual files that give
      5 synopses of each function in the library have not been included. Neither has
      6 the pcredemo program. There are separate text files for the pcregrep and
      7 pcretest commands.
      8 -----------------------------------------------------------------------------
      9 
     10 
     11 PCRE(3)                    Library Functions Manual                    PCRE(3)
     12 
     13 
     14 
     15 NAME
     16        PCRE - Perl-compatible regular expressions (original API)
     17 
     18 PLEASE TAKE NOTE
     19 
     20        This  document relates to PCRE releases that use the original API, with
     21        library names libpcre, libpcre16, and libpcre32. January 2015  saw  the
     22        first release of a new API, known as PCRE2, with release numbers start-
     23        ing  at  10.00  and  library   names   libpcre2-8,   libpcre2-16,   and
     24        libpcre2-32. The old libraries (now called PCRE1) are still being main-
     25        tained for bug fixes,  but  there  will  be  no  new  development.  New
     26        projects are advised to use the new PCRE2 libraries.
     27 
     28 
     29 INTRODUCTION
     30 
     31        The  PCRE  library is a set of functions that implement regular expres-
     32        sion pattern matching using the same syntax and semantics as Perl, with
     33        just  a few differences. Some features that appeared in Python and PCRE
     34        before they appeared in Perl are also available using the  Python  syn-
     35        tax,  there  is  some  support for one or two .NET and Oniguruma syntax
     36        items, and there is an option for requesting some  minor  changes  that
     37        give better JavaScript compatibility.
     38 
     39        Starting with release 8.30, it is possible to compile two separate PCRE
     40        libraries:  the  original,  which  supports  8-bit  character   strings
     41        (including  UTF-8  strings),  and a second library that supports 16-bit
     42        character strings (including UTF-16 strings). The build process  allows
     43        either  one  or both to be built. The majority of the work to make this
     44        possible was done by Zoltan Herczeg.
     45 
     46        Starting with release 8.32 it is possible to compile a  third  separate
     47        PCRE  library  that supports 32-bit character strings (including UTF-32
     48        strings). The build process allows any combination of the 8-,  16-  and
     49        32-bit  libraries. The work to make this possible was done by Christian
     50        Persch.
     51 
     52        The three libraries contain identical sets of  functions,  except  that
     53        the  names  in  the 16-bit library start with pcre16_ instead of pcre_,
     54        and the names in the 32-bit  library  start  with  pcre32_  instead  of
     55        pcre_.  To avoid over-complication and reduce the documentation mainte-
     56        nance load, most of the documentation describes the 8-bit library, with
     57        the  differences  for  the  16-bit and 32-bit libraries described sepa-
     58        rately in the pcre16 and  pcre32  pages.  References  to  functions  or
     59        structures  of  the  form  pcre[16|32]_xxx  should  be  read as meaning
     60        "pcre_xxx when using the  8-bit  library,  pcre16_xxx  when  using  the
     61        16-bit library, or pcre32_xxx when using the 32-bit library".
     62 
     63        The  current implementation of PCRE corresponds approximately with Perl
     64        5.12, including support for UTF-8/16/32  encoded  strings  and  Unicode
     65        general  category  properties. However, UTF-8/16/32 and Unicode support
     66        has to be explicitly enabled; it is not the default. The Unicode tables
     67        correspond to Unicode release 6.3.0.
     68 
     69        In  addition to the Perl-compatible matching function, PCRE contains an
     70        alternative function that matches the same compiled patterns in a  dif-
     71        ferent way. In certain circumstances, the alternative function has some
     72        advantages.  For a discussion of the two matching algorithms,  see  the
     73        pcrematching page.
     74 
     75        PCRE  is  written  in C and released as a C library. A number of people
     76        have written wrappers and interfaces of various kinds.  In  particular,
     77        Google  Inc.   have  provided a comprehensive C++ wrapper for the 8-bit
     78        library. This is now included as part of  the  PCRE  distribution.  The
     79        pcrecpp  page  has  details of this interface. Other people's contribu-
     80        tions can be found in the Contrib directory at the  primary  FTP  site,
     81        which is:
     82 
     83        ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
     84 
     85        Details  of  exactly which Perl regular expression features are and are
     86        not supported by PCRE are given in separate documents. See the pcrepat-
     87        tern  and pcrecompat pages. There is a syntax summary in the pcresyntax
     88        page.
     89 
     90        Some features of PCRE can be included, excluded, or  changed  when  the
     91        library  is  built.  The pcre_config() function makes it possible for a
     92        client to discover which features are  available.  The  features  them-
     93        selves  are described in the pcrebuild page. Documentation about build-
     94        ing PCRE for various operating systems can be found in the  README  and
     95        NON-AUTOTOOLS_BUILD files in the source distribution.
     96 
     97        The  libraries contains a number of undocumented internal functions and
     98        data tables that are used by more than one  of  the  exported  external
     99        functions,  but  which  are  not  intended for use by external callers.
    100        Their names all begin with "_pcre_" or "_pcre16_" or "_pcre32_",  which
    101        hopefully  will  not provoke any name clashes. In some environments, it
    102        is possible to control which  external  symbols  are  exported  when  a
    103        shared  library  is  built, and in these cases the undocumented symbols
    104        are not exported.
    105 
    106 
    107 SECURITY CONSIDERATIONS
    108 
    109        If you are using PCRE in a non-UTF application that  permits  users  to
    110        supply  arbitrary  patterns  for  compilation, you should be aware of a
    111        feature that allows users to turn on UTF support from within a pattern,
    112        provided  that  PCRE  was built with UTF support. For example, an 8-bit
    113        pattern that begins with "(*UTF8)" or "(*UTF)"  turns  on  UTF-8  mode,
    114        which  interprets  patterns and subjects as strings of UTF-8 characters
    115        instead of individual 8-bit characters.  This causes both  the  pattern
    116        and any data against which it is matched to be checked for UTF-8 valid-
    117        ity. If the data string is very long, such a  check  might  use  suffi-
    118        ciently  many  resources  as  to cause your application to lose perfor-
    119        mance.
    120 
    121        One  way  of  guarding  against  this  possibility  is   to   use   the
    122        pcre_fullinfo()  function  to  check the compiled pattern's options for
    123        UTF.  Alternatively, from release 8.33, you can set the  PCRE_NEVER_UTF
    124        option  at compile time. This causes an compile time error if a pattern
    125        contains a UTF-setting sequence.
    126 
    127        If your application is one that supports UTF, be  aware  that  validity
    128        checking  can  take time. If the same data string is to be matched many
    129        times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
    130        and subsequent matches to save redundant checks.
    131 
    132        Another  way  that  performance can be hit is by running a pattern that
    133        has a very large search tree against a string that  will  never  match.
    134        Nested  unlimited  repeats in a pattern are a common example. PCRE pro-
    135        vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
    136        ture in the pcreapi page.
    137 
    138 
    139 USER DOCUMENTATION
    140 
    141        The  user  documentation  for PCRE comprises a number of different sec-
    142        tions. In the "man" format, each of these is a separate "man page".  In
    143        the  HTML  format, each is a separate page, linked from the index page.
    144        In the plain text format, the descriptions of the pcregrep and pcretest
    145        programs  are  in  files  called pcregrep.txt and pcretest.txt, respec-
    146        tively. The remaining sections, except for the pcredemo section  (which
    147        is  a  program  listing),  are  concatenated  in  pcre.txt, for ease of
    148        searching. The sections are as follows:
    149 
    150          pcre              this document
    151          pcre-config       show PCRE installation configuration information
    152          pcre16            details of the 16-bit library
    153          pcre32            details of the 32-bit library
    154          pcreapi           details of PCRE's native C API
    155          pcrebuild         building PCRE
    156          pcrecallout       details of the callout feature
    157          pcrecompat        discussion of Perl compatibility
    158          pcrecpp           details of the C++ wrapper for the 8-bit library
    159          pcredemo          a demonstration C program that uses PCRE
    160          pcregrep          description of the pcregrep command (8-bit only)
    161          pcrejit           discussion of the just-in-time optimization support
    162          pcrelimits        details of size and other limits
    163          pcrematching      discussion of the two matching algorithms
    164          pcrepartial       details of the partial matching facility
    165          pcrepattern       syntax and semantics of supported
    166                              regular expressions
    167          pcreperform       discussion of performance issues
    168          pcreposix         the POSIX-compatible C API for the 8-bit library
    169          pcreprecompile    details of saving and re-using precompiled patterns
    170          pcresample        discussion of the pcredemo program
    171          pcrestack         discussion of stack usage
    172          pcresyntax        quick syntax reference
    173          pcretest          description of the pcretest testing command
    174          pcreunicode       discussion of Unicode and UTF-8/16/32 support
    175 
    176        In the "man" and HTML formats, there is also a short page  for  each  C
    177        library function, listing its arguments and results.
    178 
    179 
    180 AUTHOR
    181 
    182        Philip Hazel
    183        University Computing Service
    184        Cambridge CB2 3QH, England.
    185 
    186        Putting  an actual email address here seems to have been a spam magnet,
    187        so I've taken it away. If you want to email me, use  my  two  initials,
    188        followed by the two digits 10, at the domain cam.ac.uk.
    189 
    190 
    191 REVISION
    192 
    193        Last updated: 10 February 2015
    194        Copyright (c) 1997-2015 University of Cambridge.
    195 ------------------------------------------------------------------------------
    196 
    197 
    198 PCRE(3)                    Library Functions Manual                    PCRE(3)
    199 
    200 
    201 
    202 NAME
    203        PCRE - Perl-compatible regular expressions
    204 
    205        #include <pcre.h>
    206 
    207 
    208 PCRE 16-BIT API BASIC FUNCTIONS
    209 
    210        pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
    211             const char **errptr, int *erroffset,
    212             const unsigned char *tableptr);
    213 
    214        pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
    215             int *errorcodeptr,
    216             const char **errptr, int *erroffset,
    217             const unsigned char *tableptr);
    218 
    219        pcre16_extra *pcre16_study(const pcre16 *code, int options,
    220             const char **errptr);
    221 
    222        void pcre16_free_study(pcre16_extra *extra);
    223 
    224        int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
    225             PCRE_SPTR16 subject, int length, int startoffset,
    226             int options, int *ovector, int ovecsize);
    227 
    228        int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
    229             PCRE_SPTR16 subject, int length, int startoffset,
    230             int options, int *ovector, int ovecsize,
    231             int *workspace, int wscount);
    232 
    233 
    234 PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
    235 
    236        int pcre16_copy_named_substring(const pcre16 *code,
    237             PCRE_SPTR16 subject, int *ovector,
    238             int stringcount, PCRE_SPTR16 stringname,
    239             PCRE_UCHAR16 *buffer, int buffersize);
    240 
    241        int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
    242             int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
    243             int buffersize);
    244 
    245        int pcre16_get_named_substring(const pcre16 *code,
    246             PCRE_SPTR16 subject, int *ovector,
    247             int stringcount, PCRE_SPTR16 stringname,
    248             PCRE_SPTR16 *stringptr);
    249 
    250        int pcre16_get_stringnumber(const pcre16 *code,
    251             PCRE_SPTR16 name);
    252 
    253        int pcre16_get_stringtable_entries(const pcre16 *code,
    254             PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
    255 
    256        int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
    257             int stringcount, int stringnumber,
    258             PCRE_SPTR16 *stringptr);
    259 
    260        int pcre16_get_substring_list(PCRE_SPTR16 subject,
    261             int *ovector, int stringcount, PCRE_SPTR16 **listptr);
    262 
    263        void pcre16_free_substring(PCRE_SPTR16 stringptr);
    264 
    265        void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
    266 
    267 
    268 PCRE 16-BIT API AUXILIARY FUNCTIONS
    269 
    270        pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
    271 
    272        void pcre16_jit_stack_free(pcre16_jit_stack *stack);
    273 
    274        void pcre16_assign_jit_stack(pcre16_extra *extra,
    275             pcre16_jit_callback callback, void *data);
    276 
    277        const unsigned char *pcre16_maketables(void);
    278 
    279        int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
    280             int what, void *where);
    281 
    282        int pcre16_refcount(pcre16 *code, int adjust);
    283 
    284        int pcre16_config(int what, void *where);
    285 
    286        const char *pcre16_version(void);
    287 
    288        int pcre16_pattern_to_host_byte_order(pcre16 *code,
    289             pcre16_extra *extra, const unsigned char *tables);
    290 
    291 
    292 PCRE 16-BIT API INDIRECTED FUNCTIONS
    293 
    294        void *(*pcre16_malloc)(size_t);
    295 
    296        void (*pcre16_free)(void *);
    297 
    298        void *(*pcre16_stack_malloc)(size_t);
    299 
    300        void (*pcre16_stack_free)(void *);
    301 
    302        int (*pcre16_callout)(pcre16_callout_block *);
    303 
    304 
    305 PCRE 16-BIT API 16-BIT-ONLY FUNCTION
    306 
    307        int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
    308             PCRE_SPTR16 input, int length, int *byte_order,
    309             int keep_boms);
    310 
    311 
    312 THE PCRE 16-BIT LIBRARY
    313 
    314        Starting  with  release  8.30, it is possible to compile a PCRE library
    315        that supports 16-bit character strings, including  UTF-16  strings,  as
    316        well  as  or instead of the original 8-bit library. The majority of the
    317        work to make  this  possible  was  done  by  Zoltan  Herczeg.  The  two
    318        libraries contain identical sets of functions, used in exactly the same
    319        way. Only the names of the functions and the data types of their  argu-
    320        ments  and results are different. To avoid over-complication and reduce
    321        the documentation maintenance load,  most  of  the  PCRE  documentation
    322        describes  the  8-bit  library,  with only occasional references to the
    323        16-bit library. This page describes what is different when you use  the
    324        16-bit library.
    325 
    326        WARNING:  A  single  application can be linked with both libraries, but
    327        you must take care when processing any particular pattern to use  func-
    328        tions  from  just one library. For example, if you want to study a pat-
    329        tern that was compiled with  pcre16_compile(),  you  must  do  so  with
    330        pcre16_study(), not pcre_study(), and you must free the study data with
    331        pcre16_free_study().
    332 
    333 
    334 THE HEADER FILE
    335 
    336        There is only one header file, pcre.h. It contains prototypes  for  all
    337        the functions in all libraries, as well as definitions of flags, struc-
    338        tures, error codes, etc.
    339 
    340 
    341 THE LIBRARY NAME
    342 
    343        In Unix-like systems, the 16-bit library is called libpcre16,  and  can
    344        normally  be  accesss  by adding -lpcre16 to the command for linking an
    345        application that uses PCRE.
    346 
    347 
    348 STRING TYPES
    349 
    350        In the 8-bit library, strings are passed to PCRE library  functions  as
    351        vectors  of  bytes  with  the  C  type "char *". In the 16-bit library,
    352        strings are passed as vectors of unsigned 16-bit quantities. The  macro
    353        PCRE_UCHAR16  specifies  an  appropriate  data type, and PCRE_SPTR16 is
    354        defined as "const PCRE_UCHAR16 *". In very  many  environments,  "short
    355        int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
    356        as "unsigned short int", but checks that it really  is  a  16-bit  data
    357        type.  If  it is not, the build fails with an error message telling the
    358        maintainer to modify the definition appropriately.
    359 
    360 
    361 STRUCTURE TYPES
    362 
    363        The types of the opaque structures that are used  for  compiled  16-bit
    364        patterns  and  JIT stacks are pcre16 and pcre16_jit_stack respectively.
    365        The  type  of  the  user-accessible  structure  that  is  returned   by
    366        pcre16_study()  is  pcre16_extra, and the type of the structure that is
    367        used for passing data to a callout  function  is  pcre16_callout_block.
    368        These structures contain the same fields, with the same names, as their
    369        8-bit counterparts. The only difference is that pointers  to  character
    370        strings are 16-bit instead of 8-bit types.
    371 
    372 
    373 16-BIT FUNCTIONS
    374 
    375        For  every function in the 8-bit library there is a corresponding func-
    376        tion in the 16-bit library with a name that starts with pcre16_ instead
    377        of  pcre_.  The  prototypes are listed above. In addition, there is one
    378        extra function, pcre16_utf16_to_host_byte_order(). This  is  a  utility
    379        function  that converts a UTF-16 character string to host byte order if
    380        necessary. The other 16-bit  functions  expect  the  strings  they  are
    381        passed to be in host byte order.
    382 
    383        The input and output arguments of pcre16_utf16_to_host_byte_order() may
    384        point to the same address, that is, conversion in place  is  supported.
    385        The output buffer must be at least as long as the input.
    386 
    387        The  length  argument  specifies the number of 16-bit data units in the
    388        input string; a negative value specifies a zero-terminated string.
    389 
    390        If byte_order is NULL, it is assumed that the string starts off in host
    391        byte  order. This may be changed by byte-order marks (BOMs) anywhere in
    392        the string (commonly as the first character).
    393 
    394        If byte_order is not NULL, a non-zero value of the integer to which  it
    395        points  means  that  the input starts off in host byte order, otherwise
    396        the opposite order is assumed. Again, BOMs in  the  string  can  change
    397        this. The final byte order is passed back at the end of processing.
    398 
    399        If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
    400        copied into the output string. Otherwise they are discarded.
    401 
    402        The result of the function is the number of 16-bit  units  placed  into
    403        the  output  buffer,  including  the  zero terminator if the string was
    404        zero-terminated.
    405 
    406 
    407 SUBJECT STRING OFFSETS
    408 
    409        The lengths and starting offsets of subject strings must  be  specified
    410        in  16-bit  data units, and the offsets within subject strings that are
    411        returned by the matching functions are in also 16-bit units rather than
    412        bytes.
    413 
    414 
    415 NAMED SUBPATTERNS
    416 
    417        The  name-to-number translation table that is maintained for named sub-
    418        patterns uses 16-bit characters.  The  pcre16_get_stringtable_entries()
    419        function returns the length of each entry in the table as the number of
    420        16-bit data units.
    421 
    422 
    423 OPTION NAMES
    424 
    425        There   are   two   new   general   option   names,   PCRE_UTF16    and
    426        PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
    427        PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
    428        define  the  same bits in the options word. There is a discussion about
    429        the validity of UTF-16 strings in the pcreunicode page.
    430 
    431        For the pcre16_config() function there is an  option  PCRE_CONFIG_UTF16
    432        that  returns  1  if UTF-16 support is configured, otherwise 0. If this
    433        option  is  given  to  pcre_config()  or  pcre32_config(),  or  if  the
    434        PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF32  option is given to pcre16_con-
    435        fig(), the result is the PCRE_ERROR_BADOPTION error.
    436 
    437 
    438 CHARACTER CODES
    439 
    440        In 16-bit mode, when  PCRE_UTF16  is  not  set,  character  values  are
    441        treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
    442        that they can range from 0 to 0xffff instead of 0  to  0xff.  Character
    443        types  for characters less than 0xff can therefore be influenced by the
    444        locale in the same way as before.  Characters greater  than  0xff  have
    445        only one case, and no "type" (such as letter or digit).
    446 
    447        In  UTF-16  mode,  the  character  code  is  Unicode, in the range 0 to
    448        0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
    449        because  those  are "surrogate" values that are used in pairs to encode
    450        values greater than 0xffff.
    451 
    452        A UTF-16 string can indicate its endianness by special code knows as  a
    453        byte-order mark (BOM). The PCRE functions do not handle this, expecting
    454        strings  to  be  in  host  byte  order.  A  utility   function   called
    455        pcre16_utf16_to_host_byte_order()  is  provided  to help with this (see
    456        above).
    457 
    458 
    459 ERROR NAMES
    460 
    461        The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16  corre-
    462        spond  to  their  8-bit  counterparts.  The error PCRE_ERROR_BADMODE is
    463        given when a compiled pattern is passed to a  function  that  processes
    464        patterns  in  the  other  mode, for example, if a pattern compiled with
    465        pcre_compile() is passed to pcre16_exec().
    466 
    467        There are new error codes whose names  begin  with  PCRE_UTF16_ERR  for
    468        invalid  UTF-16  strings,  corresponding to the PCRE_UTF8_ERR codes for
    469        UTF-8 strings that are described in the section entitled "Reason  codes
    470        for  invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
    471        are:
    472 
    473          PCRE_UTF16_ERR1  Missing low surrogate at end of string
    474          PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
    475          PCRE_UTF16_ERR3  Isolated low surrogate
    476          PCRE_UTF16_ERR4  Non-character
    477 
    478 
    479 ERROR TEXTS
    480 
    481        If there is an error while compiling a pattern, the error text that  is
    482        passed  back by pcre16_compile() or pcre16_compile2() is still an 8-bit
    483        character string, zero-terminated.
    484 
    485 
    486 CALLOUTS
    487 
    488        The subject and mark fields in the callout block that is  passed  to  a
    489        callout function point to 16-bit vectors.
    490 
    491 
    492 TESTING
    493 
    494        The  pcretest  program continues to operate with 8-bit input and output
    495        files, but it can be used for testing the 16-bit library. If it is  run
    496        with the command line option -16, patterns and subject strings are con-
    497        verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
    498        library  functions  are used instead of the 8-bit ones. Returned 16-bit
    499        strings are converted to 8-bit for output. If both the  8-bit  and  the
    500        32-bit libraries were not compiled, pcretest defaults to 16-bit and the
    501        -16 option is ignored.
    502 
    503        When PCRE is being built, the RunTest script that is  called  by  "make
    504        check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
    505        16-bit and 32-bit libraries has been built, and runs the  tests  appro-
    506        priately.
    507 
    508 
    509 NOT SUPPORTED IN 16-BIT MODE
    510 
    511        Not all the features of the 8-bit library are available with the 16-bit
    512        library. The C++ and POSIX wrapper functions  support  only  the  8-bit
    513        library, and the pcregrep program is at present 8-bit only.
    514 
    515 
    516 AUTHOR
    517 
    518        Philip Hazel
    519        University Computing Service
    520        Cambridge CB2 3QH, England.
    521 
    522 
    523 REVISION
    524 
    525        Last updated: 12 May 2013
    526        Copyright (c) 1997-2013 University of Cambridge.
    527 ------------------------------------------------------------------------------
    528 
    529 
    530 PCRE(3)                    Library Functions Manual                    PCRE(3)
    531 
    532 
    533 
    534 NAME
    535        PCRE - Perl-compatible regular expressions
    536 
    537        #include <pcre.h>
    538 
    539 
    540 PCRE 32-BIT API BASIC FUNCTIONS
    541 
    542        pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options,
    543             const char **errptr, int *erroffset,
    544             const unsigned char *tableptr);
    545 
    546        pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options,
    547             int *errorcodeptr,
    548             const unsigned char *tableptr);
    549 
    550        pcre32_extra *pcre32_study(const pcre32 *code, int options,
    551             const char **errptr);
    552 
    553        void pcre32_free_study(pcre32_extra *extra);
    554 
    555        int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
    556             PCRE_SPTR32 subject, int length, int startoffset,
    557             int options, int *ovector, int ovecsize);
    558 
    559        int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra,
    560             PCRE_SPTR32 subject, int length, int startoffset,
    561             int options, int *ovector, int ovecsize,
    562             int *workspace, int wscount);
    563 
    564 
    565 PCRE 32-BIT API STRING EXTRACTION FUNCTIONS
    566 
    567        int pcre32_copy_named_substring(const pcre32 *code,
    568             PCRE_SPTR32 subject, int *ovector,
    569             int stringcount, PCRE_SPTR32 stringname,
    570             PCRE_UCHAR32 *buffer, int buffersize);
    571 
    572        int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector,
    573             int stringcount, int stringnumber, PCRE_UCHAR32 *buffer,
    574             int buffersize);
    575 
    576        int pcre32_get_named_substring(const pcre32 *code,
    577             PCRE_SPTR32 subject, int *ovector,
    578             int stringcount, PCRE_SPTR32 stringname,
    579             PCRE_SPTR32 *stringptr);
    580 
    581        int pcre32_get_stringnumber(const pcre32 *code,
    582             PCRE_SPTR32 name);
    583 
    584        int pcre32_get_stringtable_entries(const pcre32 *code,
    585             PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
    586 
    587        int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
    588             int stringcount, int stringnumber,
    589             PCRE_SPTR32 *stringptr);
    590 
    591        int pcre32_get_substring_list(PCRE_SPTR32 subject,
    592             int *ovector, int stringcount, PCRE_SPTR32 **listptr);
    593 
    594        void pcre32_free_substring(PCRE_SPTR32 stringptr);
    595 
    596        void pcre32_free_substring_list(PCRE_SPTR32 *stringptr);
    597 
    598 
    599 PCRE 32-BIT API AUXILIARY FUNCTIONS
    600 
    601        pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize);
    602 
    603        void pcre32_jit_stack_free(pcre32_jit_stack *stack);
    604 
    605        void pcre32_assign_jit_stack(pcre32_extra *extra,
    606             pcre32_jit_callback callback, void *data);
    607 
    608        const unsigned char *pcre32_maketables(void);
    609 
    610        int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
    611             int what, void *where);
    612 
    613        int pcre32_refcount(pcre32 *code, int adjust);
    614 
    615        int pcre32_config(int what, void *where);
    616 
    617        const char *pcre32_version(void);
    618 
    619        int pcre32_pattern_to_host_byte_order(pcre32 *code,
    620             pcre32_extra *extra, const unsigned char *tables);
    621 
    622 
    623 PCRE 32-BIT API INDIRECTED FUNCTIONS
    624 
    625        void *(*pcre32_malloc)(size_t);
    626 
    627        void (*pcre32_free)(void *);
    628 
    629        void *(*pcre32_stack_malloc)(size_t);
    630 
    631        void (*pcre32_stack_free)(void *);
    632 
    633        int (*pcre32_callout)(pcre32_callout_block *);
    634 
    635 
    636 PCRE 32-BIT API 32-BIT-ONLY FUNCTION
    637 
    638        int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
    639             PCRE_SPTR32 input, int length, int *byte_order,
    640             int keep_boms);
    641 
    642 
    643 THE PCRE 32-BIT LIBRARY
    644 
    645        Starting  with  release  8.32, it is possible to compile a PCRE library
    646        that supports 32-bit character strings, including  UTF-32  strings,  as
    647        well as or instead of the original 8-bit library. This work was done by
    648        Christian Persch, based on the work done  by  Zoltan  Herczeg  for  the
    649        16-bit  library.  All  three  libraries contain identical sets of func-
    650        tions, used in exactly the same way.  Only the names of  the  functions
    651        and  the  data  types  of their arguments and results are different. To
    652        avoid over-complication and reduce the documentation maintenance  load,
    653        most  of  the PCRE documentation describes the 8-bit library, with only
    654        occasional references to the 16-bit and  32-bit  libraries.  This  page
    655        describes what is different when you use the 32-bit library.
    656 
    657        WARNING:  A  single  application  can  be linked with all or any of the
    658        three libraries, but you must take care when processing any  particular
    659        pattern  to  use  functions  from just one library. For example, if you
    660        want to study a pattern that was compiled  with  pcre32_compile(),  you
    661        must do so with pcre32_study(), not pcre_study(), and you must free the
    662        study data with pcre32_free_study().
    663 
    664 
    665 THE HEADER FILE
    666 
    667        There is only one header file, pcre.h. It contains prototypes  for  all
    668        the functions in all libraries, as well as definitions of flags, struc-
    669        tures, error codes, etc.
    670 
    671 
    672 THE LIBRARY NAME
    673 
    674        In Unix-like systems, the 32-bit library is called libpcre32,  and  can
    675        normally  be  accesss  by adding -lpcre32 to the command for linking an
    676        application that uses PCRE.
    677 
    678 
    679 STRING TYPES
    680 
    681        In the 8-bit library, strings are passed to PCRE library  functions  as
    682        vectors  of  bytes  with  the  C  type "char *". In the 32-bit library,
    683        strings are passed as vectors of unsigned 32-bit quantities. The  macro
    684        PCRE_UCHAR32  specifies  an  appropriate  data type, and PCRE_SPTR32 is
    685        defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
    686        int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
    687        as "unsigned int", but checks that it really is a 32-bit data type.  If
    688        it is not, the build fails with an error message telling the maintainer
    689        to modify the definition appropriately.
    690 
    691 
    692 STRUCTURE TYPES
    693 
    694        The types of the opaque structures that are used  for  compiled  32-bit
    695        patterns  and  JIT stacks are pcre32 and pcre32_jit_stack respectively.
    696        The  type  of  the  user-accessible  structure  that  is  returned   by
    697        pcre32_study()  is  pcre32_extra, and the type of the structure that is
    698        used for passing data to a callout  function  is  pcre32_callout_block.
    699        These structures contain the same fields, with the same names, as their
    700        8-bit counterparts. The only difference is that pointers  to  character
    701        strings are 32-bit instead of 8-bit types.
    702 
    703 
    704 32-BIT FUNCTIONS
    705 
    706        For  every function in the 8-bit library there is a corresponding func-
    707        tion in the 32-bit library with a name that starts with pcre32_ instead
    708        of  pcre_.  The  prototypes are listed above. In addition, there is one
    709        extra function, pcre32_utf32_to_host_byte_order(). This  is  a  utility
    710        function  that converts a UTF-32 character string to host byte order if
    711        necessary. The other 32-bit  functions  expect  the  strings  they  are
    712        passed to be in host byte order.
    713 
    714        The input and output arguments of pcre32_utf32_to_host_byte_order() may
    715        point to the same address, that is, conversion in place  is  supported.
    716        The output buffer must be at least as long as the input.
    717 
    718        The  length  argument  specifies the number of 32-bit data units in the
    719        input string; a negative value specifies a zero-terminated string.
    720 
    721        If byte_order is NULL, it is assumed that the string starts off in host
    722        byte  order. This may be changed by byte-order marks (BOMs) anywhere in
    723        the string (commonly as the first character).
    724 
    725        If byte_order is not NULL, a non-zero value of the integer to which  it
    726        points  means  that  the input starts off in host byte order, otherwise
    727        the opposite order is assumed. Again, BOMs in  the  string  can  change
    728        this. The final byte order is passed back at the end of processing.
    729 
    730        If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
    731        copied into the output string. Otherwise they are discarded.
    732 
    733        The result of the function is the number of 32-bit  units  placed  into
    734        the  output  buffer,  including  the  zero terminator if the string was
    735        zero-terminated.
    736 
    737 
    738 SUBJECT STRING OFFSETS
    739 
    740        The lengths and starting offsets of subject strings must  be  specified
    741        in  32-bit  data units, and the offsets within subject strings that are
    742        returned by the matching functions are in also 32-bit units rather than
    743        bytes.
    744 
    745 
    746 NAMED SUBPATTERNS
    747 
    748        The  name-to-number translation table that is maintained for named sub-
    749        patterns uses 32-bit characters.  The  pcre32_get_stringtable_entries()
    750        function returns the length of each entry in the table as the number of
    751        32-bit data units.
    752 
    753 
    754 OPTION NAMES
    755 
    756        There   are   two   new   general   option   names,   PCRE_UTF32    and
    757        PCRE_NO_UTF32_CHECK,     which     correspond    to    PCRE_UTF8    and
    758        PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
    759        define  the  same bits in the options word. There is a discussion about
    760        the validity of UTF-32 strings in the pcreunicode page.
    761 
    762        For the pcre32_config() function there is an  option  PCRE_CONFIG_UTF32
    763        that  returns  1  if UTF-32 support is configured, otherwise 0. If this
    764        option  is  given  to  pcre_config()  or  pcre16_config(),  or  if  the
    765        PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF16  option is given to pcre32_con-
    766        fig(), the result is the PCRE_ERROR_BADOPTION error.
    767 
    768 
    769 CHARACTER CODES
    770 
    771        In 32-bit mode, when  PCRE_UTF32  is  not  set,  character  values  are
    772        treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
    773        that they can range from 0 to 0x7fffffff instead of 0 to 0xff.  Charac-
    774        ter  types for characters less than 0xff can therefore be influenced by
    775        the locale in the same way as before.   Characters  greater  than  0xff
    776        have only one case, and no "type" (such as letter or digit).
    777 
    778        In  UTF-32  mode,  the  character  code  is  Unicode, in the range 0 to
    779        0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
    780        because those are "surrogate" values that are ill-formed in UTF-32.
    781 
    782        A  UTF-32 string can indicate its endianness by special code knows as a
    783        byte-order mark (BOM). The PCRE functions do not handle this, expecting
    784        strings   to   be  in  host  byte  order.  A  utility  function  called
    785        pcre32_utf32_to_host_byte_order() is provided to help  with  this  (see
    786        above).
    787 
    788 
    789 ERROR NAMES
    790 
    791        The  error  PCRE_ERROR_BADUTF32  corresponds  to its 8-bit counterpart.
    792        The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
    793        to  a  function that processes patterns in the other mode, for example,
    794        if a pattern compiled with pcre_compile() is passed to pcre32_exec().
    795 
    796        There are new error codes whose names  begin  with  PCRE_UTF32_ERR  for
    797        invalid  UTF-32  strings,  corresponding to the PCRE_UTF8_ERR codes for
    798        UTF-8 strings that are described in the section entitled "Reason  codes
    799        for  invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
    800        are:
    801 
    802          PCRE_UTF32_ERR1  Surrogate character (range from 0xd800 to 0xdfff)
    803          PCRE_UTF32_ERR2  Non-character
    804          PCRE_UTF32_ERR3  Character > 0x10ffff
    805 
    806 
    807 ERROR TEXTS
    808 
    809        If there is an error while compiling a pattern, the error text that  is
    810        passed  back by pcre32_compile() or pcre32_compile2() is still an 8-bit
    811        character string, zero-terminated.
    812 
    813 
    814 CALLOUTS
    815 
    816        The subject and mark fields in the callout block that is  passed  to  a
    817        callout function point to 32-bit vectors.
    818 
    819 
    820 TESTING
    821 
    822        The  pcretest  program continues to operate with 8-bit input and output
    823        files, but it can be used for testing the 32-bit library. If it is  run
    824        with the command line option -32, patterns and subject strings are con-
    825        verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit
    826        library  functions  are used instead of the 8-bit ones. Returned 32-bit
    827        strings are converted to 8-bit for output. If both the  8-bit  and  the
    828        16-bit libraries were not compiled, pcretest defaults to 32-bit and the
    829        -32 option is ignored.
    830 
    831        When PCRE is being built, the RunTest script that is  called  by  "make
    832        check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
    833        16-bit and 32-bit libraries has been built, and runs the  tests  appro-
    834        priately.
    835 
    836 
    837 NOT SUPPORTED IN 32-BIT MODE
    838 
    839        Not all the features of the 8-bit library are available with the 32-bit
    840        library. The C++ and POSIX wrapper functions  support  only  the  8-bit
    841        library, and the pcregrep program is at present 8-bit only.
    842 
    843 
    844 AUTHOR
    845 
    846        Philip Hazel
    847        University Computing Service
    848        Cambridge CB2 3QH, England.
    849 
    850 
    851 REVISION
    852 
    853        Last updated: 12 May 2013
    854        Copyright (c) 1997-2013 University of Cambridge.
    855 ------------------------------------------------------------------------------
    856 
    857 
    858 PCREBUILD(3)               Library Functions Manual               PCREBUILD(3)
    859 
    860 
    861 
    862 NAME
    863        PCRE - Perl-compatible regular expressions
    864 
    865 BUILDING PCRE
    866 
    867        PCRE  is  distributed with a configure script that can be used to build
    868        the library in Unix-like environments using the applications  known  as
    869        Autotools.   Also  in  the  distribution  are files to support building
    870        using CMake instead of configure. The text file README contains general
    871        information  about  building  with Autotools (some of which is repeated
    872        below), and also has some comments about building on various  operating
    873        systems.  There  is  a lot more information about building PCRE without
    874        using Autotools (including information about using CMake  and  building
    875        "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
    876        consult this file as well as the README file if you are building  in  a
    877        non-Unix-like environment.
    878 
    879 
    880 PCRE BUILD-TIME OPTIONS
    881 
    882        The  rest of this document describes the optional features of PCRE that
    883        can be selected when the library is compiled. It  assumes  use  of  the
    884        configure  script,  where  the  optional features are selected or dese-
    885        lected by providing options to configure before running the  make  com-
    886        mand.  However,  the same options can be selected in both Unix-like and
    887        non-Unix-like environments using the GUI facility of cmake-gui  if  you
    888        are using CMake instead of configure to build PCRE.
    889 
    890        If  you  are not using Autotools or CMake, option selection can be done
    891        by editing the config.h file, or by passing parameter settings  to  the
    892        compiler, as described in NON-AUTOTOOLS-BUILD.
    893 
    894        The complete list of options for configure (which includes the standard
    895        ones such as the  selection  of  the  installation  directory)  can  be
    896        obtained by running
    897 
    898          ./configure --help
    899 
    900        The  following  sections  include  descriptions  of options whose names
    901        begin with --enable or --disable. These settings specify changes to the
    902        defaults  for  the configure command. Because of the way that configure
    903        works, --enable and --disable always come in pairs, so  the  complemen-
    904        tary  option always exists as well, but as it specifies the default, it
    905        is not described.
    906 
    907 
    908 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
    909 
    910        By default, a library called libpcre  is  built,  containing  functions
    911        that  take  string  arguments  contained in vectors of bytes, either as
    912        single-byte characters, or interpreted as UTF-8 strings. You  can  also
    913        build  a  separate library, called libpcre16, in which strings are con-
    914        tained in vectors of 16-bit data units and interpreted either  as  sin-
    915        gle-unit characters or UTF-16 strings, by adding
    916 
    917          --enable-pcre16
    918 
    919        to  the  configure  command.  You  can  also build yet another separate
    920        library, called libpcre32, in which strings are contained in vectors of
    921        32-bit  data  units and interpreted either as single-unit characters or
    922        UTF-32 strings, by adding
    923 
    924          --enable-pcre32
    925 
    926        to the configure command. If you do not want the 8-bit library, add
    927 
    928          --disable-pcre8
    929 
    930        as well. At least one of the three libraries must be built.  Note  that
    931        the  C++  and  POSIX  wrappers are for the 8-bit library only, and that
    932        pcregrep is an 8-bit program. None of these are  built  if  you  select
    933        only the 16-bit or 32-bit libraries.
    934 
    935 
    936 BUILDING SHARED AND STATIC LIBRARIES
    937 
    938        The  Autotools  PCRE building process uses libtool to build both shared
    939        and static libraries by default. You  can  suppress  one  of  these  by
    940        adding one of
    941 
    942          --disable-shared
    943          --disable-static
    944 
    945        to the configure command, as required.
    946 
    947 
    948 C++ SUPPORT
    949 
    950        By  default,  if the 8-bit library is being built, the configure script
    951        will search for a C++ compiler and C++ header files. If it finds  them,
    952        it  automatically  builds  the C++ wrapper library (which supports only
    953        8-bit strings). You can disable this by adding
    954 
    955          --disable-cpp
    956 
    957        to the configure command.
    958 
    959 
    960 UTF-8, UTF-16 AND UTF-32 SUPPORT
    961 
    962        To build PCRE with support for UTF Unicode character strings, add
    963 
    964          --enable-utf
    965 
    966        to the configure command. This setting applies to all three  libraries,
    967        adding  support  for  UTF-8 to the 8-bit library, support for UTF-16 to
    968        the 16-bit library, and  support  for  UTF-32  to  the  to  the  32-bit
    969        library.  There  are no separate options for enabling UTF-8, UTF-16 and
    970        UTF-32 independently because that would allow ridiculous settings  such
    971        as  requesting UTF-16 support while building only the 8-bit library. It
    972        is not possible to build one library with UTF support and another with-
    973        out  in the same configuration. (For backwards compatibility, --enable-
    974        utf8 is a synonym of --enable-utf.)
    975 
    976        Of itself, this setting does not make  PCRE  treat  strings  as  UTF-8,
    977        UTF-16  or UTF-32. As well as compiling PCRE with this option, you also
    978        have have to set the PCRE_UTF8, PCRE_UTF16  or  PCRE_UTF32  option  (as
    979        appropriate) when you call one of the pattern compiling functions.
    980 
    981        If  you  set --enable-utf when compiling in an EBCDIC environment, PCRE
    982        expects its input to be either ASCII or UTF-8 (depending  on  the  run-
    983        time option). It is not possible to support both EBCDIC and UTF-8 codes
    984        in the same version of  the  library.  Consequently,  --enable-utf  and
    985        --enable-ebcdic are mutually exclusive.
    986 
    987 
    988 UNICODE CHARACTER PROPERTY SUPPORT
    989 
    990        UTF  support allows the libraries to process character codepoints up to
    991        0x10ffff in the strings that they handle. On its own, however, it  does
    992        not provide any facilities for accessing the properties of such charac-
    993        ters. If you want to be able to use the pattern escapes \P, \p, and \X,
    994        which refer to Unicode character properties, you must add
    995 
    996          --enable-unicode-properties
    997 
    998        to  the  configure  command. This implies UTF support, even if you have
    999        not explicitly requested it.
   1000 
   1001        Including Unicode property support adds around 30K  of  tables  to  the
   1002        PCRE  library.  Only  the general category properties such as Lu and Nd
   1003        are supported. Details are given in the pcrepattern documentation.
   1004 
   1005 
   1006 JUST-IN-TIME COMPILER SUPPORT
   1007 
   1008        Just-in-time compiler support is included in the build by specifying
   1009 
   1010          --enable-jit
   1011 
   1012        This support is available only for certain hardware  architectures.  If
   1013        this  option  is  set  for  an unsupported architecture, a compile time
   1014        error occurs.  See the pcrejit documentation for a  discussion  of  JIT
   1015        usage. When JIT support is enabled, pcregrep automatically makes use of
   1016        it, unless you add
   1017 
   1018          --disable-pcregrep-jit
   1019 
   1020        to the "configure" command.
   1021 
   1022 
   1023 CODE VALUE OF NEWLINE
   1024 
   1025        By default, PCRE interprets the linefeed (LF) character  as  indicating
   1026        the  end  of  a line. This is the normal newline character on Unix-like
   1027        systems. You can compile PCRE to use carriage return (CR)  instead,  by
   1028        adding
   1029 
   1030          --enable-newline-is-cr
   1031 
   1032        to  the  configure  command.  There  is  also  a --enable-newline-is-lf
   1033        option, which explicitly specifies linefeed as the newline character.
   1034 
   1035        Alternatively, you can specify that line endings are to be indicated by
   1036        the two character sequence CRLF. If you want this, add
   1037 
   1038          --enable-newline-is-crlf
   1039 
   1040        to the configure command. There is a fourth option, specified by
   1041 
   1042          --enable-newline-is-anycrlf
   1043 
   1044        which  causes  PCRE  to recognize any of the three sequences CR, LF, or
   1045        CRLF as indicating a line ending. Finally, a fifth option, specified by
   1046 
   1047          --enable-newline-is-any
   1048 
   1049        causes PCRE to recognize any Unicode newline sequence.
   1050 
   1051        Whatever line ending convention is selected when PCRE is built  can  be
   1052        overridden  when  the library functions are called. At build time it is
   1053        conventional to use the standard for your operating system.
   1054 
   1055 
   1056 WHAT \R MATCHES
   1057 
   1058        By default, the sequence \R in a pattern matches  any  Unicode  newline
   1059        sequence,  whatever  has  been selected as the line ending sequence. If
   1060        you specify
   1061 
   1062          --enable-bsr-anycrlf
   1063 
   1064        the default is changed so that \R matches only CR, LF, or  CRLF.  What-
   1065        ever  is selected when PCRE is built can be overridden when the library
   1066        functions are called.
   1067 
   1068 
   1069 POSIX MALLOC USAGE
   1070 
   1071        When the 8-bit library is called through the POSIX interface  (see  the
   1072        pcreposix  documentation),  additional  working storage is required for
   1073        holding the pointers to capturing  substrings,  because  PCRE  requires
   1074        three integers per substring, whereas the POSIX interface provides only
   1075        two. If the number of expected substrings is small, the  wrapper  func-
   1076        tion  uses  space  on the stack, because this is faster than using mal-
   1077        loc() for each call. The default threshold above which the stack is  no
   1078        longer used is 10; it can be changed by adding a setting such as
   1079 
   1080          --with-posix-malloc-threshold=20
   1081 
   1082        to the configure command.
   1083 
   1084 
   1085 HANDLING VERY LARGE PATTERNS
   1086 
   1087        Within  a  compiled  pattern,  offset values are used to point from one
   1088        part to another (for example, from an opening parenthesis to an  alter-
   1089        nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
   1090        two-byte values are used for these offsets, leading to a  maximum  size
   1091        for  a compiled pattern of around 64K. This is sufficient to handle all
   1092        but the most gigantic patterns.  Nevertheless, some people do  want  to
   1093        process  truly  enormous patterns, so it is possible to compile PCRE to
   1094        use three-byte or four-byte offsets by adding a setting such as
   1095 
   1096          --with-link-size=3
   1097 
   1098        to the configure command. The value given must be 2, 3, or 4.  For  the
   1099        16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
   1100        using longer offsets slows down the operation of PCRE because it has to
   1101        load  additional  data  when  handling them. For the 32-bit library the
   1102        value is always 4 and cannot be overridden; the value  of  --with-link-
   1103        size is ignored.
   1104 
   1105 
   1106 AVOIDING EXCESSIVE STACK USAGE
   1107 
   1108        When matching with the pcre_exec() function, PCRE implements backtrack-
   1109        ing by making recursive calls to an internal function  called  match().
   1110        In  environments  where  the size of the stack is limited, this can se-
   1111        verely limit PCRE's operation. (The Unix environment does  not  usually
   1112        suffer from this problem, but it may sometimes be necessary to increase
   1113        the maximum stack size.  There is a discussion in the  pcrestack  docu-
   1114        mentation.)  An alternative approach to recursion that uses memory from
   1115        the heap to remember data, instead of using recursive  function  calls,
   1116        has  been  implemented to work round the problem of limited stack size.
   1117        If you want to build a version of PCRE that works this way, add
   1118 
   1119          --disable-stack-for-recursion
   1120 
   1121        to the configure command. With this configuration, PCRE  will  use  the
   1122        pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
   1123        ment functions. By default these point to malloc() and free(), but  you
   1124        can replace the pointers so that your own functions are used instead.
   1125 
   1126        Separate  functions  are  provided  rather  than  using pcre_malloc and
   1127        pcre_free because the  usage  is  very  predictable:  the  block  sizes
   1128        requested  are  always  the  same,  and  the blocks are always freed in
   1129        reverse order. A calling program might be able to  implement  optimized
   1130        functions  that  perform  better  than  malloc()  and free(). PCRE runs
   1131        noticeably more slowly when built in this way. This option affects only
   1132        the pcre_exec() function; it is not relevant for pcre_dfa_exec().
   1133 
   1134 
   1135 LIMITING PCRE RESOURCE USAGE
   1136 
   1137        Internally,  PCRE has a function called match(), which it calls repeat-
   1138        edly  (sometimes  recursively)  when  matching  a  pattern   with   the
   1139        pcre_exec()  function.  By controlling the maximum number of times this
   1140        function may be called during a single matching operation, a limit  can
   1141        be  placed  on  the resources used by a single call to pcre_exec(). The
   1142        limit can be changed at run time, as described in the pcreapi  documen-
   1143        tation.  The default is 10 million, but this can be changed by adding a
   1144        setting such as
   1145 
   1146          --with-match-limit=500000
   1147 
   1148        to  the  configure  command.  This  setting  has  no  effect   on   the
   1149        pcre_dfa_exec() matching function.
   1150 
   1151        In  some  environments  it is desirable to limit the depth of recursive
   1152        calls of match() more strictly than the total number of calls, in order
   1153        to  restrict  the maximum amount of stack (or heap, if --disable-stack-
   1154        for-recursion is specified) that is used. A second limit controls this;
   1155        it  defaults  to  the  value  that is set for --with-match-limit, which
   1156        imposes no additional constraints. However, you can set a  lower  limit
   1157        by adding, for example,
   1158 
   1159          --with-match-limit-recursion=10000
   1160 
   1161        to  the  configure  command.  This  value can also be overridden at run
   1162        time.
   1163 
   1164 
   1165 CREATING CHARACTER TABLES AT BUILD TIME
   1166 
   1167        PCRE uses fixed tables for processing characters whose code values  are
   1168        less  than 256. By default, PCRE is built with a set of tables that are
   1169        distributed in the file pcre_chartables.c.dist. These  tables  are  for
   1170        ASCII codes only. If you add
   1171 
   1172          --enable-rebuild-chartables
   1173 
   1174        to  the  configure  command, the distributed tables are no longer used.
   1175        Instead, a program called dftables is compiled and  run.  This  outputs
   1176        the source for new set of tables, created in the default locale of your
   1177        C run-time system. (This method of replacing the tables does  not  work
   1178        if  you are cross compiling, because dftables is run on the local host.
   1179        If you need to create alternative tables when cross compiling, you will
   1180        have to do so "by hand".)
   1181 
   1182 
   1183 USING EBCDIC CODE
   1184 
   1185        PCRE  assumes  by  default that it will run in an environment where the
   1186        character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
   1187        This  is  the  case for most computer operating systems. PCRE can, how-
   1188        ever, be compiled to run in an EBCDIC environment by adding
   1189 
   1190          --enable-ebcdic
   1191 
   1192        to the configure command. This setting implies --enable-rebuild-charta-
   1193        bles.  You  should  only  use  it if you know that you are in an EBCDIC
   1194        environment (for example,  an  IBM  mainframe  operating  system).  The
   1195        --enable-ebcdic option is incompatible with --enable-utf.
   1196 
   1197        The EBCDIC character that corresponds to an ASCII LF is assumed to have
   1198        the value 0x15 by default. However, in some EBCDIC  environments,  0x25
   1199        is used. In such an environment you should use
   1200 
   1201          --enable-ebcdic-nl25
   1202 
   1203        as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
   1204        has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
   1205        0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
   1206        acter (which, in Unicode, is 0x85).
   1207 
   1208        The options that select newline behaviour, such as --enable-newline-is-
   1209        cr, and equivalent run-time options, refer to these character values in
   1210        an EBCDIC environment.
   1211 
   1212 
   1213 PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
   1214 
   1215        By default, pcregrep reads all files as plain text. You can build it so
   1216        that it recognizes files whose names end in .gz or .bz2, and reads them
   1217        with libz or libbz2, respectively, by adding one or both of
   1218 
   1219          --enable-pcregrep-libz
   1220          --enable-pcregrep-libbz2
   1221 
   1222        to the configure command. These options naturally require that the rel-
   1223        evant  libraries  are installed on your system. Configuration will fail
   1224        if they are not.
   1225 
   1226 
   1227 PCREGREP BUFFER SIZE
   1228 
   1229        pcregrep uses an internal buffer to hold a "window" on the file  it  is
   1230        scanning, in order to be able to output "before" and "after" lines when
   1231        it finds a match. The size of the buffer is controlled by  a  parameter
   1232        whose default value is 20K. The buffer itself is three times this size,
   1233        but because of the way it is used for holding "before" lines, the long-
   1234        est  line  that  is guaranteed to be processable is the parameter size.
   1235        You can change the default parameter value by adding, for example,
   1236 
   1237          --with-pcregrep-bufsize=50K
   1238 
   1239        to the configure command. The caller of pcregrep can, however, override
   1240        this value by specifying a run-time option.
   1241 
   1242 
   1243 PCRETEST OPTION FOR LIBREADLINE SUPPORT
   1244 
   1245        If you add
   1246 
   1247          --enable-pcretest-libreadline
   1248 
   1249        to  the  configure  command,  pcretest  is  linked with the libreadline
   1250        library, and when its input is from a terminal, it reads it  using  the
   1251        readline() function. This provides line-editing and history facilities.
   1252        Note that libreadline is GPL-licensed, so if you distribute a binary of
   1253        pcretest linked in this way, there may be licensing issues.
   1254 
   1255        Setting  this  option  causes  the -lreadline option to be added to the
   1256        pcretest build. In many operating environments with  a  sytem-installed
   1257        libreadline this is sufficient. However, in some environments (e.g.  if
   1258        an unmodified distribution version of readline is in use),  some  extra
   1259        configuration  may  be necessary. The INSTALL file for libreadline says
   1260        this:
   1261 
   1262          "Readline uses the termcap functions, but does not link with the
   1263          termcap or curses library itself, allowing applications which link
   1264          with readline the to choose an appropriate library."
   1265 
   1266        If your environment has not been set up so that an appropriate  library
   1267        is automatically included, you may need to add something like
   1268 
   1269          LIBS="-ncurses"
   1270 
   1271        immediately before the configure command.
   1272 
   1273 
   1274 DEBUGGING WITH VALGRIND SUPPORT
   1275 
   1276        By adding the
   1277 
   1278          --enable-valgrind
   1279 
   1280        option  to to the configure command, PCRE will use valgrind annotations
   1281        to mark certain memory regions as  unaddressable.  This  allows  it  to
   1282        detect invalid memory accesses, and is mostly useful for debugging PCRE
   1283        itself.
   1284 
   1285 
   1286 CODE COVERAGE REPORTING
   1287 
   1288        If your C compiler is gcc, you can build a version  of  PCRE  that  can
   1289        generate a code coverage report for its test suite. To enable this, you
   1290        must install lcov version 1.6 or above. Then specify
   1291 
   1292          --enable-coverage
   1293 
   1294        to the configure command and build PCRE in the usual way.
   1295 
   1296        Note that using ccache (a caching C compiler) is incompatible with code
   1297        coverage  reporting. If you have configured ccache to run automatically
   1298        on your system, you must set the environment variable
   1299 
   1300          CCACHE_DISABLE=1
   1301 
   1302        before running make to build PCRE, so that ccache is not used.
   1303 
   1304        When --enable-coverage is used,  the  following  addition  targets  are
   1305        added to the Makefile:
   1306 
   1307          make coverage
   1308 
   1309        This  creates  a  fresh  coverage report for the PCRE test suite. It is
   1310        equivalent to running "make coverage-reset", "make  coverage-baseline",
   1311        "make check", and then "make coverage-report".
   1312 
   1313          make coverage-reset
   1314 
   1315        This zeroes the coverage counters, but does nothing else.
   1316 
   1317          make coverage-baseline
   1318 
   1319        This captures baseline coverage information.
   1320 
   1321          make coverage-report
   1322 
   1323        This creates the coverage report.
   1324 
   1325          make coverage-clean-report
   1326 
   1327        This  removes the generated coverage report without cleaning the cover-
   1328        age data itself.
   1329 
   1330          make coverage-clean-data
   1331 
   1332        This removes the captured coverage data without removing  the  coverage
   1333        files created at compile time (*.gcno).
   1334 
   1335          make coverage-clean
   1336 
   1337        This  cleans all coverage data including the generated coverage report.
   1338        For more information about code coverage, see the gcov and  lcov  docu-
   1339        mentation.
   1340 
   1341 
   1342 SEE ALSO
   1343 
   1344        pcreapi(3), pcre16, pcre32, pcre_config(3).
   1345 
   1346 
   1347 AUTHOR
   1348 
   1349        Philip Hazel
   1350        University Computing Service
   1351        Cambridge CB2 3QH, England.
   1352 
   1353 
   1354 REVISION
   1355 
   1356        Last updated: 12 May 2013
   1357        Copyright (c) 1997-2013 University of Cambridge.
   1358 ------------------------------------------------------------------------------
   1359 
   1360 
   1361 PCREMATCHING(3)            Library Functions Manual            PCREMATCHING(3)
   1362 
   1363 
   1364 
   1365 NAME
   1366        PCRE - Perl-compatible regular expressions
   1367 
   1368 PCRE MATCHING ALGORITHMS
   1369 
   1370        This document describes the two different algorithms that are available
   1371        in PCRE for matching a compiled regular expression against a given sub-
   1372        ject  string.  The  "standard"  algorithm  is  the  one provided by the
   1373        pcre_exec(), pcre16_exec() and pcre32_exec() functions. These  work  in
   1374        the  same as as Perl's matching function, and provide a Perl-compatible
   1375        matching  operation.   The  just-in-time  (JIT)  optimization  that  is
   1376        described  in  the pcrejit documentation is compatible with these func-
   1377        tions.
   1378 
   1379        An  alternative  algorithm  is   provided   by   the   pcre_dfa_exec(),
   1380        pcre16_dfa_exec()  and  pcre32_dfa_exec()  functions; they operate in a
   1381        different way, and are not Perl-compatible. This alternative has advan-
   1382        tages and disadvantages compared with the standard algorithm, and these
   1383        are described below.
   1384 
   1385        When there is only one possible way in which a given subject string can
   1386        match  a pattern, the two algorithms give the same answer. A difference
   1387        arises, however, when there are multiple possibilities. For example, if
   1388        the pattern
   1389 
   1390          ^<.*>
   1391 
   1392        is matched against the string
   1393 
   1394          <something> <something else> <something further>
   1395 
   1396        there are three possible answers. The standard algorithm finds only one
   1397        of them, whereas the alternative algorithm finds all three.
   1398 
   1399 
   1400 REGULAR EXPRESSIONS AS TREES
   1401 
   1402        The set of strings that are matched by a regular expression can be rep-
   1403        resented  as  a  tree structure. An unlimited repetition in the pattern
   1404        makes the tree of infinite size, but it is still a tree.  Matching  the
   1405        pattern  to a given subject string (from a given starting point) can be
   1406        thought of as a search of the tree.  There are two  ways  to  search  a
   1407        tree:  depth-first  and  breadth-first, and these correspond to the two
   1408        matching algorithms provided by PCRE.
   1409 
   1410 
   1411 THE STANDARD MATCHING ALGORITHM
   1412 
   1413        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
   1414        sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
   1415        depth-first search of the pattern tree. That is, it  proceeds  along  a
   1416        single path through the tree, checking that the subject matches what is
   1417        required. When there is a mismatch, the algorithm  tries  any  alterna-
   1418        tives  at  the  current point, and if they all fail, it backs up to the
   1419        previous branch point in the  tree,  and  tries  the  next  alternative
   1420        branch  at  that  level.  This often involves backing up (moving to the
   1421        left) in the subject string as well.  The  order  in  which  repetition
   1422        branches  are  tried  is controlled by the greedy or ungreedy nature of
   1423        the quantifier.
   1424 
   1425        If a leaf node is reached, a matching string has  been  found,  and  at
   1426        that  point the algorithm stops. Thus, if there is more than one possi-
   1427        ble match, this algorithm returns the first one that it finds.  Whether
   1428        this  is the shortest, the longest, or some intermediate length depends
   1429        on the way the greedy and ungreedy repetition quantifiers are specified
   1430        in the pattern.
   1431 
   1432        Because  it  ends  up  with a single path through the tree, it is rela-
   1433        tively straightforward for this algorithm to keep  track  of  the  sub-
   1434        strings  that  are  matched  by portions of the pattern in parentheses.
   1435        This provides support for capturing parentheses and back references.
   1436 
   1437 
   1438 THE ALTERNATIVE MATCHING ALGORITHM
   1439 
   1440        This algorithm conducts a breadth-first search of  the  tree.  Starting
   1441        from  the  first  matching  point  in the subject, it scans the subject
   1442        string from left to right, once, character by character, and as it does
   1443        this,  it remembers all the paths through the tree that represent valid
   1444        matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
   1445        though  it is not implemented as a traditional finite state machine (it
   1446        keeps multiple states active simultaneously).
   1447 
   1448        Although the general principle of this matching algorithm  is  that  it
   1449        scans  the subject string only once, without backtracking, there is one
   1450        exception: when a lookaround assertion is encountered,  the  characters
   1451        following  or  preceding  the  current  point  have to be independently
   1452        inspected.
   1453 
   1454        The scan continues until either the end of the subject is  reached,  or
   1455        there  are  no more unterminated paths. At this point, terminated paths
   1456        represent the different matching possibilities (if there are none,  the
   1457        match  has  failed).   Thus,  if there is more than one possible match,
   1458        this algorithm finds all of them, and in particular, it finds the long-
   1459        est.  The  matches are returned in decreasing order of length. There is
   1460        an option to stop the algorithm after the first match (which is  neces-
   1461        sarily the shortest) is found.
   1462 
   1463        Note that all the matches that are found start at the same point in the
   1464        subject. If the pattern
   1465 
   1466          cat(er(pillar)?)?
   1467 
   1468        is matched against the string "the caterpillar catchment",  the  result
   1469        will  be the three strings "caterpillar", "cater", and "cat" that start
   1470        at the fifth character of the subject. The algorithm does not automati-
   1471        cally move on to find matches that start at later positions.
   1472 
   1473        PCRE's  "auto-possessification" optimization usually applies to charac-
   1474        ter repeats at the end of a pattern (as well as internally). For  exam-
   1475        ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
   1476        is no point even considering the possibility of backtracking  into  the
   1477        repeated  digits.  For  DFA matching, this means that only one possible
   1478        match is found. If you really do want multiple matches in  such  cases,
   1479        either use an ungreedy repeat ("a\d+?") or set the PCRE_NO_AUTO_POSSESS
   1480        option when compiling.
   1481 
   1482        There are a number of features of PCRE regular expressions that are not
   1483        supported by the alternative matching algorithm. They are as follows:
   1484 
   1485        1.  Because  the  algorithm  finds  all possible matches, the greedy or
   1486        ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
   1487        ungreedy quantifiers are treated in exactly the same way. However, pos-
   1488        sessive quantifiers can make a difference when what follows could  also
   1489        match what is quantified, for example in a pattern like this:
   1490 
   1491          ^a++\w!
   1492 
   1493        This  pattern matches "aaab!" but not "aaa!", which would be matched by
   1494        a non-possessive quantifier. Similarly, if an atomic group is  present,
   1495        it  is matched as if it were a standalone pattern at the current point,
   1496        and the longest match is then "locked in" for the rest of  the  overall
   1497        pattern.
   1498 
   1499        2. When dealing with multiple paths through the tree simultaneously, it
   1500        is not straightforward to keep track of  captured  substrings  for  the
   1501        different  matching  possibilities,  and  PCRE's implementation of this
   1502        algorithm does not attempt to do this. This means that no captured sub-
   1503        strings are available.
   1504 
   1505        3.  Because no substrings are captured, back references within the pat-
   1506        tern are not supported, and cause errors if encountered.
   1507 
   1508        4. For the same reason, conditional expressions that use  a  backrefer-
   1509        ence  as  the  condition or test for a specific group recursion are not
   1510        supported.
   1511 
   1512        5. Because many paths through the tree may be  active,  the  \K  escape
   1513        sequence, which resets the start of the match when encountered (but may
   1514        be on some paths and not on others), is not  supported.  It  causes  an
   1515        error if encountered.
   1516 
   1517        6.  Callouts  are  supported, but the value of the capture_top field is
   1518        always 1, and the value of the capture_last field is always -1.
   1519 
   1520        7. The \C escape sequence, which (in  the  standard  algorithm)  always
   1521        matches  a  single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is
   1522        not supported in these modes, because the alternative  algorithm  moves
   1523        through the subject string one character (not data unit) at a time, for
   1524        all active paths through the tree.
   1525 
   1526        8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
   1527        are  not  supported.  (*FAIL)  is supported, and behaves like a failing
   1528        negative assertion.
   1529 
   1530 
   1531 ADVANTAGES OF THE ALTERNATIVE ALGORITHM
   1532 
   1533        Using the alternative matching algorithm provides the following  advan-
   1534        tages:
   1535 
   1536        1. All possible matches (at a single point in the subject) are automat-
   1537        ically found, and in particular, the longest match is  found.  To  find
   1538        more than one match using the standard algorithm, you have to do kludgy
   1539        things with callouts.
   1540 
   1541        2. Because the alternative algorithm  scans  the  subject  string  just
   1542        once, and never needs to backtrack (except for lookbehinds), it is pos-
   1543        sible to pass very long subject strings to  the  matching  function  in
   1544        several pieces, checking for partial matching each time. Although it is
   1545        possible to do multi-segment matching using the standard  algorithm  by
   1546        retaining  partially  matched  substrings,  it is more complicated. The
   1547        pcrepartial documentation gives details of partial  matching  and  dis-
   1548        cusses multi-segment matching.
   1549 
   1550 
   1551 DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
   1552 
   1553        The alternative algorithm suffers from a number of disadvantages:
   1554 
   1555        1.  It  is  substantially  slower  than the standard algorithm. This is
   1556        partly because it has to search for all possible matches, but  is  also
   1557        because it is less susceptible to optimization.
   1558 
   1559        2. Capturing parentheses and back references are not supported.
   1560 
   1561        3. Although atomic groups are supported, their use does not provide the
   1562        performance advantage that it does for the standard algorithm.
   1563 
   1564 
   1565 AUTHOR
   1566 
   1567        Philip Hazel
   1568        University Computing Service
   1569        Cambridge CB2 3QH, England.
   1570 
   1571 
   1572 REVISION
   1573 
   1574        Last updated: 12 November 2013
   1575        Copyright (c) 1997-2012 University of Cambridge.
   1576 ------------------------------------------------------------------------------
   1577 
   1578 
   1579 PCREAPI(3)                 Library Functions Manual                 PCREAPI(3)
   1580 
   1581 
   1582 
   1583 NAME
   1584        PCRE - Perl-compatible regular expressions
   1585 
   1586        #include <pcre.h>
   1587 
   1588 
   1589 PCRE NATIVE API BASIC FUNCTIONS
   1590 
   1591        pcre *pcre_compile(const char *pattern, int options,
   1592             const char **errptr, int *erroffset,
   1593             const unsigned char *tableptr);
   1594 
   1595        pcre *pcre_compile2(const char *pattern, int options,
   1596             int *errorcodeptr,
   1597             const char **errptr, int *erroffset,
   1598             const unsigned char *tableptr);
   1599 
   1600        pcre_extra *pcre_study(const pcre *code, int options,
   1601             const char **errptr);
   1602 
   1603        void pcre_free_study(pcre_extra *extra);
   1604 
   1605        int pcre_exec(const pcre *code, const pcre_extra *extra,
   1606             const char *subject, int length, int startoffset,
   1607             int options, int *ovector, int ovecsize);
   1608 
   1609        int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
   1610             const char *subject, int length, int startoffset,
   1611             int options, int *ovector, int ovecsize,
   1612             int *workspace, int wscount);
   1613 
   1614 
   1615 PCRE NATIVE API STRING EXTRACTION FUNCTIONS
   1616 
   1617        int pcre_copy_named_substring(const pcre *code,
   1618             const char *subject, int *ovector,
   1619             int stringcount, const char *stringname,
   1620             char *buffer, int buffersize);
   1621 
   1622        int pcre_copy_substring(const char *subject, int *ovector,
   1623             int stringcount, int stringnumber, char *buffer,
   1624             int buffersize);
   1625 
   1626        int pcre_get_named_substring(const pcre *code,
   1627             const char *subject, int *ovector,
   1628             int stringcount, const char *stringname,
   1629             const char **stringptr);
   1630 
   1631        int pcre_get_stringnumber(const pcre *code,
   1632             const char *name);
   1633 
   1634        int pcre_get_stringtable_entries(const pcre *code,
   1635             const char *name, char **first, char **last);
   1636 
   1637        int pcre_get_substring(const char *subject, int *ovector,
   1638             int stringcount, int stringnumber,
   1639             const char **stringptr);
   1640 
   1641        int pcre_get_substring_list(const char *subject,
   1642             int *ovector, int stringcount, const char ***listptr);
   1643 
   1644        void pcre_free_substring(const char *stringptr);
   1645 
   1646        void pcre_free_substring_list(const char **stringptr);
   1647 
   1648 
   1649 PCRE NATIVE API AUXILIARY FUNCTIONS
   1650 
   1651        int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
   1652             const char *subject, int length, int startoffset,
   1653             int options, int *ovector, int ovecsize,
   1654             pcre_jit_stack *jstack);
   1655 
   1656        pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
   1657 
   1658        void pcre_jit_stack_free(pcre_jit_stack *stack);
   1659 
   1660        void pcre_assign_jit_stack(pcre_extra *extra,
   1661             pcre_jit_callback callback, void *data);
   1662 
   1663        const unsigned char *pcre_maketables(void);
   1664 
   1665        int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
   1666             int what, void *where);
   1667 
   1668        int pcre_refcount(pcre *code, int adjust);
   1669 
   1670        int pcre_config(int what, void *where);
   1671 
   1672        const char *pcre_version(void);
   1673 
   1674        int pcre_pattern_to_host_byte_order(pcre *code,
   1675             pcre_extra *extra, const unsigned char *tables);
   1676 
   1677 
   1678 PCRE NATIVE API INDIRECTED FUNCTIONS
   1679 
   1680        void *(*pcre_malloc)(size_t);
   1681 
   1682        void (*pcre_free)(void *);
   1683 
   1684        void *(*pcre_stack_malloc)(size_t);
   1685 
   1686        void (*pcre_stack_free)(void *);
   1687 
   1688        int (*pcre_callout)(pcre_callout_block *);
   1689 
   1690        int (*pcre_stack_guard)(void);
   1691 
   1692 
   1693 PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
   1694 
   1695        As  well  as  support  for  8-bit character strings, PCRE also supports
   1696        16-bit strings (from release 8.30) and  32-bit  strings  (from  release
   1697        8.32),  by means of two additional libraries. They can be built as well
   1698        as, or instead of, the 8-bit library. To avoid too  much  complication,
   1699        this  document describes the 8-bit versions of the functions, with only
   1700        occasional references to the 16-bit and 32-bit libraries.
   1701 
   1702        The 16-bit and 32-bit functions operate in the same way as their  8-bit
   1703        counterparts;  they  just  use different data types for their arguments
   1704        and results, and their names start with pcre16_ or pcre32_  instead  of
   1705        pcre_.  For  every  option  that  has  UTF8  in  its name (for example,
   1706        PCRE_UTF8), there are corresponding 16-bit and 32-bit names  with  UTF8
   1707        replaced by UTF16 or UTF32, respectively. This facility is in fact just
   1708        cosmetic; the 16-bit and 32-bit option names define the same  bit  val-
   1709        ues.
   1710 
   1711        References to bytes and UTF-8 in this document should be read as refer-
   1712        ences to 16-bit data units and UTF-16 when using the 16-bit library, or
   1713        32-bit  data  units  and  UTF-32  when using the 32-bit library, unless
   1714        specified otherwise.  More details of the specific differences for  the
   1715        16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
   1716 
   1717 
   1718 PCRE API OVERVIEW
   1719 
   1720        PCRE has its own native API, which is described in this document. There
   1721        are also some wrapper functions (for the 8-bit library only) that  cor-
   1722        respond  to  the  POSIX  regular  expression  API, but they do not give
   1723        access to all the functionality. They are described  in  the  pcreposix
   1724        documentation.  Both  of these APIs define a set of C function calls. A
   1725        C++ wrapper (again for the 8-bit library only) is also distributed with
   1726        PCRE. It is documented in the pcrecpp page.
   1727 
   1728        The  native  API  C  function prototypes are defined in the header file
   1729        pcre.h, and on Unix-like systems the (8-bit) library itself  is  called
   1730        libpcre.  It  can  normally be accessed by adding -lpcre to the command
   1731        for linking an application that uses PCRE. The header file defines  the
   1732        macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
   1733        numbers for the library. Applications can use these to include  support
   1734        for different releases of PCRE.
   1735 
   1736        In a Windows environment, if you want to statically link an application
   1737        program against a non-dll pcre.a  file,  you  must  define  PCRE_STATIC
   1738        before  including  pcre.h or pcrecpp.h, because otherwise the pcre_mal-
   1739        loc()   and   pcre_free()   exported   functions   will   be   declared
   1740        __declspec(dllimport), with unwanted results.
   1741 
   1742        The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
   1743        pcre_exec() are used for compiling and matching regular expressions  in
   1744        a  Perl-compatible  manner. A sample program that demonstrates the sim-
   1745        plest way of using them is provided in the file  called  pcredemo.c  in
   1746        the PCRE source distribution. A listing of this program is given in the
   1747        pcredemo documentation, and the pcresample documentation describes  how
   1748        to compile and run it.
   1749 
   1750        Just-in-time  compiler  support is an optional feature of PCRE that can
   1751        be built in appropriate hardware environments. It greatly speeds up the
   1752        matching  performance  of  many  patterns.  Simple  programs can easily
   1753        request that it be used if available, by  setting  an  option  that  is
   1754        ignored  when  it is not relevant. More complicated programs might need
   1755        to    make    use    of    the    functions     pcre_jit_stack_alloc(),
   1756        pcre_jit_stack_free(),  and pcre_assign_jit_stack() in order to control
   1757        the JIT code's memory usage.
   1758 
   1759        From release 8.32 there is also a direct interface for  JIT  execution,
   1760        which  gives  improved performance. The JIT-specific functions are dis-
   1761        cussed in the pcrejit documentation.
   1762 
   1763        A second matching function, pcre_dfa_exec(), which is not Perl-compati-
   1764        ble,  is  also provided. This uses a different algorithm for the match-
   1765        ing. The alternative algorithm finds all possible matches (at  a  given
   1766        point  in  the  subject), and scans the subject just once (unless there
   1767        are lookbehind assertions). However, this  algorithm  does  not  return
   1768        captured  substrings.  A description of the two matching algorithms and
   1769        their advantages and disadvantages is given in the  pcrematching  docu-
   1770        mentation.
   1771 
   1772        In  addition  to  the  main compiling and matching functions, there are
   1773        convenience functions for extracting captured substrings from a subject
   1774        string that is matched by pcre_exec(). They are:
   1775 
   1776          pcre_copy_substring()
   1777          pcre_copy_named_substring()
   1778          pcre_get_substring()
   1779          pcre_get_named_substring()
   1780          pcre_get_substring_list()
   1781          pcre_get_stringnumber()
   1782          pcre_get_stringtable_entries()
   1783 
   1784        pcre_free_substring() and pcre_free_substring_list() are also provided,
   1785        to free the memory used for extracted strings.
   1786 
   1787        The function pcre_maketables() is used to  build  a  set  of  character
   1788        tables   in   the   current   locale  for  passing  to  pcre_compile(),
   1789        pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is
   1790        provided  for  specialist  use.  Most  commonly,  no special tables are
   1791        passed, in which case internal tables that are generated when  PCRE  is
   1792        built are used.
   1793 
   1794        The  function  pcre_fullinfo()  is used to find out information about a
   1795        compiled pattern. The function pcre_version() returns a  pointer  to  a
   1796        string containing the version of PCRE and its date of release.
   1797 
   1798        The  function  pcre_refcount()  maintains  a  reference count in a data
   1799        block containing a compiled pattern. This is provided for  the  benefit
   1800        of object-oriented applications.
   1801 
   1802        The  global  variables  pcre_malloc and pcre_free initially contain the
   1803        entry points of the standard malloc()  and  free()  functions,  respec-
   1804        tively. PCRE calls the memory management functions via these variables,
   1805        so a calling program can replace them if it  wishes  to  intercept  the
   1806        calls. This should be done before calling any PCRE functions.
   1807 
   1808        The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
   1809        indirections to memory management functions.  These  special  functions
   1810        are  used  only  when  PCRE is compiled to use the heap for remembering
   1811        data, instead of recursive function calls, when running the pcre_exec()
   1812        function.  See  the  pcrebuild  documentation  for details of how to do
   1813        this. It is a non-standard way of building PCRE, for  use  in  environ-
   1814        ments  that  have  limited stacks. Because of the greater use of memory
   1815        management, it runs more slowly. Separate  functions  are  provided  so
   1816        that  special-purpose  external  code  can  be used for this case. When
   1817        used, these functions are always called in a  stack-like  manner  (last
   1818        obtained,  first freed), and always for memory blocks of the same size.
   1819        There is a discussion about PCRE's stack usage in the  pcrestack  docu-
   1820        mentation.
   1821 
   1822        The global variable pcre_callout initially contains NULL. It can be set
   1823        by the caller to a "callout" function, which PCRE  will  then  call  at
   1824        specified  points during a matching operation. Details are given in the
   1825        pcrecallout documentation.
   1826 
   1827        The global variable pcre_stack_guard initially contains NULL. It can be
   1828        set  by  the  caller  to  a function that is called by PCRE whenever it
   1829        starts to compile a parenthesized part of a pattern.  When  parentheses
   1830        are nested, PCRE uses recursive function calls, which use up the system
   1831        stack. This function is provided so that applications  with  restricted
   1832        stacks  can  force a compilation error if the stack runs out. The func-
   1833        tion should return zero if all is well, or non-zero to force an error.
   1834 
   1835 
   1836 NEWLINES
   1837 
   1838        PCRE supports five different conventions for indicating line breaks  in
   1839        strings:  a  single  CR (carriage return) character, a single LF (line-
   1840        feed) character, the two-character sequence CRLF, any of the three pre-
   1841        ceding,  or any Unicode newline sequence. The Unicode newline sequences
   1842        are the three just mentioned, plus the single characters  VT  (vertical
   1843        tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
   1844        separator, U+2028), and PS (paragraph separator, U+2029).
   1845 
   1846        Each of the first three conventions is used by at least  one  operating
   1847        system  as its standard newline sequence. When PCRE is built, a default
   1848        can be specified.  The default default is LF, which is the  Unix  stan-
   1849        dard.  When  PCRE  is run, the default can be overridden, either when a
   1850        pattern is compiled, or when it is matched.
   1851 
   1852        At compile time, the newline convention can be specified by the options
   1853        argument  of  pcre_compile(), or it can be specified by special text at
   1854        the start of the pattern itself; this overrides any other settings. See
   1855        the pcrepattern page for details of the special character sequences.
   1856 
   1857        In the PCRE documentation the word "newline" is used to mean "the char-
   1858        acter or pair of characters that indicate a line break". The choice  of
   1859        newline  convention  affects  the  handling of the dot, circumflex, and
   1860        dollar metacharacters, the handling of #-comments in /x mode, and, when
   1861        CRLF  is a recognized line ending sequence, the match position advance-
   1862        ment for a non-anchored pattern. There is more detail about this in the
   1863        section on pcre_exec() options below.
   1864 
   1865        The  choice of newline convention does not affect the interpretation of
   1866        the \n or \r escape sequences, nor does  it  affect  what  \R  matches,
   1867        which is controlled in a similar way, but by separate options.
   1868 
   1869 
   1870 MULTITHREADING
   1871 
   1872        The  PCRE  functions  can be used in multi-threading applications, with
   1873        the  proviso  that  the  memory  management  functions  pointed  to  by
   1874        pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
   1875        callout and stack-checking functions pointed  to  by  pcre_callout  and
   1876        pcre_stack_guard, are shared by all threads.
   1877 
   1878        The  compiled form of a regular expression is not altered during match-
   1879        ing, so the same compiled pattern can safely be used by several threads
   1880        at once.
   1881 
   1882        If  the just-in-time optimization feature is being used, it needs sepa-
   1883        rate memory stack areas for each thread. See the pcrejit  documentation
   1884        for more details.
   1885 
   1886 
   1887 SAVING PRECOMPILED PATTERNS FOR LATER USE
   1888 
   1889        The compiled form of a regular expression can be saved and re-used at a
   1890        later time, possibly by a different program, and even on a  host  other
   1891        than  the  one  on  which  it  was  compiled.  Details are given in the
   1892        pcreprecompile documentation,  which  includes  a  description  of  the
   1893        pcre_pattern_to_host_byte_order()  function. However, compiling a regu-
   1894        lar expression with one version of PCRE for use with a  different  ver-
   1895        sion is not guaranteed to work and may cause crashes.
   1896 
   1897 
   1898 CHECKING BUILD-TIME OPTIONS
   1899 
   1900        int pcre_config(int what, void *where);
   1901 
   1902        The  function pcre_config() makes it possible for a PCRE client to dis-
   1903        cover which optional features have been compiled into the PCRE library.
   1904        The  pcrebuild documentation has more details about these optional fea-
   1905        tures.
   1906 
   1907        The first argument for pcre_config() is an  integer,  specifying  which
   1908        information is required; the second argument is a pointer to a variable
   1909        into which the information is placed. The returned  value  is  zero  on
   1910        success,  or  the negative error code PCRE_ERROR_BADOPTION if the value
   1911        in the first argument is not recognized. The following  information  is
   1912        available:
   1913 
   1914          PCRE_CONFIG_UTF8
   1915 
   1916        The  output is an integer that is set to one if UTF-8 support is avail-
   1917        able; otherwise it is set to zero. This value should normally be  given
   1918        to the 8-bit version of this function, pcre_config(). If it is given to
   1919        the  16-bit  or  32-bit  version  of  this  function,  the  result   is
   1920        PCRE_ERROR_BADOPTION.
   1921 
   1922          PCRE_CONFIG_UTF16
   1923 
   1924        The output is an integer that is set to one if UTF-16 support is avail-
   1925        able; otherwise it is set to zero. This value should normally be  given
   1926        to the 16-bit version of this function, pcre16_config(). If it is given
   1927        to the 8-bit  or  32-bit  version  of  this  function,  the  result  is
   1928        PCRE_ERROR_BADOPTION.
   1929 
   1930          PCRE_CONFIG_UTF32
   1931 
   1932        The output is an integer that is set to one if UTF-32 support is avail-
   1933        able; otherwise it is set to zero. This value should normally be  given
   1934        to the 32-bit version of this function, pcre32_config(). If it is given
   1935        to the 8-bit  or  16-bit  version  of  this  function,  the  result  is
   1936        PCRE_ERROR_BADOPTION.
   1937 
   1938          PCRE_CONFIG_UNICODE_PROPERTIES
   1939 
   1940        The  output  is  an  integer  that is set to one if support for Unicode
   1941        character properties is available; otherwise it is set to zero.
   1942 
   1943          PCRE_CONFIG_JIT
   1944 
   1945        The output is an integer that is set to one if support for just-in-time
   1946        compiling is available; otherwise it is set to zero.
   1947 
   1948          PCRE_CONFIG_JITTARGET
   1949 
   1950        The  output is a pointer to a zero-terminated "const char *" string. If
   1951        JIT support is available, the string contains the name of the architec-
   1952        ture  for  which the JIT compiler is configured, for example "x86 32bit
   1953        (little endian + unaligned)". If JIT  support  is  not  available,  the
   1954        result is NULL.
   1955 
   1956          PCRE_CONFIG_NEWLINE
   1957 
   1958        The  output  is  an integer whose value specifies the default character
   1959        sequence that is recognized as meaning "newline". The values  that  are
   1960        supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
   1961        for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC  environments,  CR,
   1962        ANYCRLF,  and  ANY  yield the same values. However, the value for LF is
   1963        normally 21, though some EBCDIC environments use 37. The  corresponding
   1964        values  for  CRLF are 3349 and 3365. The default should normally corre-
   1965        spond to the standard sequence for your operating system.
   1966 
   1967          PCRE_CONFIG_BSR
   1968 
   1969        The output is an integer whose value indicates what character sequences
   1970        the  \R  escape sequence matches by default. A value of 0 means that \R
   1971        matches any Unicode line ending sequence; a value of 1  means  that  \R
   1972        matches only CR, LF, or CRLF. The default can be overridden when a pat-
   1973        tern is compiled or matched.
   1974 
   1975          PCRE_CONFIG_LINK_SIZE
   1976 
   1977        The output is an integer that contains the number  of  bytes  used  for
   1978        internal  linkage  in  compiled  regular  expressions.  For  the  8-bit
   1979        library, the value can be 2, 3, or 4. For the 16-bit library, the value
   1980        is  either  2  or  4  and  is  still  a number of bytes. For the 32-bit
   1981        library, the value is either 2 or 4 and is still a number of bytes. The
   1982        default value of 2 is sufficient for all but the most massive patterns,
   1983        since it allows the compiled pattern to be up to 64K  in  size.  Larger
   1984        values  allow larger regular expressions to be compiled, at the expense
   1985        of slower matching.
   1986 
   1987          PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
   1988 
   1989        The output is an integer that contains the threshold  above  which  the
   1990        POSIX  interface  uses malloc() for output vectors. Further details are
   1991        given in the pcreposix documentation.
   1992 
   1993          PCRE_CONFIG_PARENS_LIMIT
   1994 
   1995        The output is a long integer that gives the maximum depth of nesting of
   1996        parentheses  (of  any  kind) in a pattern. This limit is imposed to cap
   1997        the amount of system stack used when a pattern is compiled. It is spec-
   1998        ified  when PCRE is built; the default is 250. This limit does not take
   1999        into account the stack that may already be used by the calling applica-
   2000        tion.  For  finer  control  over compilation stack usage, you can set a
   2001        pointer to an external checking function in pcre_stack_guard.
   2002 
   2003          PCRE_CONFIG_MATCH_LIMIT
   2004 
   2005        The output is a long integer that gives the default limit for the  num-
   2006        ber  of  internal  matching  function calls in a pcre_exec() execution.
   2007        Further details are given with pcre_exec() below.
   2008 
   2009          PCRE_CONFIG_MATCH_LIMIT_RECURSION
   2010 
   2011        The output is a long integer that gives the default limit for the depth
   2012        of   recursion  when  calling  the  internal  matching  function  in  a
   2013        pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
   2014        below.
   2015 
   2016          PCRE_CONFIG_STACKRECURSE
   2017 
   2018        The  output is an integer that is set to one if internal recursion when
   2019        running pcre_exec() is implemented by recursive function calls that use
   2020        the  stack  to remember their state. This is the usual way that PCRE is
   2021        compiled. The output is zero if PCRE was compiled to use blocks of data
   2022        on  the  heap  instead  of  recursive  function  calls.  In  this case,
   2023        pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
   2024        blocks on the heap, thus avoiding the use of the stack.
   2025 
   2026 
   2027 COMPILING A PATTERN
   2028 
   2029        pcre *pcre_compile(const char *pattern, int options,
   2030             const char **errptr, int *erroffset,
   2031             const unsigned char *tableptr);
   2032 
   2033        pcre *pcre_compile2(const char *pattern, int options,
   2034             int *errorcodeptr,
   2035             const char **errptr, int *erroffset,
   2036             const unsigned char *tableptr);
   2037 
   2038        Either of the functions pcre_compile() or pcre_compile2() can be called
   2039        to compile a pattern into an internal form. The only difference between
   2040        the  two interfaces is that pcre_compile2() has an additional argument,
   2041        errorcodeptr, via which a numerical error  code  can  be  returned.  To
   2042        avoid  too  much repetition, we refer just to pcre_compile() below, but
   2043        the information applies equally to pcre_compile2().
   2044 
   2045        The pattern is a C string terminated by a binary zero, and is passed in
   2046        the  pattern  argument.  A  pointer to a single block of memory that is
   2047        obtained via pcre_malloc is returned. This contains the  compiled  code
   2048        and related data. The pcre type is defined for the returned block; this
   2049        is a typedef for a structure whose contents are not externally defined.
   2050        It is up to the caller to free the memory (via pcre_free) when it is no
   2051        longer required.
   2052 
   2053        Although the compiled code of a PCRE regex is relocatable, that is,  it
   2054        does not depend on memory location, the complete pcre data block is not
   2055        fully relocatable, because it may contain a copy of the tableptr  argu-
   2056        ment, which is an address (see below).
   2057 
   2058        The options argument contains various bit settings that affect the com-
   2059        pilation. It should be zero if no options are required.  The  available
   2060        options  are  described  below. Some of them (in particular, those that
   2061        are compatible with Perl, but some others as well) can also be set  and
   2062        unset  from  within  the  pattern  (see the detailed description in the
   2063        pcrepattern documentation). For those options that can be different  in
   2064        different  parts  of  the pattern, the contents of the options argument
   2065        specifies their settings at the start of compilation and execution. The
   2066        PCRE_ANCHORED,  PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
   2067        PCRE_NO_START_OPTIMIZE options can be set at the time  of  matching  as
   2068        well as at compile time.
   2069 
   2070        If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
   2071        if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
   2072        sets the variable pointed to by errptr to point to a textual error mes-
   2073        sage. This is a static string that is part of the library. You must not
   2074        try  to  free it. Normally, the offset from the start of the pattern to
   2075        the data unit that was being processed when the error was discovered is
   2076        placed  in the variable pointed to by erroffset, which must not be NULL
   2077        (if it is, an immediate error is given). However, for an invalid  UTF-8
   2078        or  UTF-16  string,  the  offset  is that of the first data unit of the
   2079        failing character.
   2080 
   2081        Some errors are not detected until the whole pattern has been  scanned;
   2082        in  these  cases,  the offset passed back is the length of the pattern.
   2083        Note that the offset is in data units, not characters, even  in  a  UTF
   2084        mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
   2085        acter.
   2086 
   2087        If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
   2088        codeptr  argument is not NULL, a non-zero error code number is returned
   2089        via this argument in the event of an error. This is in addition to  the
   2090        textual error message. Error codes and messages are listed below.
   2091 
   2092        If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
   2093        character tables that are  built  when  PCRE  is  compiled,  using  the
   2094        default  C  locale.  Otherwise, tableptr must be an address that is the
   2095        result of a call to pcre_maketables(). This value is  stored  with  the
   2096        compiled  pattern,  and  used  again by pcre_exec() and pcre_dfa_exec()
   2097        when the pattern is matched. For more discussion, see  the  section  on
   2098        locale support below.
   2099 
   2100        This  code  fragment  shows a typical straightforward call to pcre_com-
   2101        pile():
   2102 
   2103          pcre *re;
   2104          const char *error;
   2105          int erroffset;
   2106          re = pcre_compile(
   2107            "^A.*Z",          /* the pattern */
   2108            0,                /* default options */
   2109            &error,           /* for error message */
   2110            &erroffset,       /* for error offset */
   2111            NULL);            /* use default character tables */
   2112 
   2113        The following names for option bits are defined in  the  pcre.h  header
   2114        file:
   2115 
   2116          PCRE_ANCHORED
   2117 
   2118        If this bit is set, the pattern is forced to be "anchored", that is, it
   2119        is constrained to match only at the first matching point in the  string
   2120        that  is being searched (the "subject string"). This effect can also be
   2121        achieved by appropriate constructs in the pattern itself, which is  the
   2122        only way to do it in Perl.
   2123 
   2124          PCRE_AUTO_CALLOUT
   2125 
   2126        If this bit is set, pcre_compile() automatically inserts callout items,
   2127        all with number 255, before each pattern item. For  discussion  of  the
   2128        callout facility, see the pcrecallout documentation.
   2129 
   2130          PCRE_BSR_ANYCRLF
   2131          PCRE_BSR_UNICODE
   2132 
   2133        These options (which are mutually exclusive) control what the \R escape
   2134        sequence matches. The choice is either to match only CR, LF,  or  CRLF,
   2135        or to match any Unicode newline sequence. The default is specified when
   2136        PCRE is built. It can be overridden from within the pattern, or by set-
   2137        ting an option when a compiled pattern is matched.
   2138 
   2139          PCRE_CASELESS
   2140 
   2141        If  this  bit is set, letters in the pattern match both upper and lower
   2142        case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
   2143        changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
   2144        always understands the concept of case for characters whose values  are
   2145        less  than 128, so caseless matching is always possible. For characters
   2146        with higher values, the concept of case is supported if  PCRE  is  com-
   2147        piled  with Unicode property support, but not otherwise. If you want to
   2148        use caseless matching for characters 128 and  above,  you  must  ensure
   2149        that  PCRE  is  compiled  with Unicode property support as well as with
   2150        UTF-8 support.
   2151 
   2152          PCRE_DOLLAR_ENDONLY
   2153 
   2154        If this bit is set, a dollar metacharacter in the pattern matches  only
   2155        at  the  end  of the subject string. Without this option, a dollar also
   2156        matches immediately before a newline at the end of the string (but  not
   2157        before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
   2158        if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
   2159        Perl, and no way to set it within a pattern.
   2160 
   2161          PCRE_DOTALL
   2162 
   2163        If  this bit is set, a dot metacharacter in the pattern matches a char-
   2164        acter of any value, including one that indicates a newline. However, it
   2165        only  ever  matches  one character, even if newlines are coded as CRLF.
   2166        Without this option, a dot does not match when the current position  is
   2167        at a newline. This option is equivalent to Perl's /s option, and it can
   2168        be changed within a pattern by a (?s) option setting. A negative  class
   2169        such as [^a] always matches newline characters, independent of the set-
   2170        ting of this option.
   2171 
   2172          PCRE_DUPNAMES
   2173 
   2174        If this bit is set, names used to identify capturing  subpatterns  need
   2175        not be unique. This can be helpful for certain types of pattern when it
   2176        is known that only one instance of the named  subpattern  can  ever  be
   2177        matched.  There  are  more details of named subpatterns below; see also
   2178        the pcrepattern documentation.
   2179 
   2180          PCRE_EXTENDED
   2181 
   2182        If this bit is set, most white space  characters  in  the  pattern  are
   2183        totally  ignored  except when escaped or inside a character class. How-
   2184        ever, white space is not allowed within  sequences  such  as  (?>  that
   2185        introduce  various  parenthesized  subpatterns,  nor within a numerical
   2186        quantifier such as {1,3}.  However, ignorable white space is  permitted
   2187        between an item and a following quantifier and between a quantifier and
   2188        a following + that indicates possessiveness.
   2189 
   2190        White space did not used to include the VT character (code 11), because
   2191        Perl did not treat this character as white space. However, Perl changed
   2192        at release 5.18, so PCRE followed  at  release  8.34,  and  VT  is  now
   2193        treated as white space.
   2194 
   2195        PCRE_EXTENDED  also  causes characters between an unescaped # outside a
   2196        character class  and  the  next  newline,  inclusive,  to  be  ignored.
   2197        PCRE_EXTENDED  is equivalent to Perl's /x option, and it can be changed
   2198        within a pattern by a (?x) option setting.
   2199 
   2200        Which characters are interpreted  as  newlines  is  controlled  by  the
   2201        options  passed to pcre_compile() or by a special sequence at the start
   2202        of the pattern, as described in the section entitled  "Newline  conven-
   2203        tions" in the pcrepattern documentation. Note that the end of this type
   2204        of comment is  a  literal  newline  sequence  in  the  pattern;  escape
   2205        sequences that happen to represent a newline do not count.
   2206 
   2207        This  option  makes  it possible to include comments inside complicated
   2208        patterns.  Note, however, that this applies only  to  data  characters.
   2209        White  space  characters  may  never  appear  within  special character
   2210        sequences in a pattern, for example within the sequence (?( that intro-
   2211        duces a conditional subpattern.
   2212 
   2213          PCRE_EXTRA
   2214 
   2215        This  option  was invented in order to turn on additional functionality
   2216        of PCRE that is incompatible with Perl, but it  is  currently  of  very
   2217        little  use. When set, any backslash in a pattern that is followed by a
   2218        letter that has no special meaning  causes  an  error,  thus  reserving
   2219        these  combinations  for  future  expansion.  By default, as in Perl, a
   2220        backslash followed by a letter with no special meaning is treated as  a
   2221        literal. (Perl can, however, be persuaded to give an error for this, by
   2222        running it with the -w option.) There are at present no other  features
   2223        controlled  by this option. It can also be set by a (?X) option setting
   2224        within a pattern.
   2225 
   2226          PCRE_FIRSTLINE
   2227 
   2228        If this option is set, an  unanchored  pattern  is  required  to  match
   2229        before  or  at  the  first  newline  in  the subject string, though the
   2230        matched text may continue over the newline.
   2231 
   2232          PCRE_JAVASCRIPT_COMPAT
   2233 
   2234        If this option is set, PCRE's behaviour is changed in some ways so that
   2235        it  is  compatible with JavaScript rather than Perl. The changes are as
   2236        follows:
   2237 
   2238        (1) A lone closing square bracket in a pattern  causes  a  compile-time
   2239        error,  because this is illegal in JavaScript (by default it is treated
   2240        as a data character). Thus, the pattern AB]CD becomes illegal when this
   2241        option is set.
   2242 
   2243        (2)  At run time, a back reference to an unset subpattern group matches
   2244        an empty string (by default this causes the current  matching  alterna-
   2245        tive  to  fail). A pattern such as (\1)(a) succeeds when this option is
   2246        set (assuming it can find an "a" in the subject), whereas it  fails  by
   2247        default, for Perl compatibility.
   2248 
   2249        (3) \U matches an upper case "U" character; by default \U causes a com-
   2250        pile time error (Perl uses \U to upper case subsequent characters).
   2251 
   2252        (4) \u matches a lower case "u" character unless it is followed by four
   2253        hexadecimal  digits,  in  which case the hexadecimal number defines the
   2254        code point to match. By default, \u causes a compile time  error  (Perl
   2255        uses it to upper case the following character).
   2256 
   2257        (5)  \x matches a lower case "x" character unless it is followed by two
   2258        hexadecimal digits, in which case the hexadecimal  number  defines  the
   2259        code  point  to  match. By default, as in Perl, a hexadecimal number is
   2260        always expected after \x, but it may have zero, one, or two digits (so,
   2261        for example, \xz matches a binary zero character followed by z).
   2262 
   2263          PCRE_MULTILINE
   2264 
   2265        By  default,  for  the purposes of matching "start of line" and "end of
   2266        line", PCRE treats the subject string as consisting of a single line of
   2267        characters,  even if it actually contains newlines. The "start of line"
   2268        metacharacter (^) matches only at the start of the string, and the "end
   2269        of  line"  metacharacter  ($) matches only at the end of the string, or
   2270        before a terminating newline (except when PCRE_DOLLAR_ENDONLY is  set).
   2271        Note,  however,  that  unless  PCRE_DOTALL  is set, the "any character"
   2272        metacharacter (.) does not match at a newline. This behaviour  (for  ^,
   2273        $, and dot) is the same as Perl.
   2274 
   2275        When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
   2276        constructs match immediately following or immediately  before  internal
   2277        newlines  in  the  subject string, respectively, as well as at the very
   2278        start and end. This is equivalent to Perl's /m option, and  it  can  be
   2279        changed within a pattern by a (?m) option setting. If there are no new-
   2280        lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
   2281        setting PCRE_MULTILINE has no effect.
   2282 
   2283          PCRE_NEVER_UTF
   2284 
   2285        This option locks out interpretation of the pattern as UTF-8 (or UTF-16
   2286        or UTF-32 in the 16-bit and 32-bit libraries). In particular,  it  pre-
   2287        vents  the  creator of the pattern from switching to UTF interpretation
   2288        by starting the pattern with (*UTF). This may be useful in applications
   2289        that  process  patterns  from  external  sources.  The  combination  of
   2290        PCRE_UTF8 and PCRE_NEVER_UTF also causes an error.
   2291 
   2292          PCRE_NEWLINE_CR
   2293          PCRE_NEWLINE_LF
   2294          PCRE_NEWLINE_CRLF
   2295          PCRE_NEWLINE_ANYCRLF
   2296          PCRE_NEWLINE_ANY
   2297 
   2298        These options override the default newline definition that  was  chosen
   2299        when  PCRE  was built. Setting the first or the second specifies that a
   2300        newline is indicated by a single character (CR  or  LF,  respectively).
   2301        Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
   2302        two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies
   2303        that any of the three preceding sequences should be recognized. Setting
   2304        PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be
   2305        recognized.
   2306 
   2307        In  an ASCII/Unicode environment, the Unicode newline sequences are the
   2308        three just mentioned, plus the  single  characters  VT  (vertical  tab,
   2309        U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-
   2310        arator, U+2028), and PS (paragraph separator, U+2029).  For  the  8-bit
   2311        library, the last two are recognized only in UTF-8 mode.
   2312 
   2313        When  PCRE is compiled to run in an EBCDIC (mainframe) environment, the
   2314        code for CR is 0x0d, the same as ASCII. However, the character code for
   2315        LF  is  normally 0x15, though in some EBCDIC environments 0x25 is used.
   2316        Whichever of these is not LF is made to  correspond  to  Unicode's  NEL
   2317        character.  EBCDIC  codes  are all less than 256. For more details, see
   2318        the pcrebuild documentation.
   2319 
   2320        The newline setting in the  options  word  uses  three  bits  that  are
   2321        treated as a number, giving eight possibilities. Currently only six are
   2322        used (default plus the five values above). This means that if  you  set
   2323        more  than one newline option, the combination may or may not be sensi-
   2324        ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
   2325        PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
   2326        cause an error.
   2327 
   2328        The only time that a line break in a pattern  is  specially  recognized
   2329        when  compiling is when PCRE_EXTENDED is set. CR and LF are white space
   2330        characters, and so are ignored in this mode. Also, an unescaped #  out-
   2331        side  a  character class indicates a comment that lasts until after the
   2332        next line break sequence. In other circumstances, line break  sequences
   2333        in patterns are treated as literal data.
   2334 
   2335        The newline option that is set at compile time becomes the default that
   2336        is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
   2337 
   2338          PCRE_NO_AUTO_CAPTURE
   2339 
   2340        If this option is set, it disables the use of numbered capturing paren-
   2341        theses  in the pattern. Any opening parenthesis that is not followed by
   2342        ? behaves as if it were followed by ?: but named parentheses can  still
   2343        be  used  for  capturing  (and  they acquire numbers in the usual way).
   2344        There is no equivalent of this option in Perl.
   2345 
   2346          PCRE_NO_AUTO_POSSESS
   2347 
   2348        If this option is set, it disables "auto-possessification". This is  an
   2349        optimization  that,  for example, turns a+b into a++b in order to avoid
   2350        backtracks into a+ that can never be successful. However,  if  callouts
   2351        are  in  use,  auto-possessification  means that some of them are never
   2352        taken. You can set this option if you want the matching functions to do
   2353        a  full  unoptimized  search and run all the callouts, but it is mainly
   2354        provided for testing purposes.
   2355 
   2356          PCRE_NO_START_OPTIMIZE
   2357 
   2358        This is an option that acts at matching time; that is, it is really  an
   2359        option  for  pcre_exec()  or  pcre_dfa_exec().  If it is set at compile
   2360        time, it is remembered with the compiled pattern and assumed at  match-
   2361        ing  time.  This is necessary if you want to use JIT execution, because
   2362        the JIT compiler needs to know whether or not this option is  set.  For
   2363        details see the discussion of PCRE_NO_START_OPTIMIZE below.
   2364 
   2365          PCRE_UCP
   2366 
   2367        This  option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
   2368        \w, and some of the POSIX character classes.  By  default,  only  ASCII
   2369        characters  are  recognized, but if PCRE_UCP is set, Unicode properties
   2370        are used instead to classify characters. More details are given in  the
   2371        section  on generic character types in the pcrepattern page. If you set
   2372        PCRE_UCP, matching one of the items it affects takes much  longer.  The
   2373        option  is  available only if PCRE has been compiled with Unicode prop-
   2374        erty support.
   2375 
   2376          PCRE_UNGREEDY
   2377 
   2378        This option inverts the "greediness" of the quantifiers  so  that  they
   2379        are  not greedy by default, but become greedy if followed by "?". It is
   2380        not compatible with Perl. It can also be set by a (?U)  option  setting
   2381        within the pattern.
   2382 
   2383          PCRE_UTF8
   2384 
   2385        This  option  causes PCRE to regard both the pattern and the subject as
   2386        strings of UTF-8 characters instead of single-byte strings. However, it
   2387        is  available  only  when PCRE is built to include UTF support. If not,
   2388        the use of this option provokes an error. Details of  how  this  option
   2389        changes the behaviour of PCRE are given in the pcreunicode page.
   2390 
   2391          PCRE_NO_UTF8_CHECK
   2392 
   2393        When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
   2394        automatically checked. There is a  discussion  about  the  validity  of
   2395        UTF-8  strings in the pcreunicode page. If an invalid UTF-8 sequence is
   2396        found, pcre_compile() returns an error. If you already know  that  your
   2397        pattern  is valid, and you want to skip this check for performance rea-
   2398        sons, you can set the PCRE_NO_UTF8_CHECK option.  When it is  set,  the
   2399        effect of passing an invalid UTF-8 string as a pattern is undefined. It
   2400        may cause your program to crash or loop. Note that this option can also
   2401        be  passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity
   2402        checking of subject strings only. If the same string is  being  matched
   2403        many  times, the option can be safely set for the second and subsequent
   2404        matchings to improve performance.
   2405 
   2406 
   2407 COMPILATION ERROR CODES
   2408 
   2409        The following table lists the error  codes  than  may  be  returned  by
   2410        pcre_compile2(),  along with the error messages that may be returned by
   2411        both compiling functions. Note that error  messages  are  always  8-bit
   2412        ASCII  strings,  even  in 16-bit or 32-bit mode. As PCRE has developed,
   2413        some error codes have fallen out of use. To avoid confusion, they  have
   2414        not been re-used.
   2415 
   2416           0  no error
   2417           1  \ at end of pattern
   2418           2  \c at end of pattern
   2419           3  unrecognized character follows \
   2420           4  numbers out of order in {} quantifier
   2421           5  number too big in {} quantifier
   2422           6  missing terminating ] for character class
   2423           7  invalid escape sequence in character class
   2424           8  range out of order in character class
   2425           9  nothing to repeat
   2426          10  [this code is not in use]
   2427          11  internal error: unexpected repeat
   2428          12  unrecognized character after (? or (?-
   2429          13  POSIX named classes are supported only within a class
   2430          14  missing )
   2431          15  reference to non-existent subpattern
   2432          16  erroffset passed as NULL
   2433          17  unknown option bit(s) set
   2434          18  missing ) after comment
   2435          19  [this code is not in use]
   2436          20  regular expression is too large
   2437          21  failed to get memory
   2438          22  unmatched parentheses
   2439          23  internal error: code overflow
   2440          24  unrecognized character after (?<
   2441          25  lookbehind assertion is not fixed length
   2442          26  malformed number or name after (?(
   2443          27  conditional group contains more than two branches
   2444          28  assertion expected after (?(
   2445          29  (?R or (?[+-]digits must be followed by )
   2446          30  unknown POSIX class name
   2447          31  POSIX collating elements are not supported
   2448          32  this version of PCRE is compiled without UTF support
   2449          33  [this code is not in use]
   2450          34  character value in \x{} or \o{} is too large
   2451          35  invalid condition (?(0)
   2452          36  \C not allowed in lookbehind assertion
   2453          37  PCRE does not support \L, \l, \N{name}, \U, or \u
   2454          38  number after (?C is > 255
   2455          39  closing ) for (?C expected
   2456          40  recursive call could loop indefinitely
   2457          41  unrecognized character after (?P
   2458          42  syntax error in subpattern name (missing terminator)
   2459          43  two named subpatterns have the same name
   2460          44  invalid UTF-8 string (specifically UTF-8)
   2461          45  support for \P, \p, and \X has not been compiled
   2462          46  malformed \P or \p sequence
   2463          47  unknown property name after \P or \p
   2464          48  subpattern name is too long (maximum 32 characters)
   2465          49  too many named subpatterns (maximum 10000)
   2466          50  [this code is not in use]
   2467          51  octal value is greater than \377 in 8-bit non-UTF-8 mode
   2468          52  internal error: overran compiling workspace
   2469          53  internal error: previously-checked referenced subpattern
   2470                not found
   2471          54  DEFINE group contains more than one branch
   2472          55  repeating a DEFINE group is not allowed
   2473          56  inconsistent NEWLINE options
   2474          57  \g is not followed by a braced, angle-bracketed, or quoted
   2475                name/number or by a plain number
   2476          58  a numbered reference must not be zero
   2477          59  an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
   2478          60  (*VERB) not recognized or malformed
   2479          61  number is too big
   2480          62  subpattern name expected
   2481          63  digit expected after (?+
   2482          64  ] is an invalid data character in JavaScript compatibility mode
   2483          65  different names for subpatterns of the same number are
   2484                not allowed
   2485          66  (*MARK) must have an argument
   2486          67  this version of PCRE is not compiled with Unicode property
   2487                support
   2488          68  \c must be followed by an ASCII character
   2489          69  \k is not followed by a braced, angle-bracketed, or quoted name
   2490          70  internal error: unknown opcode in find_fixedlength()
   2491          71  \N is not supported in a class
   2492          72  too many forward references
   2493          73  disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
   2494          74  invalid UTF-16 string (specifically UTF-16)
   2495          75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
   2496          76  character value in \u.... sequence is too large
   2497          77  invalid UTF-32 string (specifically UTF-32)
   2498          78  setting UTF is disabled by the application
   2499          79  non-hex character in \x{} (closing brace missing?)
   2500          80  non-octal character in \o{} (closing brace missing?)
   2501          81  missing opening brace after \o
   2502          82  parentheses are too deeply nested
   2503          83  invalid range in character class
   2504          84  group name must start with a non-digit
   2505          85  parentheses are too deeply nested (stack check)
   2506 
   2507        The  numbers  32  and 10000 in errors 48 and 49 are defaults; different
   2508        values may be used if the limits were changed when PCRE was built.
   2509 
   2510 
   2511 STUDYING A PATTERN
   2512 
   2513        pcre_extra *pcre_study(const pcre *code, int options,
   2514             const char **errptr);
   2515 
   2516        If a compiled pattern is going to be used several times,  it  is  worth
   2517        spending more time analyzing it in order to speed up the time taken for
   2518        matching. The function pcre_study() takes a pointer to a compiled  pat-
   2519        tern as its first argument. If studying the pattern produces additional
   2520        information that will help speed up matching,  pcre_study()  returns  a
   2521        pointer  to a pcre_extra block, in which the study_data field points to
   2522        the results of the study.
   2523 
   2524        The  returned  value  from  pcre_study()  can  be  passed  directly  to
   2525        pcre_exec()  or  pcre_dfa_exec(). However, a pcre_extra block also con-
   2526        tains other fields that can be set by the caller before  the  block  is
   2527        passed; these are described below in the section on matching a pattern.
   2528 
   2529        If  studying  the  pattern  does  not  produce  any useful information,
   2530        pcre_study() returns NULL by default.  In  that  circumstance,  if  the
   2531        calling program wants to pass any of the other fields to pcre_exec() or
   2532        pcre_dfa_exec(), it must set up its own pcre_extra block.  However,  if
   2533        pcre_study()  is  called  with  the  PCRE_STUDY_EXTRA_NEEDED option, it
   2534        returns a pcre_extra block even if studying did not find any additional
   2535        information.  It  may still return NULL, however, if an error occurs in
   2536        pcre_study().
   2537 
   2538        The second argument of pcre_study() contains  option  bits.  There  are
   2539        three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
   2540 
   2541          PCRE_STUDY_JIT_COMPILE
   2542          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
   2543          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
   2544 
   2545        If  any  of  these are set, and the just-in-time compiler is available,
   2546        the pattern is further compiled into machine code  that  executes  much
   2547        faster  than  the  pcre_exec()  interpretive  matching function. If the
   2548        just-in-time compiler is not available, these options are ignored.  All
   2549        undefined bits in the options argument must be zero.
   2550 
   2551        JIT  compilation  is  a heavyweight optimization. It can take some time
   2552        for patterns to be analyzed, and for one-off matches  and  simple  pat-
   2553        terns  the benefit of faster execution might be offset by a much slower
   2554        study time.  Not all patterns can be optimized by the JIT compiler. For
   2555        those  that cannot be handled, matching automatically falls back to the
   2556        pcre_exec() interpreter. For more details, see the  pcrejit  documenta-
   2557        tion.
   2558 
   2559        The  third argument for pcre_study() is a pointer for an error message.
   2560        If studying succeeds (even if no data is  returned),  the  variable  it
   2561        points  to  is  set  to NULL. Otherwise it is set to point to a textual
   2562        error message. This is a static string that is part of the library. You
   2563        must  not  try  to  free it. You should test the error pointer for NULL
   2564        after calling pcre_study(), to be sure that it has run successfully.
   2565 
   2566        When you are finished with a pattern, you can free the memory used  for
   2567        the study data by calling pcre_free_study(). This function was added to
   2568        the API for release 8.20. For earlier versions,  the  memory  could  be
   2569        freed  with  pcre_free(), just like the pattern itself. This will still
   2570        work in cases where JIT optimization is not used, but it  is  advisable
   2571        to change to the new function when convenient.
   2572 
   2573        This  is  a typical way in which pcre_study() is used (except that in a
   2574        real application there should be tests for errors):
   2575 
   2576          int rc;
   2577          pcre *re;
   2578          pcre_extra *sd;
   2579          re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
   2580          sd = pcre_study(
   2581            re,             /* result of pcre_compile() */
   2582            0,              /* no options */
   2583            &error);        /* set to NULL or points to a message */
   2584          rc = pcre_exec(   /* see below for details of pcre_exec() options */
   2585            re, sd, "subject", 7, 0, 0, ovector, 30);
   2586          ...
   2587          pcre_free_study(sd);
   2588          pcre_free(re);
   2589 
   2590        Studying a pattern does two things: first, a lower bound for the length
   2591        of subject string that is needed to match the pattern is computed. This
   2592        does not mean that there are any strings of that length that match, but
   2593        it  does  guarantee that no shorter strings match. The value is used to
   2594        avoid wasting time by trying to match strings that are shorter than the
   2595        lower  bound.  You  can find out the value in a calling program via the
   2596        pcre_fullinfo() function.
   2597 
   2598        Studying a pattern is also useful for non-anchored patterns that do not
   2599        have  a  single fixed starting character. A bitmap of possible starting
   2600        bytes is created. This speeds up finding a position in the  subject  at
   2601        which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
   2602        values less than 256.  In 32-bit mode, the bitmap is  used  for  32-bit
   2603        values less than 256.)
   2604 
   2605        These  two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
   2606        and the information is also used by the JIT  compiler.   The  optimiza-
   2607        tions  can  be  disabled  by setting the PCRE_NO_START_OPTIMIZE option.
   2608        You might want to do this if your pattern contains callouts or  (*MARK)
   2609        and  you  want  to make use of these facilities in cases where matching
   2610        fails.
   2611 
   2612        PCRE_NO_START_OPTIMIZE can be specified at either compile time or  exe-
   2613        cution   time.   However,   if   PCRE_NO_START_OPTIMIZE  is  passed  to
   2614        pcre_exec(), (that is, after any JIT compilation has happened) JIT exe-
   2615        cution  is disabled. For JIT execution to work with PCRE_NO_START_OPTI-
   2616        MIZE, the option must be set at compile time.
   2617 
   2618        There is a longer discussion of PCRE_NO_START_OPTIMIZE below.
   2619 
   2620 
   2621 LOCALE SUPPORT
   2622 
   2623        PCRE handles caseless matching, and determines whether  characters  are
   2624        letters,  digits, or whatever, by reference to a set of tables, indexed
   2625        by character code point. When running in UTF-8 mode, or in the  16-  or
   2626        32-bit libraries, this applies only to characters with code points less
   2627        than 256. By default, higher-valued code  points  never  match  escapes
   2628        such  as \w or \d. However, if PCRE is built with Unicode property sup-
   2629        port, all characters can be tested with \p and \P,  or,  alternatively,
   2630        the  PCRE_UCP option can be set when a pattern is compiled; this causes
   2631        \w and friends to use Unicode property support instead of the  built-in
   2632        tables.
   2633 
   2634        The  use  of  locales  with Unicode is discouraged. If you are handling
   2635        characters with code points greater than 128,  you  should  either  use
   2636        Unicode support, or use locales, but not try to mix the two.
   2637 
   2638        PCRE  contains  an  internal set of tables that are used when the final
   2639        argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
   2640        applications.  Normally, the internal tables recognize only ASCII char-
   2641        acters. However, when PCRE is built, it is possible to cause the inter-
   2642        nal tables to be rebuilt in the default "C" locale of the local system,
   2643        which may cause them to be different.
   2644 
   2645        The internal tables can always be overridden by tables supplied by  the
   2646        application that calls PCRE. These may be created in a different locale
   2647        from the default. As more and more applications change  to  using  Uni-
   2648        code, the need for this locale support is expected to die away.
   2649 
   2650        External  tables  are  built by calling the pcre_maketables() function,
   2651        which has no arguments, in the relevant locale. The result can then  be
   2652        passed  to  pcre_compile() as often as necessary. For example, to build
   2653        and use tables that  are  appropriate  for  the  French  locale  (where
   2654        accented  characters  with  values greater than 128 are treated as let-
   2655        ters), the following code could be used:
   2656 
   2657          setlocale(LC_CTYPE, "fr_FR");
   2658          tables = pcre_maketables();
   2659          re = pcre_compile(..., tables);
   2660 
   2661        The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
   2662        if you are using Windows, the name for the French locale is "french".
   2663 
   2664        When  pcre_maketables()  runs,  the  tables are built in memory that is
   2665        obtained via pcre_malloc. It is the caller's responsibility  to  ensure
   2666        that  the memory containing the tables remains available for as long as
   2667        it is needed.
   2668 
   2669        The pointer that is passed to pcre_compile() is saved with the compiled
   2670        pattern,  and the same tables are used via this pointer by pcre_study()
   2671        and also by pcre_exec() and pcre_dfa_exec(). Thus, for any single  pat-
   2672        tern, compilation, studying and matching all happen in the same locale,
   2673        but different patterns can be processed in different locales.
   2674 
   2675        It is possible to pass a table pointer or NULL (indicating the  use  of
   2676        the internal tables) to pcre_exec() or pcre_dfa_exec() (see the discus-
   2677        sion below in the section on matching a pattern). This facility is pro-
   2678        vided  for  use  with  pre-compiled  patterns  that have been saved and
   2679        reloaded.  Character tables are not saved with patterns, so if  a  non-
   2680        standard table was used at compile time, it must be provided again when
   2681        the reloaded pattern is matched. Attempting to  use  this  facility  to
   2682        match a pattern in a different locale from the one in which it was com-
   2683        piled is likely to lead to anomalous (usually incorrect) results.
   2684 
   2685 
   2686 INFORMATION ABOUT A PATTERN
   2687 
   2688        int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
   2689             int what, void *where);
   2690 
   2691        The pcre_fullinfo() function returns information about a compiled  pat-
   2692        tern.  It replaces the pcre_info() function, which was removed from the
   2693        library at version 8.30, after more than 10 years of obsolescence.
   2694 
   2695        The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
   2696        pattern.  The second argument is the result of pcre_study(), or NULL if
   2697        the pattern was not studied. The third argument specifies  which  piece
   2698        of  information  is required, and the fourth argument is a pointer to a
   2699        variable to receive the data. The yield of the  function  is  zero  for
   2700        success, or one of the following negative numbers:
   2701 
   2702          PCRE_ERROR_NULL           the argument code was NULL
   2703                                    the argument where was NULL
   2704          PCRE_ERROR_BADMAGIC       the "magic number" was not found
   2705          PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
   2706                                    endianness
   2707          PCRE_ERROR_BADOPTION      the value of what was invalid
   2708          PCRE_ERROR_UNSET          the requested field is not set
   2709 
   2710        The  "magic  number" is placed at the start of each compiled pattern as
   2711        an simple check against passing an arbitrary memory pointer. The  endi-
   2712        anness error can occur if a compiled pattern is saved and reloaded on a
   2713        different host. Here is a typical call of  pcre_fullinfo(),  to  obtain
   2714        the length of the compiled pattern:
   2715 
   2716          int rc;
   2717          size_t length;
   2718          rc = pcre_fullinfo(
   2719            re,               /* result of pcre_compile() */
   2720            sd,               /* result of pcre_study(), or NULL */
   2721            PCRE_INFO_SIZE,   /* what is required */
   2722            &length);         /* where to put the data */
   2723 
   2724        The  possible  values for the third argument are defined in pcre.h, and
   2725        are as follows:
   2726 
   2727          PCRE_INFO_BACKREFMAX
   2728 
   2729        Return the number of the highest back reference  in  the  pattern.  The
   2730        fourth  argument  should  point to an int variable. Zero is returned if
   2731        there are no back references.
   2732 
   2733          PCRE_INFO_CAPTURECOUNT
   2734 
   2735        Return the number of capturing subpatterns in the pattern.  The  fourth
   2736        argument should point to an int variable.
   2737 
   2738          PCRE_INFO_DEFAULT_TABLES
   2739 
   2740        Return  a pointer to the internal default character tables within PCRE.
   2741        The fourth argument should point to an unsigned char *  variable.  This
   2742        information call is provided for internal use by the pcre_study() func-
   2743        tion. External callers can cause PCRE to use  its  internal  tables  by
   2744        passing a NULL table pointer.
   2745 
   2746          PCRE_INFO_FIRSTBYTE (deprecated)
   2747 
   2748        Return information about the first data unit of any matched string, for
   2749        a non-anchored pattern. The name of this option  refers  to  the  8-bit
   2750        library,  where  data units are bytes. The fourth argument should point
   2751        to an int variable. Negative values are used for  special  cases.  How-
   2752        ever,  this  means  that when the 32-bit library is in non-UTF-32 mode,
   2753        the full 32-bit range of characters cannot be returned. For  this  rea-
   2754        son,  this  value  is deprecated; use PCRE_INFO_FIRSTCHARACTERFLAGS and
   2755        PCRE_INFO_FIRSTCHARACTER instead.
   2756 
   2757        If there is a fixed first value, for example, the  letter  "c"  from  a
   2758        pattern  such  as (cat|cow|coyote), its value is returned. In the 8-bit
   2759        library, the value is always less than 256. In the 16-bit  library  the
   2760        value can be up to 0xffff. In the 32-bit library the value can be up to
   2761        0x10ffff.
   2762 
   2763        If there is no fixed first value, and if either
   2764 
   2765        (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
   2766        branch starts with "^", or
   2767 
   2768        (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
   2769        set (if it were set, the pattern would be anchored),
   2770 
   2771        -1 is returned, indicating that the pattern matches only at  the  start
   2772        of  a  subject string or after any newline within the string. Otherwise
   2773        -2 is returned. For anchored patterns, -2 is returned.
   2774 
   2775          PCRE_INFO_FIRSTCHARACTER
   2776 
   2777        Return the value of the first data  unit  (non-UTF  character)  of  any
   2778        matched  string  in  the  situation where PCRE_INFO_FIRSTCHARACTERFLAGS
   2779        returns 1; otherwise return 0. The fourth argument should point  to  an
   2780        uint_t variable.
   2781 
   2782        In  the 8-bit library, the value is always less than 256. In the 16-bit
   2783        library the value can be up to 0xffff. In the 32-bit library in  UTF-32
   2784        mode  the  value  can  be up to 0x10ffff, and up to 0xffffffff when not
   2785        using UTF-32 mode.
   2786 
   2787          PCRE_INFO_FIRSTCHARACTERFLAGS
   2788 
   2789        Return information about the first data unit of any matched string, for
   2790        a  non-anchored  pattern.  The  fourth  argument should point to an int
   2791        variable.
   2792 
   2793        If there is a fixed first value, for example, the  letter  "c"  from  a
   2794        pattern  such  as  (cat|cow|coyote),  1  is returned, and the character
   2795        value can be retrieved using PCRE_INFO_FIRSTCHARACTER. If there  is  no
   2796        fixed first value, and if either
   2797 
   2798        (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
   2799        branch starts with "^", or
   2800 
   2801        (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
   2802        set (if it were set, the pattern would be anchored),
   2803 
   2804        2 is returned, indicating that the pattern matches only at the start of
   2805        a subject string or after any newline within the string. Otherwise 0 is
   2806        returned. For anchored patterns, 0 is returned.
   2807 
   2808          PCRE_INFO_FIRSTTABLE
   2809 
   2810        If  the pattern was studied, and this resulted in the construction of a
   2811        256-bit table indicating a fixed set of values for the first data  unit
   2812        in  any  matching string, a pointer to the table is returned. Otherwise
   2813        NULL is returned. The fourth argument should point to an unsigned  char
   2814        * variable.
   2815 
   2816          PCRE_INFO_HASCRORLF
   2817 
   2818        Return  1  if  the  pattern  contains any explicit matches for CR or LF
   2819        characters, otherwise 0. The fourth argument should  point  to  an  int
   2820        variable.  An explicit match is either a literal CR or LF character, or
   2821        \r or \n.
   2822 
   2823          PCRE_INFO_JCHANGED
   2824 
   2825        Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
   2826        otherwise  0. The fourth argument should point to an int variable. (?J)
   2827        and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
   2828 
   2829          PCRE_INFO_JIT
   2830 
   2831        Return 1 if the pattern was studied with one of the  JIT  options,  and
   2832        just-in-time compiling was successful. The fourth argument should point
   2833        to an int variable. A return value of 0 means that JIT support  is  not
   2834        available  in this version of PCRE, or that the pattern was not studied
   2835        with a JIT option, or that the JIT compiler could not handle this  par-
   2836        ticular  pattern. See the pcrejit documentation for details of what can
   2837        and cannot be handled.
   2838 
   2839          PCRE_INFO_JITSIZE
   2840 
   2841        If the pattern was successfully studied with a JIT option,  return  the
   2842        size  of the JIT compiled code, otherwise return zero. The fourth argu-
   2843        ment should point to a size_t variable.
   2844 
   2845          PCRE_INFO_LASTLITERAL
   2846 
   2847        Return the value of the rightmost literal data unit that must exist  in
   2848        any  matched  string, other than at its start, if such a value has been
   2849        recorded. The fourth argument should point to an int variable. If there
   2850        is no such value, -1 is returned. For anchored patterns, a last literal
   2851        value is recorded only if it follows something of variable length.  For
   2852        example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
   2853        /^a\dz\d/ the returned value is -1.
   2854 
   2855        Since for the 32-bit library using the non-UTF-32 mode,  this  function
   2856        is  unable to return the full 32-bit range of characters, this value is
   2857        deprecated;     instead     the     PCRE_INFO_REQUIREDCHARFLAGS     and
   2858        PCRE_INFO_REQUIREDCHAR values should be used.
   2859 
   2860          PCRE_INFO_MATCH_EMPTY
   2861 
   2862        Return  1  if  the  pattern can match an empty string, otherwise 0. The
   2863        fourth argument should point to an int variable.
   2864 
   2865          PCRE_INFO_MATCHLIMIT
   2866 
   2867        If the pattern set a match limit by  including  an  item  of  the  form
   2868        (*LIMIT_MATCH=nnnn)  at  the  start,  the value is returned. The fourth
   2869        argument should point to an unsigned 32-bit integer. If no  such  value
   2870        has   been   set,   the  call  to  pcre_fullinfo()  returns  the  error
   2871        PCRE_ERROR_UNSET.
   2872 
   2873          PCRE_INFO_MAXLOOKBEHIND
   2874 
   2875        Return the number of characters (NB not  data  units)  in  the  longest
   2876        lookbehind  assertion  in  the pattern. This information is useful when
   2877        doing multi-segment matching using  the  partial  matching  facilities.
   2878        Note that the simple assertions \b and \B require a one-character look-
   2879        behind. \A also registers a one-character lookbehind,  though  it  does
   2880        not  actually inspect the previous character. This is to ensure that at
   2881        least one character from the old segment is retained when a new segment
   2882        is processed. Otherwise, if there are no lookbehinds in the pattern, \A
   2883        might match incorrectly at the start of a new segment.
   2884 
   2885          PCRE_INFO_MINLENGTH
   2886 
   2887        If the pattern was studied and a minimum length  for  matching  subject
   2888        strings  was  computed,  its  value is returned. Otherwise the returned
   2889        value is -1. The value is a number of characters, which in UTF mode may
   2890        be  different from the number of data units. The fourth argument should
   2891        point to an int variable. A non-negative value is a lower bound to  the
   2892        length  of  any  matching  string. There may not be any strings of that
   2893        length that do actually match, but every string that does match  is  at
   2894        least that long.
   2895 
   2896          PCRE_INFO_NAMECOUNT
   2897          PCRE_INFO_NAMEENTRYSIZE
   2898          PCRE_INFO_NAMETABLE
   2899 
   2900        PCRE  supports the use of named as well as numbered capturing parenthe-
   2901        ses. The names are just an additional way of identifying the  parenthe-
   2902        ses, which still acquire numbers. Several convenience functions such as
   2903        pcre_get_named_substring() are provided for  extracting  captured  sub-
   2904        strings  by  name. It is also possible to extract the data directly, by
   2905        first converting the name to a number in order to  access  the  correct
   2906        pointers in the output vector (described with pcre_exec() below). To do
   2907        the conversion, you need  to  use  the  name-to-number  map,  which  is
   2908        described by these three values.
   2909 
   2910        The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
   2911        gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
   2912        of  each  entry;  both  of  these  return  an int value. The entry size
   2913        depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
   2914        a pointer to the first entry of the table. This is a pointer to char in
   2915        the 8-bit library, where the first two bytes of each entry are the num-
   2916        ber  of  the capturing parenthesis, most significant byte first. In the
   2917        16-bit library, the pointer points to 16-bit data units, the  first  of
   2918        which  contains  the  parenthesis  number.  In  the 32-bit library, the
   2919        pointer points to 32-bit data units, the first of  which  contains  the
   2920        parenthesis  number.  The  rest of the entry is the corresponding name,
   2921        zero terminated.
   2922 
   2923        The names are in alphabetical order. If (?| is used to create  multiple
   2924        groups  with  the same number, as described in the section on duplicate
   2925        subpattern numbers in the pcrepattern page, the groups may be given the
   2926        same  name,  but  there is only one entry in the table. Different names
   2927        for groups of the same number are not permitted.  Duplicate  names  for
   2928        subpatterns with different numbers are permitted, but only if PCRE_DUP-
   2929        NAMES is set. They appear in the table in the order in which they  were
   2930        found  in  the  pattern.  In  the  absence  of (?| this is the order of
   2931        increasing number; when (?| is used this is not  necessarily  the  case
   2932        because later subpatterns may have lower numbers.
   2933 
   2934        As  a  simple  example of the name/number table, consider the following
   2935        pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
   2936        set, so white space - including newlines - is ignored):
   2937 
   2938          (?<date> (?<year>(\d\d)?\d\d) -
   2939          (?<month>\d\d) - (?<day>\d\d) )
   2940 
   2941        There  are  four  named subpatterns, so the table has four entries, and
   2942        each entry in the table is eight bytes long. The table is  as  follows,
   2943        with non-printing bytes shows in hexadecimal, and undefined bytes shown
   2944        as ??:
   2945 
   2946          00 01 d  a  t  e  00 ??
   2947          00 05 d  a  y  00 ?? ??
   2948          00 04 m  o  n  t  h  00
   2949          00 02 y  e  a  r  00 ??
   2950 
   2951        When writing code to extract data  from  named  subpatterns  using  the
   2952        name-to-number  map,  remember that the length of the entries is likely
   2953        to be different for each compiled pattern.
   2954 
   2955          PCRE_INFO_OKPARTIAL
   2956 
   2957        Return 1  if  the  pattern  can  be  used  for  partial  matching  with
   2958        pcre_exec(),  otherwise  0.  The fourth argument should point to an int
   2959        variable. From  release  8.00,  this  always  returns  1,  because  the
   2960        restrictions  that  previously  applied  to  partial matching have been
   2961        lifted. The pcrepartial documentation gives details of  partial  match-
   2962        ing.
   2963 
   2964          PCRE_INFO_OPTIONS
   2965 
   2966        Return  a  copy of the options with which the pattern was compiled. The
   2967        fourth argument should point to an unsigned long  int  variable.  These
   2968        option bits are those specified in the call to pcre_compile(), modified
   2969        by any top-level option settings at the start of the pattern itself. In
   2970        other  words,  they are the options that will be in force when matching
   2971        starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with
   2972        the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
   2973        and PCRE_EXTENDED.
   2974 
   2975        A pattern is automatically anchored by PCRE if  all  of  its  top-level
   2976        alternatives begin with one of the following:
   2977 
   2978          ^     unless PCRE_MULTILINE is set
   2979          \A    always
   2980          \G    always
   2981          .*    if PCRE_DOTALL is set and there are no back
   2982                  references to the subpattern in which .* appears
   2983 
   2984        For such patterns, the PCRE_ANCHORED bit is set in the options returned
   2985        by pcre_fullinfo().
   2986 
   2987          PCRE_INFO_RECURSIONLIMIT
   2988 
   2989        If the pattern set a recursion limit by including an item of  the  form
   2990        (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
   2991        argument should point to an unsigned 32-bit integer. If no  such  value
   2992        has   been   set,   the  call  to  pcre_fullinfo()  returns  the  error
   2993        PCRE_ERROR_UNSET.
   2994 
   2995          PCRE_INFO_SIZE
   2996 
   2997        Return the size of  the  compiled  pattern  in  bytes  (for  all  three
   2998        libraries). The fourth argument should point to a size_t variable. This
   2999        value does not include the size of the pcre structure that is  returned
   3000        by  pcre_compile().  The  value  that  is  passed  as  the  argument to
   3001        pcre_malloc() when pcre_compile() is getting memory in which  to  place
   3002        the compiled data is the value returned by this option plus the size of
   3003        the pcre structure. Studying a compiled pattern, with or  without  JIT,
   3004        does not alter the value returned by this option.
   3005 
   3006          PCRE_INFO_STUDYSIZE
   3007 
   3008        Return  the  size  in bytes (for all three libraries) of the data block
   3009        pointed to by the study_data field in a pcre_extra block. If pcre_extra
   3010        is  NULL, or there is no study data, zero is returned. The fourth argu-
   3011        ment should point to a size_t variable. The study_data field is set  by
   3012        pcre_study() to record information that will speed up matching (see the
   3013        section entitled  "Studying  a  pattern"  above).  The  format  of  the
   3014        study_data  block is private, but its length is made available via this
   3015        option so that it can be saved and  restored  (see  the  pcreprecompile
   3016        documentation for details).
   3017 
   3018          PCRE_INFO_REQUIREDCHARFLAGS
   3019 
   3020        Returns  1 if there is a rightmost literal data unit that must exist in
   3021        any matched string, other than at its start. The fourth argument should
   3022        point  to an int variable. If there is no such value, 0 is returned. If
   3023        returning  1,  the  character  value  itself  can  be  retrieved  using
   3024        PCRE_INFO_REQUIREDCHAR.
   3025 
   3026        For anchored patterns, a last literal value is recorded only if it fol-
   3027        lows something  of  variable  length.  For  example,  for  the  pattern
   3028        /^a\d+z\d+/   the   returned   value   1   (with   "z"   returned  from
   3029        PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
   3030 
   3031          PCRE_INFO_REQUIREDCHAR
   3032 
   3033        Return the value of the rightmost literal data unit that must exist  in
   3034        any  matched  string, other than at its start, if such a value has been
   3035        recorded. The fourth argument should point to an uint32_t variable.  If
   3036        there is no such value, 0 is returned.
   3037 
   3038 
   3039 REFERENCE COUNTS
   3040 
   3041        int pcre_refcount(pcre *code, int adjust);
   3042 
   3043        The  pcre_refcount()  function is used to maintain a reference count in
   3044        the data block that contains a compiled pattern. It is provided for the
   3045        benefit  of  applications  that  operate  in an object-oriented manner,
   3046        where different parts of the application may be using the same compiled
   3047        pattern, but you want to free the block when they are all done.
   3048 
   3049        When a pattern is compiled, the reference count field is initialized to
   3050        zero.  It is changed only by calling this function, whose action is  to
   3051        add  the  adjust  value  (which may be positive or negative) to it. The
   3052        yield of the function is the new value. However, the value of the count
   3053        is  constrained to lie between 0 and 65535, inclusive. If the new value
   3054        is outside these limits, it is forced to the appropriate limit value.
   3055 
   3056        Except when it is zero, the reference count is not correctly  preserved
   3057        if  a  pattern  is  compiled on one host and then transferred to a host
   3058        whose byte-order is different. (This seems a highly unlikely scenario.)
   3059 
   3060 
   3061 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
   3062 
   3063        int pcre_exec(const pcre *code, const pcre_extra *extra,
   3064             const char *subject, int length, int startoffset,
   3065             int options, int *ovector, int ovecsize);
   3066 
   3067        The function pcre_exec() is called to match a subject string against  a
   3068        compiled  pattern, which is passed in the code argument. If the pattern
   3069        was studied, the result of the study should  be  passed  in  the  extra
   3070        argument.  You  can call pcre_exec() with the same code and extra argu-
   3071        ments as many times as you like, in order to  match  different  subject
   3072        strings with the same pattern.
   3073 
   3074        This  function  is  the  main  matching facility of the library, and it
   3075        operates in a Perl-like manner. For specialist use  there  is  also  an
   3076        alternative  matching function, which is described below in the section
   3077        about the pcre_dfa_exec() function.
   3078 
   3079        In most applications, the pattern will have been compiled (and  option-
   3080        ally  studied)  in the same process that calls pcre_exec(). However, it
   3081        is possible to save compiled patterns and study data, and then use them
   3082        later  in  different processes, possibly even on different hosts. For a
   3083        discussion about this, see the pcreprecompile documentation.
   3084 
   3085        Here is an example of a simple call to pcre_exec():
   3086 
   3087          int rc;
   3088          int ovector[30];
   3089          rc = pcre_exec(
   3090            re,             /* result of pcre_compile() */
   3091            NULL,           /* we didn't study the pattern */
   3092            "some string",  /* the subject string */
   3093            11,             /* the length of the subject string */
   3094            0,              /* start at offset 0 in the subject */
   3095            0,              /* default options */
   3096            ovector,        /* vector of integers for substring information */
   3097            30);            /* number of elements (NOT size in bytes) */
   3098 
   3099    Extra data for pcre_exec()
   3100 
   3101        If the extra argument is not NULL, it must point to a  pcre_extra  data
   3102        block.  The pcre_study() function returns such a block (when it doesn't
   3103        return NULL), but you can also create one for yourself, and pass  addi-
   3104        tional  information  in it. The pcre_extra block contains the following
   3105        fields (not necessarily in this order):
   3106 
   3107          unsigned long int flags;
   3108          void *study_data;
   3109          void *executable_jit;
   3110          unsigned long int match_limit;
   3111          unsigned long int match_limit_recursion;
   3112          void *callout_data;
   3113          const unsigned char *tables;
   3114          unsigned char **mark;
   3115 
   3116        In the 16-bit version of  this  structure,  the  mark  field  has  type
   3117        "PCRE_UCHAR16 **".
   3118 
   3119        In  the  32-bit  version  of  this  structure,  the mark field has type
   3120        "PCRE_UCHAR32 **".
   3121 
   3122        The flags field is used to specify which of the other fields  are  set.
   3123        The flag bits are:
   3124 
   3125          PCRE_EXTRA_CALLOUT_DATA
   3126          PCRE_EXTRA_EXECUTABLE_JIT
   3127          PCRE_EXTRA_MARK
   3128          PCRE_EXTRA_MATCH_LIMIT
   3129          PCRE_EXTRA_MATCH_LIMIT_RECURSION
   3130          PCRE_EXTRA_STUDY_DATA
   3131          PCRE_EXTRA_TABLES
   3132 
   3133        Other  flag  bits should be set to zero. The study_data field and some-
   3134        times the executable_jit field are set in the pcre_extra block that  is
   3135        returned  by pcre_study(), together with the appropriate flag bits. You
   3136        should not set these yourself, but you may add to the block by  setting
   3137        other fields and their corresponding flag bits.
   3138 
   3139        The match_limit field provides a means of preventing PCRE from using up
   3140        a vast amount of resources when running patterns that are not going  to
   3141        match,  but  which  have  a very large number of possibilities in their
   3142        search trees. The classic example is a pattern that uses nested  unlim-
   3143        ited repeats.
   3144 
   3145        Internally,  pcre_exec() uses a function called match(), which it calls
   3146        repeatedly (sometimes recursively). The limit  set  by  match_limit  is
   3147        imposed  on the number of times this function is called during a match,
   3148        which has the effect of limiting the amount of  backtracking  that  can
   3149        take place. For patterns that are not anchored, the count restarts from
   3150        zero for each position in the subject string.
   3151 
   3152        When pcre_exec() is called with a pattern that was successfully studied
   3153        with  a  JIT  option, the way that the matching is executed is entirely
   3154        different.  However, there is still the possibility of runaway matching
   3155        that goes on for a very long time, and so the match_limit value is also
   3156        used in this case (but in a different way) to limit how long the match-
   3157        ing can continue.
   3158 
   3159        The  default  value  for  the  limit can be set when PCRE is built; the
   3160        default default is 10 million, which handles all but the  most  extreme
   3161        cases.  You  can  override  the  default by suppling pcre_exec() with a
   3162        pcre_extra    block    in    which    match_limit    is    set,     and
   3163        PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
   3164        exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
   3165 
   3166        A value for the match limit may also be supplied  by  an  item  at  the
   3167        start of a pattern of the form
   3168 
   3169          (*LIMIT_MATCH=d)
   3170 
   3171        where  d is a decimal number. However, such a setting is ignored unless
   3172        d is less than the limit set by the caller of  pcre_exec()  or,  if  no
   3173        such limit is set, less than the default.
   3174 
   3175        The  match_limit_recursion field is similar to match_limit, but instead
   3176        of limiting the total number of times that match() is called, it limits
   3177        the  depth  of  recursion. The recursion depth is a smaller number than
   3178        the total number of calls, because not all calls to match() are  recur-
   3179        sive.  This limit is of use only if it is set smaller than match_limit.
   3180 
   3181        Limiting  the  recursion  depth limits the amount of machine stack that
   3182        can be used, or, when PCRE has been compiled to use memory on the  heap
   3183        instead  of the stack, the amount of heap memory that can be used. This
   3184        limit is not relevant, and is ignored, when matching is done using  JIT
   3185        compiled code.
   3186 
   3187        The  default  value  for  match_limit_recursion can be set when PCRE is
   3188        built; the default default  is  the  same  value  as  the  default  for
   3189        match_limit.  You can override the default by suppling pcre_exec() with
   3190        a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and
   3191        PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the
   3192        limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
   3193 
   3194        A value for the recursion limit may also be supplied by an item at  the
   3195        start of a pattern of the form
   3196 
   3197          (*LIMIT_RECURSION=d)
   3198 
   3199        where  d is a decimal number. However, such a setting is ignored unless
   3200        d is less than the limit set by the caller of  pcre_exec()  or,  if  no
   3201        such limit is set, less than the default.
   3202 
   3203        The  callout_data  field is used in conjunction with the "callout" fea-
   3204        ture, and is described in the pcrecallout documentation.
   3205 
   3206        The tables field is provided for use with patterns that have been  pre-
   3207        compiled using custom character tables, saved to disc or elsewhere, and
   3208        then reloaded, because the tables that were used to compile  a  pattern
   3209        are  not saved with it. See the pcreprecompile documentation for a dis-
   3210        cussion of saving compiled patterns for later use. If  NULL  is  passed
   3211        using this mechanism, it forces PCRE's internal tables to be used.
   3212 
   3213        Warning:  The  tables  that  pcre_exec() uses must be the same as those
   3214        that were used when the pattern was compiled. If this is not the  case,
   3215        the behaviour of pcre_exec() is undefined. Therefore, when a pattern is
   3216        compiled and matched in the same process, this field  should  never  be
   3217        set. In this (the most common) case, the correct table pointer is auto-
   3218        matically passed with  the  compiled  pattern  from  pcre_compile()  to
   3219        pcre_exec().
   3220 
   3221        If  PCRE_EXTRA_MARK  is  set in the flags field, the mark field must be
   3222        set to point to a suitable variable. If the pattern contains any  back-
   3223        tracking  control verbs such as (*MARK:NAME), and the execution ends up
   3224        with a name to pass back, a pointer to the  name  string  (zero  termi-
   3225        nated)  is  placed  in  the  variable pointed to by the mark field. The
   3226        names are within the compiled pattern; if you wish  to  retain  such  a
   3227        name  you must copy it before freeing the memory of a compiled pattern.
   3228        If there is no name to pass back, the variable pointed to by  the  mark
   3229        field  is  set  to NULL. For details of the backtracking control verbs,
   3230        see the section entitled "Backtracking control" in the pcrepattern doc-
   3231        umentation.
   3232 
   3233    Option bits for pcre_exec()
   3234 
   3235        The  unused  bits of the options argument for pcre_exec() must be zero.
   3236        The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
   3237        PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
   3238        PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_HARD,   and
   3239        PCRE_PARTIAL_SOFT.
   3240 
   3241        If  the  pattern  was successfully studied with one of the just-in-time
   3242        (JIT) compile options, the only supported options for JIT execution are
   3243        PCRE_NO_UTF8_CHECK,     PCRE_NOTBOL,     PCRE_NOTEOL,    PCRE_NOTEMPTY,
   3244        PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If  an
   3245        unsupported  option  is  used, JIT execution is disabled and the normal
   3246        interpretive code in pcre_exec() is run.
   3247 
   3248          PCRE_ANCHORED
   3249 
   3250        The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
   3251        matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
   3252        turned out to be anchored by virtue of its contents, it cannot be  made
   3253        unachored at matching time.
   3254 
   3255          PCRE_BSR_ANYCRLF
   3256          PCRE_BSR_UNICODE
   3257 
   3258        These options (which are mutually exclusive) control what the \R escape
   3259        sequence matches. The choice is either to match only CR, LF,  or  CRLF,
   3260        or  to  match  any Unicode newline sequence. These options override the
   3261        choice that was made or defaulted when the pattern was compiled.
   3262 
   3263          PCRE_NEWLINE_CR
   3264          PCRE_NEWLINE_LF
   3265          PCRE_NEWLINE_CRLF
   3266          PCRE_NEWLINE_ANYCRLF
   3267          PCRE_NEWLINE_ANY
   3268 
   3269        These options override  the  newline  definition  that  was  chosen  or
   3270        defaulted  when the pattern was compiled. For details, see the descrip-
   3271        tion of pcre_compile()  above.  During  matching,  the  newline  choice
   3272        affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
   3273        ters. It may also alter the way the match position is advanced after  a
   3274        match failure for an unanchored pattern.
   3275 
   3276        When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
   3277        set, and a match attempt for an unanchored pattern fails when the  cur-
   3278        rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
   3279        explicit matches for  CR  or  LF  characters,  the  match  position  is
   3280        advanced by two characters instead of one, in other words, to after the
   3281        CRLF.
   3282 
   3283        The above rule is a compromise that makes the most common cases work as
   3284        expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
   3285        option is not set), it does not match the string "\r\nA" because, after
   3286        failing  at the start, it skips both the CR and the LF before retrying.
   3287        However, the pattern [\r\n]A does match that string,  because  it  con-
   3288        tains an explicit CR or LF reference, and so advances only by one char-
   3289        acter after the first failure.
   3290 
   3291        An explicit match for CR of LF is either a literal appearance of one of
   3292        those  characters,  or  one  of the \r or \n escape sequences. Implicit
   3293        matches such as [^X] do not count, nor does \s (which includes  CR  and
   3294        LF in the characters that it matches).
   3295 
   3296        Notwithstanding  the above, anomalous effects may still occur when CRLF
   3297        is a valid newline sequence and explicit \r or \n escapes appear in the
   3298        pattern.
   3299 
   3300          PCRE_NOTBOL
   3301 
   3302        This option specifies that first character of the subject string is not
   3303        the beginning of a line, so the  circumflex  metacharacter  should  not
   3304        match  before it. Setting this without PCRE_MULTILINE (at compile time)
   3305        causes circumflex never to match. This option affects only  the  behav-
   3306        iour of the circumflex metacharacter. It does not affect \A.
   3307 
   3308          PCRE_NOTEOL
   3309 
   3310        This option specifies that the end of the subject string is not the end
   3311        of a line, so the dollar metacharacter should not match it nor  (except
   3312        in  multiline mode) a newline immediately before it. Setting this with-
   3313        out PCRE_MULTILINE (at compile time) causes dollar never to match. This
   3314        option  affects only the behaviour of the dollar metacharacter. It does
   3315        not affect \Z or \z.
   3316 
   3317          PCRE_NOTEMPTY
   3318 
   3319        An empty string is not considered to be a valid match if this option is
   3320        set.  If  there are alternatives in the pattern, they are tried. If all
   3321        the alternatives match the empty string, the entire  match  fails.  For
   3322        example, if the pattern
   3323 
   3324          a?b?
   3325 
   3326        is  applied  to  a  string not beginning with "a" or "b", it matches an
   3327        empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
   3328        match is not valid, so PCRE searches further into the string for occur-
   3329        rences of "a" or "b".
   3330 
   3331          PCRE_NOTEMPTY_ATSTART
   3332 
   3333        This is like PCRE_NOTEMPTY, except that an empty string match  that  is
   3334        not  at  the  start  of  the  subject  is  permitted. If the pattern is
   3335        anchored, such a match can occur only if the pattern contains \K.
   3336 
   3337        Perl    has    no    direct    equivalent    of    PCRE_NOTEMPTY     or
   3338        PCRE_NOTEMPTY_ATSTART,  but  it  does  make a special case of a pattern
   3339        match of the empty string within its split() function, and  when  using
   3340        the  /g  modifier.  It  is  possible  to emulate Perl's behaviour after
   3341        matching a null string by first trying the match again at the same off-
   3342        set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that
   3343        fails, by advancing the starting offset (see below) and trying an ordi-
   3344        nary  match  again. There is some code that demonstrates how to do this
   3345        in the pcredemo sample program. In the most general case, you  have  to
   3346        check  to  see  if the newline convention recognizes CRLF as a newline,
   3347        and if so, and the current character is CR followed by LF, advance  the
   3348        starting offset by two characters instead of one.
   3349 
   3350          PCRE_NO_START_OPTIMIZE
   3351 
   3352        There  are a number of optimizations that pcre_exec() uses at the start
   3353        of a match, in order to speed up the process. For  example,  if  it  is
   3354        known that an unanchored match must start with a specific character, it
   3355        searches the subject for that character, and fails  immediately  if  it
   3356        cannot  find  it,  without actually running the main matching function.
   3357        This means that a special item such as (*COMMIT) at the start of a pat-
   3358        tern  is  not  considered until after a suitable starting point for the
   3359        match has been found. Also, when callouts or (*MARK) items are in  use,
   3360        these "start-up" optimizations can cause them to be skipped if the pat-
   3361        tern is never actually used. The start-up optimizations are in effect a
   3362        pre-scan of the subject that takes place before the pattern is run.
   3363 
   3364        The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
   3365        possibly causing performance to suffer,  but  ensuring  that  in  cases
   3366        where  the  result is "no match", the callouts do occur, and that items
   3367        such as (*COMMIT) and (*MARK) are considered at every possible starting
   3368        position  in  the  subject  string. If PCRE_NO_START_OPTIMIZE is set at
   3369        compile time,  it  cannot  be  unset  at  matching  time.  The  use  of
   3370        PCRE_NO_START_OPTIMIZE  at  matching  time  (that  is,  passing  it  to
   3371        pcre_exec()) disables JIT execution; in  this  situation,  matching  is
   3372        always done using interpretively.
   3373 
   3374        Setting  PCRE_NO_START_OPTIMIZE  can  change  the outcome of a matching
   3375        operation.  Consider the pattern
   3376 
   3377          (*COMMIT)ABC
   3378 
   3379        When this is compiled, PCRE records the fact that a  match  must  start
   3380        with  the  character  "A".  Suppose the subject string is "DEFABC". The
   3381        start-up optimization scans along the subject, finds "A" and  runs  the
   3382        first  match attempt from there. The (*COMMIT) item means that the pat-
   3383        tern must match the current starting position, which in this  case,  it
   3384        does.  However,  if  the  same match is run with PCRE_NO_START_OPTIMIZE
   3385        set, the initial scan along the subject string  does  not  happen.  The
   3386        first  match  attempt  is  run  starting  from "D" and when this fails,
   3387        (*COMMIT) prevents any further matches  being  tried,  so  the  overall
   3388        result  is  "no  match". If the pattern is studied, more start-up opti-
   3389        mizations may be used. For example, a minimum length  for  the  subject
   3390        may be recorded. Consider the pattern
   3391 
   3392          (*MARK:A)(X|Y)
   3393 
   3394        The  minimum  length  for  a  match is one character. If the subject is
   3395        "ABC", there will be attempts to  match  "ABC",  "BC",  "C",  and  then
   3396        finally  an empty string.  If the pattern is studied, the final attempt
   3397        does not take place, because PCRE knows that the subject is too  short,
   3398        and  so  the  (*MARK) is never encountered.  In this case, studying the
   3399        pattern does not affect the overall match result, which  is  still  "no
   3400        match", but it does affect the auxiliary information that is returned.
   3401 
   3402          PCRE_NO_UTF8_CHECK
   3403 
   3404        When PCRE_UTF8 is set at compile time, the validity of the subject as a
   3405        UTF-8 string is automatically checked when pcre_exec() is  subsequently
   3406        called.  The entire string is checked before any other processing takes
   3407        place. The value of startoffset is  also  checked  to  ensure  that  it
   3408        points  to  the start of a UTF-8 character. There is a discussion about
   3409        the validity of UTF-8 strings in the pcreunicode page.  If  an  invalid
   3410        sequence   of   bytes   is   found,   pcre_exec()   returns  the  error
   3411        PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
   3412        truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
   3413        both cases, information about the precise nature of the error may  also
   3414        be  returned (see the descriptions of these errors in the section enti-
   3415        tled Error return values from pcre_exec() below).  If startoffset  con-
   3416        tains a value that does not point to the start of a UTF-8 character (or
   3417        to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
   3418 
   3419        If you already know that your subject is valid, and you  want  to  skip
   3420        these    checks    for   performance   reasons,   you   can   set   the
   3421        PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
   3422        do  this  for the second and subsequent calls to pcre_exec() if you are
   3423        making repeated calls to find all  the  matches  in  a  single  subject
   3424        string.  However,  you  should  be  sure  that the value of startoffset
   3425        points to the start of a character (or the end of  the  subject).  When
   3426        PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
   3427        subject or an invalid value of startoffset is undefined.  Your  program
   3428        may crash or loop.
   3429 
   3430          PCRE_PARTIAL_HARD
   3431          PCRE_PARTIAL_SOFT
   3432 
   3433        These  options turn on the partial matching feature. For backwards com-
   3434        patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A  partial
   3435        match  occurs if the end of the subject string is reached successfully,
   3436        but there are not enough subject characters to complete the  match.  If
   3437        this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
   3438        matching continues by testing any remaining alternatives.  Only  if  no
   3439        complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of
   3440        PCRE_ERROR_NOMATCH. In other words,  PCRE_PARTIAL_SOFT  says  that  the
   3441        caller  is  prepared to handle a partial match, but only if no complete
   3442        match can be found.
   3443 
   3444        If PCRE_PARTIAL_HARD is set, it overrides  PCRE_PARTIAL_SOFT.  In  this
   3445        case,  if  a  partial  match  is found, pcre_exec() immediately returns
   3446        PCRE_ERROR_PARTIAL, without  considering  any  other  alternatives.  In
   3447        other  words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
   3448        ered to be more important that an alternative complete match.
   3449 
   3450        In both cases, the portion of the string that was  inspected  when  the
   3451        partial match was found is set as the first matching string. There is a
   3452        more detailed discussion of partial and  multi-segment  matching,  with
   3453        examples, in the pcrepartial documentation.
   3454 
   3455    The string to be matched by pcre_exec()
   3456 
   3457        The  subject string is passed to pcre_exec() as a pointer in subject, a
   3458        length in length, and a starting offset in startoffset. The  units  for
   3459        length  and  startoffset  are  bytes for the 8-bit library, 16-bit data
   3460        items for the 16-bit library, and 32-bit  data  items  for  the  32-bit
   3461        library.
   3462 
   3463        If  startoffset  is negative or greater than the length of the subject,
   3464        pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting  offset  is
   3465        zero,  the  search  for a match starts at the beginning of the subject,
   3466        and this is by far the most common case. In UTF-8 or UTF-16  mode,  the
   3467        offset  must  point to the start of a character, or the end of the sub-
   3468        ject (in UTF-32 mode, one data unit equals one character, so  all  off-
   3469        sets  are  valid).  Unlike  the pattern string, the subject may contain
   3470        binary zeroes.
   3471 
   3472        A non-zero starting offset is useful when searching for  another  match
   3473        in  the same subject by calling pcre_exec() again after a previous suc-
   3474        cess.  Setting startoffset differs from just passing over  a  shortened
   3475        string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
   3476        with any kind of lookbehind. For example, consider the pattern
   3477 
   3478          \Biss\B
   3479 
   3480        which finds occurrences of "iss" in the middle of  words.  (\B  matches
   3481        only  if  the  current position in the subject is not a word boundary.)
   3482        When applied to the string "Mississipi" the first call  to  pcre_exec()
   3483        finds  the  first  occurrence. If pcre_exec() is called again with just
   3484        the remainder of the subject,  namely  "issipi",  it  does  not  match,
   3485        because \B is always false at the start of the subject, which is deemed
   3486        to be a word boundary. However, if pcre_exec()  is  passed  the  entire
   3487        string again, but with startoffset set to 4, it finds the second occur-
   3488        rence of "iss" because it is able to look behind the starting point  to
   3489        discover that it is preceded by a letter.
   3490 
   3491        Finding  all  the  matches  in a subject is tricky when the pattern can
   3492        match an empty string. It is possible to emulate Perl's /g behaviour by
   3493        first   trying   the   match   again  at  the  same  offset,  with  the
   3494        PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if  that
   3495        fails,  advancing  the  starting  offset  and  trying an ordinary match
   3496        again. There is some code that demonstrates how to do this in the pcre-
   3497        demo sample program. In the most general case, you have to check to see
   3498        if the newline convention recognizes CRLF as a newline, and if so,  and
   3499        the current character is CR followed by LF, advance the starting offset
   3500        by two characters instead of one.
   3501 
   3502        If a non-zero starting offset is passed when the pattern  is  anchored,
   3503        one attempt to match at the given offset is made. This can only succeed
   3504        if the pattern does not require the match to be at  the  start  of  the
   3505        subject.
   3506 
   3507    How pcre_exec() returns captured substrings
   3508 
   3509        In  general, a pattern matches a certain portion of the subject, and in
   3510        addition, further substrings from the subject  may  be  picked  out  by
   3511        parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
   3512        this is called "capturing" in what follows, and the  phrase  "capturing
   3513        subpattern"  is  used for a fragment of a pattern that picks out a sub-
   3514        string. PCRE supports several other kinds of  parenthesized  subpattern
   3515        that do not cause substrings to be captured.
   3516 
   3517        Captured substrings are returned to the caller via a vector of integers
   3518        whose address is passed in ovector. The number of elements in the  vec-
   3519        tor  is  passed in ovecsize, which must be a non-negative number. Note:
   3520        this argument is NOT the size of ovector in bytes.
   3521 
   3522        The first two-thirds of the vector is used to pass back  captured  sub-
   3523        strings,  each  substring using a pair of integers. The remaining third
   3524        of the vector is used as workspace by pcre_exec() while  matching  cap-
   3525        turing  subpatterns, and is not available for passing back information.
   3526        The number passed in ovecsize should always be a multiple of three.  If
   3527        it is not, it is rounded down.
   3528 
   3529        When  a  match  is successful, information about captured substrings is
   3530        returned in pairs of integers, starting at the  beginning  of  ovector,
   3531        and  continuing  up  to two-thirds of its length at the most. The first
   3532        element of each pair is set to the offset of the first character  in  a
   3533        substring,  and  the second is set to the offset of the first character
   3534        after the end of a substring. These values are always  data  unit  off-
   3535        sets,  even  in  UTF  mode. They are byte offsets in the 8-bit library,
   3536        16-bit data item offsets in the 16-bit library, and  32-bit  data  item
   3537        offsets in the 32-bit library. Note: they are not character counts.
   3538 
   3539        The  first  pair  of  integers, ovector[0] and ovector[1], identify the
   3540        portion of the subject string matched by the entire pattern.  The  next
   3541        pair  is  used for the first capturing subpattern, and so on. The value
   3542        returned by pcre_exec() is one more than the highest numbered pair that
   3543        has  been  set.  For example, if two substrings have been captured, the
   3544        returned value is 3. If there are no capturing subpatterns, the  return
   3545        value from a successful match is 1, indicating that just the first pair
   3546        of offsets has been set.
   3547 
   3548        If a capturing subpattern is matched repeatedly, it is the last portion
   3549        of the string that it matched that is returned.
   3550 
   3551        If  the vector is too small to hold all the captured substring offsets,
   3552        it is used as far as possible (up to two-thirds of its length), and the
   3553        function  returns a value of zero. If neither the actual string matched
   3554        nor any captured substrings are of interest, pcre_exec() may be  called
   3555        with  ovector passed as NULL and ovecsize as zero. However, if the pat-
   3556        tern contains back references and the ovector  is  not  big  enough  to
   3557        remember  the related substrings, PCRE has to get additional memory for
   3558        use during matching. Thus it is usually advisable to supply an  ovector
   3559        of reasonable size.
   3560 
   3561        There  are  some  cases where zero is returned (indicating vector over-
   3562        flow) when in fact the vector is exactly the right size for  the  final
   3563        match. For example, consider the pattern
   3564 
   3565          (a)(?:(b)c|bd)
   3566 
   3567        If  a  vector of 6 elements (allowing for only 1 captured substring) is
   3568        given with subject string "abd", pcre_exec() will try to set the second
   3569        captured string, thereby recording a vector overflow, before failing to
   3570        match "c" and backing up  to  try  the  second  alternative.  The  zero
   3571        return,  however,  does  correctly  indicate that the maximum number of
   3572        slots (namely 2) have been filled. In similar cases where there is tem-
   3573        porary  overflow,  but  the final number of used slots is actually less
   3574        than the maximum, a non-zero value is returned.
   3575 
   3576        The pcre_fullinfo() function can be used to find out how many capturing
   3577        subpatterns  there  are  in  a  compiled pattern. The smallest size for
   3578        ovector that will allow for n captured substrings, in addition  to  the
   3579        offsets of the substring matched by the whole pattern, is (n+1)*3.
   3580 
   3581        It  is  possible for capturing subpattern number n+1 to match some part
   3582        of the subject when subpattern n has not been used at all. For example,
   3583        if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
   3584        return from the function is 4, and subpatterns 1 and 3 are matched, but
   3585        2  is  not.  When  this happens, both values in the offset pairs corre-
   3586        sponding to unused subpatterns are set to -1.
   3587 
   3588        Offset values that correspond to unused subpatterns at the end  of  the
   3589        expression  are  also  set  to  -1. For example, if the string "abc" is
   3590        matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
   3591        matched.  The  return  from the function is 2, because the highest used
   3592        capturing subpattern number is 1, and the offsets for  for  the  second
   3593        and  third  capturing subpatterns (assuming the vector is large enough,
   3594        of course) are set to -1.
   3595 
   3596        Note: Elements in the first two-thirds of ovector that  do  not  corre-
   3597        spond  to  capturing parentheses in the pattern are never changed. That
   3598        is, if a pattern contains n capturing parentheses, no more  than  ovec-
   3599        tor[0]  to ovector[2n+1] are set by pcre_exec(). The other elements (in
   3600        the first two-thirds) retain whatever values they previously had.
   3601 
   3602        Some convenience functions are provided  for  extracting  the  captured
   3603        substrings as separate strings. These are described below.
   3604 
   3605    Error return values from pcre_exec()
   3606 
   3607        If  pcre_exec()  fails, it returns a negative number. The following are
   3608        defined in the header file:
   3609 
   3610          PCRE_ERROR_NOMATCH        (-1)
   3611 
   3612        The subject string did not match the pattern.
   3613 
   3614          PCRE_ERROR_NULL           (-2)
   3615 
   3616        Either code or subject was passed as NULL,  or  ovector  was  NULL  and
   3617        ovecsize was not zero.
   3618 
   3619          PCRE_ERROR_BADOPTION      (-3)
   3620 
   3621        An unrecognized bit was set in the options argument.
   3622 
   3623          PCRE_ERROR_BADMAGIC       (-4)
   3624 
   3625        PCRE  stores a 4-byte "magic number" at the start of the compiled code,
   3626        to catch the case when it is passed a junk pointer and to detect when a
   3627        pattern that was compiled in an environment of one endianness is run in
   3628        an environment with the other endianness. This is the error  that  PCRE
   3629        gives when the magic number is not present.
   3630 
   3631          PCRE_ERROR_UNKNOWN_OPCODE (-5)
   3632 
   3633        While running the pattern match, an unknown item was encountered in the
   3634        compiled pattern. This error could be caused by a bug  in  PCRE  or  by
   3635        overwriting of the compiled pattern.
   3636 
   3637          PCRE_ERROR_NOMEMORY       (-6)
   3638 
   3639        If  a  pattern contains back references, but the ovector that is passed
   3640        to pcre_exec() is not big enough to remember the referenced substrings,
   3641        PCRE  gets  a  block of memory at the start of matching to use for this
   3642        purpose. If the call via pcre_malloc() fails, this error is given.  The
   3643        memory is automatically freed at the end of matching.
   3644 
   3645        This  error  is also given if pcre_stack_malloc() fails in pcre_exec().
   3646        This can happen only when PCRE has been compiled with  --disable-stack-
   3647        for-recursion.
   3648 
   3649          PCRE_ERROR_NOSUBSTRING    (-7)
   3650 
   3651        This  error is used by the pcre_copy_substring(), pcre_get_substring(),
   3652        and  pcre_get_substring_list()  functions  (see  below).  It  is  never
   3653        returned by pcre_exec().
   3654 
   3655          PCRE_ERROR_MATCHLIMIT     (-8)
   3656 
   3657        The  backtracking  limit,  as  specified  by the match_limit field in a
   3658        pcre_extra structure (or defaulted) was reached.  See  the  description
   3659        above.
   3660 
   3661          PCRE_ERROR_CALLOUT        (-9)
   3662 
   3663        This error is never generated by pcre_exec() itself. It is provided for
   3664        use by callout functions that want to yield a distinctive  error  code.
   3665        See the pcrecallout documentation for details.
   3666 
   3667          PCRE_ERROR_BADUTF8        (-10)
   3668 
   3669        A  string  that contains an invalid UTF-8 byte sequence was passed as a
   3670        subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size  of
   3671        the  output  vector  (ovecsize)  is  at least 2, the byte offset to the
   3672        start of the the invalid UTF-8 character is placed in  the  first  ele-
   3673        ment,  and  a  reason  code is placed in the second element. The reason
   3674        codes are listed in the following section.  For backward compatibility,
   3675        if  PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
   3676        acter  at  the  end  of  the   subject   (reason   codes   1   to   5),
   3677        PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
   3678 
   3679          PCRE_ERROR_BADUTF8_OFFSET (-11)
   3680 
   3681        The  UTF-8  byte  sequence that was passed as a subject was checked and
   3682        found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but  the
   3683        value  of startoffset did not point to the beginning of a UTF-8 charac-
   3684        ter or the end of the subject.
   3685 
   3686          PCRE_ERROR_PARTIAL        (-12)
   3687 
   3688        The subject string did not match, but it did match partially.  See  the
   3689        pcrepartial documentation for details of partial matching.
   3690 
   3691          PCRE_ERROR_BADPARTIAL     (-13)
   3692 
   3693        This  code  is  no  longer  in  use.  It was formerly returned when the
   3694        PCRE_PARTIAL option was used with a compiled pattern  containing  items
   3695        that  were  not  supported  for  partial  matching.  From  release 8.00
   3696        onwards, there are no restrictions on partial matching.
   3697 
   3698          PCRE_ERROR_INTERNAL       (-14)
   3699 
   3700        An unexpected internal error has occurred. This error could  be  caused
   3701        by a bug in PCRE or by overwriting of the compiled pattern.
   3702 
   3703          PCRE_ERROR_BADCOUNT       (-15)
   3704 
   3705        This error is given if the value of the ovecsize argument is negative.
   3706 
   3707          PCRE_ERROR_RECURSIONLIMIT (-21)
   3708 
   3709        The internal recursion limit, as specified by the match_limit_recursion
   3710        field in a pcre_extra structure (or defaulted)  was  reached.  See  the
   3711        description above.
   3712 
   3713          PCRE_ERROR_BADNEWLINE     (-23)
   3714 
   3715        An invalid combination of PCRE_NEWLINE_xxx options was given.
   3716 
   3717          PCRE_ERROR_BADOFFSET      (-24)
   3718 
   3719        The value of startoffset was negative or greater than the length of the
   3720        subject, that is, the value in length.
   3721 
   3722          PCRE_ERROR_SHORTUTF8      (-25)
   3723 
   3724        This error is returned instead of PCRE_ERROR_BADUTF8 when  the  subject
   3725        string  ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
   3726        option is set.  Information  about  the  failure  is  returned  as  for
   3727        PCRE_ERROR_BADUTF8.  It  is in fact sufficient to detect this case, but
   3728        this special error code for PCRE_PARTIAL_HARD precedes the  implementa-
   3729        tion  of returned information; it is retained for backwards compatibil-
   3730        ity.
   3731 
   3732          PCRE_ERROR_RECURSELOOP    (-26)
   3733 
   3734        This error is returned when pcre_exec() detects a recursion loop within
   3735        the  pattern. Specifically, it means that either the whole pattern or a
   3736        subpattern has been called recursively for the second time at the  same
   3737        position in the subject string. Some simple patterns that might do this
   3738        are detected and faulted at compile time, but more  complicated  cases,
   3739        in particular mutual recursions between two different subpatterns, can-
   3740        not be detected until run time.
   3741 
   3742          PCRE_ERROR_JIT_STACKLIMIT (-27)
   3743 
   3744        This error is returned when a pattern  that  was  successfully  studied
   3745        using  a  JIT compile option is being matched, but the memory available
   3746        for the just-in-time processing stack is  not  large  enough.  See  the
   3747        pcrejit documentation for more details.
   3748 
   3749          PCRE_ERROR_BADMODE        (-28)
   3750 
   3751        This error is given if a pattern that was compiled by the 8-bit library
   3752        is passed to a 16-bit or 32-bit library function, or vice versa.
   3753 
   3754          PCRE_ERROR_BADENDIANNESS  (-29)
   3755 
   3756        This error is given if  a  pattern  that  was  compiled  and  saved  is
   3757        reloaded  on  a  host  with  different endianness. The utility function
   3758        pcre_pattern_to_host_byte_order() can be used to convert such a pattern
   3759        so that it runs on the new host.
   3760 
   3761          PCRE_ERROR_JIT_BADOPTION
   3762 
   3763        This  error  is  returned  when a pattern that was successfully studied
   3764        using a JIT compile option is being  matched,  but  the  matching  mode
   3765        (partial  or complete match) does not correspond to any JIT compilation
   3766        mode. When the JIT fast path function is used, this error may  be  also
   3767        given  for  invalid  options.  See  the  pcrejit documentation for more
   3768        details.
   3769 
   3770          PCRE_ERROR_BADLENGTH      (-32)
   3771 
   3772        This error is given if pcre_exec() is called with a negative value  for
   3773        the length argument.
   3774 
   3775        Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
   3776 
   3777    Reason codes for invalid UTF-8 strings
   3778 
   3779        This  section  applies  only  to  the  8-bit library. The corresponding
   3780        information for the 16-bit and 32-bit libraries is given in the  pcre16
   3781        and pcre32 pages.
   3782 
   3783        When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
   3784        UTF8, and the size of the output vector (ovecsize) is at least  2,  the
   3785        offset  of  the  start  of the invalid UTF-8 character is placed in the
   3786        first output vector element (ovector[0]) and a reason code is placed in
   3787        the  second  element  (ovector[1]). The reason codes are given names in
   3788        the pcre.h header file:
   3789 
   3790          PCRE_UTF8_ERR1
   3791          PCRE_UTF8_ERR2
   3792          PCRE_UTF8_ERR3
   3793          PCRE_UTF8_ERR4
   3794          PCRE_UTF8_ERR5
   3795 
   3796        The string ends with a truncated UTF-8 character;  the  code  specifies
   3797        how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
   3798        characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
   3799        nally  defined  by  RFC  2279)  allows  for  up to 6 bytes, and this is
   3800        checked first; hence the possibility of 4 or 5 missing bytes.
   3801 
   3802          PCRE_UTF8_ERR6
   3803          PCRE_UTF8_ERR7
   3804          PCRE_UTF8_ERR8
   3805          PCRE_UTF8_ERR9
   3806          PCRE_UTF8_ERR10
   3807 
   3808        The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
   3809        the  character  do  not have the binary value 0b10 (that is, either the
   3810        most significant bit is 0, or the next bit is 1).
   3811 
   3812          PCRE_UTF8_ERR11
   3813          PCRE_UTF8_ERR12
   3814 
   3815        A character that is valid by the RFC 2279 rules is either 5 or 6  bytes
   3816        long; these code points are excluded by RFC 3629.
   3817 
   3818          PCRE_UTF8_ERR13
   3819 
   3820        A  4-byte character has a value greater than 0x10fff; these code points
   3821        are excluded by RFC 3629.
   3822 
   3823          PCRE_UTF8_ERR14
   3824 
   3825        A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
   3826        range  of code points are reserved by RFC 3629 for use with UTF-16, and
   3827        so are excluded from UTF-8.
   3828 
   3829          PCRE_UTF8_ERR15
   3830          PCRE_UTF8_ERR16
   3831          PCRE_UTF8_ERR17
   3832          PCRE_UTF8_ERR18
   3833          PCRE_UTF8_ERR19
   3834 
   3835        A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
   3836        for  a  value that can be represented by fewer bytes, which is invalid.
   3837        For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
   3838        rect coding uses just one byte.
   3839 
   3840          PCRE_UTF8_ERR20
   3841 
   3842        The two most significant bits of the first byte of a character have the
   3843        binary value 0b10 (that is, the most significant bit is 1 and the  sec-
   3844        ond  is  0). Such a byte can only validly occur as the second or subse-
   3845        quent byte of a multi-byte character.
   3846 
   3847          PCRE_UTF8_ERR21
   3848 
   3849        The first byte of a character has the value 0xfe or 0xff. These  values
   3850        can never occur in a valid UTF-8 string.
   3851 
   3852          PCRE_UTF8_ERR22
   3853 
   3854        This  error  code  was  formerly  used when the presence of a so-called
   3855        "non-character" caused an error. Unicode corrigendum #9 makes it  clear
   3856        that  such  characters should not cause a string to be rejected, and so
   3857        this code is no longer in use and is never returned.
   3858 
   3859 
   3860 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
   3861 
   3862        int pcre_copy_substring(const char *subject, int *ovector,
   3863             int stringcount, int stringnumber, char *buffer,
   3864             int buffersize);
   3865 
   3866        int pcre_get_substring(const char *subject, int *ovector,
   3867             int stringcount, int stringnumber,
   3868             const char **stringptr);
   3869 
   3870        int pcre_get_substring_list(const char *subject,
   3871             int *ovector, int stringcount, const char ***listptr);
   3872 
   3873        Captured substrings can be  accessed  directly  by  using  the  offsets
   3874        returned  by  pcre_exec()  in  ovector.  For convenience, the functions
   3875        pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
   3876        string_list()  are  provided for extracting captured substrings as new,
   3877        separate, zero-terminated strings. These functions identify  substrings
   3878        by  number.  The  next section describes functions for extracting named
   3879        substrings.
   3880 
   3881        A substring that contains a binary zero is correctly extracted and  has
   3882        a  further zero added on the end, but the result is not, of course, a C
   3883        string.  However, you can process such a string  by  referring  to  the
   3884        length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
   3885        string().  Unfortunately, the interface to pcre_get_substring_list() is
   3886        not  adequate for handling strings containing binary zeros, because the
   3887        end of the final string is not independently indicated.
   3888 
   3889        The first three arguments are the same for all  three  of  these  func-
   3890        tions:  subject  is  the subject string that has just been successfully
   3891        matched, ovector is a pointer to the vector of integer offsets that was
   3892        passed to pcre_exec(), and stringcount is the number of substrings that
   3893        were captured by the match, including the substring  that  matched  the
   3894        entire regular expression. This is the value returned by pcre_exec() if
   3895        it is greater than zero. If pcre_exec() returned zero, indicating  that
   3896        it  ran out of space in ovector, the value passed as stringcount should
   3897        be the number of elements in the vector divided by three.
   3898 
   3899        The functions pcre_copy_substring() and pcre_get_substring() extract  a
   3900        single  substring,  whose  number  is given as stringnumber. A value of
   3901        zero extracts the substring that matched the  entire  pattern,  whereas
   3902        higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
   3903        string(), the string is placed in buffer,  whose  length  is  given  by
   3904        buffersize,  while  for  pcre_get_substring()  a new block of memory is
   3905        obtained via pcre_malloc, and its address is  returned  via  stringptr.
   3906        The  yield  of  the function is the length of the string, not including
   3907        the terminating zero, or one of these error codes:
   3908 
   3909          PCRE_ERROR_NOMEMORY       (-6)
   3910 
   3911        The buffer was too small for pcre_copy_substring(), or the  attempt  to
   3912        get memory failed for pcre_get_substring().
   3913 
   3914          PCRE_ERROR_NOSUBSTRING    (-7)
   3915 
   3916        There is no substring whose number is stringnumber.
   3917 
   3918        The  pcre_get_substring_list()  function  extracts  all  available sub-
   3919        strings and builds a list of pointers to them. All this is  done  in  a
   3920        single block of memory that is obtained via pcre_malloc. The address of
   3921        the memory block is returned via listptr, which is also  the  start  of
   3922        the  list  of  string pointers. The end of the list is marked by a NULL
   3923        pointer. The yield of the function is zero if all  went  well,  or  the
   3924        error code
   3925 
   3926          PCRE_ERROR_NOMEMORY       (-6)
   3927 
   3928        if the attempt to get the memory block failed.
   3929 
   3930        When  any of these functions encounter a substring that is unset, which
   3931        can happen when capturing subpattern number n+1 matches  some  part  of
   3932        the  subject, but subpattern n has not been used at all, they return an
   3933        empty string. This can be distinguished from a genuine zero-length sub-
   3934        string  by inspecting the appropriate offset in ovector, which is nega-
   3935        tive for unset substrings.
   3936 
   3937        The two convenience functions pcre_free_substring() and  pcre_free_sub-
   3938        string_list()  can  be  used  to free the memory returned by a previous
   3939        call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
   3940        tively.  They  do  nothing  more  than  call the function pointed to by
   3941        pcre_free, which of course could be called directly from a  C  program.
   3942        However,  PCRE is used in some situations where it is linked via a spe-
   3943        cial  interface  to  another  programming  language  that  cannot   use
   3944        pcre_free  directly;  it is for these cases that the functions are pro-
   3945        vided.
   3946 
   3947 
   3948 EXTRACTING CAPTURED SUBSTRINGS BY NAME
   3949 
   3950        int pcre_get_stringnumber(const pcre *code,
   3951             const char *name);
   3952 
   3953        int pcre_copy_named_substring(const pcre *code,
   3954             const char *subject, int *ovector,
   3955             int stringcount, const char *stringname,
   3956             char *buffer, int buffersize);
   3957 
   3958        int pcre_get_named_substring(const pcre *code,
   3959             const char *subject, int *ovector,
   3960             int stringcount, const char *stringname,
   3961             const char **stringptr);
   3962 
   3963        To extract a substring by name, you first have to find associated  num-
   3964        ber.  For example, for this pattern
   3965 
   3966          (a+)b(?<xxx>\d+)...
   3967 
   3968        the number of the subpattern called "xxx" is 2. If the name is known to
   3969        be unique (PCRE_DUPNAMES was not set), you can find the number from the
   3970        name by calling pcre_get_stringnumber(). The first argument is the com-
   3971        piled pattern, and the second is the name. The yield of the function is
   3972        the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
   3973        subpattern of that name.
   3974 
   3975        Given the number, you can extract the substring directly, or use one of
   3976        the functions described in the previous section. For convenience, there
   3977        are also two functions that do the whole job.
   3978 
   3979        Most   of   the   arguments    of    pcre_copy_named_substring()    and
   3980        pcre_get_named_substring()  are  the  same  as  those for the similarly
   3981        named functions that extract by number. As these are described  in  the
   3982        previous  section,  they  are not re-described here. There are just two
   3983        differences:
   3984 
   3985        First, instead of a substring number, a substring name is  given.  Sec-
   3986        ond, there is an extra argument, given at the start, which is a pointer
   3987        to the compiled pattern. This is needed in order to gain access to  the
   3988        name-to-number translation table.
   3989 
   3990        These  functions call pcre_get_stringnumber(), and if it succeeds, they
   3991        then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
   3992        ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
   3993        behaviour may not be what you want (see the next section).
   3994 
   3995        Warning: If the pattern uses the (?| feature to set up multiple subpat-
   3996        terns  with  the  same number, as described in the section on duplicate
   3997        subpattern numbers in the pcrepattern page, you  cannot  use  names  to
   3998        distinguish  the  different subpatterns, because names are not included
   3999        in the compiled code. The matching process uses only numbers. For  this
   4000        reason,  the  use of different names for subpatterns of the same number
   4001        causes an error at compile time.
   4002 
   4003 
   4004 DUPLICATE SUBPATTERN NAMES
   4005 
   4006        int pcre_get_stringtable_entries(const pcre *code,
   4007             const char *name, char **first, char **last);
   4008 
   4009        When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
   4010        subpatterns  are not required to be unique. (Duplicate names are always
   4011        allowed for subpatterns with the same number, created by using the  (?|
   4012        feature.  Indeed,  if  such subpatterns are named, they are required to
   4013        use the same names.)
   4014 
   4015        Normally, patterns with duplicate names are such that in any one match,
   4016        only  one of the named subpatterns participates. An example is shown in
   4017        the pcrepattern documentation.
   4018 
   4019        When   duplicates   are   present,   pcre_copy_named_substring()    and
   4020        pcre_get_named_substring()  return the first substring corresponding to
   4021        the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
   4022        (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
   4023        function returns one of the numbers that are associated with the  name,
   4024        but it is not defined which it is.
   4025 
   4026        If  you want to get full details of all captured substrings for a given
   4027        name, you must use  the  pcre_get_stringtable_entries()  function.  The
   4028        first argument is the compiled pattern, and the second is the name. The
   4029        third and fourth are pointers to variables which  are  updated  by  the
   4030        function. After it has run, they point to the first and last entries in
   4031        the name-to-number table  for  the  given  name.  The  function  itself
   4032        returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
   4033        there are none. The format of the table is described above in the  sec-
   4034        tion  entitled  Information about a pattern above.  Given all the rele-
   4035        vant entries for the name, you can extract each of their  numbers,  and
   4036        hence the captured data, if any.
   4037 
   4038 
   4039 FINDING ALL POSSIBLE MATCHES
   4040 
   4041        The  traditional  matching  function  uses a similar algorithm to Perl,
   4042        which stops when it finds the first match, starting at a given point in
   4043        the  subject.  If you want to find all possible matches, or the longest
   4044        possible match, consider using the alternative matching  function  (see
   4045        below)  instead.  If you cannot use the alternative function, but still
   4046        need to find all possible matches, you can kludge it up by  making  use
   4047        of the callout facility, which is described in the pcrecallout documen-
   4048        tation.
   4049 
   4050        What you have to do is to insert a callout right at the end of the pat-
   4051        tern.   When your callout function is called, extract and save the cur-
   4052        rent matched substring. Then return  1,  which  forces  pcre_exec()  to
   4053        backtrack  and  try other alternatives. Ultimately, when it runs out of
   4054        matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
   4055 
   4056 
   4057 OBTAINING AN ESTIMATE OF STACK USAGE
   4058 
   4059        Matching certain patterns using pcre_exec() can use a  lot  of  process
   4060        stack,  which  in  certain  environments can be rather limited in size.
   4061        Some users find it helpful to have an estimate of the amount  of  stack
   4062        that  is  used  by  pcre_exec(),  to help them set recursion limits, as
   4063        described in the pcrestack documentation. The estimate that  is  output
   4064        by pcretest when called with the -m and -C options is obtained by call-
   4065        ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for  its
   4066        first five arguments.
   4067 
   4068        Normally,  if  its  first  argument  is  NULL,  pcre_exec() immediately
   4069        returns the negative error code PCRE_ERROR_NULL, but with this  special
   4070        combination  of  arguments,  it returns instead a negative number whose
   4071        absolute value is the approximate stack frame size in bytes.  (A  nega-
   4072        tive  number  is  used so that it is clear that no match has happened.)
   4073        The value is approximate because in  some  cases,  recursive  calls  to
   4074        pcre_exec() occur when there are one or two additional variables on the
   4075        stack.
   4076 
   4077        If PCRE has been compiled to use the heap  instead  of  the  stack  for
   4078        recursion,  the  value  returned  is  the  size  of  each block that is
   4079        obtained from the heap.
   4080 
   4081 
   4082 MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
   4083 
   4084        int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
   4085             const char *subject, int length, int startoffset,
   4086             int options, int *ovector, int ovecsize,
   4087             int *workspace, int wscount);
   4088 
   4089        The function pcre_dfa_exec()  is  called  to  match  a  subject  string
   4090        against  a  compiled pattern, using a matching algorithm that scans the
   4091        subject string just once, and does not backtrack.  This  has  different
   4092        characteristics  to  the  normal  algorithm, and is not compatible with
   4093        Perl. Some of the features of PCRE patterns are not  supported.  Never-
   4094        theless,  there are times when this kind of matching can be useful. For
   4095        a discussion of the two matching algorithms, and  a  list  of  features
   4096        that  pcre_dfa_exec() does not support, see the pcrematching documenta-
   4097        tion.
   4098 
   4099        The arguments for the pcre_dfa_exec() function  are  the  same  as  for
   4100        pcre_exec(), plus two extras. The ovector argument is used in a differ-
   4101        ent way, and this is described below. The other  common  arguments  are
   4102        used  in  the  same way as for pcre_exec(), so their description is not
   4103        repeated here.
   4104 
   4105        The two additional arguments provide workspace for  the  function.  The
   4106        workspace  vector  should  contain at least 20 elements. It is used for
   4107        keeping  track  of  multiple  paths  through  the  pattern  tree.  More
   4108        workspace  will  be  needed for patterns and subjects where there are a
   4109        lot of potential matches.
   4110 
   4111        Here is an example of a simple call to pcre_dfa_exec():
   4112 
   4113          int rc;
   4114          int ovector[10];
   4115          int wspace[20];
   4116          rc = pcre_dfa_exec(
   4117            re,             /* result of pcre_compile() */
   4118            NULL,           /* we didn't study the pattern */
   4119            "some string",  /* the subject string */
   4120            11,             /* the length of the subject string */
   4121            0,              /* start at offset 0 in the subject */
   4122            0,              /* default options */
   4123            ovector,        /* vector of integers for substring information */
   4124            10,             /* number of elements (NOT size in bytes) */
   4125            wspace,         /* working space vector */
   4126            20);            /* number of elements (NOT size in bytes) */
   4127 
   4128    Option bits for pcre_dfa_exec()
   4129 
   4130        The unused bits of the options argument  for  pcre_dfa_exec()  must  be
   4131        zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
   4132        LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
   4133        PCRE_NOTEMPTY_ATSTART,       PCRE_NO_UTF8_CHECK,      PCRE_BSR_ANYCRLF,
   4134        PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD,  PCRE_PAR-
   4135        TIAL_SOFT,  PCRE_DFA_SHORTEST,  and PCRE_DFA_RESTART.  All but the last
   4136        four of these are  exactly  the  same  as  for  pcre_exec(),  so  their
   4137        description is not repeated here.
   4138 
   4139          PCRE_PARTIAL_HARD
   4140          PCRE_PARTIAL_SOFT
   4141 
   4142        These  have the same general effect as they do for pcre_exec(), but the
   4143        details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for
   4144        pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-
   4145        ject is reached and there is still at least  one  matching  possibility
   4146        that requires additional characters. This happens even if some complete
   4147        matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
   4148        code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
   4149        of the subject is reached, there have been  no  complete  matches,  but
   4150        there  is  still  at least one matching possibility. The portion of the
   4151        string that was inspected when the longest partial match was  found  is
   4152        set  as  the  first  matching  string  in  both cases.  There is a more
   4153        detailed discussion of partial and multi-segment matching,  with  exam-
   4154        ples, in the pcrepartial documentation.
   4155 
   4156          PCRE_DFA_SHORTEST
   4157 
   4158        Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
   4159        stop as soon as it has found one match. Because of the way the alterna-
   4160        tive  algorithm  works, this is necessarily the shortest possible match
   4161        at the first possible matching point in the subject string.
   4162 
   4163          PCRE_DFA_RESTART
   4164 
   4165        When pcre_dfa_exec() returns a partial match, it is possible to call it
   4166        again,  with  additional  subject characters, and have it continue with
   4167        the same match. The PCRE_DFA_RESTART option requests this action;  when
   4168        it  is  set,  the workspace and wscount options must reference the same
   4169        vector as before because data about the match so far is  left  in  them
   4170        after a partial match. There is more discussion of this facility in the
   4171        pcrepartial documentation.
   4172 
   4173    Successful returns from pcre_dfa_exec()
   4174 
   4175        When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
   4176        string in the subject. Note, however, that all the matches from one run
   4177        of the function start at the same point in  the  subject.  The  shorter
   4178        matches  are all initial substrings of the longer matches. For example,
   4179        if the pattern
   4180 
   4181          <.*>
   4182 
   4183        is matched against the string
   4184 
   4185          This is <something> <something else> <something further> no more
   4186 
   4187        the three matched strings are
   4188 
   4189          <something>
   4190          <something> <something else>
   4191          <something> <something else> <something further>
   4192 
   4193        On success, the yield of the function is a number  greater  than  zero,
   4194        which  is  the  number of matched substrings. The substrings themselves
   4195        are returned in ovector. Each string uses two elements;  the  first  is
   4196        the  offset  to  the start, and the second is the offset to the end. In
   4197        fact, all the strings have the same start  offset.  (Space  could  have
   4198        been  saved by giving this only once, but it was decided to retain some
   4199        compatibility with the way pcre_exec() returns data,  even  though  the
   4200        meaning of the strings is different.)
   4201 
   4202        The strings are returned in reverse order of length; that is, the long-
   4203        est matching string is given first. If there were too many  matches  to
   4204        fit  into ovector, the yield of the function is zero, and the vector is
   4205        filled with the longest matches.  Unlike  pcre_exec(),  pcre_dfa_exec()
   4206        can use the entire ovector for returning matched strings.
   4207 
   4208        NOTE:  PCRE's  "auto-possessification"  optimization usually applies to
   4209        character repeats at the end of a pattern (as well as internally).  For
   4210        example,  the  pattern "a\d+" is compiled as if it were "a\d++" because
   4211        there is no point even considering the possibility of backtracking into
   4212        the  repeated digits. For DFA matching, this means that only one possi-
   4213        ble match is found. If you really do  want  multiple  matches  in  such
   4214        cases,   either   use   an   ungreedy   repeat  ("a\d+?")  or  set  the
   4215        PCRE_NO_AUTO_POSSESS option when compiling.
   4216 
   4217    Error returns from pcre_dfa_exec()
   4218 
   4219        The pcre_dfa_exec() function returns a negative number when  it  fails.
   4220        Many  of  the  errors  are  the  same as for pcre_exec(), and these are
   4221        described above.  There are in addition the following errors  that  are
   4222        specific to pcre_dfa_exec():
   4223 
   4224          PCRE_ERROR_DFA_UITEM      (-16)
   4225 
   4226        This  return is given if pcre_dfa_exec() encounters an item in the pat-
   4227        tern that it does not support, for instance, the use of \C  or  a  back
   4228        reference.
   4229 
   4230          PCRE_ERROR_DFA_UCOND      (-17)
   4231 
   4232        This  return  is  given  if pcre_dfa_exec() encounters a condition item
   4233        that uses a back reference for the condition, or a test  for  recursion
   4234        in a specific group. These are not supported.
   4235 
   4236          PCRE_ERROR_DFA_UMLIMIT    (-18)
   4237 
   4238        This  return  is given if pcre_dfa_exec() is called with an extra block
   4239        that contains a setting of  the  match_limit  or  match_limit_recursion
   4240        fields.  This  is  not  supported (these fields are meaningless for DFA
   4241        matching).
   4242 
   4243          PCRE_ERROR_DFA_WSSIZE     (-19)
   4244 
   4245        This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
   4246        workspace vector.
   4247 
   4248          PCRE_ERROR_DFA_RECURSE    (-20)
   4249 
   4250        When  a  recursive subpattern is processed, the matching function calls
   4251        itself recursively, using private vectors for  ovector  and  workspace.
   4252        This  error  is  given  if  the output vector is not large enough. This
   4253        should be extremely rare, as a vector of size 1000 is used.
   4254 
   4255          PCRE_ERROR_DFA_BADRESTART (-30)
   4256 
   4257        When pcre_dfa_exec() is called with the PCRE_DFA_RESTART  option,  some
   4258        plausibility  checks  are  made on the contents of the workspace, which
   4259        should contain data about the previous partial match. If any  of  these
   4260        checks fail, this error is given.
   4261 
   4262 
   4263 SEE ALSO
   4264 
   4265        pcre16(3),   pcre32(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),
   4266        pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
   4267        sample(3), pcrestack(3).
   4268 
   4269 
   4270 AUTHOR
   4271 
   4272        Philip Hazel
   4273        University Computing Service
   4274        Cambridge CB2 3QH, England.
   4275 
   4276 
   4277 REVISION
   4278 
   4279        Last updated: 09 February 2014
   4280        Copyright (c) 1997-2014 University of Cambridge.
   4281 ------------------------------------------------------------------------------
   4282 
   4283 
   4284 PCRECALLOUT(3)             Library Functions Manual             PCRECALLOUT(3)
   4285 
   4286 
   4287 
   4288 NAME
   4289        PCRE - Perl-compatible regular expressions
   4290 
   4291 SYNOPSIS
   4292 
   4293        #include <pcre.h>
   4294 
   4295        int (*pcre_callout)(pcre_callout_block *);
   4296 
   4297        int (*pcre16_callout)(pcre16_callout_block *);
   4298 
   4299        int (*pcre32_callout)(pcre32_callout_block *);
   4300 
   4301 
   4302 DESCRIPTION
   4303 
   4304        PCRE provides a feature called "callout", which is a means of temporar-
   4305        ily passing control to the caller of PCRE  in  the  middle  of  pattern
   4306        matching.  The  caller of PCRE provides an external function by putting
   4307        its entry point in the global variable pcre_callout (pcre16_callout for
   4308        the 16-bit library, pcre32_callout for the 32-bit library). By default,
   4309        this variable contains NULL, which disables all calling out.
   4310 
   4311        Within a regular expression, (?C) indicates the  points  at  which  the
   4312        external  function  is  to  be  called. Different callout points can be
   4313        identified by putting a number less than 256 after the  letter  C.  The
   4314        default  value  is  zero.   For  example,  this pattern has two callout
   4315        points:
   4316 
   4317          (?C1)abc(?C2)def
   4318 
   4319        If the PCRE_AUTO_CALLOUT option bit is set when a pattern is  compiled,
   4320        PCRE  automatically  inserts callouts, all with number 255, before each
   4321        item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
   4322        pattern
   4323 
   4324          A(\d{2}|--)
   4325 
   4326        it is processed as if it were
   4327 
   4328        (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
   4329 
   4330        Notice  that  there  is a callout before and after each parenthesis and
   4331        alternation bar. If the pattern contains a conditional group whose con-
   4332        dition  is  an  assertion, an automatic callout is inserted immediately
   4333        before the condition. Such a callout may also be  inserted  explicitly,
   4334        for example:
   4335 
   4336          (?(?C9)(?=a)ab|de)
   4337 
   4338        This  applies only to assertion conditions (because they are themselves
   4339        independent groups).
   4340 
   4341        Automatic callouts can be used for tracking  the  progress  of  pattern
   4342        matching.   The pcretest program has a pattern qualifier (/C) that sets
   4343        automatic callouts; when it is used, the output indicates how the  pat-
   4344        tern  is  being matched. This is useful information when you are trying
   4345        to optimize the performance of a particular pattern.
   4346 
   4347 
   4348 MISSING CALLOUTS
   4349 
   4350        You should be aware that, because of optimizations in the way PCRE com-
   4351        piles and matches patterns, callouts sometimes do not happen exactly as
   4352        you might expect.
   4353 
   4354        At compile time, PCRE "auto-possessifies" repeated items when it  knows
   4355        that  what follows cannot be part of the repeat. For example, a+[bc] is
   4356        compiled as if it were a++[bc]. The pcretest output when  this  pattern
   4357        is  anchored  and  then  applied  with automatic callouts to the string
   4358        "aaaa" is:
   4359 
   4360          --->aaaa
   4361           +0 ^        ^
   4362           +1 ^        a+
   4363           +3 ^   ^    [bc]
   4364          No match
   4365 
   4366        This indicates that when matching [bc] fails, there is no  backtracking
   4367        into  a+  and  therefore the callouts that would be taken for the back-
   4368        tracks do not occur.  You can disable the  auto-possessify  feature  by
   4369        passing PCRE_NO_AUTO_POSSESS to pcre_compile(), or starting the pattern
   4370        with (*NO_AUTO_POSSESS). If this is done  in  pcretest  (using  the  /O
   4371        qualifier), the output changes to this:
   4372 
   4373          --->aaaa
   4374           +0 ^        ^
   4375           +1 ^        a+
   4376           +3 ^   ^    [bc]
   4377           +3 ^  ^     [bc]
   4378           +3 ^ ^      [bc]
   4379           +3 ^^       [bc]
   4380          No match
   4381 
   4382        This time, when matching [bc] fails, the matcher backtracks into a+ and
   4383        tries again, repeatedly, until a+ itself fails.
   4384 
   4385        Other optimizations that provide fast "no match"  results  also  affect
   4386        callouts.  For example, if the pattern is
   4387 
   4388          ab(?C4)cd
   4389 
   4390        PCRE knows that any matching string must contain the letter "d". If the
   4391        subject string is "abyz", the lack of "d" means that  matching  doesn't
   4392        ever  start,  and  the  callout is never reached. However, with "abyd",
   4393        though the result is still no match, the callout is obeyed.
   4394 
   4395        If the pattern is studied, PCRE knows the minimum length of a  matching
   4396        string,  and will immediately give a "no match" return without actually
   4397        running a match if the subject is not long enough, or,  for  unanchored
   4398        patterns, if it has been scanned far enough.
   4399 
   4400        You  can disable these optimizations by passing the PCRE_NO_START_OPTI-
   4401        MIZE option to the matching function, or by starting the  pattern  with
   4402        (*NO_START_OPT).  This slows down the matching process, but does ensure
   4403        that callouts such as the example above are obeyed.
   4404 
   4405 
   4406 THE CALLOUT INTERFACE
   4407 
   4408        During matching, when PCRE reaches a callout point, the external  func-
   4409        tion defined by pcre_callout or pcre[16|32]_callout is called (if it is
   4410        set). This applies to both normal and DFA matching. The  only  argument
   4411        to   the   callout   function   is  a  pointer  to  a  pcre_callout  or
   4412        pcre[16|32]_callout block.  These  structures  contains  the  following
   4413        fields:
   4414 
   4415          int           version;
   4416          int           callout_number;
   4417          int          *offset_vector;
   4418          const char   *subject;           (8-bit version)
   4419          PCRE_SPTR16   subject;           (16-bit version)
   4420          PCRE_SPTR32   subject;           (32-bit version)
   4421          int           subject_length;
   4422          int           start_match;
   4423          int           current_position;
   4424          int           capture_top;
   4425          int           capture_last;
   4426          void         *callout_data;
   4427          int           pattern_position;
   4428          int           next_item_length;
   4429          const unsigned char *mark;       (8-bit version)
   4430          const PCRE_UCHAR16  *mark;       (16-bit version)
   4431          const PCRE_UCHAR32  *mark;       (32-bit version)
   4432 
   4433        The  version  field  is an integer containing the version number of the
   4434        block format. The initial version was 0; the current version is 2.  The
   4435        version  number  will  change  again in future if additional fields are
   4436        added, but the intention is never to remove any of the existing fields.
   4437 
   4438        The callout_number field contains the number of the  callout,  as  com-
   4439        piled  into  the pattern (that is, the number after ?C for manual call-
   4440        outs, and 255 for automatically generated callouts).
   4441 
   4442        The offset_vector field is a pointer to the vector of offsets that  was
   4443        passed  by  the  caller  to  the matching function. When pcre_exec() or
   4444        pcre[16|32]_exec() is used, the contents can be inspected, in order  to
   4445        extract  substrings  that  have been matched so far, in the same way as
   4446        for extracting substrings after a match  has  completed.  For  the  DFA
   4447        matching functions, this field is not useful.
   4448 
   4449        The subject and subject_length fields contain copies of the values that
   4450        were passed to the matching function.
   4451 
   4452        The start_match field normally contains the offset within  the  subject
   4453        at  which  the  current  match  attempt started. However, if the escape
   4454        sequence \K has been encountered, this value is changed to reflect  the
   4455        modified  starting  point.  If the pattern is not anchored, the callout
   4456        function may be called several times from the same point in the pattern
   4457        for different starting points in the subject.
   4458 
   4459        The  current_position  field  contains the offset within the subject of
   4460        the current match pointer.
   4461 
   4462        When the pcre_exec() or pcre[16|32]_exec()  is  used,  the  capture_top
   4463        field  contains  one  more than the number of the highest numbered cap-
   4464        tured substring so far. If no substrings have been captured, the  value
   4465        of  capture_top  is one. This is always the case when the DFA functions
   4466        are used, because they do not support captured substrings.
   4467 
   4468        The capture_last field contains the number of the  most  recently  cap-
   4469        tured  substring. However, when a recursion exits, the value reverts to
   4470        what it was outside the recursion, as do the  values  of  all  captured
   4471        substrings.  If  no  substrings  have  been captured, the value of cap-
   4472        ture_last is -1. This is always the case for  the  DFA  matching  func-
   4473        tions.
   4474 
   4475        The  callout_data  field  contains a value that is passed to a matching
   4476        function specifically so that it can be passed back in callouts. It  is
   4477        passed  in  the callout_data field of a pcre_extra or pcre[16|32]_extra
   4478        data structure. If no such data was passed, the value  of  callout_data
   4479        in  a  callout  block is NULL. There is a description of the pcre_extra
   4480        structure in the pcreapi documentation.
   4481 
   4482        The pattern_position field is present from version  1  of  the  callout
   4483        structure. It contains the offset to the next item to be matched in the
   4484        pattern string.
   4485 
   4486        The next_item_length field is present from version  1  of  the  callout
   4487        structure. It contains the length of the next item to be matched in the
   4488        pattern string. When the callout immediately  precedes  an  alternation
   4489        bar,  a  closing  parenthesis, or the end of the pattern, the length is
   4490        zero. When the callout precedes an opening parenthesis, the  length  is
   4491        that of the entire subpattern.
   4492 
   4493        The  pattern_position  and next_item_length fields are intended to help
   4494        in distinguishing between different automatic callouts, which all  have
   4495        the same callout number. However, they are set for all callouts.
   4496 
   4497        The  mark  field is present from version 2 of the callout structure. In
   4498        callouts from pcre_exec() or pcre[16|32]_exec() it contains  a  pointer
   4499        to  the  zero-terminated  name  of  the  most  recently passed (*MARK),
   4500        (*PRUNE), or (*THEN) item in the match, or NULL if no such  items  have
   4501        been  passed.  Instances  of  (*PRUNE) or (*THEN) without a name do not
   4502        obliterate a previous (*MARK). In callouts from the DFA matching  func-
   4503        tions this field always contains NULL.
   4504 
   4505 
   4506 RETURN VALUES
   4507 
   4508        The  external callout function returns an integer to PCRE. If the value
   4509        is zero, matching proceeds as normal. If  the  value  is  greater  than
   4510        zero,  matching  fails  at  the current point, but the testing of other
   4511        matching possibilities goes ahead, just as if a lookahead assertion had
   4512        failed.  If  the  value  is less than zero, the match is abandoned, the
   4513        matching function returns the negative value.
   4514 
   4515        Negative  values  should  normally  be   chosen   from   the   set   of
   4516        PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
   4517        dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is
   4518        reserved  for  use  by callout functions; it will never be used by PCRE
   4519        itself.
   4520 
   4521 
   4522 AUTHOR
   4523 
   4524        Philip Hazel
   4525        University Computing Service
   4526        Cambridge CB2 3QH, England.
   4527 
   4528 
   4529 REVISION
   4530 
   4531        Last updated: 12 November 2013
   4532        Copyright (c) 1997-2013 University of Cambridge.
   4533 ------------------------------------------------------------------------------
   4534 
   4535 
   4536 PCRECOMPAT(3)              Library Functions Manual              PCRECOMPAT(3)
   4537 
   4538 
   4539 
   4540 NAME
   4541        PCRE - Perl-compatible regular expressions
   4542 
   4543 DIFFERENCES BETWEEN PCRE AND PERL
   4544 
   4545        This  document describes the differences in the ways that PCRE and Perl
   4546        handle regular expressions. The differences  described  here  are  with
   4547        respect to Perl versions 5.10 and above.
   4548 
   4549        1. PCRE has only a subset of Perl's Unicode support. Details of what it
   4550        does have are given in the pcreunicode page.
   4551 
   4552        2. PCRE allows repeat quantifiers only on parenthesized assertions, but
   4553        they  do  not mean what you might think. For example, (?!a){3} does not
   4554        assert that the next three characters are not "a". It just asserts that
   4555        the next character is not "a" three times (in principle: PCRE optimizes
   4556        this to run the assertion just once). Perl allows repeat quantifiers on
   4557        other assertions such as \b, but these do not seem to have any use.
   4558 
   4559        3.  Capturing  subpatterns  that occur inside negative lookahead asser-
   4560        tions are counted, but their entries in the offsets  vector  are  never
   4561        set.  Perl sometimes (but not always) sets its numerical variables from
   4562        inside negative assertions.
   4563 
   4564        4. Though binary zero characters are supported in the  subject  string,
   4565        they are not allowed in a pattern string because it is passed as a nor-
   4566        mal C string, terminated by zero. The escape sequence \0 can be used in
   4567        the pattern to represent a binary zero.
   4568 
   4569        5.  The  following Perl escape sequences are not supported: \l, \u, \L,
   4570        \U, and \N when followed by a character name or Unicode value.  (\N  on
   4571        its own, matching a non-newline character, is supported.) In fact these
   4572        are implemented by Perl's general string-handling and are not  part  of
   4573        its  pattern  matching engine. If any of these are encountered by PCRE,
   4574        an error is generated by default. However, if the  PCRE_JAVASCRIPT_COM-
   4575        PAT  option  is set, \U and \u are interpreted as JavaScript interprets
   4576        them.
   4577 
   4578        6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
   4579        is  built  with Unicode character property support. The properties that
   4580        can be tested with \p and \P are limited to the general category  prop-
   4581        erties  such  as  Lu and Nd, script names such as Greek or Han, and the
   4582        derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
   4583        property,  which  Perl  does  not; the Perl documentation says "Because
   4584        Perl hides the need for the user to understand the internal representa-
   4585        tion  of Unicode characters, there is no need to implement the somewhat
   4586        messy concept of surrogates."
   4587 
   4588        7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
   4589        ters  in  between  are  treated as literals. This is slightly different
   4590        from Perl in that $ and @ are  also  handled  as  literals  inside  the
   4591        quotes.  In Perl, they cause variable interpolation (but of course PCRE
   4592        does not have variables). Note the following examples:
   4593 
   4594            Pattern            PCRE matches      Perl matches
   4595 
   4596            \Qabc$xyz\E        abc$xyz           abc followed by the
   4597                                                   contents of $xyz
   4598            \Qabc\$xyz\E       abc\$xyz          abc\$xyz
   4599            \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
   4600 
   4601        The \Q...\E sequence is recognized both inside  and  outside  character
   4602        classes.
   4603 
   4604        8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
   4605        constructions. However, there is support for recursive  patterns.  This
   4606        is  not  available  in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
   4607        "callout" feature allows an external function to be called during  pat-
   4608        tern matching. See the pcrecallout documentation for details.
   4609 
   4610        9.  Subpatterns  that  are called as subroutines (whether or not recur-
   4611        sively) are always treated as atomic  groups  in  PCRE.  This  is  like
   4612        Python,  but  unlike Perl.  Captured values that are set outside a sub-
   4613        routine call can be reference from inside in PCRE,  but  not  in  Perl.
   4614        There is a discussion that explains these differences in more detail in
   4615        the section on recursion differences from Perl in the pcrepattern page.
   4616 
   4617        10. If any of the backtracking control verbs are used in  a  subpattern
   4618        that  is  called  as  a  subroutine (whether or not recursively), their
   4619        effect is confined to that subpattern; it does not extend to  the  sur-
   4620        rounding  pattern.  This is not always the case in Perl. In particular,
   4621        if (*THEN) is present in a group that is called as  a  subroutine,  its
   4622        action is limited to that group, even if the group does not contain any
   4623        | characters. Note that such subpatterns are processed as  anchored  at
   4624        the point where they are tested.
   4625 
   4626        11.  If a pattern contains more than one backtracking control verb, the
   4627        first one that is backtracked onto acts. For example,  in  the  pattern
   4628        A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure
   4629        in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
   4630        it is the same as PCRE, but there are examples where it differs.
   4631 
   4632        12.  Most  backtracking  verbs in assertions have their normal actions.
   4633        They are not confined to the assertion.
   4634 
   4635        13. There are some differences that are concerned with the settings  of
   4636        captured  strings  when  part  of  a  pattern is repeated. For example,
   4637        matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
   4638        unset, but in PCRE it is set to "b".
   4639 
   4640        14.  PCRE's handling of duplicate subpattern numbers and duplicate sub-
   4641        pattern names is not as general as Perl's. This is a consequence of the
   4642        fact the PCRE works internally just with numbers, using an external ta-
   4643        ble to translate between numbers and names. In  particular,  a  pattern
   4644        such  as  (?|(?<a>A)|(?<b)B),  where the two capturing parentheses have
   4645        the same number but different names, is not supported,  and  causes  an
   4646        error  at compile time. If it were allowed, it would not be possible to
   4647        distinguish which parentheses matched, because both names map  to  cap-
   4648        turing subpattern number 1. To avoid this confusing situation, an error
   4649        is given at compile time.
   4650 
   4651        15. Perl recognizes comments in some places that  PCRE  does  not,  for
   4652        example,  between  the  ( and ? at the start of a subpattern. If the /x
   4653        modifier is set, Perl allows white space between ( and ?  (though  cur-
   4654        rent  Perls  warn that this is deprecated) but PCRE never does, even if
   4655        the PCRE_EXTENDED option is set.
   4656 
   4657        16. Perl, when in warning mode, gives warnings  for  character  classes
   4658        such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
   4659        als. PCRE has no warning features, so it gives an error in these  cases
   4660        because they are almost certainly user mistakes.
   4661 
   4662        17.  In  PCRE,  the upper/lower case character properties Lu and Ll are
   4663        not affected when case-independent matching is specified. For  example,
   4664        \p{Lu} always matches an upper case letter. I think Perl has changed in
   4665        this respect; in the release at the time of writing (5.16), \p{Lu}  and
   4666        \p{Ll} match all letters, regardless of case, when case independence is
   4667        specified.
   4668 
   4669        18. PCRE provides some extensions to the Perl regular expression facil-
   4670        ities.   Perl  5.10  includes new features that are not in earlier ver-
   4671        sions of Perl, some of which (such as named parentheses) have  been  in
   4672        PCRE for some time. This list is with respect to Perl 5.10:
   4673 
   4674        (a)  Although  lookbehind  assertions  in  PCRE must match fixed length
   4675        strings, each alternative branch of a lookbehind assertion can match  a
   4676        different  length  of  string.  Perl requires them all to have the same
   4677        length.
   4678 
   4679        (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $
   4680        meta-character matches only at the very end of the string.
   4681 
   4682        (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
   4683        cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
   4684        ignored.  (Perl can be made to issue a warning.)
   4685 
   4686        (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
   4687        fiers is inverted, that is, by default they are not greedy, but if fol-
   4688        lowed by a question mark they are.
   4689 
   4690        (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
   4691        tried only at the first matching position in the subject string.
   4692 
   4693        (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
   4694        and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-
   4695        lents.
   4696 
   4697        (g) The \R escape sequence can be restricted to match only CR,  LF,  or
   4698        CRLF by the PCRE_BSR_ANYCRLF option.
   4699 
   4700        (h) The callout facility is PCRE-specific.
   4701 
   4702        (i) The partial matching facility is PCRE-specific.
   4703 
   4704        (j) Patterns compiled by PCRE can be saved and re-used at a later time,
   4705        even on different hosts that have the other endianness.  However,  this
   4706        does not apply to optimized data created by the just-in-time compiler.
   4707 
   4708        (k)    The    alternative    matching    functions    (pcre_dfa_exec(),
   4709        pcre16_dfa_exec() and pcre32_dfa_exec(),) match in a different way  and
   4710        are not Perl-compatible.
   4711 
   4712        (l)  PCRE  recognizes some special sequences such as (*CR) at the start
   4713        of a pattern that set overall options that cannot be changed within the
   4714        pattern.
   4715 
   4716 
   4717 AUTHOR
   4718 
   4719        Philip Hazel
   4720        University Computing Service
   4721        Cambridge CB2 3QH, England.
   4722 
   4723 
   4724 REVISION
   4725 
   4726        Last updated: 10 November 2013
   4727        Copyright (c) 1997-2013 University of Cambridge.
   4728 ------------------------------------------------------------------------------
   4729 
   4730 
   4731 PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)
   4732 
   4733 
   4734 
   4735 NAME
   4736        PCRE - Perl-compatible regular expressions
   4737 
   4738 PCRE REGULAR EXPRESSION DETAILS
   4739 
   4740        The  syntax and semantics of the regular expressions that are supported
   4741        by PCRE are described in detail below. There is a quick-reference  syn-
   4742        tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
   4743        semantics as closely as it can. PCRE  also  supports  some  alternative
   4744        regular  expression  syntax (which does not conflict with the Perl syn-
   4745        tax) in order to provide some compatibility with regular expressions in
   4746        Python, .NET, and Oniguruma.
   4747 
   4748        Perl's  regular expressions are described in its own documentation, and
   4749        regular expressions in general are covered in a number of  books,  some
   4750        of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
   4751        Expressions", published by  O'Reilly,  covers  regular  expressions  in
   4752        great  detail.  This  description  of  PCRE's  regular  expressions  is
   4753        intended as reference material.
   4754 
   4755        This document discusses the patterns that are supported  by  PCRE  when
   4756        one    its    main   matching   functions,   pcre_exec()   (8-bit)   or
   4757        pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has  alternative
   4758        matching  functions,  pcre_dfa_exec()  and pcre[16|32_dfa_exec(), which
   4759        match using a different algorithm that is not Perl-compatible. Some  of
   4760        the  features  discussed  below  are not available when DFA matching is
   4761        used. The advantages and disadvantages of  the  alternative  functions,
   4762        and  how  they  differ  from the normal functions, are discussed in the
   4763        pcrematching page.
   4764 
   4765 
   4766 SPECIAL START-OF-PATTERN ITEMS
   4767 
   4768        A number of options that can be passed to pcre_compile()  can  also  be
   4769        set by special items at the start of a pattern. These are not Perl-com-
   4770        patible, but are provided to make these options accessible  to  pattern
   4771        writers  who are not able to change the program that processes the pat-
   4772        tern. Any number of these items  may  appear,  but  they  must  all  be
   4773        together right at the start of the pattern string, and the letters must
   4774        be in upper case.
   4775 
   4776    UTF support
   4777 
   4778        The original operation of PCRE was on strings of  one-byte  characters.
   4779        However,  there  is  now also support for UTF-8 strings in the original
   4780        library, an extra library that supports  16-bit  and  UTF-16  character
   4781        strings,  and a third library that supports 32-bit and UTF-32 character
   4782        strings. To use these features, PCRE must be built to include appropri-
   4783        ate  support. When using UTF strings you must either call the compiling
   4784        function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option,  or  the
   4785        pattern must start with one of these special sequences:
   4786 
   4787          (*UTF8)
   4788          (*UTF16)
   4789          (*UTF32)
   4790          (*UTF)
   4791 
   4792        (*UTF)  is  a  generic  sequence  that  can  be  used  with  any of the
   4793        libraries.  Starting a pattern with such a sequence  is  equivalent  to
   4794        setting  the  relevant  option.  How setting a UTF mode affects pattern
   4795        matching is mentioned in several places below. There is also a  summary
   4796        of features in the pcreunicode page.
   4797 
   4798        Some applications that allow their users to supply patterns may wish to
   4799        restrict  them  to  non-UTF  data  for   security   reasons.   If   the
   4800        PCRE_NEVER_UTF  option  is  set  at  compile  time, (*UTF) etc. are not
   4801        allowed, and their appearance causes an error.
   4802 
   4803    Unicode property support
   4804 
   4805        Another special sequence that may appear at the start of a  pattern  is
   4806        (*UCP).   This  has  the same effect as setting the PCRE_UCP option: it
   4807        causes sequences such as \d and \w to use Unicode properties to  deter-
   4808        mine character types, instead of recognizing only characters with codes
   4809        less than 128 via a lookup table.
   4810 
   4811    Disabling auto-possessification
   4812 
   4813        If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect  as
   4814        setting  the  PCRE_NO_AUTO_POSSESS  option  at compile time. This stops
   4815        PCRE from making quantifiers possessive when what follows cannot  match
   4816        the  repeated item. For example, by default a+b is treated as a++b. For
   4817        more details, see the pcreapi documentation.
   4818 
   4819    Disabling start-up optimizations
   4820 
   4821        If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
   4822        setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
   4823        time. This disables several  optimizations  for  quickly  reaching  "no
   4824        match" results. For more details, see the pcreapi documentation.
   4825 
   4826    Newline conventions
   4827 
   4828        PCRE  supports five different conventions for indicating line breaks in
   4829        strings: a single CR (carriage return) character, a  single  LF  (line-
   4830        feed) character, the two-character sequence CRLF, any of the three pre-
   4831        ceding, or any Unicode newline sequence. The pcreapi page  has  further
   4832        discussion  about newlines, and shows how to set the newline convention
   4833        in the options arguments for the compiling and matching functions.
   4834 
   4835        It is also possible to specify a newline convention by starting a  pat-
   4836        tern string with one of the following five sequences:
   4837 
   4838          (*CR)        carriage return
   4839          (*LF)        linefeed
   4840          (*CRLF)      carriage return, followed by linefeed
   4841          (*ANYCRLF)   any of the three above
   4842          (*ANY)       all Unicode newline sequences
   4843 
   4844        These override the default and the options given to the compiling func-
   4845        tion. For example, on a Unix system where LF  is  the  default  newline
   4846        sequence, the pattern
   4847 
   4848          (*CR)a.b
   4849 
   4850        changes the convention to CR. That pattern matches "a\nb" because LF is
   4851        no longer a newline. If more than one of these settings is present, the
   4852        last one is used.
   4853 
   4854        The  newline  convention affects where the circumflex and dollar asser-
   4855        tions are true. It also affects the interpretation of the dot metachar-
   4856        acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it
   4857        does not affect what the \R escape sequence matches. By  default,  this
   4858        is  any Unicode newline sequence, for Perl compatibility. However, this
   4859        can be changed; see the description of \R in the section entitled "New-
   4860        line  sequences"  below.  A change of \R setting can be combined with a
   4861        change of newline convention.
   4862 
   4863    Setting match and recursion limits
   4864 
   4865        The caller of pcre_exec() can set a limit on the number  of  times  the
   4866        internal  match() function is called and on the maximum depth of recur-
   4867        sive calls. These facilities are provided to catch runaway matches that
   4868        are provoked by patterns with huge matching trees (a typical example is
   4869        a pattern with nested unlimited repeats) and to avoid  running  out  of
   4870        system  stack  by  too  much  recursion.  When  one  of these limits is
   4871        reached, pcre_exec() gives an error return. The limits can also be  set
   4872        by items at the start of the pattern of the form
   4873 
   4874          (*LIMIT_MATCH=d)
   4875          (*LIMIT_RECURSION=d)
   4876 
   4877        where d is any number of decimal digits. However, the value of the set-
   4878        ting must be less than the value set (or defaulted) by  the  caller  of
   4879        pcre_exec()  for  it  to  have  any effect. In other words, the pattern
   4880        writer can lower the limits set by the programmer, but not raise  them.
   4881        If  there  is  more  than one setting of one of these limits, the lower
   4882        value is used.
   4883 
   4884 
   4885 EBCDIC CHARACTER CODES
   4886 
   4887        PCRE can be compiled to run in an environment that uses EBCDIC  as  its
   4888        character code rather than ASCII or Unicode (typically a mainframe sys-
   4889        tem). In the sections below, character code values are  ASCII  or  Uni-
   4890        code; in an EBCDIC environment these characters may have different code
   4891        values, and there are no code points greater than 255.
   4892 
   4893 
   4894 CHARACTERS AND METACHARACTERS
   4895 
   4896        A regular expression is a pattern that is  matched  against  a  subject
   4897        string  from  left  to right. Most characters stand for themselves in a
   4898        pattern, and match the corresponding characters in the  subject.  As  a
   4899        trivial example, the pattern
   4900 
   4901          The quick brown fox
   4902 
   4903        matches a portion of a subject string that is identical to itself. When
   4904        caseless matching is specified (the PCRE_CASELESS option), letters  are
   4905        matched  independently  of case. In a UTF mode, PCRE always understands
   4906        the concept of case for characters whose values are less than  128,  so
   4907        caseless  matching  is always possible. For characters with higher val-
   4908        ues, the concept of case is supported if PCRE is compiled with  Unicode
   4909        property  support,  but  not  otherwise.   If  you want to use caseless
   4910        matching for characters 128 and above, you must  ensure  that  PCRE  is
   4911        compiled with Unicode property support as well as with UTF support.
   4912 
   4913        The  power  of  regular  expressions  comes from the ability to include
   4914        alternatives and repetitions in the pattern. These are encoded  in  the
   4915        pattern by the use of metacharacters, which do not stand for themselves
   4916        but instead are interpreted in some special way.
   4917 
   4918        There are two different sets of metacharacters: those that  are  recog-
   4919        nized  anywhere in the pattern except within square brackets, and those
   4920        that are recognized within square brackets.  Outside  square  brackets,
   4921        the metacharacters are as follows:
   4922 
   4923          \      general escape character with several uses
   4924          ^      assert start of string (or line, in multiline mode)
   4925          $      assert end of string (or line, in multiline mode)
   4926          .      match any character except newline (by default)
   4927          [      start character class definition
   4928          |      start of alternative branch
   4929          (      start subpattern
   4930          )      end subpattern
   4931          ?      extends the meaning of (
   4932                 also 0 or 1 quantifier
   4933                 also quantifier minimizer
   4934          *      0 or more quantifier
   4935          +      1 or more quantifier
   4936                 also "possessive quantifier"
   4937          {      start min/max quantifier
   4938 
   4939        Part  of  a  pattern  that is in square brackets is called a "character
   4940        class". In a character class the only metacharacters are:
   4941 
   4942          \      general escape character
   4943          ^      negate the class, but only if the first character
   4944          -      indicates character range
   4945          [      POSIX character class (only if followed by POSIX
   4946                   syntax)
   4947          ]      terminates the character class
   4948 
   4949        The following sections describe the use of each of the metacharacters.
   4950 
   4951 
   4952 BACKSLASH
   4953 
   4954        The backslash character has several uses. Firstly, if it is followed by
   4955        a character that is not a number or a letter, it takes away any special
   4956        meaning that character may have. This use of  backslash  as  an  escape
   4957        character applies both inside and outside character classes.
   4958 
   4959        For  example,  if  you want to match a * character, you write \* in the
   4960        pattern.  This escaping action applies whether  or  not  the  following
   4961        character  would  otherwise be interpreted as a metacharacter, so it is
   4962        always safe to precede a non-alphanumeric  with  backslash  to  specify
   4963        that  it stands for itself. In particular, if you want to match a back-
   4964        slash, you write \\.
   4965 
   4966        In a UTF mode, only ASCII numbers and letters have any special  meaning
   4967        after  a  backslash.  All  other characters (in particular, those whose
   4968        codepoints are greater than 127) are treated as literals.
   4969 
   4970        If a pattern is compiled with  the  PCRE_EXTENDED  option,  most  white
   4971        space  in the pattern (other than in a character class), and characters
   4972        between a # outside a character class and the next newline,  inclusive,
   4973        are ignored. An escaping backslash can be used to include a white space
   4974        or # character as part of the pattern.
   4975 
   4976        If you want to remove the special meaning from a  sequence  of  charac-
   4977        ters,  you can do so by putting them between \Q and \E. This is differ-
   4978        ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
   4979        sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
   4980        tion. Note the following examples:
   4981 
   4982          Pattern            PCRE matches   Perl matches
   4983 
   4984          \Qabc$xyz\E        abc$xyz        abc followed by the
   4985                                              contents of $xyz
   4986          \Qabc\$xyz\E       abc\$xyz       abc\$xyz
   4987          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
   4988 
   4989        The \Q...\E sequence is recognized both inside  and  outside  character
   4990        classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
   4991        is not followed by \E later in the pattern, the literal  interpretation
   4992        continues  to  the  end  of  the pattern (that is, \E is assumed at the
   4993        end). If the isolated \Q is inside a character class,  this  causes  an
   4994        error, because the character class is not terminated.
   4995 
   4996    Non-printing characters
   4997 
   4998        A second use of backslash provides a way of encoding non-printing char-
   4999        acters in patterns in a visible manner. There is no restriction on  the
   5000        appearance  of non-printing characters, apart from the binary zero that
   5001        terminates a pattern, but when a pattern  is  being  prepared  by  text
   5002        editing,  it  is  often  easier  to  use  one  of  the following escape
   5003        sequences than the binary character it represents.  In an ASCII or Uni-
   5004        code environment, these escapes are as follows:
   5005 
   5006          \a        alarm, that is, the BEL character (hex 07)
   5007          \cx       "control-x", where x is any ASCII character
   5008          \e        escape (hex 1B)
   5009          \f        form feed (hex 0C)
   5010          \n        linefeed (hex 0A)
   5011          \r        carriage return (hex 0D)
   5012          \t        tab (hex 09)
   5013          \0dd      character with octal code 0dd
   5014          \ddd      character with octal code ddd, or back reference
   5015          \o{ddd..} character with octal code ddd..
   5016          \xhh      character with hex code hh
   5017          \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
   5018          \uhhhh    character with hex code hhhh (JavaScript mode only)
   5019 
   5020        The  precise effect of \cx on ASCII characters is as follows: if x is a
   5021        lower case letter, it is converted to upper case. Then  bit  6  of  the
   5022        character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
   5023        (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and  \c;  becomes
   5024        hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c
   5025        has a value greater than 127, a compile-time error occurs.  This  locks
   5026        out non-ASCII characters in all modes.
   5027 
   5028        When PCRE is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gener-
   5029        ate the appropriate EBCDIC code values. The \c escape is  processed  as
   5030        specified for Perl in the perlebcdic document. The only characters that
   5031        are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^,  _,  or  ?.
   5032        Any  other  character  provokes  a  compile-time error. The sequence \@
   5033        encodes character code 0; the letters (in either case)  encode  charac-
   5034        ters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
   5035        (hex 1B to hex 1F), and \? becomes either 255 (hex FF) or 95 (hex 5F).
   5036 
   5037        Thus, apart from \?, these escapes generate  the  same  character  code
   5038        values  as  they do in an ASCII environment, though the meanings of the
   5039        values mostly differ. For example, \G always generates  code  value  7,
   5040        which is BEL in ASCII but DEL in EBCDIC.
   5041 
   5042        The  sequence  \?  generates DEL (127, hex 7F) in an ASCII environment,
   5043        but because 127 is not a control character in  EBCDIC,  Perl  makes  it
   5044        generate  the  APC character. Unfortunately, there are several variants
   5045        of EBCDIC. In most of them the APC character has  the  value  255  (hex
   5046        FF),  but  in  the one Perl calls POSIX-BC its value is 95 (hex 5F). If
   5047        certain other characters have POSIX-BC values, PCRE makes  \?  generate
   5048        95; otherwise it generates 255.
   5049 
   5050        After  \0  up  to two further octal digits are read. If there are fewer
   5051        than two digits, just  those  that  are  present  are  used.  Thus  the
   5052        sequence \0\x\015 specifies two binary zeros followed by a CR character
   5053        (code value 13). Make sure you supply two digits after the initial zero
   5054        if the pattern character that follows is itself an octal digit.
   5055 
   5056        The  escape \o must be followed by a sequence of octal digits, enclosed
   5057        in braces. An error occurs if this is not the case. This  escape  is  a
   5058        recent  addition  to Perl; it provides way of specifying character code
   5059        points as octal numbers greater than 0777, and  it  also  allows  octal
   5060        numbers and back references to be unambiguously specified.
   5061 
   5062        For greater clarity and unambiguity, it is best to avoid following \ by
   5063        a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
   5064        ter  numbers,  and \g{} to specify back references. The following para-
   5065        graphs describe the old, ambiguous syntax.
   5066 
   5067        The handling of a backslash followed by a digit other than 0 is compli-
   5068        cated,  and  Perl  has changed in recent releases, causing PCRE also to
   5069        change. Outside a character class, PCRE reads the digit and any follow-
   5070        ing  digits  as  a  decimal number. If the number is less than 8, or if
   5071        there have been at least that many previous capturing left  parentheses
   5072        in  the expression, the entire sequence is taken as a back reference. A
   5073        description of how this works is given later, following the  discussion
   5074        of parenthesized subpatterns.
   5075 
   5076        Inside  a  character  class,  or  if  the decimal number following \ is
   5077        greater than 7 and there have not been that many capturing subpatterns,
   5078        PCRE  handles \8 and \9 as the literal characters "8" and "9", and oth-
   5079        erwise re-reads up to three octal digits following the backslash, using
   5080        them  to  generate  a  data character.  Any subsequent digits stand for
   5081        themselves. For example:
   5082 
   5083          \040   is another way of writing an ASCII space
   5084          \40    is the same, provided there are fewer than 40
   5085                    previous capturing subpatterns
   5086          \7     is always a back reference
   5087          \11    might be a back reference, or another way of
   5088                    writing a tab
   5089          \011   is always a tab
   5090          \0113  is a tab followed by the character "3"
   5091          \113   might be a back reference, otherwise the
   5092                    character with octal code 113
   5093          \377   might be a back reference, otherwise
   5094                    the value 255 (decimal)
   5095          \81    is either a back reference, or the two
   5096                    characters "8" and "1"
   5097 
   5098        Note that octal values of 100 or greater that are specified using  this
   5099        syntax  must  not be introduced by a leading zero, because no more than
   5100        three octal digits are ever read.
   5101 
   5102        By default, after \x that is not followed by {, from zero to two  hexa-
   5103        decimal  digits  are  read (letters can be in upper or lower case). Any
   5104        number of hexadecimal digits may appear between \x{ and }. If a charac-
   5105        ter  other  than  a  hexadecimal digit appears between \x{ and }, or if
   5106        there is no terminating }, an error occurs.
   5107 
   5108        If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation  of  \x
   5109        is  as  just described only when it is followed by two hexadecimal dig-
   5110        its.  Otherwise, it matches a  literal  "x"  character.  In  JavaScript
   5111        mode, support for code points greater than 256 is provided by \u, which
   5112        must be followed by four hexadecimal digits;  otherwise  it  matches  a
   5113        literal "u" character.
   5114 
   5115        Characters whose value is less than 256 can be defined by either of the
   5116        two syntaxes for \x (or by \u in JavaScript mode). There is no  differ-
   5117        ence in the way they are handled. For example, \xdc is exactly the same
   5118        as \x{dc} (or \u00dc in JavaScript mode).
   5119 
   5120    Constraints on character values
   5121 
   5122        Characters that are specified using octal or  hexadecimal  numbers  are
   5123        limited to certain values, as follows:
   5124 
   5125          8-bit non-UTF mode    less than 0x100
   5126          8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
   5127          16-bit non-UTF mode   less than 0x10000
   5128          16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
   5129          32-bit non-UTF mode   less than 0x100000000
   5130          32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
   5131 
   5132        Invalid  Unicode  codepoints  are  the  range 0xd800 to 0xdfff (the so-
   5133        called "surrogate" codepoints), and 0xffef.
   5134 
   5135    Escape sequences in character classes
   5136 
   5137        All the sequences that define a single character value can be used both
   5138        inside  and  outside character classes. In addition, inside a character
   5139        class, \b is interpreted as the backspace character (hex 08).
   5140 
   5141        \N is not allowed in a character class. \B, \R, and \X are not  special
   5142        inside  a  character  class.  Like other unrecognized escape sequences,
   5143        they are treated as  the  literal  characters  "B",  "R",  and  "X"  by
   5144        default,  but cause an error if the PCRE_EXTRA option is set. Outside a
   5145        character class, these sequences have different meanings.
   5146 
   5147    Unsupported escape sequences
   5148 
   5149        In Perl, the sequences \l, \L, \u, and \U are recognized by its  string
   5150        handler  and  used  to  modify  the  case  of  following characters. By
   5151        default, PCRE does not support these escape sequences. However, if  the
   5152        PCRE_JAVASCRIPT_COMPAT  option  is set, \U matches a "U" character, and
   5153        \u can be used to define a character by code point, as described in the
   5154        previous section.
   5155 
   5156    Absolute and relative back references
   5157 
   5158        The  sequence  \g followed by an unsigned or a negative number, option-
   5159        ally enclosed in braces, is an absolute or relative back  reference.  A
   5160        named back reference can be coded as \g{name}. Back references are dis-
   5161        cussed later, following the discussion of parenthesized subpatterns.
   5162 
   5163    Absolute and relative subroutine calls
   5164 
   5165        For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
   5166        name or a number enclosed either in angle brackets or single quotes, is
   5167        an alternative syntax for referencing a subpattern as  a  "subroutine".
   5168        Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
   5169        \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
   5170        reference; the latter is a subroutine call.
   5171 
   5172    Generic character types
   5173 
   5174        Another use of backslash is for specifying generic character types:
   5175 
   5176          \d     any decimal digit
   5177          \D     any character that is not a decimal digit
   5178          \h     any horizontal white space character
   5179          \H     any character that is not a horizontal white space character
   5180          \s     any white space character
   5181          \S     any character that is not a white space character
   5182          \v     any vertical white space character
   5183          \V     any character that is not a vertical white space character
   5184          \w     any "word" character
   5185          \W     any "non-word" character
   5186 
   5187        There is also the single sequence \N, which matches a non-newline char-
   5188        acter.  This is the same as the "." metacharacter when  PCRE_DOTALL  is
   5189        not  set.  Perl also uses \N to match characters by name; PCRE does not
   5190        support this.
   5191 
   5192        Each pair of lower and upper case escape sequences partitions the  com-
   5193        plete  set  of  characters  into two disjoint sets. Any given character
   5194        matches one, and only one, of each pair. The sequences can appear  both
   5195        inside  and outside character classes. They each match one character of
   5196        the appropriate type. If the current matching point is at  the  end  of
   5197        the  subject string, all of them fail, because there is no character to
   5198        match.
   5199 
   5200        For compatibility with Perl, \s did not used to match the VT  character
   5201        (code  11),  which  made it different from the the POSIX "space" class.
   5202        However, Perl added VT at release  5.18,  and  PCRE  followed  suit  at
   5203        release  8.34.  The  default  \s characters are now HT (9), LF (10), VT
   5204        (11), FF (12), CR (13), and space (32),  which  are  defined  as  white
   5205        space in the "C" locale. This list may vary if locale-specific matching
   5206        is taking place. For example, in some locales the "non-breaking  space"
   5207        character  (\xA0)  is  recognized  as white space, and in others the VT
   5208        character is not.
   5209 
   5210        A "word" character is an underscore or any character that is  a  letter
   5211        or  digit.   By  default,  the definition of letters and digits is con-
   5212        trolled by PCRE's low-valued character tables, and may vary if  locale-
   5213        specific  matching is taking place (see "Locale support" in the pcreapi
   5214        page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
   5215        systems,  or "french" in Windows, some character codes greater than 127
   5216        are used for accented letters, and these are then matched  by  \w.  The
   5217        use of locales with Unicode is discouraged.
   5218 
   5219        By  default,  characters  whose  code points are greater than 127 never
   5220        match \d, \s, or \w, and always match \D, \S, and \W, although this may
   5221        vary  for characters in the range 128-255 when locale-specific matching
   5222        is happening.  These escape sequences retain  their  original  meanings
   5223        from  before  Unicode support was available, mainly for efficiency rea-
   5224        sons. If PCRE is  compiled  with  Unicode  property  support,  and  the
   5225        PCRE_UCP  option is set, the behaviour is changed so that Unicode prop-
   5226        erties are used to determine character types, as follows:
   5227 
   5228          \d  any character that matches \p{Nd} (decimal digit)
   5229          \s  any character that matches \p{Z} or \h or \v
   5230          \w  any character that matches \p{L} or \p{N}, plus underscore
   5231 
   5232        The upper case escapes match the inverse sets of characters. Note  that
   5233        \d  matches  only decimal digits, whereas \w matches any Unicode digit,
   5234        as well as any Unicode letter, and underscore. Note also that  PCRE_UCP
   5235        affects  \b,  and  \B  because  they are defined in terms of \w and \W.
   5236        Matching these sequences is noticeably slower when PCRE_UCP is set.
   5237 
   5238        The sequences \h, \H, \v, and \V are features that were added  to  Perl
   5239        at  release  5.10. In contrast to the other sequences, which match only
   5240        ASCII characters by default, these  always  match  certain  high-valued
   5241        code points, whether or not PCRE_UCP is set. The horizontal space char-
   5242        acters are:
   5243 
   5244          U+0009     Horizontal tab (HT)
   5245          U+0020     Space
   5246          U+00A0     Non-break space
   5247          U+1680     Ogham space mark
   5248          U+180E     Mongolian vowel separator
   5249          U+2000     En quad
   5250          U+2001     Em quad
   5251          U+2002     En space
   5252          U+2003     Em space
   5253          U+2004     Three-per-em space
   5254          U+2005     Four-per-em space
   5255          U+2006     Six-per-em space
   5256          U+2007     Figure space
   5257          U+2008     Punctuation space
   5258          U+2009     Thin space
   5259          U+200A     Hair space
   5260          U+202F     Narrow no-break space
   5261          U+205F     Medium mathematical space
   5262          U+3000     Ideographic space
   5263 
   5264        The vertical space characters are:
   5265 
   5266          U+000A     Linefeed (LF)
   5267          U+000B     Vertical tab (VT)
   5268          U+000C     Form feed (FF)
   5269          U+000D     Carriage return (CR)
   5270          U+0085     Next line (NEL)
   5271          U+2028     Line separator
   5272          U+2029     Paragraph separator
   5273 
   5274        In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
   5275        256 are relevant.
   5276 
   5277    Newline sequences
   5278 
   5279        Outside  a  character class, by default, the escape sequence \R matches
   5280        any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
   5281        to the following:
   5282 
   5283          (?>\r\n|\n|\x0b|\f|\r|\x85)
   5284 
   5285        This  is  an  example  of an "atomic group", details of which are given
   5286        below.  This particular group matches either the two-character sequence
   5287        CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
   5288        U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
   5289        riage  return,  U+000D),  or NEL (next line, U+0085). The two-character
   5290        sequence is treated as a single unit that cannot be split.
   5291 
   5292        In other modes, two additional characters whose codepoints are  greater
   5293        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
   5294        rator, U+2029).  Unicode character property support is not  needed  for
   5295        these characters to be recognized.
   5296 
   5297        It is possible to restrict \R to match only CR, LF, or CRLF (instead of
   5298        the complete set  of  Unicode  line  endings)  by  setting  the  option
   5299        PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
   5300        (BSR is an abbrevation for "backslash R".) This can be made the default
   5301        when  PCRE  is  built;  if this is the case, the other behaviour can be
   5302        requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
   5303        specify  these  settings  by  starting a pattern string with one of the
   5304        following sequences:
   5305 
   5306          (*BSR_ANYCRLF)   CR, LF, or CRLF only
   5307          (*BSR_UNICODE)   any Unicode newline sequence
   5308 
   5309        These override the default and the options given to the compiling func-
   5310        tion,  but  they  can  themselves  be  overridden by options given to a
   5311        matching function. Note that these  special  settings,  which  are  not
   5312        Perl-compatible,  are  recognized  only at the very start of a pattern,
   5313        and that they must be in upper case.  If  more  than  one  of  them  is
   5314        present,  the  last  one is used. They can be combined with a change of
   5315        newline convention; for example, a pattern can start with:
   5316 
   5317          (*ANY)(*BSR_ANYCRLF)
   5318 
   5319        They can also be combined with the (*UTF8), (*UTF16), (*UTF32),  (*UTF)
   5320        or (*UCP) special sequences. Inside a character class, \R is treated as
   5321        an unrecognized escape sequence, and  so  matches  the  letter  "R"  by
   5322        default, but causes an error if PCRE_EXTRA is set.
   5323 
   5324    Unicode character properties
   5325 
   5326        When PCRE is built with Unicode character property support, three addi-
   5327        tional escape sequences that match characters with specific  properties
   5328        are  available.   When  in 8-bit non-UTF-8 mode, these sequences are of
   5329        course limited to testing characters whose  codepoints  are  less  than
   5330        256, but they do work in this mode.  The extra escape sequences are:
   5331 
   5332          \p{xx}   a character with the xx property
   5333          \P{xx}   a character without the xx property
   5334          \X       a Unicode extended grapheme cluster
   5335 
   5336        The  property  names represented by xx above are limited to the Unicode
   5337        script names, the general category properties, "Any", which matches any
   5338        character   (including  newline),  and  some  special  PCRE  properties
   5339        (described in the next section).  Other Perl properties such as  "InMu-
   5340        sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}
   5341        does not match any characters, so always causes a match failure.
   5342 
   5343        Sets of Unicode characters are defined as belonging to certain scripts.
   5344        A  character from one of these sets can be matched using a script name.
   5345        For example:
   5346 
   5347          \p{Greek}
   5348          \P{Han}
   5349 
   5350        Those that are not part of an identified script are lumped together  as
   5351        "Common". The current list of scripts is:
   5352 
   5353        Arabic,  Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali,
   5354        Bopomofo, Brahmi, Braille, Buginese, Buhid,  Canadian_Aboriginal,  Car-
   5355        ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei-
   5356        form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero-
   5357        glyphs,  Elbasan,  Ethiopic,  Georgian,  Glagolitic,  Gothic,  Grantha,
   5358        Greek, Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana,
   5359        Imperial_Aramaic,     Inherited,     Inscriptional_Pahlavi,    Inscrip-
   5360        tional_Parthian,  Javanese,  Kaithi,   Kannada,   Katakana,   Kayah_Li,
   5361        Kharoshthi,  Khmer,  Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin-
   5362        ear_A, Linear_B, Lisu, Lycian, Lydian,  Mahajani,  Malayalam,  Mandaic,
   5363        Manichaean,      Meetei_Mayek,     Mende_Kikakui,     Meroitic_Cursive,
   5364        Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro,  Myanmar,  Nabataean,
   5365        New_Tai_Lue,   Nko,  Ogham,  Ol_Chiki,  Old_Italic,  Old_North_Arabian,
   5366        Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya,
   5367        Pahawh_Hmong,    Palmyrene,    Pau_Cin_Hau,    Phags_Pa,    Phoenician,
   5368        Psalter_Pahlavi, Rejang, Runic, Samaritan,  Saurashtra,  Sharada,  Sha-
   5369        vian,  Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac,
   5370        Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,  Takri,  Tamil,  Telugu,
   5371        Thaana,  Thai,  Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi,
   5372        Yi.
   5373 
   5374        Each character has exactly one Unicode general category property, spec-
   5375        ified  by a two-letter abbreviation. For compatibility with Perl, nega-
   5376        tion can be specified by including a  circumflex  between  the  opening
   5377        brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
   5378        \P{Lu}.
   5379 
   5380        If only one letter is specified with \p or \P, it includes all the gen-
   5381        eral  category properties that start with that letter. In this case, in
   5382        the absence of negation, the curly brackets in the escape sequence  are
   5383        optional; these two examples have the same effect:
   5384 
   5385          \p{L}
   5386          \pL
   5387 
   5388        The following general category property codes are supported:
   5389 
   5390          C     Other
   5391          Cc    Control
   5392          Cf    Format
   5393          Cn    Unassigned
   5394          Co    Private use
   5395          Cs    Surrogate
   5396 
   5397          L     Letter
   5398          Ll    Lower case letter
   5399          Lm    Modifier letter
   5400          Lo    Other letter
   5401          Lt    Title case letter
   5402          Lu    Upper case letter
   5403 
   5404          M     Mark
   5405          Mc    Spacing mark
   5406          Me    Enclosing mark
   5407          Mn    Non-spacing mark
   5408 
   5409          N     Number
   5410          Nd    Decimal number
   5411          Nl    Letter number
   5412          No    Other number
   5413 
   5414          P     Punctuation
   5415          Pc    Connector punctuation
   5416          Pd    Dash punctuation
   5417          Pe    Close punctuation
   5418          Pf    Final punctuation
   5419          Pi    Initial punctuation
   5420          Po    Other punctuation
   5421          Ps    Open punctuation
   5422 
   5423          S     Symbol
   5424          Sc    Currency symbol
   5425          Sk    Modifier symbol
   5426          Sm    Mathematical symbol
   5427          So    Other symbol
   5428 
   5429          Z     Separator
   5430          Zl    Line separator
   5431          Zp    Paragraph separator
   5432          Zs    Space separator
   5433 
   5434        The  special property L& is also supported: it matches a character that
   5435        has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
   5436        classified as a modifier or "other".
   5437 
   5438        The  Cs  (Surrogate)  property  applies only to characters in the range
   5439        U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
   5440        so  cannot  be  tested  by  PCRE, unless UTF validity checking has been
   5441        turned    off    (see    the    discussion    of    PCRE_NO_UTF8_CHECK,
   5442        PCRE_NO_UTF16_CHECK  and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl
   5443        does not support the Cs property.
   5444 
   5445        The long synonyms for  property  names  that  Perl  supports  (such  as
   5446        \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
   5447        any of these properties with "Is".
   5448 
   5449        No character that is in the Unicode table has the Cn (unassigned) prop-
   5450        erty.  Instead, this property is assumed for any code point that is not
   5451        in the Unicode table.
   5452 
   5453        Specifying caseless matching does not affect  these  escape  sequences.
   5454        For  example,  \p{Lu}  always  matches only upper case letters. This is
   5455        different from the behaviour of current versions of Perl.
   5456 
   5457        Matching characters by Unicode property is not fast, because  PCRE  has
   5458        to  do  a  multistage table lookup in order to find a character's prop-
   5459        erty. That is why the traditional escape sequences such as \d and \w do
   5460        not use Unicode properties in PCRE by default, though you can make them
   5461        do so by setting the PCRE_UCP option or by starting  the  pattern  with
   5462        (*UCP).
   5463 
   5464    Extended grapheme clusters
   5465 
   5466        The  \X  escape  matches  any number of Unicode characters that form an
   5467        "extended grapheme cluster", and treats the sequence as an atomic group
   5468        (see  below).   Up  to and including release 8.31, PCRE matched an ear-
   5469        lier, simpler definition that was equivalent to
   5470 
   5471          (?>\PM\pM*)
   5472 
   5473        That is, it matched a character without the "mark"  property,  followed
   5474        by  zero  or  more characters with the "mark" property. Characters with
   5475        the "mark" property are typically non-spacing accents that  affect  the
   5476        preceding character.
   5477 
   5478        This  simple definition was extended in Unicode to include more compli-
   5479        cated kinds of composite character by giving each character a  grapheme
   5480        breaking  property,  and  creating  rules  that use these properties to
   5481        define the boundaries of extended grapheme  clusters.  In  releases  of
   5482        PCRE later than 8.31, \X matches one of these clusters.
   5483 
   5484        \X  always  matches  at least one character. Then it decides whether to
   5485        add additional characters according to the following rules for ending a
   5486        cluster:
   5487 
   5488        1. End at the end of the subject string.
   5489 
   5490        2.  Do not end between CR and LF; otherwise end after any control char-
   5491        acter.
   5492 
   5493        3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
   5494        characters  are of five types: L, V, T, LV, and LVT. An L character may
   5495        be followed by an L, V, LV, or LVT character; an LV or V character  may
   5496        be followed by a V or T character; an LVT or T character may be follwed
   5497        only by a T character.
   5498 
   5499        4. Do not end before extending characters or spacing marks.  Characters
   5500        with  the  "mark"  property  always have the "extend" grapheme breaking
   5501        property.
   5502 
   5503        5. Do not end after prepend characters.
   5504 
   5505        6. Otherwise, end the cluster.
   5506 
   5507    PCRE's additional properties
   5508 
   5509        As well as the standard Unicode properties described above,  PCRE  sup-
   5510        ports  four  more  that  make it possible to convert traditional escape
   5511        sequences such as \w and \s to use Unicode properties. PCRE uses  these
   5512        non-standard, non-Perl properties internally when PCRE_UCP is set. How-
   5513        ever, they may also be used explicitly. These properties are:
   5514 
   5515          Xan   Any alphanumeric character
   5516          Xps   Any POSIX space character
   5517          Xsp   Any Perl space character
   5518          Xwd   Any Perl "word" character
   5519 
   5520        Xan matches characters that have either the L (letter) or the  N  (num-
   5521        ber)  property. Xps matches the characters tab, linefeed, vertical tab,
   5522        form feed, or carriage return, and any other character that has  the  Z
   5523        (separator)  property.  Xsp is the same as Xps; it used to exclude ver-
   5524        tical tab, for Perl compatibility, but Perl changed, and so  PCRE  fol-
   5525        lowed  at  release  8.34.  Xwd matches the same characters as Xan, plus
   5526        underscore.
   5527 
   5528        There is another non-standard property, Xuc, which matches any  charac-
   5529        ter  that  can  be represented by a Universal Character Name in C++ and
   5530        other programming languages. These are the characters $,  @,  `  (grave
   5531        accent),  and  all  characters with Unicode code points greater than or
   5532        equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note  that
   5533        most  base  (ASCII) characters are excluded. (Universal Character Names
   5534        are of the form \uHHHH or \UHHHHHHHH where H is  a  hexadecimal  digit.
   5535        Note that the Xuc property does not match these sequences but the char-
   5536        acters that they represent.)
   5537 
   5538    Resetting the match start
   5539 
   5540        The escape sequence \K causes any previously matched characters not  to
   5541        be included in the final matched sequence. For example, the pattern:
   5542 
   5543          foo\Kbar
   5544 
   5545        matches  "foobar",  but reports that it has matched "bar". This feature
   5546        is similar to a lookbehind assertion (described  below).   However,  in
   5547        this  case, the part of the subject before the real match does not have
   5548        to be of fixed length, as lookbehind assertions do. The use of \K  does
   5549        not  interfere  with  the setting of captured substrings.  For example,
   5550        when the pattern
   5551 
   5552          (foo)\Kbar
   5553 
   5554        matches "foobar", the first substring is still set to "foo".
   5555 
   5556        Perl documents that the use  of  \K  within  assertions  is  "not  well
   5557        defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
   5558        assertions, but is ignored in negative assertions.  Note  that  when  a
   5559        pattern  such  as (?=ab\K) matches, the reported start of the match can
   5560        be greater than the end of the match.
   5561 
   5562    Simple assertions
   5563 
   5564        The final use of backslash is for certain simple assertions. An  asser-
   5565        tion  specifies a condition that has to be met at a particular point in
   5566        a match, without consuming any characters from the subject string.  The
   5567        use  of subpatterns for more complicated assertions is described below.
   5568        The backslashed assertions are:
   5569 
   5570          \b     matches at a word boundary
   5571          \B     matches when not at a word boundary
   5572          \A     matches at the start of the subject
   5573          \Z     matches at the end of the subject
   5574                  also matches before a newline at the end of the subject
   5575          \z     matches only at the end of the subject
   5576          \G     matches at the first matching position in the subject
   5577 
   5578        Inside a character class, \b has a different meaning;  it  matches  the
   5579        backspace  character.  If  any  other  of these assertions appears in a
   5580        character class, by default it matches the corresponding literal  char-
   5581        acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
   5582        PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-
   5583        ated instead.
   5584 
   5585        A  word  boundary is a position in the subject string where the current
   5586        character and the previous character do not both match \w or  \W  (i.e.
   5587        one  matches  \w  and the other matches \W), or the start or end of the
   5588        string if the first or last character matches \w,  respectively.  In  a
   5589        UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
   5590        PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
   5591        PCRE  nor  Perl has a separate "start of word" or "end of word" metase-
   5592        quence. However, whatever follows \b normally determines which  it  is.
   5593        For example, the fragment \ba matches "a" at the start of a word.
   5594 
   5595        The  \A,  \Z,  and \z assertions differ from the traditional circumflex
   5596        and dollar (described in the next section) in that they only ever match
   5597        at  the  very start and end of the subject string, whatever options are
   5598        set. Thus, they are independent of multiline mode. These  three  asser-
   5599        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
   5600        affect only the behaviour of the circumflex and dollar  metacharacters.
   5601        However,  if the startoffset argument of pcre_exec() is non-zero, indi-
   5602        cating that matching is to start at a point other than the beginning of
   5603        the  subject,  \A  can never match. The difference between \Z and \z is
   5604        that \Z matches before a newline at the end of the string as well as at
   5605        the very end, whereas \z matches only at the end.
   5606 
   5607        The  \G assertion is true only when the current matching position is at
   5608        the start point of the match, as specified by the startoffset  argument
   5609        of  pcre_exec().  It  differs  from \A when the value of startoffset is
   5610        non-zero. By calling pcre_exec() multiple times with appropriate  argu-
   5611        ments, you can mimic Perl's /g option, and it is in this kind of imple-
   5612        mentation where \G can be useful.
   5613 
   5614        Note, however, that PCRE's interpretation of \G, as the  start  of  the
   5615        current match, is subtly different from Perl's, which defines it as the
   5616        end of the previous match. In Perl, these can  be  different  when  the
   5617        previously  matched  string was empty. Because PCRE does just one match
   5618        at a time, it cannot reproduce this behaviour.
   5619 
   5620        If all the alternatives of a pattern begin with \G, the  expression  is
   5621        anchored to the starting match position, and the "anchored" flag is set
   5622        in the compiled regular expression.
   5623 
   5624 
   5625 CIRCUMFLEX AND DOLLAR
   5626 
   5627        The circumflex and dollar  metacharacters  are  zero-width  assertions.
   5628        That  is,  they test for a particular condition being true without con-
   5629        suming any characters from the subject string.
   5630 
   5631        Outside a character class, in the default matching mode, the circumflex
   5632        character  is  an  assertion  that is true only if the current matching
   5633        point is at the start of the subject string. If the  startoffset  argu-
   5634        ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
   5635        PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
   5636        has an entirely different meaning (see below).
   5637 
   5638        Circumflex  need  not be the first character of the pattern if a number
   5639        of alternatives are involved, but it should be the first thing in  each
   5640        alternative  in  which  it appears if the pattern is ever to match that
   5641        branch. If all possible alternatives start with a circumflex, that  is,
   5642        if  the  pattern  is constrained to match only at the start of the sub-
   5643        ject, it is said to be an "anchored" pattern.  (There  are  also  other
   5644        constructs that can cause a pattern to be anchored.)
   5645 
   5646        The  dollar  character is an assertion that is true only if the current
   5647        matching point is at the end of  the  subject  string,  or  immediately
   5648        before  a newline at the end of the string (by default). Note, however,
   5649        that it does not actually match the newline. Dollar  need  not  be  the
   5650        last character of the pattern if a number of alternatives are involved,
   5651        but it should be the last item in any branch in which it appears.  Dol-
   5652        lar has no special meaning in a character class.
   5653 
   5654        The  meaning  of  dollar  can be changed so that it matches only at the
   5655        very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
   5656        compile time. This does not affect the \Z assertion.
   5657 
   5658        The meanings of the circumflex and dollar characters are changed if the
   5659        PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
   5660        matches  immediately after internal newlines as well as at the start of
   5661        the subject string. It does not match after a  newline  that  ends  the
   5662        string.  A dollar matches before any newlines in the string, as well as
   5663        at the very end, when PCRE_MULTILINE is set. When newline is  specified
   5664        as  the  two-character  sequence CRLF, isolated CR and LF characters do
   5665        not indicate newlines.
   5666 
   5667        For example, the pattern /^abc$/ matches the subject string  "def\nabc"
   5668        (where  \n  represents a newline) in multiline mode, but not otherwise.
   5669        Consequently, patterns that are anchored in single  line  mode  because
   5670        all  branches  start  with  ^ are not anchored in multiline mode, and a
   5671        match for circumflex is  possible  when  the  startoffset  argument  of
   5672        pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
   5673        PCRE_MULTILINE is set.
   5674 
   5675        Note that the sequences \A, \Z, and \z can be used to match  the  start
   5676        and  end of the subject in both modes, and if all branches of a pattern
   5677        start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
   5678        set.
   5679 
   5680 
   5681 FULL STOP (PERIOD, DOT) AND \N
   5682 
   5683        Outside a character class, a dot in the pattern matches any one charac-
   5684        ter in the subject string except (by default) a character  that  signi-
   5685        fies the end of a line.
   5686 
   5687        When  a line ending is defined as a single character, dot never matches
   5688        that character; when the two-character sequence CRLF is used, dot  does
   5689        not  match  CR  if  it  is immediately followed by LF, but otherwise it
   5690        matches all characters (including isolated CRs and LFs). When any  Uni-
   5691        code  line endings are being recognized, dot does not match CR or LF or
   5692        any of the other line ending characters.
   5693 
   5694        The behaviour of dot with regard to newlines can  be  changed.  If  the
   5695        PCRE_DOTALL  option  is  set,  a dot matches any one character, without
   5696        exception. If the two-character sequence CRLF is present in the subject
   5697        string, it takes two dots to match it.
   5698 
   5699        The  handling of dot is entirely independent of the handling of circum-
   5700        flex and dollar, the only relationship being  that  they  both  involve
   5701        newlines. Dot has no special meaning in a character class.
   5702 
   5703        The  escape  sequence  \N  behaves  like  a  dot, except that it is not
   5704        affected by the PCRE_DOTALL option. In  other  words,  it  matches  any
   5705        character  except  one that signifies the end of a line. Perl also uses
   5706        \N to match characters by name; PCRE does not support this.
   5707 
   5708 
   5709 MATCHING A SINGLE DATA UNIT
   5710 
   5711        Outside a character class, the escape sequence \C matches any one  data
   5712        unit,  whether or not a UTF mode is set. In the 8-bit library, one data
   5713        unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
   5714        32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
   5715        line-ending characters. The feature is provided in  Perl  in  order  to
   5716        match individual bytes in UTF-8 mode, but it is unclear how it can use-
   5717        fully be used. Because \C breaks up  characters  into  individual  data
   5718        units,  matching  one unit with \C in a UTF mode means that the rest of
   5719        the string may start with a malformed UTF character. This has undefined
   5720        results, because PCRE assumes that it is dealing with valid UTF strings
   5721        (and by default it checks this at the start of  processing  unless  the
   5722        PCRE_NO_UTF8_CHECK,  PCRE_NO_UTF16_CHECK  or PCRE_NO_UTF32_CHECK option
   5723        is used).
   5724 
   5725        PCRE does not allow \C to appear in  lookbehind  assertions  (described
   5726        below)  in  a UTF mode, because this would make it impossible to calcu-
   5727        late the length of the lookbehind.
   5728 
   5729        In general, the \C escape sequence is best avoided. However, one way of
   5730        using  it that avoids the problem of malformed UTF characters is to use
   5731        a lookahead to check the length of the next character, as in this  pat-
   5732        tern,  which  could be used with a UTF-8 string (ignore white space and
   5733        line breaks):
   5734 
   5735          (?| (?=[\x00-\x7f])(\C) |
   5736              (?=[\x80-\x{7ff}])(\C)(\C) |
   5737              (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
   5738              (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
   5739 
   5740        A group that starts with (?| resets the capturing  parentheses  numbers
   5741        in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The
   5742        assertions at the start of each branch check the next  UTF-8  character
   5743        for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
   5744        character's individual bytes are then captured by the appropriate  num-
   5745        ber of groups.
   5746 
   5747 
   5748 SQUARE BRACKETS AND CHARACTER CLASSES
   5749 
   5750        An opening square bracket introduces a character class, terminated by a
   5751        closing square bracket. A closing square bracket on its own is not spe-
   5752        cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
   5753        a lone closing square bracket causes a compile-time error. If a closing
   5754        square  bracket  is required as a member of the class, it should be the
   5755        first data character in the class  (after  an  initial  circumflex,  if
   5756        present) or escaped with a backslash.
   5757 
   5758        A  character  class matches a single character in the subject. In a UTF
   5759        mode, the character may be more than one  data  unit  long.  A  matched
   5760        character must be in the set of characters defined by the class, unless
   5761        the first character in the class definition is a circumflex,  in  which
   5762        case the subject character must not be in the set defined by the class.
   5763        If a circumflex is actually required as a member of the  class,  ensure
   5764        it is not the first character, or escape it with a backslash.
   5765 
   5766        For  example, the character class [aeiou] matches any lower case vowel,
   5767        while [^aeiou] matches any character that is not a  lower  case  vowel.
   5768        Note that a circumflex is just a convenient notation for specifying the
   5769        characters that are in the class by enumerating those that are  not.  A
   5770        class  that starts with a circumflex is not an assertion; it still con-
   5771        sumes a character from the subject string, and therefore  it  fails  if
   5772        the current pointer is at the end of the string.
   5773 
   5774        In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
   5775        (0xffff) can be included in a class as a literal string of data  units,
   5776        or by using the \x{ escaping mechanism.
   5777 
   5778        When  caseless  matching  is set, any letters in a class represent both
   5779        their upper case and lower case versions, so for  example,  a  caseless
   5780        [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
   5781        match "A", whereas a caseful version would. In a UTF mode, PCRE  always
   5782        understands  the  concept  of case for characters whose values are less
   5783        than 128, so caseless matching is always possible. For characters  with
   5784        higher  values,  the  concept  of case is supported if PCRE is compiled
   5785        with Unicode property support, but not otherwise.  If you want  to  use
   5786        caseless  matching in a UTF mode for characters 128 and above, you must
   5787        ensure that PCRE is compiled with Unicode property support as  well  as
   5788        with UTF support.
   5789 
   5790        Characters  that  might  indicate  line breaks are never treated in any
   5791        special way  when  matching  character  classes,  whatever  line-ending
   5792        sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
   5793        PCRE_MULTILINE options is used. A class such as [^a] always matches one
   5794        of these characters.
   5795 
   5796        The  minus (hyphen) character can be used to specify a range of charac-
   5797        ters in a character  class.  For  example,  [d-m]  matches  any  letter
   5798        between  d  and  m,  inclusive.  If  a minus character is required in a
   5799        class, it must be escaped with a backslash  or  appear  in  a  position
   5800        where  it cannot be interpreted as indicating a range, typically as the
   5801        first or last character in the class, or immediately after a range. For
   5802        example,  [b-d-z] matches letters in the range b to d, a hyphen charac-
   5803        ter, or z.
   5804 
   5805        It is not possible to have the literal character "]" as the end charac-
   5806        ter  of a range. A pattern such as [W-]46] is interpreted as a class of
   5807        two characters ("W" and "-") followed by a literal string "46]", so  it
   5808        would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
   5809        backslash it is interpreted as the end of range, so [W-\]46] is  inter-
   5810        preted  as a class containing a range followed by two other characters.
   5811        The octal or hexadecimal representation of "]" can also be used to  end
   5812        a range.
   5813 
   5814        An  error  is  generated  if  a POSIX character class (see below) or an
   5815        escape sequence other than one that defines a single character  appears
   5816        at  a  point  where  a range ending character is expected. For example,
   5817        [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
   5818 
   5819        Ranges operate in the collating sequence of character values. They  can
   5820        also   be  used  for  characters  specified  numerically,  for  example
   5821        [\000-\037]. Ranges can include any characters that are valid  for  the
   5822        current mode.
   5823 
   5824        If a range that includes letters is used when caseless matching is set,
   5825        it matches the letters in either case. For example, [W-c] is equivalent
   5826        to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
   5827        character tables for a French locale are in  use,  [\xc8-\xcb]  matches
   5828        accented  E  characters  in both cases. In UTF modes, PCRE supports the
   5829        concept of case for characters with values greater than 128  only  when
   5830        it is compiled with Unicode property support.
   5831 
   5832        The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
   5833        \w, and \W may appear in a character class, and add the characters that
   5834        they  match to the class. For example, [\dABCDEF] matches any hexadeci-
   5835        mal digit. In UTF modes, the PCRE_UCP option affects  the  meanings  of
   5836        \d,  \s,  \w  and  their upper case partners, just as it does when they
   5837        appear outside a character class, as described in the section  entitled
   5838        "Generic character types" above. The escape sequence \b has a different
   5839        meaning inside a character class; it matches the  backspace  character.
   5840        The  sequences  \B,  \N,  \R, and \X are not special inside a character
   5841        class. Like any other unrecognized escape sequences, they  are  treated
   5842        as  the literal characters "B", "N", "R", and "X" by default, but cause
   5843        an error if the PCRE_EXTRA option is set.
   5844 
   5845        A circumflex can conveniently be used with  the  upper  case  character
   5846        types  to specify a more restricted set of characters than the matching
   5847        lower case type.  For example, the class [^\W_] matches any  letter  or
   5848        digit, but not underscore, whereas [\w] includes underscore. A positive
   5849        character class should be read as "something OR something OR ..." and a
   5850        negative class as "NOT something AND NOT something AND NOT ...".
   5851 
   5852        The  only  metacharacters  that are recognized in character classes are
   5853        backslash, hyphen (only where it can be  interpreted  as  specifying  a
   5854        range),  circumflex  (only  at the start), opening square bracket (only
   5855        when it can be interpreted as introducing a POSIX class name, or for  a
   5856        special  compatibility  feature  -  see the next two sections), and the
   5857        terminating  closing  square  bracket.  However,  escaping  other  non-
   5858        alphanumeric characters does no harm.
   5859 
   5860 
   5861 POSIX CHARACTER CLASSES
   5862 
   5863        Perl supports the POSIX notation for character classes. This uses names
   5864        enclosed by [: and :] within the enclosing square brackets.  PCRE  also
   5865        supports this notation. For example,
   5866 
   5867          [01[:alpha:]%]
   5868 
   5869        matches "0", "1", any alphabetic character, or "%". The supported class
   5870        names are:
   5871 
   5872          alnum    letters and digits
   5873          alpha    letters
   5874          ascii    character codes 0 - 127
   5875          blank    space or tab only
   5876          cntrl    control characters
   5877          digit    decimal digits (same as \d)
   5878          graph    printing characters, excluding space
   5879          lower    lower case letters
   5880          print    printing characters, including space
   5881          punct    printing characters, excluding letters and digits and space
   5882          space    white space (the same as \s from PCRE 8.34)
   5883          upper    upper case letters
   5884          word     "word" characters (same as \w)
   5885          xdigit   hexadecimal digits
   5886 
   5887        The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
   5888        CR  (13),  and space (32). If locale-specific matching is taking place,
   5889        the list of space characters may be different; there may  be  fewer  or
   5890        more of them. "Space" used to be different to \s, which did not include
   5891        VT, for Perl compatibility.  However, Perl changed at release 5.18, and
   5892        PCRE  followed  at release 8.34.  "Space" and \s now match the same set
   5893        of characters.
   5894 
   5895        The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
   5896        from  Perl  5.8. Another Perl extension is negation, which is indicated
   5897        by a ^ character after the colon. For example,
   5898 
   5899          [12[:^digit:]]
   5900 
   5901        matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
   5902        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
   5903        these are not supported, and an error is given if they are encountered.
   5904 
   5905        By default, characters with values greater than 128 do not match any of
   5906        the  POSIX character classes. However, if the PCRE_UCP option is passed
   5907        to pcre_compile(), some of the classes  are  changed  so  that  Unicode
   5908        character  properties  are  used. This is achieved by replacing certain
   5909        POSIX classes by other sequences, as follows:
   5910 
   5911          [:alnum:]  becomes  \p{Xan}
   5912          [:alpha:]  becomes  \p{L}
   5913          [:blank:]  becomes  \h
   5914          [:digit:]  becomes  \p{Nd}
   5915          [:lower:]  becomes  \p{Ll}
   5916          [:space:]  becomes  \p{Xps}
   5917          [:upper:]  becomes  \p{Lu}
   5918          [:word:]   becomes  \p{Xwd}
   5919 
   5920        Negated versions, such as [:^alpha:] use \P instead of \p. Three  other
   5921        POSIX classes are handled specially in UCP mode:
   5922 
   5923        [:graph:] This  matches  characters that have glyphs that mark the page
   5924                  when printed. In Unicode property terms, it matches all char-
   5925                  acters with the L, M, N, P, S, or Cf properties, except for:
   5926 
   5927                    U+061C           Arabic Letter Mark
   5928                    U+180E           Mongolian Vowel Separator
   5929                    U+2066 - U+2069  Various "isolate"s
   5930 
   5931 
   5932        [:print:] This  matches  the  same  characters  as [:graph:] plus space
   5933                  characters that are not controls, that  is,  characters  with
   5934                  the Zs property.
   5935 
   5936        [:punct:] This matches all characters that have the Unicode P (punctua-
   5937                  tion) property, plus those characters whose code  points  are
   5938                  less than 128 that have the S (Symbol) property.
   5939 
   5940        The  other  POSIX classes are unchanged, and match only characters with
   5941        code points less than 128.
   5942 
   5943 
   5944 COMPATIBILITY FEATURE FOR WORD BOUNDARIES
   5945 
   5946        In the POSIX.2 compliant library that was included in 4.4BSD Unix,  the
   5947        ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
   5948        and "end of word". PCRE treats these items as follows:
   5949 
   5950          [[:<:]]  is converted to  \b(?=\w)
   5951          [[:>:]]  is converted to  \b(?<=\w)
   5952 
   5953        Only these exact character sequences are recognized. A sequence such as
   5954        [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
   5955        support is not compatible with Perl. It is provided to help  migrations
   5956        from other environments, and is best not used in any new patterns. Note
   5957        that \b matches at the start and the end of a word (see "Simple  asser-
   5958        tions"  above),  and in a Perl-style pattern the preceding or following
   5959        character normally shows which is wanted,  without  the  need  for  the
   5960        assertions  that  are used above in order to give exactly the POSIX be-
   5961        haviour.
   5962 
   5963 
   5964 VERTICAL BAR
   5965 
   5966        Vertical bar characters are used to separate alternative patterns.  For
   5967        example, the pattern
   5968 
   5969          gilbert|sullivan
   5970 
   5971        matches  either "gilbert" or "sullivan". Any number of alternatives may
   5972        appear, and an empty  alternative  is  permitted  (matching  the  empty
   5973        string). The matching process tries each alternative in turn, from left
   5974        to right, and the first one that succeeds is used. If the  alternatives
   5975        are  within a subpattern (defined below), "succeeds" means matching the
   5976        rest of the main pattern as well as the alternative in the subpattern.
   5977 
   5978 
   5979 INTERNAL OPTION SETTING
   5980 
   5981        The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
   5982        PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from
   5983        within the pattern by  a  sequence  of  Perl  option  letters  enclosed
   5984        between "(?" and ")".  The option letters are
   5985 
   5986          i  for PCRE_CASELESS
   5987          m  for PCRE_MULTILINE
   5988          s  for PCRE_DOTALL
   5989          x  for PCRE_EXTENDED
   5990 
   5991        For example, (?im) sets caseless, multiline matching. It is also possi-
   5992        ble to unset these options by preceding the letter with a hyphen, and a
   5993        combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-
   5994        LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,
   5995        is  also  permitted.  If  a  letter  appears  both before and after the
   5996        hyphen, the option is unset.
   5997 
   5998        The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
   5999        can  be changed in the same way as the Perl-compatible options by using
   6000        the characters J, U and X respectively.
   6001 
   6002        When one of these option changes occurs at  top  level  (that  is,  not
   6003        inside  subpattern parentheses), the change applies to the remainder of
   6004        the pattern that follows. If the change is placed right at the start of
   6005        a pattern, PCRE extracts it into the global options (and it will there-
   6006        fore show up in data extracted by the pcre_fullinfo() function).
   6007 
   6008        An option change within a subpattern (see below for  a  description  of
   6009        subpatterns)  affects only that part of the subpattern that follows it,
   6010        so
   6011 
   6012          (a(?i)b)c
   6013 
   6014        matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
   6015        used).   By  this means, options can be made to have different settings
   6016        in different parts of the pattern. Any changes made in one  alternative
   6017        do  carry  on  into subsequent branches within the same subpattern. For
   6018        example,
   6019 
   6020          (a(?i)b|c)
   6021 
   6022        matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
   6023        first  branch  is  abandoned before the option setting. This is because
   6024        the effects of option settings happen at compile time. There  would  be
   6025        some very weird behaviour otherwise.
   6026 
   6027        Note:  There  are  other  PCRE-specific  options that can be set by the
   6028        application when the compiling or matching  functions  are  called.  In
   6029        some  cases  the  pattern can contain special leading sequences such as
   6030        (*CRLF) to override what the application  has  set  or  what  has  been
   6031        defaulted.   Details   are  given  in  the  section  entitled  "Newline
   6032        sequences" above. There are also the  (*UTF8),  (*UTF16),(*UTF32),  and
   6033        (*UCP)  leading sequences that can be used to set UTF and Unicode prop-
   6034        erty modes; they are equivalent to setting the  PCRE_UTF8,  PCRE_UTF16,
   6035        PCRE_UTF32  and the PCRE_UCP options, respectively. The (*UTF) sequence
   6036        is a generic version that can be used with any of the  libraries.  How-
   6037        ever,  the  application  can set the PCRE_NEVER_UTF option, which locks
   6038        out the use of the (*UTF) sequences.
   6039 
   6040 
   6041 SUBPATTERNS
   6042 
   6043        Subpatterns are delimited by parentheses (round brackets), which can be
   6044        nested.  Turning part of a pattern into a subpattern does two things:
   6045 
   6046        1. It localizes a set of alternatives. For example, the pattern
   6047 
   6048          cat(aract|erpillar|)
   6049 
   6050        matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
   6051        it would match "cataract", "erpillar" or an empty string.
   6052 
   6053        2. It sets up the subpattern as  a  capturing  subpattern.  This  means
   6054        that,  when  the  whole  pattern  matches,  that portion of the subject
   6055        string that matched the subpattern is passed back to the caller via the
   6056        ovector  argument  of  the matching function. (This applies only to the
   6057        traditional matching functions; the DFA matching functions do not  sup-
   6058        port capturing.)
   6059 
   6060        Opening parentheses are counted from left to right (starting from 1) to
   6061        obtain numbers for the  capturing  subpatterns.  For  example,  if  the
   6062        string "the red king" is matched against the pattern
   6063 
   6064          the ((red|white) (king|queen))
   6065 
   6066        the captured substrings are "red king", "red", and "king", and are num-
   6067        bered 1, 2, and 3, respectively.
   6068 
   6069        The fact that plain parentheses fulfil  two  functions  is  not  always
   6070        helpful.   There are often times when a grouping subpattern is required
   6071        without a capturing requirement. If an opening parenthesis is  followed
   6072        by  a question mark and a colon, the subpattern does not do any captur-
   6073        ing, and is not counted when computing the  number  of  any  subsequent
   6074        capturing  subpatterns. For example, if the string "the white queen" is
   6075        matched against the pattern
   6076 
   6077          the ((?:red|white) (king|queen))
   6078 
   6079        the captured substrings are "white queen" and "queen", and are numbered
   6080        1 and 2. The maximum number of capturing subpatterns is 65535.
   6081 
   6082        As  a  convenient shorthand, if any option settings are required at the
   6083        start of a non-capturing subpattern,  the  option  letters  may  appear
   6084        between the "?" and the ":". Thus the two patterns
   6085 
   6086          (?i:saturday|sunday)
   6087          (?:(?i)saturday|sunday)
   6088 
   6089        match exactly the same set of strings. Because alternative branches are
   6090        tried from left to right, and options are not reset until  the  end  of
   6091        the  subpattern is reached, an option setting in one branch does affect
   6092        subsequent branches, so the above patterns match "SUNDAY"  as  well  as
   6093        "Saturday".
   6094 
   6095 
   6096 DUPLICATE SUBPATTERN NUMBERS
   6097 
   6098        Perl 5.10 introduced a feature whereby each alternative in a subpattern
   6099        uses the same numbers for its capturing parentheses. Such a  subpattern
   6100        starts  with (?| and is itself a non-capturing subpattern. For example,
   6101        consider this pattern:
   6102 
   6103          (?|(Sat)ur|(Sun))day
   6104 
   6105        Because the two alternatives are inside a (?| group, both sets of  cap-
   6106        turing  parentheses  are  numbered one. Thus, when the pattern matches,
   6107        you can look at captured substring number  one,  whichever  alternative
   6108        matched.  This  construct  is useful when you want to capture part, but
   6109        not all, of one of a number of alternatives. Inside a (?| group, paren-
   6110        theses  are  numbered as usual, but the number is reset at the start of
   6111        each branch. The numbers of any capturing parentheses that  follow  the
   6112        subpattern  start after the highest number used in any branch. The fol-
   6113        lowing example is taken from the Perl documentation. The numbers under-
   6114        neath show in which buffer the captured content will be stored.
   6115 
   6116          # before  ---------------branch-reset----------- after
   6117          / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
   6118          # 1            2         2  3        2     3     4
   6119 
   6120        A  back  reference  to a numbered subpattern uses the most recent value
   6121        that is set for that number by any subpattern.  The  following  pattern
   6122        matches "abcabc" or "defdef":
   6123 
   6124          /(?|(abc)|(def))\1/
   6125 
   6126        In  contrast,  a subroutine call to a numbered subpattern always refers
   6127        to the first one in the pattern with the given  number.  The  following
   6128        pattern matches "abcabc" or "defabc":
   6129 
   6130          /(?|(abc)|(def))(?1)/
   6131 
   6132        If  a condition test for a subpattern's having matched refers to a non-
   6133        unique number, the test is true if any of the subpatterns of that  num-
   6134        ber have matched.
   6135 
   6136        An  alternative approach to using this "branch reset" feature is to use
   6137        duplicate named subpatterns, as described in the next section.
   6138 
   6139 
   6140 NAMED SUBPATTERNS
   6141 
   6142        Identifying capturing parentheses by number is simple, but  it  can  be
   6143        very  hard  to keep track of the numbers in complicated regular expres-
   6144        sions. Furthermore, if an  expression  is  modified,  the  numbers  may
   6145        change.  To help with this difficulty, PCRE supports the naming of sub-
   6146        patterns. This feature was not added to Perl until release 5.10. Python
   6147        had  the  feature earlier, and PCRE introduced it at release 4.0, using
   6148        the Python syntax. PCRE now supports both the Perl and the Python  syn-
   6149        tax.  Perl  allows  identically  numbered subpatterns to have different
   6150        names, but PCRE does not.
   6151 
   6152        In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
   6153        or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
   6154        to capturing parentheses from other parts of the pattern, such as  back
   6155        references,  recursion,  and conditions, can be made by name as well as
   6156        by number.
   6157 
   6158        Names consist of up to 32 alphanumeric characters and underscores,  but
   6159        must  start  with  a  non-digit.  Named capturing parentheses are still
   6160        allocated numbers as well as names, exactly as if the  names  were  not
   6161        present.  The PCRE API provides function calls for extracting the name-
   6162        to-number translation table from a compiled pattern. There  is  also  a
   6163        convenience function for extracting a captured substring by name.
   6164 
   6165        By  default, a name must be unique within a pattern, but it is possible
   6166        to relax this constraint by setting the PCRE_DUPNAMES option at compile
   6167        time.  (Duplicate  names are also always permitted for subpatterns with
   6168        the same number, set up as described in the previous  section.)  Dupli-
   6169        cate  names  can  be useful for patterns where only one instance of the
   6170        named parentheses can match. Suppose you want to match the  name  of  a
   6171        weekday,  either as a 3-letter abbreviation or as the full name, and in
   6172        both cases you want to extract the abbreviation. This pattern (ignoring
   6173        the line breaks) does the job:
   6174 
   6175          (?<DN>Mon|Fri|Sun)(?:day)?|
   6176          (?<DN>Tue)(?:sday)?|
   6177          (?<DN>Wed)(?:nesday)?|
   6178          (?<DN>Thu)(?:rsday)?|
   6179          (?<DN>Sat)(?:urday)?
   6180 
   6181        There  are  five capturing substrings, but only one is ever set after a
   6182        match.  (An alternative way of solving this problem is to use a "branch
   6183        reset" subpattern, as described in the previous section.)
   6184 
   6185        The  convenience  function  for extracting the data by name returns the
   6186        substring for the first (and in this example, the only)  subpattern  of
   6187        that  name  that  matched.  This saves searching to find which numbered
   6188        subpattern it was.
   6189 
   6190        If you make a back reference to  a  non-unique  named  subpattern  from
   6191        elsewhere  in the pattern, the subpatterns to which the name refers are
   6192        checked in the order in which they appear in the overall  pattern.  The
   6193        first one that is set is used for the reference. For example, this pat-
   6194        tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
   6195 
   6196          (?:(?<n>foo)|(?<n>bar))\k<n>
   6197 
   6198 
   6199        If you make a subroutine call to a non-unique named subpattern, the one
   6200        that  corresponds  to  the first occurrence of the name is used. In the
   6201        absence of duplicate numbers (see the previous section) this is the one
   6202        with the lowest number.
   6203 
   6204        If you use a named reference in a condition test (see the section about
   6205        conditions below), either to check whether a subpattern has matched, or
   6206        to  check for recursion, all subpatterns with the same name are tested.
   6207        If the condition is true for any one of them, the overall condition  is
   6208        true.  This  is  the  same  behaviour as testing by number. For further
   6209        details of the interfaces  for  handling  named  subpatterns,  see  the
   6210        pcreapi documentation.
   6211 
   6212        Warning: You cannot use different names to distinguish between two sub-
   6213        patterns with the same number because PCRE uses only the  numbers  when
   6214        matching. For this reason, an error is given at compile time if differ-
   6215        ent names are given to subpatterns with the same number.  However,  you
   6216        can always give the same name to subpatterns with the same number, even
   6217        when PCRE_DUPNAMES is not set.
   6218 
   6219 
   6220 REPETITION
   6221 
   6222        Repetition is specified by quantifiers, which can  follow  any  of  the
   6223        following items:
   6224 
   6225          a literal data character
   6226          the dot metacharacter
   6227          the \C escape sequence
   6228          the \X escape sequence
   6229          the \R escape sequence
   6230          an escape such as \d or \pL that matches a single character
   6231          a character class
   6232          a back reference (see next section)
   6233          a parenthesized subpattern (including assertions)
   6234          a subroutine call to a subpattern (recursive or otherwise)
   6235 
   6236        The  general repetition quantifier specifies a minimum and maximum num-
   6237        ber of permitted matches, by giving the two numbers in  curly  brackets
   6238        (braces),  separated  by  a comma. The numbers must be less than 65536,
   6239        and the first must be less than or equal to the second. For example:
   6240 
   6241          z{2,4}
   6242 
   6243        matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
   6244        special  character.  If  the second number is omitted, but the comma is
   6245        present, there is no upper limit; if the second number  and  the  comma
   6246        are  both omitted, the quantifier specifies an exact number of required
   6247        matches. Thus
   6248 
   6249          [aeiou]{3,}
   6250 
   6251        matches at least 3 successive vowels, but may match many more, while
   6252 
   6253          \d{8}
   6254 
   6255        matches exactly 8 digits. An opening curly bracket that  appears  in  a
   6256        position  where a quantifier is not allowed, or one that does not match
   6257        the syntax of a quantifier, is taken as a literal character. For  exam-
   6258        ple, {,6} is not a quantifier, but a literal string of four characters.
   6259 
   6260        In UTF modes, quantifiers apply to characters rather than to individual
   6261        data units. Thus, for example, \x{100}{2} matches two characters,  each
   6262        of which is represented by a two-byte sequence in a UTF-8 string. Simi-
   6263        larly, \X{3} matches three Unicode extended grapheme clusters, each  of
   6264        which  may  be  several  data  units long (and they may be of different
   6265        lengths).
   6266 
   6267        The quantifier {0} is permitted, causing the expression to behave as if
   6268        the previous item and the quantifier were not present. This may be use-
   6269        ful for subpatterns that are referenced as subroutines  from  elsewhere
   6270        in the pattern (but see also the section entitled "Defining subpatterns
   6271        for use by reference only" below). Items other  than  subpatterns  that
   6272        have a {0} quantifier are omitted from the compiled pattern.
   6273 
   6274        For  convenience, the three most common quantifiers have single-charac-
   6275        ter abbreviations:
   6276 
   6277          *    is equivalent to {0,}
   6278          +    is equivalent to {1,}
   6279          ?    is equivalent to {0,1}
   6280 
   6281        It is possible to construct infinite loops by  following  a  subpattern
   6282        that can match no characters with a quantifier that has no upper limit,
   6283        for example:
   6284 
   6285          (a?)*
   6286 
   6287        Earlier versions of Perl and PCRE used to give an error at compile time
   6288        for  such  patterns. However, because there are cases where this can be
   6289        useful, such patterns are now accepted, but if any  repetition  of  the
   6290        subpattern  does in fact match no characters, the loop is forcibly bro-
   6291        ken.
   6292 
   6293        By default, the quantifiers are "greedy", that is, they match  as  much
   6294        as  possible  (up  to  the  maximum number of permitted times), without
   6295        causing the rest of the pattern to fail. The classic example  of  where
   6296        this gives problems is in trying to match comments in C programs. These
   6297        appear between /* and */ and within the comment,  individual  *  and  /
   6298        characters  may  appear. An attempt to match C comments by applying the
   6299        pattern
   6300 
   6301          /\*.*\*/
   6302 
   6303        to the string
   6304 
   6305          /* first comment */  not comment  /* second comment */
   6306 
   6307        fails, because it matches the entire string owing to the greediness  of
   6308        the .*  item.
   6309 
   6310        However,  if  a quantifier is followed by a question mark, it ceases to
   6311        be greedy, and instead matches the minimum number of times possible, so
   6312        the pattern
   6313 
   6314          /\*.*?\*/
   6315 
   6316        does  the  right  thing with the C comments. The meaning of the various
   6317        quantifiers is not otherwise changed,  just  the  preferred  number  of
   6318        matches.   Do  not  confuse this use of question mark with its use as a
   6319        quantifier in its own right. Because it has two uses, it can  sometimes
   6320        appear doubled, as in
   6321 
   6322          \d??\d
   6323 
   6324        which matches one digit by preference, but can match two if that is the
   6325        only way the rest of the pattern matches.
   6326 
   6327        If the PCRE_UNGREEDY option is set (an option that is not available  in
   6328        Perl),  the  quantifiers are not greedy by default, but individual ones
   6329        can be made greedy by following them with a  question  mark.  In  other
   6330        words, it inverts the default behaviour.
   6331 
   6332        When  a  parenthesized  subpattern  is quantified with a minimum repeat
   6333        count that is greater than 1 or with a limited maximum, more memory  is
   6334        required  for  the  compiled  pattern, in proportion to the size of the
   6335        minimum or maximum.
   6336 
   6337        If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
   6338        alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
   6339        the pattern is implicitly anchored, because whatever  follows  will  be
   6340        tried  against every character position in the subject string, so there
   6341        is no point in retrying the overall match at  any  position  after  the
   6342        first.  PCRE  normally treats such a pattern as though it were preceded
   6343        by \A.
   6344 
   6345        In cases where it is known that the subject  string  contains  no  new-
   6346        lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
   6347        mization, or alternatively using ^ to indicate anchoring explicitly.
   6348 
   6349        However, there are some cases where the optimization  cannot  be  used.
   6350        When .*  is inside capturing parentheses that are the subject of a back
   6351        reference elsewhere in the pattern, a match at the start may fail where
   6352        a later one succeeds. Consider, for example:
   6353 
   6354          (.*)abc\1
   6355 
   6356        If  the subject is "xyz123abc123" the match point is the fourth charac-
   6357        ter. For this reason, such a pattern is not implicitly anchored.
   6358 
   6359        Another case where implicit anchoring is not applied is when the  lead-
   6360        ing  .* is inside an atomic group. Once again, a match at the start may
   6361        fail where a later one succeeds. Consider this pattern:
   6362 
   6363          (?>.*?a)b
   6364 
   6365        It matches "ab" in the subject "aab". The use of the backtracking  con-
   6366        trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
   6367 
   6368        When a capturing subpattern is repeated, the value captured is the sub-
   6369        string that matched the final iteration. For example, after
   6370 
   6371          (tweedle[dume]{3}\s*)+
   6372 
   6373        has matched "tweedledum tweedledee" the value of the captured substring
   6374        is  "tweedledee".  However,  if there are nested capturing subpatterns,
   6375        the corresponding captured values may have been set in previous  itera-
   6376        tions. For example, after
   6377 
   6378          /(a|(b))+/
   6379 
   6380        matches "aba" the value of the second captured substring is "b".
   6381 
   6382 
   6383 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
   6384 
   6385        With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
   6386        repetition, failure of what follows normally causes the  repeated  item
   6387        to  be  re-evaluated to see if a different number of repeats allows the
   6388        rest of the pattern to match. Sometimes it is useful to  prevent  this,
   6389        either  to  change the nature of the match, or to cause it fail earlier
   6390        than it otherwise might, when the author of the pattern knows there  is
   6391        no point in carrying on.
   6392 
   6393        Consider,  for  example, the pattern \d+foo when applied to the subject
   6394        line
   6395 
   6396          123456bar
   6397 
   6398        After matching all 6 digits and then failing to match "foo", the normal
   6399        action  of  the matcher is to try again with only 5 digits matching the
   6400        \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
   6401        "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
   6402        the means for specifying that once a subpattern has matched, it is  not
   6403        to be re-evaluated in this way.
   6404 
   6405        If  we  use atomic grouping for the previous example, the matcher gives
   6406        up immediately on failing to match "foo" the first time.  The  notation
   6407        is a kind of special parenthesis, starting with (?> as in this example:
   6408 
   6409          (?>\d+)foo
   6410 
   6411        This  kind  of  parenthesis "locks up" the  part of the pattern it con-
   6412        tains once it has matched, and a failure further into  the  pattern  is
   6413        prevented  from  backtracking into it. Backtracking past it to previous
   6414        items, however, works as normal.
   6415 
   6416        An alternative description is that a subpattern of  this  type  matches
   6417        the  string  of  characters  that an identical standalone pattern would
   6418        match, if anchored at the current point in the subject string.
   6419 
   6420        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
   6421        such as the above example can be thought of as a maximizing repeat that
   6422        must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
   6423        pared  to  adjust  the number of digits they match in order to make the
   6424        rest of the pattern match, (?>\d+) can only match an entire sequence of
   6425        digits.
   6426 
   6427        Atomic  groups in general can of course contain arbitrarily complicated
   6428        subpatterns, and can be nested. However, when  the  subpattern  for  an
   6429        atomic group is just a single repeated item, as in the example above, a
   6430        simpler notation, called a "possessive quantifier" can  be  used.  This
   6431        consists  of  an  additional  + character following a quantifier. Using
   6432        this notation, the previous example can be rewritten as
   6433 
   6434          \d++foo
   6435 
   6436        Note that a possessive quantifier can be used with an entire group, for
   6437        example:
   6438 
   6439          (abc|xyz){2,3}+
   6440 
   6441        Possessive   quantifiers   are   always  greedy;  the  setting  of  the
   6442        PCRE_UNGREEDY option is ignored. They are a convenient notation for the
   6443        simpler  forms  of atomic group. However, there is no difference in the
   6444        meaning of a possessive quantifier and  the  equivalent  atomic  group,
   6445        though  there  may  be a performance difference; possessive quantifiers
   6446        should be slightly faster.
   6447 
   6448        The possessive quantifier syntax is an extension to the Perl  5.8  syn-
   6449        tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
   6450        edition of his book. Mike McCloskey liked it, so implemented it when he
   6451        built  Sun's Java package, and PCRE copied it from there. It ultimately
   6452        found its way into Perl at release 5.10.
   6453 
   6454        PCRE has an optimization that automatically "possessifies" certain sim-
   6455        ple  pattern  constructs.  For  example, the sequence A+B is treated as
   6456        A++B because there is no point in backtracking into a sequence  of  A's
   6457        when B must follow.
   6458 
   6459        When  a  pattern  contains an unlimited repeat inside a subpattern that
   6460        can itself be repeated an unlimited number of  times,  the  use  of  an
   6461        atomic  group  is  the  only way to avoid some failing matches taking a
   6462        very long time indeed. The pattern
   6463 
   6464          (\D+|<\d+>)*[!?]
   6465 
   6466        matches an unlimited number of substrings that either consist  of  non-
   6467        digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
   6468        matches, it runs quickly. However, if it is applied to
   6469 
   6470          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
   6471 
   6472        it takes a long time before reporting  failure.  This  is  because  the
   6473        string  can be divided between the internal \D+ repeat and the external
   6474        * repeat in a large number of ways, and all  have  to  be  tried.  (The
   6475        example  uses  [!?]  rather than a single character at the end, because
   6476        both PCRE and Perl have an optimization that allows  for  fast  failure
   6477        when  a single character is used. They remember the last single charac-
   6478        ter that is required for a match, and fail early if it is  not  present
   6479        in  the  string.)  If  the pattern is changed so that it uses an atomic
   6480        group, like this:
   6481 
   6482          ((?>\D+)|<\d+>)*[!?]
   6483 
   6484        sequences of non-digits cannot be broken, and failure happens quickly.
   6485 
   6486 
   6487 BACK REFERENCES
   6488 
   6489        Outside a character class, a backslash followed by a digit greater than
   6490        0 (and possibly further digits) is a back reference to a capturing sub-
   6491        pattern earlier (that is, to its left) in the pattern,  provided  there
   6492        have been that many previous capturing left parentheses.
   6493 
   6494        However, if the decimal number following the backslash is less than 10,
   6495        it is always taken as a back reference, and causes  an  error  only  if
   6496        there  are  not that many capturing left parentheses in the entire pat-
   6497        tern. In other words, the parentheses that are referenced need  not  be
   6498        to  the left of the reference for numbers less than 10. A "forward back
   6499        reference" of this type can make sense when a  repetition  is  involved
   6500        and  the  subpattern to the right has participated in an earlier itera-
   6501        tion.
   6502 
   6503        It is not possible to have a numerical "forward back  reference"  to  a
   6504        subpattern  whose  number  is  10  or  more using this syntax because a
   6505        sequence such as \50 is interpreted as a character  defined  in  octal.
   6506        See the subsection entitled "Non-printing characters" above for further
   6507        details of the handling of digits following a backslash.  There  is  no
   6508        such  problem  when named parentheses are used. A back reference to any
   6509        subpattern is possible using named parentheses (see below).
   6510 
   6511        Another way of avoiding the ambiguity inherent in  the  use  of  digits
   6512        following  a  backslash  is  to use the \g escape sequence. This escape
   6513        must be followed by an unsigned number or a negative number, optionally
   6514        enclosed in braces. These examples are all identical:
   6515 
   6516          (ring), \1
   6517          (ring), \g1
   6518          (ring), \g{1}
   6519 
   6520        An  unsigned number specifies an absolute reference without the ambigu-
   6521        ity that is present in the older syntax. It is also useful when literal
   6522        digits follow the reference. A negative number is a relative reference.
   6523        Consider this example:
   6524 
   6525          (abc(def)ghi)\g{-1}
   6526 
   6527        The sequence \g{-1} is a reference to the most recently started captur-
   6528        ing subpattern before \g, that is, is it equivalent to \2 in this exam-
   6529        ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative
   6530        references  can  be helpful in long patterns, and also in patterns that
   6531        are created by  joining  together  fragments  that  contain  references
   6532        within themselves.
   6533 
   6534        A  back  reference matches whatever actually matched the capturing sub-
   6535        pattern in the current subject string, rather  than  anything  matching
   6536        the subpattern itself (see "Subpatterns as subroutines" below for a way
   6537        of doing that). So the pattern
   6538 
   6539          (sens|respons)e and \1ibility
   6540 
   6541        matches "sense and sensibility" and "response and responsibility",  but
   6542        not  "sense and responsibility". If caseful matching is in force at the
   6543        time of the back reference, the case of letters is relevant. For  exam-
   6544        ple,
   6545 
   6546          ((?i)rah)\s+\1
   6547 
   6548        matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
   6549        original capturing subpattern is matched caselessly.
   6550 
   6551        There are several different ways of writing back  references  to  named
   6552        subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
   6553        \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
   6554        unified back reference syntax, in which \g can be used for both numeric
   6555        and named references, is also supported. We  could  rewrite  the  above
   6556        example in any of the following ways:
   6557 
   6558          (?<p1>(?i)rah)\s+\k<p1>
   6559          (?'p1'(?i)rah)\s+\k{p1}
   6560          (?P<p1>(?i)rah)\s+(?P=p1)
   6561          (?<p1>(?i)rah)\s+\g{p1}
   6562 
   6563        A  subpattern  that  is  referenced  by  name may appear in the pattern
   6564        before or after the reference.
   6565 
   6566        There may be more than one back reference to the same subpattern. If  a
   6567        subpattern  has  not actually been used in a particular match, any back
   6568        references to it always fail by default. For example, the pattern
   6569 
   6570          (a|(bc))\2
   6571 
   6572        always fails if it starts to match "a" rather than  "bc".  However,  if
   6573        the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
   6574        ence to an unset value matches an empty string.
   6575 
   6576        Because there may be many capturing parentheses in a pattern, all  dig-
   6577        its  following a backslash are taken as part of a potential back refer-
   6578        ence number.  If the pattern continues with  a  digit  character,  some
   6579        delimiter  must  be  used  to  terminate  the  back  reference.  If the
   6580        PCRE_EXTENDED option is set, this can be white  space.  Otherwise,  the
   6581        \g{ syntax or an empty comment (see "Comments" below) can be used.
   6582 
   6583    Recursive back references
   6584 
   6585        A  back reference that occurs inside the parentheses to which it refers
   6586        fails when the subpattern is first used, so, for example,  (a\1)  never
   6587        matches.   However,  such references can be useful inside repeated sub-
   6588        patterns. For example, the pattern
   6589 
   6590          (a|b\1)+
   6591 
   6592        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
   6593        ation  of  the  subpattern,  the  back  reference matches the character
   6594        string corresponding to the previous iteration. In order  for  this  to
   6595        work,  the  pattern must be such that the first iteration does not need
   6596        to match the back reference. This can be done using alternation, as  in
   6597        the example above, or by a quantifier with a minimum of zero.
   6598 
   6599        Back  references of this type cause the group that they reference to be
   6600        treated as an atomic group.  Once the whole group has been  matched,  a
   6601        subsequent  matching  failure cannot cause backtracking into the middle
   6602        of the group.
   6603 
   6604 
   6605 ASSERTIONS
   6606 
   6607        An assertion is a test on the characters  following  or  preceding  the
   6608        current  matching  point that does not actually consume any characters.
   6609        The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
   6610        described above.
   6611 
   6612        More  complicated  assertions  are  coded as subpatterns. There are two
   6613        kinds: those that look ahead of the current  position  in  the  subject
   6614        string,  and  those  that  look  behind  it. An assertion subpattern is
   6615        matched in the normal way, except that it does not  cause  the  current
   6616        matching position to be changed.
   6617 
   6618        Assertion  subpatterns are not capturing subpatterns. If such an asser-
   6619        tion contains capturing subpatterns within it, these  are  counted  for
   6620        the  purposes  of numbering the capturing subpatterns in the whole pat-
   6621        tern. However, substring capturing is carried  out  only  for  positive
   6622        assertions. (Perl sometimes, but not always, does do capturing in nega-
   6623        tive assertions.)
   6624 
   6625        For compatibility with Perl, assertion  subpatterns  may  be  repeated;
   6626        though  it  makes  no sense to assert the same thing several times, the
   6627        side effect of capturing parentheses may  occasionally  be  useful.  In
   6628        practice, there only three cases:
   6629 
   6630        (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
   6631        matching.  However, it may  contain  internal  capturing  parenthesized
   6632        groups that are called from elsewhere via the subroutine mechanism.
   6633 
   6634        (2)  If quantifier is {0,n} where n is greater than zero, it is treated
   6635        as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
   6636        tried with and without the assertion, the order depending on the greed-
   6637        iness of the quantifier.
   6638 
   6639        (3) If the minimum repetition is greater than zero, the  quantifier  is
   6640        ignored.   The  assertion  is  obeyed just once when encountered during
   6641        matching.
   6642 
   6643    Lookahead assertions
   6644 
   6645        Lookahead assertions start with (?= for positive assertions and (?! for
   6646        negative assertions. For example,
   6647 
   6648          \w+(?=;)
   6649 
   6650        matches  a word followed by a semicolon, but does not include the semi-
   6651        colon in the match, and
   6652 
   6653          foo(?!bar)
   6654 
   6655        matches any occurrence of "foo" that is not  followed  by  "bar".  Note
   6656        that the apparently similar pattern
   6657 
   6658          (?!foo)bar
   6659 
   6660        does  not  find  an  occurrence  of "bar" that is preceded by something
   6661        other than "foo"; it finds any occurrence of "bar" whatsoever,  because
   6662        the assertion (?!foo) is always true when the next three characters are
   6663        "bar". A lookbehind assertion is needed to achieve the other effect.
   6664 
   6665        If you want to force a matching failure at some point in a pattern, the
   6666        most  convenient  way  to  do  it  is with (?!) because an empty string
   6667        always matches, so an assertion that requires there not to be an  empty
   6668        string must always fail.  The backtracking control verb (*FAIL) or (*F)
   6669        is a synonym for (?!).
   6670 
   6671    Lookbehind assertions
   6672 
   6673        Lookbehind assertions start with (?<= for positive assertions and  (?<!
   6674        for negative assertions. For example,
   6675 
   6676          (?<!foo)bar
   6677 
   6678        does  find  an  occurrence  of "bar" that is not preceded by "foo". The
   6679        contents of a lookbehind assertion are restricted  such  that  all  the
   6680        strings it matches must have a fixed length. However, if there are sev-
   6681        eral top-level alternatives, they do not all  have  to  have  the  same
   6682        fixed length. Thus
   6683 
   6684          (?<=bullock|donkey)
   6685 
   6686        is permitted, but
   6687 
   6688          (?<!dogs?|cats?)
   6689 
   6690        causes  an  error at compile time. Branches that match different length
   6691        strings are permitted only at the top level of a lookbehind  assertion.
   6692        This is an extension compared with Perl, which requires all branches to
   6693        match the same length of string. An assertion such as
   6694 
   6695          (?<=ab(c|de))
   6696 
   6697        is not permitted, because its single top-level  branch  can  match  two
   6698        different lengths, but it is acceptable to PCRE if rewritten to use two
   6699        top-level branches:
   6700 
   6701          (?<=abc|abde)
   6702 
   6703        In some cases, the escape sequence \K (see above) can be  used  instead
   6704        of a lookbehind assertion to get round the fixed-length restriction.
   6705 
   6706        The  implementation  of lookbehind assertions is, for each alternative,
   6707        to temporarily move the current position back by the fixed  length  and
   6708        then try to match. If there are insufficient characters before the cur-
   6709        rent position, the assertion fails.
   6710 
   6711        In a UTF mode, PCRE does not allow the \C escape (which matches a  sin-
   6712        gle  data  unit even in a UTF mode) to appear in lookbehind assertions,
   6713        because it makes it impossible to calculate the length of  the  lookbe-
   6714        hind.  The \X and \R escapes, which can match different numbers of data
   6715        units, are also not permitted.
   6716 
   6717        "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
   6718        lookbehinds,  as  long as the subpattern matches a fixed-length string.
   6719        Recursion, however, is not supported.
   6720 
   6721        Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
   6722        assertions to specify efficient matching of fixed-length strings at the
   6723        end of subject strings. Consider a simple pattern such as
   6724 
   6725          abcd$
   6726 
   6727        when applied to a long string that does  not  match.  Because  matching
   6728        proceeds from left to right, PCRE will look for each "a" in the subject
   6729        and then see if what follows matches the rest of the  pattern.  If  the
   6730        pattern is specified as
   6731 
   6732          ^.*abcd$
   6733 
   6734        the  initial .* matches the entire string at first, but when this fails
   6735        (because there is no following "a"), it backtracks to match all but the
   6736        last  character,  then all but the last two characters, and so on. Once
   6737        again the search for "a" covers the entire string, from right to  left,
   6738        so we are no better off. However, if the pattern is written as
   6739 
   6740          ^.*+(?<=abcd)
   6741 
   6742        there  can  be  no backtracking for the .*+ item; it can match only the
   6743        entire string. The subsequent lookbehind assertion does a  single  test
   6744        on  the last four characters. If it fails, the match fails immediately.
   6745        For long strings, this approach makes a significant difference  to  the
   6746        processing time.
   6747 
   6748    Using multiple assertions
   6749 
   6750        Several assertions (of any sort) may occur in succession. For example,
   6751 
   6752          (?<=\d{3})(?<!999)foo
   6753 
   6754        matches  "foo" preceded by three digits that are not "999". Notice that
   6755        each of the assertions is applied independently at the  same  point  in
   6756        the  subject  string.  First  there  is a check that the previous three
   6757        characters are all digits, and then there is  a  check  that  the  same
   6758        three characters are not "999".  This pattern does not match "foo" pre-
   6759        ceded by six characters, the first of which are  digits  and  the  last
   6760        three  of  which  are not "999". For example, it doesn't match "123abc-
   6761        foo". A pattern to do that is
   6762 
   6763          (?<=\d{3}...)(?<!999)foo
   6764 
   6765        This time the first assertion looks at the  preceding  six  characters,
   6766        checking that the first three are digits, and then the second assertion
   6767        checks that the preceding three characters are not "999".
   6768 
   6769        Assertions can be nested in any combination. For example,
   6770 
   6771          (?<=(?<!foo)bar)baz
   6772 
   6773        matches an occurrence of "baz" that is preceded by "bar" which in  turn
   6774        is not preceded by "foo", while
   6775 
   6776          (?<=\d{3}(?!999)...)foo
   6777 
   6778        is  another pattern that matches "foo" preceded by three digits and any
   6779        three characters that are not "999".
   6780 
   6781 
   6782 CONDITIONAL SUBPATTERNS
   6783 
   6784        It is possible to cause the matching process to obey a subpattern  con-
   6785        ditionally  or to choose between two alternative subpatterns, depending
   6786        on the result of an assertion, or whether a specific capturing  subpat-
   6787        tern  has  already  been matched. The two possible forms of conditional
   6788        subpattern are:
   6789 
   6790          (?(condition)yes-pattern)
   6791          (?(condition)yes-pattern|no-pattern)
   6792 
   6793        If the condition is satisfied, the yes-pattern is used;  otherwise  the
   6794        no-pattern  (if  present)  is used. If there are more than two alterna-
   6795        tives in the subpattern, a compile-time error occurs. Each of  the  two
   6796        alternatives may itself contain nested subpatterns of any form, includ-
   6797        ing  conditional  subpatterns;  the  restriction  to  two  alternatives
   6798        applies only at the level of the condition. This pattern fragment is an
   6799        example where the alternatives are complex:
   6800 
   6801          (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
   6802 
   6803 
   6804        There are four kinds of condition: references  to  subpatterns,  refer-
   6805        ences to recursion, a pseudo-condition called DEFINE, and assertions.
   6806 
   6807    Checking for a used subpattern by number
   6808 
   6809        If  the  text between the parentheses consists of a sequence of digits,
   6810        the condition is true if a capturing subpattern of that number has pre-
   6811        viously  matched.  If  there is more than one capturing subpattern with
   6812        the same number (see the earlier  section  about  duplicate  subpattern
   6813        numbers),  the condition is true if any of them have matched. An alter-
   6814        native notation is to precede the digits with a plus or minus sign.  In
   6815        this  case, the subpattern number is relative rather than absolute. The
   6816        most recently opened parentheses can be referenced by (?(-1), the  next
   6817        most  recent  by (?(-2), and so on. Inside loops it can also make sense
   6818        to refer to subsequent groups. The next parentheses to be opened can be
   6819        referenced  as (?(+1), and so on. (The value zero in any of these forms
   6820        is not used; it provokes a compile-time error.)
   6821 
   6822        Consider the following pattern, which  contains  non-significant  white
   6823        space to make it more readable (assume the PCRE_EXTENDED option) and to
   6824        divide it into three parts for ease of discussion:
   6825 
   6826          ( \( )?    [^()]+    (?(1) \) )
   6827 
   6828        The first part matches an optional opening  parenthesis,  and  if  that
   6829        character is present, sets it as the first captured substring. The sec-
   6830        ond part matches one or more characters that are not  parentheses.  The
   6831        third  part  is  a conditional subpattern that tests whether or not the
   6832        first set of parentheses matched. If they  did,  that  is,  if  subject
   6833        started  with an opening parenthesis, the condition is true, and so the
   6834        yes-pattern is executed and a closing parenthesis is  required.  Other-
   6835        wise,  since no-pattern is not present, the subpattern matches nothing.
   6836        In other words, this pattern matches  a  sequence  of  non-parentheses,
   6837        optionally enclosed in parentheses.
   6838 
   6839        If  you  were  embedding  this pattern in a larger one, you could use a
   6840        relative reference:
   6841 
   6842          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
   6843 
   6844        This makes the fragment independent of the parentheses  in  the  larger
   6845        pattern.
   6846 
   6847    Checking for a used subpattern by name
   6848 
   6849        Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
   6850        used subpattern by name. For compatibility  with  earlier  versions  of
   6851        PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
   6852        also recognized.
   6853 
   6854        Rewriting the above example to use a named subpattern gives this:
   6855 
   6856          (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
   6857 
   6858        If the name used in a condition of this kind is a duplicate,  the  test
   6859        is  applied to all subpatterns of the same name, and is true if any one
   6860        of them has matched.
   6861 
   6862    Checking for pattern recursion
   6863 
   6864        If the condition is the string (R), and there is no subpattern with the
   6865        name  R, the condition is true if a recursive call to the whole pattern
   6866        or any subpattern has been made. If digits or a name preceded by amper-
   6867        sand follow the letter R, for example:
   6868 
   6869          (?(R3)...) or (?(R&name)...)
   6870 
   6871        the condition is true if the most recent recursion is into a subpattern
   6872        whose number or name is given. This condition does not check the entire
   6873        recursion  stack.  If  the  name  used in a condition of this kind is a
   6874        duplicate, the test is applied to all subpatterns of the same name, and
   6875        is true if any one of them is the most recent recursion.
   6876 
   6877        At  "top  level",  all  these recursion test conditions are false.  The
   6878        syntax for recursive patterns is described below.
   6879 
   6880    Defining subpatterns for use by reference only
   6881 
   6882        If the condition is the string (DEFINE), and  there  is  no  subpattern
   6883        with  the  name  DEFINE,  the  condition is always false. In this case,
   6884        there may be only one alternative  in  the  subpattern.  It  is  always
   6885        skipped  if  control  reaches  this  point  in the pattern; the idea of
   6886        DEFINE is that it can be used to define subroutines that can be  refer-
   6887        enced  from elsewhere. (The use of subroutines is described below.) For
   6888        example, a pattern to match an IPv4 address  such  as  "192.168.23.245"
   6889        could be written like this (ignore white space and line breaks):
   6890 
   6891          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
   6892          \b (?&byte) (\.(?&byte)){3} \b
   6893 
   6894        The  first part of the pattern is a DEFINE group inside which a another
   6895        group named "byte" is defined. This matches an individual component  of
   6896        an  IPv4  address  (a number less than 256). When matching takes place,
   6897        this part of the pattern is skipped because DEFINE acts  like  a  false
   6898        condition.  The  rest of the pattern uses references to the named group
   6899        to match the four dot-separated components of an IPv4 address,  insist-
   6900        ing on a word boundary at each end.
   6901 
   6902    Assertion conditions
   6903 
   6904        If  the  condition  is  not  in any of the above formats, it must be an
   6905        assertion.  This may be a positive or negative lookahead or  lookbehind
   6906        assertion.  Consider  this  pattern,  again  containing non-significant
   6907        white space, and with the two alternatives on the second line:
   6908 
   6909          (?(?=[^a-z]*[a-z])
   6910          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
   6911 
   6912        The condition  is  a  positive  lookahead  assertion  that  matches  an
   6913        optional  sequence of non-letters followed by a letter. In other words,
   6914        it tests for the presence of at least one letter in the subject.  If  a
   6915        letter  is found, the subject is matched against the first alternative;
   6916        otherwise it is  matched  against  the  second.  This  pattern  matches
   6917        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
   6918        letters and dd are digits.
   6919 
   6920 
   6921 COMMENTS
   6922 
   6923        There are two ways of including comments in patterns that are processed
   6924        by PCRE. In both cases, the start of the comment must not be in a char-
   6925        acter class, nor in the middle of any other sequence of related charac-
   6926        ters  such  as  (?: or a subpattern name or number. The characters that
   6927        make up a comment play no part in the pattern matching.
   6928 
   6929        The sequence (?# marks the start of a comment that continues up to  the
   6930        next  closing parenthesis. Nested parentheses are not permitted. If the
   6931        PCRE_EXTENDED option is set, an unescaped # character also introduces a
   6932        comment,  which  in  this  case continues to immediately after the next
   6933        newline character or character sequence in the pattern.  Which  charac-
   6934        ters are interpreted as newlines is controlled by the options passed to
   6935        a compiling function or by a special sequence at the start of the  pat-
   6936        tern, as described in the section entitled "Newline conventions" above.
   6937        Note that the end of this type of comment is a literal newline sequence
   6938        in  the pattern; escape sequences that happen to represent a newline do
   6939        not count. For example, consider this  pattern  when  PCRE_EXTENDED  is
   6940        set, and the default newline convention is in force:
   6941 
   6942          abc #comment \n still comment
   6943 
   6944        On  encountering  the  # character, pcre_compile() skips along, looking
   6945        for a newline in the pattern. The sequence \n is still literal at  this
   6946        stage,  so  it does not terminate the comment. Only an actual character
   6947        with the code value 0x0a (the default newline) does so.
   6948 
   6949 
   6950 RECURSIVE PATTERNS
   6951 
   6952        Consider the problem of matching a string in parentheses, allowing  for
   6953        unlimited  nested  parentheses.  Without the use of recursion, the best
   6954        that can be done is to use a pattern that  matches  up  to  some  fixed
   6955        depth  of  nesting.  It  is not possible to handle an arbitrary nesting
   6956        depth.
   6957 
   6958        For some time, Perl has provided a facility that allows regular expres-
   6959        sions  to recurse (amongst other things). It does this by interpolating
   6960        Perl code in the expression at run time, and the code can refer to  the
   6961        expression itself. A Perl pattern using code interpolation to solve the
   6962        parentheses problem can be created like this:
   6963 
   6964          $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
   6965 
   6966        The (?p{...}) item interpolates Perl code at run time, and in this case
   6967        refers recursively to the pattern in which it appears.
   6968 
   6969        Obviously, PCRE cannot support the interpolation of Perl code. Instead,
   6970        it supports special syntax for recursion of  the  entire  pattern,  and
   6971        also  for  individual  subpattern  recursion. After its introduction in
   6972        PCRE and Python, this kind of  recursion  was  subsequently  introduced
   6973        into Perl at release 5.10.
   6974 
   6975        A  special  item  that consists of (? followed by a number greater than
   6976        zero and a closing parenthesis is a recursive subroutine  call  of  the
   6977        subpattern  of  the  given  number, provided that it occurs inside that
   6978        subpattern. (If not, it is a non-recursive subroutine  call,  which  is
   6979        described  in  the  next  section.)  The special item (?R) or (?0) is a
   6980        recursive call of the entire regular expression.
   6981 
   6982        This PCRE pattern solves the nested  parentheses  problem  (assume  the
   6983        PCRE_EXTENDED option is set so that white space is ignored):
   6984 
   6985          \( ( [^()]++ | (?R) )* \)
   6986 
   6987        First  it matches an opening parenthesis. Then it matches any number of
   6988        substrings which can either be a  sequence  of  non-parentheses,  or  a
   6989        recursive  match  of the pattern itself (that is, a correctly parenthe-
   6990        sized substring).  Finally there is a closing parenthesis. Note the use
   6991        of a possessive quantifier to avoid backtracking into sequences of non-
   6992        parentheses.
   6993 
   6994        If this were part of a larger pattern, you would not  want  to  recurse
   6995        the entire pattern, so instead you could use this:
   6996 
   6997          ( \( ( [^()]++ | (?1) )* \) )
   6998 
   6999        We  have  put the pattern into parentheses, and caused the recursion to
   7000        refer to them instead of the whole pattern.
   7001 
   7002        In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
   7003        tricky.  This is made easier by the use of relative references. Instead
   7004        of (?1) in the pattern above you can write (?-2) to refer to the second
   7005        most  recently  opened  parentheses  preceding  the recursion. In other
   7006        words, a negative number counts capturing  parentheses  leftwards  from
   7007        the point at which it is encountered.
   7008 
   7009        It  is  also  possible  to refer to subsequently opened parentheses, by
   7010        writing references such as (?+2). However, these  cannot  be  recursive
   7011        because  the  reference  is  not inside the parentheses that are refer-
   7012        enced. They are always non-recursive subroutine calls, as described  in
   7013        the next section.
   7014 
   7015        An  alternative  approach is to use named parentheses instead. The Perl
   7016        syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
   7017        supported. We could rewrite the above example as follows:
   7018 
   7019          (?<pn> \( ( [^()]++ | (?&pn) )* \) )
   7020 
   7021        If  there  is more than one subpattern with the same name, the earliest
   7022        one is used.
   7023 
   7024        This particular example pattern that we have been looking  at  contains
   7025        nested unlimited repeats, and so the use of a possessive quantifier for
   7026        matching strings of non-parentheses is important when applying the pat-
   7027        tern  to  strings  that do not match. For example, when this pattern is
   7028        applied to
   7029 
   7030          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
   7031 
   7032        it yields "no match" quickly. However, if a  possessive  quantifier  is
   7033        not  used, the match runs for a very long time indeed because there are
   7034        so many different ways the + and * repeats can carve  up  the  subject,
   7035        and all have to be tested before failure can be reported.
   7036 
   7037        At  the  end  of a match, the values of capturing parentheses are those
   7038        from the outermost level. If you want to obtain intermediate values,  a
   7039        callout  function can be used (see below and the pcrecallout documenta-
   7040        tion). If the pattern above is matched against
   7041 
   7042          (ab(cd)ef)
   7043 
   7044        the value for the inner capturing parentheses  (numbered  2)  is  "ef",
   7045        which  is the last value taken on at the top level. If a capturing sub-
   7046        pattern is not matched at the top level, its final  captured  value  is
   7047        unset,  even  if  it was (temporarily) set at a deeper level during the
   7048        matching process.
   7049 
   7050        If there are more than 15 capturing parentheses in a pattern, PCRE  has
   7051        to  obtain extra memory to store data during a recursion, which it does
   7052        by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
   7053        can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
   7054 
   7055        Do  not  confuse  the (?R) item with the condition (R), which tests for
   7056        recursion.  Consider this pattern, which matches text in  angle  brack-
   7057        ets,  allowing for arbitrary nesting. Only digits are allowed in nested
   7058        brackets (that is, when recursing), whereas any characters are  permit-
   7059        ted at the outer level.
   7060 
   7061          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
   7062 
   7063        In  this  pattern, (?(R) is the start of a conditional subpattern, with
   7064        two different alternatives for the recursive and  non-recursive  cases.
   7065        The (?R) item is the actual recursive call.
   7066 
   7067    Differences in recursion processing between PCRE and Perl
   7068 
   7069        Recursion  processing  in PCRE differs from Perl in two important ways.
   7070        In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
   7071        always treated as an atomic group. That is, once it has matched some of
   7072        the subject string, it is never re-entered, even if it contains untried
   7073        alternatives  and  there  is a subsequent matching failure. This can be
   7074        illustrated by the following pattern, which purports to match a  palin-
   7075        dromic  string  that contains an odd number of characters (for example,
   7076        "a", "aba", "abcba", "abcdcba"):
   7077 
   7078          ^(.|(.)(?1)\2)$
   7079 
   7080        The idea is that it either matches a single character, or two identical
   7081        characters  surrounding  a sub-palindrome. In Perl, this pattern works;
   7082        in PCRE it does not if the pattern is  longer  than  three  characters.
   7083        Consider the subject string "abcba":
   7084 
   7085        At  the  top level, the first character is matched, but as it is not at
   7086        the end of the string, the first alternative fails; the second alterna-
   7087        tive is taken and the recursion kicks in. The recursive call to subpat-
   7088        tern 1 successfully matches the next character ("b").  (Note  that  the
   7089        beginning and end of line tests are not part of the recursion).
   7090 
   7091        Back  at  the top level, the next character ("c") is compared with what
   7092        subpattern 2 matched, which was "a". This fails. Because the  recursion
   7093        is  treated  as  an atomic group, there are now no backtracking points,
   7094        and so the entire match fails. (Perl is able, at  this  point,  to  re-
   7095        enter  the  recursion  and try the second alternative.) However, if the
   7096        pattern is written with the alternatives in the other order, things are
   7097        different:
   7098 
   7099          ^((.)(?1)\2|.)$
   7100 
   7101        This  time,  the recursing alternative is tried first, and continues to
   7102        recurse until it runs out of characters, at which point  the  recursion
   7103        fails.  But  this  time  we  do  have another alternative to try at the
   7104        higher level. That is the big difference:  in  the  previous  case  the
   7105        remaining alternative is at a deeper recursion level, which PCRE cannot
   7106        use.
   7107 
   7108        To change the pattern so that it matches all palindromic  strings,  not
   7109        just  those  with an odd number of characters, it is tempting to change
   7110        the pattern to this:
   7111 
   7112          ^((.)(?1)\2|.?)$
   7113 
   7114        Again, this works in Perl, but not in PCRE, and for  the  same  reason.
   7115        When  a  deeper  recursion has matched a single character, it cannot be
   7116        entered again in order to match an empty string.  The  solution  is  to
   7117        separate  the two cases, and write out the odd and even cases as alter-
   7118        natives at the higher level:
   7119 
   7120          ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
   7121 
   7122        If you want to match typical palindromic phrases, the  pattern  has  to
   7123        ignore all non-word characters, which can be done like this:
   7124 
   7125          ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
   7126 
   7127        If run with the PCRE_CASELESS option, this pattern matches phrases such
   7128        as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
   7129        Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-
   7130        ing into sequences of non-word characters. Without this, PCRE  takes  a
   7131        great  deal  longer  (ten  times or more) to match typical phrases, and
   7132        Perl takes so long that you think it has gone into a loop.
   7133 
   7134        WARNING: The palindrome-matching patterns above work only if  the  sub-
   7135        ject  string  does not start with a palindrome that is shorter than the
   7136        entire string.  For example, although "abcba" is correctly matched,  if
   7137        the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,
   7138        then fails at top level because the end of the string does not  follow.
   7139        Once  again, it cannot jump back into the recursion to try other alter-
   7140        natives, so the entire match fails.
   7141 
   7142        The second way in which PCRE and Perl differ in  their  recursion  pro-
   7143        cessing  is in the handling of captured values. In Perl, when a subpat-
   7144        tern is called recursively or as a subpattern (see the  next  section),
   7145        it  has  no  access to any values that were captured outside the recur-
   7146        sion, whereas in PCRE these values can  be  referenced.  Consider  this
   7147        pattern:
   7148 
   7149          ^(.)(\1|a(?2))
   7150 
   7151        In  PCRE,  this  pattern matches "bab". The first capturing parentheses
   7152        match "b", then in the second group, when the back reference  \1  fails
   7153        to  match "b", the second alternative matches "a" and then recurses. In
   7154        the recursion, \1 does now match "b" and so the whole  match  succeeds.
   7155        In  Perl,  the pattern fails to match because inside the recursive call
   7156        \1 cannot access the externally set value.
   7157 
   7158 
   7159 SUBPATTERNS AS SUBROUTINES
   7160 
   7161        If the syntax for a recursive subpattern call (either by number  or  by
   7162        name)  is  used outside the parentheses to which it refers, it operates
   7163        like a subroutine in a programming language. The called subpattern  may
   7164        be  defined  before or after the reference. A numbered reference can be
   7165        absolute or relative, as in these examples:
   7166 
   7167          (...(absolute)...)...(?2)...
   7168          (...(relative)...)...(?-1)...
   7169          (...(?+1)...(relative)...
   7170 
   7171        An earlier example pointed out that the pattern
   7172 
   7173          (sens|respons)e and \1ibility
   7174 
   7175        matches "sense and sensibility" and "response and responsibility",  but
   7176        not "sense and responsibility". If instead the pattern
   7177 
   7178          (sens|respons)e and (?1)ibility
   7179 
   7180        is  used, it does match "sense and responsibility" as well as the other
   7181        two strings. Another example is  given  in  the  discussion  of  DEFINE
   7182        above.
   7183 
   7184        All  subroutine  calls, whether recursive or not, are always treated as
   7185        atomic groups. That is, once a subroutine has matched some of the  sub-
   7186        ject string, it is never re-entered, even if it contains untried alter-
   7187        natives and there is  a  subsequent  matching  failure.  Any  capturing
   7188        parentheses  that  are  set  during the subroutine call revert to their
   7189        previous values afterwards.
   7190 
   7191        Processing options such as case-independence are fixed when  a  subpat-
   7192        tern  is defined, so if it is used as a subroutine, such options cannot
   7193        be changed for different calls. For example, consider this pattern:
   7194 
   7195          (abc)(?i:(?-1))
   7196 
   7197        It matches "abcabc". It does not match "abcABC" because the  change  of
   7198        processing option does not affect the called subpattern.
   7199 
   7200 
   7201 ONIGURUMA SUBROUTINE SYNTAX
   7202 
   7203        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
   7204        name or a number enclosed either in angle brackets or single quotes, is
   7205        an  alternative  syntax  for  referencing a subpattern as a subroutine,
   7206        possibly recursively. Here are two of the examples used above,  rewrit-
   7207        ten using this syntax:
   7208 
   7209          (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
   7210          (sens|respons)e and \g'1'ibility
   7211 
   7212        PCRE  supports  an extension to Oniguruma: if a number is preceded by a
   7213        plus or a minus sign it is taken as a relative reference. For example:
   7214 
   7215          (abc)(?i:\g<-1>)
   7216 
   7217        Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
   7218        synonymous.  The former is a back reference; the latter is a subroutine
   7219        call.
   7220 
   7221 
   7222 CALLOUTS
   7223 
   7224        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
   7225        Perl  code to be obeyed in the middle of matching a regular expression.
   7226        This makes it possible, amongst other things, to extract different sub-
   7227        strings that match the same pair of parentheses when there is a repeti-
   7228        tion.
   7229 
   7230        PCRE provides a similar feature, but of course it cannot obey arbitrary
   7231        Perl code. The feature is called "callout". The caller of PCRE provides
   7232        an external function by putting its entry point in the global  variable
   7233        pcre_callout  (8-bit  library) or pcre[16|32]_callout (16-bit or 32-bit
   7234        library).  By default, this variable contains NULL, which disables  all
   7235        calling out.
   7236 
   7237        Within  a  regular  expression,  (?C) indicates the points at which the
   7238        external function is to be called. If you want  to  identify  different
   7239        callout  points, you can put a number less than 256 after the letter C.
   7240        The default value is zero.  For example, this pattern has  two  callout
   7241        points:
   7242 
   7243          (?C1)abc(?C2)def
   7244 
   7245        If  the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
   7246        outs are automatically installed before each item in the pattern.  They
   7247        are  all  numbered  255. If there is a conditional group in the pattern
   7248        whose condition is an assertion, an additional callout is inserted just
   7249        before the condition. An explicit callout may also be set at this posi-
   7250        tion, as in this example:
   7251 
   7252          (?(?C9)(?=a)abc|def)
   7253 
   7254        Note that this applies only to assertion conditions, not to other types
   7255        of condition.
   7256 
   7257        During  matching, when PCRE reaches a callout point, the external func-
   7258        tion is called. It is provided with the  number  of  the  callout,  the
   7259        position  in  the pattern, and, optionally, one item of data originally
   7260        supplied by the caller of the matching function. The  callout  function
   7261        may cause matching to proceed, to backtrack, or to fail altogether.
   7262 
   7263        By  default,  PCRE implements a number of optimizations at compile time
   7264        and matching time, and one side-effect is that sometimes  callouts  are
   7265        skipped.  If  you need all possible callouts to happen, you need to set
   7266        options that disable the relevant optimizations. More  details,  and  a
   7267        complete  description  of  the  interface  to the callout function, are
   7268        given in the pcrecallout documentation.
   7269 
   7270 
   7271 BACKTRACKING CONTROL
   7272 
   7273        Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
   7274        which  are  still  described in the Perl documentation as "experimental
   7275        and subject to change or removal in a future version of Perl". It  goes
   7276        on  to  say:  "Their  usage in production code should be noted to avoid
   7277        problems during upgrades." The same remarks apply to the PCRE  features
   7278        described in this section.
   7279 
   7280        The  new verbs make use of what was previously invalid syntax: an open-
   7281        ing parenthesis followed by an asterisk. They are generally of the form
   7282        (*VERB)  or  (*VERB:NAME). Some may take either form, possibly behaving
   7283        differently depending on whether or not a name is present.  A  name  is
   7284        any sequence of characters that does not include a closing parenthesis.
   7285        The maximum length of name is 255 in the 8-bit library and 65535 in the
   7286        16-bit  and  32-bit  libraries.  If  the name is empty, that is, if the
   7287        closing parenthesis immediately follows the colon, the effect is as  if
   7288        the  colon  were  not  there.  Any number of these verbs may occur in a
   7289        pattern.
   7290 
   7291        Since these verbs are specifically related  to  backtracking,  most  of
   7292        them  can  be  used only when the pattern is to be matched using one of
   7293        the traditional matching functions, because these  use  a  backtracking
   7294        algorithm.  With the exception of (*FAIL), which behaves like a failing
   7295        negative assertion, the backtracking control verbs cause  an  error  if
   7296        encountered by a DFA matching function.
   7297 
   7298        The  behaviour  of  these  verbs in repeated groups, assertions, and in
   7299        subpatterns called as subroutines (whether or not recursively) is docu-
   7300        mented below.
   7301 
   7302    Optimizations that affect backtracking verbs
   7303 
   7304        PCRE  contains some optimizations that are used to speed up matching by
   7305        running some checks at the start of each match attempt. For example, it
   7306        may  know  the minimum length of matching subject, or that a particular
   7307        character must be present. When one of these optimizations bypasses the
   7308        running  of  a  match,  any  included  backtracking  verbs will not, of
   7309        course, be processed. You can suppress the start-of-match optimizations
   7310        by  setting  the  PCRE_NO_START_OPTIMIZE  option when calling pcre_com-
   7311        pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
   7312        There is more discussion of this option in the section entitled "Option
   7313        bits for pcre_exec()" in the pcreapi documentation.
   7314 
   7315        Experiments with Perl suggest that it too  has  similar  optimizations,
   7316        sometimes leading to anomalous results.
   7317 
   7318    Verbs that act immediately
   7319 
   7320        The  following  verbs act as soon as they are encountered. They may not
   7321        be followed by a name.
   7322 
   7323           (*ACCEPT)
   7324 
   7325        This verb causes the match to end successfully, skipping the  remainder
   7326        of  the pattern. However, when it is inside a subpattern that is called
   7327        as a subroutine, only that subpattern is ended  successfully.  Matching
   7328        then continues at the outer level. If (*ACCEPT) in triggered in a posi-
   7329        tive assertion, the assertion succeeds; in a  negative  assertion,  the
   7330        assertion fails.
   7331 
   7332        If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap-
   7333        tured. For example:
   7334 
   7335          A((?:A|B(*ACCEPT)|C)D)
   7336 
   7337        This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
   7338        tured by the outer parentheses.
   7339 
   7340          (*FAIL) or (*F)
   7341 
   7342        This  verb causes a matching failure, forcing backtracking to occur. It
   7343        is equivalent to (?!) but easier to read. The Perl documentation  notes
   7344        that  it  is  probably  useful only when combined with (?{}) or (??{}).
   7345        Those are, of course, Perl features that are not present in  PCRE.  The
   7346        nearest  equivalent is the callout feature, as for example in this pat-
   7347        tern:
   7348 
   7349          a+(?C)(*FAIL)
   7350 
   7351        A match with the string "aaaa" always fails, but the callout  is  taken
   7352        before each backtrack happens (in this example, 10 times).
   7353 
   7354    Recording which path was taken
   7355 
   7356        There  is  one  verb  whose  main  purpose  is to track how a match was
   7357        arrived at, though it also has a  secondary  use  in  conjunction  with
   7358        advancing the match starting point (see (*SKIP) below).
   7359 
   7360          (*MARK:NAME) or (*:NAME)
   7361 
   7362        A  name  is  always  required  with  this  verb.  There  may be as many
   7363        instances of (*MARK) as you like in a pattern, and their names  do  not
   7364        have to be unique.
   7365 
   7366        When  a  match succeeds, the name of the last-encountered (*MARK:NAME),
   7367        (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed  back  to
   7368        the  caller  as  described  in  the  section  entitled  "Extra data for
   7369        pcre_exec()" in the  pcreapi  documentation.  Here  is  an  example  of
   7370        pcretest  output, where the /K modifier requests the retrieval and out-
   7371        putting of (*MARK) data:
   7372 
   7373            re> /X(*MARK:A)Y|X(*MARK:B)Z/K
   7374          data> XY
   7375           0: XY
   7376          MK: A
   7377          XZ
   7378           0: XZ
   7379          MK: B
   7380 
   7381        The (*MARK) name is tagged with "MK:" in this output, and in this exam-
   7382        ple  it indicates which of the two alternatives matched. This is a more
   7383        efficient way of obtaining this information than putting each  alterna-
   7384        tive in its own capturing parentheses.
   7385 
   7386        If  a  verb  with a name is encountered in a positive assertion that is
   7387        true, the name is recorded and passed back if it  is  the  last-encoun-
   7388        tered. This does not happen for negative assertions or failing positive
   7389        assertions.
   7390 
   7391        After a partial match or a failed match, the last encountered  name  in
   7392        the entire match process is returned. For example:
   7393 
   7394            re> /X(*MARK:A)Y|X(*MARK:B)Z/K
   7395          data> XP
   7396          No match, mark = B
   7397 
   7398        Note  that  in  this  unanchored  example the mark is retained from the
   7399        match attempt that started at the letter "X" in the subject. Subsequent
   7400        match attempts starting at "P" and then with an empty string do not get
   7401        as far as the (*MARK) item, but nevertheless do not reset it.
   7402 
   7403        If you are interested in  (*MARK)  values  after  failed  matches,  you
   7404        should  probably  set  the PCRE_NO_START_OPTIMIZE option (see above) to
   7405        ensure that the match is always attempted.
   7406 
   7407    Verbs that act after backtracking
   7408 
   7409        The following verbs do nothing when they are encountered. Matching con-
   7410        tinues  with what follows, but if there is no subsequent match, causing
   7411        a backtrack to the verb, a failure is  forced.  That  is,  backtracking
   7412        cannot  pass  to the left of the verb. However, when one of these verbs
   7413        appears inside an atomic group or an assertion that is true, its effect
   7414        is  confined  to  that  group, because once the group has been matched,
   7415        there is never any backtracking into it. In this situation,  backtrack-
   7416        ing  can  "jump  back" to the left of the entire atomic group or asser-
   7417        tion. (Remember also, as stated  above,  that  this  localization  also
   7418        applies in subroutine calls.)
   7419 
   7420        These  verbs  differ  in exactly what kind of failure occurs when back-
   7421        tracking reaches them. The behaviour described below  is  what  happens
   7422        when  the  verb is not in a subroutine or an assertion. Subsequent sec-
   7423        tions cover these special cases.
   7424 
   7425          (*COMMIT)
   7426 
   7427        This verb, which may not be followed by a name, causes the whole  match
   7428        to fail outright if there is a later matching failure that causes back-
   7429        tracking to reach it. Even if the pattern  is  unanchored,  no  further
   7430        attempts to find a match by advancing the starting point take place. If
   7431        (*COMMIT) is the only backtracking verb that is  encountered,  once  it
   7432        has been passed pcre_exec() is committed to finding a match at the cur-
   7433        rent starting point, or not at all. For example:
   7434 
   7435          a+(*COMMIT)b
   7436 
   7437        This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
   7438        of dynamic anchor, or "I've started, so I must finish." The name of the
   7439        most recently passed (*MARK) in the path is passed back when  (*COMMIT)
   7440        forces a match failure.
   7441 
   7442        If  there  is more than one backtracking verb in a pattern, a different
   7443        one that follows (*COMMIT) may be triggered first,  so  merely  passing
   7444        (*COMMIT) during a match does not always guarantee that a match must be
   7445        at this starting point.
   7446 
   7447        Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
   7448        anchor,  unless  PCRE's start-of-match optimizations are turned off, as
   7449        shown in this output from pcretest:
   7450 
   7451            re> /(*COMMIT)abc/
   7452          data> xyzabc
   7453           0: abc
   7454          data> xyzabc\Y
   7455          No match
   7456 
   7457        For this pattern, PCRE knows that any match must start with "a", so the
   7458        optimization skips along the subject to "a" before applying the pattern
   7459        to the first set of data. The match attempt then succeeds. In the  sec-
   7460        ond  set of data, the escape sequence \Y is interpreted by the pcretest
   7461        program. It causes the PCRE_NO_START_OPTIMIZE option  to  be  set  when
   7462        pcre_exec() is called.  This disables the optimization that skips along
   7463        to the first character. The pattern is now applied starting at "x", and
   7464        so  the  (*COMMIT)  causes  the  match to fail without trying any other
   7465        starting points.
   7466 
   7467          (*PRUNE) or (*PRUNE:NAME)
   7468 
   7469        This verb causes the match to fail at the current starting position  in
   7470        the subject if there is a later matching failure that causes backtrack-
   7471        ing to reach it. If the pattern is unanchored, the  normal  "bumpalong"
   7472        advance  to  the next starting character then happens. Backtracking can
   7473        occur as usual to the left of (*PRUNE), before it is reached,  or  when
   7474        matching  to  the  right  of  (*PRUNE), but if there is no match to the
   7475        right, backtracking cannot cross (*PRUNE). In simple cases, the use  of
   7476        (*PRUNE)  is just an alternative to an atomic group or possessive quan-
   7477        tifier, but there are some uses of (*PRUNE) that cannot be expressed in
   7478        any  other  way. In an anchored pattern (*PRUNE) has the same effect as
   7479        (*COMMIT).
   7480 
   7481        The   behaviour   of   (*PRUNE:NAME)   is   the   not   the   same   as
   7482        (*MARK:NAME)(*PRUNE).   It  is  like  (*MARK:NAME)  in that the name is
   7483        remembered for  passing  back  to  the  caller.  However,  (*SKIP:NAME)
   7484        searches only for names set with (*MARK).
   7485 
   7486          (*SKIP)
   7487 
   7488        This  verb, when given without a name, is like (*PRUNE), except that if
   7489        the pattern is unanchored, the "bumpalong" advance is not to  the  next
   7490        character, but to the position in the subject where (*SKIP) was encoun-
   7491        tered. (*SKIP) signifies that whatever text was matched leading  up  to
   7492        it cannot be part of a successful match. Consider:
   7493 
   7494          a+(*SKIP)b
   7495 
   7496        If  the  subject  is  "aaaac...",  after  the first match attempt fails
   7497        (starting at the first character in the  string),  the  starting  point
   7498        skips on to start the next attempt at "c". Note that a possessive quan-
   7499        tifer does not have the same effect as this example; although it  would
   7500        suppress  backtracking  during  the  first  match  attempt,  the second
   7501        attempt would start at the second character instead of skipping  on  to
   7502        "c".
   7503 
   7504          (*SKIP:NAME)
   7505 
   7506        When (*SKIP) has an associated name, its behaviour is modified. When it
   7507        is triggered, the previous path through the pattern is searched for the
   7508        most  recent  (*MARK)  that  has  the  same  name. If one is found, the
   7509        "bumpalong" advance is to the subject position that corresponds to that
   7510        (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
   7511        a matching name is found, the (*SKIP) is ignored.
   7512 
   7513        Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME).  It
   7514        ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
   7515 
   7516          (*THEN) or (*THEN:NAME)
   7517 
   7518        This  verb  causes  a skip to the next innermost alternative when back-
   7519        tracking reaches it. That  is,  it  cancels  any  further  backtracking
   7520        within  the  current  alternative.  Its name comes from the observation
   7521        that it can be used for a pattern-based if-then-else block:
   7522 
   7523          ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
   7524 
   7525        If the COND1 pattern matches, FOO is tried (and possibly further  items
   7526        after  the  end  of the group if FOO succeeds); on failure, the matcher
   7527        skips to the second alternative and tries COND2,  without  backtracking
   7528        into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse-
   7529        quently BAZ fails, there are no more alternatives, so there is a  back-
   7530        track  to  whatever  came  before  the  entire group. If (*THEN) is not
   7531        inside an alternation, it acts like (*PRUNE).
   7532 
   7533        The   behaviour   of   (*THEN:NAME)   is   the   not   the   same    as
   7534        (*MARK:NAME)(*THEN).   It  is  like  (*MARK:NAME)  in  that the name is
   7535        remembered for  passing  back  to  the  caller.  However,  (*SKIP:NAME)
   7536        searches only for names set with (*MARK).
   7537 
   7538        A  subpattern that does not contain a | character is just a part of the
   7539        enclosing alternative; it is not a nested  alternation  with  only  one
   7540        alternative.  The effect of (*THEN) extends beyond such a subpattern to
   7541        the enclosing alternative. Consider this pattern, where A, B, etc.  are
   7542        complex  pattern fragments that do not contain any | characters at this
   7543        level:
   7544 
   7545          A (B(*THEN)C) | D
   7546 
   7547        If A and B are matched, but there is a failure in C, matching does  not
   7548        backtrack into A; instead it moves to the next alternative, that is, D.
   7549        However, if the subpattern containing (*THEN) is given an  alternative,
   7550        it behaves differently:
   7551 
   7552          A (B(*THEN)C | (*FAIL)) | D
   7553 
   7554        The  effect of (*THEN) is now confined to the inner subpattern. After a
   7555        failure in C, matching moves to (*FAIL), which causes the whole subpat-
   7556        tern  to  fail  because  there are no more alternatives to try. In this
   7557        case, matching does now backtrack into A.
   7558 
   7559        Note that a conditional subpattern is  not  considered  as  having  two
   7560        alternatives,  because  only  one  is  ever used. In other words, the |
   7561        character in a conditional subpattern has a different meaning. Ignoring
   7562        white space, consider:
   7563 
   7564          ^.*? (?(?=a) a | b(*THEN)c )
   7565 
   7566        If  the  subject  is  "ba", this pattern does not match. Because .*? is
   7567        ungreedy, it initially matches zero  characters.  The  condition  (?=a)
   7568        then  fails,  the  character  "b"  is  matched, but "c" is not. At this
   7569        point, matching does not backtrack to .*? as might perhaps be  expected
   7570        from  the  presence  of  the | character. The conditional subpattern is
   7571        part of the single alternative that comprises the whole pattern, and so
   7572        the  match  fails.  (If  there was a backtrack into .*?, allowing it to
   7573        match "b", the match would succeed.)
   7574 
   7575        The verbs just described provide four different "strengths" of  control
   7576        when subsequent matching fails. (*THEN) is the weakest, carrying on the
   7577        match at the next alternative. (*PRUNE) comes next, failing  the  match
   7578        at  the  current starting position, but allowing an advance to the next
   7579        character (for an unanchored pattern). (*SKIP) is similar, except  that
   7580        the advance may be more than one character. (*COMMIT) is the strongest,
   7581        causing the entire match to fail.
   7582 
   7583    More than one backtracking verb
   7584 
   7585        If more than one backtracking verb is present in  a  pattern,  the  one
   7586        that  is  backtracked  onto first acts. For example, consider this pat-
   7587        tern, where A, B, etc. are complex pattern fragments:
   7588 
   7589          (A(*COMMIT)B(*THEN)C|ABD)
   7590 
   7591        If A matches but B fails, the backtrack to (*COMMIT) causes the  entire
   7592        match to fail. However, if A and B match, but C fails, the backtrack to
   7593        (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
   7594        is  consistent,  but is not always the same as Perl's. It means that if
   7595        two or more backtracking verbs appear in succession, all the  the  last
   7596        of them has no effect. Consider this example:
   7597 
   7598          ...(*COMMIT)(*PRUNE)...
   7599 
   7600        If there is a matching failure to the right, backtracking onto (*PRUNE)
   7601        causes it to be triggered, and its action is taken. There can never  be
   7602        a backtrack onto (*COMMIT).
   7603 
   7604    Backtracking verbs in repeated groups
   7605 
   7606        PCRE  differs  from  Perl  in  its  handling  of  backtracking verbs in
   7607        repeated groups. For example, consider:
   7608 
   7609          /(a(*COMMIT)b)+ac/
   7610 
   7611        If the subject is "abac", Perl matches,  but  PCRE  fails  because  the
   7612        (*COMMIT) in the second repeat of the group acts.
   7613 
   7614    Backtracking verbs in assertions
   7615 
   7616        (*FAIL)  in  an assertion has its normal effect: it forces an immediate
   7617        backtrack.
   7618 
   7619        (*ACCEPT) in a positive assertion causes the assertion to succeed with-
   7620        out  any  further processing. In a negative assertion, (*ACCEPT) causes
   7621        the assertion to fail without any further processing.
   7622 
   7623        The other backtracking verbs are not treated specially if  they  appear
   7624        in  a  positive  assertion.  In  particular,  (*THEN) skips to the next
   7625        alternative in the innermost enclosing  group  that  has  alternations,
   7626        whether or not this is within the assertion.
   7627 
   7628        Negative  assertions  are,  however, different, in order to ensure that
   7629        changing a positive assertion into a  negative  assertion  changes  its
   7630        result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg-
   7631        ative assertion to be true, without considering any further alternative
   7632        branches in the assertion.  Backtracking into (*THEN) causes it to skip
   7633        to the next enclosing alternative within the assertion (the normal  be-
   7634        haviour),  but  if  the  assertion  does  not have such an alternative,
   7635        (*THEN) behaves like (*PRUNE).
   7636 
   7637    Backtracking verbs in subroutines
   7638 
   7639        These behaviours occur whether or not the subpattern is  called  recur-
   7640        sively.  Perl's treatment of subroutines is different in some cases.
   7641 
   7642        (*FAIL)  in  a subpattern called as a subroutine has its normal effect:
   7643        it forces an immediate backtrack.
   7644 
   7645        (*ACCEPT) in a subpattern called as a subroutine causes the  subroutine
   7646        match  to succeed without any further processing. Matching then contin-
   7647        ues after the subroutine call.
   7648 
   7649        (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
   7650        cause the subroutine match to fail.
   7651 
   7652        (*THEN)  skips to the next alternative in the innermost enclosing group
   7653        within the subpattern that has alternatives. If there is no such  group
   7654        within the subpattern, (*THEN) causes the subroutine match to fail.
   7655 
   7656 
   7657 SEE ALSO
   7658 
   7659        pcreapi(3),  pcrecallout(3),  pcrematching(3),  pcresyntax(3), pcre(3),
   7660        pcre16(3), pcre32(3).
   7661 
   7662 
   7663 AUTHOR
   7664 
   7665        Philip Hazel
   7666        University Computing Service
   7667        Cambridge CB2 3QH, England.
   7668 
   7669 
   7670 REVISION
   7671 
   7672        Last updated: 14 June 2015
   7673        Copyright (c) 1997-2015 University of Cambridge.
   7674 ------------------------------------------------------------------------------
   7675 
   7676 
   7677 PCRESYNTAX(3)              Library Functions Manual              PCRESYNTAX(3)
   7678 
   7679 
   7680 
   7681 NAME
   7682        PCRE - Perl-compatible regular expressions
   7683 
   7684 PCRE REGULAR EXPRESSION SYNTAX SUMMARY
   7685 
   7686        The  full syntax and semantics of the regular expressions that are sup-
   7687        ported by PCRE are described in  the  pcrepattern  documentation.  This
   7688        document contains a quick-reference summary of the syntax.
   7689 
   7690 
   7691 QUOTING
   7692 
   7693          \x         where x is non-alphanumeric is a literal x
   7694          \Q...\E    treat enclosed characters as literal
   7695 
   7696 
   7697 CHARACTERS
   7698 
   7699          \a         alarm, that is, the BEL character (hex 07)
   7700          \cx        "control-x", where x is any ASCII character
   7701          \e         escape (hex 1B)
   7702          \f         form feed (hex 0C)
   7703          \n         newline (hex 0A)
   7704          \r         carriage return (hex 0D)
   7705          \t         tab (hex 09)
   7706          \0dd       character with octal code 0dd
   7707          \ddd       character with octal code ddd, or backreference
   7708          \o{ddd..}  character with octal code ddd..
   7709          \xhh       character with hex code hh
   7710          \x{hhh..}  character with hex code hhh..
   7711 
   7712        Note that \0dd is always an octal code, and that \8 and \9 are the lit-
   7713        eral characters "8" and "9".
   7714 
   7715 
   7716 CHARACTER TYPES
   7717 
   7718          .          any character except newline;
   7719                       in dotall mode, any character whatsoever
   7720          \C         one data unit, even in UTF mode (best avoided)
   7721          \d         a decimal digit
   7722          \D         a character that is not a decimal digit
   7723          \h         a horizontal white space character
   7724          \H         a character that is not a horizontal white space character
   7725          \N         a character that is not a newline
   7726          \p{xx}     a character with the xx property
   7727          \P{xx}     a character without the xx property
   7728          \R         a newline sequence
   7729          \s         a white space character
   7730          \S         a character that is not a white space character
   7731          \v         a vertical white space character
   7732          \V         a character that is not a vertical white space character
   7733          \w         a "word" character
   7734          \W         a "non-word" character
   7735          \X         a Unicode extended grapheme cluster
   7736 
   7737        By default, \d, \s, and \w match only ASCII characters, even  in  UTF-8
   7738        mode  or  in  the 16- bit and 32-bit libraries. However, if locale-spe-
   7739        cific matching is happening, \s and \w may also match  characters  with
   7740        code  points  in  the range 128-255. If the PCRE_UCP option is set, the
   7741        behaviour of these escape sequences is changed to use  Unicode  proper-
   7742        ties and they match many more characters.
   7743 
   7744 
   7745 GENERAL CATEGORY PROPERTIES FOR \p and \P
   7746 
   7747          C          Other
   7748          Cc         Control
   7749          Cf         Format
   7750          Cn         Unassigned
   7751          Co         Private use
   7752          Cs         Surrogate
   7753 
   7754          L          Letter
   7755          Ll         Lower case letter
   7756          Lm         Modifier letter
   7757          Lo         Other letter
   7758          Lt         Title case letter
   7759          Lu         Upper case letter
   7760          L&         Ll, Lu, or Lt
   7761 
   7762          M          Mark
   7763          Mc         Spacing mark
   7764          Me         Enclosing mark
   7765          Mn         Non-spacing mark
   7766 
   7767          N          Number
   7768          Nd         Decimal number
   7769          Nl         Letter number
   7770          No         Other number
   7771 
   7772          P          Punctuation
   7773          Pc         Connector punctuation
   7774          Pd         Dash punctuation
   7775          Pe         Close punctuation
   7776          Pf         Final punctuation
   7777          Pi         Initial punctuation
   7778          Po         Other punctuation
   7779          Ps         Open punctuation
   7780 
   7781          S          Symbol
   7782          Sc         Currency symbol
   7783          Sk         Modifier symbol
   7784          Sm         Mathematical symbol
   7785          So         Other symbol
   7786 
   7787          Z          Separator
   7788          Zl         Line separator
   7789          Zp         Paragraph separator
   7790          Zs         Space separator
   7791 
   7792 
   7793 PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P
   7794 
   7795          Xan        Alphanumeric: union of properties L and N
   7796          Xps        POSIX space: property Z or tab, NL, VT, FF, CR
   7797          Xsp        Perl space: property Z or tab, NL, VT, FF, CR
   7798          Xuc        Univerally-named character: one that can be
   7799                       represented by a Universal Character Name
   7800          Xwd        Perl word: property Xan or underscore
   7801 
   7802        Perl and POSIX space are now the same. Perl added VT to its space char-
   7803        acter set at release 5.18 and PCRE changed at release 8.34.
   7804 
   7805 
   7806 SCRIPT NAMES FOR \p AND \P
   7807 
   7808        Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak,  Bengali,
   7809        Bopomofo,  Brahmi,  Braille, Buginese, Buhid, Canadian_Aboriginal, Car-
   7810        ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei-
   7811        form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero-
   7812        glyphs,  Elbasan,  Ethiopic,  Georgian,  Glagolitic,  Gothic,  Grantha,
   7813        Greek,  Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo, Hebrew, Hiragana,
   7814        Imperial_Aramaic,    Inherited,     Inscriptional_Pahlavi,     Inscrip-
   7815        tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
   7816        Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha,  Limbu,  Lin-
   7817        ear_A,  Linear_B,  Lisu,  Lycian, Lydian, Mahajani, Malayalam, Mandaic,
   7818        Manichaean,     Meetei_Mayek,     Mende_Kikakui,      Meroitic_Cursive,
   7819        Meroitic_Hieroglyphs,  Miao,  Modi, Mongolian, Mro, Myanmar, Nabataean,
   7820        New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,  Old_Italic,   Old_North_Arabian,
   7821        Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya,
   7822        Pahawh_Hmong,    Palmyrene,    Pau_Cin_Hau,    Phags_Pa,    Phoenician,
   7823        Psalter_Pahlavi,  Rejang,  Runic,  Samaritan, Saurashtra, Sharada, Sha-
   7824        vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri,  Syriac,
   7825        Tagalog,  Tagbanwa,  Tai_Le,  Tai_Tham, Tai_Viet, Takri, Tamil, Telugu,
   7826        Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic,  Vai,  Warang_Citi,
   7827        Yi.
   7828 
   7829 
   7830 CHARACTER CLASSES
   7831 
   7832          [...]       positive character class
   7833          [^...]      negative character class
   7834          [x-y]       range (can be used for hex characters)
   7835          [[:xxx:]]   positive POSIX named set
   7836          [[:^xxx:]]  negative POSIX named set
   7837 
   7838          alnum       alphanumeric
   7839          alpha       alphabetic
   7840          ascii       0-127
   7841          blank       space or tab
   7842          cntrl       control character
   7843          digit       decimal digit
   7844          graph       printing, excluding space
   7845          lower       lower case letter
   7846          print       printing, including space
   7847          punct       printing, excluding alphanumeric
   7848          space       white space
   7849          upper       upper case letter
   7850          word        same as \w
   7851          xdigit      hexadecimal digit
   7852 
   7853        In  PCRE,  POSIX character set names recognize only ASCII characters by
   7854        default, but some of them use Unicode properties if  PCRE_UCP  is  set.
   7855        You can use \Q...\E inside a character class.
   7856 
   7857 
   7858 QUANTIFIERS
   7859 
   7860          ?           0 or 1, greedy
   7861          ?+          0 or 1, possessive
   7862          ??          0 or 1, lazy
   7863          *           0 or more, greedy
   7864          *+          0 or more, possessive
   7865          *?          0 or more, lazy
   7866          +           1 or more, greedy
   7867          ++          1 or more, possessive
   7868          +?          1 or more, lazy
   7869          {n}         exactly n
   7870          {n,m}       at least n, no more than m, greedy
   7871          {n,m}+      at least n, no more than m, possessive
   7872          {n,m}?      at least n, no more than m, lazy
   7873          {n,}        n or more, greedy
   7874          {n,}+       n or more, possessive
   7875          {n,}?       n or more, lazy
   7876 
   7877 
   7878 ANCHORS AND SIMPLE ASSERTIONS
   7879 
   7880          \b          word boundary
   7881          \B          not a word boundary
   7882          ^           start of subject
   7883                       also after internal newline in multiline mode
   7884          \A          start of subject
   7885          $           end of subject
   7886                       also before newline at end of subject
   7887                       also before internal newline in multiline mode
   7888          \Z          end of subject
   7889                       also before newline at end of subject
   7890          \z          end of subject
   7891          \G          first matching position in subject
   7892 
   7893 
   7894 MATCH POINT RESET
   7895 
   7896          \K          reset start of match
   7897 
   7898        \K is honoured in positive assertions, but ignored in negative ones.
   7899 
   7900 
   7901 ALTERNATION
   7902 
   7903          expr|expr|expr...
   7904 
   7905 
   7906 CAPTURING
   7907 
   7908          (...)           capturing group
   7909          (?<name>...)    named capturing group (Perl)
   7910          (?'name'...)    named capturing group (Perl)
   7911          (?P<name>...)   named capturing group (Python)
   7912          (?:...)         non-capturing group
   7913          (?|...)         non-capturing group; reset group numbers for
   7914                           capturing groups in each alternative
   7915 
   7916 
   7917 ATOMIC GROUPS
   7918 
   7919          (?>...)         atomic, non-capturing group
   7920 
   7921 
   7922 COMMENT
   7923 
   7924          (?#....)        comment (not nestable)
   7925 
   7926 
   7927 OPTION SETTING
   7928 
   7929          (?i)            caseless
   7930          (?J)            allow duplicate names
   7931          (?m)            multiline
   7932          (?s)            single line (dotall)
   7933          (?U)            default ungreedy (lazy)
   7934          (?x)            extended (ignore white space)
   7935          (?-...)         unset option(s)
   7936 
   7937        The  following  are  recognized  only at the very start of a pattern or
   7938        after one of the newline or \R options with similar syntax.  More  than
   7939        one of them may appear.
   7940 
   7941          (*LIMIT_MATCH=d) set the match limit to d (decimal number)
   7942          (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
   7943          (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
   7944          (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
   7945          (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
   7946          (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
   7947          (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
   7948          (*UTF)          set appropriate UTF mode for the library in use
   7949          (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
   7950 
   7951        Note  that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
   7952        the limits set by the caller of pcre_exec(), not increase them.
   7953 
   7954 
   7955 NEWLINE CONVENTION
   7956 
   7957        These are recognized only at the very start of  the  pattern  or  after
   7958        option settings with a similar syntax.
   7959 
   7960          (*CR)           carriage return only
   7961          (*LF)           linefeed only
   7962          (*CRLF)         carriage return followed by linefeed
   7963          (*ANYCRLF)      all three of the above
   7964          (*ANY)          any Unicode newline sequence
   7965 
   7966 
   7967 WHAT \R MATCHES
   7968 
   7969        These  are  recognized  only  at the very start of the pattern or after
   7970        option setting with a similar syntax.
   7971 
   7972          (*BSR_ANYCRLF)  CR, LF, or CRLF
   7973          (*BSR_UNICODE)  any Unicode newline sequence
   7974 
   7975 
   7976 LOOKAHEAD AND LOOKBEHIND ASSERTIONS
   7977 
   7978          (?=...)         positive look ahead
   7979          (?!...)         negative look ahead
   7980          (?<=...)        positive look behind
   7981          (?<!...)        negative look behind
   7982 
   7983        Each top-level branch of a look behind must be of a fixed length.
   7984 
   7985 
   7986 BACKREFERENCES
   7987 
   7988          \n              reference by number (can be ambiguous)
   7989          \gn             reference by number
   7990          \g{n}           reference by number
   7991          \g{-n}          relative reference by number
   7992          \k<name>        reference by name (Perl)
   7993          \k'name'        reference by name (Perl)
   7994          \g{name}        reference by name (Perl)
   7995          \k{name}        reference by name (.NET)
   7996          (?P=name)       reference by name (Python)
   7997 
   7998 
   7999 SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
   8000 
   8001          (?R)            recurse whole pattern
   8002          (?n)            call subpattern by absolute number
   8003          (?+n)           call subpattern by relative number
   8004          (?-n)           call subpattern by relative number
   8005          (?&name)        call subpattern by name (Perl)
   8006          (?P>name)       call subpattern by name (Python)
   8007          \g<name>        call subpattern by name (Oniguruma)
   8008          \g'name'        call subpattern by name (Oniguruma)
   8009          \g<n>           call subpattern by absolute number (Oniguruma)
   8010          \g'n'           call subpattern by absolute number (Oniguruma)
   8011          \g<+n>          call subpattern by relative number (PCRE extension)
   8012          \g'+n'          call subpattern by relative number (PCRE extension)
   8013          \g<-n>          call subpattern by relative number (PCRE extension)
   8014          \g'-n'          call subpattern by relative number (PCRE extension)
   8015 
   8016 
   8017 CONDITIONAL PATTERNS
   8018 
   8019          (?(condition)yes-pattern)
   8020          (?(condition)yes-pattern|no-pattern)
   8021 
   8022          (?(n)...        absolute reference condition
   8023          (?(+n)...       relative reference condition
   8024          (?(-n)...       relative reference condition
   8025          (?(<name>)...   named reference condition (Perl)
   8026          (?('name')...   named reference condition (Perl)
   8027          (?(name)...     named reference condition (PCRE)
   8028          (?(R)...        overall recursion condition
   8029          (?(Rn)...       specific group recursion condition
   8030          (?(R&name)...   specific recursion condition
   8031          (?(DEFINE)...   define subpattern for reference
   8032          (?(assert)...   assertion condition
   8033 
   8034 
   8035 BACKTRACKING CONTROL
   8036 
   8037        The following act immediately they are reached:
   8038 
   8039          (*ACCEPT)       force successful match
   8040          (*FAIL)         force backtrack; synonym (*F)
   8041          (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
   8042 
   8043        The following act only when a subsequent match failure causes  a  back-
   8044        track to reach them. They all force a match failure, but they differ in
   8045        what happens afterwards. Those that advance the start-of-match point do
   8046        so only if the pattern is not anchored.
   8047 
   8048          (*COMMIT)       overall failure, no advance of starting point
   8049          (*PRUNE)        advance to next starting character
   8050          (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
   8051          (*SKIP)         advance to current matching position
   8052          (*SKIP:NAME)    advance to position corresponding to an earlier
   8053                          (*MARK:NAME); if not found, the (*SKIP) is ignored
   8054          (*THEN)         local failure, backtrack to next alternation
   8055          (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
   8056 
   8057 
   8058 CALLOUTS
   8059 
   8060          (?C)      callout
   8061          (?Cn)     callout with data n
   8062 
   8063 
   8064 SEE ALSO
   8065 
   8066        pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
   8067 
   8068 
   8069 AUTHOR
   8070 
   8071        Philip Hazel
   8072        University Computing Service
   8073        Cambridge CB2 3QH, England.
   8074 
   8075 
   8076 REVISION
   8077 
   8078        Last updated: 08 January 2014
   8079        Copyright (c) 1997-2014 University of Cambridge.
   8080 ------------------------------------------------------------------------------
   8081 
   8082 
   8083 PCREUNICODE(3)             Library Functions Manual             PCREUNICODE(3)
   8084 
   8085 
   8086 
   8087 NAME
   8088        PCRE - Perl-compatible regular expressions
   8089 
   8090 UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT
   8091 
   8092        As well as UTF-8 support, PCRE also supports UTF-16 (from release 8.30)
   8093        and UTF-32 (from release 8.32), by means of two  additional  libraries.
   8094        They can be built as well as, or instead of, the 8-bit library.
   8095 
   8096 
   8097 UTF-8 SUPPORT
   8098 
   8099        In  order  process  UTF-8  strings, you must build PCRE's 8-bit library
   8100        with UTF support, and, in addition, you must call  pcre_compile()  with
   8101        the  PCRE_UTF8 option flag, or the pattern must start with the sequence
   8102        (*UTF8) or (*UTF). When either of these is the case, both  the  pattern
   8103        and  any  subject  strings  that  are matched against it are treated as
   8104        UTF-8 strings instead of strings of individual 1-byte characters.
   8105 
   8106 
   8107 UTF-16 AND UTF-32 SUPPORT
   8108 
   8109        In order process UTF-16 or UTF-32 strings, you must build PCRE's 16-bit
   8110        or  32-bit  library  with  UTF support, and, in addition, you must call
   8111        pcre16_compile() or pcre32_compile() with the PCRE_UTF16 or  PCRE_UTF32
   8112        option flag, as appropriate. Alternatively, the pattern must start with
   8113        the sequence (*UTF16), (*UTF32), as appropriate, or (*UTF),  which  can
   8114        be used with either library. When UTF mode is set, both the pattern and
   8115        any subject strings that are matched against it are treated  as  UTF-16
   8116        or  UTF-32  strings  instead  of strings of individual 16-bit or 32-bit
   8117        characters.
   8118 
   8119 
   8120 UTF SUPPORT OVERHEAD
   8121 
   8122        If you compile PCRE with UTF support, but do not use it  at  run  time,
   8123        the  library will be a bit bigger, but the additional run time overhead
   8124        is limited to  testing  the  PCRE_UTF[8|16|32]  flag  occasionally,  so
   8125        should not be very big.
   8126 
   8127 
   8128 UNICODE PROPERTY SUPPORT
   8129 
   8130        If PCRE is built with Unicode character property support (which implies
   8131        UTF support), the escape sequences \p{..}, \P{..}, and \X can be  used.
   8132        The  available properties that can be tested are limited to the general
   8133        category properties such as Lu for an upper case letter  or  Nd  for  a
   8134        decimal number, the Unicode script names such as Arabic or Han, and the
   8135        derived properties Any and L&. Full lists is given in  the  pcrepattern
   8136        and  pcresyntax  documentation. Only the short names for properties are
   8137        supported. For example, \p{L}  matches  a  letter.  Its  Perl  synonym,
   8138        \p{Letter},  is  not  supported.  Furthermore, in Perl, many properties
   8139        may optionally be prefixed by "Is", for compatibility  with  Perl  5.6.
   8140        PCRE does not support this.
   8141 
   8142    Validity of UTF-8 strings
   8143 
   8144        When  you  set  the PCRE_UTF8 flag, the byte strings passed as patterns
   8145        and subjects are (by default) checked for validity on entry to the rel-
   8146        evant functions. The entire string is checked before any other process-
   8147        ing takes place. From release 7.3 of PCRE, the check is  according  the
   8148        rules of RFC 3629, which are themselves derived from the Unicode speci-
   8149        fication. Earlier releases of PCRE followed  the  rules  of  RFC  2279,
   8150        which  allows  the  full  range of 31-bit values (0 to 0x7FFFFFFF). The
   8151        current check allows only values in the range U+0 to U+10FFFF,  exclud-
   8152        ing  the  surrogate area. (From release 8.33 the so-called "non-charac-
   8153        ter" code points are no longer excluded because Unicode corrigendum  #9
   8154        makes it clear that they should not be.)
   8155 
   8156        Characters  in  the "Surrogate Area" of Unicode are reserved for use by
   8157        UTF-16, where they are used in pairs to encode codepoints  with  values
   8158        greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
   8159        are available independently in the  UTF-8  and  UTF-32  encodings.  (In
   8160        other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
   8161        unfortunately messes up UTF-8 and UTF-32.)
   8162 
   8163        If an invalid UTF-8 string is passed to PCRE, an error return is given.
   8164        At  compile  time, the only additional information is the offset to the
   8165        first byte of the failing character. The run-time functions pcre_exec()
   8166        and  pcre_dfa_exec() also pass back this information, as well as a more
   8167        detailed reason code if the caller has provided memory in which  to  do
   8168        this.
   8169 
   8170        In  some  situations, you may already know that your strings are valid,
   8171        and therefore want to skip these checks in  order  to  improve  perfor-
   8172        mance,  for  example in the case of a long subject string that is being
   8173        scanned repeatedly.  If you set the PCRE_NO_UTF8_CHECK flag at  compile
   8174        time  or  at  run  time, PCRE assumes that the pattern or subject it is
   8175        given (respectively) contains only valid UTF-8 codes. In this case,  it
   8176        does not diagnose an invalid UTF-8 string.
   8177 
   8178        Note  that  passing  PCRE_NO_UTF8_CHECK to pcre_compile() just disables
   8179        the check for the pattern; it does not also apply to  subject  strings.
   8180        If  you  want  to  disable the check for a subject string you must pass
   8181        this option to pcre_exec() or pcre_dfa_exec().
   8182 
   8183        If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the
   8184        result is undefined and your program may crash.
   8185 
   8186    Validity of UTF-16 strings
   8187 
   8188        When you set the PCRE_UTF16 flag, the strings of 16-bit data units that
   8189        are passed as patterns and subjects are (by default) checked for valid-
   8190        ity  on entry to the relevant functions. Values other than those in the
   8191        surrogate range U+D800 to U+DFFF are independent code points. Values in
   8192        the surrogate range must be used in pairs in the correct manner.
   8193 
   8194        If  an  invalid  UTF-16  string  is  passed to PCRE, an error return is
   8195        given. At compile time, the only additional information is  the  offset
   8196        to the first data unit of the failing character. The run-time functions
   8197        pcre16_exec() and pcre16_dfa_exec() also pass back this information, as
   8198        well  as  a more detailed reason code if the caller has provided memory
   8199        in which to do this.
   8200 
   8201        In some situations, you may already know that your strings  are  valid,
   8202        and  therefore  want  to  skip these checks in order to improve perfor-
   8203        mance. If you set the PCRE_NO_UTF16_CHECK flag at compile  time  or  at
   8204        run time, PCRE assumes that the pattern or subject it is given (respec-
   8205        tively) contains only valid UTF-16 sequences. In this case, it does not
   8206        diagnose  an  invalid  UTF-16 string.  However, if an invalid string is
   8207        passed, the result is undefined.
   8208 
   8209    Validity of UTF-32 strings
   8210 
   8211        When you set the PCRE_UTF32 flag, the strings of 32-bit data units that
   8212        are passed as patterns and subjects are (by default) checked for valid-
   8213        ity on entry to the relevant functions.  This check allows only  values
   8214        in  the  range  U+0 to U+10FFFF, excluding the surrogate area U+D800 to
   8215        U+DFFF.
   8216 
   8217        If an invalid UTF-32 string is passed  to  PCRE,  an  error  return  is
   8218        given.  At  compile time, the only additional information is the offset
   8219        to the first data unit of the failing character. The run-time functions
   8220        pcre32_exec() and pcre32_dfa_exec() also pass back this information, as
   8221        well as a more detailed reason code if the caller has  provided  memory
   8222        in which to do this.
   8223 
   8224        In  some  situations, you may already know that your strings are valid,
   8225        and therefore want to skip these checks in  order  to  improve  perfor-
   8226        mance.  If  you  set the PCRE_NO_UTF32_CHECK flag at compile time or at
   8227        run time, PCRE assumes that the pattern or subject it is given (respec-
   8228        tively) contains only valid UTF-32 sequences. In this case, it does not
   8229        diagnose an invalid UTF-32 string.  However, if an  invalid  string  is
   8230        passed, the result is undefined.
   8231 
   8232    General comments about UTF modes
   8233 
   8234        1.  Codepoints  less  than  256  can be specified in patterns by either
   8235        braced or unbraced hexadecimal escape sequences (for example, \x{b3} or
   8236        \xb3). Larger values have to use braced sequences.
   8237 
   8238        2.  Octal  numbers  up  to  \777 are recognized, and in UTF-8 mode they
   8239        match two-byte characters for values greater than \177.
   8240 
   8241        3. Repeat quantifiers apply to complete UTF characters, not to individ-
   8242        ual data units, for example: \x{100}{3}.
   8243 
   8244        4.  The dot metacharacter matches one UTF character instead of a single
   8245        data unit.
   8246 
   8247        5. The escape sequence \C can be used to match a single byte  in  UTF-8
   8248        mode,  or  a single 16-bit data unit in UTF-16 mode, or a single 32-bit
   8249        data unit in UTF-32 mode, but its use can lead to some strange  effects
   8250        because  it  breaks up multi-unit characters (see the description of \C
   8251        in the pcrepattern documentation). The use of \C is  not  supported  in
   8252        the  alternative  matching  function  pcre[16|32]_dfa_exec(), nor is it
   8253        supported in UTF mode by the JIT optimization of pcre[16|32]_exec(). If
   8254        JIT  optimization  is  requested for a UTF pattern that contains \C, it
   8255        will not succeed, and so the matching will be carried out by the normal
   8256        interpretive function.
   8257 
   8258        6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
   8259        test characters of any code value, but, by default, the characters that
   8260        PCRE  recognizes  as digits, spaces, or word characters remain the same
   8261        set as in non-UTF mode, all with values less  than  256.  This  remains
   8262        true  even  when  PCRE  is  built  to include Unicode property support,
   8263        because to do otherwise would slow down PCRE in many common cases. Note
   8264        in  particular that this applies to \b and \B, because they are defined
   8265        in terms of \w and \W. If you really want to test for a wider sense of,
   8266        say,  "digit",  you  can  use  explicit  Unicode property tests such as
   8267        \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
   8268        character  escapes  work is changed so that Unicode properties are used
   8269        to determine which characters match. There are more details in the sec-
   8270        tion on generic character types in the pcrepattern documentation.
   8271 
   8272        7.  Similarly,  characters that match the POSIX named character classes
   8273        are all low-valued characters, unless the PCRE_UCP option is set.
   8274 
   8275        8. However, the horizontal and vertical white  space  matching  escapes
   8276        (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
   8277        whether or not PCRE_UCP is set.
   8278 
   8279        9. Case-insensitive matching applies only to  characters  whose  values
   8280        are  less than 128, unless PCRE is built with Unicode property support.
   8281        A few Unicode characters such as Greek sigma have more than  two  code-
   8282        points that are case-equivalent. Up to and including PCRE release 8.31,
   8283        only one-to-one case mappings were supported, but later releases  (with
   8284        Unicode  property  support) do treat as case-equivalent all versions of
   8285        characters such as Greek sigma.
   8286 
   8287 
   8288 AUTHOR
   8289 
   8290        Philip Hazel
   8291        University Computing Service
   8292        Cambridge CB2 3QH, England.
   8293 
   8294 
   8295 REVISION
   8296 
   8297        Last updated: 27 February 2013
   8298        Copyright (c) 1997-2013 University of Cambridge.
   8299 ------------------------------------------------------------------------------
   8300 
   8301 
   8302 PCREJIT(3)                 Library Functions Manual                 PCREJIT(3)
   8303 
   8304 
   8305 
   8306 NAME
   8307        PCRE - Perl-compatible regular expressions
   8308 
   8309 PCRE JUST-IN-TIME COMPILER SUPPORT
   8310 
   8311        Just-in-time  compiling  is a heavyweight optimization that can greatly
   8312        speed up pattern matching. However, it comes at the cost of extra  pro-
   8313        cessing before the match is performed. Therefore, it is of most benefit
   8314        when the same pattern is going to be matched many times. This does  not
   8315        necessarily  mean  many calls of a matching function; if the pattern is
   8316        not anchored, matching attempts may take place many  times  at  various
   8317        positions  in  the  subject, even for a single call.  Therefore, if the
   8318        subject string is very long, it may still pay to use  JIT  for  one-off
   8319        matches.
   8320 
   8321        JIT  support  applies  only to the traditional Perl-compatible matching
   8322        function.  It does not apply when the DFA matching  function  is  being
   8323        used. The code for this support was written by Zoltan Herczeg.
   8324 
   8325 
   8326 8-BIT, 16-BIT AND 32-BIT SUPPORT
   8327 
   8328        JIT  support  is available for all of the 8-bit, 16-bit and 32-bit PCRE
   8329        libraries. To keep this documentation simple, only the 8-bit  interface
   8330        is described in what follows. If you are using the 16-bit library, sub-
   8331        stitute the  16-bit  functions  and  16-bit  structures  (for  example,
   8332        pcre16_jit_stack  instead  of  pcre_jit_stack).  If  you  are using the
   8333        32-bit library, substitute the 32-bit functions and  32-bit  structures
   8334        (for example, pcre32_jit_stack instead of pcre_jit_stack).
   8335 
   8336 
   8337 AVAILABILITY OF JIT SUPPORT
   8338 
   8339        JIT  support  is  an  optional  feature of PCRE. The "configure" option
   8340        --enable-jit (or equivalent CMake option) must  be  set  when  PCRE  is
   8341        built  if  you want to use JIT. The support is limited to the following
   8342        hardware platforms:
   8343 
   8344          ARM v5, v7, and Thumb2
   8345          Intel x86 32-bit and 64-bit
   8346          MIPS 32-bit
   8347          Power PC 32-bit and 64-bit
   8348          SPARC 32-bit (experimental)
   8349 
   8350        If --enable-jit is set on an unsupported platform, compilation fails.
   8351 
   8352        A program that is linked with PCRE 8.20 or later can tell if  JIT  sup-
   8353        port  is  available  by  calling pcre_config() with the PCRE_CONFIG_JIT
   8354        option. The result is 1 when JIT is available, and  0  otherwise.  How-
   8355        ever, a simple program does not need to check this in order to use JIT.
   8356        The normal API is implemented in a way that falls back to the interpre-
   8357        tive code if JIT is not available. For programs that need the best pos-
   8358        sible performance, there is also a "fast path"  API  that  is  JIT-spe-
   8359        cific.
   8360 
   8361        If  your program may sometimes be linked with versions of PCRE that are
   8362        older than 8.20, but you want to use JIT when it is available, you  can
   8363        test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
   8364        macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
   8365 
   8366 
   8367 SIMPLE USE OF JIT
   8368 
   8369        You have to do two things to make use of the JIT support  in  the  sim-
   8370        plest way:
   8371 
   8372          (1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for
   8373              each compiled pattern, and pass the resulting pcre_extra block to
   8374              pcre_exec().
   8375 
   8376          (2) Use pcre_free_study() to free the pcre_extra block when it is
   8377              no  longer  needed,  instead  of  just  freeing it yourself. This
   8378        ensures that
   8379              any JIT data is also freed.
   8380 
   8381        For a program that may be linked with pre-8.20 versions  of  PCRE,  you
   8382        can insert
   8383 
   8384          #ifndef PCRE_STUDY_JIT_COMPILE
   8385          #define PCRE_STUDY_JIT_COMPILE 0
   8386          #endif
   8387 
   8388        so  that  no  option  is passed to pcre_study(), and then use something
   8389        like this to free the study data:
   8390 
   8391          #ifdef PCRE_CONFIG_JIT
   8392              pcre_free_study(study_ptr);
   8393          #else
   8394              pcre_free(study_ptr);
   8395          #endif
   8396 
   8397        PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate  code  for
   8398        complete  matches.  If  you  want  to  run  partial  matches  using the
   8399        PCRE_PARTIAL_HARD or  PCRE_PARTIAL_SOFT  options  of  pcre_exec(),  you
   8400        should  set  one  or  both  of the following options in addition to, or
   8401        instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study():
   8402 
   8403          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
   8404          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
   8405 
   8406        The JIT compiler generates different optimized code  for  each  of  the
   8407        three  modes  (normal, soft partial, hard partial). When pcre_exec() is
   8408        called, the appropriate code is run if it is available. Otherwise,  the
   8409        pattern is matched using interpretive code.
   8410 
   8411        In  some circumstances you may need to call additional functions. These
   8412        are described in the  section  entitled  "Controlling  the  JIT  stack"
   8413        below.
   8414 
   8415        If  JIT  support  is  not  available,  PCRE_STUDY_JIT_COMPILE  etc. are
   8416        ignored, and no JIT data is created. Otherwise, the compiled pattern is
   8417        passed  to the JIT compiler, which turns it into machine code that exe-
   8418        cutes much faster than the normal interpretive code.  When  pcre_exec()
   8419        is  passed  a  pcre_extra block containing a pointer to JIT code of the
   8420        appropriate mode (normal or hard/soft  partial),  it  obeys  that  code
   8421        instead  of  running  the interpreter. The result is identical, but the
   8422        compiled JIT code runs much faster.
   8423 
   8424        There are some pcre_exec() options that are not supported for JIT  exe-
   8425        cution.  There  are  also  some  pattern  items that JIT cannot handle.
   8426        Details are given below. In both cases, execution  automatically  falls
   8427        back  to  the  interpretive  code.  If you want to know whether JIT was
   8428        actually used for a particular match, you  should  arrange  for  a  JIT
   8429        callback  function  to  be  set up as described in the section entitled
   8430        "Controlling the JIT stack" below, even if you do not need to supply  a
   8431        non-default  JIT stack. Such a callback function is called whenever JIT
   8432        code is about to be obeyed. If the execution options are not right  for
   8433        JIT execution, the callback function is not obeyed.
   8434 
   8435        If  the  JIT  compiler finds an unsupported item, no JIT data is gener-
   8436        ated. You can find out if JIT execution is available after  studying  a
   8437        pattern  by  calling  pcre_fullinfo()  with the PCRE_INFO_JIT option. A
   8438        result of 1 means that JIT compilation was successful. A  result  of  0
   8439        means that JIT support is not available, or the pattern was not studied
   8440        with PCRE_STUDY_JIT_COMPILE etc., or the JIT compiler was not  able  to
   8441        handle the pattern.
   8442 
   8443        Once a pattern has been studied, with or without JIT, it can be used as
   8444        many times as you like for matching different subject strings.
   8445 
   8446 
   8447 UNSUPPORTED OPTIONS AND PATTERN ITEMS
   8448 
   8449        The only pcre_exec() options that are supported for JIT  execution  are
   8450        PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NO_UTF32_CHECK, PCRE_NOT-
   8451        BOL,  PCRE_NOTEOL,  PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,   PCRE_PAR-
   8452        TIAL_HARD, and PCRE_PARTIAL_SOFT.
   8453 
   8454        The  only  unsupported  pattern items are \C (match a single data unit)
   8455        when running in a UTF mode, and a callout immediately before an  asser-
   8456        tion condition in a conditional group.
   8457 
   8458 
   8459 RETURN VALUES FROM JIT EXECUTION
   8460 
   8461        When  a  pattern  is matched using JIT execution, the return values are
   8462        the same as those given by the interpretive pcre_exec() code, with  the
   8463        addition  of  one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means
   8464        that the memory used for the JIT stack was insufficient. See  "Control-
   8465        ling the JIT stack" below for a discussion of JIT stack usage. For com-
   8466        patibility with the interpretive pcre_exec() code, no  more  than  two-
   8467        thirds  of  the ovector argument is used for passing back captured sub-
   8468        strings.
   8469 
   8470        The error code PCRE_ERROR_MATCHLIMIT is returned by  the  JIT  code  if
   8471        searching  a  very large pattern tree goes on for too long, as it is in
   8472        the same circumstance when JIT is not used, but the details of  exactly
   8473        what  is  counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error
   8474        code is never returned by JIT execution.
   8475 
   8476 
   8477 SAVING AND RESTORING COMPILED PATTERNS
   8478 
   8479        The code that is generated by the  JIT  compiler  is  architecture-spe-
   8480        cific,  and  is also position dependent. For those reasons it cannot be
   8481        saved (in a file or database) and restored later like the bytecode  and
   8482        other  data  of  a compiled pattern. Saving and restoring compiled pat-
   8483        terns is not something many people do. More detail about this  facility
   8484        is  given in the pcreprecompile documentation. It should be possible to
   8485        run pcre_study() on a saved and restored pattern, and thereby  recreate
   8486        the  JIT  data, but because JIT compilation uses significant resources,
   8487        it is probably not worth doing this; you might as  well  recompile  the
   8488        original pattern.
   8489 
   8490 
   8491 CONTROLLING THE JIT STACK
   8492 
   8493        When the compiled JIT code runs, it needs a block of memory to use as a
   8494        stack.  By default, it uses 32K on the  machine  stack.  However,  some
   8495        large   or   complicated  patterns  need  more  than  this.  The  error
   8496        PCRE_ERROR_JIT_STACKLIMIT is given when  there  is  not  enough  stack.
   8497        Three  functions  are provided for managing blocks of memory for use as
   8498        JIT stacks. There is further discussion about the use of JIT stacks  in
   8499        the section entitled "JIT stack FAQ" below.
   8500 
   8501        The  pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
   8502        are a starting size and a maximum size, and it returns a pointer to  an
   8503        opaque  structure of type pcre_jit_stack, or NULL if there is an error.
   8504        The pcre_jit_stack_free() function can be used to free a stack that  is
   8505        no  longer  needed.  (For  the technically minded: the address space is
   8506        allocated by mmap or VirtualAlloc.)
   8507 
   8508        JIT uses far less memory for recursion than the interpretive code,  and
   8509        a  maximum  stack size of 512K to 1M should be more than enough for any
   8510        pattern.
   8511 
   8512        The pcre_assign_jit_stack() function specifies  which  stack  JIT  code
   8513        should use. Its arguments are as follows:
   8514 
   8515          pcre_extra         *extra
   8516          pcre_jit_callback  callback
   8517          void               *data
   8518 
   8519        The  extra  argument  must  be  the  result  of studying a pattern with
   8520        PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the
   8521        other two options:
   8522 
   8523          (1) If callback is NULL and data is NULL, an internal 32K block
   8524              on the machine stack is used.
   8525 
   8526          (2) If callback is NULL and data is not NULL, data must be
   8527              a valid JIT stack, the result of calling pcre_jit_stack_alloc().
   8528 
   8529          (3) If callback is not NULL, it must point to a function that is
   8530              called with data as an argument at the start of matching, in
   8531              order to set up a JIT stack. If the return from the callback
   8532              function is NULL, the internal 32K stack is used; otherwise the
   8533              return value must be a valid JIT stack, the result of calling
   8534              pcre_jit_stack_alloc().
   8535 
   8536        A  callback function is obeyed whenever JIT code is about to be run; it
   8537        is not obeyed when pcre_exec() is called with options that  are  incom-
   8538        patible for JIT execution. A callback function can therefore be used to
   8539        determine whether a match operation was  executed  by  JIT  or  by  the
   8540        interpreter.
   8541 
   8542        You may safely use the same JIT stack for more than one pattern (either
   8543        by assigning directly or by callback), as long as the patterns are  all
   8544        matched  sequentially in the same thread. In a multithread application,
   8545        if you do not specify a JIT stack, or if you assign or pass  back  NULL
   8546        from  a  callback, that is thread-safe, because each thread has its own
   8547        machine stack. However, if you assign  or  pass  back  a  non-NULL  JIT
   8548        stack,  this  must  be  a  different  stack for each thread so that the
   8549        application is thread-safe.
   8550 
   8551        Strictly speaking, even more is allowed. You can assign the  same  non-
   8552        NULL  stack  to any number of patterns as long as they are not used for
   8553        matching by multiple threads at the same time.  For  example,  you  can
   8554        assign  the same stack to all compiled patterns, and use a global mutex
   8555        in the callback to wait until the stack is available for use.  However,
   8556        this is an inefficient solution, and not recommended.
   8557 
   8558        This  is a suggestion for how a multithreaded program that needs to set
   8559        up non-default JIT stacks might operate:
   8560 
   8561          During thread initalization
   8562            thread_local_var = pcre_jit_stack_alloc(...)
   8563 
   8564          During thread exit
   8565            pcre_jit_stack_free(thread_local_var)
   8566 
   8567          Use a one-line callback function
   8568            return thread_local_var
   8569 
   8570        All the functions described in this section do nothing if  JIT  is  not
   8571        available,  and  pcre_assign_jit_stack()  does nothing unless the extra
   8572        argument is non-NULL and points to  a  pcre_extra  block  that  is  the
   8573        result of a successful study with PCRE_STUDY_JIT_COMPILE etc.
   8574 
   8575 
   8576 JIT STACK FAQ
   8577 
   8578        (1) Why do we need JIT stacks?
   8579 
   8580        PCRE  (and JIT) is a recursive, depth-first engine, so it needs a stack
   8581        where the local data of the current node is pushed before checking  its
   8582        child nodes.  Allocating real machine stack on some platforms is diffi-
   8583        cult. For example, the stack chain needs to be updated every time if we
   8584        extend  the  stack  on  PowerPC.  Although it is possible, its updating
   8585        time overhead decreases performance. So we do the recursion in memory.
   8586 
   8587        (2) Why don't we simply allocate blocks of memory with malloc()?
   8588 
   8589        Modern operating systems have a  nice  feature:  they  can  reserve  an
   8590        address space instead of allocating memory. We can safely allocate mem-
   8591        ory pages inside this address space, so the stack  could  grow  without
   8592        moving memory data (this is important because of pointers). Thus we can
   8593        allocate 1M address space, and use only a single memory  page  (usually
   8594        4K)  if  that is enough. However, we can still grow up to 1M anytime if
   8595        needed.
   8596 
   8597        (3) Who "owns" a JIT stack?
   8598 
   8599        The owner of the stack is the user program, not the JIT studied pattern
   8600        or  anything else. The user program must ensure that if a stack is used
   8601        by pcre_exec(), (that is, it is assigned to the pattern currently  run-
   8602        ning), that stack must not be used by any other threads (to avoid over-
   8603        writing the same memory area). The best practice for multithreaded pro-
   8604        grams  is  to  allocate  a stack for each thread, and return this stack
   8605        through the JIT callback function.
   8606 
   8607        (4) When should a JIT stack be freed?
   8608 
   8609        You can free a JIT stack at any time, as long as it will not be used by
   8610        pcre_exec()  again.  When  you  assign  the  stack to a pattern, only a
   8611        pointer is set. There is no reference counting or any other magic.  You
   8612        can  free  the  patterns  and stacks in any order, anytime. Just do not
   8613        call pcre_exec() with a pattern pointing to an already freed stack,  as
   8614        that  will cause SEGFAULT. (Also, do not free a stack currently used by
   8615        pcre_exec() in another thread). You can also replace the  stack  for  a
   8616        pattern  at  any  time.  You  can  even  free the previous stack before
   8617        assigning a replacement.
   8618 
   8619        (5) Should I allocate/free a  stack  every  time  before/after  calling
   8620        pcre_exec()?
   8621 
   8622        No,  because  this  is  too  costly in terms of resources. However, you
   8623        could implement some clever idea which release the stack if it  is  not
   8624        used  in  let's  say  two minutes. The JIT callback can help to achieve
   8625        this without keeping a list of the currently JIT studied patterns.
   8626 
   8627        (6) OK, the stack is for long term memory allocation. But what  happens
   8628        if  a pattern causes stack overflow with a stack of 1M? Is that 1M kept
   8629        until the stack is freed?
   8630 
   8631        Especially on embedded sytems, it might be a good idea to release  mem-
   8632        ory  sometimes  without  freeing the stack. There is no API for this at
   8633        the moment.  Probably a function call which returns with the  currently
   8634        allocated  memory for any stack and another which allows releasing mem-
   8635        ory (shrinking the stack) would be a good idea if someone needs this.
   8636 
   8637        (7) This is too much of a headache. Isn't there any better solution for
   8638        JIT stack handling?
   8639 
   8640        No,  thanks to Windows. If POSIX threads were used everywhere, we could
   8641        throw out this complicated API.
   8642 
   8643 
   8644 EXAMPLE CODE
   8645 
   8646        This is a single-threaded example that specifies a  JIT  stack  without
   8647        using a callback.
   8648 
   8649          int rc;
   8650          int ovector[30];
   8651          pcre *re;
   8652          pcre_extra *extra;
   8653          pcre_jit_stack *jit_stack;
   8654 
   8655          re = pcre_compile(pattern, 0, &error, &erroffset, NULL);
   8656          /* Check for errors */
   8657          extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error);
   8658          jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024);
   8659          /* Check for error (NULL) */
   8660          pcre_assign_jit_stack(extra, NULL, jit_stack);
   8661          rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30);
   8662          /* Check results */
   8663          pcre_free(re);
   8664          pcre_free_study(extra);
   8665          pcre_jit_stack_free(jit_stack);
   8666 
   8667 
   8668 JIT FAST PATH API
   8669 
   8670        Because  the  API  described  above falls back to interpreted execution
   8671        when JIT is not available, it is convenient for programs that are writ-
   8672        ten  for  general  use  in  many environments. However, calling JIT via
   8673        pcre_exec() does have a performance impact. Programs that  are  written
   8674        for  use  where  JIT  is known to be available, and which need the best
   8675        possible performance, can instead use a "fast path"  API  to  call  JIT
   8676        execution  directly  instead of calling pcre_exec() (obviously only for
   8677        patterns that have been successfully studied by JIT).
   8678 
   8679        The fast path function is called pcre_jit_exec(), and it takes  exactly
   8680        the  same  arguments  as pcre_exec(), plus one additional argument that
   8681        must point to a JIT stack. The JIT stack arrangements  described  above
   8682        do not apply. The return values are the same as for pcre_exec().
   8683 
   8684        When  you  call  pcre_exec(), as well as testing for invalid options, a
   8685        number of other sanity checks are performed on the arguments. For exam-
   8686        ple,  if  the  subject  pointer  is NULL, or its length is negative, an
   8687        immediate error is given. Also, unless PCRE_NO_UTF[8|16|32] is  set,  a
   8688        UTF  subject  string is tested for validity. In the interests of speed,
   8689        these checks do not happen on the JIT fast path, and if invalid data is
   8690        passed, the result is undefined.
   8691 
   8692        Bypassing  the  sanity  checks  and  the  pcre_exec() wrapping can give
   8693        speedups of more than 10%.
   8694 
   8695 
   8696 SEE ALSO
   8697 
   8698        pcreapi(3)
   8699 
   8700 
   8701 AUTHOR
   8702 
   8703        Philip Hazel (FAQ by Zoltan Herczeg)
   8704        University Computing Service
   8705        Cambridge CB2 3QH, England.
   8706 
   8707 
   8708 REVISION
   8709 
   8710        Last updated: 17 March 2013
   8711        Copyright (c) 1997-2013 University of Cambridge.
   8712 ------------------------------------------------------------------------------
   8713 
   8714 
   8715 PCREPARTIAL(3)             Library Functions Manual             PCREPARTIAL(3)
   8716 
   8717 
   8718 
   8719 NAME
   8720        PCRE - Perl-compatible regular expressions
   8721 
   8722 PARTIAL MATCHING IN PCRE
   8723 
   8724        In normal use of PCRE, if the subject string that is passed to a match-
   8725        ing function matches as far as it goes, but is too short to  match  the
   8726        entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
   8727        where it might be helpful to distinguish this case from other cases  in
   8728        which there is no match.
   8729 
   8730        Consider, for example, an application where a human is required to type
   8731        in data for a field with specific formatting requirements.  An  example
   8732        might be a date in the form ddmmmyy, defined by this pattern:
   8733 
   8734          ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
   8735 
   8736        If the application sees the user's keystrokes one by one, and can check
   8737        that what has been typed so far is potentially valid,  it  is  able  to
   8738        raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
   8739        reflecting the character that has been typed, for example. This immedi-
   8740        ate  feedback is likely to be a better user interface than a check that
   8741        is delayed until the entire string has been entered.  Partial  matching
   8742        can  also be useful when the subject string is very long and is not all
   8743        available at once.
   8744 
   8745        PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
   8746        PCRE_PARTIAL_HARD  options,  which  can  be set when calling any of the
   8747        matching functions. For backwards compatibility, PCRE_PARTIAL is a syn-
   8748        onym  for  PCRE_PARTIAL_SOFT.  The essential difference between the two
   8749        options is whether or not a partial match is preferred to  an  alterna-
   8750        tive complete match, though the details differ between the two types of
   8751        matching function. If both options  are  set,  PCRE_PARTIAL_HARD  takes
   8752        precedence.
   8753 
   8754        If  you  want to use partial matching with just-in-time optimized code,
   8755        you must call pcre_study(), pcre16_study() or  pcre32_study() with  one
   8756        or both of these options:
   8757 
   8758          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
   8759          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
   8760 
   8761        PCRE_STUDY_JIT_COMPILE  should also be set if you are going to run non-
   8762        partial matches on the same pattern. If the appropriate JIT study  mode
   8763        has not been set for a match, the interpretive matching code is used.
   8764 
   8765        Setting a partial matching option disables two of PCRE's standard opti-
   8766        mizations. PCRE remembers the last literal data unit in a pattern,  and
   8767        abandons  matching  immediately  if  it  is  not present in the subject
   8768        string. This optimization cannot be used  for  a  subject  string  that
   8769        might  match only partially. If the pattern was studied, PCRE knows the
   8770        minimum length of a matching string, and does not  bother  to  run  the
   8771        matching  function  on  shorter strings. This optimization is also dis-
   8772        abled for partial matching.
   8773 
   8774 
   8775 PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()
   8776 
   8777        A  partial   match   occurs   during   a   call   to   pcre_exec()   or
   8778        pcre[16|32]_exec()  when  the end of the subject string is reached suc-
   8779        cessfully, but matching cannot continue  because  more  characters  are
   8780        needed.   However, at least one character in the subject must have been
   8781        inspected. This character need not  form  part  of  the  final  matched
   8782        string;  lookbehind  assertions and the \K escape sequence provide ways
   8783        of inspecting characters before the start of a matched  substring.  The
   8784        requirement  for  inspecting  at  least one character exists because an
   8785        empty string can always be matched; without such  a  restriction  there
   8786        would  always  be  a partial match of an empty string at the end of the
   8787        subject.
   8788 
   8789        If there are at least two slots in the offsets vector  when  a  partial
   8790        match  is returned, the first slot is set to the offset of the earliest
   8791        character that was inspected. For convenience, the second offset points
   8792        to the end of the subject so that a substring can easily be identified.
   8793        If there are at least three slots in the offsets vector, the third slot
   8794        is set to the offset of the character where matching started.
   8795 
   8796        For the majority of patterns, the contents of the first and third slots
   8797        will be the same. However, for patterns that contain lookbehind  asser-
   8798        tions, or begin with \b or \B, characters before the one where matching
   8799        started may have been inspected while carrying out the match. For exam-
   8800        ple, consider this pattern:
   8801 
   8802          /(?<=abc)123/
   8803 
   8804        This pattern matches "123", but only if it is preceded by "abc". If the
   8805        subject string is "xyzabc12", the first two  offsets  after  a  partial
   8806        match  are for the substring "abc12", because all these characters were
   8807        inspected. However, the third offset is set to 6, because that  is  the
   8808        offset where matching began.
   8809 
   8810        What happens when a partial match is identified depends on which of the
   8811        two partial matching options are set.
   8812 
   8813    PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec()
   8814 
   8815        If PCRE_PARTIAL_SOFT is  set  when  pcre_exec()  or  pcre[16|32]_exec()
   8816        identifies a partial match, the partial match is remembered, but match-
   8817        ing continues as normal, and other  alternatives  in  the  pattern  are
   8818        tried.  If  no  complete  match  can  be  found,  PCRE_ERROR_PARTIAL is
   8819        returned instead of PCRE_ERROR_NOMATCH.
   8820 
   8821        This option is "soft" because it prefers a complete match over  a  par-
   8822        tial  match.   All the various matching items in a pattern behave as if
   8823        the subject string is potentially complete. For example, \z, \Z, and  $
   8824        match  at  the end of the subject, as normal, and for \b and \B the end
   8825        of the subject is treated as a non-alphanumeric.
   8826 
   8827        If there is more than one partial match, the first one that  was  found
   8828        provides the data that is returned. Consider this pattern:
   8829 
   8830          /123\w+X|dogY/
   8831 
   8832        If  this is matched against the subject string "abc123dog", both alter-
   8833        natives fail to match, but the end of the  subject  is  reached  during
   8834        matching,  so  PCRE_ERROR_PARTIAL is returned. The offsets are set to 3
   8835        and 9, identifying "123dog" as the first partial match that was  found.
   8836        (In  this  example, there are two partial matches, because "dog" on its
   8837        own partially matches the second alternative.)
   8838 
   8839    PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec()
   8840 
   8841        If PCRE_PARTIAL_HARD is  set  for  pcre_exec()  or  pcre[16|32]_exec(),
   8842        PCRE_ERROR_PARTIAL  is  returned  as  soon as a partial match is found,
   8843        without continuing to search for possible complete matches. This option
   8844        is "hard" because it prefers an earlier partial match over a later com-
   8845        plete match. For this reason, the assumption is made that  the  end  of
   8846        the  supplied  subject  string may not be the true end of the available
   8847        data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
   8848        subject,  the  result is PCRE_ERROR_PARTIAL, provided that at least one
   8849        character in the subject has been inspected.
   8850 
   8851        Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
   8852        strings  are checked for validity. Normally, an invalid sequence causes
   8853        the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16.  However,  in  the
   8854        special  case  of  a  truncated  character  at  the end of the subject,
   8855        PCRE_ERROR_SHORTUTF8  or   PCRE_ERROR_SHORTUTF16   is   returned   when
   8856        PCRE_PARTIAL_HARD is set.
   8857 
   8858    Comparing hard and soft partial matching
   8859 
   8860        The  difference  between the two partial matching options can be illus-
   8861        trated by a pattern such as:
   8862 
   8863          /dog(sbody)?/
   8864 
   8865        This matches either "dog" or "dogsbody", greedily (that is, it  prefers
   8866        the  longer  string  if  possible). If it is matched against the string
   8867        "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".
   8868        However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
   8869        On the other hand, if the pattern is made ungreedy the result  is  dif-
   8870        ferent:
   8871 
   8872          /dog(sbody)??/
   8873 
   8874        In  this  case  the  result  is always a complete match because that is
   8875        found first, and matching never  continues  after  finding  a  complete
   8876        match. It might be easier to follow this explanation by thinking of the
   8877        two patterns like this:
   8878 
   8879          /dog(sbody)?/    is the same as  /dogsbody|dog/
   8880          /dog(sbody)??/   is the same as  /dog|dogsbody/
   8881 
   8882        The second pattern will never match "dogsbody", because it will  always
   8883        find the shorter match first.
   8884 
   8885 
   8886 PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec()
   8887 
   8888        The DFA functions move along the subject string character by character,
   8889        without backtracking, searching for  all  possible  matches  simultane-
   8890        ously.  If the end of the subject is reached before the end of the pat-
   8891        tern, there is the possibility of a partial match, again provided  that
   8892        at least one character has been inspected.
   8893 
   8894        When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if
   8895        there have been no complete matches. Otherwise,  the  complete  matches
   8896        are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match
   8897        takes precedence over any complete matches. The portion of  the  string
   8898        that  was  inspected when the longest partial match was found is set as
   8899        the first matching string, provided there are at least two slots in the
   8900        offsets vector.
   8901 
   8902        Because  the  DFA functions always search for all possible matches, and
   8903        there is no difference between greedy and  ungreedy  repetition,  their
   8904        behaviour  is  different  from  the  standard  functions when PCRE_PAR-
   8905        TIAL_HARD is  set.  Consider  the  string  "dog"  matched  against  the
   8906        ungreedy pattern shown above:
   8907 
   8908          /dog(sbody)??/
   8909 
   8910        Whereas  the  standard functions stop as soon as they find the complete
   8911        match for "dog", the DFA functions also  find  the  partial  match  for
   8912        "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
   8913 
   8914 
   8915 PARTIAL MATCHING AND WORD BOUNDARIES
   8916 
   8917        If  a  pattern ends with one of sequences \b or \B, which test for word
   8918        boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
   8919        intuitive results. Consider this pattern:
   8920 
   8921          /\bcat\b/
   8922 
   8923        This matches "cat", provided there is a word boundary at either end. If
   8924        the subject string is "the cat", the comparison of the final "t" with a
   8925        following  character  cannot  take  place, so a partial match is found.
   8926        However, normal matching carries on, and \b matches at the end  of  the
   8927        subject  when  the  last  character is a letter, so a complete match is
   8928        found.  The  result,  therefore,  is  not   PCRE_ERROR_PARTIAL.   Using
   8929        PCRE_PARTIAL_HARD  in  this case does yield PCRE_ERROR_PARTIAL, because
   8930        then the partial match takes precedence.
   8931 
   8932 
   8933 FORMERLY RESTRICTED PATTERNS
   8934 
   8935        For releases of PCRE prior to 8.00, because of the way certain internal
   8936        optimizations   were  implemented  in  the  pcre_exec()  function,  the
   8937        PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
   8938        used  with all patterns. From release 8.00 onwards, the restrictions no
   8939        longer apply, and partial matching with can be requested for  any  pat-
   8940        tern.
   8941 
   8942        Items that were formerly restricted were repeated single characters and
   8943        repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
   8944        not  conform  to  the restrictions, pcre_exec() returned the error code
   8945        PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
   8946        PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
   8947        pattern can be used for partial matching now always returns 1.
   8948 
   8949 
   8950 EXAMPLE OF PARTIAL MATCHING USING PCRETEST
   8951 
   8952        If the escape sequence \P is present  in  a  pcretest  data  line,  the
   8953        PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
   8954        pcretest that uses the date example quoted above:
   8955 
   8956            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
   8957          data> 25jun04\P
   8958           0: 25jun04
   8959           1: jun
   8960          data> 25dec3\P
   8961          Partial match: 23dec3
   8962          data> 3ju\P
   8963          Partial match: 3ju
   8964          data> 3juj\P
   8965          No match
   8966          data> j\P
   8967          No match
   8968 
   8969        The first data string is matched  completely,  so  pcretest  shows  the
   8970        matched  substrings.  The  remaining four strings do not match the com-
   8971        plete pattern, but the first two are partial matches. Similar output is
   8972        obtained if DFA matching is used.
   8973 
   8974        If  the escape sequence \P is present more than once in a pcretest data
   8975        line, the PCRE_PARTIAL_HARD option is set for the match.
   8976 
   8977 
   8978 MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec()
   8979 
   8980        When a partial match has been found using a DFA matching  function,  it
   8981        is  possible to continue the match by providing additional subject data
   8982        and calling the function again with the same compiled  regular  expres-
   8983        sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
   8984        same working space as before, because this is where details of the pre-
   8985        vious  partial  match  are  stored.  Here is an example using pcretest,
   8986        using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
   8987        specifies the use of the DFA matching function):
   8988 
   8989            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
   8990          data> 23ja\P\D
   8991          Partial match: 23ja
   8992          data> n05\R\D
   8993           0: n05
   8994 
   8995        The  first  call has "23ja" as the subject, and requests partial match-
   8996        ing; the second call  has  "n05"  as  the  subject  for  the  continued
   8997        (restarted)  match.   Notice  that when the match is complete, only the
   8998        last part is shown; PCRE does  not  retain  the  previously  partially-
   8999        matched  string. It is up to the calling program to do that if it needs
   9000        to.
   9001 
   9002        That means that, for an unanchored pattern, if a continued match fails,
   9003        it  is  not  possible  to  try  again at a new starting point. All this
   9004        facility is capable of doing is  continuing  with  the  previous  match
   9005        attempt.  In  the previous example, if the second set of data is "ug23"
   9006        the result is no match, even though there would be a match for  "aug23"
   9007        if  the entire string were given at once. Depending on the application,
   9008        this may or may not be what you want.  The only way to allow for start-
   9009        ing  again  at  the next character is to retain the matched part of the
   9010        subject and try a new complete match.
   9011 
   9012        You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
   9013        PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
   9014        This facility can be used to pass very long subject strings to the  DFA
   9015        matching functions.
   9016 
   9017 
   9018 MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec()
   9019 
   9020        From  release 8.00, the standard matching functions can also be used to
   9021        do multi-segment matching. Unlike the DFA functions, it is not possible
   9022        to  restart the previous match with a new segment of data. Instead, new
   9023        data must be added to the previous subject string, and the entire match
   9024        re-run,  starting from the point where the partial match occurred. Ear-
   9025        lier data can be discarded.
   9026 
   9027        It is best to use PCRE_PARTIAL_HARD in this situation, because it  does
   9028        not  treat the end of a segment as the end of the subject when matching
   9029        \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
   9030        dates:
   9031 
   9032            re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
   9033          data> The date is 23ja\P\P
   9034          Partial match: 23ja
   9035 
   9036        At  this stage, an application could discard the text preceding "23ja",
   9037        add on text from the next  segment,  and  call  the  matching  function
   9038        again.  Unlike  the  DFA matching functions, the entire matching string
   9039        must always be available, and the complete matching process occurs  for
   9040        each call, so more memory and more processing time is needed.
   9041 
   9042        Note:  If  the pattern contains lookbehind assertions, or \K, or starts
   9043        with \b or \B, the string that is returned for a partial match includes
   9044        characters  that precede the start of what would be returned for a com-
   9045        plete match, because it contains all the characters that were inspected
   9046        during the partial match.
   9047 
   9048 
   9049 ISSUES WITH MULTI-SEGMENT MATCHING
   9050 
   9051        Certain types of pattern may give problems with multi-segment matching,
   9052        whichever matching function is used.
   9053 
   9054        1. If the pattern contains a test for the beginning of a line, you need
   9055        to  pass  the  PCRE_NOTBOL  option when the subject string for any call
   9056        does start at the beginning of a line.  There  is  also  a  PCRE_NOTEOL
   9057        option, but in practice when doing multi-segment matching you should be
   9058        using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
   9059 
   9060        2. Lookbehind assertions that have already been obeyed are catered  for
   9061        in the offsets that are returned for a partial match. However a lookbe-
   9062        hind assertion later in the pattern could require even earlier  charac-
   9063        ters   to  be  inspected.  You  can  handle  this  case  by  using  the
   9064        PCRE_INFO_MAXLOOKBEHIND    option    of    the    pcre_fullinfo()    or
   9065        pcre[16|32]_fullinfo()  functions  to  obtain the length of the longest
   9066        lookbehind in the pattern. This length  is  given  in  characters,  not
   9067        bytes.  If  you  always retain at least that many characters before the
   9068        partially matched string, all should be  well.  (Of  course,  near  the
   9069        start of the subject, fewer characters may be present; in that case all
   9070        characters should be retained.)
   9071 
   9072        From release 8.33, there is a more accurate way of deciding which char-
   9073        acters  to  retain.  Instead  of  subtracting the length of the longest
   9074        lookbehind from the  earliest  inspected  character  (offsets[0]),  the
   9075        match  start  position  (offsets[2]) should be used, and the next match
   9076        attempt started at the offsets[2] character by setting the  startoffset
   9077        argument of pcre_exec() or pcre_dfa_exec().
   9078 
   9079        For  example, if the pattern "(?<=123)abc" is partially matched against
   9080        the string "xx123a", the three offset values returned are 2, 6, and  5.
   9081        This  indicates  that  the  matching  process that gave a partial match
   9082        started at offset 5, but the characters "123a" were all inspected.  The
   9083        maximum  lookbehind  for  that pattern is 3, so taking that away from 5
   9084        shows that we need only keep "123a", and the next match attempt can  be
   9085        started at offset 3 (that is, at "a") when further characters have been
   9086        added. When the match start is not the  earliest  inspected  character,
   9087        pcretest shows it explicitly:
   9088 
   9089            re> "(?<=123)abc"
   9090          data> xx123a\P\P
   9091          Partial match at offset 5: 123a
   9092 
   9093        3.  Because a partial match must always contain at least one character,
   9094        what might be considered a partial match of an  empty  string  actually
   9095        gives a "no match" result. For example:
   9096 
   9097            re> /c(?<=abc)x/
   9098          data> ab\P
   9099          No match
   9100 
   9101        If the next segment begins "cx", a match should be found, but this will
   9102        only happen if characters from the previous segment are  retained.  For
   9103        this  reason,  a  "no  match"  result should be interpreted as "partial
   9104        match of an empty string" when the pattern contains lookbehinds.
   9105 
   9106        4. Matching a subject string that is split into multiple  segments  may
   9107        not  always produce exactly the same result as matching over one single
   9108        long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
   9109        "Partial  Matching  and  Word Boundaries" above describes an issue that
   9110        arises if the pattern ends with \b or \B. Another  kind  of  difference
   9111        may  occur when there are multiple matching possibilities, because (for
   9112        PCRE_PARTIAL_SOFT) a partial match result is given only when there  are
   9113        no completed matches. This means that as soon as the shortest match has
   9114        been found, continuation to a new subject segment is no  longer  possi-
   9115        ble. Consider again this pcretest example:
   9116 
   9117            re> /dog(sbody)?/
   9118          data> dogsb\P
   9119           0: dog
   9120          data> do\P\D
   9121          Partial match: do
   9122          data> gsb\R\P\D
   9123           0: g
   9124          data> dogsbody\D
   9125           0: dogsbody
   9126           1: dog
   9127 
   9128        The  first  data  line passes the string "dogsb" to a standard matching
   9129        function, setting the PCRE_PARTIAL_SOFT option. Although the string  is
   9130        a  partial  match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
   9131        because the shorter string "dog" is a complete match.  Similarly,  when
   9132        the  subject  is  presented to a DFA matching function in several parts
   9133        ("do" and "gsb" being the first two) the match  stops  when  "dog"  has
   9134        been  found, and it is not possible to continue.  On the other hand, if
   9135        "dogsbody" is presented as a single string,  a  DFA  matching  function
   9136        finds both matches.
   9137 
   9138        Because  of  these  problems,  it is best to use PCRE_PARTIAL_HARD when
   9139        matching multi-segment data. The example  above  then  behaves  differ-
   9140        ently:
   9141 
   9142            re> /dog(sbody)?/
   9143          data> dogsb\P\P
   9144          Partial match: dogsb
   9145          data> do\P\D
   9146          Partial match: do
   9147          data> gsb\R\P\P\D
   9148          Partial match: gsb
   9149 
   9150        5. Patterns that contain alternatives at the top level which do not all
   9151        start with the  same  pattern  item  may  not  work  as  expected  when
   9152        PCRE_DFA_RESTART is used. For example, consider this pattern:
   9153 
   9154          1234|3789
   9155 
   9156        If  the  first  part of the subject is "ABC123", a partial match of the
   9157        first alternative is found at offset 3. There is no partial  match  for
   9158        the second alternative, because such a match does not start at the same
   9159        point in the subject string. Attempting to  continue  with  the  string
   9160        "7890"  does  not  yield  a  match because only those alternatives that
   9161        match at one point in the subject are remembered.  The  problem  arises
   9162        because  the  start  of the second alternative matches within the first
   9163        alternative. There is no problem with  anchored  patterns  or  patterns
   9164        such as:
   9165 
   9166          1234|ABCD
   9167 
   9168        where  no  string can be a partial match for both alternatives. This is
   9169        not a problem if a standard matching  function  is  used,  because  the
   9170        entire match has to be rerun each time:
   9171 
   9172            re> /1234|3789/
   9173          data> ABC123\P\P
   9174          Partial match: 123
   9175          data> 1237890
   9176           0: 3789
   9177 
   9178        Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
   9179        running the entire match can also be used with the DFA  matching  func-
   9180        tions.  Another  possibility  is to work with two buffers. If a partial
   9181        match at offset n in the first buffer is followed by  "no  match"  when
   9182        PCRE_DFA_RESTART  is  used on the second buffer, you can then try a new
   9183        match starting at offset n+1 in the first buffer.
   9184 
   9185 
   9186 AUTHOR
   9187 
   9188        Philip Hazel
   9189        University Computing Service
   9190        Cambridge CB2 3QH, England.
   9191 
   9192 
   9193 REVISION
   9194 
   9195        Last updated: 02 July 2013
   9196        Copyright (c) 1997-2013 University of Cambridge.
   9197 ------------------------------------------------------------------------------
   9198 
   9199 
   9200 PCREPRECOMPILE(3)          Library Functions Manual          PCREPRECOMPILE(3)
   9201 
   9202 
   9203 
   9204 NAME
   9205        PCRE - Perl-compatible regular expressions
   9206 
   9207 SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
   9208 
   9209        If  you  are running an application that uses a large number of regular
   9210        expression patterns, it may be useful to store them  in  a  precompiled
   9211        form  instead  of  having to compile them every time the application is
   9212        run.  If you are not  using  any  private  character  tables  (see  the
   9213        pcre_maketables()  documentation),  this is relatively straightforward.
   9214        If you are using private tables, it is a little bit  more  complicated.
   9215        However,  if you are using the just-in-time optimization feature, it is
   9216        not possible to save and reload the JIT data.
   9217 
   9218        If you save compiled patterns to a file, you can copy them to a differ-
   9219        ent host and run them there. If the two hosts have different endianness
   9220        (byte    order),    you     should     run     the     pcre[16|32]_pat-
   9221        tern_to_host_byte_order()  function  on  the  new host before trying to
   9222        match the pattern. The matching functions return  PCRE_ERROR_BADENDIAN-
   9223        NESS if they detect a pattern with the wrong endianness.
   9224 
   9225        Compiling  regular  expressions with one version of PCRE for use with a
   9226        different version is not guaranteed to work and may cause crashes,  and
   9227        saving  and  restoring  a  compiled  pattern loses any JIT optimization
   9228        data.
   9229 
   9230 
   9231 SAVING A COMPILED PATTERN
   9232 
   9233        The value returned by pcre[16|32]_compile() points to a single block of
   9234        memory  that  holds  the  compiled pattern and associated data. You can
   9235        find   the   length   of   this   block    in    bytes    by    calling
   9236        pcre[16|32]_fullinfo() with an argument of PCRE_INFO_SIZE. You can then
   9237        save the data in any appropriate manner. Here is sample  code  for  the
   9238        8-bit  library  that  compiles  a  pattern  and writes it to a file. It
   9239        assumes that the variable fd refers to a file that is open for output:
   9240 
   9241          int erroroffset, rc, size;
   9242          char *error;
   9243          pcre *re;
   9244 
   9245          re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
   9246          if (re == NULL) { ... handle errors ... }
   9247          rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
   9248          if (rc < 0) { ... handle errors ... }
   9249          rc = fwrite(re, 1, size, fd);
   9250          if (rc != size) { ... handle errors ... }
   9251 
   9252        In this example, the bytes  that  comprise  the  compiled  pattern  are
   9253        copied  exactly.  Note that this is binary data that may contain any of
   9254        the 256 possible byte  values.  On  systems  that  make  a  distinction
   9255        between binary and non-binary data, be sure that the file is opened for
   9256        binary output.
   9257 
   9258        If you want to write more than one pattern to a file, you will have  to
   9259        devise  a  way of separating them. For binary data, preceding each pat-
   9260        tern with its length is probably  the  most  straightforward  approach.
   9261        Another  possibility is to write out the data in hexadecimal instead of
   9262        binary, one pattern to a line.
   9263 
   9264        Saving compiled patterns in a file is only one possible way of  storing
   9265        them  for later use. They could equally well be saved in a database, or
   9266        in the memory of some daemon process that passes them  via  sockets  to
   9267        the processes that want them.
   9268 
   9269        If the pattern has been studied, it is also possible to save the normal
   9270        study data in a similar way to the compiled pattern itself. However, if
   9271        the PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is cre-
   9272        ated cannot be saved because it is too dependent on the  current  envi-
   9273        ronment.    When    studying    generates    additional    information,
   9274        pcre[16|32]_study() returns  a  pointer  to  a  pcre[16|32]_extra  data
   9275        block.  Its  format  is defined in the section on matching a pattern in
   9276        the pcreapi documentation. The study_data field points  to  the  binary
   9277        study  data,  and this is what you must save (not the pcre[16|32]_extra
   9278        block itself). The length of the study data can be obtained by  calling
   9279        pcre[16|32]_fullinfo()  with an argument of PCRE_INFO_STUDYSIZE. Remem-
   9280        ber to check that  pcre[16|32]_study()  did  return  a  non-NULL  value
   9281        before trying to save the study data.
   9282 
   9283 
   9284 RE-USING A PRECOMPILED PATTERN
   9285 
   9286        Re-using  a  precompiled pattern is straightforward. Having reloaded it
   9287        into main memory,  called  pcre[16|32]_pattern_to_host_byte_order()  if
   9288        necessary,    you   pass   its   pointer   to   pcre[16|32]_exec()   or
   9289        pcre[16|32]_dfa_exec() in the usual way.
   9290 
   9291        However, if you passed a pointer to custom character  tables  when  the
   9292        pattern  was compiled (the tableptr argument of pcre[16|32]_compile()),
   9293        you  must  now  pass  a  similar  pointer  to   pcre[16|32]_exec()   or
   9294        pcre[16|32]_dfa_exec(),  because the value saved with the compiled pat-
   9295        tern will obviously be nonsense. A field in a pcre[16|32]_extra() block
   9296        is  used  to  pass this data, as described in the section on matching a
   9297        pattern in the pcreapi documentation.
   9298 
   9299        Warning: The tables that pcre_exec() and pcre_dfa_exec()  use  must  be
   9300        the same as those that were used when the pattern was compiled. If this
   9301        is not the case, the behaviour is undefined.
   9302 
   9303        If you did not provide custom character tables  when  the  pattern  was
   9304        compiled, the pointer in the compiled pattern is NULL, which causes the
   9305        matching functions to use PCRE's internal tables. Thus, you do not need
   9306        to take any special action at run time in this case.
   9307 
   9308        If  you  saved study data with the compiled pattern, you need to create
   9309        your own pcre[16|32]_extra data block and set the study_data  field  to
   9310        point   to   the   reloaded   study   data.   You  must  also  set  the
   9311        PCRE_EXTRA_STUDY_DATA bit in the flags field  to  indicate  that  study
   9312        data  is present. Then pass the pcre[16|32]_extra block to the matching
   9313        function in the usual way. If the pattern was studied for  just-in-time
   9314        optimization,  that  data  cannot  be  saved,  and  so  is  lost  by  a
   9315        save/restore cycle.
   9316 
   9317 
   9318 COMPATIBILITY WITH DIFFERENT PCRE RELEASES
   9319 
   9320        In general, it is safest to  recompile  all  saved  patterns  when  you
   9321        update  to  a new PCRE release, though not all updates actually require
   9322        this.
   9323 
   9324 
   9325 AUTHOR
   9326 
   9327        Philip Hazel
   9328        University Computing Service
   9329        Cambridge CB2 3QH, England.
   9330 
   9331 
   9332 REVISION
   9333 
   9334        Last updated: 12 November 2013
   9335        Copyright (c) 1997-2013 University of Cambridge.
   9336 ------------------------------------------------------------------------------
   9337 
   9338 
   9339 PCREPERFORM(3)             Library Functions Manual             PCREPERFORM(3)
   9340 
   9341 
   9342 
   9343 NAME
   9344        PCRE - Perl-compatible regular expressions
   9345 
   9346 PCRE PERFORMANCE
   9347 
   9348        Two  aspects  of performance are discussed below: memory usage and pro-
   9349        cessing time. The way you express your pattern as a regular  expression
   9350        can affect both of them.
   9351 
   9352 
   9353 COMPILED PATTERN MEMORY USAGE
   9354 
   9355        Patterns  are compiled by PCRE into a reasonably efficient interpretive
   9356        code, so that most simple patterns do not  use  much  memory.  However,
   9357        there  is  one case where the memory usage of a compiled pattern can be
   9358        unexpectedly large. If a parenthesized subpattern has a quantifier with
   9359        a minimum greater than 1 and/or a limited maximum, the whole subpattern
   9360        is repeated in the compiled code. For example, the pattern
   9361 
   9362          (abc|def){2,4}
   9363 
   9364        is compiled as if it were
   9365 
   9366          (abc|def)(abc|def)((abc|def)(abc|def)?)?
   9367 
   9368        (Technical aside: It is done this way so that backtrack  points  within
   9369        each of the repetitions can be independently maintained.)
   9370 
   9371        For  regular expressions whose quantifiers use only small numbers, this
   9372        is not usually a problem. However, if the numbers are large,  and  par-
   9373        ticularly  if  such repetitions are nested, the memory usage can become
   9374        an embarrassment. For example, the very simple pattern
   9375 
   9376          ((ab){1,1000}c){1,3}
   9377 
   9378        uses 51K bytes when compiled using the 8-bit library. When PCRE is com-
   9379        piled  with  its  default  internal pointer size of two bytes, the size
   9380        limit on a compiled pattern is 64K data units, and this is reached with
   9381        the  above  pattern  if  the outer repetition is increased from 3 to 4.
   9382        PCRE can be compiled to use larger internal pointers  and  thus  handle
   9383        larger  compiled patterns, but it is better to try to rewrite your pat-
   9384        tern to use less memory if you can.
   9385 
   9386        One way of reducing the memory usage for such patterns is to  make  use
   9387        of PCRE's "subroutine" facility. Re-writing the above pattern as
   9388 
   9389          ((ab)(?2){0,999}c)(?1){0,2}
   9390 
   9391        reduces the memory requirements to 18K, and indeed it remains under 20K
   9392        even with the outer repetition increased to 100. However, this  pattern
   9393        is  not  exactly equivalent, because the "subroutine" calls are treated
   9394        as atomic groups into which there can be no backtracking if there is  a
   9395        subsequent  matching  failure.  Therefore,  PCRE cannot do this kind of
   9396        rewriting automatically.  Furthermore, there is a  noticeable  loss  of
   9397        speed  when executing the modified pattern. Nevertheless, if the atomic
   9398        grouping is not a problem and the loss of  speed  is  acceptable,  this
   9399        kind  of  rewriting will allow you to process patterns that PCRE cannot
   9400        otherwise handle.
   9401 
   9402 
   9403 STACK USAGE AT RUN TIME
   9404 
   9405        When pcre_exec() or pcre[16|32]_exec() is used  for  matching,  certain
   9406        kinds  of  pattern  can  cause  it  to use large amounts of the process
   9407        stack. In some environments the default process stack is  quite  small,
   9408        and  if it runs out the result is often SIGSEGV. This issue is probably
   9409        the most frequently raised problem with PCRE.  Rewriting  your  pattern
   9410        can  often  help.  The  pcrestack documentation discusses this issue in
   9411        detail.
   9412 
   9413 
   9414 PROCESSING TIME
   9415 
   9416        Certain items in regular expression patterns are processed  more  effi-
   9417        ciently than others. It is more efficient to use a character class like
   9418        [aeiou]  than  a  set  of   single-character   alternatives   such   as
   9419        (a|e|i|o|u).  In  general,  the simplest construction that provides the
   9420        required behaviour is usually the most efficient. Jeffrey Friedl's book
   9421        contains  a  lot  of useful general discussion about optimizing regular
   9422        expressions for efficient performance. This  document  contains  a  few
   9423        observations about PCRE.
   9424 
   9425        Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
   9426        slow, because PCRE has to use a multi-stage table  lookup  whenever  it
   9427        needs  a  character's  property. If you can find an alternative pattern
   9428        that does not use character properties, it will probably be faster.
   9429 
   9430        By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
   9431        character  classes  such  as  [:alpha:]  do not use Unicode properties,
   9432        partly for backwards compatibility, and partly for performance reasons.
   9433        However,  you can set PCRE_UCP if you want Unicode character properties
   9434        to be used. This can double the matching time for  items  such  as  \d,
   9435        when matched with a traditional matching function; the performance loss
   9436        is less with a DFA matching function, and in both cases  there  is  not
   9437        much difference for \b.
   9438 
   9439        When  a  pattern  begins  with .* not in parentheses, or in parentheses
   9440        that are not the subject of a backreference, and the PCRE_DOTALL option
   9441        is  set, the pattern is implicitly anchored by PCRE, since it can match
   9442        only at the start of a subject string. However, if PCRE_DOTALL  is  not
   9443        set,  PCRE  cannot  make this optimization, because the . metacharacter
   9444        does not then match a newline, and if the subject string contains  new-
   9445        lines,  the  pattern may match from the character immediately following
   9446        one of them instead of from the very start. For example, the pattern
   9447 
   9448          .*second
   9449 
   9450        matches the subject "first\nand second" (where \n stands for a  newline
   9451        character),  with the match starting at the seventh character. In order
   9452        to do this, PCRE has to retry the match starting after every newline in
   9453        the subject.
   9454 
   9455        If  you  are using such a pattern with subject strings that do not con-
   9456        tain newlines, the best performance is obtained by setting PCRE_DOTALL,
   9457        or  starting  the pattern with ^.* or ^.*? to indicate explicit anchor-
   9458        ing. That saves PCRE from having to scan along the subject looking  for
   9459        a newline to restart at.
   9460 
   9461        Beware  of  patterns  that contain nested indefinite repeats. These can
   9462        take a long time to run when applied to a string that does  not  match.
   9463        Consider the pattern fragment
   9464 
   9465          ^(a+)*
   9466 
   9467        This  can  match "aaaa" in 16 different ways, and this number increases
   9468        very rapidly as the string gets longer. (The * repeat can match  0,  1,
   9469        2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
   9470        repeats can match different numbers of times.) When  the  remainder  of
   9471        the pattern is such that the entire match is going to fail, PCRE has in
   9472        principle to try  every  possible  variation,  and  this  can  take  an
   9473        extremely long time, even for relatively short strings.
   9474 
   9475        An optimization catches some of the more simple cases such as
   9476 
   9477          (a+)*b
   9478 
   9479        where  a  literal  character  follows. Before embarking on the standard
   9480        matching procedure, PCRE checks that there is a "b" later in  the  sub-
   9481        ject  string, and if there is not, it fails the match immediately. How-
   9482        ever, when there is no following literal this  optimization  cannot  be
   9483        used. You can see the difference by comparing the behaviour of
   9484 
   9485          (a+)*\d
   9486 
   9487        with  the  pattern  above.  The former gives a failure almost instantly
   9488        when applied to a whole line of  "a"  characters,  whereas  the  latter
   9489        takes an appreciable time with strings longer than about 20 characters.
   9490 
   9491        In many cases, the solution to this kind of performance issue is to use
   9492        an atomic group or a possessive quantifier.
   9493 
   9494 
   9495 AUTHOR
   9496 
   9497        Philip Hazel
   9498        University Computing Service
   9499        Cambridge CB2 3QH, England.
   9500 
   9501 
   9502 REVISION
   9503 
   9504        Last updated: 25 August 2012
   9505        Copyright (c) 1997-2012 University of Cambridge.
   9506 ------------------------------------------------------------------------------
   9507 
   9508 
   9509 PCREPOSIX(3)               Library Functions Manual               PCREPOSIX(3)
   9510 
   9511 
   9512 
   9513 NAME
   9514        PCRE - Perl-compatible regular expressions.
   9515 
   9516 SYNOPSIS
   9517 
   9518        #include <pcreposix.h>
   9519 
   9520        int regcomp(regex_t *preg, const char *pattern,
   9521             int cflags);
   9522 
   9523        int regexec(regex_t *preg, const char *string,
   9524             size_t nmatch, regmatch_t pmatch[], int eflags);
   9525             size_t regerror(int errcode, const regex_t *preg,
   9526             char *errbuf, size_t errbuf_size);
   9527 
   9528        void regfree(regex_t *preg);
   9529 
   9530 
   9531 DESCRIPTION
   9532 
   9533        This  set  of functions provides a POSIX-style API for the PCRE regular
   9534        expression 8-bit library. See the pcreapi documentation for a  descrip-
   9535        tion  of  PCRE's native API, which contains much additional functional-
   9536        ity. There is no POSIX-style  wrapper  for  PCRE's  16-bit  and  32-bit
   9537        library.
   9538 
   9539        The functions described here are just wrapper functions that ultimately
   9540        call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
   9541        pcreposix.h  header  file,  and  on  Unix systems the library itself is
   9542        called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
   9543        command  for  linking  an application that uses them. Because the POSIX
   9544        functions call the native ones, it is also necessary to add -lpcre.
   9545 
   9546        I have implemented only those POSIX option bits that can be  reasonably
   9547        mapped  to PCRE native options. In addition, the option REG_EXTENDED is
   9548        defined with the value zero. This has no  effect,  but  since  programs
   9549        that  are  written  to  the POSIX interface often use it, this makes it
   9550        easier to slot in PCRE as a replacement library.  Other  POSIX  options
   9551        are not even defined.
   9552 
   9553        There  are also some other options that are not defined by POSIX. These
   9554        have been added at the request of users who want to make use of certain
   9555        PCRE-specific features via the POSIX calling interface.
   9556 
   9557        When  PCRE  is  called  via these functions, it is only the API that is
   9558        POSIX-like in style. The syntax and semantics of  the  regular  expres-
   9559        sions  themselves  are  still  those of Perl, subject to the setting of
   9560        various PCRE options, as described below. "POSIX-like in  style"  means
   9561        that  the  API  approximates  to  the POSIX definition; it is not fully
   9562        POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
   9563        even less compatible.
   9564 
   9565        The  header for these functions is supplied as pcreposix.h to avoid any
   9566        potential clash with other POSIX  libraries.  It  can,  of  course,  be
   9567        renamed or aliased as regex.h, which is the "correct" name. It provides
   9568        two structure types, regex_t for  compiled  internal  forms,  and  reg-
   9569        match_t  for  returning  captured substrings. It also defines some con-
   9570        stants whose names start  with  "REG_";  these  are  used  for  setting
   9571        options and identifying error codes.
   9572 
   9573 
   9574 COMPILING A PATTERN
   9575 
   9576        The  function regcomp() is called to compile a pattern into an internal
   9577        form. The pattern is a C string terminated by a  binary  zero,  and  is
   9578        passed  in  the  argument  pattern. The preg argument is a pointer to a
   9579        regex_t structure that is used as a base for storing information  about
   9580        the compiled regular expression.
   9581 
   9582        The argument cflags is either zero, or contains one or more of the bits
   9583        defined by the following macros:
   9584 
   9585          REG_DOTALL
   9586 
   9587        The PCRE_DOTALL option is set when the regular expression is passed for
   9588        compilation to the native function. Note that REG_DOTALL is not part of
   9589        the POSIX standard.
   9590 
   9591          REG_ICASE
   9592 
   9593        The PCRE_CASELESS option is set when the regular expression  is  passed
   9594        for compilation to the native function.
   9595 
   9596          REG_NEWLINE
   9597 
   9598        The  PCRE_MULTILINE option is set when the regular expression is passed
   9599        for compilation to the native function. Note that this does  not  mimic
   9600        the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec-
   9601        tion).
   9602 
   9603          REG_NOSUB
   9604 
   9605        The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
   9606        passed for compilation to the native function. In addition, when a pat-
   9607        tern that is compiled with this flag is passed to regexec() for  match-
   9608        ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
   9609        strings are returned.
   9610 
   9611          REG_UCP
   9612 
   9613        The PCRE_UCP option is set when the regular expression  is  passed  for
   9614        compilation  to  the  native  function. This causes PCRE to use Unicode
   9615        properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
   9616        ASCII values. Note that REG_UTF8 is not part of the POSIX standard.
   9617 
   9618          REG_UNGREEDY
   9619 
   9620        The  PCRE_UNGREEDY  option is set when the regular expression is passed
   9621        for compilation to the native function. Note that REG_UNGREEDY  is  not
   9622        part of the POSIX standard.
   9623 
   9624          REG_UTF8
   9625 
   9626        The  PCRE_UTF8  option is set when the regular expression is passed for
   9627        compilation to the native function. This causes the pattern itself  and
   9628        all  data  strings used for matching it to be treated as UTF-8 strings.
   9629        Note that REG_UTF8 is not part of the POSIX standard.
   9630 
   9631        In the absence of these flags, no options  are  passed  to  the  native
   9632        function.   This  means  the  the  regex  is compiled with PCRE default
   9633        semantics. In particular, the way it handles newline characters in  the
   9634        subject  string  is  the Perl way, not the POSIX way. Note that setting
   9635        PCRE_MULTILINE has only some of the effects specified for  REG_NEWLINE.
   9636        It  does not affect the way newlines are matched by . (they are not) or
   9637        by a negative class such as [^a] (they are).
   9638 
   9639        The yield of regcomp() is zero on success, and non-zero otherwise.  The
   9640        preg structure is filled in on success, and one member of the structure
   9641        is public: re_nsub contains the number of capturing subpatterns in  the
   9642        regular expression. Various error codes are defined in the header file.
   9643 
   9644        NOTE:  If  the  yield of regcomp() is non-zero, you must not attempt to
   9645        use the contents of the preg structure. If, for example, you pass it to
   9646        regexec(), the result is undefined and your program is likely to crash.
   9647 
   9648 
   9649 MATCHING NEWLINE CHARACTERS
   9650 
   9651        This area is not simple, because POSIX and Perl take different views of
   9652        things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
   9653        then  PCRE was never intended to be a POSIX engine. The following table
   9654        lists the different possibilities for matching  newline  characters  in
   9655        PCRE:
   9656 
   9657                                  Default   Change with
   9658 
   9659          . matches newline          no     PCRE_DOTALL
   9660          newline matches [^a]       yes    not changeable
   9661          $ matches \n at end        yes    PCRE_DOLLARENDONLY
   9662          $ matches \n in middle     no     PCRE_MULTILINE
   9663          ^ matches \n in middle     no     PCRE_MULTILINE
   9664 
   9665        This is the equivalent table for POSIX:
   9666 
   9667                                  Default   Change with
   9668 
   9669          . matches newline          yes    REG_NEWLINE
   9670          newline matches [^a]       yes    REG_NEWLINE
   9671          $ matches \n at end        no     REG_NEWLINE
   9672          $ matches \n in middle     no     REG_NEWLINE
   9673          ^ matches \n in middle     no     REG_NEWLINE
   9674 
   9675        PCRE's behaviour is the same as Perl's, except that there is no equiva-
   9676        lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is
   9677        no way to stop newline from matching [^a].
   9678 
   9679        The   default  POSIX  newline  handling  can  be  obtained  by  setting
   9680        PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE
   9681        behave exactly as for the REG_NEWLINE action.
   9682 
   9683 
   9684 MATCHING A PATTERN
   9685 
   9686        The  function  regexec()  is  called  to  match a compiled pattern preg
   9687        against a given string, which is by default terminated by a  zero  byte
   9688        (but  see  REG_STARTEND below), subject to the options in eflags. These
   9689        can be:
   9690 
   9691          REG_NOTBOL
   9692 
   9693        The PCRE_NOTBOL option is set when calling the underlying PCRE matching
   9694        function.
   9695 
   9696          REG_NOTEMPTY
   9697 
   9698        The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
   9699        ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
   9700        However, setting this option can give more POSIX-like behaviour in some
   9701        situations.
   9702 
   9703          REG_NOTEOL
   9704 
   9705        The PCRE_NOTEOL option is set when calling the underlying PCRE matching
   9706        function.
   9707 
   9708          REG_STARTEND
   9709 
   9710        The  string  is  considered to start at string + pmatch[0].rm_so and to
   9711        have a terminating NUL located at string + pmatch[0].rm_eo (there  need
   9712        not  actually  be  a  NUL at that location), regardless of the value of
   9713        nmatch. This is a BSD extension, compatible with but not  specified  by
   9714        IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
   9715        software intended to be portable to other systems. Note that a non-zero
   9716        rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
   9717        of the string, not how it is matched.
   9718 
   9719        If the pattern was compiled with the REG_NOSUB flag, no data about  any
   9720        matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
   9721        regexec() are ignored.
   9722 
   9723        If the value of nmatch is zero, or if the value pmatch is NULL, no data
   9724        about any matched strings is returned.
   9725 
   9726        Otherwise,the portion of the string that was matched, and also any cap-
   9727        tured substrings, are returned via the pmatch argument, which points to
   9728        an  array  of nmatch structures of type regmatch_t, containing the mem-
   9729        bers rm_so and rm_eo. These contain the offset to the  first  character
   9730        of  each  substring and the offset to the first character after the end
   9731        of each substring, respectively. The 0th element of the vector  relates
   9732        to  the  entire portion of string that was matched; subsequent elements
   9733        relate to the capturing subpatterns of the regular  expression.  Unused
   9734        entries in the array have both structure members set to -1.
   9735 
   9736        A  successful  match  yields  a  zero  return;  various error codes are
   9737        defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
   9738        failure code.
   9739 
   9740 
   9741 ERROR MESSAGES
   9742 
   9743        The regerror() function maps a non-zero errorcode from either regcomp()
   9744        or regexec() to a printable message. If preg is  not  NULL,  the  error
   9745        should have arisen from the use of that structure. A message terminated
   9746        by a binary zero is placed  in  errbuf.  The  length  of  the  message,
   9747        including  the  zero, is limited to errbuf_size. The yield of the func-
   9748        tion is the size of buffer needed to hold the whole message.
   9749 
   9750 
   9751 MEMORY USAGE
   9752 
   9753        Compiling a regular expression causes memory to be allocated and  asso-
   9754        ciated  with  the preg structure. The function regfree() frees all such
   9755        memory, after which preg may no longer be used as  a  compiled  expres-
   9756        sion.
   9757 
   9758 
   9759 AUTHOR
   9760 
   9761        Philip Hazel
   9762        University Computing Service
   9763        Cambridge CB2 3QH, England.
   9764 
   9765 
   9766 REVISION
   9767 
   9768        Last updated: 09 January 2012
   9769        Copyright (c) 1997-2012 University of Cambridge.
   9770 ------------------------------------------------------------------------------
   9771 
   9772 
   9773 PCRECPP(3)                 Library Functions Manual                 PCRECPP(3)
   9774 
   9775 
   9776 
   9777 NAME
   9778        PCRE - Perl-compatible regular expressions.
   9779 
   9780 SYNOPSIS OF C++ WRAPPER
   9781 
   9782        #include <pcrecpp.h>
   9783 
   9784 
   9785 DESCRIPTION
   9786 
   9787        The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
   9788        functionality was added by Giuseppe Maxia. This brief man page was con-
   9789        structed  from  the  notes  in the pcrecpp.h file, which should be con-
   9790        sulted for further details. Note that the C++ wrapper supports only the
   9791        original  8-bit  PCRE  library. There is no 16-bit or 32-bit support at
   9792        present.
   9793 
   9794 
   9795 MATCHING INTERFACE
   9796 
   9797        The "FullMatch" operation checks that supplied text matches a  supplied
   9798        pattern  exactly.  If pointer arguments are supplied, it copies matched
   9799        sub-strings that match sub-patterns into them.
   9800 
   9801          Example: successful match
   9802             pcrecpp::RE re("h.*o");
   9803             re.FullMatch("hello");
   9804 
   9805          Example: unsuccessful match (requires full match):
   9806             pcrecpp::RE re("e");
   9807             !re.FullMatch("hello");
   9808 
   9809          Example: creating a temporary RE object:
   9810             pcrecpp::RE("h.*o").FullMatch("hello");
   9811 
   9812        You can pass in a "const char*" or a "string" for "text". The  examples
   9813        below  tend to use a const char*. You can, as in the different examples
   9814        above, store the RE object explicitly in a variable or use a  temporary
   9815        RE  object.  The  examples below use one mode or the other arbitrarily.
   9816        Either could correctly be used for any of these examples.
   9817 
   9818        You must supply extra pointer arguments to extract matched subpieces.
   9819 
   9820          Example: extracts "ruby" into "s" and 1234 into "i"
   9821             int i;
   9822             string s;
   9823             pcrecpp::RE re("(\\w+):(\\d+)");
   9824             re.FullMatch("ruby:1234", &s, &i);
   9825 
   9826          Example: does not try to extract any extra sub-patterns
   9827             re.FullMatch("ruby:1234", &s);
   9828 
   9829          Example: does not try to extract into NULL
   9830             re.FullMatch("ruby:1234", NULL, &i);
   9831 
   9832          Example: integer overflow causes failure
   9833             !re.FullMatch("ruby:1234567891234", NULL, &i);
   9834 
   9835          Example: fails because there aren't enough sub-patterns:
   9836             !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
   9837 
   9838          Example: fails because string cannot be stored in integer
   9839             !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
   9840 
   9841        The provided pointer arguments can be pointers to  any  scalar  numeric
   9842        type, or one of:
   9843 
   9844           string        (matched piece is copied to string)
   9845           StringPiece   (StringPiece is mutated to point to matched piece)
   9846           T             (where "bool T::ParseFrom(const char*, int)" exists)
   9847           NULL          (the corresponding matched sub-pattern is not copied)
   9848 
   9849        The  function returns true iff all of the following conditions are sat-
   9850        isfied:
   9851 
   9852          a. "text" matches "pattern" exactly;
   9853 
   9854          b. The number of matched sub-patterns is >= number of supplied
   9855             pointers;
   9856 
   9857          c. The "i"th argument has a suitable type for holding the
   9858             string captured as the "i"th sub-pattern. If you pass in
   9859             void * NULL for the "i"th argument, or a non-void * NULL
   9860             of the correct type, or pass fewer arguments than the
   9861             number of sub-patterns, "i"th captured sub-pattern is
   9862             ignored.
   9863 
   9864        CAVEAT: An optional sub-pattern that does  not  exist  in  the  matched
   9865        string  is  assigned  the  empty  string. Therefore, the following will
   9866        return false (because the empty string is not a valid number):
   9867 
   9868           int number;
   9869           pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
   9870 
   9871        The matching interface supports at most 16 arguments per call.  If  you
   9872        need    more,    consider    using    the    more   general   interface
   9873        pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
   9874 
   9875        NOTE: Do not use no_arg, which is used internally to mark the end of  a
   9876        list  of optional arguments, as a placeholder for missing arguments, as
   9877        this can lead to segfaults.
   9878 
   9879 
   9880 QUOTING METACHARACTERS
   9881 
   9882        You can use the "QuoteMeta" operation to insert backslashes before  all
   9883        potentially  meaningful  characters  in  a string. The returned string,
   9884        used as a regular expression, will exactly match the original string.
   9885 
   9886          Example:
   9887             string quoted = RE::QuoteMeta(unquoted);
   9888 
   9889        Note that it's legal to escape a character even if it  has  no  special
   9890        meaning  in  a  regular expression -- so this function does that. (This
   9891        also makes it identical to the perl function  of  the  same  name;  see
   9892        "perldoc    -f    quotemeta".)    For   example,   "1.5-2.0?"   becomes
   9893        "1\.5\-2\.0\?".
   9894 
   9895 
   9896 PARTIAL MATCHES
   9897 
   9898        You can use the "PartialMatch" operation when you want the  pattern  to
   9899        match any substring of the text.
   9900 
   9901          Example: simple search for a string:
   9902             pcrecpp::RE("ell").PartialMatch("hello");
   9903 
   9904          Example: find first number in a string:
   9905             int number;
   9906             pcrecpp::RE re("(\\d+)");
   9907             re.PartialMatch("x*100 + 20", &number);
   9908             assert(number == 100);
   9909 
   9910 
   9911 UTF-8 AND THE MATCHING INTERFACE
   9912 
   9913        By  default,  pattern  and text are plain text, one byte per character.
   9914        The UTF8 flag, passed to  the  constructor,  causes  both  pattern  and
   9915        string to be treated as UTF-8 text, still a byte stream but potentially
   9916        multiple bytes per character. In practice, the text is likelier  to  be
   9917        UTF-8  than  the pattern, but the match returned may depend on the UTF8
   9918        flag, so always use it when matching UTF8 text. For example,  "."  will
   9919        match  one  byte normally but with UTF8 set may match up to three bytes
   9920        of a multi-byte character.
   9921 
   9922          Example:
   9923             pcrecpp::RE_Options options;
   9924             options.set_utf8();
   9925             pcrecpp::RE re(utf8_pattern, options);
   9926             re.FullMatch(utf8_string);
   9927 
   9928          Example: using the convenience function UTF8():
   9929             pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
   9930             re.FullMatch(utf8_string);
   9931 
   9932        NOTE: The UTF8 flag is ignored if pcre was not configured with the
   9933              --enable-utf8 flag.
   9934 
   9935 
   9936 PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
   9937 
   9938        PCRE defines some modifiers to  change  the  behavior  of  the  regular
   9939        expression   engine.  The  C++  wrapper  defines  an  auxiliary  class,
   9940        RE_Options, as a vehicle to pass such modifiers to  a  RE  class.  Cur-
   9941        rently, the following modifiers are supported:
   9942 
   9943           modifier              description               Perl corresponding
   9944 
   9945           PCRE_CASELESS         case insensitive match      /i
   9946           PCRE_MULTILINE        multiple lines match        /m
   9947           PCRE_DOTALL           dot matches newlines        /s
   9948           PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
   9949           PCRE_EXTRA            strict escape parsing       N/A
   9950           PCRE_EXTENDED         ignore white spaces         /x
   9951           PCRE_UTF8             handles UTF8 chars          built-in
   9952           PCRE_UNGREEDY         reverses * and *?           N/A
   9953           PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
   9954 
   9955        (*)  Both Perl and PCRE allow non capturing parentheses by means of the
   9956        "?:" modifier within the pattern itself. e.g. (?:ab|cd) does  not  cap-
   9957        ture, while (ab|cd) does.
   9958 
   9959        For  a  full  account on how each modifier works, please check the PCRE
   9960        API reference page.
   9961 
   9962        For each modifier, there are two member functions whose  name  is  made
   9963        out  of  the  modifier  in  lowercase,  without the "PCRE_" prefix. For
   9964        instance, PCRE_CASELESS is handled by
   9965 
   9966          bool caseless()
   9967 
   9968        which returns true if the modifier is set, and
   9969 
   9970          RE_Options & set_caseless(bool)
   9971 
   9972        which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
   9973        be  accessed  through  the  set_match_limit()  and match_limit() member
   9974        functions. Setting match_limit to a non-zero value will limit the  exe-
   9975        cution  of pcre to keep it from doing bad things like blowing the stack
   9976        or taking an eternity to return a result.  A  value  of  5000  is  good
   9977        enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
   9978        to  zero  disables  match  limiting.  Alternatively,   you   can   call
   9979        match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
   9980        limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
   9981        matches PCRE does; match_limit_recursion() limits the depth of internal
   9982        recursion, and therefore the amount of stack that is used.
   9983 
   9984        Normally, to pass one or more modifiers to a RE class,  you  declare  a
   9985        RE_Options object, set the appropriate options, and pass this object to
   9986        a RE constructor. Example:
   9987 
   9988           RE_Options opt;
   9989           opt.set_caseless(true);
   9990           if (RE("HELLO", opt).PartialMatch("hello world")) ...
   9991 
   9992        RE_options has two constructors. The default constructor takes no argu-
   9993        ments  and creates a set of flags that are off by default. The optional
   9994        parameter option_flags is to facilitate transfer of legacy code from  C
   9995        programs.  This lets you do
   9996 
   9997           RE(pattern,
   9998             RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
   9999 
   10000        However, new code is better off doing
   10001 
   10002           RE(pattern,
   10003             RE_Options().set_caseless(true).set_multiline(true))
   10004               .PartialMatch(str);
   10005 
   10006        If you are going to pass one of the most used modifiers, there are some
   10007        convenience functions that return a RE_Options class with the appropri-
   10008        ate  modifier  already  set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
   10009        and EXTENDED().
   10010 
   10011        If you need to set several options at once, and you don't  want  to  go
   10012        through  the pains of declaring a RE_Options object and setting several
   10013        options, there is a parallel method that give you such ability  on  the
   10014        fly.  You  can  concatenate several set_xxxxx() member functions, since
   10015        each of them returns a reference to its class object. For  example,  to
   10016        pass  PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
   10017        statement, you may write:
   10018 
   10019           RE(" ^ xyz \\s+ .* blah$",
   10020             RE_Options()
   10021               .set_caseless(true)
   10022               .set_extended(true)
   10023               .set_multiline(true)).PartialMatch(sometext);
   10024 
   10025 
   10026 SCANNING TEXT INCREMENTALLY
   10027 
   10028        The "Consume" operation may be useful if you want to  repeatedly  match
   10029        regular expressions at the front of a string and skip over them as they
   10030        match. This requires use of the "StringPiece" type, which represents  a
   10031        sub-range  of  a  real  string.  Like RE, StringPiece is defined in the
   10032        pcrecpp namespace.
   10033 
   10034          Example: read lines of the form "var = value" from a string.
   10035             string contents = ...;                 // Fill string somehow
   10036             pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
   10037 
   10038             string var;
   10039             int value;
   10040             pcrecpp::RE re("(\\w+) = (\\d+)\n");
   10041             while (re.Consume(&input, &var, &value)) {
   10042               ...;
   10043             }
   10044 
   10045        Each successful call  to  "Consume"  will  set  "var/value",  and  also
   10046        advance "input" so it points past the matched text.
   10047 
   10048        The  "FindAndConsume"  operation  is  similar to "Consume" but does not
   10049        anchor your match at the beginning of  the  string.  For  example,  you
   10050        could extract all words from a string by repeatedly calling
   10051 
   10052          pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
   10053 
   10054 
   10055 PARSING HEX/OCTAL/C-RADIX NUMBERS
   10056 
   10057        By default, if you pass a pointer to a numeric value, the corresponding
   10058        text is interpreted as a base-10  number.  You  can  instead  wrap  the
   10059        pointer with a call to one of the operators Hex(), Octal(), or CRadix()
   10060        to interpret the text in another base. The CRadix  operator  interprets
   10061        C-style  "0"  (base-8)  and  "0x"  (base-16)  prefixes, but defaults to
   10062        base-10.
   10063 
   10064          Example:
   10065            int a, b, c, d;
   10066            pcrecpp::RE re("(.*) (.*) (.*) (.*)");
   10067            re.FullMatch("100 40 0100 0x40",
   10068                         pcrecpp::Octal(&a), pcrecpp::Hex(&b),
   10069                         pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
   10070 
   10071        will leave 64 in a, b, c, and d.
   10072 
   10073 
   10074 REPLACING PARTS OF STRINGS
   10075 
   10076        You can replace the first match of "pattern" in "str"  with  "rewrite".
   10077        Within  "rewrite",  backslash-escaped  digits (\1 to \9) can be used to
   10078        insert text matching corresponding parenthesized group  from  the  pat-
   10079        tern. \0 in "rewrite" refers to the entire matching text. For example:
   10080 
   10081          string s = "yabba dabba doo";
   10082          pcrecpp::RE("b+").Replace("d", &s);
   10083 
   10084        will  leave  "s" containing "yada dabba doo". The result is true if the
   10085        pattern matches and a replacement occurs, false otherwise.
   10086 
   10087        GlobalReplace is like Replace except that it replaces  all  occurrences
   10088        of  the  pattern  in  the string with the rewrite. Replacements are not
   10089        subject to re-matching. For example:
   10090 
   10091          string s = "yabba dabba doo";
   10092          pcrecpp::RE("b+").GlobalReplace("d", &s);
   10093 
   10094        will leave "s" containing "yada dada doo". It  returns  the  number  of
   10095        replacements made.
   10096 
   10097        Extract  is like Replace, except that if the pattern matches, "rewrite"
   10098        is copied into "out" (an additional argument) with substitutions.   The
   10099        non-matching  portions  of "text" are ignored. Returns true iff a match
   10100        occurred and the extraction happened successfully;  if no match occurs,
   10101        the string is left unaffected.
   10102 
   10103 
   10104 AUTHOR
   10105 
   10106        The C++ wrapper was contributed by Google Inc.
   10107        Copyright (c) 2007 Google Inc.
   10108 
   10109 
   10110 REVISION
   10111 
   10112        Last updated: 08 January 2012
   10113 ------------------------------------------------------------------------------
   10114 
   10115 
   10116 PCRESAMPLE(3)              Library Functions Manual              PCRESAMPLE(3)
   10117 
   10118 
   10119 
   10120 NAME
   10121        PCRE - Perl-compatible regular expressions
   10122 
   10123 PCRE SAMPLE PROGRAM
   10124 
   10125        A simple, complete demonstration program, to get you started with using
   10126        PCRE, is supplied in the file pcredemo.c in the  PCRE  distribution.  A
   10127        listing  of this program is given in the pcredemo documentation. If you
   10128        do not have a copy of the PCRE distribution, you can save this  listing
   10129        to re-create pcredemo.c.
   10130 
   10131        The  demonstration program, which uses the original PCRE 8-bit library,
   10132        compiles the regular expression that is its first argument, and matches
   10133        it  against  the subject string in its second argument. No PCRE options
   10134        are set, and default character tables are used. If  matching  succeeds,
   10135        the  program  outputs the portion of the subject that matched, together
   10136        with the contents of any captured substrings.
   10137 
   10138        If the -g option is given on the command line, the program then goes on
   10139        to check for further matches of the same regular expression in the same
   10140        subject string. The logic is a little bit tricky because of the  possi-
   10141        bility  of  matching an empty string. Comments in the code explain what
   10142        is going on.
   10143 
   10144        If PCRE is installed in the standard include  and  library  directories
   10145        for your operating system, you should be able to compile the demonstra-
   10146        tion program using this command:
   10147 
   10148          gcc -o pcredemo pcredemo.c -lpcre
   10149 
   10150        If PCRE is installed elsewhere, you may need to add additional  options
   10151        to  the  command line. For example, on a Unix-like system that has PCRE
   10152        installed in /usr/local, you  can  compile  the  demonstration  program
   10153        using a command like this:
   10154 
   10155          gcc -o pcredemo -I/usr/local/include pcredemo.c \
   10156              -L/usr/local/lib -lpcre
   10157 
   10158        In  a  Windows  environment, if you want to statically link the program
   10159        against a non-dll pcre.a file, you must uncomment the line that defines
   10160        PCRE_STATIC  before  including  pcre.h, because otherwise the pcre_mal-
   10161        loc()   and   pcre_free()   exported   functions   will   be   declared
   10162        __declspec(dllimport), with unwanted results.
   10163 
   10164        Once  you  have  compiled and linked the demonstration program, you can
   10165        run simple tests like this:
   10166 
   10167          ./pcredemo 'cat|dog' 'the cat sat on the mat'
   10168          ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
   10169 
   10170        Note that there is a  much  more  comprehensive  test  program,  called
   10171        pcretest,  which  supports  many  more  facilities  for testing regular
   10172        expressions and both PCRE libraries. The pcredemo program  is  provided
   10173        as a simple coding example.
   10174 
   10175        If  you  try to run pcredemo when PCRE is not installed in the standard
   10176        library directory, you may get an error like  this  on  some  operating
   10177        systems (e.g. Solaris):
   10178 
   10179          ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
   10180        directory
   10181 
   10182        This is caused by the way shared library support works  on  those  sys-
   10183        tems. You need to add
   10184 
   10185          -R/usr/local/lib
   10186 
   10187        (for example) to the compile command to get round this problem.
   10188 
   10189 
   10190 AUTHOR
   10191 
   10192        Philip Hazel
   10193        University Computing Service
   10194        Cambridge CB2 3QH, England.
   10195 
   10196 
   10197 REVISION
   10198 
   10199        Last updated: 10 January 2012
   10200        Copyright (c) 1997-2012 University of Cambridge.
   10201 ------------------------------------------------------------------------------
   10202 PCRELIMITS(3)              Library Functions Manual              PCRELIMITS(3)
   10203 
   10204 
   10205 
   10206 NAME
   10207        PCRE - Perl-compatible regular expressions
   10208 
   10209 SIZE AND OTHER LIMITATIONS
   10210 
   10211        There  are some size limitations in PCRE but it is hoped that they will
   10212        never in practice be relevant.
   10213 
   10214        The maximum length of a compiled  pattern  is  approximately  64K  data
   10215        units  (bytes  for  the  8-bit  library,  16-bit  units  for the 16-bit
   10216        library, and 32-bit units for the 32-bit library) if PCRE  is  compiled
   10217        with  the default internal linkage size, which is 2 bytes for the 8-bit
   10218        and 16-bit libraries, and 4 bytes for the 32-bit library. If  you  want
   10219        to process regular expressions that are truly enormous, you can compile
   10220        PCRE with an internal linkage size of 3 or 4 (when building the  16-bit
   10221        or  32-bit  library,  3 is rounded up to 4). See the README file in the
   10222        source distribution and the pcrebuild  documentation  for  details.  In
   10223        these  cases  the limit is substantially larger.  However, the speed of
   10224        execution is slower.
   10225 
   10226        All values in repeating quantifiers must be less than 65536.
   10227 
   10228        There is no limit to the number of parenthesized subpatterns, but there
   10229        can  be  no more than 65535 capturing subpatterns. There is, however, a
   10230        limit to the depth of  nesting  of  parenthesized  subpatterns  of  all
   10231        kinds.  This  is  imposed  in order to limit the amount of system stack
   10232        used at compile time. The limit can be specified when  PCRE  is  built;
   10233        the default is 250.
   10234 
   10235        There is a limit to the number of forward references to subsequent sub-
   10236        patterns of around 200,000.  Repeated  forward  references  with  fixed
   10237        upper  limits,  for example, (?2){0,100} when subpattern number 2 is to
   10238        the right, are included in the count. There is no limit to  the  number
   10239        of backward references.
   10240 
   10241        The maximum length of name for a named subpattern is 32 characters, and
   10242        the maximum number of named subpatterns is 10000.
   10243 
   10244        The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
   10245        (*THEN)  verb is 255 for the 8-bit library and 65535 for the 16-bit and
   10246        32-bit libraries.
   10247 
   10248        The maximum length of a subject string is the largest  positive  number
   10249        that  an integer variable can hold. However, when using the traditional
   10250        matching function, PCRE uses recursion to handle subpatterns and indef-
   10251        inite  repetition.  This means that the available stack space may limit
   10252        the size of a subject string that can be processed by certain patterns.
   10253        For a discussion of stack issues, see the pcrestack documentation.
   10254 
   10255 
   10256 AUTHOR
   10257 
   10258        Philip Hazel
   10259        University Computing Service
   10260        Cambridge CB2 3QH, England.
   10261 
   10262 
   10263 REVISION
   10264 
   10265        Last updated: 05 November 2013
   10266        Copyright (c) 1997-2013 University of Cambridge.
   10267 ------------------------------------------------------------------------------
   10268 
   10269 
   10270 PCRESTACK(3)               Library Functions Manual               PCRESTACK(3)
   10271 
   10272 
   10273 
   10274 NAME
   10275        PCRE - Perl-compatible regular expressions
   10276 
   10277 PCRE DISCUSSION OF STACK USAGE
   10278 
   10279        When  you call pcre[16|32]_exec(), it makes use of an internal function
   10280        called match(). This calls itself recursively at branch points  in  the
   10281        pattern,  in  order  to  remember the state of the match so that it can
   10282        back up and try a different alternative if  the  first  one  fails.  As
   10283        matching proceeds deeper and deeper into the tree of possibilities, the
   10284        recursion depth increases. The match() function is also called in other
   10285        circumstances,  for  example,  whenever  a parenthesized sub-pattern is
   10286        entered, and in certain cases of repetition.
   10287 
   10288        Not all calls of match() increase the recursion depth; for an item such
   10289        as  a* it may be called several times at the same level, after matching
   10290        different numbers of a's. Furthermore, in a number of cases  where  the
   10291        result  of  the  recursive call would immediately be passed back as the
   10292        result of the current call (a "tail recursion"), the function  is  just
   10293        restarted instead.
   10294 
   10295        The  above  comments apply when pcre[16|32]_exec() is run in its normal
   10296        interpretive  manner.   If   the   pattern   was   studied   with   the
   10297        PCRE_STUDY_JIT_COMPILE  option, and just-in-time compiling was success-
   10298        ful, and the options passed to pcre[16|32]_exec() were  not  incompati-
   10299        ble,  the  matching  process  uses the JIT-compiled code instead of the
   10300        match() function. In this case, the  memory  requirements  are  handled
   10301        entirely differently. See the pcrejit documentation for details.
   10302 
   10303        The  pcre[16|32]_dfa_exec()  function operates in an entirely different
   10304        way, and uses recursion only when there is a regular expression  recur-
   10305        sion or subroutine call in the pattern. This includes the processing of
   10306        assertion and "once-only" subpatterns, which are handled  like  subrou-
   10307        tine  calls.  Normally, these are never very deep, and the limit on the
   10308        complexity of pcre[16|32]_dfa_exec() is controlled  by  the  amount  of
   10309        workspace  it is given.  However, it is possible to write patterns with
   10310        runaway    infinite    recursions;    such    patterns    will    cause
   10311        pcre[16|32]_dfa_exec()  to  run  out  of stack. At present, there is no
   10312        protection against this.
   10313 
   10314        The comments that follow do NOT apply to  pcre[16|32]_dfa_exec();  they
   10315        are relevant only for pcre[16|32]_exec() without the JIT optimization.
   10316 
   10317    Reducing pcre[16|32]_exec()'s stack usage
   10318 
   10319        Each  time  that match() is actually called recursively, it uses memory
   10320        from the process stack. For certain kinds of  pattern  and  data,  very
   10321        large  amounts of stack may be needed, despite the recognition of "tail
   10322        recursion".  You can often reduce the amount of recursion,  and  there-
   10323        fore  the  amount of stack used, by modifying the pattern that is being
   10324        matched. Consider, for example, this pattern:
   10325 
   10326          ([^<]|<(?!inet))+
   10327 
   10328        It matches from wherever it starts until it encounters "<inet"  or  the
   10329        end  of  the  data,  and is the kind of pattern that might be used when
   10330        processing an XML file. Each iteration of the outer parentheses matches
   10331        either  one  character that is not "<" or a "<" that is not followed by
   10332        "inet". However, each time a  parenthesis  is  processed,  a  recursion
   10333        occurs, so this formulation uses a stack frame for each matched charac-
   10334        ter. For a long string, a lot of stack is required. Consider  now  this
   10335        rewritten pattern, which matches exactly the same strings:
   10336 
   10337          ([^<]++|<(?!inet))+
   10338 
   10339        This  uses very much less stack, because runs of characters that do not
   10340        contain "<" are "swallowed" in one item inside the parentheses.  Recur-
   10341        sion  happens  only when a "<" character that is not followed by "inet"
   10342        is encountered (and we assume this is relatively  rare).  A  possessive
   10343        quantifier  is  used  to stop any backtracking into the runs of non-"<"
   10344        characters, but that is not related to stack usage.
   10345 
   10346        This example shows that one way of avoiding stack problems when  match-
   10347        ing long subject strings is to write repeated parenthesized subpatterns
   10348        to match more than one character whenever possible.
   10349 
   10350    Compiling PCRE to use heap instead of stack for pcre[16|32]_exec()
   10351 
   10352        In environments where stack memory is constrained, you  might  want  to
   10353        compile  PCRE to use heap memory instead of stack for remembering back-
   10354        up points when pcre[16|32]_exec() is running. This makes it run  a  lot
   10355        more slowly, however.  Details of how to do this are given in the pcre-
   10356        build documentation. When built in  this  way,  instead  of  using  the
   10357        stack,  PCRE obtains and frees memory by calling the functions that are
   10358        pointed to by the pcre[16|32]_stack_malloc  and  pcre[16|32]_stack_free
   10359        variables.  By default, these point to malloc() and free(), but you can
   10360        replace the pointers to cause PCRE to use your own functions. Since the
   10361        block sizes are always the same, and are always freed in reverse order,
   10362        it may be possible to implement customized  memory  handlers  that  are
   10363        more efficient than the standard functions.
   10364 
   10365    Limiting pcre[16|32]_exec()'s stack usage
   10366 
   10367        You  can set limits on the number of times that match() is called, both
   10368        in total and recursively. If a limit  is  exceeded,  pcre[16|32]_exec()
   10369        returns  an  error code. Setting suitable limits should prevent it from
   10370        running out of stack. The default values of the limits are very  large,
   10371        and  unlikely  ever to operate. They can be changed when PCRE is built,
   10372        and they can also be set when pcre[16|32]_exec() is called. For details
   10373        of these interfaces, see the pcrebuild documentation and the section on
   10374        extra data for pcre[16|32]_exec() in the pcreapi documentation.
   10375 
   10376        As a very rough rule of thumb, you should reckon on about 500 bytes per
   10377        recursion.  Thus,  if  you  want  to limit your stack usage to 8Mb, you
   10378        should set the limit at 16000 recursions. A 64Mb stack,  on  the  other
   10379        hand, can support around 128000 recursions.
   10380 
   10381        In Unix-like environments, the pcretest test program has a command line
   10382        option (-S) that can be used to increase the size of its stack. As long
   10383        as  the  stack is large enough, another option (-M) can be used to find
   10384        the smallest limits that allow a particular pattern to  match  a  given
   10385        subject  string.  This is done by calling pcre[16|32]_exec() repeatedly
   10386        with different limits.
   10387 
   10388    Obtaining an estimate of stack usage
   10389 
   10390        The actual amount of stack used per recursion can  vary  quite  a  lot,
   10391        depending on the compiler that was used to build PCRE and the optimiza-
   10392        tion or debugging options that were set for it. The rule of thumb value
   10393        of  500  bytes  mentioned  above  may be larger or smaller than what is
   10394        actually needed. A better approximation can be obtained by running this
   10395        command:
   10396 
   10397          pcretest -m -C
   10398 
   10399        The  -C  option causes pcretest to output information about the options
   10400        with which PCRE was compiled. When -m is also given (before -C), infor-
   10401        mation about stack use is given in a line like this:
   10402 
   10403          Match recursion uses stack: approximate frame size = 640 bytes
   10404 
   10405        The value is approximate because some recursions need a bit more (up to
   10406        perhaps 16 more bytes).
   10407 
   10408        If the above command is given when PCRE is compiled  to  use  the  heap
   10409        instead  of  the  stack  for recursion, the value that is output is the
   10410        size of each block that is obtained from the heap.
   10411 
   10412    Changing stack size in Unix-like systems
   10413 
   10414        In Unix-like environments, there is not often a problem with the  stack
   10415        unless  very  long  strings  are  involved, though the default limit on
   10416        stack size varies from system to system. Values from 8Mb  to  64Mb  are
   10417        common. You can find your default limit by running the command:
   10418 
   10419          ulimit -s
   10420 
   10421        Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
   10422        though sometimes a more explicit error message is given. You  can  nor-
   10423        mally increase the limit on stack size by code such as this:
   10424 
   10425          struct rlimit rlim;
   10426          getrlimit(RLIMIT_STACK, &rlim);
   10427          rlim.rlim_cur = 100*1024*1024;
   10428          setrlimit(RLIMIT_STACK, &rlim);
   10429 
   10430        This  reads  the current limits (soft and hard) using getrlimit(), then
   10431        attempts to increase the soft limit to  100Mb  using  setrlimit().  You
   10432        must do this before calling pcre[16|32]_exec().
   10433 
   10434    Changing stack size in Mac OS X
   10435 
   10436        Using setrlimit(), as described above, should also work on Mac OS X. It
   10437        is also possible to set a stack size when linking a program. There is a
   10438        discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
   10439        http://developer.apple.com/qa/qa2005/qa1419.html.
   10440 
   10441 
   10442 AUTHOR
   10443 
   10444        Philip Hazel
   10445        University Computing Service
   10446        Cambridge CB2 3QH, England.
   10447 
   10448 
   10449 REVISION
   10450 
   10451        Last updated: 24 June 2012
   10452        Copyright (c) 1997-2012 University of Cambridge.
   10453 ------------------------------------------------------------------------------
   10454 
   10455 
   10456