1 <html> 2 <head> 3 <title>pcre2api specification</title> 4 </head> 5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6 <h1>pcre2api man page</h1> 7 <p> 8 Return to the <a href="index.html">PCRE2 index page</a>. 9 </p> 10 <p> 11 This page is part of the PCRE2 HTML documentation. It was generated 12 automatically from the original man page. If there is any nonsense in it, 13 please consult the man page, in case the conversion went wrong. 14 <br> 15 <ul> 16 <li><a name="TOC1" href="#SEC1">PCRE2 NATIVE API BASIC FUNCTIONS</a> 17 <li><a name="TOC2" href="#SEC2">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a> 18 <li><a name="TOC3" href="#SEC3">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a> 19 <li><a name="TOC4" href="#SEC4">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a> 20 <li><a name="TOC5" href="#SEC5">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a> 21 <li><a name="TOC6" href="#SEC6">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a> 22 <li><a name="TOC7" href="#SEC7">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a> 23 <li><a name="TOC8" href="#SEC8">PCRE2 NATIVE API JIT FUNCTIONS</a> 24 <li><a name="TOC9" href="#SEC9">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a> 25 <li><a name="TOC10" href="#SEC10">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a> 26 <li><a name="TOC11" href="#SEC11">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a> 27 <li><a name="TOC12" href="#SEC12">PCRE2 API OVERVIEW</a> 28 <li><a name="TOC13" href="#SEC13">STRING LENGTHS AND OFFSETS</a> 29 <li><a name="TOC14" href="#SEC14">NEWLINES</a> 30 <li><a name="TOC15" href="#SEC15">MULTITHREADING</a> 31 <li><a name="TOC16" href="#SEC16">PCRE2 CONTEXTS</a> 32 <li><a name="TOC17" href="#SEC17">CHECKING BUILD-TIME OPTIONS</a> 33 <li><a name="TOC18" href="#SEC18">COMPILING A PATTERN</a> 34 <li><a name="TOC19" href="#SEC19">COMPILATION ERROR CODES</a> 35 <li><a name="TOC20" href="#SEC20">JUST-IN-TIME (JIT) COMPILATION</a> 36 <li><a name="TOC21" href="#SEC21">LOCALE SUPPORT</a> 37 <li><a name="TOC22" href="#SEC22">INFORMATION ABOUT A COMPILED PATTERN</a> 38 <li><a name="TOC23" href="#SEC23">INFORMATION ABOUT A PATTERN'S CALLOUTS</a> 39 <li><a name="TOC24" href="#SEC24">SERIALIZATION AND PRECOMPILING</a> 40 <li><a name="TOC25" href="#SEC25">THE MATCH DATA BLOCK</a> 41 <li><a name="TOC26" href="#SEC26">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a> 42 <li><a name="TOC27" href="#SEC27">NEWLINE HANDLING WHEN MATCHING</a> 43 <li><a name="TOC28" href="#SEC28">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a> 44 <li><a name="TOC29" href="#SEC29">OTHER INFORMATION ABOUT A MATCH</a> 45 <li><a name="TOC30" href="#SEC30">ERROR RETURNS FROM <b>pcre2_match()</b></a> 46 <li><a name="TOC31" href="#SEC31">OBTAINING A TEXTUAL ERROR MESSAGE</a> 47 <li><a name="TOC32" href="#SEC32">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a> 48 <li><a name="TOC33" href="#SEC33">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a> 49 <li><a name="TOC34" href="#SEC34">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a> 50 <li><a name="TOC35" href="#SEC35">CREATING A NEW STRING WITH SUBSTITUTIONS</a> 51 <li><a name="TOC36" href="#SEC36">DUPLICATE SUBPATTERN NAMES</a> 52 <li><a name="TOC37" href="#SEC37">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a> 53 <li><a name="TOC38" href="#SEC38">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a> 54 <li><a name="TOC39" href="#SEC39">SEE ALSO</a> 55 <li><a name="TOC40" href="#SEC40">AUTHOR</a> 56 <li><a name="TOC41" href="#SEC41">REVISION</a> 57 </ul> 58 <P> 59 <b>#include <pcre2.h></b> 60 <br> 61 <br> 62 PCRE2 is a new API for PCRE. This document contains a description of all its 63 functions. See the 64 <a href="pcre2.html"><b>pcre2</b></a> 65 document for an overview of all the PCRE2 documentation. 66 </P> 67 <br><a name="SEC1" href="#TOC1">PCRE2 NATIVE API BASIC FUNCTIONS</a><br> 68 <P> 69 <b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b> 70 <b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b> 71 <b> pcre2_compile_context *<i>ccontext</i>);</b> 72 <br> 73 <br> 74 <b>void pcre2_code_free(pcre2_code *<i>code</i>);</b> 75 <br> 76 <br> 77 <b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b> 78 <b> pcre2_general_context *<i>gcontext</i>);</b> 79 <br> 80 <br> 81 <b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b> 82 <b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b> 83 <br> 84 <br> 85 <b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 86 <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 87 <b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 88 <b> pcre2_match_context *<i>mcontext</i>);</b> 89 <br> 90 <br> 91 <b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 92 <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 93 <b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 94 <b> pcre2_match_context *<i>mcontext</i>,</b> 95 <b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b> 96 <br> 97 <br> 98 <b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b> 99 </P> 100 <br><a name="SEC2" href="#TOC1">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a><br> 101 <P> 102 <b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b> 103 <br> 104 <br> 105 <b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b> 106 <br> 107 <br> 108 <b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b> 109 <br> 110 <br> 111 <b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b> 112 </P> 113 <br><a name="SEC3" href="#TOC1">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a><br> 114 <P> 115 <b>pcre2_general_context *pcre2_general_context_create(</b> 116 <b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b> 117 <b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b> 118 <br> 119 <br> 120 <b>pcre2_general_context *pcre2_general_context_copy(</b> 121 <b> pcre2_general_context *<i>gcontext</i>);</b> 122 <br> 123 <br> 124 <b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b> 125 </P> 126 <br><a name="SEC4" href="#TOC1">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a><br> 127 <P> 128 <b>pcre2_compile_context *pcre2_compile_context_create(</b> 129 <b> pcre2_general_context *<i>gcontext</i>);</b> 130 <br> 131 <br> 132 <b>pcre2_compile_context *pcre2_compile_context_copy(</b> 133 <b> pcre2_compile_context *<i>ccontext</i>);</b> 134 <br> 135 <br> 136 <b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b> 137 <br> 138 <br> 139 <b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b> 140 <b> uint32_t <i>value</i>);</b> 141 <br> 142 <br> 143 <b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b> 144 <b> const unsigned char *<i>tables</i>);</b> 145 <br> 146 <br> 147 <b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b> 148 <b> PCRE2_SIZE <i>value</i>);</b> 149 <br> 150 <br> 151 <b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b> 152 <b> uint32_t <i>value</i>);</b> 153 <br> 154 <br> 155 <b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b> 156 <b> uint32_t <i>value</i>);</b> 157 <br> 158 <br> 159 <b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b> 160 <b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b> 161 </P> 162 <br><a name="SEC5" href="#TOC1">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a><br> 163 <P> 164 <b>pcre2_match_context *pcre2_match_context_create(</b> 165 <b> pcre2_general_context *<i>gcontext</i>);</b> 166 <br> 167 <br> 168 <b>pcre2_match_context *pcre2_match_context_copy(</b> 169 <b> pcre2_match_context *<i>mcontext</i>);</b> 170 <br> 171 <br> 172 <b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b> 173 <br> 174 <br> 175 <b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b> 176 <b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b> 177 <b> void *<i>callout_data</i>);</b> 178 <br> 179 <br> 180 <b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b> 181 <b> uint32_t <i>value</i>);</b> 182 <br> 183 <br> 184 <b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b> 185 <b> PCRE2_SIZE <i>value</i>);</b> 186 <br> 187 <br> 188 <b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b> 189 <b> uint32_t <i>value</i>);</b> 190 <br> 191 <br> 192 <b>int pcre2_set_recursion_memory_management(</b> 193 <b> pcre2_match_context *<i>mcontext</i>,</b> 194 <b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b> 195 <b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b> 196 </P> 197 <br><a name="SEC6" href="#TOC1">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a><br> 198 <P> 199 <b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b> 200 <b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b> 201 <br> 202 <br> 203 <b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b> 204 <b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b> 205 <b> PCRE2_SIZE *<i>bufflen</i>);</b> 206 <br> 207 <br> 208 <b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b> 209 <br> 210 <br> 211 <b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b> 212 <b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b> 213 <br> 214 <br> 215 <b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b> 216 <b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b> 217 <b> PCRE2_SIZE *<i>bufflen</i>);</b> 218 <br> 219 <br> 220 <b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b> 221 <b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b> 222 <br> 223 <br> 224 <b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b> 225 <b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b> 226 <br> 227 <br> 228 <b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b> 229 <b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b> 230 <br> 231 <br> 232 <b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b> 233 <b> PCRE2_SPTR <i>name</i>);</b> 234 <br> 235 <br> 236 <b>void pcre2_substring_list_free(PCRE2_SPTR *<i>list</i>);</b> 237 <br> 238 <br> 239 <b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b> 240 <b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b> 241 </P> 242 <br><a name="SEC7" href="#TOC1">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a><br> 243 <P> 244 <b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 245 <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 246 <b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 247 <b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR \fIreplacementzfP,</b> 248 <b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b> 249 <b> PCRE2_SIZE *<i>outlengthptr</i>);</b> 250 </P> 251 <br><a name="SEC8" href="#TOC1">PCRE2 NATIVE API JIT FUNCTIONS</a><br> 252 <P> 253 <b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b> 254 <br> 255 <br> 256 <b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 257 <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 258 <b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 259 <b> pcre2_match_context *<i>mcontext</i>);</b> 260 <br> 261 <br> 262 <b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b> 263 <br> 264 <br> 265 <b>pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE <i>startsize</i>,</b> 266 <b> PCRE2_SIZE <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b> 267 <br> 268 <br> 269 <b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b> 270 <b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b> 271 <br> 272 <br> 273 <b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b> 274 </P> 275 <br><a name="SEC9" href="#TOC1">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a><br> 276 <P> 277 <b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b> 278 <b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b> 279 <b> pcre2_general_context *<i>gcontext</i>);</b> 280 <br> 281 <br> 282 <b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b> 283 <b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b> 284 <b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b> 285 <br> 286 <br> 287 <b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b> 288 <br> 289 <br> 290 <b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b> 291 </P> 292 <br><a name="SEC10" href="#TOC1">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a><br> 293 <P> 294 <b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b> 295 <br> 296 <br> 297 <b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b> 298 <b> PCRE2_SIZE <i>bufflen</i>);</b> 299 <br> 300 <br> 301 <b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b> 302 <br> 303 <br> 304 <b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b> 305 <br> 306 <br> 307 <b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b> 308 <b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b> 309 <b> void *<i>user_data</i>);</b> 310 <br> 311 <br> 312 <b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b> 313 </P> 314 <br><a name="SEC11" href="#TOC1">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br> 315 <P> 316 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit code 317 units, respectively. However, there is just one header file, <b>pcre2.h</b>. 318 This contains the function prototypes and other definitions for all three 319 libraries. One, two, or all three can be installed simultaneously. On Unix-like 320 systems the libraries are called <b>libpcre2-8</b>, <b>libpcre2-16</b>, and 321 <b>libpcre2-32</b>, and they can also co-exist with the original PCRE libraries. 322 </P> 323 <P> 324 Character strings are passed to and from a PCRE2 library as a sequence of 325 unsigned integers in code units of the appropriate width. Every PCRE2 function 326 comes in three different forms, one for each library, for example: 327 <pre> 328 <b>pcre2_compile_8()</b> 329 <b>pcre2_compile_16()</b> 330 <b>pcre2_compile_32()</b> 331 </pre> 332 There are also three different sets of data types: 333 <pre> 334 <b>PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32</b> 335 <b>PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32</b> 336 </pre> 337 The UCHAR types define unsigned code units of the appropriate widths. For 338 example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR types are 339 constant pointers to the equivalent UCHAR types, that is, they are pointers to 340 vectors of unsigned code units. 341 </P> 342 <P> 343 Many applications use only one code unit width. For their convenience, macros 344 are defined whose names are the generic forms such as <b>pcre2_compile()</b> and 345 PCRE2_SPTR. These macros use the value of the macro PCRE2_CODE_UNIT_WIDTH to 346 generate the appropriate width-specific function and macro names. 347 PCRE2_CODE_UNIT_WIDTH is not defined by default. An application must define it 348 to be 8, 16, or 32 before including <b>pcre2.h</b> in order to make use of the 349 generic names. 350 </P> 351 <P> 352 Applications that use more than one code unit width can be linked with more 353 than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to be 0 before 354 including <b>pcre2.h</b>, and then use the real function names. Any code that is 355 to be included in an environment where the value of PCRE2_CODE_UNIT_WIDTH is 356 unknown should also use the real function names. (Unfortunately, it is not 357 possible in C code to save and restore the value of a macro.) 358 </P> 359 <P> 360 If PCRE2_CODE_UNIT_WIDTH is not defined before including <b>pcre2.h</b>, a 361 compiler error occurs. 362 </P> 363 <P> 364 When using multiple libraries in an application, you must take care when 365 processing any particular pattern to use only functions from a single library. 366 For example, if you want to run a match using a pattern that was compiled with 367 <b>pcre2_compile_16()</b>, you must do so with <b>pcre2_match_16()</b>, not 368 <b>pcre2_match_8()</b>. 369 </P> 370 <P> 371 In the function summaries above, and in the rest of this document and other 372 PCRE2 documents, functions and data types are described using their generic 373 names, without the 8, 16, or 32 suffix. 374 </P> 375 <br><a name="SEC12" href="#TOC1">PCRE2 API OVERVIEW</a><br> 376 <P> 377 PCRE2 has its own native API, which is described in this document. There are 378 also some wrapper functions for the 8-bit library that correspond to the 379 POSIX regular expression API, but they do not give access to all the 380 functionality. They are described in the 381 <a href="pcre2posix.html"><b>pcre2posix</b></a> 382 documentation. Both these APIs define a set of C function calls. 383 </P> 384 <P> 385 The native API C data types, function prototypes, option values, and error 386 codes are defined in the header file <b>pcre2.h</b>, which contains definitions 387 of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers for the 388 library. Applications can use these to include support for different releases 389 of PCRE2. 390 </P> 391 <P> 392 In a Windows environment, if you want to statically link an application program 393 against a non-dll PCRE2 library, you must define PCRE2_STATIC before including 394 <b>pcre2.h</b>. 395 </P> 396 <P> 397 The functions <b>pcre2_compile()</b>, and <b>pcre2_match()</b> are used for 398 compiling and matching regular expressions in a Perl-compatible manner. A 399 sample program that demonstrates the simplest way of using them is provided in 400 the file called <i>pcre2demo.c</i> in the PCRE2 source distribution. A listing 401 of this program is given in the 402 <a href="pcre2demo.html"><b>pcre2demo</b></a> 403 documentation, and the 404 <a href="pcre2sample.html"><b>pcre2sample</b></a> 405 documentation describes how to compile and run it. 406 </P> 407 <P> 408 Just-in-time compiler support is an optional feature of PCRE2 that can be built 409 in appropriate hardware environments. It greatly speeds up the matching 410 performance of many patterns. Programs can request that it be used if 411 available, by calling <b>pcre2_jit_compile()</b> after a pattern has been 412 successfully compiled by <b>pcre2_compile()</b>. This does nothing if JIT 413 support is not available. 414 </P> 415 <P> 416 More complicated programs might need to make use of the specialist functions 417 <b>pcre2_jit_stack_create()</b>, <b>pcre2_jit_stack_free()</b>, and 418 <b>pcre2_jit_stack_assign()</b> in order to control the JIT code's memory usage. 419 </P> 420 <P> 421 JIT matching is automatically used by <b>pcre2_match()</b> if it is available, 422 unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT 423 matching, which gives improved performance. The JIT-specific functions are 424 discussed in the 425 <a href="pcre2jit.html"><b>pcre2jit</b></a> 426 documentation. 427 </P> 428 <P> 429 A second matching function, <b>pcre2_dfa_match()</b>, which is not 430 Perl-compatible, is also provided. This uses a different algorithm for the 431 matching. The alternative algorithm finds all possible matches (at a given 432 point in the subject), and scans the subject just once (unless there are 433 lookbehind assertions). However, this algorithm does not return captured 434 substrings. A description of the two matching algorithms and their advantages 435 and disadvantages is given in the 436 <a href="pcre2matching.html"><b>pcre2matching</b></a> 437 documentation. There is no JIT support for <b>pcre2_dfa_match()</b>. 438 </P> 439 <P> 440 In addition to the main compiling and matching functions, there are convenience 441 functions for extracting captured substrings from a subject string that has 442 been matched by <b>pcre2_match()</b>. They are: 443 <pre> 444 <b>pcre2_substring_copy_byname()</b> 445 <b>pcre2_substring_copy_bynumber()</b> 446 <b>pcre2_substring_get_byname()</b> 447 <b>pcre2_substring_get_bynumber()</b> 448 <b>pcre2_substring_list_get()</b> 449 <b>pcre2_substring_length_byname()</b> 450 <b>pcre2_substring_length_bynumber()</b> 451 <b>pcre2_substring_nametable_scan()</b> 452 <b>pcre2_substring_number_from_name()</b> 453 </pre> 454 <b>pcre2_substring_free()</b> and <b>pcre2_substring_list_free()</b> are also 455 provided, to free the memory used for extracted strings. 456 </P> 457 <P> 458 The function <b>pcre2_substitute()</b> can be called to match a pattern and 459 return a copy of the subject string with substitutions for parts that were 460 matched. 461 </P> 462 <P> 463 Functions whose names begin with <b>pcre2_serialize_</b> are used for saving 464 compiled patterns on disc or elsewhere, and reloading them later. 465 </P> 466 <P> 467 Finally, there are functions for finding out information about a compiled 468 pattern (<b>pcre2_pattern_info()</b>) and about the configuration with which 469 PCRE2 was built (<b>pcre2_config()</b>). 470 </P> 471 <P> 472 Functions with names ending with <b>_free()</b> are used for freeing memory 473 blocks of various sorts. In all cases, if one of these functions is called with 474 a NULL argument, it does nothing. 475 </P> 476 <br><a name="SEC13" href="#TOC1">STRING LENGTHS AND OFFSETS</a><br> 477 <P> 478 The PCRE2 API uses string lengths and offsets into strings of code units in 479 several places. These values are always of type PCRE2_SIZE, which is an 480 unsigned integer type, currently always defined as <i>size_t</i>. The largest 481 value that can be stored in such a type (that is ~(PCRE2_SIZE)0) is reserved 482 as a special indicator for zero-terminated strings and unset offsets. 483 Therefore, the longest string that can be handled is one less than this 484 maximum. 485 <a name="newlines"></a></P> 486 <br><a name="SEC14" href="#TOC1">NEWLINES</a><br> 487 <P> 488 PCRE2 supports five different conventions for indicating line breaks in 489 strings: a single CR (carriage return) character, a single LF (linefeed) 490 character, the two-character sequence CRLF, any of the three preceding, or any 491 Unicode newline sequence. The Unicode newline sequences are the three just 492 mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed, 493 U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS 494 (paragraph separator, U+2029). 495 </P> 496 <P> 497 Each of the first three conventions is used by at least one operating system as 498 its standard newline sequence. When PCRE2 is built, a default can be specified. 499 The default default is LF, which is the Unix standard. However, the newline 500 convention can be changed by an application when calling <b>pcre2_compile()</b>, 501 or it can be specified by special text at the start of the pattern itself; this 502 overrides any other settings. See the 503 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 504 page for details of the special character sequences. 505 </P> 506 <P> 507 In the PCRE2 documentation the word "newline" is used to mean "the character or 508 pair of characters that indicate a line break". The choice of newline 509 convention affects the handling of the dot, circumflex, and dollar 510 metacharacters, the handling of #-comments in /x mode, and, when CRLF is a 511 recognized line ending sequence, the match position advancement for a 512 non-anchored pattern. There is more detail about this in the 513 <a href="#matchoptions">section on <b>pcre2_match()</b> options</a> 514 below. 515 </P> 516 <P> 517 The choice of newline convention does not affect the interpretation of 518 the \n or \r escape sequences, nor does it affect what \R matches; this has 519 its own separate convention. 520 </P> 521 <br><a name="SEC15" href="#TOC1">MULTITHREADING</a><br> 522 <P> 523 In a multithreaded application it is important to keep thread-specific data 524 separate from data that can be shared between threads. The PCRE2 library code 525 itself is thread-safe: it contains no static or global variables. The API is 526 designed to be fairly simple for non-threaded applications while at the same 527 time ensuring that multithreaded applications can use it. 528 </P> 529 <P> 530 There are several different blocks of data that are used to pass information 531 between the application and the PCRE2 libraries. 532 </P> 533 <br><b> 534 The compiled pattern 535 </b><br> 536 <P> 537 A pointer to the compiled form of a pattern is returned to the user when 538 <b>pcre2_compile()</b> is successful. The data in the compiled pattern is fixed, 539 and does not change when the pattern is matched. Therefore, it is thread-safe, 540 that is, the same compiled pattern can be used by more than one thread 541 simultaneously. For example, an application can compile all its patterns at the 542 start, before forking off multiple threads that use them. However, if the 543 just-in-time optimization feature is being used, it needs separate memory stack 544 areas for each thread. See the 545 <a href="pcre2jit.html"><b>pcre2jit</b></a> 546 documentation for more details. 547 </P> 548 <P> 549 In a more complicated situation, where patterns are compiled only when they are 550 first needed, but are still shared between threads, pointers to compiled 551 patterns must be protected from simultaneous writing by multiple threads, at 552 least until a pattern has been compiled. The logic can be something like this: 553 <pre> 554 Get a read-only (shared) lock (mutex) for pointer 555 if (pointer == NULL) 556 { 557 Get a write (unique) lock for pointer 558 pointer = pcre2_compile(... 559 } 560 Release the lock 561 Use pointer in pcre2_match() 562 </pre> 563 Of course, testing for compilation errors should also be included in the code. 564 </P> 565 <P> 566 If JIT is being used, but the JIT compilation is not being done immediately, 567 (perhaps waiting to see if the pattern is used often enough) similar logic is 568 required. JIT compilation updates a pointer within the compiled code block, so 569 a thread must gain unique write access to the pointer before calling 570 <b>pcre2_jit_compile()</b>. Alternatively, <b>pcre2_code_copy()</b> can be used 571 to obtain a private copy of the compiled code. 572 </P> 573 <br><b> 574 Context blocks 575 </b><br> 576 <P> 577 The next main section below introduces the idea of "contexts" in which PCRE2 578 functions are called. A context is nothing more than a collection of parameters 579 that control the way PCRE2 operates. Grouping a number of parameters together 580 in a context is a convenient way of passing them to a PCRE2 function without 581 using lots of arguments. The parameters that are stored in contexts are in some 582 sense "advanced features" of the API. Many straightforward applications will 583 not need to use contexts. 584 </P> 585 <P> 586 In a multithreaded application, if the parameters in a context are values that 587 are never changed, the same context can be used by all the threads. However, if 588 any thread needs to change any value in a context, it must make its own 589 thread-specific copy. 590 </P> 591 <br><b> 592 Match blocks 593 </b><br> 594 <P> 595 The matching functions need a block of memory for working space and for storing 596 the results of a match. This includes details of what was matched, as well as 597 additional information such as the name of a (*MARK) setting. Each thread must 598 provide its own copy of this memory. 599 </P> 600 <br><a name="SEC16" href="#TOC1">PCRE2 CONTEXTS</a><br> 601 <P> 602 Some PCRE2 functions have a lot of parameters, many of which are used only by 603 specialist applications, for example, those that use custom memory management 604 or non-standard character tables. To keep function argument lists at a 605 reasonable size, and at the same time to keep the API extensible, "uncommon" 606 parameters are passed to certain functions in a <b>context</b> instead of 607 directly. A context is just a block of memory that holds the parameter values. 608 Applications that do not need to adjust any of the context parameters can pass 609 NULL when a context pointer is required. 610 </P> 611 <P> 612 There are three different types of context: a general context that is relevant 613 for several PCRE2 operations, a compile-time context, and a match-time context. 614 </P> 615 <br><b> 616 The general context 617 </b><br> 618 <P> 619 At present, this context just contains pointers to (and data for) external 620 memory management functions that are called from several places in the PCRE2 621 library. The context is named `general' rather than specifically `memory' 622 because in future other fields may be added. If you do not want to supply your 623 own custom memory management functions, you do not need to bother with a 624 general context. A general context is created by: 625 <b>pcre2_general_context *pcre2_general_context_create(</b> 626 <b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b> 627 <b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b> 628 <br> 629 <br> 630 The two function pointers specify custom memory management functions, whose 631 prototypes are: 632 <pre> 633 <b>void *private_malloc(PCRE2_SIZE, void *);</b> 634 <b>void private_free(void *, void *);</b> 635 </pre> 636 Whenever code in PCRE2 calls these functions, the final argument is the value 637 of <i>memory_data</i>. Either of the first two arguments of the creation 638 function may be NULL, in which case the system memory management functions 639 <i>malloc()</i> and <i>free()</i> are used. (This is not currently useful, as 640 there are no other fields in a general context, but in future there might be.) 641 The <i>private_malloc()</i> function is used (if supplied) to obtain memory for 642 storing the context, and all three values are saved as part of the context. 643 </P> 644 <P> 645 Whenever PCRE2 creates a data block of any kind, the block contains a pointer 646 to the <i>free()</i> function that matches the <i>malloc()</i> function that was 647 used. When the time comes to free the block, this function is called. 648 </P> 649 <P> 650 A general context can be copied by calling: 651 <b>pcre2_general_context *pcre2_general_context_copy(</b> 652 <b> pcre2_general_context *<i>gcontext</i>);</b> 653 <br> 654 <br> 655 The memory used for a general context should be freed by calling: 656 <b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b> 657 <a name="compilecontext"></a></P> 658 <br><b> 659 The compile context 660 </b><br> 661 <P> 662 A compile context is required if you want to change the default values of any 663 of the following compile-time parameters: 664 <pre> 665 What \R matches (Unicode newlines or CR, LF, CRLF only) 666 PCRE2's character tables 667 The newline character sequence 668 The compile time nested parentheses limit 669 The maximum length of the pattern string 670 An external function for stack checking 671 </pre> 672 A compile context is also required if you are using custom memory management. 673 If none of these apply, just pass NULL as the context argument of 674 <i>pcre2_compile()</i>. 675 </P> 676 <P> 677 A compile context is created, copied, and freed by the following functions: 678 <b>pcre2_compile_context *pcre2_compile_context_create(</b> 679 <b> pcre2_general_context *<i>gcontext</i>);</b> 680 <br> 681 <br> 682 <b>pcre2_compile_context *pcre2_compile_context_copy(</b> 683 <b> pcre2_compile_context *<i>ccontext</i>);</b> 684 <br> 685 <br> 686 <b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b> 687 <br> 688 <br> 689 A compile context is created with default values for its parameters. These can 690 be changed by calling the following functions, which return 0 on success, or 691 PCRE2_ERROR_BADDATA if invalid data is detected. 692 <b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b> 693 <b> uint32_t <i>value</i>);</b> 694 <br> 695 <br> 696 The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF, 697 or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line 698 ending sequence. The value is used by the JIT compiler and by the two 699 interpreted matching functions, <i>pcre2_match()</i> and 700 <i>pcre2_dfa_match()</i>. 701 <b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b> 702 <b> const unsigned char *<i>tables</i>);</b> 703 <br> 704 <br> 705 The value must be the result of a call to <i>pcre2_maketables()</i>, whose only 706 argument is a general context. This function builds a set of character tables 707 in the current locale. 708 <b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b> 709 <b> PCRE2_SIZE <i>value</i>);</b> 710 <br> 711 <br> 712 This sets a maximum length, in code units, for the pattern string that is to be 713 compiled. If the pattern is longer, an error is generated. This facility is 714 provided so that applications that accept patterns from external sources can 715 limit their size. The default is the largest number that a PCRE2_SIZE variable 716 can hold, which is effectively unlimited. 717 <b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b> 718 <b> uint32_t <i>value</i>);</b> 719 <br> 720 <br> 721 This specifies which characters or character sequences are to be recognized as 722 newlines. The value must be one of PCRE2_NEWLINE_CR (carriage return only), 723 PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the two-character 724 sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above), or 725 PCRE2_NEWLINE_ANY (any Unicode newline sequence). 726 </P> 727 <P> 728 When a pattern is compiled with the PCRE2_EXTENDED option, the value of this 729 parameter affects the recognition of white space and the end of internal 730 comments starting with #. The value is saved with the compiled pattern for 731 subsequent use by the JIT compiler and by the two interpreted matching 732 functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>. 733 <b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b> 734 <b> uint32_t <i>value</i>);</b> 735 <br> 736 <br> 737 This parameter ajusts the limit, set when PCRE2 is built (default 250), on the 738 depth of parenthesis nesting in a pattern. This limit stops rogue patterns 739 using up too much system stack when being compiled. 740 <b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b> 741 <b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b> 742 <br> 743 <br> 744 There is at least one application that runs PCRE2 in threads with very limited 745 system stack, where running out of stack is to be avoided at all costs. The 746 parenthesis limit above cannot take account of how much stack is actually 747 available. For a finer control, you can supply a function that is called 748 whenever <b>pcre2_compile()</b> starts to compile a parenthesized part of a 749 pattern. This function can check the actual stack size (or anything else that 750 it wants to, of course). 751 </P> 752 <P> 753 The first argument to the callout function gives the current depth of 754 nesting, and the second is user data that is set up by the last argument of 755 <b>pcre2_set_compile_recursion_guard()</b>. The callout function should return 756 zero if all is well, or non-zero to force an error. 757 <a name="matchcontext"></a></P> 758 <br><b> 759 The match context 760 </b><br> 761 <P> 762 A match context is required if you want to change the default values of any 763 of the following match-time parameters: 764 <pre> 765 A callout function 766 The offset limit for matching an unanchored pattern 767 The limit for calling <b>match()</b> (see below) 768 The limit for calling <b>match()</b> recursively 769 </pre> 770 A match context is also required if you are using custom memory management. 771 If none of these apply, just pass NULL as the context argument of 772 <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or <b>pcre2_jit_match()</b>. 773 </P> 774 <P> 775 A match context is created, copied, and freed by the following functions: 776 <b>pcre2_match_context *pcre2_match_context_create(</b> 777 <b> pcre2_general_context *<i>gcontext</i>);</b> 778 <br> 779 <br> 780 <b>pcre2_match_context *pcre2_match_context_copy(</b> 781 <b> pcre2_match_context *<i>mcontext</i>);</b> 782 <br> 783 <br> 784 <b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b> 785 <br> 786 <br> 787 A match context is created with default values for its parameters. These can 788 be changed by calling the following functions, which return 0 on success, or 789 PCRE2_ERROR_BADDATA if invalid data is detected. 790 <b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b> 791 <b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b> 792 <b> void *<i>callout_data</i>);</b> 793 <br> 794 <br> 795 This sets up a "callout" function, which PCRE2 will call at specified points 796 during a matching operation. Details are given in the 797 <a href="pcre2callout.html"><b>pcre2callout</b></a> 798 documentation. 799 <b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b> 800 <b> PCRE2_SIZE <i>value</i>);</b> 801 <br> 802 <br> 803 The <i>offset_limit</i> parameter limits how far an unanchored search can 804 advance in the subject string. The default value is PCRE2_UNSET. The 805 <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> functions return 806 PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given 807 offset is not found. For example, if the pattern /abc/ is matched against 808 "123abc" with an offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH. 809 A match can never be found if the <i>startoffset</i> argument of 810 <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> is greater than the offset 811 limit. 812 </P> 813 <P> 814 When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when calling 815 <b>pcre2_compile()</b> so that when JIT is in use, different code can be 816 compiled. If a match is started with a non-default match limit when 817 PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. 818 </P> 819 <P> 820 The offset limit facility can be used to track progress when searching large 821 subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to 822 start within the first line of the subject. If this is set with an offset 823 limit, a match must occur in the first line and also within the offset limit. 824 In other words, whichever limit comes first is used. 825 <b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b> 826 <b> uint32_t <i>value</i>);</b> 827 <br> 828 <br> 829 The <i>match_limit</i> parameter provides a means of preventing PCRE2 from using 830 up too many resources when processing patterns that are not going to match, but 831 which have a very large number of possibilities in their search trees. The 832 classic example is a pattern that uses nested unlimited repeats. 833 </P> 834 <P> 835 Internally, <b>pcre2_match()</b> uses a function called <b>match()</b>, which it 836 calls repeatedly (sometimes recursively). The limit set by <i>match_limit</i> is 837 imposed on the number of times this function is called during a match, which 838 has the effect of limiting the amount of backtracking that can take place. For 839 patterns that are not anchored, the count restarts from zero for each position 840 in the subject string. This limit is not relevant to <b>pcre2_dfa_match()</b>, 841 which ignores it. 842 </P> 843 <P> 844 When <b>pcre2_match()</b> is called with a pattern that was successfully 845 processed by <b>pcre2_jit_compile()</b>, the way in which matching is executed 846 is entirely different. However, there is still the possibility of runaway 847 matching that goes on for a very long time, and so the <i>match_limit</i> value 848 is also used in this case (but in a different way) to limit how long the 849 matching can continue. 850 </P> 851 <P> 852 The default value for the limit can be set when PCRE2 is built; the default 853 default is 10 million, which handles all but the most extreme cases. If the 854 limit is exceeded, <b>pcre2_match()</b> returns PCRE2_ERROR_MATCHLIMIT. A value 855 for the match limit may also be supplied by an item at the start of a pattern 856 of the form 857 <pre> 858 (*LIMIT_MATCH=ddd) 859 </pre> 860 where ddd is a decimal number. However, such a setting is ignored unless ddd is 861 less than the limit set by the caller of <b>pcre2_match()</b> or, if no such 862 limit is set, less than the default. 863 <b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b> 864 <b> uint32_t <i>value</i>);</b> 865 <br> 866 <br> 867 The <i>recursion_limit</i> parameter is similar to <i>match_limit</i>, but 868 instead of limiting the total number of times that <b>match()</b> is called, it 869 limits the depth of recursion. The recursion depth is a smaller number than the 870 total number of calls, because not all calls to <b>match()</b> are recursive. 871 This limit is of use only if it is set smaller than <i>match_limit</i>. 872 </P> 873 <P> 874 Limiting the recursion depth limits the amount of system stack that can be 875 used, or, when PCRE2 has been compiled to use memory on the heap instead of the 876 stack, the amount of heap memory that can be used. This limit is not relevant, 877 and is ignored, when matching is done using JIT compiled code or by the 878 <b>pcre2_dfa_match()</b> function. 879 </P> 880 <P> 881 The default value for <i>recursion_limit</i> can be set when PCRE2 is built; the 882 default default is the same value as the default for <i>match_limit</i>. If the 883 limit is exceeded, <b>pcre2_match()</b> returns PCRE2_ERROR_RECURSIONLIMIT. A 884 value for the recursion limit may also be supplied by an item at the start of a 885 pattern of the form 886 <pre> 887 (*LIMIT_RECURSION=ddd) 888 </pre> 889 where ddd is a decimal number. However, such a setting is ignored unless ddd is 890 less than the limit set by the caller of <b>pcre2_match()</b> or, if no such 891 limit is set, less than the default. 892 <b>int pcre2_set_recursion_memory_management(</b> 893 <b> pcre2_match_context *<i>mcontext</i>,</b> 894 <b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b> 895 <b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b> 896 <br> 897 <br> 898 This function sets up two additional custom memory management functions for use 899 by <b>pcre2_match()</b> when PCRE2 is compiled to use the heap for remembering 900 backtracking data, instead of recursive function calls that use the system 901 stack. There is a discussion about PCRE2's stack usage in the 902 <a href="pcre2stack.html"><b>pcre2stack</b></a> 903 documentation. See the 904 <a href="pcre2build.html"><b>pcre2build</b></a> 905 documentation for details of how to build PCRE2. 906 </P> 907 <P> 908 Using the heap for recursion is a non-standard way of building PCRE2, for use 909 in environments that have limited stacks. Because of the greater use of memory 910 management, <b>pcre2_match()</b> runs more slowly. Functions that are different 911 to the general custom memory functions are provided so that special-purpose 912 external code can be used for this case, because the memory blocks are all the 913 same size. The blocks are retained by <b>pcre2_match()</b> until it is about to 914 exit so that they can be re-used when possible during the match. In the absence 915 of these functions, the normal custom memory management functions are used, if 916 supplied, otherwise the system functions. 917 </P> 918 <br><a name="SEC17" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br> 919 <P> 920 <b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b> 921 </P> 922 <P> 923 The function <b>pcre2_config()</b> makes it possible for a PCRE2 client to 924 discover which optional features have been compiled into the PCRE2 library. The 925 <a href="pcre2build.html"><b>pcre2build</b></a> 926 documentation has more details about these optional features. 927 </P> 928 <P> 929 The first argument for <b>pcre2_config()</b> specifies which information is 930 required. The second argument is a pointer to memory into which the information 931 is placed. If NULL is passed, the function returns the amount of memory that is 932 needed for the requested information. For calls that return numerical values, 933 the value is in bytes; when requesting these values, <i>where</i> should point 934 to appropriately aligned memory. For calls that return strings, the required 935 length is given in code units, not counting the terminating zero. 936 </P> 937 <P> 938 When requesting information, the returned value from <b>pcre2_config()</b> is 939 non-negative on success, or the negative error code PCRE2_ERROR_BADOPTION if 940 the value in the first argument is not recognized. The following information is 941 available: 942 <pre> 943 PCRE2_CONFIG_BSR 944 </pre> 945 The output is a uint32_t integer whose value indicates what character 946 sequences the \R escape sequence matches by default. A value of 947 PCRE2_BSR_UNICODE means that \R matches any Unicode line ending sequence; a 948 value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. The 949 default can be overridden when a pattern is compiled. 950 <pre> 951 PCRE2_CONFIG_JIT 952 </pre> 953 The output is a uint32_t integer that is set to one if support for just-in-time 954 compiling is available; otherwise it is set to zero. 955 <pre> 956 PCRE2_CONFIG_JITTARGET 957 </pre> 958 The <i>where</i> argument should point to a buffer that is at least 48 code 959 units long. (The exact length required can be found by calling 960 <b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a 961 string that contains the name of the architecture for which the JIT compiler is 962 configured, for example "x86 32bit (little endian + unaligned)". If JIT support 963 is not available, PCRE2_ERROR_BADOPTION is returned, otherwise the number of 964 code units used is returned. This is the length of the string, plus one unit 965 for the terminating zero. 966 <pre> 967 PCRE2_CONFIG_LINKSIZE 968 </pre> 969 The output is a uint32_t integer that contains the number of bytes used for 970 internal linkage in compiled regular expressions. When PCRE2 is configured, the 971 value can be set to 2, 3, or 4, with the default being 2. This is the value 972 that is returned by <b>pcre2_config()</b>. However, when the 16-bit library is 973 compiled, a value of 3 is rounded up to 4, and when the 32-bit library is 974 compiled, internal linkages always use 4 bytes, so the configured value is not 975 relevant. 976 </P> 977 <P> 978 The default value of 2 for the 8-bit and 16-bit libraries is sufficient for all 979 but the most massive patterns, since it allows the size of the compiled pattern 980 to be up to 64K code units. Larger values allow larger regular expressions to 981 be compiled by those two libraries, but at the expense of slower matching. 982 <pre> 983 PCRE2_CONFIG_MATCHLIMIT 984 </pre> 985 The output is a uint32_t integer that gives the default limit for the number of 986 internal matching function calls in a <b>pcre2_match()</b> execution. Further 987 details are given with <b>pcre2_match()</b> below. 988 <pre> 989 PCRE2_CONFIG_NEWLINE 990 </pre> 991 The output is a uint32_t integer whose value specifies the default character 992 sequence that is recognized as meaning "newline". The values are: 993 <pre> 994 PCRE2_NEWLINE_CR Carriage return (CR) 995 PCRE2_NEWLINE_LF Linefeed (LF) 996 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 997 PCRE2_NEWLINE_ANY Any Unicode line ending 998 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 999 </pre> 1000 The default should normally correspond to the standard sequence for your 1001 operating system. 1002 <pre> 1003 PCRE2_CONFIG_PARENSLIMIT 1004 </pre> 1005 The output is a uint32_t integer that gives the maximum depth of nesting 1006 of parentheses (of any kind) in a pattern. This limit is imposed to cap the 1007 amount of system stack used when a pattern is compiled. It is specified when 1008 PCRE2 is built; the default is 250. This limit does not take into account the 1009 stack that may already be used by the calling application. For finer control 1010 over compilation stack usage, see <b>pcre2_set_compile_recursion_guard()</b>. 1011 <pre> 1012 PCRE2_CONFIG_RECURSIONLIMIT 1013 </pre> 1014 The output is a uint32_t integer that gives the default limit for the depth of 1015 recursion when calling the internal matching function in a <b>pcre2_match()</b> 1016 execution. Further details are given with <b>pcre2_match()</b> below. 1017 <pre> 1018 PCRE2_CONFIG_STACKRECURSE 1019 </pre> 1020 The output is a uint32_t integer that is set to one if internal recursion when 1021 running <b>pcre2_match()</b> is implemented by recursive function calls that use 1022 the system stack to remember their state. This is the usual way that PCRE2 is 1023 compiled. The output is zero if PCRE2 was compiled to use blocks of data on the 1024 heap instead of recursive function calls. 1025 <pre> 1026 PCRE2_CONFIG_UNICODE_VERSION 1027 </pre> 1028 The <i>where</i> argument should point to a buffer that is at least 24 code 1029 units long. (The exact length required can be found by calling 1030 <b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled 1031 without Unicode support, the buffer is filled with the text "Unicode not 1032 supported". Otherwise, the Unicode version string (for example, "8.0.0") is 1033 inserted. The number of code units used is returned. This is the length of the 1034 string plus one unit for the terminating zero. 1035 <pre> 1036 PCRE2_CONFIG_UNICODE 1037 </pre> 1038 The output is a uint32_t integer that is set to one if Unicode support is 1039 available; otherwise it is set to zero. Unicode support implies UTF support. 1040 <pre> 1041 PCRE2_CONFIG_VERSION 1042 </pre> 1043 The <i>where</i> argument should point to a buffer that is at least 12 code 1044 units long. (The exact length required can be found by calling 1045 <b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with 1046 the PCRE2 version string, zero-terminated. The number of code units used is 1047 returned. This is the length of the string plus one unit for the terminating 1048 zero. 1049 <a name="compiling"></a></P> 1050 <br><a name="SEC18" href="#TOC1">COMPILING A PATTERN</a><br> 1051 <P> 1052 <b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b> 1053 <b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b> 1054 <b> pcre2_compile_context *<i>ccontext</i>);</b> 1055 <br> 1056 <br> 1057 <b>void pcre2_code_free(pcre2_code *<i>code</i>);</b> 1058 <br> 1059 <br> 1060 <b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b> 1061 </P> 1062 <P> 1063 The <b>pcre2_compile()</b> function compiles a pattern into an internal form. 1064 The pattern is defined by a pointer to a string of code units and a length. If 1065 the pattern is zero-terminated, the length can be specified as 1066 PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that 1067 contains the compiled pattern and related data, or NULL if an error occurred. 1068 </P> 1069 <P> 1070 If the compile context argument <i>ccontext</i> is NULL, memory for the compiled 1071 pattern is obtained by calling <b>malloc()</b>. Otherwise, it is obtained from 1072 the same memory function that was used for the compile context. The caller must 1073 free the memory by calling <b>pcre2_code_free()</b> when it is no longer needed. 1074 </P> 1075 <P> 1076 The function <b>pcre2_code_copy()</b> makes a copy of the compiled code in new 1077 memory, using the same memory allocator as was used for the original. However, 1078 if the code has been processed by the JIT compiler (see 1079 <a href="#jitcompiling">below),</a> 1080 the JIT information cannot be copied (because it is position-dependent). 1081 The new copy can initially be used only for non-JIT matching, though it can be 1082 passed to <b>pcre2_jit_compile()</b> if required. The <b>pcre2_code_copy()</b> 1083 function provides a way for individual threads in a multithreaded application 1084 to acquire a private copy of shared compiled code. 1085 </P> 1086 <P> 1087 NOTE: When one of the matching functions is called, pointers to the compiled 1088 pattern and the subject string are set in the match data block so that they can 1089 be referenced by the substring extraction functions. After running a match, you 1090 must not free a compiled pattern (or a subject string) until after all 1091 operations on the 1092 <a href="#matchdatablock">match data block</a> 1093 have taken place. 1094 </P> 1095 <P> 1096 The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit 1097 settings that affect the compilation. It should be zero if no options are 1098 required. The available options are described below. Some of them (in 1099 particular, those that are compatible with Perl, but some others as well) can 1100 also be set and unset from within the pattern (see the detailed description in 1101 the 1102 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 1103 documentation). 1104 </P> 1105 <P> 1106 For those options that can be different in different parts of the pattern, the 1107 contents of the <i>options</i> argument specifies their settings at the start of 1108 compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK options can be set at 1109 the time of matching as well as at compile time. 1110 </P> 1111 <P> 1112 Other, less frequently required compile-time parameters (for example, the 1113 newline setting) can be provided in a compile context (as described 1114 <a href="#compilecontext">above).</a> 1115 </P> 1116 <P> 1117 If <i>errorcode</i> or <i>erroroffset</i> is NULL, <b>pcre2_compile()</b> returns 1118 NULL immediately. Otherwise, the variables to which these point are set to an 1119 error code and an offset (number of code units) within the pattern, 1120 respectively, when <b>pcre2_compile()</b> returns NULL because a compilation 1121 error has occurred. The values are not defined when compilation is successful 1122 and <b>pcre2_compile()</b> returns a non-NULL value. 1123 </P> 1124 <P> 1125 The <b>pcre2_get_error_message()</b> function (see "Obtaining a textual error 1126 message" 1127 <a href="#geterrormessage">below)</a> 1128 provides a textual message for each error code. Compilation errors have 1129 positive error codes; UTF formatting error codes are negative. For an invalid 1130 UTF-8 or UTF-16 string, the offset is that of the first code unit of the 1131 failing character. 1132 </P> 1133 <P> 1134 Some errors are not detected until the whole pattern has been scanned; in these 1135 cases, the offset passed back is the length of the pattern. Note that the 1136 offset is in code units, not characters, even in a UTF mode. It may sometimes 1137 point into the middle of a UTF-8 or UTF-16 character. 1138 </P> 1139 <P> 1140 This code fragment shows a typical straightforward call to 1141 <b>pcre2_compile()</b>: 1142 <pre> 1143 pcre2_code *re; 1144 PCRE2_SIZE erroffset; 1145 int errorcode; 1146 re = pcre2_compile( 1147 "^A.*Z", /* the pattern */ 1148 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ 1149 0, /* default options */ 1150 &errorcode, /* for error code */ 1151 &erroffset, /* for error offset */ 1152 NULL); /* no compile context */ 1153 </pre> 1154 The following names for option bits are defined in the <b>pcre2.h</b> header 1155 file: 1156 <pre> 1157 PCRE2_ANCHORED 1158 </pre> 1159 If this bit is set, the pattern is forced to be "anchored", that is, it is 1160 constrained to match only at the first matching point in the string that is 1161 being searched (the "subject string"). This effect can also be achieved by 1162 appropriate constructs in the pattern itself, which is the only way to do it in 1163 Perl. 1164 <pre> 1165 PCRE2_ALLOW_EMPTY_CLASS 1166 </pre> 1167 By default, for compatibility with Perl, a closing square bracket that 1168 immediately follows an opening one is treated as a data character for the 1169 class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which 1170 therefore contains no characters and so can never match. 1171 <pre> 1172 PCRE2_ALT_BSUX 1173 </pre> 1174 This option request alternative handling of three escape sequences, which 1175 makes PCRE2's behaviour more like ECMAscript (aka JavaScript). When it is set: 1176 </P> 1177 <P> 1178 (1) \U matches an upper case "U" character; by default \U causes a compile 1179 time error (Perl uses \U to upper case subsequent characters). 1180 </P> 1181 <P> 1182 (2) \u matches a lower case "u" character unless it is followed by four 1183 hexadecimal digits, in which case the hexadecimal number defines the code point 1184 to match. By default, \u causes a compile time error (Perl uses it to upper 1185 case the following character). 1186 </P> 1187 <P> 1188 (3) \x matches a lower case "x" character unless it is followed by two 1189 hexadecimal digits, in which case the hexadecimal number defines the code point 1190 to match. By default, as in Perl, a hexadecimal number is always expected after 1191 \x, but it may have zero, one, or two digits (so, for example, \xz matches a 1192 binary zero character followed by z). 1193 <pre> 1194 PCRE2_ALT_CIRCUMFLEX 1195 </pre> 1196 In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter 1197 matches at the start of the subject (unless PCRE2_NOTBOL is set), and also 1198 after any internal newline. However, it does not match after a newline at the 1199 end of the subject, for compatibility with Perl. If you want a multiline 1200 circumflex also to match after a terminating newline, you must set 1201 PCRE2_ALT_CIRCUMFLEX. 1202 <pre> 1203 PCRE2_ALT_VERBNAMES 1204 </pre> 1205 By default, for compatibility with Perl, the name in any verb sequence such as 1206 (*MARK:NAME) is any sequence of characters that does not include a closing 1207 parenthesis. The name is not processed in any way, and it is not possible to 1208 include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES 1209 option is set, normal backslash processing is applied to verb names and only an 1210 unescaped closing parenthesis terminates the name. A closing parenthesis can be 1211 included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED 1212 option is set, unescaped whitespace in verb names is skipped and #-comments are 1213 recognized, exactly as in the rest of the pattern. 1214 <pre> 1215 PCRE2_AUTO_CALLOUT 1216 </pre> 1217 If this bit is set, <b>pcre2_compile()</b> automatically inserts callout items, 1218 all with number 255, before each pattern item. For discussion of the callout 1219 facility, see the 1220 <a href="pcre2callout.html"><b>pcre2callout</b></a> 1221 documentation. 1222 <pre> 1223 PCRE2_CASELESS 1224 </pre> 1225 If this bit is set, letters in the pattern match both upper and lower case 1226 letters in the subject. It is equivalent to Perl's /i option, and it can be 1227 changed within a pattern by a (?i) option setting. 1228 <pre> 1229 PCRE2_DOLLAR_ENDONLY 1230 </pre> 1231 If this bit is set, a dollar metacharacter in the pattern matches only at the 1232 end of the subject string. Without this option, a dollar also matches 1233 immediately before a newline at the end of the string (but not before any other 1234 newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is 1235 set. There is no equivalent to this option in Perl, and no way to set it within 1236 a pattern. 1237 <pre> 1238 PCRE2_DOTALL 1239 </pre> 1240 If this bit is set, a dot metacharacter in the pattern matches any character, 1241 including one that indicates a newline. However, it only ever matches one 1242 character, even if newlines are coded as CRLF. Without this option, a dot does 1243 not match when the current position in the subject is at a newline. This option 1244 is equivalent to Perl's /s option, and it can be changed within a pattern by a 1245 (?s) option setting. A negative class such as [^a] always matches newline 1246 characters, independent of the setting of this option. 1247 <pre> 1248 PCRE2_DUPNAMES 1249 </pre> 1250 If this bit is set, names used to identify capturing subpatterns need not be 1251 unique. This can be helpful for certain types of pattern when it is known that 1252 only one instance of the named subpattern can ever be matched. There are more 1253 details of named subpatterns below; see also the 1254 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 1255 documentation. 1256 <pre> 1257 PCRE2_EXTENDED 1258 </pre> 1259 If this bit is set, most white space characters in the pattern are totally 1260 ignored except when escaped or inside a character class. However, white space 1261 is not allowed within sequences such as (?> that introduce various 1262 parenthesized subpatterns, nor within numerical quantifiers such as {1,3}. 1263 Ignorable white space is permitted between an item and a following quantifier 1264 and between a quantifier and a following + that indicates possessiveness. 1265 </P> 1266 <P> 1267 PCRE2_EXTENDED also causes characters between an unescaped # outside a 1268 character class and the next newline, inclusive, to be ignored, which makes it 1269 possible to include comments inside complicated patterns. Note that the end of 1270 this type of comment is a literal newline sequence in the pattern; escape 1271 sequences that happen to represent a newline do not count. PCRE2_EXTENDED is 1272 equivalent to Perl's /x option, and it can be changed within a pattern by a 1273 (?x) option setting. 1274 </P> 1275 <P> 1276 Which characters are interpreted as newlines can be specified by a setting in 1277 the compile context that is passed to <b>pcre2_compile()</b> or by a special 1278 sequence at the start of the pattern, as described in the section entitled 1279 <a href="pcre2pattern.html#newlines">"Newline conventions"</a> 1280 in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is 1281 built. 1282 <pre> 1283 PCRE2_FIRSTLINE 1284 </pre> 1285 If this option is set, an unanchored pattern is required to match before or at 1286 the first newline in the subject string, though the matched text may continue 1287 over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more 1288 general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a 1289 match must occur in the first line and also within the offset limit. In other 1290 words, whichever limit comes first is used. 1291 <pre> 1292 PCRE2_MATCH_UNSET_BACKREF 1293 </pre> 1294 If this option is set, a back reference to an unset subpattern group matches an 1295 empty string (by default this causes the current matching alternative to fail). 1296 A pattern such as (\1)(a) succeeds when this option is set (assuming it can 1297 find an "a" in the subject), whereas it fails by default, for Perl 1298 compatibility. Setting this option makes PCRE2 behave more like ECMAscript (aka 1299 JavaScript). 1300 <pre> 1301 PCRE2_MULTILINE 1302 </pre> 1303 By default, for the purposes of matching "start of line" and "end of line", 1304 PCRE2 treats the subject string as consisting of a single line of characters, 1305 even if it actually contains newlines. The "start of line" metacharacter (^) 1306 matches only at the start of the string, and the "end of line" metacharacter 1307 ($) matches only at the end of the string, or before a terminating newline 1308 (except when PCRE2_DOLLAR_ENDONLY is set). Note, however, that unless 1309 PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at a 1310 newline. This behaviour (for ^, $, and dot) is the same as Perl. 1311 </P> 1312 <P> 1313 When PCRE2_MULTILINE it is set, the "start of line" and "end of line" 1314 constructs match immediately following or immediately before internal newlines 1315 in the subject string, respectively, as well as at the very start and end. This 1316 is equivalent to Perl's /m option, and it can be changed within a pattern by a 1317 (?m) option setting. Note that the "start of line" metacharacter does not match 1318 after a newline at the end of the subject, for compatibility with Perl. 1319 However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If 1320 there are no newlines in a subject string, or no occurrences of ^ or $ in a 1321 pattern, setting PCRE2_MULTILINE has no effect. 1322 <pre> 1323 PCRE2_NEVER_BACKSLASH_C 1324 </pre> 1325 This option locks out the use of \C in the pattern that is being compiled. 1326 This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because 1327 it may leave the current matching point in the middle of a multi-code-unit 1328 character. This option may be useful in applications that process patterns from 1329 external sources. Note that there is also a build-time option that permanently 1330 locks out the use of \C. 1331 <pre> 1332 PCRE2_NEVER_UCP 1333 </pre> 1334 This option locks out the use of Unicode properties for handling \B, \b, \D, 1335 \d, \S, \s, \W, \w, and some of the POSIX character classes, as described 1336 for the PCRE2_UCP option below. In particular, it prevents the creator of the 1337 pattern from enabling this facility by starting the pattern with (*UCP). This 1338 option may be useful in applications that process patterns from external 1339 sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error. 1340 <pre> 1341 PCRE2_NEVER_UTF 1342 </pre> 1343 This option locks out interpretation of the pattern as UTF-8, UTF-16, or 1344 UTF-32, depending on which library is in use. In particular, it prevents the 1345 creator of the pattern from switching to UTF interpretation by starting the 1346 pattern with (*UTF). This option may be useful in applications that process 1347 patterns from external sources. The combination of PCRE2_UTF and 1348 PCRE2_NEVER_UTF causes an error. 1349 <pre> 1350 PCRE2_NO_AUTO_CAPTURE 1351 </pre> 1352 If this option is set, it disables the use of numbered capturing parentheses in 1353 the pattern. Any opening parenthesis that is not followed by ? behaves as if it 1354 were followed by ?: but named parentheses can still be used for capturing (and 1355 they acquire numbers in the usual way). There is no equivalent of this option 1356 in Perl. Note that, if this option is set, references to capturing groups (back 1357 references or recursion/subroutine calls) may only refer to named groups, 1358 though the reference can be by name or by number. 1359 <pre> 1360 PCRE2_NO_AUTO_POSSESS 1361 </pre> 1362 If this option is set, it disables "auto-possessification", which is an 1363 optimization that, for example, turns a+b into a++b in order to avoid 1364 backtracks into a+ that can never be successful. However, if callouts are in 1365 use, auto-possessification means that some callouts are never taken. You can 1366 set this option if you want the matching functions to do a full unoptimized 1367 search and run all the callouts, but it is mainly provided for testing 1368 purposes. 1369 <pre> 1370 PCRE2_NO_DOTSTAR_ANCHOR 1371 </pre> 1372 If this option is set, it disables an optimization that is applied when .* is 1373 the first significant item in a top-level branch of a pattern, and all the 1374 other branches also start with .* or with \A or \G or ^. The optimization is 1375 automatically disabled for .* if it is inside an atomic group or a capturing 1376 group that is the subject of a back reference, or if the pattern contains 1377 (*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is 1378 automatically anchored if PCRE2_DOTALL is set for all the .* items and 1379 PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match 1380 must start either at the start of the subject or following a newline is 1381 remembered. Like other optimizations, this can cause callouts to be skipped. 1382 <pre> 1383 PCRE2_NO_START_OPTIMIZE 1384 </pre> 1385 This is an option whose main effect is at matching time. It does not change 1386 what <b>pcre2_compile()</b> generates, but it does affect the output of the JIT 1387 compiler. 1388 </P> 1389 <P> 1390 There are a number of optimizations that may occur at the start of a match, in 1391 order to speed up the process. For example, if it is known that an unanchored 1392 match must start with a specific character, the matching code searches the 1393 subject for that character, and fails immediately if it cannot find it, without 1394 actually running the main matching function. This means that a special item 1395 such as (*COMMIT) at the start of a pattern is not considered until after a 1396 suitable starting point for the match has been found. Also, when callouts or 1397 (*MARK) items are in use, these "start-up" optimizations can cause them to be 1398 skipped if the pattern is never actually used. The start-up optimizations are 1399 in effect a pre-scan of the subject that takes place before the pattern is run. 1400 </P> 1401 <P> 1402 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, 1403 possibly causing performance to suffer, but ensuring that in cases where the 1404 result is "no match", the callouts do occur, and that items such as (*COMMIT) 1405 and (*MARK) are considered at every possible starting position in the subject 1406 string. 1407 </P> 1408 <P> 1409 Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation. 1410 Consider the pattern 1411 <pre> 1412 (*COMMIT)ABC 1413 </pre> 1414 When this is compiled, PCRE2 records the fact that a match must start with the 1415 character "A". Suppose the subject string is "DEFABC". The start-up 1416 optimization scans along the subject, finds "A" and runs the first match 1417 attempt from there. The (*COMMIT) item means that the pattern must match the 1418 current starting position, which in this case, it does. However, if the same 1419 match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the 1420 subject string does not happen. The first match attempt is run starting from 1421 "D" and when this fails, (*COMMIT) prevents any further matches being tried, so 1422 the overall result is "no match". There are also other start-up optimizations. 1423 For example, a minimum length for the subject may be recorded. Consider the 1424 pattern 1425 <pre> 1426 (*MARK:A)(X|Y) 1427 </pre> 1428 The minimum length for a match is one character. If the subject is "ABC", there 1429 will be attempts to match "ABC", "BC", and "C". An attempt to match an empty 1430 string at the end of the subject does not take place, because PCRE2 knows that 1431 the subject is now too short, and so the (*MARK) is never encountered. In this 1432 case, the optimization does not affect the overall match result, which is still 1433 "no match", but it does affect the auxiliary information that is returned. 1434 <pre> 1435 PCRE2_NO_UTF_CHECK 1436 </pre> 1437 When PCRE2_UTF is set, the validity of the pattern as a UTF string is 1438 automatically checked. There are discussions about the validity of 1439 <a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a> 1440 <a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a> 1441 and 1442 <a href="pcre2unicode.html#utf32strings">UTF-32 strings</a> 1443 in the 1444 <a href="pcre2unicode.html"><b>pcre2unicode</b></a> 1445 document. 1446 If an invalid UTF sequence is found, <b>pcre2_compile()</b> returns a negative 1447 error code. 1448 </P> 1449 <P> 1450 If you know that your pattern is valid, and you want to skip this check for 1451 performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When it is set, 1452 the effect of passing an invalid UTF string as a pattern is undefined. It may 1453 cause your program to crash or loop. Note that this option can also be passed 1454 to <b>pcre2_match()</b> and <b>pcre_dfa_match()</b>, to suppress validity 1455 checking of the subject string. 1456 <pre> 1457 PCRE2_UCP 1458 </pre> 1459 This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, 1460 \w, and some of the POSIX character classes. By default, only ASCII characters 1461 are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to 1462 classify characters. More details are given in the section on 1463 <a href="pcre2pattern.html#genericchartypes">generic character types</a> 1464 in the 1465 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 1466 page. If you set PCRE2_UCP, matching one of the items it affects takes much 1467 longer. The option is available only if PCRE2 has been compiled with Unicode 1468 support. 1469 <pre> 1470 PCRE2_UNGREEDY 1471 </pre> 1472 This option inverts the "greediness" of the quantifiers so that they are not 1473 greedy by default, but become greedy if followed by "?". It is not compatible 1474 with Perl. It can also be set by a (?U) option setting within the pattern. 1475 <pre> 1476 PCRE2_USE_OFFSET_LIMIT 1477 </pre> 1478 This option must be set for <b>pcre2_compile()</b> if 1479 <b>pcre2_set_offset_limit()</b> is going to be used to set a non-default offset 1480 limit in a match context for matches that use this pattern. An error is 1481 generated if an offset limit is set without this option. For more details, see 1482 the description of <b>pcre2_set_offset_limit()</b> in the 1483 <a href="#matchcontext">section</a> 1484 that describes match contexts. See also the PCRE2_FIRSTLINE 1485 option above. 1486 <pre> 1487 PCRE2_UTF 1488 </pre> 1489 This option causes PCRE2 to regard both the pattern and the subject strings 1490 that are subsequently processed as strings of UTF characters instead of 1491 single-code-unit strings. It is available when PCRE2 is built to include 1492 Unicode support (which is the default). If Unicode support is not available, 1493 the use of this option provokes an error. Details of how this option changes 1494 the behaviour of PCRE2 are given in the 1495 <a href="pcre2unicode.html"><b>pcre2unicode</b></a> 1496 page. 1497 </P> 1498 <br><a name="SEC19" href="#TOC1">COMPILATION ERROR CODES</a><br> 1499 <P> 1500 There are over 80 positive error codes that <b>pcre2_compile()</b> may return 1501 (via <i>errorcode</i>) if it finds an error in the pattern. There are also some 1502 negative error codes that are used for invalid UTF strings. These are the same 1503 as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described 1504 in the 1505 <a href="pcre2unicode.html"><b>pcre2unicode</b></a> 1506 page. The <b>pcre2_get_error_message()</b> function (see "Obtaining a textual 1507 error message" 1508 <a href="#geterrormessage">below)</a> 1509 can be called to obtain a textual error message from any error code. 1510 <a name="jitcompiling"></a></P> 1511 <br><a name="SEC20" href="#TOC1">JUST-IN-TIME (JIT) COMPILATION</a><br> 1512 <P> 1513 <b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b> 1514 <br> 1515 <br> 1516 <b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 1517 <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 1518 <b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 1519 <b> pcre2_match_context *<i>mcontext</i>);</b> 1520 <br> 1521 <br> 1522 <b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b> 1523 <br> 1524 <br> 1525 <b>pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE <i>startsize</i>,</b> 1526 <b> PCRE2_SIZE <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b> 1527 <br> 1528 <br> 1529 <b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b> 1530 <b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b> 1531 <br> 1532 <br> 1533 <b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b> 1534 </P> 1535 <P> 1536 These functions provide support for JIT compilation, which, if the just-in-time 1537 compiler is available, further processes a compiled pattern into machine code 1538 that executes much faster than the <b>pcre2_match()</b> interpretive matching 1539 function. Full details are given in the 1540 <a href="pcre2jit.html"><b>pcre2jit</b></a> 1541 documentation. 1542 </P> 1543 <P> 1544 JIT compilation is a heavyweight optimization. It can take some time for 1545 patterns to be analyzed, and for one-off matches and simple patterns the 1546 benefit of faster execution might be offset by a much slower compilation time. 1547 Most, but not all patterns can be optimized by the JIT compiler. 1548 <a name="localesupport"></a></P> 1549 <br><a name="SEC21" href="#TOC1">LOCALE SUPPORT</a><br> 1550 <P> 1551 PCRE2 handles caseless matching, and determines whether characters are letters, 1552 digits, or whatever, by reference to a set of tables, indexed by character code 1553 point. This applies only to characters whose code points are less than 256. By 1554 default, higher-valued code points never match escapes such as \w or \d. 1555 However, if PCRE2 is built with UTF support, all characters can be tested with 1556 \p and \P, or, alternatively, the PCRE2_UCP option can be set when a pattern 1557 is compiled; this causes \w and friends to use Unicode property support 1558 instead of the built-in tables. 1559 </P> 1560 <P> 1561 The use of locales with Unicode is discouraged. If you are handling characters 1562 with code points greater than 128, you should either use Unicode support, or 1563 use locales, but not try to mix the two. 1564 </P> 1565 <P> 1566 PCRE2 contains an internal set of character tables that are used by default. 1567 These are sufficient for many applications. Normally, the internal tables 1568 recognize only ASCII characters. However, when PCRE2 is built, it is possible 1569 to cause the internal tables to be rebuilt in the default "C" locale of the 1570 local system, which may cause them to be different. 1571 </P> 1572 <P> 1573 The internal tables can be overridden by tables supplied by the application 1574 that calls PCRE2. These may be created in a different locale from the default. 1575 As more and more applications change to using Unicode, the need for this locale 1576 support is expected to die away. 1577 </P> 1578 <P> 1579 External tables are built by calling the <b>pcre2_maketables()</b> function, in 1580 the relevant locale. The result can be passed to <b>pcre2_compile()</b> as often 1581 as necessary, by creating a compile context and calling 1582 <b>pcre2_set_character_tables()</b> to set the tables pointer therein. For 1583 example, to build and use tables that are appropriate for the French locale 1584 (where accented characters with values greater than 128 are treated as 1585 letters), the following code could be used: 1586 <pre> 1587 setlocale(LC_CTYPE, "fr_FR"); 1588 tables = pcre2_maketables(NULL); 1589 ccontext = pcre2_compile_context_create(NULL); 1590 pcre2_set_character_tables(ccontext, tables); 1591 re = pcre2_compile(..., ccontext); 1592 </pre> 1593 The locale name "fr_FR" is used on Linux and other Unix-like systems; if you 1594 are using Windows, the name for the French locale is "french". It is the 1595 caller's responsibility to ensure that the memory containing the tables remains 1596 available for as long as it is needed. 1597 </P> 1598 <P> 1599 The pointer that is passed (via the compile context) to <b>pcre2_compile()</b> 1600 is saved with the compiled pattern, and the same tables are used by 1601 <b>pcre2_match()</b> and <b>pcre_dfa_match()</b>. Thus, for any single pattern, 1602 compilation, and matching all happen in the same locale, but different patterns 1603 can be processed in different locales. 1604 <a name="infoaboutpattern"></a></P> 1605 <br><a name="SEC22" href="#TOC1">INFORMATION ABOUT A COMPILED PATTERN</a><br> 1606 <P> 1607 <b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b> 1608 </P> 1609 <P> 1610 The <b>pcre2_pattern_info()</b> function returns general information about a 1611 compiled pattern. For information about callouts, see the 1612 <a href="pcre2pattern.html#infoaboutcallouts">next section.</a> 1613 The first argument for <b>pcre2_pattern_info()</b> is a pointer to the compiled 1614 pattern. The second argument specifies which piece of information is required, 1615 and the third argument is a pointer to a variable to receive the data. If the 1616 third argument is NULL, the first argument is ignored, and the function returns 1617 the size in bytes of the variable that is required for the information 1618 requested. Otherwise, The yield of the function is zero for success, or one of 1619 the following negative numbers: 1620 <pre> 1621 PCRE2_ERROR_NULL the argument <i>code</i> was NULL 1622 PCRE2_ERROR_BADMAGIC the "magic number" was not found 1623 PCRE2_ERROR_BADOPTION the value of <i>what</i> was invalid 1624 PCRE2_ERROR_UNSET the requested field is not set 1625 </pre> 1626 The "magic number" is placed at the start of each compiled pattern as an simple 1627 check against passing an arbitrary memory pointer. Here is a typical call of 1628 <b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern: 1629 <pre> 1630 int rc; 1631 size_t length; 1632 rc = pcre2_pattern_info( 1633 re, /* result of pcre2_compile() */ 1634 PCRE2_INFO_SIZE, /* what is required */ 1635 &length); /* where to put the data */ 1636 </pre> 1637 The possible values for the second argument are defined in <b>pcre2.h</b>, and 1638 are as follows: 1639 <pre> 1640 PCRE2_INFO_ALLOPTIONS 1641 PCRE2_INFO_ARGOPTIONS 1642 </pre> 1643 Return a copy of the pattern's options. The third argument should point to a 1644 <b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that 1645 were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns 1646 the compile options as modified by any top-level (*XXX) option settings such as 1647 (*UTF) at the start of the pattern itself. 1648 </P> 1649 <P> 1650 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED 1651 option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF. 1652 Option settings such as (?i) that can change within a pattern do not affect the 1653 result of PCRE2_INFO_ALLOPTIONS, even if they appear right at the start of the 1654 pattern. (This was different in some earlier releases.) 1655 </P> 1656 <P> 1657 A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if 1658 the first significant item in every top-level branch is one of the following: 1659 <pre> 1660 ^ unless PCRE2_MULTILINE is set 1661 \A always 1662 \G always 1663 .* sometimes - see below 1664 </pre> 1665 When .* is the first significant item, anchoring is possible only when all the 1666 following are true: 1667 <pre> 1668 .* is not in an atomic group 1669 .* is not in a capturing group that is the subject of a back reference 1670 PCRE2_DOTALL is in force for .* 1671 Neither (*PRUNE) nor (*SKIP) appears in the pattern. 1672 PCRE2_NO_DOTSTAR_ANCHOR is not set. 1673 </pre> 1674 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the 1675 options returned for PCRE2_INFO_ALLOPTIONS. 1676 <pre> 1677 PCRE2_INFO_BACKREFMAX 1678 </pre> 1679 Return the number of the highest back reference in the pattern. The third 1680 argument should point to an <b>uint32_t</b> variable. Named subpatterns acquire 1681 numbers as well as names, and these count towards the highest back reference. 1682 Back references such as \4 or \g{12} match the captured characters of the 1683 given group, but in addition, the check that a capturing group is set in a 1684 conditional subpattern such as (?(3)a|b) is also a back reference. Zero is 1685 returned if there are no back references. 1686 <pre> 1687 PCRE2_INFO_BSR 1688 </pre> 1689 The output is a uint32_t whose value indicates what character sequences the \R 1690 escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R matches 1691 any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \R 1692 matches only CR, LF, or CRLF. 1693 <pre> 1694 PCRE2_INFO_CAPTURECOUNT 1695 </pre> 1696 Return the highest capturing subpattern number in the pattern. In patterns 1697 where (?| is not used, this is also the total number of capturing subpatterns. 1698 The third argument should point to an <b>uint32_t</b> variable. 1699 <pre> 1700 PCRE2_INFO_FIRSTBITMAP 1701 </pre> 1702 In the absence of a single first code unit for a non-anchored pattern, 1703 <b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of 1704 values for the first code unit in any match. For example, a pattern that starts 1705 with [abc] results in a table with three bits set. When code unit values 1706 greater than 255 are supported, the flag bit for 255 means "any code unit of 1707 value 255 or above". If such a table was constructed, a pointer to it is 1708 returned. Otherwise NULL is returned. The third argument should point to an 1709 <b>const uint8_t *</b> variable. 1710 <pre> 1711 PCRE2_INFO_FIRSTCODETYPE 1712 </pre> 1713 Return information about the first code unit of any matched string, for a 1714 non-anchored pattern. The third argument should point to an <b>uint32_t</b> 1715 variable. If there is a fixed first value, for example, the letter "c" from a 1716 pattern such as (cat|cow|coyote), 1 is returned, and the character value can be 1717 retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but 1718 it is known that a match can occur only at the start of the subject or 1719 following a newline in the subject, 2 is returned. Otherwise, and for anchored 1720 patterns, 0 is returned. 1721 <pre> 1722 PCRE2_INFO_FIRSTCODEUNIT 1723 </pre> 1724 Return the value of the first code unit of any matched string in the situation 1725 where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third 1726 argument should point to an <b>uint32_t</b> variable. In the 8-bit library, the 1727 value is always less than 256. In the 16-bit library the value can be up to 1728 0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff, 1729 and up to 0xffffffff when not using UTF-32 mode. 1730 <pre> 1731 PCRE2_INFO_HASBACKSLASHC 1732 </pre> 1733 Return 1 if the pattern contains any instances of \C, otherwise 0. The third 1734 argument should point to an <b>uint32_t</b> variable. 1735 <pre> 1736 PCRE2_INFO_HASCRORLF 1737 </pre> 1738 Return 1 if the pattern contains any explicit matches for CR or LF characters, 1739 otherwise 0. The third argument should point to an <b>uint32_t</b> variable. An 1740 explicit match is either a literal CR or LF character, or \r or \n. 1741 <pre> 1742 PCRE2_INFO_JCHANGED 1743 </pre> 1744 Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise 1745 0. The third argument should point to an <b>uint32_t</b> variable. (?J) and 1746 (?-J) set and unset the local PCRE2_DUPNAMES option, respectively. 1747 <pre> 1748 PCRE2_INFO_JITSIZE 1749 </pre> 1750 If the compiled pattern was successfully processed by 1751 <b>pcre2_jit_compile()</b>, return the size of the JIT compiled code, otherwise 1752 return zero. The third argument should point to a <b>size_t</b> variable. 1753 <pre> 1754 PCRE2_INFO_LASTCODETYPE 1755 </pre> 1756 Returns 1 if there is a rightmost literal code unit that must exist in any 1757 matched string, other than at its start. The third argument should point to an 1758 <b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is 1759 returned, the code unit value itself can be retrieved using 1760 PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is 1761 recorded only if it follows something of variable length. For example, for the 1762 pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned from 1763 PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0. 1764 <pre> 1765 PCRE2_INFO_LASTCODEUNIT 1766 </pre> 1767 Return the value of the rightmost literal data unit that must exist in any 1768 matched string, other than at its start, if such a value has been recorded. The 1769 third argument should point to an <b>uint32_t</b> variable. If there is no such 1770 value, 0 is returned. 1771 <pre> 1772 PCRE2_INFO_MATCHEMPTY 1773 </pre> 1774 Return 1 if the pattern might match an empty string, otherwise 0. The third 1775 argument should point to an <b>uint32_t</b> variable. When a pattern contains 1776 recursive subroutine calls it is not always possible to determine whether or 1777 not it can match an empty string. PCRE2 takes a cautious approach and returns 1 1778 in such cases. 1779 <pre> 1780 PCRE2_INFO_MATCHLIMIT 1781 </pre> 1782 If the pattern set a match limit by including an item of the form 1783 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument 1784 should point to an unsigned 32-bit integer. If no such value has been set, the 1785 call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. 1786 <pre> 1787 PCRE2_INFO_MAXLOOKBEHIND 1788 </pre> 1789 Return the number of characters (not code units) in the longest lookbehind 1790 assertion in the pattern. The third argument should point to an unsigned 32-bit 1791 integer. This information is useful when doing multi-segment matching using the 1792 partial matching facilities. Note that the simple assertions \b and \B 1793 require a one-character lookbehind. \A also registers a one-character 1794 lookbehind, though it does not actually inspect the previous character. This is 1795 to ensure that at least one character from the old segment is retained when a 1796 new segment is processed. Otherwise, if there are no lookbehinds in the 1797 pattern, \A might match incorrectly at the start of a new segment. 1798 <pre> 1799 PCRE2_INFO_MINLENGTH 1800 </pre> 1801 If a minimum length for matching subject strings was computed, its value is 1802 returned. Otherwise the returned value is 0. The value is a number of 1803 characters, which in UTF mode may be different from the number of code units. 1804 The third argument should point to an <b>uint32_t</b> variable. The value is a 1805 lower bound to the length of any matching string. There may not be any strings 1806 of that length that do actually match, but every string that does match is at 1807 least that long. 1808 <pre> 1809 PCRE2_INFO_NAMECOUNT 1810 PCRE2_INFO_NAMEENTRYSIZE 1811 PCRE2_INFO_NAMETABLE 1812 </pre> 1813 PCRE2 supports the use of named as well as numbered capturing parentheses. The 1814 names are just an additional way of identifying the parentheses, which still 1815 acquire numbers. Several convenience functions such as 1816 <b>pcre2_substring_get_byname()</b> are provided for extracting captured 1817 substrings by name. It is also possible to extract the data directly, by first 1818 converting the name to a number in order to access the correct pointers in the 1819 output vector (described with <b>pcre2_match()</b> below). To do the conversion, 1820 you need to use the name-to-number map, which is described by these three 1821 values. 1822 </P> 1823 <P> 1824 The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives 1825 the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each 1826 entry in code units; both of these return a <b>uint32_t</b> value. The entry 1827 size depends on the length of the longest name. 1828 </P> 1829 <P> 1830 PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is 1831 a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first 1832 two bytes of each entry are the number of the capturing parenthesis, most 1833 significant byte first. In the 16-bit library, the pointer points to 16-bit 1834 code units, the first of which contains the parenthesis number. In the 32-bit 1835 library, the pointer points to 32-bit code units, the first of which contains 1836 the parenthesis number. The rest of the entry is the corresponding name, zero 1837 terminated. 1838 </P> 1839 <P> 1840 The names are in alphabetical order. If (?| is used to create multiple groups 1841 with the same number, as described in the 1842 <a href="pcre2pattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a> 1843 in the 1844 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 1845 page, the groups may be given the same name, but there is only one entry in the 1846 table. Different names for groups of the same number are not permitted. 1847 </P> 1848 <P> 1849 Duplicate names for subpatterns with different numbers are permitted, but only 1850 if PCRE2_DUPNAMES is set. They appear in the table in the order in which they 1851 were found in the pattern. In the absence of (?| this is the order of 1852 increasing number; when (?| is used this is not necessarily the case because 1853 later subpatterns may have lower numbers. 1854 </P> 1855 <P> 1856 As a simple example of the name/number table, consider the following pattern 1857 after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white 1858 space - including newlines - is ignored): 1859 <pre> 1860 (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) ) 1861 </pre> 1862 There are four named subpatterns, so the table has four entries, and each entry 1863 in the table is eight bytes long. The table is as follows, with non-printing 1864 bytes shows in hexadecimal, and undefined bytes shown as ??: 1865 <pre> 1866 00 01 d a t e 00 ?? 1867 00 05 d a y 00 ?? ?? 1868 00 04 m o n t h 00 1869 00 02 y e a r 00 ?? 1870 </pre> 1871 When writing code to extract data from named subpatterns using the 1872 name-to-number map, remember that the length of the entries is likely to be 1873 different for each compiled pattern. 1874 <pre> 1875 PCRE2_INFO_NEWLINE 1876 </pre> 1877 The output is a <b>uint32_t</b> with one of the following values: 1878 <pre> 1879 PCRE2_NEWLINE_CR Carriage return (CR) 1880 PCRE2_NEWLINE_LF Linefeed (LF) 1881 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 1882 PCRE2_NEWLINE_ANY Any Unicode line ending 1883 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 1884 </pre> 1885 This specifies the default character sequence that will be recognized as 1886 meaning "newline" while matching. 1887 <pre> 1888 PCRE2_INFO_RECURSIONLIMIT 1889 </pre> 1890 If the pattern set a recursion limit by including an item of the form 1891 (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third 1892 argument should point to an unsigned 32-bit integer. If no such value has been 1893 set, the call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. 1894 <pre> 1895 PCRE2_INFO_SIZE 1896 </pre> 1897 Return the size of the compiled pattern in bytes (for all three libraries). The 1898 third argument should point to a <b>size_t</b> variable. This value includes the 1899 size of the general data block that precedes the code units of the compiled 1900 pattern itself. The value that is used when <b>pcre2_compile()</b> is getting 1901 memory in which to place the compiled pattern may be slightly larger than the 1902 value returned by this option, because there are cases where the code that 1903 calculates the size has to over-estimate. Processing a pattern with the JIT 1904 compiler does not alter the value returned by this option. 1905 <a name="infoaboutcallouts"></a></P> 1906 <br><a name="SEC23" href="#TOC1">INFORMATION ABOUT A PATTERN'S CALLOUTS</a><br> 1907 <P> 1908 <b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b> 1909 <b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b> 1910 <b> void *<i>user_data</i>);</b> 1911 <br> 1912 <br> 1913 A script language that supports the use of string arguments in callouts might 1914 like to scan all the callouts in a pattern before running the match. This can 1915 be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a 1916 pointer to a compiled pattern, the second points to a callback function, and 1917 the third is arbitrary user data. The callback function is called for every 1918 callout in the pattern in the order in which they appear. Its first argument is 1919 a pointer to a callout enumeration block, and its second argument is the 1920 <i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The 1921 contents of the callout enumeration block are described in the 1922 <a href="pcre2callout.html"><b>pcre2callout</b></a> 1923 documentation, which also gives further details about callouts. 1924 </P> 1925 <br><a name="SEC24" href="#TOC1">SERIALIZATION AND PRECOMPILING</a><br> 1926 <P> 1927 It is possible to save compiled patterns on disc or elsewhere, and reload them 1928 later, subject to a number of restrictions. The functions whose names begin 1929 with <b>pcre2_serialize_</b> are used for this purpose. They are described in 1930 the 1931 <a href="pcre2serialize.html"><b>pcre2serialize</b></a> 1932 documentation. 1933 <a name="matchdatablock"></a></P> 1934 <br><a name="SEC25" href="#TOC1">THE MATCH DATA BLOCK</a><br> 1935 <P> 1936 <b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b> 1937 <b> pcre2_general_context *<i>gcontext</i>);</b> 1938 <br> 1939 <br> 1940 <b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b> 1941 <b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b> 1942 <br> 1943 <br> 1944 <b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b> 1945 </P> 1946 <P> 1947 Information about a successful or unsuccessful match is placed in a match 1948 data block, which is an opaque structure that is accessed by function calls. In 1949 particular, the match data block contains a vector of offsets into the subject 1950 string that define the matched part of the subject and any substrings that were 1951 captured. This is know as the <i>ovector</i>. 1952 </P> 1953 <P> 1954 Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or 1955 <b>pcre2_jit_match()</b> you must create a match data block by calling one of 1956 the creation functions above. For <b>pcre2_match_data_create()</b>, the first 1957 argument is the number of pairs of offsets in the <i>ovector</i>. One pair of 1958 offsets is required to identify the string that matched the whole pattern, with 1959 another pair for each captured substring. For example, a value of 4 creates 1960 enough space to record the matched portion of the subject plus three captured 1961 substrings. A minimum of at least 1 pair is imposed by 1962 <b>pcre2_match_data_create()</b>, so it is always possible to return the overall 1963 matched string. 1964 </P> 1965 <P> 1966 The second argument of <b>pcre2_match_data_create()</b> is a pointer to a 1967 general context, which can specify custom memory management for obtaining the 1968 memory for the match data block. If you are not using custom memory management, 1969 pass NULL, which causes <b>malloc()</b> to be used. 1970 </P> 1971 <P> 1972 For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a 1973 pointer to a compiled pattern. The ovector is created to be exactly the right 1974 size to hold all the substrings a pattern might capture. The second argument is 1975 again a pointer to a general context, but in this case if NULL is passed, the 1976 memory is obtained using the same allocator that was used for the compiled 1977 pattern (custom or default). 1978 </P> 1979 <P> 1980 A match data block can be used many times, with the same or different compiled 1981 patterns. You can extract information from a match data block after a match 1982 operation has finished, using functions that are described in the sections on 1983 <a href="#matchedstrings">matched strings</a> 1984 and 1985 <a href="#matchotherdata">other match data</a> 1986 below. 1987 </P> 1988 <P> 1989 When a call of <b>pcre2_match()</b> fails, valid data is available in the match 1990 block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one 1991 of the error codes for an invalid UTF string. Exactly what is available depends 1992 on the error, and is detailed below. 1993 </P> 1994 <P> 1995 When one of the matching functions is called, pointers to the compiled pattern 1996 and the subject string are set in the match data block so that they can be 1997 referenced by the extraction functions. After running a match, you must not 1998 free a compiled pattern or a subject string until after all operations on the 1999 match data block (for that match) have taken place. 2000 </P> 2001 <P> 2002 When a match data block itself is no longer needed, it should be freed by 2003 calling <b>pcre2_match_data_free()</b>. 2004 </P> 2005 <br><a name="SEC26" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br> 2006 <P> 2007 <b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 2008 <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 2009 <b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 2010 <b> pcre2_match_context *<i>mcontext</i>);</b> 2011 </P> 2012 <P> 2013 The function <b>pcre2_match()</b> is called to match a subject string against a 2014 compiled pattern, which is passed in the <i>code</i> argument. You can call 2015 <b>pcre2_match()</b> with the same <i>code</i> argument as many times as you 2016 like, in order to find multiple matches in the subject string or to match 2017 different subject strings with the same pattern. 2018 </P> 2019 <P> 2020 This function is the main matching facility of the library, and it operates in 2021 a Perl-like manner. For specialist use there is also an alternative matching 2022 function, which is described 2023 <a href="#dfamatch">below</a> 2024 in the section about the <b>pcre2_dfa_match()</b> function. 2025 </P> 2026 <P> 2027 Here is an example of a simple call to <b>pcre2_match()</b>: 2028 <pre> 2029 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 2030 int rc = pcre2_match( 2031 re, /* result of pcre2_compile() */ 2032 "some string", /* the subject string */ 2033 11, /* the length of the subject string */ 2034 0, /* start at offset 0 in the subject */ 2035 0, /* default options */ 2036 match_data, /* the match data block */ 2037 NULL); /* a match context; NULL means use defaults */ 2038 </pre> 2039 If the subject string is zero-terminated, the length can be given as 2040 PCRE2_ZERO_TERMINATED. A match context must be provided if certain less common 2041 matching parameters are to be changed. For details, see the section on 2042 <a href="#matchcontext">the match context</a> 2043 above. 2044 </P> 2045 <br><b> 2046 The string to be matched by <b>pcre2_match()</b> 2047 </b><br> 2048 <P> 2049 The subject string is passed to <b>pcre2_match()</b> as a pointer in 2050 <i>subject</i>, a length in <i>length</i>, and a starting offset in 2051 <i>startoffset</i>. The length and offset are in code units, not characters. 2052 That is, they are in bytes for the 8-bit library, 16-bit code units for the 2053 16-bit library, and 32-bit code units for the 32-bit library, whether or not 2054 UTF processing is enabled. 2055 </P> 2056 <P> 2057 If <i>startoffset</i> is greater than the length of the subject, 2058 <b>pcre2_match()</b> returns PCRE2_ERROR_BADOFFSET. When the starting offset is 2059 zero, the search for a match starts at the beginning of the subject, and this 2060 is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset 2061 must point to the start of a character, or to the end of the subject (in UTF-32 2062 mode, one code unit equals one character, so all offsets are valid). Like the 2063 pattern string, the subject may contain binary zeroes. 2064 </P> 2065 <P> 2066 A non-zero starting offset is useful when searching for another match in the 2067 same subject by calling <b>pcre2_match()</b> again after a previous success. 2068 Setting <i>startoffset</i> differs from passing over a shortened string and 2069 setting PCRE2_NOTBOL in the case of a pattern that begins with any kind of 2070 lookbehind. For example, consider the pattern 2071 <pre> 2072 \Biss\B 2073 </pre> 2074 which finds occurrences of "iss" in the middle of words. (\B matches only if 2075 the current position in the subject is not a word boundary.) When applied to 2076 the string "Mississipi" the first call to <b>pcre2_match()</b> finds the first 2077 occurrence. If <b>pcre2_match()</b> is called again with just the remainder of 2078 the subject, namely "issipi", it does not match, because \B is always false at 2079 the start of the subject, which is deemed to be a word boundary. However, if 2080 <b>pcre2_match()</b> is passed the entire string again, but with 2081 <i>startoffset</i> set to 4, it finds the second occurrence of "iss" because it 2082 is able to look behind the starting point to discover that it is preceded by a 2083 letter. 2084 </P> 2085 <P> 2086 Finding all the matches in a subject is tricky when the pattern can match an 2087 empty string. It is possible to emulate Perl's /g behaviour by first trying the 2088 match again at the same offset, with the PCRE2_NOTEMPTY_ATSTART and 2089 PCRE2_ANCHORED options, and then if that fails, advancing the starting offset 2090 and trying an ordinary match again. There is some code that demonstrates how to 2091 do this in the 2092 <a href="pcre2demo.html"><b>pcre2demo</b></a> 2093 sample program. In the most general case, you have to check to see if the 2094 newline convention recognizes CRLF as a newline, and if so, and the current 2095 character is CR followed by LF, advance the starting offset by two characters 2096 instead of one. 2097 </P> 2098 <P> 2099 If a non-zero starting offset is passed when the pattern is anchored, one 2100 attempt to match at the given offset is made. This can only succeed if the 2101 pattern does not require the match to be at the start of the subject. 2102 <a name="matchoptions"></a></P> 2103 <br><b> 2104 Option bits for <b>pcre2_match()</b> 2105 </b><br> 2106 <P> 2107 The unused bits of the <i>options</i> argument for <b>pcre2_match()</b> must be 2108 zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, 2109 PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, 2110 PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is 2111 described below. 2112 </P> 2113 <P> 2114 Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT) 2115 compiler. If it is set, JIT matching is disabled and the normal interpretive 2116 code in <b>pcre2_match()</b> is run. Apart from PCRE2_NO_JIT (obviously), the 2117 remaining options are supported for JIT matching. 2118 <pre> 2119 PCRE2_ANCHORED 2120 </pre> 2121 The PCRE2_ANCHORED option limits <b>pcre2_match()</b> to matching at the first 2122 matching position. If a pattern was compiled with PCRE2_ANCHORED, or turned out 2123 to be anchored by virtue of its contents, it cannot be made unachored at 2124 matching time. Note that setting the option at match time disables JIT 2125 matching. 2126 <pre> 2127 PCRE2_NOTBOL 2128 </pre> 2129 This option specifies that first character of the subject string is not the 2130 beginning of a line, so the circumflex metacharacter should not match before 2131 it. Setting this without having set PCRE2_MULTILINE at compile time causes 2132 circumflex never to match. This option affects only the behaviour of the 2133 circumflex metacharacter. It does not affect \A. 2134 <pre> 2135 PCRE2_NOTEOL 2136 </pre> 2137 This option specifies that the end of the subject string is not the end of a 2138 line, so the dollar metacharacter should not match it nor (except in multiline 2139 mode) a newline immediately before it. Setting this without having set 2140 PCRE2_MULTILINE at compile time causes dollar never to match. This option 2141 affects only the behaviour of the dollar metacharacter. It does not affect \Z 2142 or \z. 2143 <pre> 2144 PCRE2_NOTEMPTY 2145 </pre> 2146 An empty string is not considered to be a valid match if this option is set. If 2147 there are alternatives in the pattern, they are tried. If all the alternatives 2148 match the empty string, the entire match fails. For example, if the pattern 2149 <pre> 2150 a?b? 2151 </pre> 2152 is applied to a string not beginning with "a" or "b", it matches an empty 2153 string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not 2154 valid, so <b>pcre2_match()</b> searches further into the string for occurrences 2155 of "a" or "b". 2156 <pre> 2157 PCRE2_NOTEMPTY_ATSTART 2158 </pre> 2159 This is like PCRE2_NOTEMPTY, except that it locks out an empty string match 2160 only at the first matching position, that is, at the start of the subject plus 2161 the starting offset. An empty string match later in the subject is permitted. 2162 If the pattern is anchored, such a match can occur only if the pattern contains 2163 \K. 2164 <pre> 2165 PCRE2_NO_JIT 2166 </pre> 2167 By default, if a pattern has been successfully processed by 2168 <b>pcre2_jit_compile()</b>, JIT is automatically used when <b>pcre2_match()</b> 2169 is called with options that JIT supports. Setting PCRE2_NO_JIT disables the use 2170 of JIT; it forces matching to be done by the interpreter. 2171 <pre> 2172 PCRE2_NO_UTF_CHECK 2173 </pre> 2174 When PCRE2_UTF is set at compile time, the validity of the subject as a UTF 2175 string is checked by default when <b>pcre2_match()</b> is subsequently called. 2176 If a non-zero starting offset is given, the check is applied only to that part 2177 of the subject that could be inspected during matching, and there is a check 2178 that the starting offset points to the first code unit of a character or to the 2179 end of the subject. If there are no lookbehind assertions in the pattern, the 2180 check starts at the starting offset. Otherwise, it starts at the length of the 2181 longest lookbehind before the starting offset, or at the start of the subject 2182 if there are not that many characters before the starting offset. Note that the 2183 sequences \b and \B are one-character lookbehinds. 2184 </P> 2185 <P> 2186 The check is carried out before any other processing takes place, and a 2187 negative error code is returned if the check fails. There are several UTF error 2188 codes for each code unit width, corresponding to different problems with the 2189 code unit sequence. There are discussions about the validity of 2190 <a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a> 2191 <a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a> 2192 and 2193 <a href="pcre2unicode.html#utf32strings">UTF-32 strings</a> 2194 in the 2195 <a href="pcre2unicode.html"><b>pcre2unicode</b></a> 2196 page. 2197 </P> 2198 <P> 2199 If you know that your subject is valid, and you want to skip these checks for 2200 performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling 2201 <b>pcre2_match()</b>. You might want to do this for the second and subsequent 2202 calls to <b>pcre2_match()</b> if you are making repeated calls to find all the 2203 matches in a single subject string. 2204 </P> 2205 <P> 2206 NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string 2207 as a subject, or an invalid value of <i>startoffset</i>, is undefined. Your 2208 program may crash or loop indefinitely. 2209 <pre> 2210 PCRE2_PARTIAL_HARD 2211 PCRE2_PARTIAL_SOFT 2212 </pre> 2213 These options turn on the partial matching feature. A partial match occurs if 2214 the end of the subject string is reached successfully, but there are not enough 2215 subject characters to complete the match. If this happens when 2216 PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by 2217 testing any remaining alternatives. Only if no complete match can be found is 2218 PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, 2219 PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial 2220 match, but only if no complete match can be found. 2221 </P> 2222 <P> 2223 If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if 2224 a partial match is found, <b>pcre2_match()</b> immediately returns 2225 PCRE2_ERROR_PARTIAL, without considering any other alternatives. In other 2226 words, when PCRE2_PARTIAL_HARD is set, a partial match is considered to be more 2227 important that an alternative complete match. 2228 </P> 2229 <P> 2230 There is a more detailed discussion of partial and multi-segment matching, with 2231 examples, in the 2232 <a href="pcre2partial.html"><b>pcre2partial</b></a> 2233 documentation. 2234 </P> 2235 <br><a name="SEC27" href="#TOC1">NEWLINE HANDLING WHEN MATCHING</a><br> 2236 <P> 2237 When PCRE2 is built, a default newline convention is set; this is usually the 2238 standard convention for the operating system. The default can be overridden in 2239 a 2240 <a href="#compilecontext">compile context</a> 2241 by calling <b>pcre2_set_newline()</b>. It can also be overridden by starting a 2242 pattern string with, for example, (*CRLF), as described in the 2243 <a href="pcre2pattern.html#newlines">section on newline conventions</a> 2244 in the 2245 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 2246 page. During matching, the newline choice affects the behaviour of the dot, 2247 circumflex, and dollar metacharacters. It may also alter the way the match 2248 starting position is advanced after a match failure for an unanchored pattern. 2249 </P> 2250 <P> 2251 When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as 2252 the newline convention, and a match attempt for an unanchored pattern fails 2253 when the current starting position is at a CRLF sequence, and the pattern 2254 contains no explicit matches for CR or LF characters, the match position is 2255 advanced by two characters instead of one, in other words, to after the CRLF. 2256 </P> 2257 <P> 2258 The above rule is a compromise that makes the most common cases work as 2259 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is 2260 not set), it does not match the string "\r\nA" because, after failing at the 2261 start, it skips both the CR and the LF before retrying. However, the pattern 2262 [\r\n]A does match that string, because it contains an explicit CR or LF 2263 reference, and so advances only by one character after the first failure. 2264 </P> 2265 <P> 2266 An explicit match for CR of LF is either a literal appearance of one of those 2267 characters in the pattern, or one of the \r or \n escape sequences. Implicit 2268 matches such as [^X] do not count, nor does \s, even though it includes CR and 2269 LF in the characters that it matches. 2270 </P> 2271 <P> 2272 Notwithstanding the above, anomalous effects may still occur when CRLF is a 2273 valid newline sequence and explicit \r or \n escapes appear in the pattern. 2274 <a name="matchedstrings"></a></P> 2275 <br><a name="SEC28" href="#TOC1">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a><br> 2276 <P> 2277 <b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b> 2278 <br> 2279 <br> 2280 <b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b> 2281 </P> 2282 <P> 2283 In general, a pattern matches a certain portion of the subject, and in 2284 addition, further substrings from the subject may be picked out by 2285 parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's 2286 book, this is called "capturing" in what follows, and the phrase "capturing 2287 subpattern" or "capturing group" is used for a fragment of a pattern that picks 2288 out a substring. PCRE2 supports several other kinds of parenthesized subpattern 2289 that do not cause substrings to be captured. The <b>pcre2_pattern_info()</b> 2290 function can be used to find out how many capturing subpatterns there are in a 2291 compiled pattern. 2292 </P> 2293 <P> 2294 You can use auxiliary functions for accessing captured substrings 2295 <a href="#extractbynumber">by number</a> 2296 or 2297 <a href="#extractbyname">by name,</a> 2298 as described in sections below. 2299 </P> 2300 <P> 2301 Alternatively, you can make direct use of the vector of PCRE2_SIZE values, 2302 called the <b>ovector</b>, which contains the offsets of captured strings. It is 2303 part of the 2304 <a href="#matchdatablock">match data block.</a> 2305 The function <b>pcre2_get_ovector_pointer()</b> returns the address of the 2306 ovector, and <b>pcre2_get_ovector_count()</b> returns the number of pairs of 2307 values it contains. 2308 </P> 2309 <P> 2310 Within the ovector, the first in each pair of values is set to the offset of 2311 the first code unit of a substring, and the second is set to the offset of the 2312 first code unit after the end of a substring. These values are always code unit 2313 offsets, not character offsets. That is, they are byte offsets in the 8-bit 2314 library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit 2315 library. 2316 </P> 2317 <P> 2318 After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair 2319 of offsets (that is, <i>ovector[0]</i> and <i>ovector[1]</i>) are set. They 2320 identify the part of the subject that was partially matched. See the 2321 <a href="pcre2partial.html"><b>pcre2partial</b></a> 2322 documentation for details of partial matching. 2323 </P> 2324 <P> 2325 After a successful match, the first pair of offsets identifies the portion of 2326 the subject string that was matched by the entire pattern. The next pair is 2327 used for the first capturing subpattern, and so on. The value returned by 2328 <b>pcre2_match()</b> is one more than the highest numbered pair that has been 2329 set. For example, if two substrings have been captured, the returned value is 2330 3. If there are no capturing subpatterns, the return value from a successful 2331 match is 1, indicating that just the first pair of offsets has been set. 2332 </P> 2333 <P> 2334 If a pattern uses the \K escape sequence within a positive assertion, the 2335 reported start of a successful match can be greater than the end of the match. 2336 For example, if the pattern (?=ab\K) is matched against "ab", the start and 2337 end offset values for the match are 2 and 0. 2338 </P> 2339 <P> 2340 If a capturing subpattern group is matched repeatedly within a single match 2341 operation, it is the last portion of the subject that it matched that is 2342 returned. 2343 </P> 2344 <P> 2345 If the ovector is too small to hold all the captured substring offsets, as much 2346 as possible is filled in, and the function returns a value of zero. If captured 2347 substrings are not of interest, <b>pcre2_match()</b> may be called with a match 2348 data block whose ovector is of minimum length (that is, one pair). However, if 2349 the pattern contains back references and the <i>ovector</i> is not big enough to 2350 remember the related substrings, PCRE2 has to get additional memory for use 2351 during matching. Thus it is usually advisable to set up a match data block 2352 containing an ovector of reasonable size. 2353 </P> 2354 <P> 2355 It is possible for capturing subpattern number <i>n+1</i> to match some part of 2356 the subject when subpattern <i>n</i> has not been used at all. For example, if 2357 the string "abc" is matched against the pattern (a|(z))(bc) the return from the 2358 function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this 2359 happens, both values in the offset pairs corresponding to unused subpatterns 2360 are set to PCRE2_UNSET. 2361 </P> 2362 <P> 2363 Offset values that correspond to unused subpatterns at the end of the 2364 expression are also set to PCRE2_UNSET. For example, if the string "abc" is 2365 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. 2366 The return from the function is 2, because the highest used capturing 2367 subpattern number is 1. The offsets for for the second and third capturing 2368 subpatterns (assuming the vector is large enough, of course) are set to 2369 PCRE2_UNSET. 2370 </P> 2371 <P> 2372 Elements in the ovector that do not correspond to capturing parentheses in the 2373 pattern are never changed. That is, if a pattern contains <i>n</i> capturing 2374 parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by 2375 <b>pcre2_match()</b>. The other elements retain whatever values they previously 2376 had. 2377 <a name="matchotherdata"></a></P> 2378 <br><a name="SEC29" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br> 2379 <P> 2380 <b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b> 2381 <br> 2382 <br> 2383 <b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b> 2384 </P> 2385 <P> 2386 As well as the offsets in the ovector, other information about a match is 2387 retained in the match data block and can be retrieved by the above functions in 2388 appropriate circumstances. If they are called at other times, the result is 2389 undefined. 2390 </P> 2391 <P> 2392 After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure 2393 to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and 2394 <b>pcre2_get_mark()</b> can be called. It returns a pointer to the 2395 zero-terminated name, which is within the compiled pattern. Otherwise NULL is 2396 returned. The length of the (*MARK) name (excluding the terminating zero) is 2397 stored in the code unit that preceeds the name. You should use this instead of 2398 relying on the terminating zero if the (*MARK) name might contain a binary 2399 zero. 2400 </P> 2401 <P> 2402 After a successful match, the (*MARK) name that is returned is the 2403 last one encountered on the matching path through the pattern. After a "no 2404 match" or a partial match, the last encountered (*MARK) name is returned. For 2405 example, consider this pattern: 2406 <pre> 2407 ^(*MARK:A)((*MARK:B)a|b)c 2408 </pre> 2409 When it matches "bc", the returned mark is A. The B mark is "seen" in the first 2410 branch of the group, but it is not on the matching path. On the other hand, 2411 when this pattern fails to match "bx", the returned mark is B. 2412 </P> 2413 <P> 2414 After a successful match, a partial match, or one of the invalid UTF errors 2415 (for example, PCRE2_ERROR_UTF8_ERR5), <b>pcre2_get_startchar()</b> can be 2416 called. After a successful or partial match it returns the code unit offset of 2417 the character at which the match started. For a non-partial match, this can be 2418 different to the value of <i>ovector[0]</i> if the pattern contains the \K 2419 escape sequence. After a partial match, however, this value is always the same 2420 as <i>ovector[0]</i> because \K does not affect the result of a partial match. 2421 </P> 2422 <P> 2423 After a UTF check failure, <b>pcre2_get_startchar()</b> can be used to obtain 2424 the code unit offset of the invalid UTF character. Details are given in the 2425 <a href="pcre2unicode.html"><b>pcre2unicode</b></a> 2426 page. 2427 <a name="errorlist"></a></P> 2428 <br><a name="SEC30" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br> 2429 <P> 2430 If <b>pcre2_match()</b> fails, it returns a negative number. This can be 2431 converted to a text string by calling the <b>pcre2_get_error_message()</b> 2432 function (see "Obtaining a textual error message" 2433 <a href="#geterrormessage">below).</a> 2434 Negative error codes are also returned by other functions, and are documented 2435 with them. The codes are given names in the header file. If UTF checking is in 2436 force and an invalid UTF subject string is detected, one of a number of 2437 UTF-specific negative error codes is returned. Details are given in the 2438 <a href="pcre2unicode.html"><b>pcre2unicode</b></a> 2439 page. The following are the other errors that may be returned by 2440 <b>pcre2_match()</b>: 2441 <pre> 2442 PCRE2_ERROR_NOMATCH 2443 </pre> 2444 The subject string did not match the pattern. 2445 <pre> 2446 PCRE2_ERROR_PARTIAL 2447 </pre> 2448 The subject string did not match, but it did match partially. See the 2449 <a href="pcre2partial.html"><b>pcre2partial</b></a> 2450 documentation for details of partial matching. 2451 <pre> 2452 PCRE2_ERROR_BADMAGIC 2453 </pre> 2454 PCRE2 stores a 4-byte "magic number" at the start of the compiled code, to 2455 catch the case when it is passed a junk pointer. This is the error that is 2456 returned when the magic number is not present. 2457 <pre> 2458 PCRE2_ERROR_BADMODE 2459 </pre> 2460 This error is given when a pattern that was compiled by the 8-bit library is 2461 passed to a 16-bit or 32-bit library function, or vice versa. 2462 <pre> 2463 PCRE2_ERROR_BADOFFSET 2464 </pre> 2465 The value of <i>startoffset</i> was greater than the length of the subject. 2466 <pre> 2467 PCRE2_ERROR_BADOPTION 2468 </pre> 2469 An unrecognized bit was set in the <i>options</i> argument. 2470 <pre> 2471 PCRE2_ERROR_BADUTFOFFSET 2472 </pre> 2473 The UTF code unit sequence that was passed as a subject was checked and found 2474 to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the value of 2475 <i>startoffset</i> did not point to the beginning of a UTF character or the end 2476 of the subject. 2477 <pre> 2478 PCRE2_ERROR_CALLOUT 2479 </pre> 2480 This error is never generated by <b>pcre2_match()</b> itself. It is provided for 2481 use by callout functions that want to cause <b>pcre2_match()</b> or 2482 <b>pcre2_callout_enumerate()</b> to return a distinctive error code. See the 2483 <a href="pcre2callout.html"><b>pcre2callout</b></a> 2484 documentation for details. 2485 <pre> 2486 PCRE2_ERROR_INTERNAL 2487 </pre> 2488 An unexpected internal error has occurred. This error could be caused by a bug 2489 in PCRE2 or by overwriting of the compiled pattern. 2490 <pre> 2491 PCRE2_ERROR_JIT_BADOPTION 2492 </pre> 2493 This error is returned when a pattern that was successfully studied using JIT 2494 is being matched, but the matching mode (partial or complete match) does not 2495 correspond to any JIT compilation mode. When the JIT fast path function is 2496 used, this error may be also given for invalid options. See the 2497 <a href="pcre2jit.html"><b>pcre2jit</b></a> 2498 documentation for more details. 2499 <pre> 2500 PCRE2_ERROR_JIT_STACKLIMIT 2501 </pre> 2502 This error is returned when a pattern that was successfully studied using JIT 2503 is being matched, but the memory available for the just-in-time processing 2504 stack is not large enough. See the 2505 <a href="pcre2jit.html"><b>pcre2jit</b></a> 2506 documentation for more details. 2507 <pre> 2508 PCRE2_ERROR_MATCHLIMIT 2509 </pre> 2510 The backtracking limit was reached. 2511 <pre> 2512 PCRE2_ERROR_NOMEMORY 2513 </pre> 2514 If a pattern contains back references, but the ovector is not big enough to 2515 remember the referenced substrings, PCRE2 gets a block of memory at the start 2516 of matching to use for this purpose. There are some other special cases where 2517 extra memory is needed during matching. This error is given when memory cannot 2518 be obtained. 2519 <pre> 2520 PCRE2_ERROR_NULL 2521 </pre> 2522 Either the <i>code</i>, <i>subject</i>, or <i>match_data</i> argument was passed 2523 as NULL. 2524 <pre> 2525 PCRE2_ERROR_RECURSELOOP 2526 </pre> 2527 This error is returned when <b>pcre2_match()</b> detects a recursion loop within 2528 the pattern. Specifically, it means that either the whole pattern or a 2529 subpattern has been called recursively for the second time at the same position 2530 in the subject string. Some simple patterns that might do this are detected and 2531 faulted at compile time, but more complicated cases, in particular mutual 2532 recursions between two different subpatterns, cannot be detected until matching 2533 is attempted. 2534 <pre> 2535 PCRE2_ERROR_RECURSIONLIMIT 2536 </pre> 2537 The internal recursion limit was reached. 2538 <a name="geterrormessage"></a></P> 2539 <br><a name="SEC31" href="#TOC1">OBTAINING A TEXTUAL ERROR MESSAGE</a><br> 2540 <P> 2541 <b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b> 2542 <b> PCRE2_SIZE <i>bufflen</i>);</b> 2543 </P> 2544 <P> 2545 A text message for an error code from any PCRE2 function (compile, match, or 2546 auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code 2547 is passed as the first argument, with the remaining two arguments specifying a 2548 code unit buffer and its length, into which the text message is placed. Note 2549 that the message is returned in code units of the appropriate width for the 2550 library that is being used. 2551 </P> 2552 <P> 2553 The returned message is terminated with a trailing zero, and the function 2554 returns the number of code units used, excluding the trailing zero. If the 2555 error number is unknown, the negative error code PCRE2_ERROR_BADDATA is 2556 returned. If the buffer is too small, the message is truncated (but still with 2557 a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned. 2558 None of the messages are very long; a buffer size of 120 code units is ample. 2559 <a name="extractbynumber"></a></P> 2560 <br><a name="SEC32" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br> 2561 <P> 2562 <b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b> 2563 <b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b> 2564 <br> 2565 <br> 2566 <b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b> 2567 <b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b> 2568 <b> PCRE2_SIZE *<i>bufflen</i>);</b> 2569 <br> 2570 <br> 2571 <b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b> 2572 <b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b> 2573 <b> PCRE2_SIZE *<i>bufflen</i>);</b> 2574 <br> 2575 <br> 2576 <b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b> 2577 </P> 2578 <P> 2579 Captured substrings can be accessed directly by using the ovector as described 2580 <a href="#matchedstrings">above.</a> 2581 For convenience, auxiliary functions are provided for extracting captured 2582 substrings as new, separate, zero-terminated strings. A substring that contains 2583 a binary zero is correctly extracted and has a further zero added on the end, 2584 but the result is not, of course, a C string. 2585 </P> 2586 <P> 2587 The functions in this section identify substrings by number. The number zero 2588 refers to the entire matched substring, with higher numbers referring to 2589 substrings captured by parenthesized groups. After a partial match, only 2590 substring zero is available. An attempt to extract any other substring gives 2591 the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for 2592 extracting captured substrings by name. 2593 </P> 2594 <P> 2595 If a pattern uses the \K escape sequence within a positive assertion, the 2596 reported start of a successful match can be greater than the end of the match. 2597 For example, if the pattern (?=ab\K) is matched against "ab", the start and 2598 end offset values for the match are 2 and 0. In this situation, calling these 2599 functions with a zero substring number extracts a zero-length empty string. 2600 </P> 2601 <P> 2602 You can find the length in code units of a captured substring without 2603 extracting it by calling <b>pcre2_substring_length_bynumber()</b>. The first 2604 argument is a pointer to the match data block, the second is the group number, 2605 and the third is a pointer to a variable into which the length is placed. If 2606 you just want to know whether or not the substring has been captured, you can 2607 pass the third argument as NULL. 2608 </P> 2609 <P> 2610 The <b>pcre2_substring_copy_bynumber()</b> function copies a captured substring 2611 into a supplied buffer, whereas <b>pcre2_substring_get_bynumber()</b> copies it 2612 into new memory, obtained using the same memory allocation function that was 2613 used for the match data block. The first two arguments of these functions are a 2614 pointer to the match data block and a capturing group number. 2615 </P> 2616 <P> 2617 The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to 2618 the buffer and a pointer to a variable that contains its length in code units. 2619 This is updated to contain the actual number of code units used for the 2620 extracted substring, excluding the terminating zero. 2621 </P> 2622 <P> 2623 For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point 2624 to variables that are updated with a pointer to the new memory and the number 2625 of code units that comprise the substring, again excluding the terminating 2626 zero. When the substring is no longer needed, the memory should be freed by 2627 calling <b>pcre2_substring_free()</b>. 2628 </P> 2629 <P> 2630 The return value from all these functions is zero for success, or a negative 2631 error code. If the pattern match failed, the match failure code is returned. 2632 If a substring number greater than zero is used after a partial match, 2633 PCRE2_ERROR_PARTIAL is returned. Other possible error codes are: 2634 <pre> 2635 PCRE2_ERROR_NOMEMORY 2636 </pre> 2637 The buffer was too small for <b>pcre2_substring_copy_bynumber()</b>, or the 2638 attempt to get memory failed for <b>pcre2_substring_get_bynumber()</b>. 2639 <pre> 2640 PCRE2_ERROR_NOSUBSTRING 2641 </pre> 2642 There is no substring with that number in the pattern, that is, the number is 2643 greater than the number of capturing parentheses. 2644 <pre> 2645 PCRE2_ERROR_UNAVAILABLE 2646 </pre> 2647 The substring number, though not greater than the number of captures in the 2648 pattern, is greater than the number of slots in the ovector, so the substring 2649 could not be captured. 2650 <pre> 2651 PCRE2_ERROR_UNSET 2652 </pre> 2653 The substring did not participate in the match. For example, if the pattern is 2654 (abc)|(def) and the subject is "def", and the ovector contains at least two 2655 capturing slots, substring number 1 is unset. 2656 </P> 2657 <br><a name="SEC33" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br> 2658 <P> 2659 <b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b> 2660 <b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b> 2661 <br> 2662 <br> 2663 <b>void pcre2_substring_list_free(PCRE2_SPTR *<i>list</i>);</b> 2664 </P> 2665 <P> 2666 The <b>pcre2_substring_list_get()</b> function extracts all available substrings 2667 and builds a list of pointers to them. It also (optionally) builds a second 2668 list that contains their lengths (in code units), excluding a terminating zero 2669 that is added to each of them. All this is done in a single block of memory 2670 that is obtained using the same memory allocation function that was used to get 2671 the match data block. 2672 </P> 2673 <P> 2674 This function must be called only after a successful match. If called after a 2675 partial match, the error code PCRE2_ERROR_PARTIAL is returned. 2676 </P> 2677 <P> 2678 The address of the memory block is returned via <i>listptr</i>, which is also 2679 the start of the list of string pointers. The end of the list is marked by a 2680 NULL pointer. The address of the list of lengths is returned via 2681 <i>lengthsptr</i>. If your strings do not contain binary zeros and you do not 2682 therefore need the lengths, you may supply NULL as the <b>lengthsptr</b> 2683 argument to disable the creation of a list of lengths. The yield of the 2684 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the memory block 2685 could not be obtained. When the list is no longer needed, it should be freed by 2686 calling <b>pcre2_substring_list_free()</b>. 2687 </P> 2688 <P> 2689 If this function encounters a substring that is unset, which can happen when 2690 capturing subpattern number <i>n+1</i> matches some part of the subject, but 2691 subpattern <i>n</i> has not been used at all, it returns an empty string. This 2692 can be distinguished from a genuine zero-length substring by inspecting the 2693 appropriate offset in the ovector, which contain PCRE2_UNSET for unset 2694 substrings, or by calling <b>pcre2_substring_length_bynumber()</b>. 2695 <a name="extractbyname"></a></P> 2696 <br><a name="SEC34" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br> 2697 <P> 2698 <b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b> 2699 <b> PCRE2_SPTR <i>name</i>);</b> 2700 <br> 2701 <br> 2702 <b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b> 2703 <b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b> 2704 <br> 2705 <br> 2706 <b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b> 2707 <b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b> 2708 <br> 2709 <br> 2710 <b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b> 2711 <b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b> 2712 <br> 2713 <br> 2714 <b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b> 2715 </P> 2716 <P> 2717 To extract a substring by name, you first have to find associated number. 2718 For example, for this pattern: 2719 <pre> 2720 (a+)b(?<xxx>\d+)... 2721 </pre> 2722 the number of the subpattern called "xxx" is 2. If the name is known to be 2723 unique (PCRE2_DUPNAMES was not set), you can find the number from the name by 2724 calling <b>pcre2_substring_number_from_name()</b>. The first argument is the 2725 compiled pattern, and the second is the name. The yield of the function is the 2726 subpattern number, PCRE2_ERROR_NOSUBSTRING if there is no subpattern of that 2727 name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of 2728 that name. Given the number, you can extract the substring directly, or use one 2729 of the functions described above. 2730 </P> 2731 <P> 2732 For convenience, there are also "byname" functions that correspond to the 2733 "bynumber" functions, the only difference being that the second argument is a 2734 name instead of a number. If PCRE2_DUPNAMES is set and there are duplicate 2735 names, these functions scan all the groups with the given name, and return the 2736 first named string that is set. 2737 </P> 2738 <P> 2739 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is 2740 returned. If all groups with the name have numbers that are greater than the 2741 number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is returned. If there 2742 is at least one group with a slot in the ovector, but no group is found to be 2743 set, PCRE2_ERROR_UNSET is returned. 2744 </P> 2745 <P> 2746 <b>Warning:</b> If the pattern uses the (?| feature to set up multiple 2747 subpatterns with the same number, as described in the 2748 <a href="pcre2pattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a> 2749 in the 2750 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 2751 page, you cannot use names to distinguish the different subpatterns, because 2752 names are not included in the compiled code. The matching process uses only 2753 numbers. For this reason, the use of different names for subpatterns of the 2754 same number causes an error at compile time. 2755 </P> 2756 <br><a name="SEC35" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br> 2757 <P> 2758 <b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 2759 <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 2760 <b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 2761 <b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacement</i>,</b> 2762 <b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *\fIoutputbuffer\zfP,</b> 2763 <b> PCRE2_SIZE *<i>outlengthptr</i>);</b> 2764 </P> 2765 <P> 2766 This function calls <b>pcre2_match()</b> and then makes a copy of the subject 2767 string in <i>outputbuffer</i>, replacing the part that was matched with the 2768 <i>replacement</i> string, whose length is supplied in <b>rlength</b>. This can 2769 be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in 2770 which a \K item in a lookahead in the pattern causes the match to end before 2771 it starts are not supported, and give rise to an error return. 2772 </P> 2773 <P> 2774 The first seven arguments of <b>pcre2_substitute()</b> are the same as for 2775 <b>pcre2_match()</b>, except that the partial matching options are not 2776 permitted, and <i>match_data</i> may be passed as NULL, in which case a match 2777 data block is obtained and freed within this function, using memory management 2778 functions from the match context, if provided, or else those that were used to 2779 allocate memory for the compiled code. 2780 </P> 2781 <P> 2782 The <i>outlengthptr</i> argument must point to a variable that contains the 2783 length, in code units, of the output buffer. If the function is successful, the 2784 value is updated to contain the length of the new string, excluding the 2785 trailing zero that is automatically added. 2786 </P> 2787 <P> 2788 If the function is not successful, the value set via <i>outlengthptr</i> depends 2789 on the type of error. For syntax errors in the replacement string, the value is 2790 the offset in the replacement string where the error was detected. For other 2791 errors, the value is PCRE2_UNSET by default. This includes the case of the 2792 output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set 2793 (see below), in which case the value is the minimum length needed, including 2794 space for the trailing zero. Note that in order to compute the required length, 2795 <b>pcre2_substitute()</b> has to simulate all the matching and copying, instead 2796 of giving an error return as soon as the buffer overflows. Note also that the 2797 length is in code units, not bytes. 2798 </P> 2799 <P> 2800 In the replacement string, which is interpreted as a UTF string in UTF mode, 2801 and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a 2802 dollar character is an escape character that can specify the insertion of 2803 characters from capturing groups or (*MARK) items in the pattern. The following 2804 forms are always recognized: 2805 <pre> 2806 $$ insert a dollar character 2807 $<n> or ${<n>} insert the contents of group <n> 2808 $*MARK or ${*MARK} insert the name of the last (*MARK) encountered 2809 </pre> 2810 Either a group number or a group name can be given for <n>. Curly brackets are 2811 required only if the following character would be interpreted as part of the 2812 number or name. The number may be zero to include the entire matched string. 2813 For example, if the pattern a(b)c is matched with "=abc=" and the replacement 2814 string "+$1$0$1+", the result is "=+babcb+=". 2815 </P> 2816 <P> 2817 The facility for inserting a (*MARK) name can be used to perform simple 2818 simultaneous substitutions, as this <b>pcre2test</b> example shows: 2819 <pre> 2820 /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK} 2821 apple lemon 2822 2: pear orange 2823 </pre> 2824 As well as the usual options for <b>pcre2_match()</b>, a number of additional 2825 options can be set in the <i>options</i> argument. 2826 </P> 2827 <P> 2828 PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string, 2829 replacing every matching substring. If this is not set, only the first matching 2830 substring is replaced. If any matched substring has zero length, after the 2831 substitution has happened, an attempt to find a non-empty match at the same 2832 position is performed. If this is not successful, the current position is 2833 advanced by one character except when CRLF is a valid newline sequence and the 2834 next two characters are CR, LF. In this case, the current position is advanced 2835 by two characters. 2836 </P> 2837 <P> 2838 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is 2839 too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If 2840 this option is set, however, <b>pcre2_substitute()</b> continues to go through 2841 the motions of matching and substituting (without, of course, writing anything) 2842 in order to compute the size of buffer that is needed. This value is passed 2843 back via the <i>outlengthptr</i> variable, with the result of the function still 2844 being PCRE2_ERROR_NOMEMORY. 2845 </P> 2846 <P> 2847 Passing a buffer size of zero is a permitted way of finding out how much memory 2848 is needed for given substitution. However, this does mean that the entire 2849 operation is carried out twice. Depending on the application, it may be more 2850 efficient to allocate a large buffer and free the excess afterwards, instead of 2851 using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH. 2852 </P> 2853 <P> 2854 PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups that do 2855 not appear in the pattern to be treated as unset groups. This option should be 2856 used with care, because it means that a typo in a group name or number no 2857 longer causes the PCRE2_ERROR_NOSUBSTRING error. 2858 </P> 2859 <P> 2860 PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including unknown 2861 groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty 2862 strings when inserted as described above. If this option is not set, an attempt 2863 to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does 2864 not influence the extended substitution syntax described below. 2865 </P> 2866 <P> 2867 PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the 2868 replacement string. Without this option, only the dollar character is special, 2869 and only the group insertion forms listed above are valid. When 2870 PCRE2_SUBSTITUTE_EXTENDED is set, two things change: 2871 </P> 2872 <P> 2873 Firstly, backslash in a replacement string is interpreted as an escape 2874 character. The usual forms such as \n or \x{ddd} can be used to specify 2875 particular character codes, and backslash followed by any non-alphanumeric 2876 character quotes that character. Extended quoting can be coded using \Q...\E, 2877 exactly as in pattern strings. 2878 </P> 2879 <P> 2880 There are also four escape sequences for forcing the case of inserted letters. 2881 The insertion mechanism has three states: no case forcing, force upper case, 2882 and force lower case. The escape sequences change the current state: \U and 2883 \L change to upper or lower case forcing, respectively, and \E (when not 2884 terminating a \Q quoted sequence) reverts to no case forcing. The sequences 2885 \u and \l force the next character (if it is a letter) to upper or lower 2886 case, respectively, and then the state automatically reverts to no case 2887 forcing. Case forcing applies to all inserted characters, including those from 2888 captured groups and letters within \Q...\E quoted sequences. 2889 </P> 2890 <P> 2891 Note that case forcing sequences such as \U...\E do not nest. For example, 2892 the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no 2893 effect. 2894 </P> 2895 <P> 2896 The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more 2897 flexibility to group substitution. The syntax is similar to that used by Bash: 2898 <pre> 2899 ${<n>:-<string>} 2900 ${<n>:+<string1>:<string2>} 2901 </pre> 2902 As before, <n> may be a group number or a name. The first form specifies a 2903 default value. If group <n> is set, its value is inserted; if not, <string> is 2904 expanded and the result inserted. The second form specifies strings that are 2905 expanded and inserted when group <n> is set or unset, respectively. The first 2906 form is just a convenient shorthand for 2907 <pre> 2908 ${<n>:+${<n>}:<string>} 2909 </pre> 2910 Backslash can be used to escape colons and closing curly brackets in the 2911 replacement strings. A change of the case forcing state within a replacement 2912 string remains in force afterwards, as shown in this <b>pcre2test</b> example: 2913 <pre> 2914 /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo 2915 body 2916 1: hello 2917 somebody 2918 1: HELLO 2919 </pre> 2920 The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended 2921 substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown 2922 groups in the extended syntax forms to be treated as unset. 2923 </P> 2924 <P> 2925 If successful, <b>pcre2_substitute()</b> returns the number of replacements that 2926 were made. This may be zero if no matches were found, and is never greater than 2927 1 unless PCRE2_SUBSTITUTE_GLOBAL is set. 2928 </P> 2929 <P> 2930 In the event of an error, a negative error code is returned. Except for 2931 PCRE2_ERROR_NOMATCH (which is never returned), errors from <b>pcre2_match()</b> 2932 are passed straight back. 2933 </P> 2934 <P> 2935 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion, 2936 unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. 2937 </P> 2938 <P> 2939 PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an 2940 unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple 2941 (non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set. 2942 </P> 2943 <P> 2944 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the 2945 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is 2946 needed is returned via <i>outlengthptr</i>. Note that this does not happen by 2947 default. 2948 </P> 2949 <P> 2950 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the 2951 replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE 2952 (invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket 2953 not found), PCRE2_BADSUBSTITUTION (syntax error in extended group 2954 substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it 2955 started, which can happen if \K is used in an assertion). 2956 </P> 2957 <P> 2958 As for all PCRE2 errors, a text message that describes the error can be 2959 obtained by calling the <b>pcre2_get_error_message()</b> function (see 2960 "Obtaining a textual error message" 2961 <a href="#geterrormessage">above).</a> 2962 </P> 2963 <br><a name="SEC36" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br> 2964 <P> 2965 <b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b> 2966 <b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b> 2967 </P> 2968 <P> 2969 When a pattern is compiled with the PCRE2_DUPNAMES option, names for 2970 subpatterns are not required to be unique. Duplicate names are always allowed 2971 for subpatterns with the same number, created by using the (?| feature. Indeed, 2972 if such subpatterns are named, they are required to use the same names. 2973 </P> 2974 <P> 2975 Normally, patterns with duplicate names are such that in any one match, only 2976 one of the named subpatterns participates. An example is shown in the 2977 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 2978 documentation. 2979 </P> 2980 <P> 2981 When duplicates are present, <b>pcre2_substring_copy_byname()</b> and 2982 <b>pcre2_substring_get_byname()</b> return the first substring corresponding to 2983 the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is 2984 returned. The <b>pcre2_substring_number_from_name()</b> function returns the 2985 error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate names. 2986 </P> 2987 <P> 2988 If you want to get full details of all captured substrings for a given name, 2989 you must use the <b>pcre2_substring_nametable_scan()</b> function. The first 2990 argument is the compiled pattern, and the second is the name. If the third and 2991 fourth arguments are NULL, the function returns a group number for a unique 2992 name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. 2993 </P> 2994 <P> 2995 When the third and fourth arguments are not NULL, they must be pointers to 2996 variables that are updated by the function. After it has run, they point to the 2997 first and last entries in the name-to-number table for the given name, and the 2998 function returns the length of each entry in code units. In both cases, 2999 PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name. 3000 </P> 3001 <P> 3002 The format of the name table is described 3003 <a href="#infoaboutpattern">above</a> 3004 in the section entitled <i>Information about a pattern</i>. Given all the 3005 relevant entries for the name, you can extract each of their numbers, and hence 3006 the captured data. 3007 </P> 3008 <br><a name="SEC37" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br> 3009 <P> 3010 The traditional matching function uses a similar algorithm to Perl, which stops 3011 when it finds the first match at a given point in the subject. If you want to 3012 find all possible matches, or the longest possible match at a given position, 3013 consider using the alternative matching function (see below) instead. If you 3014 cannot use the alternative function, you can kludge it up by making use of the 3015 callout facility, which is described in the 3016 <a href="pcre2callout.html"><b>pcre2callout</b></a> 3017 documentation. 3018 </P> 3019 <P> 3020 What you have to do is to insert a callout right at the end of the pattern. 3021 When your callout function is called, extract and save the current matched 3022 substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try 3023 other alternatives. Ultimately, when it runs out of matches, 3024 <b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH. 3025 <a name="dfamatch"></a></P> 3026 <br><a name="SEC38" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br> 3027 <P> 3028 <b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 3029 <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 3030 <b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 3031 <b> pcre2_match_context *<i>mcontext</i>,</b> 3032 <b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b> 3033 </P> 3034 <P> 3035 The function <b>pcre2_dfa_match()</b> is called to match a subject string 3036 against a compiled pattern, using a matching algorithm that scans the subject 3037 string just once, and does not backtrack. This has different characteristics to 3038 the normal algorithm, and is not compatible with Perl. Some of the features of 3039 PCRE2 patterns are not supported. Nevertheless, there are times when this kind 3040 of matching can be useful. For a discussion of the two matching algorithms, and 3041 a list of features that <b>pcre2_dfa_match()</b> does not support, see the 3042 <a href="pcre2matching.html"><b>pcre2matching</b></a> 3043 documentation. 3044 </P> 3045 <P> 3046 The arguments for the <b>pcre2_dfa_match()</b> function are the same as for 3047 <b>pcre2_match()</b>, plus two extras. The ovector within the match data block 3048 is used in a different way, and this is described below. The other common 3049 arguments are used in the same way as for <b>pcre2_match()</b>, so their 3050 description is not repeated here. 3051 </P> 3052 <P> 3053 The two additional arguments provide workspace for the function. The workspace 3054 vector should contain at least 20 elements. It is used for keeping track of 3055 multiple paths through the pattern tree. More workspace is needed for patterns 3056 and subjects where there are a lot of potential matches. 3057 </P> 3058 <P> 3059 Here is an example of a simple call to <b>pcre2_dfa_match()</b>: 3060 <pre> 3061 int wspace[20]; 3062 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 3063 int rc = pcre2_dfa_match( 3064 re, /* result of pcre2_compile() */ 3065 "some string", /* the subject string */ 3066 11, /* the length of the subject string */ 3067 0, /* start at offset 0 in the subject */ 3068 0, /* default options */ 3069 match_data, /* the match data block */ 3070 NULL, /* a match context; NULL means use defaults */ 3071 wspace, /* working space vector */ 3072 20); /* number of elements (NOT size in bytes) */ 3073 </PRE> 3074 </P> 3075 <br><b> 3076 Option bits for <b>pcre_dfa_match()</b> 3077 </b><br> 3078 <P> 3079 The unused bits of the <i>options</i> argument for <b>pcre2_dfa_match()</b> must 3080 be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, 3081 PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, 3082 PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and 3083 PCRE2_DFA_RESTART. All but the last four of these are exactly the same as for 3084 <b>pcre2_match()</b>, so their description is not repeated here. 3085 <pre> 3086 PCRE2_PARTIAL_HARD 3087 PCRE2_PARTIAL_SOFT 3088 </pre> 3089 These have the same general effect as they do for <b>pcre2_match()</b>, but the 3090 details are slightly different. When PCRE2_PARTIAL_HARD is set for 3091 <b>pcre2_dfa_match()</b>, it returns PCRE2_ERROR_PARTIAL if the end of the 3092 subject is reached and there is still at least one matching possibility that 3093 requires additional characters. This happens even if some complete matches have 3094 already been found. When PCRE2_PARTIAL_SOFT is set, the return code 3095 PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL if the end of the 3096 subject is reached, there have been no complete matches, but there is still at 3097 least one matching possibility. The portion of the string that was inspected 3098 when the longest partial match was found is set as the first matching string in 3099 both cases. There is a more detailed discussion of partial and multi-segment 3100 matching, with examples, in the 3101 <a href="pcre2partial.html"><b>pcre2partial</b></a> 3102 documentation. 3103 <pre> 3104 PCRE2_DFA_SHORTEST 3105 </pre> 3106 Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to stop as 3107 soon as it has found one match. Because of the way the alternative algorithm 3108 works, this is necessarily the shortest possible match at the first possible 3109 matching point in the subject string. 3110 <pre> 3111 PCRE2_DFA_RESTART 3112 </pre> 3113 When <b>pcre2_dfa_match()</b> returns a partial match, it is possible to call it 3114 again, with additional subject characters, and have it continue with the same 3115 match. The PCRE2_DFA_RESTART option requests this action; when it is set, the 3116 <i>workspace</i> and <i>wscount</i> options must reference the same vector as 3117 before because data about the match so far is left in them after a partial 3118 match. There is more discussion of this facility in the 3119 <a href="pcre2partial.html"><b>pcre2partial</b></a> 3120 documentation. 3121 </P> 3122 <br><b> 3123 Successful returns from <b>pcre2_dfa_match()</b> 3124 </b><br> 3125 <P> 3126 When <b>pcre2_dfa_match()</b> succeeds, it may have matched more than one 3127 substring in the subject. Note, however, that all the matches from one run of 3128 the function start at the same point in the subject. The shorter matches are 3129 all initial substrings of the longer matches. For example, if the pattern 3130 <pre> 3131 <.*> 3132 </pre> 3133 is matched against the string 3134 <pre> 3135 This is <something> <something else> <something further> no more 3136 </pre> 3137 the three matched strings are 3138 <pre> 3139 <something> <something else> <something further> 3140 <something> <something else> 3141 <something> 3142 </pre> 3143 On success, the yield of the function is a number greater than zero, which is 3144 the number of matched substrings. The offsets of the substrings are returned in 3145 the ovector, and can be extracted by number in the same way as for 3146 <b>pcre2_match()</b>, but the numbers bear no relation to any capturing groups 3147 that may exist in the pattern, because DFA matching does not support group 3148 capture. 3149 </P> 3150 <P> 3151 Calls to the convenience functions that extract substrings by name 3152 return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a 3153 DFA match. The convenience functions that extract substrings by number never 3154 return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are 3155 slightly different: 3156 <pre> 3157 PCRE2_ERROR_UNAVAILABLE 3158 </pre> 3159 The ovector is not big enough to include a slot for the given substring number. 3160 <pre> 3161 PCRE2_ERROR_UNSET 3162 </pre> 3163 There is a slot in the ovector for this substring, but there were insufficient 3164 matches to fill it. 3165 </P> 3166 <P> 3167 The matched strings are stored in the ovector in reverse order of length; that 3168 is, the longest matching string is first. If there were too many matches to fit 3169 into the ovector, the yield of the function is zero, and the vector is filled 3170 with the longest matches. 3171 </P> 3172 <P> 3173 NOTE: PCRE2's "auto-possessification" optimization usually applies to character 3174 repeats at the end of a pattern (as well as internally). For example, the 3175 pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this 3176 means that only one possible match is found. If you really do want multiple 3177 matches in such cases, either use an ungreedy repeat auch as "a\d+?" or set 3178 the PCRE2_NO_AUTO_POSSESS option when compiling. 3179 </P> 3180 <br><b> 3181 Error returns from <b>pcre2_dfa_match()</b> 3182 </b><br> 3183 <P> 3184 The <b>pcre2_dfa_match()</b> function returns a negative number when it fails. 3185 Many of the errors are the same as for <b>pcre2_match()</b>, as described 3186 <a href="#errorlist">above.</a> 3187 There are in addition the following errors that are specific to 3188 <b>pcre2_dfa_match()</b>: 3189 <pre> 3190 PCRE2_ERROR_DFA_UITEM 3191 </pre> 3192 This return is given if <b>pcre2_dfa_match()</b> encounters an item in the 3193 pattern that it does not support, for instance, the use of \C in a UTF mode or 3194 a back reference. 3195 <pre> 3196 PCRE2_ERROR_DFA_UCOND 3197 </pre> 3198 This return is given if <b>pcre2_dfa_match()</b> encounters a condition item 3199 that uses a back reference for the condition, or a test for recursion in a 3200 specific group. These are not supported. 3201 <pre> 3202 PCRE2_ERROR_DFA_WSSIZE 3203 </pre> 3204 This return is given if <b>pcre2_dfa_match()</b> runs out of space in the 3205 <i>workspace</i> vector. 3206 <pre> 3207 PCRE2_ERROR_DFA_RECURSE 3208 </pre> 3209 When a recursive subpattern is processed, the matching function calls itself 3210 recursively, using private memory for the ovector and <i>workspace</i>. This 3211 error is given if the internal ovector is not large enough. This should be 3212 extremely rare, as a vector of size 1000 is used. 3213 <pre> 3214 PCRE2_ERROR_DFA_BADRESTART 3215 </pre> 3216 When <b>pcre2_dfa_match()</b> is called with the <b>PCRE2_DFA_RESTART</b> option, 3217 some plausibility checks are made on the contents of the workspace, which 3218 should contain data about the previous partial match. If any of these checks 3219 fail, this error is given. 3220 </P> 3221 <br><a name="SEC39" href="#TOC1">SEE ALSO</a><br> 3222 <P> 3223 <b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>, 3224 <b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3), 3225 <b>pcre2sample</b>(3), <b>pcre2stack</b>(3), <b>pcre2unicode</b>(3). 3226 </P> 3227 <br><a name="SEC40" href="#TOC1">AUTHOR</a><br> 3228 <P> 3229 Philip Hazel 3230 <br> 3231 University Computing Service 3232 <br> 3233 Cambridge, England. 3234 <br> 3235 </P> 3236 <br><a name="SEC41" href="#TOC1">REVISION</a><br> 3237 <P> 3238 Last updated: 17 June 2016 3239 <br> 3240 Copyright © 1997-2016 University of Cambridge. 3241 <br> 3242 <p> 3243 Return to the <a href="index.html">PCRE2 index page</a>. 3244 </p> 3245