1 This is Info file flex.info, produced by Makeinfo-1.55 from the input 2 file flex.texi. 3 4 START-INFO-DIR-ENTRY 5 * Flex: (flex). A fast scanner generator. 6 END-INFO-DIR-ENTRY 7 8 This file documents Flex. 9 10 Copyright (c) 1990 The Regents of the University of California. All 11 rights reserved. 12 13 This code is derived from software contributed to Berkeley by Vern 14 Paxson. 15 16 The United States Government has rights in this work pursuant to 17 contract no. DE-AC03-76SF00098 between the United States Department of 18 Energy and the University of California. 19 20 Redistribution and use in source and binary forms with or without 21 modification are permitted provided that: (1) source distributions 22 retain this entire copyright notice and comment, and (2) distributions 23 including binaries display the following acknowledgement: "This 24 product includes software developed by the University of California, 25 Berkeley and its contributors" in the documentation or other materials 26 provided with the distribution and in all advertising materials 27 mentioning features or use of this software. Neither the name of the 28 University nor the names of its contributors may be used to endorse or 29 promote products derived from this software without specific prior 30 written permission. 31 32 THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED 33 WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF 34 MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. 35 36 37 File: flex.info, Node: Top, Next: Name, Prev: (dir), Up: (dir) 38 39 flex 40 **** 41 42 This manual documents `flex'. It covers release 2.5. 43 44 * Menu: 45 46 * Name:: Name 47 * Synopsis:: Synopsis 48 * Overview:: Overview 49 * Description:: Description 50 * Examples:: Some simple examples 51 * Format:: Format of the input file 52 * Patterns:: Patterns 53 * Matching:: How the input is matched 54 * Actions:: Actions 55 * Generated scanner:: The generated scanner 56 * Start conditions:: Start conditions 57 * Multiple buffers:: Multiple input buffers 58 * End-of-file rules:: End-of-file rules 59 * Miscellaneous:: Miscellaneous macros 60 * User variables:: Values available to the user 61 * YACC interface:: Interfacing with `yacc' 62 * Options:: Options 63 * Performance:: Performance considerations 64 * C++:: Generating C++ scanners 65 * Incompatibilities:: Incompatibilities with `lex' and POSIX 66 * Diagnostics:: Diagnostics 67 * Files:: Files 68 * Deficiencies:: Deficiencies / Bugs 69 * See also:: See also 70 * Author:: Author 71 72 73 File: flex.info, Node: Name, Next: Synopsis, Prev: Top, Up: Top 74 75 Name 76 ==== 77 78 flex - fast lexical analyzer generator 79 80 81 File: flex.info, Node: Synopsis, Next: Overview, Prev: Name, Up: Top 82 83 Synopsis 84 ======== 85 86 flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix -Sskeleton] 87 [--help --version] [FILENAME ...] 88 89 90 File: flex.info, Node: Overview, Next: Description, Prev: Synopsis, Up: Top 91 92 Overview 93 ======== 94 95 This manual describes `flex', a tool for generating programs that 96 perform pattern-matching on text. The manual includes both tutorial 97 and reference sections: 98 99 Description 100 a brief overview of the tool 101 102 Some Simple Examples 103 Format Of The Input File 104 Patterns 105 the extended regular expressions used by flex 106 107 How The Input Is Matched 108 the rules for determining what has been matched 109 110 Actions 111 how to specify what to do when a pattern is matched 112 113 The Generated Scanner 114 details regarding the scanner that flex produces; how to control 115 the input source 116 117 Start Conditions 118 introducing context into your scanners, and managing 119 "mini-scanners" 120 121 Multiple Input Buffers 122 how to manipulate multiple input sources; how to scan from strings 123 instead of files 124 125 End-of-file Rules 126 special rules for matching the end of the input 127 128 Miscellaneous Macros 129 a summary of macros available to the actions 130 131 Values Available To The User 132 a summary of values available to the actions 133 134 Interfacing With Yacc 135 connecting flex scanners together with yacc parsers 136 137 Options 138 flex command-line options, and the "%option" directive 139 140 Performance Considerations 141 how to make your scanner go as fast as possible 142 143 Generating C++ Scanners 144 the (experimental) facility for generating C++ scanner classes 145 146 Incompatibilities With Lex And POSIX 147 how flex differs from AT&T lex and the POSIX lex standard 148 149 Diagnostics 150 those error messages produced by flex (or scanners it generates) 151 whose meanings might not be apparent 152 153 Files 154 files used by flex 155 156 Deficiencies / Bugs 157 known problems with flex 158 159 See Also 160 other documentation, related tools 161 162 Author 163 includes contact information 164 165 166 File: flex.info, Node: Description, Next: Examples, Prev: Overview, Up: Top 167 168 Description 169 =========== 170 171 `flex' is a tool for generating "scanners": programs which 172 recognized lexical patterns in text. `flex' reads the given input 173 files, or its standard input if no file names are given, for a 174 description of a scanner to generate. The description is in the form 175 of pairs of regular expressions and C code, called "rules". `flex' 176 generates as output a C source file, `lex.yy.c', which defines a 177 routine `yylex()'. This file is compiled and linked with the `-lfl' 178 library to produce an executable. When the executable is run, it 179 analyzes its input for occurrences of the regular expressions. 180 Whenever it finds one, it executes the corresponding C code. 181 182 183 File: flex.info, Node: Examples, Next: Format, Prev: Description, Up: Top 184 185 Some simple examples 186 ==================== 187 188 First some simple examples to get the flavor of how one uses `flex'. 189 The following `flex' input specifies a scanner which whenever it 190 encounters the string "username" will replace it with the user's login 191 name: 192 193 %% 194 username printf( "%s", getlogin() ); 195 196 By default, any text not matched by a `flex' scanner is copied to 197 the output, so the net effect of this scanner is to copy its input file 198 to its output with each occurrence of "username" expanded. In this 199 input, there is just one rule. "username" is the PATTERN and the 200 "printf" is the ACTION. The "%%" marks the beginning of the rules. 201 202 Here's another simple example: 203 204 int num_lines = 0, num_chars = 0; 205 206 %% 207 \n ++num_lines; ++num_chars; 208 . ++num_chars; 209 210 %% 211 main() 212 { 213 yylex(); 214 printf( "# of lines = %d, # of chars = %d\n", 215 num_lines, num_chars ); 216 } 217 218 This scanner counts the number of characters and the number of lines 219 in its input (it produces no output other than the final report on the 220 counts). The first line declares two globals, "num_lines" and 221 "num_chars", which are accessible both inside `yylex()' and in the 222 `main()' routine declared after the second "%%". There are two rules, 223 one which matches a newline ("\n") and increments both the line count 224 and the character count, and one which matches any character other than 225 a newline (indicated by the "." regular expression). 226 227 A somewhat more complicated example: 228 229 /* scanner for a toy Pascal-like language */ 230 231 %{ 232 /* need this for the call to atof() below */ 233 #include <math.h> 234 %} 235 236 DIGIT [0-9] 237 ID [a-z][a-z0-9]* 238 239 %% 240 241 {DIGIT}+ { 242 printf( "An integer: %s (%d)\n", yytext, 243 atoi( yytext ) ); 244 } 245 246 {DIGIT}+"."{DIGIT}* { 247 printf( "A float: %s (%g)\n", yytext, 248 atof( yytext ) ); 249 } 250 251 if|then|begin|end|procedure|function { 252 printf( "A keyword: %s\n", yytext ); 253 } 254 255 {ID} printf( "An identifier: %s\n", yytext ); 256 257 "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); 258 259 "{"[^}\n]*"}" /* eat up one-line comments */ 260 261 [ \t\n]+ /* eat up whitespace */ 262 263 . printf( "Unrecognized character: %s\n", yytext ); 264 265 %% 266 267 main( argc, argv ) 268 int argc; 269 char **argv; 270 { 271 ++argv, --argc; /* skip over program name */ 272 if ( argc > 0 ) 273 yyin = fopen( argv[0], "r" ); 274 else 275 yyin = stdin; 276 277 yylex(); 278 } 279 280 This is the beginnings of a simple scanner for a language like 281 Pascal. It identifies different types of TOKENS and reports on what it 282 has seen. 283 284 The details of this example will be explained in the following 285 sections. 286 287 288 File: flex.info, Node: Format, Next: Patterns, Prev: Examples, Up: Top 289 290 Format of the input file 291 ======================== 292 293 The `flex' input file consists of three sections, separated by a 294 line with just `%%' in it: 295 296 definitions 297 %% 298 rules 299 %% 300 user code 301 302 The "definitions" section contains declarations of simple "name" 303 definitions to simplify the scanner specification, and declarations of 304 "start conditions", which are explained in a later section. Name 305 definitions have the form: 306 307 name definition 308 309 The "name" is a word beginning with a letter or an underscore ('_') 310 followed by zero or more letters, digits, '_', or '-' (dash). The 311 definition is taken to begin at the first non-white-space character 312 following the name and continuing to the end of the line. The 313 definition can subsequently be referred to using "{name}", which will 314 expand to "(definition)". For example, 315 316 DIGIT [0-9] 317 ID [a-z][a-z0-9]* 318 319 defines "DIGIT" to be a regular expression which matches a single 320 digit, and "ID" to be a regular expression which matches a letter 321 followed by zero-or-more letters-or-digits. A subsequent reference to 322 323 {DIGIT}+"."{DIGIT}* 324 325 is identical to 326 327 ([0-9])+"."([0-9])* 328 329 and matches one-or-more digits followed by a '.' followed by 330 zero-or-more digits. 331 332 The RULES section of the `flex' input contains a series of rules of 333 the form: 334 335 pattern action 336 337 where the pattern must be unindented and the action must begin on the 338 same line. 339 340 See below for a further description of patterns and actions. 341 342 Finally, the user code section is simply copied to `lex.yy.c' 343 verbatim. It is used for companion routines which call or are called 344 by the scanner. The presence of this section is optional; if it is 345 missing, the second `%%' in the input file may be skipped, too. 346 347 In the definitions and rules sections, any *indented* text or text 348 enclosed in `%{' and `%}' is copied verbatim to the output (with the 349 `%{}''s removed). The `%{}''s must appear unindented on lines by 350 themselves. 351 352 In the rules section, any indented or %{} text appearing before the 353 first rule may be used to declare variables which are local to the 354 scanning routine and (after the declarations) code which is to be 355 executed whenever the scanning routine is entered. Other indented or 356 %{} text in the rule section is still copied to the output, but its 357 meaning is not well-defined and it may well cause compile-time errors 358 (this feature is present for `POSIX' compliance; see below for other 359 such features). 360 361 In the definitions section (but not in the rules section), an 362 unindented comment (i.e., a line beginning with "/*") is also copied 363 verbatim to the output up to the next "*/". 364 365 366 File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top 367 368 Patterns 369 ======== 370 371 The patterns in the input are written using an extended set of 372 regular expressions. These are: 373 374 `x' 375 match the character `x' 376 377 `.' 378 any character (byte) except newline 379 380 `[xyz]' 381 a "character class"; in this case, the pattern matches either an 382 `x', a `y', or a `z' 383 384 `[abj-oZ]' 385 a "character class" with a range in it; matches an `a', a `b', any 386 letter from `j' through `o', or a `Z' 387 388 `[^A-Z]' 389 a "negated character class", i.e., any character but those in the 390 class. In this case, any character EXCEPT an uppercase letter. 391 392 `[^A-Z\n]' 393 any character EXCEPT an uppercase letter or a newline 394 395 `R*' 396 zero or more R's, where R is any regular expression 397 398 `R+' 399 one or more R's 400 401 `R?' 402 zero or one R's (that is, "an optional R") 403 404 `R{2,5}' 405 anywhere from two to five R's 406 407 `R{2,}' 408 two or more R's 409 410 `R{4}' 411 exactly 4 R's 412 413 `{NAME}' 414 the expansion of the "NAME" definition (see above) 415 416 `"[xyz]\"foo"' 417 the literal string: `[xyz]"foo' 418 419 `\X' 420 if X is an `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C 421 interpretation of \X. Otherwise, a literal `X' (used to escape 422 operators such as `*') 423 424 `\0' 425 a NUL character (ASCII code 0) 426 427 `\123' 428 the character with octal value 123 429 430 `\x2a' 431 the character with hexadecimal value `2a' 432 433 `(R)' 434 match an R; parentheses are used to override precedence (see below) 435 436 `RS' 437 the regular expression R followed by the regular expression S; 438 called "concatenation" 439 440 `R|S' 441 either an R or an S 442 443 `R/S' 444 an R but only if it is followed by an S. The text matched by S is 445 included when determining whether this rule is the "longest 446 match", but is then returned to the input before the action is 447 executed. So the action only sees the text matched by R. This 448 type of pattern is called "trailing context". (There are some 449 combinations of `R/S' that `flex' cannot match correctly; see 450 notes in the Deficiencies / Bugs section below regarding 451 "dangerous trailing context".) 452 453 `^R' 454 an R, but only at the beginning of a line (i.e., which just 455 starting to scan, or right after a newline has been scanned). 456 457 `R$' 458 an R, but only at the end of a line (i.e., just before a newline). 459 Equivalent to "R/\n". 460 461 Note that flex's notion of "newline" is exactly whatever the C 462 compiler used to compile flex interprets '\n' as; in particular, 463 on some DOS systems you must either filter out \r's in the input 464 yourself, or explicitly use R/\r\n for "r$". 465 466 `<S>R' 467 an R, but only in start condition S (see below for discussion of 468 start conditions) <S1,S2,S3>R same, but in any of start conditions 469 S1, S2, or S3 470 471 `<*>R' 472 an R in any start condition, even an exclusive one. 473 474 `<<EOF>>' 475 an end-of-file <S1,S2><<EOF>> an end-of-file when in start 476 condition S1 or S2 477 478 Note that inside of a character class, all regular expression 479 operators lose their special meaning except escape ('\') and the 480 character class operators, '-', ']', and, at the beginning of the 481 class, '^'. 482 483 The regular expressions listed above are grouped according to 484 precedence, from highest precedence at the top to lowest at the bottom. 485 Those grouped together have equal precedence. For example, 486 487 foo|bar* 488 489 is the same as 490 491 (foo)|(ba(r*)) 492 493 since the '*' operator has higher precedence than concatenation, and 494 concatenation higher than alternation ('|'). This pattern therefore 495 matches *either* the string "foo" *or* the string "ba" followed by 496 zero-or-more r's. To match "foo" or zero-or-more "bar"'s, use: 497 498 foo|(bar)* 499 500 and to match zero-or-more "foo"'s-or-"bar"'s: 501 502 (foo|bar)* 503 504 In addition to characters and ranges of characters, character 505 classes can also contain character class "expressions". These are 506 expressions enclosed inside `[': and `:'] delimiters (which themselves 507 must appear between the '[' and ']' of the character class; other 508 elements may occur inside the character class, too). The valid 509 expressions are: 510 511 [:alnum:] [:alpha:] [:blank:] 512 [:cntrl:] [:digit:] [:graph:] 513 [:lower:] [:print:] [:punct:] 514 [:space:] [:upper:] [:xdigit:] 515 516 These expressions all designate a set of characters equivalent to 517 the corresponding standard C `isXXX' function. For example, 518 `[:alnum:]' designates those characters for which `isalnum()' returns 519 true - i.e., any alphabetic or numeric. Some systems don't provide 520 `isblank()', so flex defines `[:blank:]' as a blank or a tab. 521 522 For example, the following character classes are all equivalent: 523 524 [[:alnum:]] 525 [[:alpha:][:digit:] 526 [[:alpha:]0-9] 527 [a-zA-Z0-9] 528 529 If your scanner is case-insensitive (the `-i' flag), then 530 `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'. 531 532 Some notes on patterns: 533 534 - A negated character class such as the example "[^A-Z]" above *will 535 match a newline* unless "\n" (or an equivalent escape sequence) is 536 one of the characters explicitly present in the negated character 537 class (e.g., "[^A-Z\n]"). This is unlike how many other regular 538 expression tools treat negated character classes, but 539 unfortunately the inconsistency is historically entrenched. 540 Matching newlines means that a pattern like [^"]* can match the 541 entire input unless there's another quote in the input. 542 543 - A rule can have at most one instance of trailing context (the '/' 544 operator or the '$' operator). The start condition, '^', and 545 "<<EOF>>" patterns can only occur at the beginning of a pattern, 546 and, as well as with '/' and '$', cannot be grouped inside 547 parentheses. A '^' which does not occur at the beginning of a 548 rule or a '$' which does not occur at the end of a rule loses its 549 special properties and is treated as a normal character. 550 551 The following are illegal: 552 553 foo/bar$ 554 <sc1>foo<sc2>bar 555 556 Note that the first of these, can be written "foo/bar\n". 557 558 The following will result in '$' or '^' being treated as a normal 559 character: 560 561 foo|(bar$) 562 foo|^bar 563 564 If what's wanted is a "foo" or a bar-followed-by-a-newline, the 565 following could be used (the special '|' action is explained 566 below): 567 568 foo | 569 bar$ /* action goes here */ 570 571 A similar trick will work for matching a foo or a 572 bar-at-the-beginning-of-a-line. 573 574 575 File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top 576 577 How the input is matched 578 ======================== 579 580 When the generated scanner is run, it analyzes its input looking for 581 strings which match any of its patterns. If it finds more than one 582 match, it takes the one matching the most text (for trailing context 583 rules, this includes the length of the trailing part, even though it 584 will then be returned to the input). If it finds two or more matches 585 of the same length, the rule listed first in the `flex' input file is 586 chosen. 587 588 Once the match is determined, the text corresponding to the match 589 (called the TOKEN) is made available in the global character pointer 590 `yytext', and its length in the global integer `yyleng'. The ACTION 591 corresponding to the matched pattern is then executed (a more detailed 592 description of actions follows), and then the remaining input is 593 scanned for another match. 594 595 If no match is found, then the "default rule" is executed: the next 596 character in the input is considered matched and copied to the standard 597 output. Thus, the simplest legal `flex' input is: 598 599 %% 600 601 which generates a scanner that simply copies its input (one 602 character at a time) to its output. 603 604 Note that `yytext' can be defined in two different ways: either as a 605 character *pointer* or as a character *array*. You can control which 606 definition `flex' uses by including one of the special directives 607 `%pointer' or `%array' in the first (definitions) section of your flex 608 input. The default is `%pointer', unless you use the `-l' lex 609 compatibility option, in which case `yytext' will be an array. The 610 advantage of using `%pointer' is substantially faster scanning and no 611 buffer overflow when matching very large tokens (unless you run out of 612 dynamic memory). The disadvantage is that you are restricted in how 613 your actions can modify `yytext' (see the next section), and calls to 614 the `unput()' function destroys the present contents of `yytext', which 615 can be a considerable porting headache when moving between different 616 `lex' versions. 617 618 The advantage of `%array' is that you can then modify `yytext' to 619 your heart's content, and calls to `unput()' do not destroy `yytext' 620 (see below). Furthermore, existing `lex' programs sometimes access 621 `yytext' externally using declarations of the form: 622 extern char yytext[]; 623 This definition is erroneous when used with `%pointer', but correct 624 for `%array'. 625 626 `%array' defines `yytext' to be an array of `YYLMAX' characters, 627 which defaults to a fairly large value. You can change the size by 628 simply #define'ing `YYLMAX' to a different value in the first section 629 of your `flex' input. As mentioned above, with `%pointer' yytext grows 630 dynamically to accommodate large tokens. While this means your 631 `%pointer' scanner can accommodate very large tokens (such as matching 632 entire blocks of comments), bear in mind that each time the scanner 633 must resize `yytext' it also must rescan the entire token from the 634 beginning, so matching such tokens can prove slow. `yytext' presently 635 does *not* dynamically grow if a call to `unput()' results in too much 636 text being pushed back; instead, a run-time error results. 637 638 Also note that you cannot use `%array' with C++ scanner classes (the 639 `c++' option; see below). 640 641 642 File: flex.info, Node: Actions, Next: Generated scanner, Prev: Matching, Up: Top 643 644 Actions 645 ======= 646 647 Each pattern in a rule has a corresponding action, which can be any 648 arbitrary C statement. The pattern ends at the first non-escaped 649 whitespace character; the remainder of the line is its action. If the 650 action is empty, then when the pattern is matched the input token is 651 simply discarded. For example, here is the specification for a program 652 which deletes all occurrences of "zap me" from its input: 653 654 %% 655 "zap me" 656 657 (It will copy all other characters in the input to the output since 658 they will be matched by the default rule.) 659 660 Here is a program which compresses multiple blanks and tabs down to 661 a single blank, and throws away whitespace found at the end of a line: 662 663 %% 664 [ \t]+ putchar( ' ' ); 665 [ \t]+$ /* ignore this token */ 666 667 If the action contains a '{', then the action spans till the 668 balancing '}' is found, and the action may cross multiple lines. 669 `flex' knows about C strings and comments and won't be fooled by braces 670 found within them, but also allows actions to begin with `%{' and will 671 consider the action to be all the text up to the next `%}' (regardless 672 of ordinary braces inside the action). 673 674 An action consisting solely of a vertical bar ('|') means "same as 675 the action for the next rule." See below for an illustration. 676 677 Actions can include arbitrary C code, including `return' statements 678 to return a value to whatever routine called `yylex()'. Each time 679 `yylex()' is called it continues processing tokens from where it last 680 left off until it either reaches the end of the file or executes a 681 return. 682 683 Actions are free to modify `yytext' except for lengthening it 684 (adding characters to its end-these will overwrite later characters in 685 the input stream). This however does not apply when using `%array' 686 (see above); in that case, `yytext' may be freely modified in any way. 687 688 Actions are free to modify `yyleng' except they should not do so if 689 the action also includes use of `yymore()' (see below). 690 691 There are a number of special directives which can be included 692 within an action: 693 694 - `ECHO' copies yytext to the scanner's output. 695 696 - `BEGIN' followed by the name of a start condition places the 697 scanner in the corresponding start condition (see below). 698 699 - `REJECT' directs the scanner to proceed on to the "second best" 700 rule which matched the input (or a prefix of the input). The rule 701 is chosen as described above in "How the Input is Matched", and 702 `yytext' and `yyleng' set up appropriately. It may either be one 703 which matched as much text as the originally chosen rule but came 704 later in the `flex' input file, or one which matched less text. 705 For example, the following will both count the words in the input 706 and call the routine special() whenever "frob" is seen: 707 708 int word_count = 0; 709 %% 710 711 frob special(); REJECT; 712 [^ \t\n]+ ++word_count; 713 714 Without the `REJECT', any "frob"'s in the input would not be 715 counted as words, since the scanner normally executes only one 716 action per token. Multiple `REJECT's' are allowed, each one 717 finding the next best choice to the currently active rule. For 718 example, when the following scanner scans the token "abcd", it 719 will write "abcdabcaba" to the output: 720 721 %% 722 a | 723 ab | 724 abc | 725 abcd ECHO; REJECT; 726 .|\n /* eat up any unmatched character */ 727 728 (The first three rules share the fourth's action since they use 729 the special '|' action.) `REJECT' is a particularly expensive 730 feature in terms of scanner performance; if it is used in *any* of 731 the scanner's actions it will slow down *all* of the scanner's 732 matching. Furthermore, `REJECT' cannot be used with the `-Cf' or 733 `-CF' options (see below). 734 735 Note also that unlike the other special actions, `REJECT' is a 736 *branch*; code immediately following it in the action will *not* 737 be executed. 738 739 - `yymore()' tells the scanner that the next time it matches a rule, 740 the corresponding token should be *appended* onto the current 741 value of `yytext' rather than replacing it. For example, given 742 the input "mega-kludge" the following will write 743 "mega-mega-kludge" to the output: 744 745 %% 746 mega- ECHO; yymore(); 747 kludge ECHO; 748 749 First "mega-" is matched and echoed to the output. Then "kludge" 750 is matched, but the previous "mega-" is still hanging around at 751 the beginning of `yytext' so the `ECHO' for the "kludge" rule will 752 actually write "mega-kludge". 753 754 Two notes regarding use of `yymore()'. First, `yymore()' depends on 755 the value of `yyleng' correctly reflecting the size of the current 756 token, so you must not modify `yyleng' if you are using `yymore()'. 757 Second, the presence of `yymore()' in the scanner's action entails a 758 minor performance penalty in the scanner's matching speed. 759 760 - `yyless(n)' returns all but the first N characters of the current 761 token back to the input stream, where they will be rescanned when 762 the scanner looks for the next match. `yytext' and `yyleng' are 763 adjusted appropriately (e.g., `yyleng' will now be equal to N ). 764 For example, on the input "foobar" the following will write out 765 "foobarbar": 766 767 %% 768 foobar ECHO; yyless(3); 769 [a-z]+ ECHO; 770 771 An argument of 0 to `yyless' will cause the entire current input 772 string to be scanned again. Unless you've changed how the scanner 773 will subsequently process its input (using `BEGIN', for example), 774 this will result in an endless loop. 775 776 Note that `yyless' is a macro and can only be used in the flex 777 input file, not from other source files. 778 779 - `unput(c)' puts the character `c' back onto the input stream. It 780 will be the next character scanned. The following action will 781 take the current token and cause it to be rescanned enclosed in 782 parentheses. 783 784 { 785 int i; 786 /* Copy yytext because unput() trashes yytext */ 787 char *yycopy = strdup( yytext ); 788 unput( ')' ); 789 for ( i = yyleng - 1; i >= 0; --i ) 790 unput( yycopy[i] ); 791 unput( '(' ); 792 free( yycopy ); 793 } 794 795 Note that since each `unput()' puts the given character back at 796 the *beginning* of the input stream, pushing back strings must be 797 done back-to-front. An important potential problem when using 798 `unput()' is that if you are using `%pointer' (the default), a 799 call to `unput()' *destroys* the contents of `yytext', starting 800 with its rightmost character and devouring one character to the 801 left with each call. If you need the value of yytext preserved 802 after a call to `unput()' (as in the above example), you must 803 either first copy it elsewhere, or build your scanner using 804 `%array' instead (see How The Input Is Matched). 805 806 Finally, note that you cannot put back `EOF' to attempt to mark 807 the input stream with an end-of-file. 808 809 - `input()' reads the next character from the input stream. For 810 example, the following is one way to eat up C comments: 811 812 %% 813 "/*" { 814 register int c; 815 816 for ( ; ; ) 817 { 818 while ( (c = input()) != '*' && 819 c != EOF ) 820 ; /* eat up text of comment */ 821 822 if ( c == '*' ) 823 { 824 while ( (c = input()) == '*' ) 825 ; 826 if ( c == '/' ) 827 break; /* found the end */ 828 } 829 830 if ( c == EOF ) 831 { 832 error( "EOF in comment" ); 833 break; 834 } 835 } 836 } 837 838 (Note that if the scanner is compiled using `C++', then `input()' 839 is instead referred to as `yyinput()', in order to avoid a name 840 clash with the `C++' stream by the name of `input'.) 841 842 - YY_FLUSH_BUFFER flushes the scanner's internal buffer so that the 843 next time the scanner attempts to match a token, it will first 844 refill the buffer using `YY_INPUT' (see The Generated Scanner, 845 below). This action is a special case of the more general 846 `yy_flush_buffer()' function, described below in the section 847 Multiple Input Buffers. 848 849 - `yyterminate()' can be used in lieu of a return statement in an 850 action. It terminates the scanner and returns a 0 to the 851 scanner's caller, indicating "all done". By default, 852 `yyterminate()' is also called when an end-of-file is encountered. 853 It is a macro and may be redefined. 854 855 856 File: flex.info, Node: Generated scanner, Next: Start conditions, Prev: Actions, Up: Top 857 858 The generated scanner 859 ===================== 860 861 The output of `flex' is the file `lex.yy.c', which contains the 862 scanning routine `yylex()', a number of tables used by it for matching 863 tokens, and a number of auxiliary routines and macros. By default, 864 `yylex()' is declared as follows: 865 866 int yylex() 867 { 868 ... various definitions and the actions in here ... 869 } 870 871 (If your environment supports function prototypes, then it will be 872 "int yylex( void )".) This definition may be changed by defining 873 the "YY_DECL" macro. For example, you could use: 874 875 #define YY_DECL float lexscan( a, b ) float a, b; 876 877 to give the scanning routine the name `lexscan', returning a float, 878 and taking two floats as arguments. Note that if you give arguments to 879 the scanning routine using a K&R-style/non-prototyped function 880 declaration, you must terminate the definition with a semi-colon (`;'). 881 882 Whenever `yylex()' is called, it scans tokens from the global input 883 file `yyin' (which defaults to stdin). It continues until it either 884 reaches an end-of-file (at which point it returns the value 0) or one 885 of its actions executes a `return' statement. 886 887 If the scanner reaches an end-of-file, subsequent calls are undefined 888 unless either `yyin' is pointed at a new input file (in which case 889 scanning continues from that file), or `yyrestart()' is called. 890 `yyrestart()' takes one argument, a `FILE *' pointer (which can be nil, 891 if you've set up `YY_INPUT' to scan from a source other than `yyin'), 892 and initializes `yyin' for scanning from that file. Essentially there 893 is no difference between just assigning `yyin' to a new input file or 894 using `yyrestart()' to do so; the latter is available for compatibility 895 with previous versions of `flex', and because it can be used to switch 896 input files in the middle of scanning. It can also be used to throw 897 away the current input buffer, by calling it with an argument of 898 `yyin'; but better is to use `YY_FLUSH_BUFFER' (see above). Note that 899 `yyrestart()' does *not* reset the start condition to `INITIAL' (see 900 Start Conditions, below). 901 902 If `yylex()' stops scanning due to executing a `return' statement in 903 one of the actions, the scanner may then be called again and it will 904 resume scanning where it left off. 905 906 By default (and for purposes of efficiency), the scanner uses 907 block-reads rather than simple `getc()' calls to read characters from 908 `yyin'. The nature of how it gets its input can be controlled by 909 defining the `YY_INPUT' macro. YY_INPUT's calling sequence is 910 "YY_INPUT(buf,result,max_size)". Its action is to place up to MAX_SIZE 911 characters in the character array BUF and return in the integer 912 variable RESULT either the number of characters read or the constant 913 YY_NULL (0 on Unix systems) to indicate EOF. The default YY_INPUT 914 reads from the global file-pointer "yyin". 915 916 A sample definition of YY_INPUT (in the definitions section of the 917 input file): 918 919 %{ 920 #define YY_INPUT(buf,result,max_size) \ 921 { \ 922 int c = getchar(); \ 923 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ 924 } 925 %} 926 927 This definition will change the input processing to occur one 928 character at a time. 929 930 When the scanner receives an end-of-file indication from YY_INPUT, 931 it then checks the `yywrap()' function. If `yywrap()' returns false 932 (zero), then it is assumed that the function has gone ahead and set up 933 `yyin' to point to another input file, and scanning continues. If it 934 returns true (non-zero), then the scanner terminates, returning 0 to 935 its caller. Note that in either case, the start condition remains 936 unchanged; it does *not* revert to `INITIAL'. 937 938 If you do not supply your own version of `yywrap()', then you must 939 either use `%option noyywrap' (in which case the scanner behaves as 940 though `yywrap()' returned 1), or you must link with `-lfl' to obtain 941 the default version of the routine, which always returns 1. 942 943 Three routines are available for scanning from in-memory buffers 944 rather than files: `yy_scan_string()', `yy_scan_bytes()', and 945 `yy_scan_buffer()'. See the discussion of them below in the section 946 Multiple Input Buffers. 947 948 The scanner writes its `ECHO' output to the `yyout' global (default, 949 stdout), which may be redefined by the user simply by assigning it to 950 some other `FILE' pointer. 951 952 953 File: flex.info, Node: Start conditions, Next: Multiple buffers, Prev: Generated scanner, Up: Top 954 955 Start conditions 956 ================ 957 958 `flex' provides a mechanism for conditionally activating rules. Any 959 rule whose pattern is prefixed with "<sc>" will only be active when the 960 scanner is in the start condition named "sc". For example, 961 962 <STRING>[^"]* { /* eat up the string body ... */ 963 ... 964 } 965 966 will be active only when the scanner is in the "STRING" start 967 condition, and 968 969 <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ 970 ... 971 } 972 973 will be active only when the current start condition is either 974 "INITIAL", "STRING", or "QUOTE". 975 976 Start conditions are declared in the definitions (first) section of 977 the input using unindented lines beginning with either `%s' or `%x' 978 followed by a list of names. The former declares *inclusive* start 979 conditions, the latter *exclusive* start conditions. A start condition 980 is activated using the `BEGIN' action. Until the next `BEGIN' action is 981 executed, rules with the given start condition will be active and rules 982 with other start conditions will be inactive. If the start condition 983 is *inclusive*, then rules with no start conditions at all will also be 984 active. If it is *exclusive*, then *only* rules qualified with the 985 start condition will be active. A set of rules contingent on the same 986 exclusive start condition describe a scanner which is independent of 987 any of the other rules in the `flex' input. Because of this, exclusive 988 start conditions make it easy to specify "mini-scanners" which scan 989 portions of the input that are syntactically different from the rest 990 (e.g., comments). 991 992 If the distinction between inclusive and exclusive start conditions 993 is still a little vague, here's a simple example illustrating the 994 connection between the two. The set of rules: 995 996 %s example 997 %% 998 999 <example>foo do_something(); 1000 1001 bar something_else(); 1002 1003 is equivalent to 1004 1005 %x example 1006 %% 1007 1008 <example>foo do_something(); 1009 1010 <INITIAL,example>bar something_else(); 1011 1012 Without the `<INITIAL,example>' qualifier, the `bar' pattern in the 1013 second example wouldn't be active (i.e., couldn't match) when in start 1014 condition `example'. If we just used `<example>' to qualify `bar', 1015 though, then it would only be active in `example' and not in `INITIAL', 1016 while in the first example it's active in both, because in the first 1017 example the `example' starting condition is an *inclusive* (`%s') start 1018 condition. 1019 1020 Also note that the special start-condition specifier `<*>' matches 1021 every start condition. Thus, the above example could also have been 1022 written; 1023 1024 %x example 1025 %% 1026 1027 <example>foo do_something(); 1028 1029 <*>bar something_else(); 1030 1031 The default rule (to `ECHO' any unmatched character) remains active 1032 in start conditions. It is equivalent to: 1033 1034 <*>.|\\n ECHO; 1035 1036 `BEGIN(0)' returns to the original state where only the rules with 1037 no start conditions are active. This state can also be referred to as 1038 the start-condition "INITIAL", so `BEGIN(INITIAL)' is equivalent to 1039 `BEGIN(0)'. (The parentheses around the start condition name are not 1040 required but are considered good style.) 1041 1042 `BEGIN' actions can also be given as indented code at the beginning 1043 of the rules section. For example, the following will cause the 1044 scanner to enter the "SPECIAL" start condition whenever `yylex()' is 1045 called and the global variable `enter_special' is true: 1046 1047 int enter_special; 1048 1049 %x SPECIAL 1050 %% 1051 if ( enter_special ) 1052 BEGIN(SPECIAL); 1053 1054 <SPECIAL>blahblahblah 1055 ...more rules follow... 1056 1057 To illustrate the uses of start conditions, here is a scanner which 1058 provides two different interpretations of a string like "123.456". By 1059 default it will treat it as as three tokens, the integer "123", a dot 1060 ('.'), and the integer "456". But if the string is preceded earlier in 1061 the line by the string "expect-floats" it will treat it as a single 1062 token, the floating-point number 123.456: 1063 1064 %{ 1065 #include <math.h> 1066 %} 1067 %s expect 1068 1069 %% 1070 expect-floats BEGIN(expect); 1071 1072 <expect>[0-9]+"."[0-9]+ { 1073 printf( "found a float, = %f\n", 1074 atof( yytext ) ); 1075 } 1076 <expect>\n { 1077 /* that's the end of the line, so 1078 * we need another "expect-number" 1079 * before we'll recognize any more 1080 * numbers 1081 */ 1082 BEGIN(INITIAL); 1083 } 1084 1085 [0-9]+ { 1086 1087 Version 2.5 December 1994 18 1088 1089 printf( "found an integer, = %d\n", 1090 atoi( yytext ) ); 1091 } 1092 1093 "." printf( "found a dot\n" ); 1094 1095 Here is a scanner which recognizes (and discards) C comments while 1096 maintaining a count of the current input line. 1097 1098 %x comment 1099 %% 1100 int line_num = 1; 1101 1102 "/*" BEGIN(comment); 1103 1104 <comment>[^*\n]* /* eat anything that's not a '*' */ 1105 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1106 <comment>\n ++line_num; 1107 <comment>"*"+"/" BEGIN(INITIAL); 1108 1109 This scanner goes to a bit of trouble to match as much text as 1110 possible with each rule. In general, when attempting to write a 1111 high-speed scanner try to match as much possible in each rule, as it's 1112 a big win. 1113 1114 Note that start-conditions names are really integer values and can 1115 be stored as such. Thus, the above could be extended in the following 1116 fashion: 1117 1118 %x comment foo 1119 %% 1120 int line_num = 1; 1121 int comment_caller; 1122 1123 "/*" { 1124 comment_caller = INITIAL; 1125 BEGIN(comment); 1126 } 1127 1128 ... 1129 1130 <foo>"/*" { 1131 comment_caller = foo; 1132 BEGIN(comment); 1133 } 1134 1135 <comment>[^*\n]* /* eat anything that's not a '*' */ 1136 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1137 <comment>\n ++line_num; 1138 <comment>"*"+"/" BEGIN(comment_caller); 1139 1140 Furthermore, you can access the current start condition using the 1141 integer-valued `YY_START' macro. For example, the above assignments to 1142 `comment_caller' could instead be written 1143 1144 comment_caller = YY_START; 1145 1146 Flex provides `YYSTATE' as an alias for `YY_START' (since that is 1147 what's used by AT&T `lex'). 1148 1149 Note that start conditions do not have their own name-space; %s's 1150 and %x's declare names in the same fashion as #define's. 1151 1152 Finally, here's an example of how to match C-style quoted strings 1153 using exclusive start conditions, including expanded escape sequences 1154 (but not including checking for a string that's too long): 1155 1156 %x str 1157 1158 %% 1159 char string_buf[MAX_STR_CONST]; 1160 char *string_buf_ptr; 1161 1162 \" string_buf_ptr = string_buf; BEGIN(str); 1163 1164 <str>\" { /* saw closing quote - all done */ 1165 BEGIN(INITIAL); 1166 *string_buf_ptr = '\0'; 1167 /* return string constant token type and 1168 * value to parser 1169 */ 1170 } 1171 1172 <str>\n { 1173 /* error - unterminated string constant */ 1174 /* generate error message */ 1175 } 1176 1177 <str>\\[0-7]{1,3} { 1178 /* octal escape sequence */ 1179 int result; 1180 1181 (void) sscanf( yytext + 1, "%o", &result ); 1182 1183 if ( result > 0xff ) 1184 /* error, constant is out-of-bounds */ 1185 1186 *string_buf_ptr++ = result; 1187 } 1188 1189 <str>\\[0-9]+ { 1190 /* generate error - bad escape sequence; something 1191 * like '\48' or '\0777777' 1192 */ 1193 } 1194 1195 <str>\\n *string_buf_ptr++ = '\n'; 1196 <str>\\t *string_buf_ptr++ = '\t'; 1197 <str>\\r *string_buf_ptr++ = '\r'; 1198 <str>\\b *string_buf_ptr++ = '\b'; 1199 <str>\\f *string_buf_ptr++ = '\f'; 1200 1201 <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; 1202 1203 <str>[^\\\n\"]+ { 1204 char *yptr = yytext; 1205 1206 while ( *yptr ) 1207 *string_buf_ptr++ = *yptr++; 1208 } 1209 1210 Often, such as in some of the examples above, you wind up writing a 1211 whole bunch of rules all preceded by the same start condition(s). Flex 1212 makes this a little easier and cleaner by introducing a notion of start 1213 condition "scope". A start condition scope is begun with: 1214 1215 <SCs>{ 1216 1217 where SCs is a list of one or more start conditions. Inside the start 1218 condition scope, every rule automatically has the prefix `<SCs>' 1219 applied to it, until a `}' which matches the initial `{'. So, for 1220 example, 1221 1222 <ESC>{ 1223 "\\n" return '\n'; 1224 "\\r" return '\r'; 1225 "\\f" return '\f'; 1226 "\\0" return '\0'; 1227 } 1228 1229 is equivalent to: 1230 1231 <ESC>"\\n" return '\n'; 1232 <ESC>"\\r" return '\r'; 1233 <ESC>"\\f" return '\f'; 1234 <ESC>"\\0" return '\0'; 1235 1236 Start condition scopes may be nested. 1237 1238 Three routines are available for manipulating stacks of start 1239 conditions: 1240 1241 `void yy_push_state(int new_state)' 1242 pushes the current start condition onto the top of the start 1243 condition stack and switches to NEW_STATE as though you had used 1244 `BEGIN new_state' (recall that start condition names are also 1245 integers). 1246 1247 `void yy_pop_state()' 1248 pops the top of the stack and switches to it via `BEGIN'. 1249 1250 `int yy_top_state()' 1251 returns the top of the stack without altering the stack's contents. 1252 1253 The start condition stack grows dynamically and so has no built-in 1254 size limitation. If memory is exhausted, program execution aborts. 1255 1256 To use start condition stacks, your scanner must include a `%option 1257 stack' directive (see Options below). 1258 1259 1260 File: flex.info, Node: Multiple buffers, Next: End-of-file rules, Prev: Start conditions, Up: Top 1261 1262 Multiple input buffers 1263 ====================== 1264 1265 Some scanners (such as those which support "include" files) require 1266 reading from several input streams. As `flex' scanners do a large 1267 amount of buffering, one cannot control where the next input will be 1268 read from by simply writing a `YY_INPUT' which is sensitive to the 1269 scanning context. `YY_INPUT' is only called when the scanner reaches 1270 the end of its buffer, which may be a long time after scanning a 1271 statement such as an "include" which requires switching the input 1272 source. 1273 1274 To negotiate these sorts of problems, `flex' provides a mechanism 1275 for creating and switching between multiple input buffers. An input 1276 buffer is created by using: 1277 1278 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) 1279 1280 which takes a `FILE' pointer and a size and creates a buffer associated 1281 with the given file and large enough to hold SIZE characters (when in 1282 doubt, use `YY_BUF_SIZE' for the size). It returns a `YY_BUFFER_STATE' 1283 handle, which may then be passed to other routines (see below). The 1284 `YY_BUFFER_STATE' type is a pointer to an opaque `struct' 1285 `yy_buffer_state' structure, so you may safely initialize 1286 YY_BUFFER_STATE variables to `((YY_BUFFER_STATE) 0)' if you wish, and 1287 also refer to the opaque structure in order to correctly declare input 1288 buffers in source files other than that of your scanner. Note that the 1289 `FILE' pointer in the call to `yy_create_buffer' is only used as the 1290 value of `yyin' seen by `YY_INPUT'; if you redefine `YY_INPUT' so it no 1291 longer uses `yyin', then you can safely pass a nil `FILE' pointer to 1292 `yy_create_buffer'. You select a particular buffer to scan from using: 1293 1294 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) 1295 1296 switches the scanner's input buffer so subsequent tokens will come 1297 from NEW_BUFFER. Note that `yy_switch_to_buffer()' may be used by 1298 `yywrap()' to set things up for continued scanning, instead of opening 1299 a new file and pointing `yyin' at it. Note also that switching input 1300 sources via either `yy_switch_to_buffer()' or `yywrap()' does *not* 1301 change the start condition. 1302 1303 void yy_delete_buffer( YY_BUFFER_STATE buffer ) 1304 1305 is used to reclaim the storage associated with a buffer. You can also 1306 clear the current contents of a buffer using: 1307 1308 void yy_flush_buffer( YY_BUFFER_STATE buffer ) 1309 1310 This function discards the buffer's contents, so the next time the 1311 scanner attempts to match a token from the buffer, it will first fill 1312 the buffer anew using `YY_INPUT'. 1313 1314 `yy_new_buffer()' is an alias for `yy_create_buffer()', provided for 1315 compatibility with the C++ use of `new' and `delete' for creating and 1316 destroying dynamic objects. 1317 1318 Finally, the `YY_CURRENT_BUFFER' macro returns a `YY_BUFFER_STATE' 1319 handle to the current buffer. 1320 1321 Here is an example of using these features for writing a scanner 1322 which expands include files (the `<<EOF>>' feature is discussed below): 1323 1324 /* the "incl" state is used for picking up the name 1325 * of an include file 1326 */ 1327 %x incl 1328 1329 %{ 1330 #define MAX_INCLUDE_DEPTH 10 1331 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 1332 int include_stack_ptr = 0; 1333 %} 1334 1335 %% 1336 include BEGIN(incl); 1337 1338 [a-z]+ ECHO; 1339 [^a-z\n]*\n? ECHO; 1340 1341 <incl>[ \t]* /* eat the whitespace */ 1342 <incl>[^ \t\n]+ { /* got the include file name */ 1343 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 1344 { 1345 fprintf( stderr, "Includes nested too deeply" ); 1346 exit( 1 ); 1347 } 1348 1349 include_stack[include_stack_ptr++] = 1350 YY_CURRENT_BUFFER; 1351 1352 yyin = fopen( yytext, "r" ); 1353 1354 if ( ! yyin ) 1355 error( ... ); 1356 1357 yy_switch_to_buffer( 1358 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 1359 1360 BEGIN(INITIAL); 1361 } 1362 1363 <<EOF>> { 1364 if ( --include_stack_ptr < 0 ) 1365 { 1366 yyterminate(); 1367 } 1368 1369 else 1370 { 1371 yy_delete_buffer( YY_CURRENT_BUFFER ); 1372 yy_switch_to_buffer( 1373 include_stack[include_stack_ptr] ); 1374 } 1375 } 1376 1377 Three routines are available for setting up input buffers for 1378 scanning in-memory strings instead of files. All of them create a new 1379 input buffer for scanning the string, and return a corresponding 1380 `YY_BUFFER_STATE' handle (which you should delete with 1381 `yy_delete_buffer()' when done with it). They also switch to the new 1382 buffer using `yy_switch_to_buffer()', so the next call to `yylex()' will 1383 start scanning the string. 1384 1385 `yy_scan_string(const char *str)' 1386 scans a NUL-terminated string. 1387 1388 `yy_scan_bytes(const char *bytes, int len)' 1389 scans `len' bytes (including possibly NUL's) starting at location 1390 BYTES. 1391 1392 Note that both of these functions create and scan a *copy* of the 1393 string or bytes. (This may be desirable, since `yylex()' modifies the 1394 contents of the buffer it is scanning.) You can avoid the copy by using: 1395 1396 `yy_scan_buffer(char *base, yy_size_t size)' 1397 which scans in place the buffer starting at BASE, consisting of 1398 SIZE bytes, the last two bytes of which *must* be 1399 `YY_END_OF_BUFFER_CHAR' (ASCII NUL). These last two bytes are not 1400 scanned; thus, scanning consists of `base[0]' through 1401 `base[size-2]', inclusive. 1402 1403 If you fail to set up BASE in this manner (i.e., forget the final 1404 two `YY_END_OF_BUFFER_CHAR' bytes), then `yy_scan_buffer()' 1405 returns a nil pointer instead of creating a new input buffer. 1406 1407 The type `yy_size_t' is an integral type to which you can cast an 1408 integer expression reflecting the size of the buffer. 1409 1410 1411 File: flex.info, Node: End-of-file rules, Next: Miscellaneous, Prev: Multiple buffers, Up: Top 1412 1413 End-of-file rules 1414 ================= 1415 1416 The special rule "<<EOF>>" indicates actions which are to be taken 1417 when an end-of-file is encountered and yywrap() returns non-zero (i.e., 1418 indicates no further files to process). The action must finish by 1419 doing one of four things: 1420 1421 - assigning `yyin' to a new input file (in previous versions of 1422 flex, after doing the assignment you had to call the special 1423 action `YY_NEW_FILE'; this is no longer necessary); 1424 1425 - executing a `return' statement; 1426 1427 - executing the special `yyterminate()' action; 1428 1429 - or, switching to a new buffer using `yy_switch_to_buffer()' as 1430 shown in the example above. 1431 1432 <<EOF>> rules may not be used with other patterns; they may only be 1433 qualified with a list of start conditions. If an unqualified <<EOF>> 1434 rule is given, it applies to *all* start conditions which do not 1435 already have <<EOF>> actions. To specify an <<EOF>> rule for only the 1436 initial start condition, use 1437 1438 <INITIAL><<EOF>> 1439 1440 These rules are useful for catching things like unclosed comments. 1441 An example: 1442 1443 %x quote 1444 %% 1445 1446 ...other rules for dealing with quotes... 1447 1448 <quote><<EOF>> { 1449 error( "unterminated quote" ); 1450 yyterminate(); 1451 } 1452 <<EOF>> { 1453 if ( *++filelist ) 1454 yyin = fopen( *filelist, "r" ); 1455 else 1456 yyterminate(); 1457 } 1458 1459 1460 File: flex.info, Node: Miscellaneous, Next: User variables, Prev: End-of-file rules, Up: Top 1461 1462 Miscellaneous macros 1463 ==================== 1464 1465 The macro `YY_USER_ACTION' can be defined to provide an action which 1466 is always executed prior to the matched rule's action. For example, it 1467 could be #define'd to call a routine to convert yytext to lower-case. 1468 When `YY_USER_ACTION' is invoked, the variable `yy_act' gives the 1469 number of the matched rule (rules are numbered starting with 1). 1470 Suppose you want to profile how often each of your rules is matched. 1471 The following would do the trick: 1472 1473 #define YY_USER_ACTION ++ctr[yy_act] 1474 1475 where `ctr' is an array to hold the counts for the different rules. 1476 Note that the macro `YY_NUM_RULES' gives the total number of rules 1477 (including the default rule, even if you use `-s', so a correct 1478 declaration for `ctr' is: 1479 1480 int ctr[YY_NUM_RULES]; 1481 1482 The macro `YY_USER_INIT' may be defined to provide an action which 1483 is always executed before the first scan (and before the scanner's 1484 internal initializations are done). For example, it could be used to 1485 call a routine to read in a data table or open a logging file. 1486 1487 The macro `yy_set_interactive(is_interactive)' can be used to 1488 control whether the current buffer is considered *interactive*. An 1489 interactive buffer is processed more slowly, but must be used when the 1490 scanner's input source is indeed interactive to avoid problems due to 1491 waiting to fill buffers (see the discussion of the `-I' flag below). A 1492 non-zero value in the macro invocation marks the buffer as interactive, 1493 a zero value as non-interactive. Note that use of this macro overrides 1494 `%option always-interactive' or `%option never-interactive' (see 1495 Options below). `yy_set_interactive()' must be invoked prior to 1496 beginning to scan the buffer that is (or is not) to be considered 1497 interactive. 1498 1499 The macro `yy_set_bol(at_bol)' can be used to control whether the 1500 current buffer's scanning context for the next token match is done as 1501 though at the beginning of a line. A non-zero macro argument makes 1502 rules anchored with 1503 1504 The macro `YY_AT_BOL()' returns true if the next token scanned from 1505 the current buffer will have '^' rules active, false otherwise. 1506 1507 In the generated scanner, the actions are all gathered in one large 1508 switch statement and separated using `YY_BREAK', which may be 1509 redefined. By default, it is simply a "break", to separate each rule's 1510 action from the following rule's. Redefining `YY_BREAK' allows, for 1511 example, C++ users to #define YY_BREAK to do nothing (while being very 1512 careful that every rule ends with a "break" or a "return"!) to avoid 1513 suffering from unreachable statement warnings where because a rule's 1514 action ends with "return", the `YY_BREAK' is inaccessible. 1515 1516 1517 File: flex.info, Node: User variables, Next: YACC interface, Prev: Miscellaneous, Up: Top 1518 1519 Values available to the user 1520 ============================ 1521 1522 This section summarizes the various values available to the user in 1523 the rule actions. 1524 1525 - `char *yytext' holds the text of the current token. It may be 1526 modified but not lengthened (you cannot append characters to the 1527 end). 1528 1529 If the special directive `%array' appears in the first section of 1530 the scanner description, then `yytext' is instead declared `char 1531 yytext[YYLMAX]', where `YYLMAX' is a macro definition that you can 1532 redefine in the first section if you don't like the default value 1533 (generally 8KB). Using `%array' results in somewhat slower 1534 scanners, but the value of `yytext' becomes immune to calls to 1535 `input()' and `unput()', which potentially destroy its value when 1536 `yytext' is a character pointer. The opposite of `%array' is 1537 `%pointer', which is the default. 1538 1539 You cannot use `%array' when generating C++ scanner classes (the 1540 `-+' flag). 1541 1542 - `int yyleng' holds the length of the current token. 1543 1544 - `FILE *yyin' is the file which by default `flex' reads from. It 1545 may be redefined but doing so only makes sense before scanning 1546 begins or after an EOF has been encountered. Changing it in the 1547 midst of scanning will have unexpected results since `flex' 1548 buffers its input; use `yyrestart()' instead. Once scanning 1549 terminates because an end-of-file has been seen, you can assign 1550 `yyin' at the new input file and then call the scanner again to 1551 continue scanning. 1552 1553 - `void yyrestart( FILE *new_file )' may be called to point `yyin' 1554 at the new input file. The switch-over to the new file is 1555 immediate (any previously buffered-up input is lost). Note that 1556 calling `yyrestart()' with `yyin' as an argument thus throws away 1557 the current input buffer and continues scanning the same input 1558 file. 1559 1560 - `FILE *yyout' is the file to which `ECHO' actions are done. It 1561 can be reassigned by the user. 1562 1563 - `YY_CURRENT_BUFFER' returns a `YY_BUFFER_STATE' handle to the 1564 current buffer. 1565 1566 - `YY_START' returns an integer value corresponding to the current 1567 start condition. You can subsequently use this value with `BEGIN' 1568 to return to that start condition. 1569 1570 1571 File: flex.info, Node: YACC interface, Next: Options, Prev: User variables, Up: Top 1572 1573 Interfacing with `yacc' 1574 ======================= 1575 1576 One of the main uses of `flex' is as a companion to the `yacc' 1577 parser-generator. `yacc' parsers expect to call a routine named 1578 `yylex()' to find the next input token. The routine is supposed to 1579 return the type of the next token as well as putting any associated 1580 value in the global `yylval'. To use `flex' with `yacc', one specifies 1581 the `-d' option to `yacc' to instruct it to generate the file `y.tab.h' 1582 containing definitions of all the `%tokens' appearing in the `yacc' 1583 input. This file is then included in the `flex' scanner. For example, 1584 if one of the tokens is "TOK_NUMBER", part of the scanner might look 1585 like: 1586 1587 %{ 1588 #include "y.tab.h" 1589 %} 1590 1591 %% 1592 1593 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 1594 1595 1596 File: flex.info, Node: Options, Next: Performance, Prev: YACC interface, Up: Top 1597 1598 Options 1599 ======= 1600 1601 `flex' has the following options: 1602 1603 `-b' 1604 Generate backing-up information to `lex.backup'. This is a list 1605 of scanner states which require backing up and the input 1606 characters on which they do so. By adding rules one can remove 1607 backing-up states. If *all* backing-up states are eliminated and 1608 `-Cf' or `-CF' is used, the generated scanner will run faster (see 1609 the `-p' flag). Only users who wish to squeeze every last cycle 1610 out of their scanners need worry about this option. (See the 1611 section on Performance Considerations below.) 1612 1613 `-c' 1614 is a do-nothing, deprecated option included for POSIX compliance. 1615 1616 `-d' 1617 makes the generated scanner run in "debug" mode. Whenever a 1618 pattern is recognized and the global `yy_flex_debug' is non-zero 1619 (which is the default), the scanner will write to `stderr' a line 1620 of the form: 1621 1622 --accepting rule at line 53 ("the matched text") 1623 1624 The line number refers to the location of the rule in the file 1625 defining the scanner (i.e., the file that was fed to flex). 1626 Messages are also generated when the scanner backs up, accepts the 1627 default rule, reaches the end of its input buffer (or encounters a 1628 NUL; at this point, the two look the same as far as the scanner's 1629 concerned), or reaches an end-of-file. 1630 1631 `-f' 1632 specifies "fast scanner". No table compression is done and stdio 1633 is bypassed. The result is large but fast. This option is 1634 equivalent to `-Cfr' (see below). 1635 1636 `-h' 1637 generates a "help" summary of `flex's' options to `stdout' and 1638 then exits. `-?' and `--help' are synonyms for `-h'. 1639 1640 `-i' 1641 instructs `flex' to generate a *case-insensitive* scanner. The 1642 case of letters given in the `flex' input patterns will be 1643 ignored, and tokens in the input will be matched regardless of 1644 case. The matched text given in `yytext' will have the preserved 1645 case (i.e., it will not be folded). 1646 1647 `-l' 1648 turns on maximum compatibility with the original AT&T `lex' 1649 implementation. Note that this does not mean *full* 1650 compatibility. Use of this option costs a considerable amount of 1651 performance, and it cannot be used with the `-+, -f, -F, -Cf', or 1652 `-CF' options. For details on the compatibilities it provides, see 1653 the section "Incompatibilities With Lex And POSIX" below. This 1654 option also results in the name `YY_FLEX_LEX_COMPAT' being 1655 #define'd in the generated scanner. 1656 1657 `-n' 1658 is another do-nothing, deprecated option included only for POSIX 1659 compliance. 1660 1661 `-p' 1662 generates a performance report to stderr. The report consists of 1663 comments regarding features of the `flex' input file which will 1664 cause a serious loss of performance in the resulting scanner. If 1665 you give the flag twice, you will also get comments regarding 1666 features that lead to minor performance losses. 1667 1668 Note that the use of `REJECT', `%option yylineno' and variable 1669 trailing context (see the Deficiencies / Bugs section below) 1670 entails a substantial performance penalty; use of `yymore()', the 1671 `^' operator, and the `-I' flag entail minor performance penalties. 1672 1673 `-s' 1674 causes the "default rule" (that unmatched scanner input is echoed 1675 to `stdout') to be suppressed. If the scanner encounters input 1676 that does not match any of its rules, it aborts with an error. 1677 This option is useful for finding holes in a scanner's rule set. 1678 1679 `-t' 1680 instructs `flex' to write the scanner it generates to standard 1681 output instead of `lex.yy.c'. 1682 1683 `-v' 1684 specifies that `flex' should write to `stderr' a summary of 1685 statistics regarding the scanner it generates. Most of the 1686 statistics are meaningless to the casual `flex' user, but the 1687 first line identifies the version of `flex' (same as reported by 1688 `-V'), and the next line the flags used when generating the 1689 scanner, including those that are on by default. 1690 1691 `-w' 1692 suppresses warning messages. 1693 1694 `-B' 1695 instructs `flex' to generate a *batch* scanner, the opposite of 1696 *interactive* scanners generated by `-I' (see below). In general, 1697 you use `-B' when you are *certain* that your scanner will never 1698 be used interactively, and you want to squeeze a *little* more 1699 performance out of it. If your goal is instead to squeeze out a 1700 *lot* more performance, you should be using the `-Cf' or `-CF' 1701 options (discussed below), which turn on `-B' automatically anyway. 1702 1703 `-F' 1704 specifies that the "fast" scanner table representation should be 1705 used (and stdio bypassed). This representation is about as fast 1706 as the full table representation `(-f)', and for some sets of 1707 patterns will be considerably smaller (and for others, larger). 1708 In general, if the pattern set contains both "keywords" and a 1709 catch-all, "identifier" rule, such as in the set: 1710 1711 "case" return TOK_CASE; 1712 "switch" return TOK_SWITCH; 1713 ... 1714 "default" return TOK_DEFAULT; 1715 [a-z]+ return TOK_ID; 1716 1717 then you're better off using the full table representation. If 1718 only the "identifier" rule is present and you then use a hash 1719 table or some such to detect the keywords, you're better off using 1720 `-F'. 1721 1722 This option is equivalent to `-CFr' (see below). It cannot be 1723 used with `-+'. 1724 1725 `-I' 1726 instructs `flex' to generate an *interactive* scanner. An 1727 interactive scanner is one that only looks ahead to decide what 1728 token has been matched if it absolutely must. It turns out that 1729 always looking one extra character ahead, even if the scanner has 1730 already seen enough text to disambiguate the current token, is a 1731 bit faster than only looking ahead when necessary. But scanners 1732 that always look ahead give dreadful interactive performance; for 1733 example, when a user types a newline, it is not recognized as a 1734 newline token until they enter *another* token, which often means 1735 typing in another whole line. 1736 1737 `Flex' scanners default to *interactive* unless you use the `-Cf' 1738 or `-CF' table-compression options (see below). That's because if 1739 you're looking for high-performance you should be using one of 1740 these options, so if you didn't, `flex' assumes you'd rather trade 1741 off a bit of run-time performance for intuitive interactive 1742 behavior. Note also that you *cannot* use `-I' in conjunction 1743 with `-Cf' or `-CF'. Thus, this option is not really needed; it 1744 is on by default for all those cases in which it is allowed. 1745 1746 You can force a scanner to *not* be interactive by using `-B' (see 1747 above). 1748 1749 `-L' 1750 instructs `flex' not to generate `#line' directives. Without this 1751 option, `flex' peppers the generated scanner with #line directives 1752 so error messages in the actions will be correctly located with 1753 respect to either the original `flex' input file (if the errors 1754 are due to code in the input file), or `lex.yy.c' (if the errors 1755 are `flex's' fault - you should report these sorts of errors to 1756 the email address given below). 1757 1758 `-T' 1759 makes `flex' run in `trace' mode. It will generate a lot of 1760 messages to `stderr' concerning the form of the input and the 1761 resultant non-deterministic and deterministic finite automata. 1762 This option is mostly for use in maintaining `flex'. 1763 1764 `-V' 1765 prints the version number to `stdout' and exits. `--version' is a 1766 synonym for `-V'. 1767 1768 `-7' 1769 instructs `flex' to generate a 7-bit scanner, i.e., one which can 1770 only recognized 7-bit characters in its input. The advantage of 1771 using `-7' is that the scanner's tables can be up to half the size 1772 of those generated using the `-8' option (see below). The 1773 disadvantage is that such scanners often hang or crash if their 1774 input contains an 8-bit character. 1775 1776 Note, however, that unless you generate your scanner using the 1777 `-Cf' or `-CF' table compression options, use of `-7' will save 1778 only a small amount of table space, and make your scanner 1779 considerably less portable. `Flex's' default behavior is to 1780 generate an 8-bit scanner unless you use the `-Cf' or `-CF', in 1781 which case `flex' defaults to generating 7-bit scanners unless 1782 your site was always configured to generate 8-bit scanners (as 1783 will often be the case with non-USA sites). You can tell whether 1784 flex generated a 7-bit or an 8-bit scanner by inspecting the flag 1785 summary in the `-v' output as described above. 1786 1787 Note that if you use `-Cfe' or `-CFe' (those table compression 1788 options, but also using equivalence classes as discussed see 1789 below), flex still defaults to generating an 8-bit scanner, since 1790 usually with these compression options full 8-bit tables are not 1791 much more expensive than 7-bit tables. 1792 1793 `-8' 1794 instructs `flex' to generate an 8-bit scanner, i.e., one which can 1795 recognize 8-bit characters. This flag is only needed for scanners 1796 generated using `-Cf' or `-CF', as otherwise flex defaults to 1797 generating an 8-bit scanner anyway. 1798 1799 See the discussion of `-7' above for flex's default behavior and 1800 the tradeoffs between 7-bit and 8-bit scanners. 1801 1802 `-+' 1803 specifies that you want flex to generate a C++ scanner class. See 1804 the section on Generating C++ Scanners below for details. 1805 1806 `-C[aefFmr]' 1807 controls the degree of table compression and, more generally, 1808 trade-offs between small scanners and fast scanners. 1809 1810 `-Ca' ("align") instructs flex to trade off larger tables in the 1811 generated scanner for faster performance because the elements of 1812 the tables are better aligned for memory access and computation. 1813 On some RISC architectures, fetching and manipulating long-words 1814 is more efficient than with smaller-sized units such as 1815 shortwords. This option can double the size of the tables used by 1816 your scanner. 1817 1818 `-Ce' directs `flex' to construct "equivalence classes", i.e., 1819 sets of characters which have identical lexical properties (for 1820 example, if the only appearance of digits in the `flex' input is 1821 in the character class "[0-9]" then the digits '0', '1', ..., '9' 1822 will all be put in the same equivalence class). Equivalence 1823 classes usually give dramatic reductions in the final table/object 1824 file sizes (typically a factor of 2-5) and are pretty cheap 1825 performance-wise (one array look-up per character scanned). 1826 1827 `-Cf' specifies that the *full* scanner tables should be generated 1828 - `flex' should not compress the tables by taking advantages of 1829 similar transition functions for different states. 1830 1831 `-CF' specifies that the alternate fast scanner representation 1832 (described above under the `-F' flag) should be used. This option 1833 cannot be used with `-+'. 1834 1835 `-Cm' directs `flex' to construct "meta-equivalence classes", 1836 which are sets of equivalence classes (or characters, if 1837 equivalence classes are not being used) that are commonly used 1838 together. Meta-equivalence classes are often a big win when using 1839 compressed tables, but they have a moderate performance impact 1840 (one or two "if" tests and one array look-up per character 1841 scanned). 1842 1843 `-Cr' causes the generated scanner to *bypass* use of the standard 1844 I/O library (stdio) for input. Instead of calling `fread()' or 1845 `getc()', the scanner will use the `read()' system call, resulting 1846 in a performance gain which varies from system to system, but in 1847 general is probably negligible unless you are also using `-Cf' or 1848 `-CF'. Using `-Cr' can cause strange behavior if, for example, 1849 you read from `yyin' using stdio prior to calling the scanner 1850 (because the scanner will miss whatever text your previous reads 1851 left in the stdio input buffer). 1852 1853 `-Cr' has no effect if you define `YY_INPUT' (see The Generated 1854 Scanner above). 1855 1856 A lone `-C' specifies that the scanner tables should be compressed 1857 but neither equivalence classes nor meta-equivalence classes 1858 should be used. 1859 1860 The options `-Cf' or `-CF' and `-Cm' do not make sense together - 1861 there is no opportunity for meta-equivalence classes if the table 1862 is not being compressed. Otherwise the options may be freely 1863 mixed, and are cumulative. 1864 1865 The default setting is `-Cem', which specifies that `flex' should 1866 generate equivalence classes and meta-equivalence classes. This 1867 setting provides the highest degree of table compression. You can 1868 trade off faster-executing scanners at the cost of larger tables 1869 with the following generally being true: 1870 1871 slowest & smallest 1872 -Cem 1873 -Cm 1874 -Ce 1875 -C 1876 -C{f,F}e 1877 -C{f,F} 1878 -C{f,F}a 1879 fastest & largest 1880 1881 Note that scanners with the smallest tables are usually generated 1882 and compiled the quickest, so during development you will usually 1883 want to use the default, maximal compression. 1884 1885 `-Cfe' is often a good compromise between speed and size for 1886 production scanners. 1887 1888 `-ooutput' 1889 directs flex to write the scanner to the file `out-' `put' instead 1890 of `lex.yy.c'. If you combine `-o' with the `-t' option, then the 1891 scanner is written to `stdout' but its `#line' directives (see the 1892 `-L' option above) refer to the file `output'. 1893 1894 `-Pprefix' 1895 changes the default `yy' prefix used by `flex' for all 1896 globally-visible variable and function names to instead be PREFIX. 1897 For example, `-Pfoo' changes the name of `yytext' to `footext'. 1898 It also changes the name of the default output file from 1899 `lex.yy.c' to `lex.foo.c'. Here are all of the names affected: 1900 1901 yy_create_buffer 1902 yy_delete_buffer 1903 yy_flex_debug 1904 yy_init_buffer 1905 yy_flush_buffer 1906 yy_load_buffer_state 1907 yy_switch_to_buffer 1908 yyin 1909 yyleng 1910 yylex 1911 yylineno 1912 yyout 1913 yyrestart 1914 yytext 1915 yywrap 1916 1917 (If you are using a C++ scanner, then only `yywrap' and 1918 `yyFlexLexer' are affected.) Within your scanner itself, you can 1919 still refer to the global variables and functions using either 1920 version of their name; but externally, they have the modified name. 1921 1922 This option lets you easily link together multiple `flex' programs 1923 into the same executable. Note, though, that using this option 1924 also renames `yywrap()', so you now *must* either provide your own 1925 (appropriately-named) version of the routine for your scanner, or 1926 use `%option noyywrap', as linking with `-lfl' no longer provides 1927 one for you by default. 1928 1929 `-Sskeleton_file' 1930 overrides the default skeleton file from which `flex' constructs 1931 its scanners. You'll never need this option unless you are doing 1932 `flex' maintenance or development. 1933 1934 `flex' also provides a mechanism for controlling options within the 1935 scanner specification itself, rather than from the flex command-line. 1936 This is done by including `%option' directives in the first section of 1937 the scanner specification. You can specify multiple options with a 1938 single `%option' directive, and multiple directives in the first 1939 section of your flex input file. Most options are given simply as 1940 names, optionally preceded by the word "no" (with no intervening 1941 whitespace) to negate their meaning. A number are equivalent to flex 1942 flags or their negation: 1943 1944 7bit -7 option 1945 8bit -8 option 1946 align -Ca option 1947 backup -b option 1948 batch -B option 1949 c++ -+ option 1950 1951 caseful or 1952 case-sensitive opposite of -i (default) 1953 1954 case-insensitive or 1955 caseless -i option 1956 1957 debug -d option 1958 default opposite of -s option 1959 ecs -Ce option 1960 fast -F option 1961 full -f option 1962 interactive -I option 1963 lex-compat -l option 1964 meta-ecs -Cm option 1965 perf-report -p option 1966 read -Cr option 1967 stdout -t option 1968 verbose -v option 1969 warn opposite of -w option 1970 (use "%option nowarn" for -w) 1971 1972 array equivalent to "%array" 1973 pointer equivalent to "%pointer" (default) 1974 1975 Some `%option's' provide features otherwise not available: 1976 1977 `always-interactive' 1978 instructs flex to generate a scanner which always considers its 1979 input "interactive". Normally, on each new input file the scanner 1980 calls `isatty()' in an attempt to determine whether the scanner's 1981 input source is interactive and thus should be read a character at 1982 a time. When this option is used, however, then no such call is 1983 made. 1984 1985 `main' 1986 directs flex to provide a default `main()' program for the 1987 scanner, which simply calls `yylex()'. This option implies 1988 `noyywrap' (see below). 1989 1990 `never-interactive' 1991 instructs flex to generate a scanner which never considers its 1992 input "interactive" (again, no call made to `isatty())'. This is 1993 the opposite of `always-' *interactive*. 1994 1995 `stack' 1996 enables the use of start condition stacks (see Start Conditions 1997 above). 1998 1999 `stdinit' 2000 if unset (i.e., `%option nostdinit') initializes `yyin' and 2001 `yyout' to nil `FILE' pointers, instead of `stdin' and `stdout'. 2002 2003 `yylineno' 2004 directs `flex' to generate a scanner that maintains the number of 2005 the current line read from its input in the global variable 2006 `yylineno'. This option is implied by `%option lex-compat'. 2007 2008 `yywrap' 2009 if unset (i.e., `%option noyywrap'), makes the scanner not call 2010 `yywrap()' upon an end-of-file, but simply assume that there are 2011 no more files to scan (until the user points `yyin' at a new file 2012 and calls `yylex()' again). 2013 2014 `flex' scans your rule actions to determine whether you use the 2015 `REJECT' or `yymore()' features. The `reject' and `yymore' options are 2016 available to override its decision as to whether you use the options, 2017 either by setting them (e.g., `%option reject') to indicate the feature 2018 is indeed used, or unsetting them to indicate it actually is not used 2019 (e.g., `%option noyymore'). 2020 2021 Three options take string-delimited values, offset with '=': 2022 2023 %option outfile="ABC" 2024 2025 is equivalent to `-oABC', and 2026 2027 %option prefix="XYZ" 2028 2029 is equivalent to `-PXYZ'. 2030 2031 Finally, 2032 2033 %option yyclass="foo" 2034 2035 only applies when generating a C++ scanner (`-+' option). It informs 2036 `flex' that you have derived `foo' as a subclass of `yyFlexLexer' so 2037 `flex' will place your actions in the member function `foo::yylex()' 2038 instead of `yyFlexLexer::yylex()'. It also generates a 2039 `yyFlexLexer::yylex()' member function that emits a run-time error (by 2040 invoking `yyFlexLexer::LexerError()') if called. See Generating C++ 2041 Scanners, below, for additional information. 2042 2043 A number of options are available for lint purists who want to 2044 suppress the appearance of unneeded routines in the generated scanner. 2045 Each of the following, if unset, results in the corresponding routine 2046 not appearing in the generated scanner: 2047 2048 input, unput 2049 yy_push_state, yy_pop_state, yy_top_state 2050 yy_scan_buffer, yy_scan_bytes, yy_scan_string 2051 2052 (though `yy_push_state()' and friends won't appear anyway unless you 2053 use `%option stack'). 2054 2055 2056 File: flex.info, Node: Performance, Next: C++, Prev: Options, Up: Top 2057 2058 Performance considerations 2059 ========================== 2060 2061 The main design goal of `flex' is that it generate high-performance 2062 scanners. It has been optimized for dealing well with large sets of 2063 rules. Aside from the effects on scanner speed of the table 2064 compression `-C' options outlined above, there are a number of 2065 options/actions which degrade performance. These are, from most 2066 expensive to least: 2067 2068 REJECT 2069 %option yylineno 2070 arbitrary trailing context 2071 2072 pattern sets that require backing up 2073 %array 2074 %option interactive 2075 %option always-interactive 2076 2077 '^' beginning-of-line operator 2078 yymore() 2079 2080 with the first three all being quite expensive and the last two 2081 being quite cheap. Note also that `unput()' is implemented as a 2082 routine call that potentially does quite a bit of work, while 2083 `yyless()' is a quite-cheap macro; so if just putting back some excess 2084 text you scanned, use `yyless()'. 2085 2086 `REJECT' should be avoided at all costs when performance is 2087 important. It is a particularly expensive option. 2088 2089 Getting rid of backing up is messy and often may be an enormous 2090 amount of work for a complicated scanner. In principal, one begins by 2091 using the `-b' flag to generate a `lex.backup' file. For example, on 2092 the input 2093 2094 %% 2095 foo return TOK_KEYWORD; 2096 foobar return TOK_KEYWORD; 2097 2098 the file looks like: 2099 2100 State #6 is non-accepting - 2101 associated rule line numbers: 2102 2 3 2103 out-transitions: [ o ] 2104 jam-transitions: EOF [ \001-n p-\177 ] 2105 2106 State #8 is non-accepting - 2107 associated rule line numbers: 2108 3 2109 out-transitions: [ a ] 2110 jam-transitions: EOF [ \001-` b-\177 ] 2111 2112 State #9 is non-accepting - 2113 associated rule line numbers: 2114 3 2115 out-transitions: [ r ] 2116 jam-transitions: EOF [ \001-q s-\177 ] 2117 2118 Compressed tables always back up. 2119 2120 The first few lines tell us that there's a scanner state in which it 2121 can make a transition on an 'o' but not on any other character, and 2122 that in that state the currently scanned text does not match any rule. 2123 The state occurs when trying to match the rules found at lines 2 and 3 2124 in the input file. If the scanner is in that state and then reads 2125 something other than an 'o', it will have to back up to find a rule 2126 which is matched. With a bit of head-scratching one can see that this 2127 must be the state it's in when it has seen "fo". When this has 2128 happened, if anything other than another 'o' is seen, the scanner will 2129 have to back up to simply match the 'f' (by the default rule). 2130 2131 The comment regarding State #8 indicates there's a problem when 2132 "foob" has been scanned. Indeed, on any character other than an 'a', 2133 the scanner will have to back up to accept "foo". Similarly, the 2134 comment for State #9 concerns when "fooba" has been scanned and an 'r' 2135 does not follow. 2136 2137 The final comment reminds us that there's no point going to all the 2138 trouble of removing backing up from the rules unless we're using `-Cf' 2139 or `-CF', since there's no performance gain doing so with compressed 2140 scanners. 2141 2142 The way to remove the backing up is to add "error" rules: 2143 2144 %% 2145 foo return TOK_KEYWORD; 2146 foobar return TOK_KEYWORD; 2147 2148 fooba | 2149 foob | 2150 fo { 2151 /* false alarm, not really a keyword */ 2152 return TOK_ID; 2153 } 2154 2155 Eliminating backing up among a list of keywords can also be done 2156 using a "catch-all" rule: 2157 2158 %% 2159 foo return TOK_KEYWORD; 2160 foobar return TOK_KEYWORD; 2161 2162 [a-z]+ return TOK_ID; 2163 2164 This is usually the best solution when appropriate. 2165 2166 Backing up messages tend to cascade. With a complicated set of 2167 rules it's not uncommon to get hundreds of messages. If one can 2168 decipher them, though, it often only takes a dozen or so rules to 2169 eliminate the backing up (though it's easy to make a mistake and have 2170 an error rule accidentally match a valid token. A possible future 2171 `flex' feature will be to automatically add rules to eliminate backing 2172 up). 2173 2174 It's important to keep in mind that you gain the benefits of 2175 eliminating backing up only if you eliminate *every* instance of 2176 backing up. Leaving just one means you gain nothing. 2177 2178 VARIABLE trailing context (where both the leading and trailing parts 2179 do not have a fixed length) entails almost the same performance loss as 2180 `REJECT' (i.e., substantial). So when possible a rule like: 2181 2182 %% 2183 mouse|rat/(cat|dog) run(); 2184 2185 is better written: 2186 2187 %% 2188 mouse/cat|dog run(); 2189 rat/cat|dog run(); 2190 2191 or as 2192 2193 %% 2194 mouse|rat/cat run(); 2195 mouse|rat/dog run(); 2196 2197 Note that here the special '|' action does *not* provide any 2198 savings, and can even make things worse (see Deficiencies / Bugs below). 2199 2200 Another area where the user can increase a scanner's performance 2201 (and one that's easier to implement) arises from the fact that the 2202 longer the tokens matched, the faster the scanner will run. This is 2203 because with long tokens the processing of most input characters takes 2204 place in the (short) inner scanning loop, and does not often have to go 2205 through the additional work of setting up the scanning environment 2206 (e.g., `yytext') for the action. Recall the scanner for C comments: 2207 2208 %x comment 2209 %% 2210 int line_num = 1; 2211 2212 "/*" BEGIN(comment); 2213 2214 <comment>[^*\n]* 2215 <comment>"*"+[^*/\n]* 2216 <comment>\n ++line_num; 2217 <comment>"*"+"/" BEGIN(INITIAL); 2218 2219 This could be sped up by writing it as: 2220 2221 %x comment 2222 %% 2223 int line_num = 1; 2224 2225 "/*" BEGIN(comment); 2226 2227 <comment>[^*\n]* 2228 <comment>[^*\n]*\n ++line_num; 2229 <comment>"*"+[^*/\n]* 2230 <comment>"*"+[^*/\n]*\n ++line_num; 2231 <comment>"*"+"/" BEGIN(INITIAL); 2232 2233 Now instead of each newline requiring the processing of another 2234 action, recognizing the newlines is "distributed" over the other rules 2235 to keep the matched text as long as possible. Note that *adding* rules 2236 does *not* slow down the scanner! The speed of the scanner is 2237 independent of the number of rules or (modulo the considerations given 2238 at the beginning of this section) how complicated the rules are with 2239 regard to operators such as '*' and '|'. 2240 2241 A final example in speeding up a scanner: suppose you want to scan 2242 through a file containing identifiers and keywords, one per line and 2243 with no other extraneous characters, and recognize all the keywords. A 2244 natural first approach is: 2245 2246 %% 2247 asm | 2248 auto | 2249 break | 2250 ... etc ... 2251 volatile | 2252 while /* it's a keyword */ 2253 2254 .|\n /* it's not a keyword */ 2255 2256 To eliminate the back-tracking, introduce a catch-all rule: 2257 2258 %% 2259 asm | 2260 auto | 2261 break | 2262 ... etc ... 2263 volatile | 2264 while /* it's a keyword */ 2265 2266 [a-z]+ | 2267 .|\n /* it's not a keyword */ 2268 2269 Now, if it's guaranteed that there's exactly one word per line, then 2270 we can reduce the total number of matches by a half by merging in the 2271 recognition of newlines with that of the other tokens: 2272 2273 %% 2274 asm\n | 2275 auto\n | 2276 break\n | 2277 ... etc ... 2278 volatile\n | 2279 while\n /* it's a keyword */ 2280 2281 [a-z]+\n | 2282 .|\n /* it's not a keyword */ 2283 2284 One has to be careful here, as we have now reintroduced backing up 2285 into the scanner. In particular, while *we* know that there will never 2286 be any characters in the input stream other than letters or newlines, 2287 `flex' can't figure this out, and it will plan for possibly needing to 2288 back up when it has scanned a token like "auto" and then the next 2289 character is something other than a newline or a letter. Previously it 2290 would then just match the "auto" rule and be done, but now it has no 2291 "auto" rule, only a "auto\n" rule. To eliminate the possibility of 2292 backing up, we could either duplicate all rules but without final 2293 newlines, or, since we never expect to encounter such an input and 2294 therefore don't how it's classified, we can introduce one more 2295 catch-all rule, this one which doesn't include a newline: 2296 2297 %% 2298 asm\n | 2299 auto\n | 2300 break\n | 2301 ... etc ... 2302 volatile\n | 2303 while\n /* it's a keyword */ 2304 2305 [a-z]+\n | 2306 [a-z]+ | 2307 .|\n /* it's not a keyword */ 2308 2309 Compiled with `-Cf', this is about as fast as one can get a `flex' 2310 scanner to go for this particular problem. 2311 2312 A final note: `flex' is slow when matching NUL's, particularly when 2313 a token contains multiple NUL's. It's best to write rules which match 2314 *short* amounts of text if it's anticipated that the text will often 2315 include NUL's. 2316 2317 Another final note regarding performance: as mentioned above in the 2318 section How the Input is Matched, dynamically resizing `yytext' to 2319 accommodate huge tokens is a slow process because it presently requires 2320 that the (huge) token be rescanned from the beginning. Thus if 2321 performance is vital, you should attempt to match "large" quantities of 2322 text but not "huge" quantities, where the cutoff between the two is at 2323 about 8K characters/token. 2324 2325 2326 File: flex.info, Node: C++, Next: Incompatibilities, Prev: Performance, Up: Top 2327 2328 Generating C++ scanners 2329 ======================= 2330 2331 `flex' provides two different ways to generate scanners for use with 2332 C++. The first way is to simply compile a scanner generated by `flex' 2333 using a C++ compiler instead of a C compiler. You should not encounter 2334 any compilations errors (please report any you find to the email address 2335 given in the Author section below). You can then use C++ code in your 2336 rule actions instead of C code. Note that the default input source for 2337 your scanner remains `yyin', and default echoing is still done to 2338 `yyout'. Both of these remain `FILE *' variables and not C++ `streams'. 2339 2340 You can also use `flex' to generate a C++ scanner class, using the 2341 `-+' option, (or, equivalently, `%option c++'), which is automatically 2342 specified if the name of the flex executable ends in a `+', such as 2343 `flex++'. When using this option, flex defaults to generating the 2344 scanner to the file `lex.yy.cc' instead of `lex.yy.c'. The generated 2345 scanner includes the header file `FlexLexer.h', which defines the 2346 interface to two C++ classes. 2347 2348 The first class, `FlexLexer', provides an abstract base class 2349 defining the general scanner class interface. It provides the 2350 following member functions: 2351 2352 `const char* YYText()' 2353 returns the text of the most recently matched token, the 2354 equivalent of `yytext'. 2355 2356 `int YYLeng()' 2357 returns the length of the most recently matched token, the 2358 equivalent of `yyleng'. 2359 2360 `int lineno() const' 2361 returns the current input line number (see `%option yylineno'), or 2362 1 if `%option yylineno' was not used. 2363 2364 `void set_debug( int flag )' 2365 sets the debugging flag for the scanner, equivalent to assigning to 2366 `yy_flex_debug' (see the Options section above). Note that you 2367 must build the scanner using `%option debug' to include debugging 2368 information in it. 2369 2370 `int debug() const' 2371 returns the current setting of the debugging flag. 2372 2373 Also provided are member functions equivalent to 2374 `yy_switch_to_buffer(), yy_create_buffer()' (though the first argument 2375 is an `istream*' object pointer and not a `FILE*', `yy_flush_buffer()', 2376 `yy_delete_buffer()', and `yyrestart()' (again, the first argument is a 2377 `istream*' object pointer). 2378 2379 The second class defined in `FlexLexer.h' is `yyFlexLexer', which is 2380 derived from `FlexLexer'. It defines the following additional member 2381 functions: 2382 2383 `yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )' 2384 constructs a `yyFlexLexer' object using the given streams for 2385 input and output. If not specified, the streams default to `cin' 2386 and `cout', respectively. 2387 2388 `virtual int yylex()' 2389 performs the same role is `yylex()' does for ordinary flex 2390 scanners: it scans the input stream, consuming tokens, until a 2391 rule's action returns a value. If you derive a subclass S from 2392 `yyFlexLexer' and want to access the member functions and 2393 variables of S inside `yylex()', then you need to use `%option 2394 yyclass="S"' to inform `flex' that you will be using that subclass 2395 instead of `yyFlexLexer'. In this case, rather than generating 2396 `yyFlexLexer::yylex()', `flex' generates `S::yylex()' (and also 2397 generates a dummy `yyFlexLexer::yylex()' that calls 2398 `yyFlexLexer::LexerError()' if called). 2399 2400 `virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)' 2401 reassigns `yyin' to `new_in' (if non-nil) and `yyout' to `new_out' 2402 (ditto), deleting the previous input buffer if `yyin' is 2403 reassigned. 2404 2405 `int yylex( istream* new_in = 0, ostream* new_out = 0 )' 2406 first switches the input streams via `switch_streams( new_in, 2407 new_out )' and then returns the value of `yylex()'. 2408 2409 In addition, `yyFlexLexer' defines the following protected virtual 2410 functions which you can redefine in derived classes to tailor the 2411 scanner: 2412 2413 `virtual int LexerInput( char* buf, int max_size )' 2414 reads up to `max_size' characters into BUF and returns the number 2415 of characters read. To indicate end-of-input, return 0 2416 characters. Note that "interactive" scanners (see the `-B' and 2417 `-I' flags) define the macro `YY_INTERACTIVE'. If you redefine 2418 `LexerInput()' and need to take different actions depending on 2419 whether or not the scanner might be scanning an interactive input 2420 source, you can test for the presence of this name via `#ifdef'. 2421 2422 `virtual void LexerOutput( const char* buf, int size )' 2423 writes out SIZE characters from the buffer BUF, which, while 2424 NUL-terminated, may also contain "internal" NUL's if the scanner's 2425 rules can match text with NUL's in them. 2426 2427 `virtual void LexerError( const char* msg )' 2428 reports a fatal error message. The default version of this 2429 function writes the message to the stream `cerr' and exits. 2430 2431 Note that a `yyFlexLexer' object contains its *entire* scanning 2432 state. Thus you can use such objects to create reentrant scanners. 2433 You can instantiate multiple instances of the same `yyFlexLexer' class, 2434 and you can also combine multiple C++ scanner classes together in the 2435 same program using the `-P' option discussed above. Finally, note that 2436 the `%array' feature is not available to C++ scanner classes; you must 2437 use `%pointer' (the default). 2438 2439 Here is an example of a simple C++ scanner: 2440 2441 // An example of using the flex C++ scanner class. 2442 2443 %{ 2444 int mylineno = 0; 2445 %} 2446 2447 string \"[^\n"]+\" 2448 2449 ws [ \t]+ 2450 2451 alpha [A-Za-z] 2452 dig [0-9] 2453 name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* 2454 num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? 2455 num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? 2456 number {num1}|{num2} 2457 2458 %% 2459 2460 {ws} /* skip blanks and tabs */ 2461 2462 "/*" { 2463 int c; 2464 2465 while((c = yyinput()) != 0) 2466 { 2467 if(c == '\n') 2468 ++mylineno; 2469 2470 else if(c == '*') 2471 { 2472 if((c = yyinput()) == '/') 2473 break; 2474 else 2475 unput(c); 2476 } 2477 } 2478 } 2479 2480 {number} cout << "number " << YYText() << '\n'; 2481 2482 \n mylineno++; 2483 2484 {name} cout << "name " << YYText() << '\n'; 2485 2486 {string} cout << "string " << YYText() << '\n'; 2487 2488 %% 2489 2490 Version 2.5 December 1994 44 2491 2492 int main( int /* argc */, char** /* argv */ ) 2493 { 2494 FlexLexer* lexer = new yyFlexLexer; 2495 while(lexer->yylex() != 0) 2496 ; 2497 return 0; 2498 } 2499 2500 If you want to create multiple (different) lexer classes, you use 2501 the `-P' flag (or the `prefix=' option) to rename each `yyFlexLexer' to 2502 some other `xxFlexLexer'. You then can include `<FlexLexer.h>' in your 2503 other sources once per lexer class, first renaming `yyFlexLexer' as 2504 follows: 2505 2506 #undef yyFlexLexer 2507 #define yyFlexLexer xxFlexLexer 2508 #include <FlexLexer.h> 2509 2510 #undef yyFlexLexer 2511 #define yyFlexLexer zzFlexLexer 2512 #include <FlexLexer.h> 2513 2514 if, for example, you used `%option prefix="xx"' for one of your 2515 scanners and `%option prefix="zz"' for the other. 2516 2517 IMPORTANT: the present form of the scanning class is *experimental* 2518 and may change considerably between major releases. 2519 2520 2521 File: flex.info, Node: Incompatibilities, Next: Diagnostics, Prev: C++, Up: Top 2522 2523 Incompatibilities with `lex' and POSIX 2524 ====================================== 2525 2526 `flex' is a rewrite of the AT&T Unix `lex' tool (the two 2527 implementations do not share any code, though), with some extensions 2528 and incompatibilities, both of which are of concern to those who wish 2529 to write scanners acceptable to either implementation. Flex is fully 2530 compliant with the POSIX `lex' specification, except that when using 2531 `%pointer' (the default), a call to `unput()' destroys the contents of 2532 `yytext', which is counter to the POSIX specification. 2533 2534 In this section we discuss all of the known areas of incompatibility 2535 between flex, AT&T lex, and the POSIX specification. 2536 2537 `flex's' `-l' option turns on maximum compatibility with the 2538 original AT&T `lex' implementation, at the cost of a major loss in the 2539 generated scanner's performance. We note below which incompatibilities 2540 can be overcome using the `-l' option. 2541 2542 `flex' is fully compatible with `lex' with the following exceptions: 2543 2544 - The undocumented `lex' scanner internal variable `yylineno' is not 2545 supported unless `-l' or `%option yylineno' is used. `yylineno' 2546 should be maintained on a per-buffer basis, rather than a 2547 per-scanner (single global variable) basis. `yylineno' is not 2548 part of the POSIX specification. 2549 2550 - The `input()' routine is not redefinable, though it may be called 2551 to read characters following whatever has been matched by a rule. 2552 If `input()' encounters an end-of-file the normal `yywrap()' 2553 processing is done. A "real" end-of-file is returned by `input()' 2554 as `EOF'. 2555 2556 Input is instead controlled by defining the `YY_INPUT' macro. 2557 2558 The `flex' restriction that `input()' cannot be redefined is in 2559 accordance with the POSIX specification, which simply does not 2560 specify any way of controlling the scanner's input other than by 2561 making an initial assignment to `yyin'. 2562 2563 - The `unput()' routine is not redefinable. This restriction is in 2564 accordance with POSIX. 2565 2566 - `flex' scanners are not as reentrant as `lex' scanners. In 2567 particular, if you have an interactive scanner and an interrupt 2568 handler which long-jumps out of the scanner, and the scanner is 2569 subsequently called again, you may get the following message: 2570 2571 fatal flex scanner internal error--end of buffer missed 2572 2573 To reenter the scanner, first use 2574 2575 yyrestart( yyin ); 2576 2577 Note that this call will throw away any buffered input; usually 2578 this isn't a problem with an interactive scanner. 2579 2580 Also note that flex C++ scanner classes *are* reentrant, so if 2581 using C++ is an option for you, you should use them instead. See 2582 "Generating C++ Scanners" above for details. 2583 2584 - `output()' is not supported. Output from the `ECHO' macro is done 2585 to the file-pointer `yyout' (default `stdout'). 2586 2587 `output()' is not part of the POSIX specification. 2588 2589 - `lex' does not support exclusive start conditions (%x), though 2590 they are in the POSIX specification. 2591 2592 - When definitions are expanded, `flex' encloses them in 2593 parentheses. With lex, the following: 2594 2595 NAME [A-Z][A-Z0-9]* 2596 %% 2597 foo{NAME}? printf( "Found it\n" ); 2598 %% 2599 2600 will not match the string "foo" because when the macro is expanded 2601 the rule is equivalent to "foo[A-Z][A-Z0-9]*?" and the precedence 2602 is such that the '?' is associated with "[A-Z0-9]*". With `flex', 2603 the rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and so the 2604 string "foo" will match. 2605 2606 Note that if the definition begins with `^' or ends with `$' then 2607 it is *not* expanded with parentheses, to allow these operators to 2608 appear in definitions without losing their special meanings. But 2609 the `<s>, /', and `<<EOF>>' operators cannot be used in a `flex' 2610 definition. 2611 2612 Using `-l' results in the `lex' behavior of no parentheses around 2613 the definition. 2614 2615 The POSIX specification is that the definition be enclosed in 2616 parentheses. 2617 2618 - Some implementations of `lex' allow a rule's action to begin on a 2619 separate line, if the rule's pattern has trailing whitespace: 2620 2621 %% 2622 foo|bar<space here> 2623 { foobar_action(); } 2624 2625 `flex' does not support this feature. 2626 2627 - The `lex' `%r' (generate a Ratfor scanner) option is not 2628 supported. It is not part of the POSIX specification. 2629 2630 - After a call to `unput()', `yytext' is undefined until the next 2631 token is matched, unless the scanner was built using `%array'. 2632 This is not the case with `lex' or the POSIX specification. The 2633 `-l' option does away with this incompatibility. 2634 2635 - The precedence of the `{}' (numeric range) operator is different. 2636 `lex' interprets "abc{1,3}" as "match one, two, or three 2637 occurrences of 'abc'", whereas `flex' interprets it as "match 'ab' 2638 followed by one, two, or three occurrences of 'c'". The latter is 2639 in agreement with the POSIX specification. 2640 2641 - The precedence of the `^' operator is different. `lex' interprets 2642 "^foo|bar" as "match either 'foo' at the beginning of a line, or 2643 'bar' anywhere", whereas `flex' interprets it as "match either 2644 'foo' or 'bar' if they come at the beginning of a line". The 2645 latter is in agreement with the POSIX specification. 2646 2647 - The special table-size declarations such as `%a' supported by 2648 `lex' are not required by `flex' scanners; `flex' ignores them. 2649 2650 - The name FLEX_SCANNER is #define'd so scanners may be written for 2651 use with either `flex' or `lex'. Scanners also include 2652 `YY_FLEX_MAJOR_VERSION' and `YY_FLEX_MINOR_VERSION' indicating 2653 which version of `flex' generated the scanner (for example, for the 2654 2.5 release, these defines would be 2 and 5 respectively). 2655 2656 The following `flex' features are not included in `lex' or the POSIX 2657 specification: 2658 2659 C++ scanners 2660 %option 2661 start condition scopes 2662 start condition stacks 2663 interactive/non-interactive scanners 2664 yy_scan_string() and friends 2665 yyterminate() 2666 yy_set_interactive() 2667 yy_set_bol() 2668 YY_AT_BOL() 2669 <<EOF>> 2670 <*> 2671 YY_DECL 2672 YY_START 2673 YY_USER_ACTION 2674 YY_USER_INIT 2675 #line directives 2676 %{}'s around actions 2677 multiple actions on a line 2678 2679 plus almost all of the flex flags. The last feature in the list refers 2680 to the fact that with `flex' you can put multiple actions on the same 2681 line, separated with semicolons, while with `lex', the following 2682 2683 foo handle_foo(); ++num_foos_seen; 2684 2685 is (rather surprisingly) truncated to 2686 2687 foo handle_foo(); 2688 2689 `flex' does not truncate the action. Actions that are not enclosed 2690 in braces are simply terminated at the end of the line. 2691 2692 2693 File: flex.info, Node: Diagnostics, Next: Files, Prev: Incompatibilities, Up: Top 2694 2695 Diagnostics 2696 =========== 2697 2698 `warning, rule cannot be matched' 2699 indicates that the given rule cannot be matched because it follows 2700 other rules that will always match the same text as it. For 2701 example, in the following "foo" cannot be matched because it comes 2702 after an identifier "catch-all" rule: 2703 2704 [a-z]+ got_identifier(); 2705 foo got_foo(); 2706 2707 Using `REJECT' in a scanner suppresses this warning. 2708 2709 `warning, -s option given but default rule can be matched' 2710 means that it is possible (perhaps only in a particular start 2711 condition) that the default rule (match any single character) is 2712 the only one that will match a particular input. Since `-s' was 2713 given, presumably this is not intended. 2714 2715 `reject_used_but_not_detected undefined' 2716 `yymore_used_but_not_detected undefined' 2717 These errors can occur at compile time. They indicate that the 2718 scanner uses `REJECT' or `yymore()' but that `flex' failed to 2719 notice the fact, meaning that `flex' scanned the first two sections 2720 looking for occurrences of these actions and failed to find any, 2721 but somehow you snuck some in (via a #include file, for example). 2722 Use `%option reject' or `%option yymore' to indicate to flex that 2723 you really do use these features. 2724 2725 `flex scanner jammed' 2726 a scanner compiled with `-s' has encountered an input string which 2727 wasn't matched by any of its rules. This error can also occur due 2728 to internal problems. 2729 2730 `token too large, exceeds YYLMAX' 2731 your scanner uses `%array' and one of its rules matched a string 2732 longer than the `YYL-' `MAX' constant (8K bytes by default). You 2733 can increase the value by #define'ing `YYLMAX' in the definitions 2734 section of your `flex' input. 2735 2736 `scanner requires -8 flag to use the character 'X'' 2737 Your scanner specification includes recognizing the 8-bit 2738 character X and you did not specify the -8 flag, and your scanner 2739 defaulted to 7-bit because you used the `-Cf' or `-CF' table 2740 compression options. See the discussion of the `-7' flag for 2741 details. 2742 2743 `flex scanner push-back overflow' 2744 you used `unput()' to push back so much text that the scanner's 2745 buffer could not hold both the pushed-back text and the current 2746 token in `yytext'. Ideally the scanner should dynamically resize 2747 the buffer in this case, but at present it does not. 2748 2749 `input buffer overflow, can't enlarge buffer because scanner uses REJECT' 2750 the scanner was working on matching an extremely large token and 2751 needed to expand the input buffer. This doesn't work with 2752 scanners that use `REJECT'. 2753 2754 `fatal flex scanner internal error--end of buffer missed' 2755 This can occur in an scanner which is reentered after a long-jump 2756 has jumped out (or over) the scanner's activation frame. Before 2757 reentering the scanner, use: 2758 2759 yyrestart( yyin ); 2760 2761 or, as noted above, switch to using the C++ scanner class. 2762 2763 `too many start conditions in <> construct!' 2764 you listed more start conditions in a <> construct than exist (so 2765 you must have listed at least one of them twice). 2766 2767 2768 File: flex.info, Node: Files, Next: Deficiencies, Prev: Diagnostics, Up: Top 2769 2770 Files 2771 ===== 2772 2773 `-lfl' 2774 library with which scanners must be linked. 2775 2776 `lex.yy.c' 2777 generated scanner (called `lexyy.c' on some systems). 2778 2779 `lex.yy.cc' 2780 generated C++ scanner class, when using `-+'. 2781 2782 `<FlexLexer.h>' 2783 header file defining the C++ scanner base class, `FlexLexer', and 2784 its derived class, `yyFlexLexer'. 2785 2786 `flex.skl' 2787 skeleton scanner. This file is only used when building flex, not 2788 when flex executes. 2789 2790 `lex.backup' 2791 backing-up information for `-b' flag (called `lex.bck' on some 2792 systems). 2793 2794 2795 File: flex.info, Node: Deficiencies, Next: See also, Prev: Files, Up: Top 2796 2797 Deficiencies / Bugs 2798 =================== 2799 2800 Some trailing context patterns cannot be properly matched and 2801 generate warning messages ("dangerous trailing context"). These are 2802 patterns where the ending of the first part of the rule matches the 2803 beginning of the second part, such as "zx*/xy*", where the 'x*' matches 2804 the 'x' at the beginning of the trailing context. (Note that the POSIX 2805 draft states that the text matched by such patterns is undefined.) 2806 2807 For some trailing context rules, parts which are actually 2808 fixed-length are not recognized as such, leading to the abovementioned 2809 performance loss. In particular, parts using '|' or {n} (such as 2810 "foo{3}") are always considered variable-length. 2811 2812 Combining trailing context with the special '|' action can result in 2813 *fixed* trailing context being turned into the more expensive VARIABLE 2814 trailing context. For example, in the following: 2815 2816 %% 2817 abc | 2818 xyz/def 2819 2820 Use of `unput()' invalidates yytext and yyleng, unless the `%array' 2821 directive or the `-l' option has been used. 2822 2823 Pattern-matching of NUL's is substantially slower than matching 2824 other characters. 2825 2826 Dynamic resizing of the input buffer is slow, as it entails 2827 rescanning all the text matched so far by the current (generally huge) 2828 token. 2829 2830 Due to both buffering of input and read-ahead, you cannot intermix 2831 calls to <stdio.h> routines, such as, for example, `getchar()', with 2832 `flex' rules and expect it to work. Call `input()' instead. 2833 2834 The total table entries listed by the `-v' flag excludes the number 2835 of table entries needed to determine what rule has been matched. The 2836 number of entries is equal to the number of DFA states if the scanner 2837 does not use `REJECT', and somewhat greater than the number of states 2838 if it does. 2839 2840 `REJECT' cannot be used with the `-f' or `-F' options. 2841 2842 The `flex' internal algorithms need documentation. 2843 2844 2845 File: flex.info, Node: See also, Next: Author, Prev: Deficiencies, Up: Top 2846 2847 See also 2848 ======== 2849 2850 `lex'(1), `yacc'(1), `sed'(1), `awk'(1). 2851 2852 John Levine, Tony Mason, and Doug Brown: Lex & Yacc; O'Reilly and 2853 Associates. Be sure to get the 2nd edition. 2854 2855 M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator. 2856 2857 Alfred Aho, Ravi Sethi and Jeffrey Ullman: Compilers: Principles, 2858 Techniques and Tools; Addison-Wesley (1986). Describes the 2859 pattern-matching techniques used by `flex' (deterministic finite 2860 automata). 2861 2862 2863 File: flex.info, Node: Author, Prev: See also, Up: Top 2864 2865 Author 2866 ====== 2867 2868 Vern Paxson, with the help of many ideas and much inspiration from 2869 Van Jacobson. Original version by Jef Poskanzer. The fast table 2870 representation is a partial implementation of a design done by Van 2871 Jacobson. The implementation was done by Kevin Gong and Vern Paxson. 2872 2873 Thanks to the many `flex' beta-testers, feedbackers, and 2874 contributors, especially Francois Pinard, Casey Leedom, Stan Adermann, 2875 Terry Allen, David Barker-Plummer, John Basrai, Nelson H.F. Beebe, 2876 `benson (a] odi.com', Karl Berry, Peter A. Bigot, Simon Blanchard, Keith 2877 Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, Brian 2878 Clapper, J.T. Conklin, Jason Coughlin, Bill Cox, Nick Cropper, Dave 2879 Curtis, Scott David Daniels, Chris G. Demetriou, Theo Deraadt, Mike 2880 Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Chris 2881 Flatters, Jon Forrest, Joe Gayda, Kaveh R. Ghazi, Eric Goldman, 2882 Christopher M. Gould, Ulrich Grepel, Peer Griebel, Jan Hajic, Charles 2883 Hemphill, NORO Hideo, Jarkko Hietaniemi, Scott Hofmann, Jeff Honig, 2884 Dana Hudes, Eric Hughes, John Interrante, Ceriel Jacobs, Michal 2885 Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, Henry Juengst, Klaus 2886 Kaempf, Jonathan I. Kamens, Terrence O Kane, Amir Katz, 2887 `ken (a] ken.hilco.com', Kevin B. Kenny, Steve Kirsch, Winfried Koenig, 2888 Marq Kole, Ronald Lamprecht, Greg Lee, Rohan Lenard, Craig Leres, John 2889 Levine, Steve Liddle, Mike Long, Mohamed el Lozy, Brian Madsen, Malte, 2890 Joe Marshall, Bengt Martensson, Chris Metcalf, Luke Mewburn, Jim 2891 Meyering, R. Alexander Milowski, Erik Naggum, G.T. Nicol, Landon Noll, 2892 James Nordby, Marc Nozell, Richard Ohnemus, Karsten Pahnke, Sven Panne, 2893 Roland Pesch, Walter Pelissero, Gaumond Pierre, Esmond Pitt, Jef 2894 Poskanzer, Joe Rahmeh, Jarmo Raiha, Frederic Raimbault, Pat Rankin, 2895 Rick Richardson, Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto 2896 Santini, Andreas Scherer, Darrell Schiebel, Raf Schietekat, Doug 2897 Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex Siegel, Eckehard 2898 Stolz, Jan-Erik Strvmquist, Mike Stump, Paul Stuart, Dave Tallman, Ian 2899 Lance Taylor, Chris Thewalt, Richard M. Timoney, Jodi Tsai, Paul 2900 Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken 2901 Yap, Ron Zellar, Nathan Zelle, David Zuhn, and those whose names have 2902 slipped my marginal mail-archiving skills but whose contributions are 2903 appreciated all the same. 2904 2905 Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore, 2906 Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, Francois Pinard, 2907 Rich Salz, and Richard Stallman for help with various distribution 2908 headaches. 2909 2910 Thanks to Esmond Pitt and Earle Horton for 8-bit character support; 2911 to Benson Margulies and Fred Burke for C++ support; to Kent Williams 2912 and Tom Epperly for C++ class support; to Ove Ewerlid for support of 2913 NUL's; and to Eric Hughes for support of multiple buffers. 2914 2915 This work was primarily done when I was with the Real Time Systems 2916 Group at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks 2917 to all there for the support I received. 2918 2919 Send comments to `vern (a] ee.lbl.gov'. 2920 2921 2922 2923 Tag Table: 2924 Node: Top1430 2925 Node: Name2808 2926 Node: Synopsis2933 2927 Node: Overview3145 2928 Node: Description4986 2929 Node: Examples5748 2930 Node: Format8896 2931 Node: Patterns11637 2932 Node: Matching18138 2933 Node: Actions21438 2934 Node: Generated scanner30560 2935 Node: Start conditions34988 2936 Node: Multiple buffers45069 2937 Node: End-of-file rules50975 2938 Node: Miscellaneous52508 2939 Node: User variables55279 2940 Node: YACC interface57651 2941 Node: Options58542 2942 Node: Performance78234 2943 Node: C++87532 2944 Node: Incompatibilities94993 2945 Node: Diagnostics101853 2946 Node: Files105094 2947 Node: Deficiencies105715 2948 Node: See also107684 2949 Node: Author108216 2950 2951 End Tag Table 2952