1 #8.2 Parsing HTML documents Table of contents 8.2.5 Tree construction 2 3 WHATWG 4 5 HTML 5 6 7 Draft Recommendation 7 February 2009 8 9 8.2 Parsing HTML documents Table of contents 8.2.5 Tree 10 construction 11 12 8.2.4 Tokenization 13 14 Implementations must act as if they used the following state machine to 15 tokenise HTML. The state machine must start in the data state. Most 16 states consume a single character, which may have various side-effects, 17 and either switches the state machine to a new state to reconsume the 18 same character, or switches it to a new state (to consume the next 19 character), or repeats the same state (to consume the next character). 20 Some states have more complicated behavior and can consume several 21 characters before switching to another state. 22 23 The exact behavior of certain states depends on a content model flag 24 that is set after certain tokens are emitted. The flag has several 25 states: PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially it must be in 26 the PCDATA state. In the RCDATA and CDATA states, a further escape flag 27 is used to control the behavior of the tokeniser. It is either true or 28 false, and initially must be set to the false state. The insertion mode 29 and the stack of open elements also affects tokenization. 30 31 The output of the tokenization step is a series of zero or more of the 32 following tokens: DOCTYPE, start tag, end tag, comment, character, 33 end-of-file. DOCTYPE tokens have a name, a public identifier, a system 34 identifier, and a force-quirks flag. When a DOCTYPE token is created, 35 its name, public identifier, and system identifier must be marked as 36 missing (which is a distinct state from the empty string), and the 37 force-quirks flag must be set to off (its other state is on). Start and 38 end tag tokens have a tag name, a self-closing flag, and a list of 39 attributes, each of which has a name and a value. When a start or end 40 tag token is created, its self-closing flag must be unset (its other 41 state is that it be set), and its attributes list must be empty. 42 Comment and character tokens have data. 43 44 When a token is emitted, it must immediately be handled by the tree 45 construction stage. The tree construction stage can affect the state of 46 the content model flag, and can insert additional characters into the 47 stream. (For example, the script element can result in scripts 48 executing and using the dynamic markup insertion APIs to insert 49 characters into the stream being tokenised.) 50 51 When a start tag token is emitted with its self-closing flag set, if 52 the flag is not acknowledged when it is processed by the tree 53 construction stage, that is a parse error. 54 55 When an end tag token is emitted, the content model flag must be 56 switched to the PCDATA state. 57 58 When an end tag token is emitted with attributes, that is a parse 59 error. 60 61 When an end tag token is emitted with its self-closing flag set, that 62 is a parse error. 63 64 Before each step of the tokeniser, the user agent must first check the 65 parser pause flag. If it is true, then the tokeniser must abort the 66 processing of any nested invocations of the tokeniser, yielding control 67 back to the caller. If it is false, then the user agent may then check 68 to see if either one of the scripts in the list of scripts that will 69 execute as soon as possible or the first script in the list of scripts 70 that will execute asynchronously, has completed loading. If one has, 71 then it must be executed and removed from its list. 72 73 The tokeniser state machine consists of the states defined in the 74 following subsections. 75 76 8.2.4.1 Data state 77 78 Consume the next input character: 79 80 U+0026 AMPERSAND (&) 81 When the content model flag is set to one of the PCDATA or 82 RCDATA states and the escape flag is false: switch to the 83 character reference data state. 84 Otherwise: treat it as per the "anything else" entry below. 85 86 U+002D HYPHEN-MINUS (-) 87 If the content model flag is set to either the RCDATA state or 88 the CDATA state, and the escape flag is false, and there are at 89 least three characters before this one in the input stream, and 90 the last four characters in the input stream, including this 91 one, are U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D 92 HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the 93 escape flag to true. 94 95 In any case, emit the input character as a character token. Stay 96 in the data state. 97 98 U+003C LESS-THAN SIGN (<) 99 When the content model flag is set to the PCDATA state: switch 100 to the tag open state. 101 When the content model flag is set to either the RCDATA state or 102 the CDATA state, and the escape flag is false: switch to the tag 103 open state. 104 Otherwise: treat it as per the "anything else" entry below. 105 106 U+003E GREATER-THAN SIGN (>) 107 If the content model flag is set to either the RCDATA state or 108 the CDATA state, and the escape flag is true, and the last three 109 characters in the input stream including this one are U+002D 110 HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN 111 ("-->"), set the escape flag to false. 112 113 In any case, emit the input character as a character token. Stay 114 in the data state. 115 116 EOF 117 Emit an end-of-file token. 118 119 Anything else 120 Emit the input character as a character token. Stay in the data 121 state. 122 123 8.2.4.2 Character reference data state 124 125 (This cannot happen if the content model flag is set to the CDATA 126 state.) 127 128 Attempt to consume a character reference, with no additional allowed 129 character. 130 131 If nothing is returned, emit a U+0026 AMPERSAND character token. 132 133 Otherwise, emit the character token that was returned. 134 135 Finally, switch to the data state. 136 137 8.2.4.3 Tag open state 138 139 The behavior of this state depends on the content model flag. 140 141 If the content model flag is set to the RCDATA or CDATA states 142 Consume the next input character. If it is a U+002F SOLIDUS (/) 143 character, switch to the close tag open state. Otherwise, emit a 144 U+003C LESS-THAN SIGN character token and reconsume the current 145 input character in the data state. 146 147 If the content model flag is set to the PCDATA state 148 Consume the next input character: 149 150 U+0021 EXCLAMATION MARK (!) 151 Switch to the markup declaration open state. 152 153 U+002F SOLIDUS (/) 154 Switch to the close tag open state. 155 156 U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL 157 LETTER Z 158 Create a new start tag token, set its tag name to the 159 lowercase version of the input character (add 0x0020 to 160 the character's code point), then switch to the tag name 161 state. (Don't emit the token yet; further details will be 162 filled in before it is emitted.) 163 164 U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z 165 Create a new start tag token, set its tag name to the 166 input character, then switch to the tag name state. (Don't 167 emit the token yet; further details will be filled in 168 before it is emitted.) 169 170 U+003E GREATER-THAN SIGN (>) 171 Parse error. Emit a U+003C LESS-THAN SIGN character token 172 and a U+003E GREATER-THAN SIGN character token. Switch to 173 the data state. 174 175 U+003F QUESTION MARK (?) 176 Parse error. Switch to the bogus comment state. 177 178 Anything else 179 Parse error. Emit a U+003C LESS-THAN SIGN character token 180 and reconsume the current input character in the data 181 state. 182 183 8.2.4.4 Close tag open state 184 185 If the content model flag is set to the RCDATA or CDATA states but no 186 start tag token has ever been emitted by this instance of the tokeniser 187 (fragment case), or, if the content model flag is set to the RCDATA or 188 CDATA states and the next few characters do not match the tag name of 189 the last start tag token emitted (compared in an ASCII case-insensitive 190 manner), or if they do but they are not immediately followed by one of 191 the following characters: 192 * U+0009 CHARACTER TABULATION 193 * U+000A LINE FEED (LF) 194 * U+000C FORM FEED (FF) 195 * U+0020 SPACE 196 * U+003E GREATER-THAN SIGN (>) 197 * U+002F SOLIDUS (/) 198 * EOF 199 200 ...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS 201 character token, and switch to the data state to process the next input 202 character. 203 204 Otherwise, if the content model flag is set to the PCDATA state, or if 205 the next few characters do match that tag name, consume the next input 206 character: 207 208 U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z 209 Create a new end tag token, set its tag name to the lowercase 210 version of the input character (add 0x0020 to the character's 211 code point), then switch to the tag name state. (Don't emit the 212 token yet; further details will be filled in before it is 213 emitted.) 214 215 U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z 216 Create a new end tag token, set its tag name to the input 217 character, then switch to the tag name state. (Don't emit the 218 token yet; further details will be filled in before it is 219 emitted.) 220 221 U+003E GREATER-THAN SIGN (>) 222 Parse error. Switch to the data state. 223 224 EOF 225 Parse error. Emit a U+003C LESS-THAN SIGN character token and a 226 U+002F SOLIDUS character token. Reconsume the EOF character in 227 the data state. 228 229 Anything else 230 Parse error. Switch to the bogus comment state. 231 232 8.2.4.5 Tag name state 233 234 Consume the next input character: 235 236 U+0009 CHARACTER TABULATION 237 U+000A LINE FEED (LF) 238 U+000C FORM FEED (FF) 239 U+0020 SPACE 240 Switch to the before attribute name state. 241 242 U+002F SOLIDUS (/) 243 Switch to the self-closing start tag state. 244 245 U+003E GREATER-THAN SIGN (>) 246 Emit the current tag token. Switch to the data state. 247 248 U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z 249 Append the lowercase version of the current input character (add 250 0x0020 to the character's code point) to the current tag token's 251 tag name. Stay in the tag name state. 252 253 EOF 254 Parse error. Emit the current tag token. Reconsume the EOF 255 character in the data state. 256 257 Anything else 258 Append the current input character to the current tag token's 259 tag name. Stay in the tag name state. 260 261 8.2.4.6 Before attribute name state 262 263 Consume the next input character: 264 265 U+0009 CHARACTER TABULATION 266 U+000A LINE FEED (LF) 267 U+000C FORM FEED (FF) 268 U+0020 SPACE 269 Stay in the before attribute name state. 270 271 U+002F SOLIDUS (/) 272 Switch to the self-closing start tag state. 273 274 U+003E GREATER-THAN SIGN (>) 275 Emit the current tag token. Switch to the data state. 276 277 U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z 278 Start a new attribute in the current tag token. Set that 279 attribute's name to the lowercase version of the current input 280 character (add 0x0020 to the character's code point), and its 281 value to the empty string. Switch to the attribute name state. 282 283 U+0022 QUOTATION MARK (") 284 U+0027 APOSTROPHE (') 285 U+003D EQUALS SIGN (=) 286 Parse error. Treat it as per the "anything else" entry below. 287 288 EOF 289 Parse error. Emit the current tag token. Reconsume the EOF 290 character in the data state. 291 292 Anything else 293 Start a new attribute in the current tag token. Set that 294 attribute's name to the current input character, and its value 295 to the empty string. Switch to the attribute name state. 296 297 8.2.4.7 Attribute name state 298 299 Consume the next input character: 300 301 U+0009 CHARACTER TABULATION 302 U+000A LINE FEED (LF) 303 U+000C FORM FEED (FF) 304 U+0020 SPACE 305 Switch to the after attribute name state. 306 307 U+002F SOLIDUS (/) 308 Switch to the self-closing start tag state. 309 310 U+003D EQUALS SIGN (=) 311 Switch to the before attribute value state. 312 313 U+003E GREATER-THAN SIGN (>) 314 Emit the current tag token. Switch to the data state. 315 316 U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z 317 Append the lowercase version of the current input character (add 318 0x0020 to the character's code point) to the current attribute's 319 name. Stay in the attribute name state. 320 321 U+0022 QUOTATION MARK (") 322 U+0027 APOSTROPHE (') 323 Parse error. Treat it as per the "anything else" entry below. 324 325 EOF 326 Parse error. Emit the current tag token. Reconsume the EOF 327 character in the data state. 328 329 Anything else 330 Append the current input character to the current attribute's 331 name. Stay in the attribute name state. 332 333 When the user agent leaves the attribute name state (and before 334 emitting the tag token, if appropriate), the complete attribute's name 335 must be compared to the other attributes on the same token; if there is 336 already an attribute on the token with the exact same name, then this 337 is a parse error and the new attribute must be dropped, along with the 338 value that gets associated with it (if any). 339 340 8.2.4.8 After attribute name state 341 342 Consume the next input character: 343 344 U+0009 CHARACTER TABULATION 345 U+000A LINE FEED (LF) 346 U+000C FORM FEED (FF) 347 U+0020 SPACE 348 Stay in the after attribute name state. 349 350 U+002F SOLIDUS (/) 351 Switch to the self-closing start tag state. 352 353 U+003D EQUALS SIGN (=) 354 Switch to the before attribute value state. 355 356 U+003E GREATER-THAN SIGN (>) 357 Emit the current tag token. Switch to the data state. 358 359 U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z 360 Start a new attribute in the current tag token. Set that 361 attribute's name to the lowercase version of the current input 362 character (add 0x0020 to the character's code point), and its 363 value to the empty string. Switch to the attribute name state. 364 365 U+0022 QUOTATION MARK (") 366 U+0027 APOSTROPHE (') 367 Parse error. Treat it as per the "anything else" entry below. 368 369 EOF 370 Parse error. Emit the current tag token. Reconsume the EOF 371 character in the data state. 372 373 Anything else 374 Start a new attribute in the current tag token. Set that 375 attribute's name to the current input character, and its value 376 to the empty string. Switch to the attribute name state. 377 378 8.2.4.9 Before attribute value state 379 380 Consume the next input character: 381 382 U+0009 CHARACTER TABULATION 383 U+000A LINE FEED (LF) 384 U+000C FORM FEED (FF) 385 U+0020 SPACE 386 Stay in the before attribute value state. 387 388 U+0022 QUOTATION MARK (") 389 Switch to the attribute value (double-quoted) state. 390 391 U+0026 AMPERSAND (&) 392 Switch to the attribute value (unquoted) state and reconsume 393 this input character. 394 395 U+0027 APOSTROPHE (') 396 Switch to the attribute value (single-quoted) state. 397 398 U+003E GREATER-THAN SIGN (>) 399 Parse error. Emit the current tag token. Switch to the data 400 state. 401 402 U+003D EQUALS SIGN (=) 403 Parse error. Treat it as per the "anything else" entry below. 404 405 EOF 406 Parse error. Emit the current tag token. Reconsume the character 407 in the data state. 408 409 Anything else 410 Append the current input character to the current attribute's 411 value. Switch to the attribute value (unquoted) state. 412 413 8.2.4.10 Attribute value (double-quoted) state 414 415 Consume the next input character: 416 417 U+0022 QUOTATION MARK (") 418 Switch to the after attribute value (quoted) state. 419 420 U+0026 AMPERSAND (&) 421 Switch to the character reference in attribute value state, with 422 the additional allowed character being U+0022 QUOTATION MARK 423 ("). 424 425 EOF 426 Parse error. Emit the current tag token. Reconsume the character 427 in the data state. 428 429 Anything else 430 Append the current input character to the current attribute's 431 value. Stay in the attribute value (double-quoted) state. 432 433 8.2.4.11 Attribute value (single-quoted) state 434 435 Consume the next input character: 436 437 U+0027 APOSTROPHE (') 438 Switch to the after attribute value (quoted) state. 439 440 U+0026 AMPERSAND (&) 441 Switch to the character reference in attribute value state, with 442 the additional allowed character being U+0027 APOSTROPHE ('). 443 444 EOF 445 Parse error. Emit the current tag token. Reconsume the character 446 in the data state. 447 448 Anything else 449 Append the current input character to the current attribute's 450 value. Stay in the attribute value (single-quoted) state. 451 452 8.2.4.12 Attribute value (unquoted) state 453 454 Consume the next input character: 455 456 U+0009 CHARACTER TABULATION 457 U+000A LINE FEED (LF) 458 U+000C FORM FEED (FF) 459 U+0020 SPACE 460 Switch to the before attribute name state. 461 462 U+0026 AMPERSAND (&) 463 Switch to the character reference in attribute value state, with 464 no additional allowed character. 465 466 U+003E GREATER-THAN SIGN (>) 467 Emit the current tag token. Switch to the data state. 468 469 U+0022 QUOTATION MARK (") 470 U+0027 APOSTROPHE (') 471 U+003D EQUALS SIGN (=) 472 Parse error. Treat it as per the "anything else" entry below. 473 474 EOF 475 Parse error. Emit the current tag token. Reconsume the character 476 in the data state. 477 478 Anything else 479 Append the current input character to the current attribute's 480 value. Stay in the attribute value (unquoted) state. 481 482 8.2.4.13 Character reference in attribute value state 483 484 Attempt to consume a character reference. 485 486 If nothing is returned, append a U+0026 AMPERSAND character to the 487 current attribute's value. 488 489 Otherwise, append the returned character token to the current 490 attribute's value. 491 492 Finally, switch back to the attribute value state that you were in when 493 were switched into this state. 494 495 8.2.4.14 After attribute value (quoted) state 496 497 Consume the next input character: 498 499 U+0009 CHARACTER TABULATION 500 U+000A LINE FEED (LF) 501 U+000C FORM FEED (FF) 502 U+0020 SPACE 503 Switch to the before attribute name state. 504 505 U+002F SOLIDUS (/) 506 Switch to the self-closing start tag state. 507 508 U+003E GREATER-THAN SIGN (>) 509 Emit the current tag token. Switch to the data state. 510 511 EOF 512 Parse error. Emit the current tag token. Reconsume the EOF 513 character in the data state. 514 515 Anything else 516 Parse error. Reconsume the character in the before attribute 517 name state. 518 519 8.2.4.15 Self-closing start tag state 520 521 Consume the next input character: 522 523 U+003E GREATER-THAN SIGN (>) 524 Set the self-closing flag of the current tag token. Emit the 525 current tag token. Switch to the data state. 526 527 EOF 528 Parse error. Emit the current tag token. Reconsume the EOF 529 character in the data state. 530 531 Anything else 532 Parse error. Reconsume the character in the before attribute 533 name state. 534 535 8.2.4.16 Bogus comment state 536 537 (This can only happen if the content model flag is set to the PCDATA 538 state.) 539 540 Consume every character up to and including the first U+003E 541 GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever 542 comes first. Emit a comment token whose data is the concatenation of 543 all the characters starting from and including the character that 544 caused the state machine to switch into the bogus comment state, up to 545 and including the character immediately before the last consumed 546 character (i.e. up to the character just before the U+003E or EOF 547 character). (If the comment was started by the end of the file (EOF), 548 the token is empty.) 549 550 Switch to the data state. 551 552 If the end of the file was reached, reconsume the EOF character. 553 554 8.2.4.17 Markup declaration open state 555 556 (This can only happen if the content model flag is set to the PCDATA 557 state.) 558 559 If the next two characters are both U+002D HYPHEN-MINUS (-) characters, 560 consume those two characters, create a comment token whose data is the 561 empty string, and switch to the comment start state. 562 563 Otherwise, if the next seven characters are an ASCII case-insensitive 564 match for the word "DOCTYPE", then consume those characters and switch 565 to the DOCTYPE state. 566 567 Otherwise, if the insertion mode is "in foreign content" and the 568 current node is not an element in the HTML namespace and the next seven 569 characters are an ASCII case-sensitive match for the string "[CDATA[" 570 (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET 571 character before and after), then consume those characters and switch 572 to the CDATA section state (which is unrelated to the content model 573 flag's CDATA state). 574 575 Otherwise, this is a parse error. Switch to the bogus comment state. 576 The next character that is consumed, if any, is the first character 577 that will be in the comment. 578 579 8.2.4.18 Comment start state 580 581 Consume the next input character: 582 583 U+002D HYPHEN-MINUS (-) 584 Switch to the comment start dash state. 585 586 U+003E GREATER-THAN SIGN (>) 587 Parse error. Emit the comment token. Switch to the data state. 588 589 EOF 590 Parse error. Emit the comment token. Reconsume the EOF character 591 in the data state. 592 593 Anything else 594 Append the input character to the comment token's data. Switch 595 to the comment state. 596 597 8.2.4.19 Comment start dash state 598 599 Consume the next input character: 600 601 U+002D HYPHEN-MINUS (-) 602 Switch to the comment end state 603 604 U+003E GREATER-THAN SIGN (>) 605 Parse error. Emit the comment token. Switch to the data state. 606 607 EOF 608 Parse error. Emit the comment token. Reconsume the EOF character 609 in the data state. 610 611 Anything else 612 Append a U+002D HYPHEN-MINUS (-) character and the input 613 character to the comment token's data. Switch to the comment 614 state. 615 616 8.2.4.20 Comment state 617 618 Consume the next input character: 619 620 U+002D HYPHEN-MINUS (-) 621 Switch to the comment end dash state 622 623 EOF 624 Parse error. Emit the comment token. Reconsume the EOF character 625 in the data state. 626 627 Anything else 628 Append the input character to the comment token's data. Stay in 629 the comment state. 630 631 8.2.4.21 Comment end dash state 632 633 Consume the next input character: 634 635 U+002D HYPHEN-MINUS (-) 636 Switch to the comment end state 637 638 EOF 639 Parse error. Emit the comment token. Reconsume the EOF character 640 in the data state. 641 642 Anything else 643 Append a U+002D HYPHEN-MINUS (-) character and the input 644 character to the comment token's data. Switch to the comment 645 state. 646 647 8.2.4.22 Comment end state 648 649 Consume the next input character: 650 651 U+003E GREATER-THAN SIGN (>) 652 Emit the comment token. Switch to the data state. 653 654 U+002D HYPHEN-MINUS (-) 655 Parse error. Append a U+002D HYPHEN-MINUS (-) character to the 656 comment token's data. Stay in the comment end state. 657 658 EOF 659 Parse error. Emit the comment token. Reconsume the EOF character 660 in the data state. 661 662 Anything else 663 Parse error. Append two U+002D HYPHEN-MINUS (-) characters and 664 the input character to the comment token's data. Switch to the 665 comment state. 666 667 8.2.4.23 DOCTYPE state 668 669 Consume the next input character: 670 671 U+0009 CHARACTER TABULATION 672 U+000A LINE FEED (LF) 673 U+000C FORM FEED (FF) 674 U+0020 SPACE 675 Switch to the before DOCTYPE name state. 676 677 Anything else 678 Parse error. Reconsume the current character in the before 679 DOCTYPE name state. 680 681 8.2.4.24 Before DOCTYPE name state 682 683 Consume the next input character: 684 685 U+0009 CHARACTER TABULATION 686 U+000A LINE FEED (LF) 687 U+000C FORM FEED (FF) 688 U+0020 SPACE 689 Stay in the before DOCTYPE name state. 690 691 U+003E GREATER-THAN SIGN (>) 692 Parse error. Create a new DOCTYPE token. Set its force-quirks 693 flag to on. Emit the token. Switch to the data state. 694 695 U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z 696 Create a new DOCTYPE token. Set the token's name to the 697 lowercase version of the input character (add 0x0020 to the 698 character's code point). Switch to the DOCTYPE name state. 699 700 EOF 701 Parse error. Create a new DOCTYPE token. Set its force-quirks 702 flag to on. Emit the token. Reconsume the EOF character in the 703 data state. 704 705 Anything else 706 Create a new DOCTYPE token. Set the token's name to the current 707 input character. Switch to the DOCTYPE name state. 708 709 8.2.4.25 DOCTYPE name state 710 711 Consume the next input character: 712 713 U+0009 CHARACTER TABULATION 714 U+000A LINE FEED (LF) 715 U+000C FORM FEED (FF) 716 U+0020 SPACE 717 Switch to the after DOCTYPE name state. 718 719 U+003E GREATER-THAN SIGN (>) 720 Emit the current DOCTYPE token. Switch to the data state. 721 722 U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z 723 Append the lowercase version of the input character (add 0x0020 724 to the character's code point) to the current DOCTYPE token's 725 name. Stay in the DOCTYPE name state. 726 727 EOF 728 Parse error. Set the DOCTYPE token's force-quirks flag to on. 729 Emit that DOCTYPE token. Reconsume the EOF character in the data 730 state. 731 732 Anything else 733 Append the current input character to the current DOCTYPE 734 token's name. Stay in the DOCTYPE name state. 735 736 8.2.4.26 After DOCTYPE name state 737 738 Consume the next input character: 739 740 U+0009 CHARACTER TABULATION 741 U+000A LINE FEED (LF) 742 U+000C FORM FEED (FF) 743 U+0020 SPACE 744 Stay in the after DOCTYPE name state. 745 746 U+003E GREATER-THAN SIGN (>) 747 Emit the current DOCTYPE token. Switch to the data state. 748 749 EOF 750 Parse error. Set the DOCTYPE token's force-quirks flag to on. 751 Emit that DOCTYPE token. Reconsume the EOF character in the data 752 state. 753 754 Anything else 755 If the six characters starting from the current input character 756 are an ASCII case-insensitive match for the word "PUBLIC", then 757 consume those characters and switch to the before DOCTYPE public 758 identifier state. 759 760 Otherwise, if the six characters starting from the current input 761 character are an ASCII case-insensitive match for the word 762 "SYSTEM", then consume those characters and switch to the before 763 DOCTYPE system identifier state. 764 765 Otherwise, this is the parse error. Set the DOCTYPE token's 766 force-quirks flag to on. Switch to the bogus DOCTYPE state. 767 768 8.2.4.27 Before DOCTYPE public identifier state 769 770 Consume the next input character: 771 772 U+0009 CHARACTER TABULATION 773 U+000A LINE FEED (LF) 774 U+000C FORM FEED (FF) 775 U+0020 SPACE 776 Stay in the before DOCTYPE public identifier state. 777 778 U+0022 QUOTATION MARK (") 779 Set the DOCTYPE token's public identifier to the empty string 780 (not missing), then switch to the DOCTYPE public identifier 781 (double-quoted) state. 782 783 U+0027 APOSTROPHE (') 784 Set the DOCTYPE token's public identifier to the empty string 785 (not missing), then switch to the DOCTYPE public identifier 786 (single-quoted) state. 787 788 U+003E GREATER-THAN SIGN (>) 789 Parse error. Set the DOCTYPE token's force-quirks flag to on. 790 Emit that DOCTYPE token. Switch to the data state. 791 792 EOF 793 Parse error. Set the DOCTYPE token's force-quirks flag to on. 794 Emit that DOCTYPE token. Reconsume the EOF character in the data 795 state. 796 797 Anything else 798 Parse error. Set the DOCTYPE token's force-quirks flag to on. 799 Switch to the bogus DOCTYPE state. 800 801 8.2.4.28 DOCTYPE public identifier (double-quoted) state 802 803 Consume the next input character: 804 805 U+0022 QUOTATION MARK (") 806 Switch to the after DOCTYPE public identifier state. 807 808 U+003E GREATER-THAN SIGN (>) 809 Parse error. Set the DOCTYPE token's force-quirks flag to on. 810 Emit that DOCTYPE token. Switch to the data state. 811 812 EOF 813 Parse error. Set the DOCTYPE token's force-quirks flag to on. 814 Emit that DOCTYPE token. Reconsume the EOF character in the data 815 state. 816 817 Anything else 818 Append the current input character to the current DOCTYPE 819 token's public identifier. Stay in the DOCTYPE public identifier 820 (double-quoted) state. 821 822 8.2.4.29 DOCTYPE public identifier (single-quoted) state 823 824 Consume the next input character: 825 826 U+0027 APOSTROPHE (') 827 Switch to the after DOCTYPE public identifier state. 828 829 U+003E GREATER-THAN SIGN (>) 830 Parse error. Set the DOCTYPE token's force-quirks flag to on. 831 Emit that DOCTYPE token. Switch to the data state. 832 833 EOF 834 Parse error. Set the DOCTYPE token's force-quirks flag to on. 835 Emit that DOCTYPE token. Reconsume the EOF character in the data 836 state. 837 838 Anything else 839 Append the current input character to the current DOCTYPE 840 token's public identifier. Stay in the DOCTYPE public identifier 841 (single-quoted) state. 842 843 8.2.4.30 After DOCTYPE public identifier state 844 845 Consume the next input character: 846 847 U+0009 CHARACTER TABULATION 848 U+000A LINE FEED (LF) 849 U+000C FORM FEED (FF) 850 U+0020 SPACE 851 Stay in the after DOCTYPE public identifier state. 852 853 U+0022 QUOTATION MARK (") 854 Set the DOCTYPE token's system identifier to the empty string 855 (not missing), then switch to the DOCTYPE system identifier 856 (double-quoted) state. 857 858 U+0027 APOSTROPHE (') 859 Set the DOCTYPE token's system identifier to the empty string 860 (not missing), then switch to the DOCTYPE system identifier 861 (single-quoted) state. 862 863 U+003E GREATER-THAN SIGN (>) 864 Emit the current DOCTYPE token. Switch to the data state. 865 866 EOF 867 Parse error. Set the DOCTYPE token's force-quirks flag to on. 868 Emit that DOCTYPE token. Reconsume the EOF character in the data 869 state. 870 871 Anything else 872 Parse error. Set the DOCTYPE token's force-quirks flag to on. 873 Switch to the bogus DOCTYPE state. 874 875 8.2.4.31 Before DOCTYPE system identifier state 876 877 Consume the next input character: 878 879 U+0009 CHARACTER TABULATION 880 U+000A LINE FEED (LF) 881 U+000C FORM FEED (FF) 882 U+0020 SPACE 883 Stay in the before DOCTYPE system identifier state. 884 885 U+0022 QUOTATION MARK (") 886 Set the DOCTYPE token's system identifier to the empty string 887 (not missing), then switch to the DOCTYPE system identifier 888 (double-quoted) state. 889 890 U+0027 APOSTROPHE (') 891 Set the DOCTYPE token's system identifier to the empty string 892 (not missing), then switch to the DOCTYPE system identifier 893 (single-quoted) state. 894 895 U+003E GREATER-THAN SIGN (>) 896 Parse error. Set the DOCTYPE token's force-quirks flag to on. 897 Emit that DOCTYPE token. Switch to the data state. 898 899 EOF 900 Parse error. Set the DOCTYPE token's force-quirks flag to on. 901 Emit that DOCTYPE token. Reconsume the EOF character in the data 902 state. 903 904 Anything else 905 Parse error. Set the DOCTYPE token's force-quirks flag to on. 906 Switch to the bogus DOCTYPE state. 907 908 8.2.4.32 DOCTYPE system identifier (double-quoted) state 909 910 Consume the next input character: 911 912 U+0022 QUOTATION MARK (") 913 Switch to the after DOCTYPE system identifier state. 914 915 U+003E GREATER-THAN SIGN (>) 916 Parse error. Set the DOCTYPE token's force-quirks flag to on. 917 Emit that DOCTYPE token. Switch to the data state. 918 919 EOF 920 Parse error. Set the DOCTYPE token's force-quirks flag to on. 921 Emit that DOCTYPE token. Reconsume the EOF character in the data 922 state. 923 924 Anything else 925 Append the current input character to the current DOCTYPE 926 token's system identifier. Stay in the DOCTYPE system identifier 927 (double-quoted) state. 928 929 8.2.4.33 DOCTYPE system identifier (single-quoted) state 930 931 Consume the next input character: 932 933 U+0027 APOSTROPHE (') 934 Switch to the after DOCTYPE system identifier state. 935 936 U+003E GREATER-THAN SIGN (>) 937 Parse error. Set the DOCTYPE token's force-quirks flag to on. 938 Emit that DOCTYPE token. Switch to the data state. 939 940 EOF 941 Parse error. Set the DOCTYPE token's force-quirks flag to on. 942 Emit that DOCTYPE token. Reconsume the EOF character in the data 943 state. 944 945 Anything else 946 Append the current input character to the current DOCTYPE 947 token's system identifier. Stay in the DOCTYPE system identifier 948 (single-quoted) state. 949 950 8.2.4.34 After DOCTYPE system identifier state 951 952 Consume the next input character: 953 954 U+0009 CHARACTER TABULATION 955 U+000A LINE FEED (LF) 956 U+000C FORM FEED (FF) 957 U+0020 SPACE 958 Stay in the after DOCTYPE system identifier state. 959 960 U+003E GREATER-THAN SIGN (>) 961 Emit the current DOCTYPE token. Switch to the data state. 962 963 EOF 964 Parse error. Set the DOCTYPE token's force-quirks flag to on. 965 Emit that DOCTYPE token. Reconsume the EOF character in the data 966 state. 967 968 Anything else 969 Parse error. Switch to the bogus DOCTYPE state. (This does not 970 set the DOCTYPE token's force-quirks flag to on.) 971 972 8.2.4.35 Bogus DOCTYPE state 973 974 Consume the next input character: 975 976 U+003E GREATER-THAN SIGN (>) 977 Emit the DOCTYPE token. Switch to the data state. 978 979 EOF 980 Emit the DOCTYPE token. Reconsume the EOF character in the data 981 state. 982 983 Anything else 984 Stay in the bogus DOCTYPE state. 985 986 8.2.4.36 CDATA section state 987 988 (This can only happen if the content model flag is set to the PCDATA 989 state, and is unrelated to the content model flag's CDATA state.) 990 991 Consume every character up to the next occurrence of the three 992 character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE 993 BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF), 994 whichever comes first. Emit a series of character tokens consisting of 995 all the characters consumed except the matching three character 996 sequence at the end (if one was found before the end of the file). 997 998 Switch to the data state. 999 1000 If the end of the file was reached, reconsume the EOF character. 1001 1002 8.2.4.37 Tokenizing character references 1003 1004 This section defines how to consume a character reference. This 1005 definition is used when parsing character references in text and in 1006 attributes. 1007 1008 The behavior depends on the identity of the next character (the one 1009 immediately after the U+0026 AMPERSAND character): 1010 1011 U+0009 CHARACTER TABULATION 1012 U+000A LINE FEED (LF) 1013 U+000C FORM FEED (FF) 1014 U+0020 SPACE 1015 U+003C LESS-THAN SIGN 1016 U+0026 AMPERSAND 1017 EOF 1018 The additional allowed character, if there is one 1019 Not a character reference. No characters are consumed, and 1020 nothing is returned. (This is not an error, either.) 1021 1022 U+0023 NUMBER SIGN (#) 1023 Consume the U+0023 NUMBER SIGN. 1024 1025 The behavior further depends on the character after the U+0023 1026 NUMBER SIGN: 1027 1028 U+0078 LATIN SMALL LETTER X 1029 U+0058 LATIN CAPITAL LETTER X 1030 Consume the X. 1031 1032 Follow the steps below, but using the range of characters 1033 U+0030 DIGIT ZERO through to U+0039 DIGIT NINE, U+0061 1034 LATIN SMALL LETTER A through to U+0066 LATIN SMALL LETTER 1035 F, and U+0041 LATIN CAPITAL LETTER A, through to U+0046 1036 LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f). 1037 1038 When it comes to interpreting the number, interpret it as 1039 a hexadecimal number. 1040 1041 Anything else 1042 Follow the steps below, but using the range of characters 1043 U+0030 DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just 1044 0-9). 1045 1046 When it comes to interpreting the number, interpret it as 1047 a decimal number. 1048 1049 Consume as many characters as match the range of characters 1050 given above. 1051 1052 If no characters match the range, then don't consume any 1053 characters (and unconsume the U+0023 NUMBER SIGN character and, 1054 if appropriate, the X character). This is a parse error; nothing 1055 is returned. 1056 1057 Otherwise, if the next character is a U+003B SEMICOLON, consume 1058 that too. If it isn't, there is a parse error. 1059 1060 If one or more characters match the range, then take them all 1061 and interpret the string of characters as a number (either 1062 hexadecimal or decimal as appropriate). 1063 1064 If that number is one of the numbers in the first column of the 1065 following table, then this is a parse error. Find the row with 1066 that number in the first column, and return a character token 1067 for the Unicode character given in the second column of that 1068 row. 1069 1070 Number Unicode character 1071 0x0D U+000A LINE FEED (LF) 1072 0x80 U+20AC EURO SIGN ('') 1073 0x81 U+FFFD REPLACEMENT CHARACTER 1074 0x82 U+201A SINGLE LOW-9 QUOTATION MARK ('') 1075 0x83 U+0192 LATIN SMALL LETTER F WITH HOOK ('') 1076 0x84 U+201E DOUBLE LOW-9 QUOTATION MARK ('') 1077 0x85 U+2026 HORIZONTAL ELLIPSIS ('') 1078 0x86 U+2020 DAGGER ('') 1079 0x87 U+2021 DOUBLE DAGGER ('') 1080 0x88 U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT ('') 1081 0x89 U+2030 PER MILLE SIGN ('') 1082 0x8A U+0160 LATIN CAPITAL LETTER S WITH CARON ('') 1083 0x8B U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('') 1084 0x8C U+0152 LATIN CAPITAL LIGATURE OE ('') 1085 0x8D U+FFFD REPLACEMENT CHARACTER 1086 0x8E U+017D LATIN CAPITAL LETTER Z WITH CARON ('') 1087 0x8F U+FFFD REPLACEMENT CHARACTER 1088 0x90 U+FFFD REPLACEMENT CHARACTER 1089 0x91 U+2018 LEFT SINGLE QUOTATION MARK ('') 1090 0x92 U+2019 RIGHT SINGLE QUOTATION MARK ('') 1091 0x93 U+201C LEFT DOUBLE QUOTATION MARK ('') 1092 0x94 U+201D RIGHT DOUBLE QUOTATION MARK ('') 1093 0x95 U+2022 BULLET ('') 1094 0x96 U+2013 EN DASH ('') 1095 0x97 U+2014 EM DASH ('') 1096 0x98 U+02DC SMALL TILDE ('') 1097 0x99 U+2122 TRADE MARK SIGN ('') 1098 0x9A U+0161 LATIN SMALL LETTER S WITH CARON ('') 1099 0x9B U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('') 1100 0x9C U+0153 LATIN SMALL LIGATURE OE ('') 1101 0x9D U+FFFD REPLACEMENT CHARACTER 1102 0x9E U+017E LATIN SMALL LETTER Z WITH CARON ('') 1103 0x9F U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS ('') 1104 1105 Otherwise, if the number is in the range 0x0000 to 0x0008, 1106 0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to 1107 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 1108 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 1109 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 1110 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 1111 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 1112 0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is 1113 a parse error; return a character token for the U+FFFD 1114 REPLACEMENT CHARACTER character instead. 1115 1116 Otherwise, return a character token for the Unicode character 1117 whose code point is that number. 1118 1119 Anything else 1120 Consume the maximum number of characters possible, with the 1121 consumed characters matching one of the identifiers in the first 1122 column of the named character references table (in a 1123 case-sensitive manner). 1124 1125 If no match can be made, then this is a parse error. No 1126 characters are consumed, and nothing is returned. 1127 1128 If the last character matched is not a U+003B SEMICOLON (;), 1129 there is a parse error. 1130 1131 If the character reference is being consumed as part of an 1132 attribute, and the last character matched is not a U+003B 1133 SEMICOLON (;), and the next character is in the range U+0030 1134 DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN CAPITAL LETTER A 1135 to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A 1136 to U+007A LATIN SMALL LETTER Z, then, for historical reasons, 1137 all the characters that were matched after the U+0026 AMPERSAND 1138 (&) must be unconsumed, and nothing is returned. 1139 1140 Otherwise, return a character token for the character 1141 corresponding to the character reference name (as given by the 1142 second column of the named character references table). 1143 1144 If the markup contains I'm ¬it; I tell you, the character 1145 reference is parsed as "not", as in, I'm it; I tell you. But if 1146 the markup was I'm ∉ I tell you, the character reference 1147 would be parsed as "notin;", resulting in I'm I tell you. 1148