Home | History | Annotate | Download | only in doc
      1    #8.2 Parsing HTML documents Table of contents 8.2.5 Tree construction
      2 
      3    WHATWG
      4 
      5 HTML 5
      6 
      7 Draft Recommendation  7 February 2009
      8 
      9     8.2 Parsing HTML documents  Table of contents  8.2.5 Tree
     10    construction 
     11 
     12     8.2.4 Tokenization
     13 
     14    Implementations must act as if they used the following state machine to
     15    tokenise HTML. The state machine must start in the data state. Most
     16    states consume a single character, which may have various side-effects,
     17    and either switches the state machine to a new state to reconsume the
     18    same character, or switches it to a new state (to consume the next
     19    character), or repeats the same state (to consume the next character).
     20    Some states have more complicated behavior and can consume several
     21    characters before switching to another state.
     22 
     23    The exact behavior of certain states depends on a content model flag
     24    that is set after certain tokens are emitted. The flag has several
     25    states: PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially it must be in
     26    the PCDATA state. In the RCDATA and CDATA states, a further escape flag
     27    is used to control the behavior of the tokeniser. It is either true or
     28    false, and initially must be set to the false state. The insertion mode
     29    and the stack of open elements also affects tokenization.
     30 
     31    The output of the tokenization step is a series of zero or more of the
     32    following tokens: DOCTYPE, start tag, end tag, comment, character,
     33    end-of-file. DOCTYPE tokens have a name, a public identifier, a system
     34    identifier, and a force-quirks flag. When a DOCTYPE token is created,
     35    its name, public identifier, and system identifier must be marked as
     36    missing (which is a distinct state from the empty string), and the
     37    force-quirks flag must be set to off (its other state is on). Start and
     38    end tag tokens have a tag name, a self-closing flag, and a list of
     39    attributes, each of which has a name and a value. When a start or end
     40    tag token is created, its self-closing flag must be unset (its other
     41    state is that it be set), and its attributes list must be empty.
     42    Comment and character tokens have data.
     43 
     44    When a token is emitted, it must immediately be handled by the tree
     45    construction stage. The tree construction stage can affect the state of
     46    the content model flag, and can insert additional characters into the
     47    stream. (For example, the script element can result in scripts
     48    executing and using the dynamic markup insertion APIs to insert
     49    characters into the stream being tokenised.)
     50 
     51    When a start tag token is emitted with its self-closing flag set, if
     52    the flag is not acknowledged when it is processed by the tree
     53    construction stage, that is a parse error.
     54 
     55    When an end tag token is emitted, the content model flag must be
     56    switched to the PCDATA state.
     57 
     58    When an end tag token is emitted with attributes, that is a parse
     59    error.
     60 
     61    When an end tag token is emitted with its self-closing flag set, that
     62    is a parse error.
     63 
     64    Before each step of the tokeniser, the user agent must first check the
     65    parser pause flag. If it is true, then the tokeniser must abort the
     66    processing of any nested invocations of the tokeniser, yielding control
     67    back to the caller. If it is false, then the user agent may then check
     68    to see if either one of the scripts in the list of scripts that will
     69    execute as soon as possible or the first script in the list of scripts
     70    that will execute asynchronously, has completed loading. If one has,
     71    then it must be executed and removed from its list.
     72 
     73    The tokeniser state machine consists of the states defined in the
     74    following subsections.
     75 
     76       8.2.4.1 Data state
     77 
     78    Consume the next input character:
     79 
     80    U+0026 AMPERSAND (&)
     81           When the content model flag is set to one of the PCDATA or
     82           RCDATA states and the escape flag is false: switch to the
     83           character reference data state.
     84           Otherwise: treat it as per the "anything else" entry below.
     85 
     86    U+002D HYPHEN-MINUS (-)
     87           If the content model flag is set to either the RCDATA state or
     88           the CDATA state, and the escape flag is false, and there are at
     89           least three characters before this one in the input stream, and
     90           the last four characters in the input stream, including this
     91           one, are U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D
     92           HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the
     93           escape flag to true.
     94 
     95           In any case, emit the input character as a character token. Stay
     96           in the data state.
     97 
     98    U+003C LESS-THAN SIGN (<)
     99           When the content model flag is set to the PCDATA state: switch
    100           to the tag open state.
    101           When the content model flag is set to either the RCDATA state or
    102           the CDATA state, and the escape flag is false: switch to the tag
    103           open state.
    104           Otherwise: treat it as per the "anything else" entry below.
    105 
    106    U+003E GREATER-THAN SIGN (>)
    107           If the content model flag is set to either the RCDATA state or
    108           the CDATA state, and the escape flag is true, and the last three
    109           characters in the input stream including this one are U+002D
    110           HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN
    111           ("-->"), set the escape flag to false.
    112 
    113           In any case, emit the input character as a character token. Stay
    114           in the data state.
    115 
    116    EOF
    117           Emit an end-of-file token.
    118 
    119    Anything else
    120           Emit the input character as a character token. Stay in the data
    121           state.
    122 
    123       8.2.4.2 Character reference data state
    124 
    125    (This cannot happen if the content model flag is set to the CDATA
    126    state.)
    127 
    128    Attempt to consume a character reference, with no additional allowed
    129    character.
    130 
    131    If nothing is returned, emit a U+0026 AMPERSAND character token.
    132 
    133    Otherwise, emit the character token that was returned.
    134 
    135    Finally, switch to the data state.
    136 
    137       8.2.4.3 Tag open state
    138 
    139    The behavior of this state depends on the content model flag.
    140 
    141    If the content model flag is set to the RCDATA or CDATA states
    142           Consume the next input character. If it is a U+002F SOLIDUS (/)
    143           character, switch to the close tag open state. Otherwise, emit a
    144           U+003C LESS-THAN SIGN character token and reconsume the current
    145           input character in the data state.
    146 
    147    If the content model flag is set to the PCDATA state
    148           Consume the next input character:
    149 
    150         U+0021 EXCLAMATION MARK (!)
    151                 Switch to the markup declaration open state.
    152 
    153         U+002F SOLIDUS (/)
    154                 Switch to the close tag open state.
    155 
    156         U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
    157                 LETTER Z
    158                 Create a new start tag token, set its tag name to the
    159                 lowercase version of the input character (add 0x0020 to
    160                 the character's code point), then switch to the tag name
    161                 state. (Don't emit the token yet; further details will be
    162                 filled in before it is emitted.)
    163 
    164         U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
    165                 Create a new start tag token, set its tag name to the
    166                 input character, then switch to the tag name state. (Don't
    167                 emit the token yet; further details will be filled in
    168                 before it is emitted.)
    169 
    170         U+003E GREATER-THAN SIGN (>)
    171                 Parse error. Emit a U+003C LESS-THAN SIGN character token
    172                 and a U+003E GREATER-THAN SIGN character token. Switch to
    173                 the data state.
    174 
    175         U+003F QUESTION MARK (?)
    176                 Parse error. Switch to the bogus comment state.
    177 
    178         Anything else
    179                 Parse error. Emit a U+003C LESS-THAN SIGN character token
    180                 and reconsume the current input character in the data
    181                 state.
    182 
    183       8.2.4.4 Close tag open state
    184 
    185    If the content model flag is set to the RCDATA or CDATA states but no
    186    start tag token has ever been emitted by this instance of the tokeniser
    187    (fragment case), or, if the content model flag is set to the RCDATA or
    188    CDATA states and the next few characters do not match the tag name of
    189    the last start tag token emitted (compared in an ASCII case-insensitive
    190    manner), or if they do but they are not immediately followed by one of
    191    the following characters:
    192      * U+0009 CHARACTER TABULATION
    193      * U+000A LINE FEED (LF)
    194      * U+000C FORM FEED (FF)
    195      * U+0020 SPACE
    196      * U+003E GREATER-THAN SIGN (>)
    197      * U+002F SOLIDUS (/)
    198      * EOF
    199 
    200    ...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS
    201    character token, and switch to the data state to process the next input
    202    character.
    203 
    204    Otherwise, if the content model flag is set to the PCDATA state, or if
    205    the next few characters do match that tag name, consume the next input
    206    character:
    207 
    208    U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
    209           Create a new end tag token, set its tag name to the lowercase
    210           version of the input character (add 0x0020 to the character's
    211           code point), then switch to the tag name state. (Don't emit the
    212           token yet; further details will be filled in before it is
    213           emitted.)
    214 
    215    U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
    216           Create a new end tag token, set its tag name to the input
    217           character, then switch to the tag name state. (Don't emit the
    218           token yet; further details will be filled in before it is
    219           emitted.)
    220 
    221    U+003E GREATER-THAN SIGN (>)
    222           Parse error. Switch to the data state.
    223 
    224    EOF
    225           Parse error. Emit a U+003C LESS-THAN SIGN character token and a
    226           U+002F SOLIDUS character token. Reconsume the EOF character in
    227           the data state.
    228 
    229    Anything else
    230           Parse error. Switch to the bogus comment state.
    231 
    232       8.2.4.5 Tag name state
    233 
    234    Consume the next input character:
    235 
    236    U+0009 CHARACTER TABULATION
    237    U+000A LINE FEED (LF)
    238    U+000C FORM FEED (FF)
    239    U+0020 SPACE
    240           Switch to the before attribute name state.
    241 
    242    U+002F SOLIDUS (/)
    243           Switch to the self-closing start tag state.
    244 
    245    U+003E GREATER-THAN SIGN (>)
    246           Emit the current tag token. Switch to the data state.
    247 
    248    U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
    249           Append the lowercase version of the current input character (add
    250           0x0020 to the character's code point) to the current tag token's
    251           tag name. Stay in the tag name state.
    252 
    253    EOF
    254           Parse error. Emit the current tag token. Reconsume the EOF
    255           character in the data state.
    256 
    257    Anything else
    258           Append the current input character to the current tag token's
    259           tag name. Stay in the tag name state.
    260 
    261       8.2.4.6 Before attribute name state
    262 
    263    Consume the next input character:
    264 
    265    U+0009 CHARACTER TABULATION
    266    U+000A LINE FEED (LF)
    267    U+000C FORM FEED (FF)
    268    U+0020 SPACE
    269           Stay in the before attribute name state.
    270 
    271    U+002F SOLIDUS (/)
    272           Switch to the self-closing start tag state.
    273 
    274    U+003E GREATER-THAN SIGN (>)
    275           Emit the current tag token. Switch to the data state.
    276 
    277    U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
    278           Start a new attribute in the current tag token. Set that
    279           attribute's name to the lowercase version of the current input
    280           character (add 0x0020 to the character's code point), and its
    281           value to the empty string. Switch to the attribute name state.
    282 
    283    U+0022 QUOTATION MARK (")
    284    U+0027 APOSTROPHE (')
    285    U+003D EQUALS SIGN (=)
    286           Parse error. Treat it as per the "anything else" entry below.
    287 
    288    EOF
    289           Parse error. Emit the current tag token. Reconsume the EOF
    290           character in the data state.
    291 
    292    Anything else
    293           Start a new attribute in the current tag token. Set that
    294           attribute's name to the current input character, and its value
    295           to the empty string. Switch to the attribute name state.
    296 
    297       8.2.4.7 Attribute name state
    298 
    299    Consume the next input character:
    300 
    301    U+0009 CHARACTER TABULATION
    302    U+000A LINE FEED (LF)
    303    U+000C FORM FEED (FF)
    304    U+0020 SPACE
    305           Switch to the after attribute name state.
    306 
    307    U+002F SOLIDUS (/)
    308           Switch to the self-closing start tag state.
    309 
    310    U+003D EQUALS SIGN (=)
    311           Switch to the before attribute value state.
    312 
    313    U+003E GREATER-THAN SIGN (>)
    314           Emit the current tag token. Switch to the data state.
    315 
    316    U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
    317           Append the lowercase version of the current input character (add
    318           0x0020 to the character's code point) to the current attribute's
    319           name. Stay in the attribute name state.
    320 
    321    U+0022 QUOTATION MARK (")
    322    U+0027 APOSTROPHE (')
    323           Parse error. Treat it as per the "anything else" entry below.
    324 
    325    EOF
    326           Parse error. Emit the current tag token. Reconsume the EOF
    327           character in the data state.
    328 
    329    Anything else
    330           Append the current input character to the current attribute's
    331           name. Stay in the attribute name state.
    332 
    333    When the user agent leaves the attribute name state (and before
    334    emitting the tag token, if appropriate), the complete attribute's name
    335    must be compared to the other attributes on the same token; if there is
    336    already an attribute on the token with the exact same name, then this
    337    is a parse error and the new attribute must be dropped, along with the
    338    value that gets associated with it (if any).
    339 
    340       8.2.4.8 After attribute name state
    341 
    342    Consume the next input character:
    343 
    344    U+0009 CHARACTER TABULATION
    345    U+000A LINE FEED (LF)
    346    U+000C FORM FEED (FF)
    347    U+0020 SPACE
    348           Stay in the after attribute name state.
    349 
    350    U+002F SOLIDUS (/)
    351           Switch to the self-closing start tag state.
    352 
    353    U+003D EQUALS SIGN (=)
    354           Switch to the before attribute value state.
    355 
    356    U+003E GREATER-THAN SIGN (>)
    357           Emit the current tag token. Switch to the data state.
    358 
    359    U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
    360           Start a new attribute in the current tag token. Set that
    361           attribute's name to the lowercase version of the current input
    362           character (add 0x0020 to the character's code point), and its
    363           value to the empty string. Switch to the attribute name state.
    364 
    365    U+0022 QUOTATION MARK (")
    366    U+0027 APOSTROPHE (')
    367           Parse error. Treat it as per the "anything else" entry below.
    368 
    369    EOF
    370           Parse error. Emit the current tag token. Reconsume the EOF
    371           character in the data state.
    372 
    373    Anything else
    374           Start a new attribute in the current tag token. Set that
    375           attribute's name to the current input character, and its value
    376           to the empty string. Switch to the attribute name state.
    377 
    378       8.2.4.9 Before attribute value state
    379 
    380    Consume the next input character:
    381 
    382    U+0009 CHARACTER TABULATION
    383    U+000A LINE FEED (LF)
    384    U+000C FORM FEED (FF)
    385    U+0020 SPACE
    386           Stay in the before attribute value state.
    387 
    388    U+0022 QUOTATION MARK (")
    389           Switch to the attribute value (double-quoted) state.
    390 
    391    U+0026 AMPERSAND (&)
    392           Switch to the attribute value (unquoted) state and reconsume
    393           this input character.
    394 
    395    U+0027 APOSTROPHE (')
    396           Switch to the attribute value (single-quoted) state.
    397 
    398    U+003E GREATER-THAN SIGN (>)
    399           Parse error. Emit the current tag token. Switch to the data
    400           state.
    401 
    402    U+003D EQUALS SIGN (=)
    403           Parse error. Treat it as per the "anything else" entry below.
    404 
    405    EOF
    406           Parse error. Emit the current tag token. Reconsume the character
    407           in the data state.
    408 
    409    Anything else
    410           Append the current input character to the current attribute's
    411           value. Switch to the attribute value (unquoted) state.
    412 
    413       8.2.4.10 Attribute value (double-quoted) state
    414 
    415    Consume the next input character:
    416 
    417    U+0022 QUOTATION MARK (")
    418           Switch to the after attribute value (quoted) state.
    419 
    420    U+0026 AMPERSAND (&)
    421           Switch to the character reference in attribute value state, with
    422           the additional allowed character being U+0022 QUOTATION MARK
    423           (").
    424 
    425    EOF
    426           Parse error. Emit the current tag token. Reconsume the character
    427           in the data state.
    428 
    429    Anything else
    430           Append the current input character to the current attribute's
    431           value. Stay in the attribute value (double-quoted) state.
    432 
    433       8.2.4.11 Attribute value (single-quoted) state
    434 
    435    Consume the next input character:
    436 
    437    U+0027 APOSTROPHE (')
    438           Switch to the after attribute value (quoted) state.
    439 
    440    U+0026 AMPERSAND (&)
    441           Switch to the character reference in attribute value state, with
    442           the additional allowed character being U+0027 APOSTROPHE (').
    443 
    444    EOF
    445           Parse error. Emit the current tag token. Reconsume the character
    446           in the data state.
    447 
    448    Anything else
    449           Append the current input character to the current attribute's
    450           value. Stay in the attribute value (single-quoted) state.
    451 
    452       8.2.4.12 Attribute value (unquoted) state
    453 
    454    Consume the next input character:
    455 
    456    U+0009 CHARACTER TABULATION
    457    U+000A LINE FEED (LF)
    458    U+000C FORM FEED (FF)
    459    U+0020 SPACE
    460           Switch to the before attribute name state.
    461 
    462    U+0026 AMPERSAND (&)
    463           Switch to the character reference in attribute value state, with
    464           no additional allowed character.
    465 
    466    U+003E GREATER-THAN SIGN (>)
    467           Emit the current tag token. Switch to the data state.
    468 
    469    U+0022 QUOTATION MARK (")
    470    U+0027 APOSTROPHE (')
    471    U+003D EQUALS SIGN (=)
    472           Parse error. Treat it as per the "anything else" entry below.
    473 
    474    EOF
    475           Parse error. Emit the current tag token. Reconsume the character
    476           in the data state.
    477 
    478    Anything else
    479           Append the current input character to the current attribute's
    480           value. Stay in the attribute value (unquoted) state.
    481 
    482       8.2.4.13 Character reference in attribute value state
    483 
    484    Attempt to consume a character reference.
    485 
    486    If nothing is returned, append a U+0026 AMPERSAND character to the
    487    current attribute's value.
    488 
    489    Otherwise, append the returned character token to the current
    490    attribute's value.
    491 
    492    Finally, switch back to the attribute value state that you were in when
    493    were switched into this state.
    494 
    495       8.2.4.14 After attribute value (quoted) state
    496 
    497    Consume the next input character:
    498 
    499    U+0009 CHARACTER TABULATION
    500    U+000A LINE FEED (LF)
    501    U+000C FORM FEED (FF)
    502    U+0020 SPACE
    503           Switch to the before attribute name state.
    504 
    505    U+002F SOLIDUS (/)
    506           Switch to the self-closing start tag state.
    507 
    508    U+003E GREATER-THAN SIGN (>)
    509           Emit the current tag token. Switch to the data state.
    510 
    511    EOF
    512           Parse error. Emit the current tag token. Reconsume the EOF
    513           character in the data state.
    514 
    515    Anything else
    516           Parse error. Reconsume the character in the before attribute
    517           name state.
    518 
    519       8.2.4.15 Self-closing start tag state
    520 
    521    Consume the next input character:
    522 
    523    U+003E GREATER-THAN SIGN (>)
    524           Set the self-closing flag of the current tag token. Emit the
    525           current tag token. Switch to the data state.
    526 
    527    EOF
    528           Parse error. Emit the current tag token. Reconsume the EOF
    529           character in the data state.
    530 
    531    Anything else
    532           Parse error. Reconsume the character in the before attribute
    533           name state.
    534 
    535       8.2.4.16 Bogus comment state
    536 
    537    (This can only happen if the content model flag is set to the PCDATA
    538    state.)
    539 
    540    Consume every character up to and including the first U+003E
    541    GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever
    542    comes first. Emit a comment token whose data is the concatenation of
    543    all the characters starting from and including the character that
    544    caused the state machine to switch into the bogus comment state, up to
    545    and including the character immediately before the last consumed
    546    character (i.e. up to the character just before the U+003E or EOF
    547    character). (If the comment was started by the end of the file (EOF),
    548    the token is empty.)
    549 
    550    Switch to the data state.
    551 
    552    If the end of the file was reached, reconsume the EOF character.
    553 
    554       8.2.4.17 Markup declaration open state
    555 
    556    (This can only happen if the content model flag is set to the PCDATA
    557    state.)
    558 
    559    If the next two characters are both U+002D HYPHEN-MINUS (-) characters,
    560    consume those two characters, create a comment token whose data is the
    561    empty string, and switch to the comment start state.
    562 
    563    Otherwise, if the next seven characters are an ASCII case-insensitive
    564    match for the word "DOCTYPE", then consume those characters and switch
    565    to the DOCTYPE state.
    566 
    567    Otherwise, if the insertion mode is "in foreign content" and the
    568    current node is not an element in the HTML namespace and the next seven
    569    characters are an ASCII case-sensitive match for the string "[CDATA["
    570    (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET
    571    character before and after), then consume those characters and switch
    572    to the CDATA section state (which is unrelated to the content model
    573    flag's CDATA state).
    574 
    575    Otherwise, this is a parse error. Switch to the bogus comment state.
    576    The next character that is consumed, if any, is the first character
    577    that will be in the comment.
    578 
    579       8.2.4.18 Comment start state
    580 
    581    Consume the next input character:
    582 
    583    U+002D HYPHEN-MINUS (-)
    584           Switch to the comment start dash state.
    585 
    586    U+003E GREATER-THAN SIGN (>)
    587           Parse error. Emit the comment token. Switch to the data state.
    588 
    589    EOF
    590           Parse error. Emit the comment token. Reconsume the EOF character
    591           in the data state.
    592 
    593    Anything else
    594           Append the input character to the comment token's data. Switch
    595           to the comment state.
    596 
    597       8.2.4.19 Comment start dash state
    598 
    599    Consume the next input character:
    600 
    601    U+002D HYPHEN-MINUS (-)
    602           Switch to the comment end state
    603 
    604    U+003E GREATER-THAN SIGN (>)
    605           Parse error. Emit the comment token. Switch to the data state.
    606 
    607    EOF
    608           Parse error. Emit the comment token. Reconsume the EOF character
    609           in the data state.
    610 
    611    Anything else
    612           Append a U+002D HYPHEN-MINUS (-) character and the input
    613           character to the comment token's data. Switch to the comment
    614           state.
    615 
    616       8.2.4.20 Comment state
    617 
    618    Consume the next input character:
    619 
    620    U+002D HYPHEN-MINUS (-)
    621           Switch to the comment end dash state
    622 
    623    EOF
    624           Parse error. Emit the comment token. Reconsume the EOF character
    625           in the data state.
    626 
    627    Anything else
    628           Append the input character to the comment token's data. Stay in
    629           the comment state.
    630 
    631       8.2.4.21 Comment end dash state
    632 
    633    Consume the next input character:
    634 
    635    U+002D HYPHEN-MINUS (-)
    636           Switch to the comment end state
    637 
    638    EOF
    639           Parse error. Emit the comment token. Reconsume the EOF character
    640           in the data state.
    641 
    642    Anything else
    643           Append a U+002D HYPHEN-MINUS (-) character and the input
    644           character to the comment token's data. Switch to the comment
    645           state.
    646 
    647       8.2.4.22 Comment end state
    648 
    649    Consume the next input character:
    650 
    651    U+003E GREATER-THAN SIGN (>)
    652           Emit the comment token. Switch to the data state.
    653 
    654    U+002D HYPHEN-MINUS (-)
    655           Parse error. Append a U+002D HYPHEN-MINUS (-) character to the
    656           comment token's data. Stay in the comment end state.
    657 
    658    EOF
    659           Parse error. Emit the comment token. Reconsume the EOF character
    660           in the data state.
    661 
    662    Anything else
    663           Parse error. Append two U+002D HYPHEN-MINUS (-) characters and
    664           the input character to the comment token's data. Switch to the
    665           comment state.
    666 
    667       8.2.4.23 DOCTYPE state
    668 
    669    Consume the next input character:
    670 
    671    U+0009 CHARACTER TABULATION
    672    U+000A LINE FEED (LF)
    673    U+000C FORM FEED (FF)
    674    U+0020 SPACE
    675           Switch to the before DOCTYPE name state.
    676 
    677    Anything else
    678           Parse error. Reconsume the current character in the before
    679           DOCTYPE name state.
    680 
    681       8.2.4.24 Before DOCTYPE name state
    682 
    683    Consume the next input character:
    684 
    685    U+0009 CHARACTER TABULATION
    686    U+000A LINE FEED (LF)
    687    U+000C FORM FEED (FF)
    688    U+0020 SPACE
    689           Stay in the before DOCTYPE name state.
    690 
    691    U+003E GREATER-THAN SIGN (>)
    692           Parse error. Create a new DOCTYPE token. Set its force-quirks
    693           flag to on. Emit the token. Switch to the data state.
    694 
    695    U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
    696           Create a new DOCTYPE token. Set the token's name to the
    697           lowercase version of the input character (add 0x0020 to the
    698           character's code point). Switch to the DOCTYPE name state.
    699 
    700    EOF
    701           Parse error. Create a new DOCTYPE token. Set its force-quirks
    702           flag to on. Emit the token. Reconsume the EOF character in the
    703           data state.
    704 
    705    Anything else
    706           Create a new DOCTYPE token. Set the token's name to the current
    707           input character. Switch to the DOCTYPE name state.
    708 
    709       8.2.4.25 DOCTYPE name state
    710 
    711    Consume the next input character:
    712 
    713    U+0009 CHARACTER TABULATION
    714    U+000A LINE FEED (LF)
    715    U+000C FORM FEED (FF)
    716    U+0020 SPACE
    717           Switch to the after DOCTYPE name state.
    718 
    719    U+003E GREATER-THAN SIGN (>)
    720           Emit the current DOCTYPE token. Switch to the data state.
    721 
    722    U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
    723           Append the lowercase version of the input character (add 0x0020
    724           to the character's code point) to the current DOCTYPE token's
    725           name. Stay in the DOCTYPE name state.
    726 
    727    EOF
    728           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    729           Emit that DOCTYPE token. Reconsume the EOF character in the data
    730           state.
    731 
    732    Anything else
    733           Append the current input character to the current DOCTYPE
    734           token's name. Stay in the DOCTYPE name state.
    735 
    736       8.2.4.26 After DOCTYPE name state
    737 
    738    Consume the next input character:
    739 
    740    U+0009 CHARACTER TABULATION
    741    U+000A LINE FEED (LF)
    742    U+000C FORM FEED (FF)
    743    U+0020 SPACE
    744           Stay in the after DOCTYPE name state.
    745 
    746    U+003E GREATER-THAN SIGN (>)
    747           Emit the current DOCTYPE token. Switch to the data state.
    748 
    749    EOF
    750           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    751           Emit that DOCTYPE token. Reconsume the EOF character in the data
    752           state.
    753 
    754    Anything else
    755           If the six characters starting from the current input character
    756           are an ASCII case-insensitive match for the word "PUBLIC", then
    757           consume those characters and switch to the before DOCTYPE public
    758           identifier state.
    759 
    760           Otherwise, if the six characters starting from the current input
    761           character are an ASCII case-insensitive match for the word
    762           "SYSTEM", then consume those characters and switch to the before
    763           DOCTYPE system identifier state.
    764 
    765           Otherwise, this is the parse error. Set the DOCTYPE token's
    766           force-quirks flag to on. Switch to the bogus DOCTYPE state.
    767 
    768       8.2.4.27 Before DOCTYPE public identifier state
    769 
    770    Consume the next input character:
    771 
    772    U+0009 CHARACTER TABULATION
    773    U+000A LINE FEED (LF)
    774    U+000C FORM FEED (FF)
    775    U+0020 SPACE
    776           Stay in the before DOCTYPE public identifier state.
    777 
    778    U+0022 QUOTATION MARK (")
    779           Set the DOCTYPE token's public identifier to the empty string
    780           (not missing), then switch to the DOCTYPE public identifier
    781           (double-quoted) state.
    782 
    783    U+0027 APOSTROPHE (')
    784           Set the DOCTYPE token's public identifier to the empty string
    785           (not missing), then switch to the DOCTYPE public identifier
    786           (single-quoted) state.
    787 
    788    U+003E GREATER-THAN SIGN (>)
    789           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    790           Emit that DOCTYPE token. Switch to the data state.
    791 
    792    EOF
    793           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    794           Emit that DOCTYPE token. Reconsume the EOF character in the data
    795           state.
    796 
    797    Anything else
    798           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    799           Switch to the bogus DOCTYPE state.
    800 
    801       8.2.4.28 DOCTYPE public identifier (double-quoted) state
    802 
    803    Consume the next input character:
    804 
    805    U+0022 QUOTATION MARK (")
    806           Switch to the after DOCTYPE public identifier state.
    807 
    808    U+003E GREATER-THAN SIGN (>)
    809           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    810           Emit that DOCTYPE token. Switch to the data state.
    811 
    812    EOF
    813           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    814           Emit that DOCTYPE token. Reconsume the EOF character in the data
    815           state.
    816 
    817    Anything else
    818           Append the current input character to the current DOCTYPE
    819           token's public identifier. Stay in the DOCTYPE public identifier
    820           (double-quoted) state.
    821 
    822       8.2.4.29 DOCTYPE public identifier (single-quoted) state
    823 
    824    Consume the next input character:
    825 
    826    U+0027 APOSTROPHE (')
    827           Switch to the after DOCTYPE public identifier state.
    828 
    829    U+003E GREATER-THAN SIGN (>)
    830           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    831           Emit that DOCTYPE token. Switch to the data state.
    832 
    833    EOF
    834           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    835           Emit that DOCTYPE token. Reconsume the EOF character in the data
    836           state.
    837 
    838    Anything else
    839           Append the current input character to the current DOCTYPE
    840           token's public identifier. Stay in the DOCTYPE public identifier
    841           (single-quoted) state.
    842 
    843       8.2.4.30 After DOCTYPE public identifier state
    844 
    845    Consume the next input character:
    846 
    847    U+0009 CHARACTER TABULATION
    848    U+000A LINE FEED (LF)
    849    U+000C FORM FEED (FF)
    850    U+0020 SPACE
    851           Stay in the after DOCTYPE public identifier state.
    852 
    853    U+0022 QUOTATION MARK (")
    854           Set the DOCTYPE token's system identifier to the empty string
    855           (not missing), then switch to the DOCTYPE system identifier
    856           (double-quoted) state.
    857 
    858    U+0027 APOSTROPHE (')
    859           Set the DOCTYPE token's system identifier to the empty string
    860           (not missing), then switch to the DOCTYPE system identifier
    861           (single-quoted) state.
    862 
    863    U+003E GREATER-THAN SIGN (>)
    864           Emit the current DOCTYPE token. Switch to the data state.
    865 
    866    EOF
    867           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    868           Emit that DOCTYPE token. Reconsume the EOF character in the data
    869           state.
    870 
    871    Anything else
    872           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    873           Switch to the bogus DOCTYPE state.
    874 
    875       8.2.4.31 Before DOCTYPE system identifier state
    876 
    877    Consume the next input character:
    878 
    879    U+0009 CHARACTER TABULATION
    880    U+000A LINE FEED (LF)
    881    U+000C FORM FEED (FF)
    882    U+0020 SPACE
    883           Stay in the before DOCTYPE system identifier state.
    884 
    885    U+0022 QUOTATION MARK (")
    886           Set the DOCTYPE token's system identifier to the empty string
    887           (not missing), then switch to the DOCTYPE system identifier
    888           (double-quoted) state.
    889 
    890    U+0027 APOSTROPHE (')
    891           Set the DOCTYPE token's system identifier to the empty string
    892           (not missing), then switch to the DOCTYPE system identifier
    893           (single-quoted) state.
    894 
    895    U+003E GREATER-THAN SIGN (>)
    896           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    897           Emit that DOCTYPE token. Switch to the data state.
    898 
    899    EOF
    900           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    901           Emit that DOCTYPE token. Reconsume the EOF character in the data
    902           state.
    903 
    904    Anything else
    905           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    906           Switch to the bogus DOCTYPE state.
    907 
    908       8.2.4.32 DOCTYPE system identifier (double-quoted) state
    909 
    910    Consume the next input character:
    911 
    912    U+0022 QUOTATION MARK (")
    913           Switch to the after DOCTYPE system identifier state.
    914 
    915    U+003E GREATER-THAN SIGN (>)
    916           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    917           Emit that DOCTYPE token. Switch to the data state.
    918 
    919    EOF
    920           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    921           Emit that DOCTYPE token. Reconsume the EOF character in the data
    922           state.
    923 
    924    Anything else
    925           Append the current input character to the current DOCTYPE
    926           token's system identifier. Stay in the DOCTYPE system identifier
    927           (double-quoted) state.
    928 
    929       8.2.4.33 DOCTYPE system identifier (single-quoted) state
    930 
    931    Consume the next input character:
    932 
    933    U+0027 APOSTROPHE (')
    934           Switch to the after DOCTYPE system identifier state.
    935 
    936    U+003E GREATER-THAN SIGN (>)
    937           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    938           Emit that DOCTYPE token. Switch to the data state.
    939 
    940    EOF
    941           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    942           Emit that DOCTYPE token. Reconsume the EOF character in the data
    943           state.
    944 
    945    Anything else
    946           Append the current input character to the current DOCTYPE
    947           token's system identifier. Stay in the DOCTYPE system identifier
    948           (single-quoted) state.
    949 
    950       8.2.4.34 After DOCTYPE system identifier state
    951 
    952    Consume the next input character:
    953 
    954    U+0009 CHARACTER TABULATION
    955    U+000A LINE FEED (LF)
    956    U+000C FORM FEED (FF)
    957    U+0020 SPACE
    958           Stay in the after DOCTYPE system identifier state.
    959 
    960    U+003E GREATER-THAN SIGN (>)
    961           Emit the current DOCTYPE token. Switch to the data state.
    962 
    963    EOF
    964           Parse error. Set the DOCTYPE token's force-quirks flag to on.
    965           Emit that DOCTYPE token. Reconsume the EOF character in the data
    966           state.
    967 
    968    Anything else
    969           Parse error. Switch to the bogus DOCTYPE state. (This does not
    970           set the DOCTYPE token's force-quirks flag to on.)
    971 
    972       8.2.4.35 Bogus DOCTYPE state
    973 
    974    Consume the next input character:
    975 
    976    U+003E GREATER-THAN SIGN (>)
    977           Emit the DOCTYPE token. Switch to the data state.
    978 
    979    EOF
    980           Emit the DOCTYPE token. Reconsume the EOF character in the data
    981           state.
    982 
    983    Anything else
    984           Stay in the bogus DOCTYPE state.
    985 
    986       8.2.4.36 CDATA section state
    987 
    988    (This can only happen if the content model flag is set to the PCDATA
    989    state, and is unrelated to the content model flag's CDATA state.)
    990 
    991    Consume every character up to the next occurrence of the three
    992    character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE
    993    BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF),
    994    whichever comes first. Emit a series of character tokens consisting of
    995    all the characters consumed except the matching three character
    996    sequence at the end (if one was found before the end of the file).
    997 
    998    Switch to the data state.
    999 
   1000    If the end of the file was reached, reconsume the EOF character.
   1001 
   1002       8.2.4.37 Tokenizing character references
   1003 
   1004    This section defines how to consume a character reference. This
   1005    definition is used when parsing character references in text and in
   1006    attributes.
   1007 
   1008    The behavior depends on the identity of the next character (the one
   1009    immediately after the U+0026 AMPERSAND character):
   1010 
   1011    U+0009 CHARACTER TABULATION
   1012    U+000A LINE FEED (LF)
   1013    U+000C FORM FEED (FF)
   1014    U+0020 SPACE
   1015    U+003C LESS-THAN SIGN
   1016    U+0026 AMPERSAND
   1017    EOF
   1018    The additional allowed character, if there is one
   1019           Not a character reference. No characters are consumed, and
   1020           nothing is returned. (This is not an error, either.)
   1021 
   1022    U+0023 NUMBER SIGN (#)
   1023           Consume the U+0023 NUMBER SIGN.
   1024 
   1025           The behavior further depends on the character after the U+0023
   1026           NUMBER SIGN:
   1027 
   1028         U+0078 LATIN SMALL LETTER X
   1029         U+0058 LATIN CAPITAL LETTER X
   1030                 Consume the X.
   1031 
   1032                 Follow the steps below, but using the range of characters
   1033                 U+0030 DIGIT ZERO through to U+0039 DIGIT NINE, U+0061
   1034                 LATIN SMALL LETTER A through to U+0066 LATIN SMALL LETTER
   1035                 F, and U+0041 LATIN CAPITAL LETTER A, through to U+0046
   1036                 LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f).
   1037 
   1038                 When it comes to interpreting the number, interpret it as
   1039                 a hexadecimal number.
   1040 
   1041         Anything else
   1042                 Follow the steps below, but using the range of characters
   1043                 U+0030 DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just
   1044                 0-9).
   1045 
   1046                 When it comes to interpreting the number, interpret it as
   1047                 a decimal number.
   1048 
   1049           Consume as many characters as match the range of characters
   1050           given above.
   1051 
   1052           If no characters match the range, then don't consume any
   1053           characters (and unconsume the U+0023 NUMBER SIGN character and,
   1054           if appropriate, the X character). This is a parse error; nothing
   1055           is returned.
   1056 
   1057           Otherwise, if the next character is a U+003B SEMICOLON, consume
   1058           that too. If it isn't, there is a parse error.
   1059 
   1060           If one or more characters match the range, then take them all
   1061           and interpret the string of characters as a number (either
   1062           hexadecimal or decimal as appropriate).
   1063 
   1064           If that number is one of the numbers in the first column of the
   1065           following table, then this is a parse error. Find the row with
   1066           that number in the first column, and return a character token
   1067           for the Unicode character given in the second column of that
   1068           row.
   1069 
   1070           Number                   Unicode character
   1071           0x0D   U+000A LINE FEED (LF)
   1072           0x80   U+20AC EURO SIGN ('')
   1073           0x81   U+FFFD REPLACEMENT CHARACTER
   1074           0x82   U+201A SINGLE LOW-9 QUOTATION MARK ('')
   1075           0x83   U+0192 LATIN SMALL LETTER F WITH HOOK ('')
   1076           0x84   U+201E DOUBLE LOW-9 QUOTATION MARK ('')
   1077           0x85   U+2026 HORIZONTAL ELLIPSIS ('')
   1078           0x86   U+2020 DAGGER ('')
   1079           0x87   U+2021 DOUBLE DAGGER ('')
   1080           0x88   U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT ('')
   1081           0x89   U+2030 PER MILLE SIGN ('')
   1082           0x8A   U+0160 LATIN CAPITAL LETTER S WITH CARON ('')
   1083           0x8B   U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('')
   1084           0x8C   U+0152 LATIN CAPITAL LIGATURE OE ('')
   1085           0x8D   U+FFFD REPLACEMENT CHARACTER
   1086           0x8E   U+017D LATIN CAPITAL LETTER Z WITH CARON ('')
   1087           0x8F   U+FFFD REPLACEMENT CHARACTER
   1088           0x90   U+FFFD REPLACEMENT CHARACTER
   1089           0x91   U+2018 LEFT SINGLE QUOTATION MARK ('')
   1090           0x92   U+2019 RIGHT SINGLE QUOTATION MARK ('')
   1091           0x93   U+201C LEFT DOUBLE QUOTATION MARK ('')
   1092           0x94   U+201D RIGHT DOUBLE QUOTATION MARK ('')
   1093           0x95   U+2022 BULLET ('')
   1094           0x96   U+2013 EN DASH ('')
   1095           0x97   U+2014 EM DASH ('')
   1096           0x98   U+02DC SMALL TILDE ('')
   1097           0x99   U+2122 TRADE MARK SIGN ('')
   1098           0x9A   U+0161 LATIN SMALL LETTER S WITH CARON ('')
   1099           0x9B   U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('')
   1100           0x9C   U+0153 LATIN SMALL LIGATURE OE ('')
   1101           0x9D   U+FFFD REPLACEMENT CHARACTER
   1102           0x9E   U+017E LATIN SMALL LETTER Z WITH CARON ('')
   1103           0x9F   U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS ('')
   1104 
   1105           Otherwise, if the number is in the range 0x0000 to 0x0008,
   1106           0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to
   1107           0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
   1108           0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
   1109           0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
   1110           0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
   1111           0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
   1112           0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is
   1113           a parse error; return a character token for the U+FFFD
   1114           REPLACEMENT CHARACTER character instead.
   1115 
   1116           Otherwise, return a character token for the Unicode character
   1117           whose code point is that number.
   1118 
   1119    Anything else
   1120           Consume the maximum number of characters possible, with the
   1121           consumed characters matching one of the identifiers in the first
   1122           column of the named character references table (in a
   1123           case-sensitive manner).
   1124 
   1125           If no match can be made, then this is a parse error. No
   1126           characters are consumed, and nothing is returned.
   1127 
   1128           If the last character matched is not a U+003B SEMICOLON (;),
   1129           there is a parse error.
   1130 
   1131           If the character reference is being consumed as part of an
   1132           attribute, and the last character matched is not a U+003B
   1133           SEMICOLON (;), and the next character is in the range U+0030
   1134           DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN CAPITAL LETTER A
   1135           to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A
   1136           to U+007A LATIN SMALL LETTER Z, then, for historical reasons,
   1137           all the characters that were matched after the U+0026 AMPERSAND
   1138           (&) must be unconsumed, and nothing is returned.
   1139 
   1140           Otherwise, return a character token for the character
   1141           corresponding to the character reference name (as given by the
   1142           second column of the named character references table).
   1143 
   1144           If the markup contains I'm &notit; I tell you, the character
   1145           reference is parsed as "not", as in, I'm it; I tell you. But if
   1146           the markup was I'm &notin; I tell you, the character reference
   1147           would be parsed as "notin;", resulting in I'm  I tell you.
   1148