Cross Reference: /external/owasp/sanitizer/lib/htmlparser-1.3/doc/tokenization.txt

Lines Matching full:character
16    states consume a single character, which may have various side-effects,
18    same character, or switches it to a new state (to consume the next
19    character), or repeats the same state (to consume the next character).
32    following tokens: DOCTYPE, start tag, end tag, comment, character,
42    Comment and character tokens have data.
78    Consume the next input character:
83           character reference data state.
95           In any case, emit the input character as a character token. Stay
113           In any case, emit the input character as a character token. Stay
120           Emit the input character as a character token. Stay in the data
123       8.2.4.2 Character reference data state
128    Attempt to consume a character reference, with no additional allowed
129    character.
131    If nothing is returned, emit a U+0026 AMPERSAND character token.
133    Otherwise, emit the character token that was returned.
142           Consume the next input character. If it is a U+002F SOLIDUS (/)
143           character, switch to the close tag open state. Otherwise, emit a
144           U+003C LESS-THAN SIGN character token and reconsume the current
145           input character in the data state.
148           Consume the next input character:
159                 lowercase version of the input character (add 0x0020 to
160                 the character's code point), then switch to the tag name
166                 input character, then switch to the tag name state. (Don't
171                 Parse error. Emit a U+003C LESS-THAN SIGN character token
172                 and a U+003E GREATER-THAN SIGN character token. Switch to
179                 Parse error. Emit a U+003C LESS-THAN SIGN character token
180                 and reconsume the current input character in the data
192      * U+0009 CHARACTER TABULATION
200    ...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS
201    character token, and switch to the data state to process the next input
202    character.
206    character:
210           version of the input character (add 0x0020 to the character's
217           character, then switch to the tag name state. (Don't emit the
225           Parse error. Emit a U+003C LESS-THAN SIGN character token and a
226           U+002F SOLIDUS character token. Reconsume the EOF character in
234    Consume the next input character:
236    U+0009 CHARACTER TABULATION
249           Append the lowercase version of the current input character (add
250           0x0020 to the character's code point) to the current tag token's
255           character in the data state.
258           Append the current input character to the current tag token's
263    Consume the next input character:
265    U+0009 CHARACTER TABULATION
280           character (add 0x0020 to the character's code point), and its
290           character in the data state.
294           attribute's name to the current input character, and its value
299    Consume the next input character:
301    U+0009 CHARACTER TABULATION
317           Append the lowercase version of the current input character (add
318           0x0020 to the character's code point) to the current attribute's
327           character in the data state.
330           Append the current input character to the current attribute's
342    Consume the next input character:
344    U+0009 CHARACTER TABULATION
362           character (add 0x0020 to the character's code point), and its
371           character in the data state.
375           attribute's name to the current input character, and its value
380    Consume the next input character:
382    U+0009 CHARACTER TABULATION
393           this input character.
406           Parse error. Emit the current tag token. Reconsume the character
410           Append the current input character to the current attribute's
415    Consume the next input character:
421           Switch to the character reference in attribute value state, with
422 character being U+0022 QUOTATION MARK
426           Parse error. Emit the current tag token. Reconsume the character
430           Append the current input character to the current attribute's
435    Consume the next input character:
441           Switch to the character reference in attribute value state, with
442           the additional allowed character being U+0027 APOSTROPHE (').
445           Parse error. Emit the current tag token. Reconsume the character
449           Append the current input character to the current attribute's
454    Consume the next input character:
456    U+0009 CHARACTER TABULATION
463           Switch to the character reference in attribute value state, with
464           no additional allowed character.
475           Parse error. Emit the current tag token. Reconsume the character
479           Append the current input character to the current attribute's
482       8.2.4.13 Character reference in attribute value state
484    Attempt to consume a character reference.
486    If nothing is returned, append a U+0026 AMPERSAND character to the
489    Otherwise, append the returned character token to the current
497    Consume the next input character:
499    U+0009 CHARACTER TABULATION
513           character in the data state.
516           Parse error. Reconsume the character in the before attribute
521    Consume the next input character:
529           character in the data state.
532           Parse error. Reconsume the character in the before attribute
540    Consume every character up to and including the first U+003E
541    GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever
543    all the characters starting from and including the character that
545    and including the character immediately before the last consumed
546    character (i.e. up to the character just before the U+003E or EOF
547    character). (If the comment was started by the end of the file (EOF),
552    If the end of the file was reached, reconsume the EOF character.
571    character before and after), then consume those characters and switch
576    The next character that is consumed, if any, is the first character
581    Consume the next input character:
590           Parse error. Emit the comment token. Reconsume the EOF character
594           Append the input character to the comment token's data. Switch
599    Consume the next input character:
608           Parse error. Emit the comment token. Reconsume the EOF character
612           Append a U+002D HYPHEN-MINUS (-) character and the input
613           character to the comment token's data. Switch to the comment
618    Consume the next input character:
624           Parse error. Emit the comment token. Reconsume the EOF character
628           Append the input character to the comment token's data. Stay in
633    Consume the next input character:
639           Parse error. Emit the comment token. Reconsume the EOF character
643           Append a U+002D HYPHEN-MINUS (-) character and the input
644           character to the comment token's data. Switch to the comment
649    Consume the next input character:
655           Parse error. Append a U+002D HYPHEN-MINUS (-) character to the
659           Parse error. Emit the comment token. Reconsume the EOF character
664           the input character to the comment token's data. Switch to the
669    Consume the next input character:
671    U+0009 CHARACTER TABULATION
678           Parse error. Reconsume the current character in the before
683    Consume the next input character:
685    U+0009 CHARACTER TABULATION
697           lowercase version of the input character (add 0x0020 to the
698           character's code point). Switch to the DOCTYPE name state.
702           flag to on. Emit the token. Reconsume the EOF character in the
707           input character. Switch to the DOCTYPE name state.
711    Consume the next input character:
713    U+0009 CHARACTER TABULATION
723           Append the lowercase version of the input character (add 0x0020
724           to the character's code point) to the current DOCTYPE token's
729           Emit that DOCTYPE token. Reconsume the EOF character in the data
733           Append the current input character to the current DOCTYPE
738    Consume the next input character:
740    U+0009 CHARACTER TABULATION
751           Emit that DOCTYPE token. Reconsume the EOF character in the data
755           If the six characters starting from the current input character
761           character are an ASCII case-insensitive match for the word
770    Consume the next input character:
772    U+0009 CHARACTER TABULATION
794           Emit that DOCTYPE token. Reconsume the EOF character in the data
803    Consume the next input character:
814           Emit that DOCTYPE token. Reconsume the EOF character in the data
818           Append the current input character to the current DOCTYPE
824    Consume the next input character:
835           Emit that DOCTYPE token. Reconsume the EOF character in the data
839           Append the current input character to the current DOCTYPE
845    Consume the next input character:
847    U+0009 CHARACTER TABULATION
868           Emit that DOCTYPE token. Reconsume the EOF character in the data
877    Consume the next input character:
879    U+0009 CHARACTER TABULATION
901           Emit that DOCTYPE token. Reconsume the EOF character in the data
910    Consume the next input character:
921           Emit that DOCTYPE token. Reconsume the EOF character in the data
925           Append the current input character to the current DOCTYPE
931    Consume the next input character:
942           Emit that DOCTYPE token. Reconsume the EOF character in the data
946           Append the current input character to the current DOCTYPE
952    Consume the next input character:
954    U+0009 CHARACTER TABULATION
965           Emit that DOCTYPE token. Reconsume the EOF character in the data
974    Consume the next input character:
980           Emit the DOCTYPE token. Reconsume the EOF character in the data
991    Consume every character up to the next occurrence of the three
992    character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE
994    whichever comes first. Emit a series of character tokens consisting of
995    all the characters consumed except the matching three character
1000    If the end of the file was reached, reconsume the EOF character.
1002       8.2.4.37 Tokenizing character references
1004    This section defines how to consume a character reference. This
1005    definition is used when parsing character references in text and in
1008    The behavior depends on the identity of the next character (the one
1009    immediately after the U+0026 AMPERSAND character):
1011    U+0009 CHARACTER TABULATION
1018    The additional allowed character, if there is one
1019           Not a character reference. No characters are consumed, and
1025           The behavior further depends on the character after the U+0023
1053           characters (and unconsume the U+0023 NUMBER SIGN character and,
1054           if appropriate, the X character). This is a parse error; nothing
1057           Otherwise, if the next character is a U+003B SEMICOLON, consume
1066           that number in the first column, and return a character token
1067           for the Unicode character given in the second column of that
1070           Number                   Unicode character
1073           0x81   U+FFFD REPLACEMENT CHARACTER
1085           0x8D   U+FFFD REPLACEMENT CHARACTER
1087           0x8F   U+FFFD REPLACEMENT CHARACTER
1088           0x90   U+FFFD REPLACEMENT CHARACTER
1101           0x9D   U+FFFD REPLACEMENT CHARACTER
1113           a parse error; return a character token for the U+FFFD
1114           REPLACEMENT CHARACTER character instead.
1116           Otherwise, return a character token for the Unicode character
1122           column of the named character references table (in a
1128           If the last character matched is not a U+003B SEMICOLON (;),
1131           If the character reference is being consumed as part of an
1132           attribute, and the last character matched is not a U+003B
1133           SEMICOLON (;), and the next character is in the range U+0030
1140           Otherwise, return a character token for the character
1141           corresponding to the character reference name (as given by the
1142           second column of the named character references table).
1144           If the markup contains I'm &notit; I tell you, the character
1146           the markup was I'm &notin; I tell you, the character reference
OpenGrok