Home | History | Annotate | Download | only in library
      1 :mod:`sgmllib` --- Simple SGML parser
      2 =====================================
      3 
      4 .. module:: sgmllib
      5    :synopsis: Only as much of an SGML parser as needed to parse HTML.
      6    :deprecated:
      7 
      8 .. deprecated:: 2.6
      9     The :mod:`sgmllib` module has been removed in Python 3.
     10 
     11 .. index:: single: SGML
     12 
     13 This module defines a class :class:`SGMLParser` which serves as the basis for
     14 parsing text files formatted in SGML (Standard Generalized Mark-up Language).
     15 In fact, it does not provide a full SGML parser --- it only parses SGML insofar
     16 as it is used by HTML, and the module only exists as a base for the
     17 :mod:`htmllib` module.  Another HTML parser which supports XHTML and offers a
     18 somewhat different interface is available in the :mod:`HTMLParser` module.
     19 
     20 
     21 .. class:: SGMLParser()
     22 
     23    The :class:`SGMLParser` class is instantiated without arguments. The parser is
     24    hardcoded to recognize the following constructs:
     25 
     26    * Opening and closing tags of the form ``<tag attr="value" ...>`` and
     27      ``</tag>``, respectively.
     28 
     29    * Numeric character references of the form ``&#name;``.
     30 
     31    * Entity references of the form ``&name;``.
     32 
     33    * SGML comments of the form ``<!--text-->``.  Note that spaces, tabs, and
     34      newlines are allowed between the trailing ``>`` and the immediately preceding
     35      ``--``.
     36 
     37 A single exception is defined as well:
     38 
     39 
     40 .. exception:: SGMLParseError
     41 
     42    Exception raised by the :class:`SGMLParser` class when it encounters an error
     43    while parsing.
     44 
     45    .. versionadded:: 2.1
     46 
     47 :class:`SGMLParser` instances have the following methods:
     48 
     49 
     50 .. method:: SGMLParser.reset()
     51 
     52    Reset the instance.  Loses all unprocessed data.  This is called implicitly at
     53    instantiation time.
     54 
     55 
     56 .. method:: SGMLParser.setnomoretags()
     57 
     58    Stop processing tags.  Treat all following input as literal input (CDATA).
     59    (This is only provided so the HTML tag ``<PLAINTEXT>`` can be implemented.)
     60 
     61 
     62 .. method:: SGMLParser.setliteral()
     63 
     64    Enter literal mode (CDATA mode).
     65 
     66 
     67 .. method:: SGMLParser.feed(data)
     68 
     69    Feed some text to the parser.  It is processed insofar as it consists of
     70    complete elements; incomplete data is buffered until more data is fed or
     71    :meth:`close` is called.
     72 
     73 
     74 .. method:: SGMLParser.close()
     75 
     76    Force processing of all buffered data as if it were followed by an end-of-file
     77    mark.  This method may be redefined by a derived class to define additional
     78    processing at the end of the input, but the redefined version should always call
     79    :meth:`close`.
     80 
     81 
     82 .. method:: SGMLParser.get_starttag_text()
     83 
     84    Return the text of the most recently opened start tag.  This should not normally
     85    be needed for structured processing, but may be useful in dealing with HTML "as
     86    deployed" or for re-generating input with minimal changes (whitespace between
     87    attributes can be preserved, etc.).
     88 
     89 
     90 .. method:: SGMLParser.handle_starttag(tag, method, attributes)
     91 
     92    This method is called to handle start tags for which either a :meth:`start_tag`
     93    or :meth:`do_tag` method has been defined.  The *tag* argument is the name of
     94    the tag converted to lower case, and the *method* argument is the bound method
     95    which should be used to support semantic interpretation of the start tag. The
     96    *attributes* argument is a list of ``(name, value)`` pairs containing the
     97    attributes found inside the tag's ``<>`` brackets.
     98 
     99    The *name* has been translated to lower case. Double quotes and backslashes in
    100    the *value* have been interpreted, as well as known character references and
    101    known entity references terminated by a semicolon (normally, entity references
    102    can be terminated by any non-alphanumerical character, but this would break the
    103    very common case of ``<A HREF="url?spam=1&eggs=2">`` when ``eggs`` is a valid
    104    entity name).
    105 
    106    For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method would
    107    be called as ``unknown_starttag('a', [('href', 'http://www.cwi.nl/')])``.  The
    108    base implementation simply calls *method* with *attributes* as the only
    109    argument.
    110 
    111    .. versionadded:: 2.5
    112       Handling of entity and character references within attribute values.
    113 
    114 
    115 .. method:: SGMLParser.handle_endtag(tag, method)
    116 
    117    This method is called to handle endtags for which an :meth:`end_tag` method has
    118    been defined.  The *tag* argument is the name of the tag converted to lower
    119    case, and the *method* argument is the bound method which should be used to
    120    support semantic interpretation of the end tag.  If no :meth:`end_tag` method is
    121    defined for the closing element, this handler is not called.  The base
    122    implementation simply calls *method*.
    123 
    124 
    125 .. method:: SGMLParser.handle_data(data)
    126 
    127    This method is called to process arbitrary data.  It is intended to be
    128    overridden by a derived class; the base class implementation does nothing.
    129 
    130 
    131 .. method:: SGMLParser.handle_charref(ref)
    132 
    133    This method is called to process a character reference of the form ``&#ref;``.
    134    The base implementation uses :meth:`convert_charref` to convert the reference to
    135    a string.  If that method returns a string, it is passed to :meth:`handle_data`,
    136    otherwise ``unknown_charref(ref)`` is called to handle the error.
    137 
    138    .. versionchanged:: 2.5
    139       Use :meth:`convert_charref` instead of hard-coding the conversion.
    140 
    141 
    142 .. method:: SGMLParser.convert_charref(ref)
    143 
    144    Convert a character reference to a string, or ``None``.  *ref* is the reference
    145    passed in as a string.  In the base implementation, *ref* must be a decimal
    146    number in the range 0--255.  It converts the code point found using the
    147    :meth:`convert_codepoint` method. If *ref* is invalid or out of range, this
    148    method returns ``None``.  This method is called by the default
    149    :meth:`handle_charref` implementation and by the attribute value parser.
    150 
    151    .. versionadded:: 2.5
    152 
    153 
    154 .. method:: SGMLParser.convert_codepoint(codepoint)
    155 
    156    Convert a code point to a :class:`str` value.  Encodings can be handled here if
    157    appropriate, though the rest of :mod:`sgmllib` is oblivious on this matter.
    158 
    159    .. versionadded:: 2.5
    160 
    161 
    162 .. method:: SGMLParser.handle_entityref(ref)
    163 
    164    This method is called to process a general entity reference of the form
    165    ``&ref;`` where *ref* is a general entity reference.  It converts *ref* by
    166    passing it to :meth:`convert_entityref`.  If a translation is returned, it calls
    167    the method :meth:`handle_data` with the translation; otherwise, it calls the
    168    method ``unknown_entityref(ref)``. The default :attr:`entitydefs` defines
    169    translations for ``&amp;``, ``&apos;``, ``&gt;``, ``&lt;``, and ``&quot;``.
    170 
    171    .. versionchanged:: 2.5
    172       Use :meth:`convert_entityref` instead of hard-coding the conversion.
    173 
    174 
    175 .. method:: SGMLParser.convert_entityref(ref)
    176 
    177    Convert a named entity reference to a :class:`str` value, or ``None``.  The
    178    resulting value will not be parsed.  *ref* will be only the name of the entity.
    179    The default implementation looks for *ref* in the instance (or class) variable
    180    :attr:`entitydefs` which should be a mapping from entity names to corresponding
    181    translations.  If no translation is available for *ref*, this method returns
    182    ``None``.  This method is called by the default :meth:`handle_entityref`
    183    implementation and by the attribute value parser.
    184 
    185    .. versionadded:: 2.5
    186 
    187 
    188 .. method:: SGMLParser.handle_comment(comment)
    189 
    190    This method is called when a comment is encountered.  The *comment* argument is
    191    a string containing the text between the ``<!--`` and ``-->`` delimiters, but
    192    not the delimiters themselves.  For example, the comment ``<!--text-->`` will
    193    cause this method to be called with the argument ``'text'``.  The default method
    194    does nothing.
    195 
    196 
    197 .. method:: SGMLParser.handle_decl(data)
    198 
    199    Method called when an SGML declaration is read by the parser.  In practice, the
    200    ``DOCTYPE`` declaration is the only thing observed in HTML, but the parser does
    201    not discriminate among different (or broken) declarations.  Internal subsets in
    202    a ``DOCTYPE`` declaration are not supported.  The *data* parameter will be the
    203    entire contents of the declaration inside the ``<!``...\ ``>`` markup.  The
    204    default implementation does nothing.
    205 
    206 
    207 .. method:: SGMLParser.report_unbalanced(tag)
    208 
    209    This method is called when an end tag is found which does not correspond to any
    210    open element.
    211 
    212 
    213 .. method:: SGMLParser.unknown_starttag(tag, attributes)
    214 
    215    This method is called to process an unknown start tag.  It is intended to be
    216    overridden by a derived class; the base class implementation does nothing.
    217 
    218 
    219 .. method:: SGMLParser.unknown_endtag(tag)
    220 
    221    This method is called to process an unknown end tag.  It is intended to be
    222    overridden by a derived class; the base class implementation does nothing.
    223 
    224 
    225 .. method:: SGMLParser.unknown_charref(ref)
    226 
    227    This method is called to process unresolvable numeric character references.
    228    Refer to :meth:`handle_charref` to determine what is handled by default.  It is
    229    intended to be overridden by a derived class; the base class implementation does
    230    nothing.
    231 
    232 
    233 .. method:: SGMLParser.unknown_entityref(ref)
    234 
    235    This method is called to process an unknown entity reference.  It is intended to
    236    be overridden by a derived class; the base class implementation does nothing.
    237 
    238 Apart from overriding or extending the methods listed above, derived classes may
    239 also define methods of the following form to define processing of specific tags.
    240 Tag names in the input stream are case independent; the *tag* occurring in
    241 method names must be in lower case:
    242 
    243 
    244 .. method:: SGMLParser.start_tag(attributes)
    245    :noindex:
    246 
    247    This method is called to process an opening tag *tag*.  It has preference over
    248    :meth:`do_tag`.  The *attributes* argument has the same meaning as described for
    249    :meth:`handle_starttag` above.
    250 
    251 
    252 .. method:: SGMLParser.do_tag(attributes)
    253    :noindex:
    254 
    255    This method is called to process an opening tag *tag*  for which no
    256    :meth:`start_tag` method is defined.   The *attributes* argument has the same
    257    meaning as described for :meth:`handle_starttag` above.
    258 
    259 
    260 .. method:: SGMLParser.end_tag()
    261    :noindex:
    262 
    263    This method is called to process a closing tag *tag*.
    264 
    265 Note that the parser maintains a stack of open elements for which no end tag has
    266 been found yet.  Only tags processed by :meth:`start_tag` are pushed on this
    267 stack.  Definition of an :meth:`end_tag` method is optional for these tags.  For
    268 tags processed by :meth:`do_tag` or by :meth:`unknown_tag`, no :meth:`end_tag`
    269 method must be defined; if defined, it will not be used.  If both
    270 :meth:`start_tag` and :meth:`do_tag` methods exist for a tag, the
    271 :meth:`start_tag` method takes precedence.
    272 
    273