Home | History | Annotate | Download | only in library
      1 :mod:`htmllib` --- A parser for HTML documents
      2 ==============================================
      3 
      4 .. module:: htmllib
      5    :synopsis: A parser for HTML documents.
      6    :deprecated:
      7 
      8 .. deprecated:: 2.6
      9     The :mod:`htmllib` module has been removed in Python 3.
     10     Use :mod:`HTMLParser` instead in Python 2, and the equivalent,
     11     :mod:`html.parser`, in Python 3.
     12 
     13 
     14 .. index::
     15    single: HTML
     16    single: hypertext
     17 
     18 .. index::
     19    module: sgmllib
     20    module: formatter
     21    single: SGMLParser (in module sgmllib)
     22 
     23 This module defines a class which can serve as a base for parsing text files
     24 formatted in the HyperText Mark-up Language (HTML).  The class is not directly
     25 concerned with I/O --- it must be provided with input in string form via a
     26 method, and makes calls to methods of a "formatter" object in order to produce
     27 output.  The :class:`~HTMLParser.HTMLParser` class is designed to be used as a base class
     28 for other classes in order to add functionality, and allows most of its methods
     29 to be extended or overridden.  In turn, this class is derived from and extends
     30 the :class:`SGMLParser` class defined in module :mod:`sgmllib`.  The
     31 :class:`~HTMLParser.HTMLParser` implementation supports the HTML 2.0 language as described
     32 in :rfc:`1866`.  Two implementations of formatter objects are provided in the
     33 :mod:`formatter` module; refer to the documentation for that module for
     34 information on the formatter interface.
     35 
     36 The following is a summary of the interface defined by
     37 :class:`sgmllib.SGMLParser`:
     38 
     39 * The interface to feed data to an instance is through the :meth:`feed` method,
     40   which takes a string argument.  This can be called with as little or as much
     41   text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as
     42   ``p.feed(a+b)``.  When the data contains complete HTML markup constructs, these
     43   are processed immediately; incomplete constructs are saved in a buffer.  To
     44   force processing of all unprocessed data, call the :meth:`close` method.
     45 
     46   For example, to parse the entire contents of a file, use::
     47 
     48      parser.feed(open('myfile.html').read())
     49      parser.close()
     50 
     51 * The interface to define semantics for HTML tags is very simple: derive a class
     52   and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`.
     53   The parser will call these at appropriate moments: :meth:`start_tag` or
     54   :meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is
     55   encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>``
     56   is encountered.  If an opening tag requires a corresponding closing tag, like
     57   ``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if
     58   a tag requires no closing tag, like ``<P>``, the class should define the
     59   :meth:`do_tag` method.
     60 
     61 The module defines a parser class and an exception:
     62 
     63 
     64 .. class:: HTMLParser(formatter)
     65 
     66    This is the basic HTML parser class.  It supports all entity names required by
     67    the XHTML 1.0 Recommendation (https://www.w3.org/TR/xhtml1).   It also defines
     68    handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
     69 
     70 
     71 .. exception:: HTMLParseError
     72 
     73    Exception raised by the :class:`~HTMLParser.HTMLParser` class when it encounters an error
     74    while parsing.
     75 
     76    .. versionadded:: 2.4
     77 
     78 
     79 .. seealso::
     80 
     81    Module :mod:`formatter`
     82       Interface definition for transforming an abstract flow of formatting events into
     83       specific output events on writer objects.
     84 
     85    Module :mod:`HTMLParser`
     86       Alternate HTML parser that offers a slightly lower-level view of the input, but
     87       is designed to work with XHTML, and does not implement some of the SGML syntax
     88       not used in "HTML as deployed" and which isn't legal for XHTML.
     89 
     90    Module :mod:`htmlentitydefs`
     91       Definition of replacement text for XHTML 1.0  entities.
     92 
     93    Module :mod:`sgmllib`
     94       Base class for :class:`~HTMLParser.HTMLParser`.
     95 
     96 
     97 .. _html-parser-objects:
     98 
     99 HTMLParser Objects
    100 ------------------
    101 
    102 In addition to tag methods, the :class:`~HTMLParser.HTMLParser` class provides some
    103 additional methods and instance variables for use within tag methods.
    104 
    105 
    106 .. attribute:: HTMLParser.formatter
    107 
    108    This is the formatter instance associated with the parser.
    109 
    110 
    111 .. attribute:: HTMLParser.nofill
    112 
    113    Boolean flag which should be true when whitespace should not be collapsed, or
    114    false when it should be.  In general, this should only be true when character
    115    data is to be treated as "preformatted" text, as within a ``<PRE>`` element.
    116    The default value is false.  This affects the operation of :meth:`handle_data`
    117    and :meth:`save_end`.
    118 
    119 
    120 .. method:: HTMLParser.anchor_bgn(href, name, type)
    121 
    122    This method is called at the start of an anchor region.  The arguments
    123    correspond to the attributes of the ``<A>`` tag with the same names.  The
    124    default implementation maintains a list of hyperlinks (defined by the ``HREF``
    125    attribute for ``<A>`` tags) within the document.  The list of hyperlinks is
    126    available as the data attribute :attr:`anchorlist`.
    127 
    128 
    129 .. method:: HTMLParser.anchor_end()
    130 
    131    This method is called at the end of an anchor region.  The default
    132    implementation adds a textual footnote marker using an index into the list of
    133    hyperlinks created by :meth:`anchor_bgn`.
    134 
    135 
    136 .. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]])
    137 
    138    This method is called to handle images.  The default implementation simply
    139    passes the *alt* value to the :meth:`handle_data` method.
    140 
    141 
    142 .. method:: HTMLParser.save_bgn()
    143 
    144    Begins saving character data in a buffer instead of sending it to the formatter
    145    object.  Retrieve the stored data via :meth:`save_end`. Use of the
    146    :meth:`save_bgn` / :meth:`save_end` pair may not be nested.
    147 
    148 
    149 .. method:: HTMLParser.save_end()
    150 
    151    Ends buffering character data and returns all data saved since the preceding
    152    call to :meth:`save_bgn`.  If the :attr:`nofill` flag is false, whitespace is
    153    collapsed to single spaces.  A call to this method without a preceding call to
    154    :meth:`save_bgn` will raise a :exc:`TypeError` exception.
    155 
    156 
    157 :mod:`htmlentitydefs` --- Definitions of HTML general entities
    158 ==============================================================
    159 
    160 .. module:: htmlentitydefs
    161    :synopsis: Definitions of HTML general entities.
    162 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org>
    163 
    164 .. note::
    165 
    166    The :mod:`htmlentitydefs` module has been renamed to :mod:`html.entities` in
    167    Python 3.  The :term:`2to3` tool will automatically adapt imports when
    168    converting your sources to Python 3.
    169 
    170 **Source code:** :source:`Lib/htmlentitydefs.py`
    171 
    172 --------------
    173 
    174 This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``,
    175 and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to
    176 provide the :attr:`entitydefs` attribute of the :class:`~HTMLParser.HTMLParser` class.  The
    177 definition provided here contains all the entities defined by XHTML 1.0  that
    178 can be handled using simple textual substitution in the Latin-1 character set
    179 (ISO-8859-1).
    180 
    181 
    182 .. data:: entitydefs
    183 
    184    A dictionary mapping XHTML 1.0 entity definitions to their replacement text in
    185    ISO Latin-1.
    186 
    187 
    188 .. data:: name2codepoint
    189 
    190    A dictionary that maps HTML entity names to the Unicode code points.
    191 
    192    .. versionadded:: 2.3
    193 
    194 
    195 .. data:: codepoint2name
    196 
    197    A dictionary that maps Unicode code points to HTML entity names.
    198 
    199    .. versionadded:: 2.3
    200 
    201