1 :mod:`htmllib` --- A parser for HTML documents 2 ============================================== 3 4 .. module:: htmllib 5 :synopsis: A parser for HTML documents. 6 :deprecated: 7 8 .. deprecated:: 2.6 9 The :mod:`htmllib` module has been removed in Python 3. 10 Use :mod:`HTMLParser` instead in Python 2, and the equivalent, 11 :mod:`html.parser`, in Python 3. 12 13 14 .. index:: 15 single: HTML 16 single: hypertext 17 18 .. index:: 19 module: sgmllib 20 module: formatter 21 single: SGMLParser (in module sgmllib) 22 23 This module defines a class which can serve as a base for parsing text files 24 formatted in the HyperText Mark-up Language (HTML). The class is not directly 25 concerned with I/O --- it must be provided with input in string form via a 26 method, and makes calls to methods of a "formatter" object in order to produce 27 output. The :class:`~HTMLParser.HTMLParser` class is designed to be used as a base class 28 for other classes in order to add functionality, and allows most of its methods 29 to be extended or overridden. In turn, this class is derived from and extends 30 the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The 31 :class:`~HTMLParser.HTMLParser` implementation supports the HTML 2.0 language as described 32 in :rfc:`1866`. Two implementations of formatter objects are provided in the 33 :mod:`formatter` module; refer to the documentation for that module for 34 information on the formatter interface. 35 36 The following is a summary of the interface defined by 37 :class:`sgmllib.SGMLParser`: 38 39 * The interface to feed data to an instance is through the :meth:`feed` method, 40 which takes a string argument. This can be called with as little or as much 41 text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as 42 ``p.feed(a+b)``. When the data contains complete HTML markup constructs, these 43 are processed immediately; incomplete constructs are saved in a buffer. To 44 force processing of all unprocessed data, call the :meth:`close` method. 45 46 For example, to parse the entire contents of a file, use:: 47 48 parser.feed(open('myfile.html').read()) 49 parser.close() 50 51 * The interface to define semantics for HTML tags is very simple: derive a class 52 and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`. 53 The parser will call these at appropriate moments: :meth:`start_tag` or 54 :meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is 55 encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>`` 56 is encountered. If an opening tag requires a corresponding closing tag, like 57 ``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if 58 a tag requires no closing tag, like ``<P>``, the class should define the 59 :meth:`do_tag` method. 60 61 The module defines a parser class and an exception: 62 63 64 .. class:: HTMLParser(formatter) 65 66 This is the basic HTML parser class. It supports all entity names required by 67 the XHTML 1.0 Recommendation (https://www.w3.org/TR/xhtml1). It also defines 68 handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. 69 70 71 .. exception:: HTMLParseError 72 73 Exception raised by the :class:`~HTMLParser.HTMLParser` class when it encounters an error 74 while parsing. 75 76 .. versionadded:: 2.4 77 78 79 .. seealso:: 80 81 Module :mod:`formatter` 82 Interface definition for transforming an abstract flow of formatting events into 83 specific output events on writer objects. 84 85 Module :mod:`HTMLParser` 86 Alternate HTML parser that offers a slightly lower-level view of the input, but 87 is designed to work with XHTML, and does not implement some of the SGML syntax 88 not used in "HTML as deployed" and which isn't legal for XHTML. 89 90 Module :mod:`htmlentitydefs` 91 Definition of replacement text for XHTML 1.0 entities. 92 93 Module :mod:`sgmllib` 94 Base class for :class:`~HTMLParser.HTMLParser`. 95 96 97 .. _html-parser-objects: 98 99 HTMLParser Objects 100 ------------------ 101 102 In addition to tag methods, the :class:`~HTMLParser.HTMLParser` class provides some 103 additional methods and instance variables for use within tag methods. 104 105 106 .. attribute:: HTMLParser.formatter 107 108 This is the formatter instance associated with the parser. 109 110 111 .. attribute:: HTMLParser.nofill 112 113 Boolean flag which should be true when whitespace should not be collapsed, or 114 false when it should be. In general, this should only be true when character 115 data is to be treated as "preformatted" text, as within a ``<PRE>`` element. 116 The default value is false. This affects the operation of :meth:`handle_data` 117 and :meth:`save_end`. 118 119 120 .. method:: HTMLParser.anchor_bgn(href, name, type) 121 122 This method is called at the start of an anchor region. The arguments 123 correspond to the attributes of the ``<A>`` tag with the same names. The 124 default implementation maintains a list of hyperlinks (defined by the ``HREF`` 125 attribute for ``<A>`` tags) within the document. The list of hyperlinks is 126 available as the data attribute :attr:`anchorlist`. 127 128 129 .. method:: HTMLParser.anchor_end() 130 131 This method is called at the end of an anchor region. The default 132 implementation adds a textual footnote marker using an index into the list of 133 hyperlinks created by :meth:`anchor_bgn`. 134 135 136 .. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]]) 137 138 This method is called to handle images. The default implementation simply 139 passes the *alt* value to the :meth:`handle_data` method. 140 141 142 .. method:: HTMLParser.save_bgn() 143 144 Begins saving character data in a buffer instead of sending it to the formatter 145 object. Retrieve the stored data via :meth:`save_end`. Use of the 146 :meth:`save_bgn` / :meth:`save_end` pair may not be nested. 147 148 149 .. method:: HTMLParser.save_end() 150 151 Ends buffering character data and returns all data saved since the preceding 152 call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is 153 collapsed to single spaces. A call to this method without a preceding call to 154 :meth:`save_bgn` will raise a :exc:`TypeError` exception. 155 156 157 :mod:`htmlentitydefs` --- Definitions of HTML general entities 158 ============================================================== 159 160 .. module:: htmlentitydefs 161 :synopsis: Definitions of HTML general entities. 162 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org> 163 164 .. note:: 165 166 The :mod:`htmlentitydefs` module has been renamed to :mod:`html.entities` in 167 Python 3. The :term:`2to3` tool will automatically adapt imports when 168 converting your sources to Python 3. 169 170 **Source code:** :source:`Lib/htmlentitydefs.py` 171 172 -------------- 173 174 This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``, 175 and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to 176 provide the :attr:`entitydefs` attribute of the :class:`~HTMLParser.HTMLParser` class. The 177 definition provided here contains all the entities defined by XHTML 1.0 that 178 can be handled using simple textual substitution in the Latin-1 character set 179 (ISO-8859-1). 180 181 182 .. data:: entitydefs 183 184 A dictionary mapping XHTML 1.0 entity definitions to their replacement text in 185 ISO Latin-1. 186 187 188 .. data:: name2codepoint 189 190 A dictionary that maps HTML entity names to the Unicode code points. 191 192 .. versionadded:: 2.3 193 194 195 .. data:: codepoint2name 196 197 A dictionary that maps Unicode code points to HTML entity names. 198 199 .. versionadded:: 2.3 200 201