README.chromium
1 Name: html5lib-python
2 Short Name: html5lib
3 URL: https://github.com/html5lib/html5lib-python
4 Version: 01b1ebb7ce0146b8082b1a7315431aac023eb046
5 License: MIT
6
7 Description:
8 Standards-compliant library for parsing and serializing HTML documents and
9 fragments in Python
10
11 Local Modifications: None
12
README.rst
1 html5lib
2 ========
3
4 .. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master
5 :target: https://travis-ci.org/html5lib/html5lib-python
6
7 html5lib is a pure-python library for parsing HTML. It is designed to
8 conform to the WHATWG HTML specification, as is implemented by all major
9 web browsers.
10
11
12 Usage
13 -----
14
15 Simple usage follows this pattern:
16
17 .. code-block:: python
18
19 import html5lib
20 with open("mydocument.html", "rb") as f:
21 document = html5lib.parse(f)
22
23 or:
24
25 .. code-block:: python
26
27 import html5lib
28 document = html5lib.parse("<p>Hello World!")
29
30 By default, the ``document`` will be an ``xml.etree`` element instance.
31 Whenever possible, html5lib chooses the accelerated ``ElementTree``
32 implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).
33
34 Two other tree types are supported: ``xml.dom.minidom`` and
35 ``lxml.etree``. To use an alternative format, specify the name of
36 a treebuilder:
37
38 .. code-block:: python
39
40 import html5lib
41 with open("mydocument.html", "rb") as f:
42 lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
43
44 When using with ``urllib2`` (Python 2), the charset from HTTP should be
45 pass into html5lib as follows:
46
47 .. code-block:: python
48
49 from contextlib import closing
50 from urllib2 import urlopen
51 import html5lib
52
53 with closing(urlopen("http://example.com/")) as f:
54 document = html5lib.parse(f, encoding=f.info().getparam("charset"))
55
56 When using with ``urllib.request`` (Python 3), the charset from HTTP
57 should be pass into html5lib as follows:
58
59 .. code-block:: python
60
61 from urllib.request import urlopen
62 import html5lib
63
64 with urlopen("http://example.com/") as f:
65 document = html5lib.parse(f, encoding=f.info().get_content_charset())
66
67 To have more control over the parser, create a parser object explicitly.
68 For instance, to make the parser raise exceptions on parse errors, use:
69
70 .. code-block:: python
71
72 import html5lib
73 with open("mydocument.html", "rb") as f:
74 parser = html5lib.HTMLParser(strict=True)
75 document = parser.parse(f)
76
77 When you're instantiating parser objects explicitly, pass a treebuilder
78 class as the ``tree`` keyword argument to use an alternative document
79 format:
80
81 .. code-block:: python
82
83 import html5lib
84 parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
85 minidom_document = parser.parse("<p>Hello World!")
86
87 More documentation is available at http://html5lib.readthedocs.org/.
88
89
90 Installation
91 ------------
92
93 html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,
94 use:
95
96 .. code-block:: bash
97
98 $ pip install html5lib
99
100
101 Optional Dependencies
102 ---------------------
103
104 The following third-party libraries may be used for additional
105 functionality:
106
107 - ``datrie`` can be used to improve parsing performance (though in
108 almost all cases the improvement is marginal);
109
110 - ``lxml`` is supported as a tree format (for both building and
111 walking) under CPython (but *not* PyPy where it is known to cause
112 segfaults);
113
114 - ``genshi`` has a treewalker (but not builder); and
115
116 - ``charade`` can be used as a fallback when character encoding cannot
117 be determined; ``chardet``, from which it was forked, can also be used
118 on Python 2.
119
120 - ``ordereddict`` can be used under Python 2.6
121 (``collections.OrderedDict`` is used instead on later versions) to
122 serialize attributes in alphabetical order.
123
124
125 Bugs
126 ----
127
128 Please report any bugs on the `issue tracker
129 <https://github.com/html5lib/html5lib-python/issues>`_.
130
131
132 Tests
133 -----
134
135 Unit tests require the ``nose`` library and can be run using the
136 ``nosetests`` command in the root directory; ``ordereddict`` is
137 required under Python 2.6. All should pass.
138
139 Test data are contained in a separate `html5lib-tests
140 <https://github.com/html5lib/html5lib-tests>`_ repository and included
141 as a submodule, thus for git checkouts they must be initialized::
142
143 $ git submodule init
144 $ git submodule update
145
146 If you have all compatible Python implementations available on your
147 system, you can run tests on all of them using the ``tox`` utility,
148 which can be found on PyPI.
149
150
151 Questions?
152 ----------
153
154 There's a mailing list available for support on Google Groups,
155 `html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
156 though you may get a quicker response asking on IRC in `#whatwg on
157 irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_.
158