1 The moving parts 2 ================ 3 4 html5lib consists of a number of components, which are responsible for 5 handling its features. 6 7 8 Tree builders 9 ------------- 10 11 The parser reads HTML by tokenizing the content and building a tree that 12 the user can later access. There are three main types of trees that 13 html5lib can build: 14 15 * ``etree`` - this is the default; builds a tree based on ``xml.etree``, 16 which can be found in the standard library. Whenever possible, the 17 accelerated ``ElementTree`` implementation (i.e. 18 ``xml.etree.cElementTree`` on Python 2.x) is used. 19 20 * ``dom`` - builds a tree based on ``xml.dom.minidom``. 21 22 * ``lxml.etree`` - uses lxml's implementation of the ``ElementTree`` 23 API. The performance gains are relatively small compared to using the 24 accelerated ``ElementTree`` module. 25 26 You can specify the builder by name when using the shorthand API: 27 28 .. code-block:: python 29 30 import html5lib 31 with open("mydocument.html", "rb") as f: 32 lxml_etree_document = html5lib.parse(f, treebuilder="lxml") 33 34 When instantiating a parser object, you have to pass a tree builder 35 class in the ``tree`` keyword attribute: 36 37 .. code-block:: python 38 39 import html5lib 40 parser = html5lib.HTMLParser(tree=SomeTreeBuilder) 41 document = parser.parse("<p>Hello World!") 42 43 To get a builder class by name, use the ``getTreeBuilder`` function: 44 45 .. code-block:: python 46 47 import html5lib 48 parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) 49 minidom_document = parser.parse("<p>Hello World!") 50 51 The implementation of builders can be found in `html5lib/treebuilders/ 52 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_. 53 54 55 Tree walkers 56 ------------ 57 58 Once a tree is ready, you can work on it either manually, or using 59 a tree walker, which provides a streaming view of the tree. html5lib 60 provides walkers for all three supported types of trees (``etree``, 61 ``dom`` and ``lxml``). 62 63 The implementation of walkers can be found in `html5lib/treewalkers/ 64 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_. 65 66 Walkers make consuming HTML easier. html5lib uses them to provide you 67 with has a couple of handy tools. 68 69 70 HTMLSerializer 71 ~~~~~~~~~~~~~~ 72 73 The serializer lets you write HTML back as a stream of bytes. 74 75 .. code-block:: pycon 76 77 >>> import html5lib 78 >>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich') 79 >>> walker = html5lib.getTreeWalker("etree") 80 >>> stream = walker(element) 81 >>> s = html5lib.serializer.HTMLSerializer() 82 >>> output = s.serialize(stream) 83 >>> for item in output: 84 ... print("%r" % item) 85 '<p' 86 ' ' 87 'xml:lang' 88 '=' 89 'pl' 90 '>' 91 'Witam wszystkich' 92 93 You can customize the serializer behaviour in a variety of ways, consult 94 the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer` 95 documentation. 96 97 98 Filters 99 ~~~~~~~ 100 101 You can alter the stream content with filters provided by html5lib: 102 103 * :class:`alphabeticalattributes.Filter 104 <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on 105 tags to be in alphabetical order 106 107 * :class:`inject_meta_charset.Filter 108 <html5lib.filters.inject_meta_charset.Filter>` sets a user-specified 109 encoding in the correct ``<meta>`` tag in the ``<head>`` section of 110 the document 111 112 * :class:`lint.Filter <html5lib.filters.lint.Filter>` raises 113 ``LintError`` exceptions on invalid tag and attribute names, invalid 114 PCDATA, etc. 115 116 * :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>` 117 removes tags from the stream which are not necessary to produce valid 118 HTML 119 120 * :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes 121 unsafe markup and CSS. Elements that are known to be safe are passed 122 through and the rest is converted to visible text. The default 123 configuration of the sanitizer follows the `WHATWG Sanitization Rules 124 <http://wiki.whatwg.org/wiki/Sanitization_rules>`_. 125 126 * :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>` 127 collapses all whitespace characters to single spaces unless they're in 128 ``<pre/>`` or ``textarea`` tags. 129 130 To use a filter, simply wrap it around a stream: 131 132 .. code-block:: python 133 134 >>> import html5lib 135 >>> from html5lib.filters import sanitizer 136 >>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom") 137 >>> walker = html5lib.getTreeWalker("dom") 138 >>> stream = walker(dom) 139 >>> sane_stream = sanitizer.Filter(stream) clean_stream = sanitizer.Filter(stream) 140 141 142 Tree adapters 143 ------------- 144 145 Used to translate one type of tree to another. More documentation 146 pending, sorry. 147 148 149 Encoding discovery 150 ------------------ 151 152 Parsed trees are always Unicode. However a large variety of input 153 encodings are supported. The encoding of the document is determined in 154 the following way: 155 156 * The encoding may be explicitly specified by passing the name of the 157 encoding as the encoding parameter to the 158 :meth:`~html5lib.html5parser.HTMLParser.parse` method on 159 ``HTMLParser`` objects. 160 161 * If no encoding is specified, the parser will attempt to detect the 162 encoding from a ``<meta>`` element in the first 512 bytes of the 163 document (this is only a partial implementation of the current HTML 164 5 specification). 165 166 * If no encoding can be found and the chardet library is available, an 167 attempt will be made to sniff the encoding from the byte pattern. 168 169 * If all else fails, the default encoding will be used. This is usually 170 `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is 171 a common fallback used by Web browsers. 172 173 174 Tokenizers 175 ---------- 176 177 The part of the parser responsible for translating a raw input stream 178 into meaningful tokens is the tokenizer. Currently html5lib provides 179 two. 180 181 To set up a tokenizer, simply pass it when instantiating 182 a :class:`~html5lib.html5parser.HTMLParser`: 183 184 .. code-block:: python 185 186 import html5lib 187 from html5lib import sanitizer 188 189 p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer) 190 p.parse("<p>Surprise!<script>alert('Boo!');</script>") 191 192 HTMLTokenizer 193 ~~~~~~~~~~~~~ 194 195 This is the default tokenizer, the heart of html5lib. The implementation 196 can be found in `html5lib/tokenizer.py 197 <https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_. 198 199 HTMLSanitizer 200 ~~~~~~~~~~~~~ 201 202 This is a tokenizer that removes unsafe markup and CSS styles from the 203 input. Elements that are known to be safe are passed through and the 204 rest is converted to visible text. The default configuration of the 205 sanitizer follows the `WHATWG Sanitization Rules 206 <http://wiki.whatwg.org/wiki/Sanitization_rules>`_. 207 208 The implementation can be found in `html5lib/sanitizer.py 209 <https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_. 210