Home | History | Annotate | Download | only in doc
      1 The moving parts
      2 ================
      3 
      4 html5lib consists of a number of components, which are responsible for
      5 handling its features.
      6 
      7 
      8 Tree builders
      9 -------------
     10 
     11 The parser reads HTML by tokenizing the content and building a tree that
     12 the user can later access. There are three main types of trees that
     13 html5lib can build:
     14 
     15 * ``etree`` - this is the default; builds a tree based on ``xml.etree``,
     16   which can be found in the standard library. Whenever possible, the
     17   accelerated ``ElementTree`` implementation (i.e.
     18   ``xml.etree.cElementTree`` on Python 2.x) is used.
     19 
     20 * ``dom`` - builds a tree based on ``xml.dom.minidom``.
     21 
     22 * ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
     23   API.  The performance gains are relatively small compared to using the
     24   accelerated ``ElementTree`` module.
     25 
     26 You can specify the builder by name when using the shorthand API:
     27 
     28 .. code-block:: python
     29 
     30   import html5lib
     31   with open("mydocument.html", "rb") as f:
     32       lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
     33 
     34 When instantiating a parser object, you have to pass a tree builder
     35 class in the ``tree`` keyword attribute:
     36 
     37 .. code-block:: python
     38 
     39   import html5lib
     40   parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
     41   document = parser.parse("<p>Hello World!")
     42 
     43 To get a builder class by name, use the ``getTreeBuilder`` function:
     44 
     45 .. code-block:: python
     46 
     47   import html5lib
     48   parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
     49   minidom_document = parser.parse("<p>Hello World!")
     50 
     51 The implementation of builders can be found in `html5lib/treebuilders/
     52 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_.
     53 
     54 
     55 Tree walkers
     56 ------------
     57 
     58 Once a tree is ready, you can work on it either manually, or using
     59 a tree walker, which provides a streaming view of the tree. html5lib
     60 provides walkers for all three supported types of trees (``etree``,
     61 ``dom`` and ``lxml``).
     62 
     63 The implementation of walkers can be found in `html5lib/treewalkers/
     64 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
     65 
     66 Walkers make consuming HTML easier. html5lib uses them to provide you
     67 with has a couple of handy tools.
     68 
     69 
     70 HTMLSerializer
     71 ~~~~~~~~~~~~~~
     72 
     73 The serializer lets you write HTML back as a stream of bytes.
     74 
     75 .. code-block:: pycon
     76 
     77   >>> import html5lib
     78   >>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich')
     79   >>> walker = html5lib.getTreeWalker("etree")
     80   >>> stream = walker(element)
     81   >>> s = html5lib.serializer.HTMLSerializer()
     82   >>> output = s.serialize(stream)
     83   >>> for item in output:
     84   ...   print("%r" % item)
     85   '<p'
     86   ' '
     87   'xml:lang'
     88   '='
     89   'pl'
     90   '>'
     91   'Witam wszystkich'
     92 
     93 You can customize the serializer behaviour in a variety of ways, consult
     94 the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
     95 documentation.
     96 
     97 
     98 Filters
     99 ~~~~~~~
    100 
    101 You can alter the stream content with filters provided by html5lib:
    102 
    103 * :class:`alphabeticalattributes.Filter
    104   <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
    105   tags to be in alphabetical order
    106 
    107 * :class:`inject_meta_charset.Filter
    108   <html5lib.filters.inject_meta_charset.Filter>` sets a user-specified
    109   encoding in the correct ``<meta>`` tag in the ``<head>`` section of
    110   the document
    111 
    112 * :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
    113   ``LintError`` exceptions on invalid tag and attribute names, invalid
    114   PCDATA, etc.
    115 
    116 * :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
    117   removes tags from the stream which are not necessary to produce valid
    118   HTML
    119 
    120 * :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
    121   unsafe markup and CSS. Elements that are known to be safe are passed
    122   through and the rest is converted to visible text. The default
    123   configuration of the sanitizer follows the `WHATWG Sanitization Rules
    124   <http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
    125 
    126 * :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
    127   collapses all whitespace characters to single spaces unless they're in
    128   ``<pre/>`` or ``textarea`` tags.
    129 
    130 To use a filter, simply wrap it around a stream:
    131 
    132 .. code-block:: python
    133 
    134   >>> import html5lib
    135   >>> from html5lib.filters import sanitizer
    136   >>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom")
    137   >>> walker = html5lib.getTreeWalker("dom")
    138   >>> stream = walker(dom)
    139   >>> sane_stream = sanitizer.Filter(stream) clean_stream = sanitizer.Filter(stream)
    140 
    141 
    142 Tree adapters
    143 -------------
    144 
    145 Used to translate one type of tree to another. More documentation
    146 pending, sorry.
    147 
    148 
    149 Encoding discovery
    150 ------------------
    151 
    152 Parsed trees are always Unicode. However a large variety of input
    153 encodings are supported. The encoding of the document is determined in
    154 the following way:
    155 
    156 * The encoding may be explicitly specified by passing the name of the
    157   encoding as the encoding parameter to the
    158   :meth:`~html5lib.html5parser.HTMLParser.parse` method on
    159   ``HTMLParser`` objects.
    160 
    161 * If no encoding is specified, the parser will attempt to detect the
    162   encoding from a ``<meta>``  element in the first 512 bytes of the
    163   document (this is only a partial implementation of the current HTML
    164   5 specification).
    165 
    166 * If no encoding can be found and the chardet library is available, an
    167   attempt will be made to sniff the encoding from the byte pattern.
    168 
    169 * If all else fails, the default encoding will be used. This is usually
    170   `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
    171   a common fallback used by Web browsers.
    172 
    173 
    174 Tokenizers
    175 ----------
    176 
    177 The part of the parser responsible for translating a raw input stream
    178 into meaningful tokens is the tokenizer. Currently html5lib provides
    179 two.
    180 
    181 To set up a tokenizer, simply pass it when instantiating
    182 a :class:`~html5lib.html5parser.HTMLParser`:
    183 
    184 .. code-block:: python
    185 
    186   import html5lib
    187   from html5lib import sanitizer
    188 
    189   p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
    190   p.parse("<p>Surprise!<script>alert('Boo!');</script>")
    191 
    192 HTMLTokenizer
    193 ~~~~~~~~~~~~~
    194 
    195 This is the default tokenizer, the heart of html5lib. The implementation
    196 can be found in `html5lib/tokenizer.py
    197 <https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
    198 
    199 HTMLSanitizer
    200 ~~~~~~~~~~~~~
    201 
    202 This is a tokenizer that removes unsafe markup and CSS styles from the
    203 input. Elements that are known to be safe are passed through and the
    204 rest is converted to visible text. The default configuration of the
    205 sanitizer follows the `WHATWG Sanitization Rules
    206 <http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
    207 
    208 The implementation can be found in `html5lib/sanitizer.py
    209 <https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.
    210