Home | History | Annotate | only in /external/chromium-trace/catapult/third_party/html5lib-python
Up to higher level directory
NameDateSize
.gitignore06-Dec-20161.8K
.gitmodules06-Dec-2016109
.travis.yml06-Dec-2016580
AUTHORS.rst06-Dec-2016651
CHANGES.rst06-Dec-20165.3K
CONTRIBUTING.rst06-Dec-20162.4K
debug-info.py06-Dec-2016779
doc/06-Dec-2016
flake8-run.sh06-Dec-2016393
html5lib/06-Dec-2016
LICENSE06-Dec-20161.1K
MANIFEST.in06-Dec-2016149
parse.py06-Dec-20168.9K
README.chromium06-Dec-2016291
README.rst06-Dec-20164.2K
requirements-install.sh06-Dec-2016537
requirements-optional-2.6.txt06-Dec-2016126
requirements-optional-cpython.txt06-Dec-2016143
requirements-optional.txt06-Dec-2016334
requirements-test.txt06-Dec-201658
requirements.txt06-Dec-20164
setup.py06-Dec-20162.2K
tox.ini06-Dec-2016513
utils/06-Dec-2016

README.chromium

      1 Name: html5lib-python
      2 Short Name: html5lib
      3 URL: https://github.com/html5lib/html5lib-python
      4 Version: 01b1ebb7ce0146b8082b1a7315431aac023eb046
      5 License: MIT
      6 
      7 Description:
      8 Standards-compliant library for parsing and serializing HTML documents and
      9 fragments in Python
     10 
     11 Local Modifications: None
     12 

README.rst

      1 html5lib
      2 ========
      3 
      4 .. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master
      5   :target: https://travis-ci.org/html5lib/html5lib-python
      6 
      7 html5lib is a pure-python library for parsing HTML. It is designed to
      8 conform to the WHATWG HTML specification, as is implemented by all major
      9 web browsers.
     10 
     11 
     12 Usage
     13 -----
     14 
     15 Simple usage follows this pattern:
     16 
     17 .. code-block:: python
     18 
     19   import html5lib
     20   with open("mydocument.html", "rb") as f:
     21       document = html5lib.parse(f)
     22 
     23 or:
     24 
     25 .. code-block:: python
     26 
     27   import html5lib
     28   document = html5lib.parse("<p>Hello World!")
     29 
     30 By default, the ``document`` will be an ``xml.etree`` element instance.
     31 Whenever possible, html5lib chooses the accelerated ``ElementTree``
     32 implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).
     33 
     34 Two other tree types are supported: ``xml.dom.minidom`` and
     35 ``lxml.etree``. To use an alternative format, specify the name of
     36 a treebuilder:
     37 
     38 .. code-block:: python
     39 
     40   import html5lib
     41   with open("mydocument.html", "rb") as f:
     42       lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
     43 
     44 When using with ``urllib2`` (Python 2), the charset from HTTP should be
     45 pass into html5lib as follows:
     46 
     47 .. code-block:: python
     48 
     49   from contextlib import closing
     50   from urllib2 import urlopen
     51   import html5lib
     52 
     53   with closing(urlopen("http://example.com/")) as f:
     54       document = html5lib.parse(f, encoding=f.info().getparam("charset"))
     55 
     56 When using with ``urllib.request`` (Python 3), the charset from HTTP
     57 should be pass into html5lib as follows:
     58 
     59 .. code-block:: python
     60 
     61   from urllib.request import urlopen
     62   import html5lib
     63 
     64   with urlopen("http://example.com/") as f:
     65       document = html5lib.parse(f, encoding=f.info().get_content_charset())
     66 
     67 To have more control over the parser, create a parser object explicitly.
     68 For instance, to make the parser raise exceptions on parse errors, use:
     69 
     70 .. code-block:: python
     71 
     72   import html5lib
     73   with open("mydocument.html", "rb") as f:
     74       parser = html5lib.HTMLParser(strict=True)
     75       document = parser.parse(f)
     76 
     77 When you're instantiating parser objects explicitly, pass a treebuilder
     78 class as the ``tree`` keyword argument to use an alternative document
     79 format:
     80 
     81 .. code-block:: python
     82 
     83   import html5lib
     84   parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
     85   minidom_document = parser.parse("<p>Hello World!")
     86 
     87 More documentation is available at http://html5lib.readthedocs.org/.
     88 
     89 
     90 Installation
     91 ------------
     92 
     93 html5lib works on CPython 2.6+, CPython 3.2+ and PyPy.  To install it,
     94 use:
     95 
     96 .. code-block:: bash
     97 
     98     $ pip install html5lib
     99 
    100 
    101 Optional Dependencies
    102 ---------------------
    103 
    104 The following third-party libraries may be used for additional
    105 functionality:
    106 
    107 - ``datrie`` can be used to improve parsing performance (though in
    108   almost all cases the improvement is marginal);
    109 
    110 - ``lxml`` is supported as a tree format (for both building and
    111   walking) under CPython (but *not* PyPy where it is known to cause
    112   segfaults);
    113 
    114 - ``genshi`` has a treewalker (but not builder); and
    115 
    116 - ``charade`` can be used as a fallback when character encoding cannot
    117   be determined; ``chardet``, from which it was forked, can also be used
    118   on Python 2.
    119 
    120 - ``ordereddict`` can be used under Python 2.6
    121   (``collections.OrderedDict`` is used instead on later versions) to
    122   serialize attributes in alphabetical order.
    123 
    124 
    125 Bugs
    126 ----
    127 
    128 Please report any bugs on the `issue tracker
    129 <https://github.com/html5lib/html5lib-python/issues>`_.
    130 
    131 
    132 Tests
    133 -----
    134 
    135 Unit tests require the ``nose`` library and can be run using the
    136 ``nosetests`` command in the root directory; ``ordereddict`` is
    137 required under Python 2.6. All should pass.
    138 
    139 Test data are contained in a separate `html5lib-tests
    140 <https://github.com/html5lib/html5lib-tests>`_ repository and included
    141 as a submodule, thus for git checkouts they must be initialized::
    142 
    143   $ git submodule init
    144   $ git submodule update
    145 
    146 If you have all compatible Python implementations available on your
    147 system, you can run tests on all of them using the ``tox`` utility,
    148 which can be found on PyPI.
    149 
    150 
    151 Questions?
    152 ----------
    153 
    154 There's a mailing list available for support on Google Groups,
    155 `html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
    156 though you may get a quicker response asking on IRC in `#whatwg on
    157 irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_.
    158