Home | History | Annotate | Download | only in treebuilders
      1 """A collection of modules for building different kinds of tree from
      2 HTML documents.
      3 
      4 To create a treebuilder for a new type of tree, you need to do
      5 implement several things:
      6 
      7 1) A set of classes for various types of elements: Document, Doctype,
      8 Comment, Element. These must implement the interface of
      9 _base.treebuilders.Node (although comment nodes have a different
     10 signature for their constructor, see treebuilders.etree.Comment)
     11 Textual content may also be implemented as another node type, or not, as
     12 your tree implementation requires.
     13 
     14 2) A treebuilder object (called TreeBuilder by convention) that
     15 inherits from treebuilders._base.TreeBuilder. This has 4 required attributes:
     16 documentClass - the class to use for the bottommost node of a document
     17 elementClass - the class to use for HTML Elements
     18 commentClass - the class to use for comments
     19 doctypeClass - the class to use for doctypes
     20 It also has one required method:
     21 getDocument - Returns the root node of the complete document tree
     22 
     23 3) If you wish to run the unit tests, you must also create a
     24 testSerializer method on your treebuilder which accepts a node and
     25 returns a string containing Node and its children serialized according
     26 to the format used in the unittests
     27 """
     28 
     29 from __future__ import absolute_import, division, unicode_literals
     30 
     31 from ..utils import default_etree
     32 
     33 treeBuilderCache = {}
     34 
     35 
     36 def getTreeBuilder(treeType, implementation=None, **kwargs):
     37     """Get a TreeBuilder class for various types of tree with built-in support
     38 
     39     treeType - the name of the tree type required (case-insensitive). Supported
     40                values are:
     41 
     42                "dom" - A generic builder for DOM implementations, defaulting to
     43                        a xml.dom.minidom based implementation.
     44                "etree" - A generic builder for tree implementations exposing an
     45                          ElementTree-like interface, defaulting to
     46                          xml.etree.cElementTree if available and
     47                          xml.etree.ElementTree if not.
     48                "lxml" - A etree-based builder for lxml.etree, handling
     49                         limitations of lxml's implementation.
     50 
     51     implementation - (Currently applies to the "etree" and "dom" tree types). A
     52                       module implementing the tree type e.g.
     53                       xml.etree.ElementTree or xml.etree.cElementTree."""
     54 
     55     treeType = treeType.lower()
     56     if treeType not in treeBuilderCache:
     57         if treeType == "dom":
     58             from . import dom
     59             # Come up with a sane default (pref. from the stdlib)
     60             if implementation is None:
     61                 from xml.dom import minidom
     62                 implementation = minidom
     63             # NEVER cache here, caching is done in the dom submodule
     64             return dom.getDomModule(implementation, **kwargs).TreeBuilder
     65         elif treeType == "lxml":
     66             from . import etree_lxml
     67             treeBuilderCache[treeType] = etree_lxml.TreeBuilder
     68         elif treeType == "etree":
     69             from . import etree
     70             if implementation is None:
     71                 implementation = default_etree
     72             # NEVER cache here, caching is done in the etree submodule
     73             return etree.getETreeModule(implementation, **kwargs).TreeBuilder
     74         else:
     75             raise ValueError("""Unrecognised treebuilder "%s" """ % treeType)
     76     return treeBuilderCache.get(treeType)
     77