Home | History | Annotate | Download | only in library
      1 :mod:`xml.dom.minidom` --- Minimal DOM implementation
      2 =====================================================
      3 
      4 .. module:: xml.dom.minidom
      5    :synopsis: Minimal Document Object Model (DOM) implementation.
      6 .. moduleauthor:: Paul Prescod <paul (a] prescod.net>
      7 .. sectionauthor:: Paul Prescod <paul (a] prescod.net>
      8 .. sectionauthor:: Martin v. Lwis <martin (a] v.loewis.de>
      9 
     10 
     11 .. versionadded:: 2.0
     12 
     13 **Source code:** :source:`Lib/xml/dom/minidom.py`
     14 
     15 --------------
     16 
     17 :mod:`xml.dom.minidom` is a minimal implementation of the Document Object
     18 Model interface, with an API similar to that in other languages.  It is intended
     19 to be simpler than the full DOM and also significantly smaller.  Users who are
     20 not already proficient with the DOM should consider using the
     21 :mod:`xml.etree.ElementTree` module for their XML processing instead.
     22 
     23 
     24 .. warning::
     25 
     26    The :mod:`xml.dom.minidom` module is not secure against
     27    maliciously constructed data.  If you need to parse untrusted or
     28    unauthenticated data see :ref:`xml-vulnerabilities`.
     29 
     30 
     31 DOM applications typically start by parsing some XML into a DOM.  With
     32 :mod:`xml.dom.minidom`, this is done through the parse functions::
     33 
     34    from xml.dom.minidom import parse, parseString
     35 
     36    dom1 = parse('c:\\temp\\mydata.xml')  # parse an XML file by name
     37 
     38    datasource = open('c:\\temp\\mydata.xml')
     39    dom2 = parse(datasource)  # parse an open file
     40 
     41    dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
     42 
     43 The :func:`parse` function can take either a filename or an open file object.
     44 
     45 
     46 .. function:: parse(filename_or_file[, parser[, bufsize]])
     47 
     48    Return a :class:`Document` from the given input. *filename_or_file* may be
     49    either a file name, or a file-like object. *parser*, if given, must be a SAX2
     50    parser object. This function will change the document handler of the parser and
     51    activate namespace support; other parser configuration (like setting an entity
     52    resolver) must have been done in advance.
     53 
     54 If you have XML in a string, you can use the :func:`parseString` function
     55 instead:
     56 
     57 
     58 .. function:: parseString(string[, parser])
     59 
     60    Return a :class:`Document` that represents the *string*. This method creates a
     61    :class:`~StringIO.StringIO` object for the string and passes that on to :func:`parse`.
     62 
     63 Both functions return a :class:`Document` object representing the content of the
     64 document.
     65 
     66 What the :func:`parse` and :func:`parseString` functions do is connect an XML
     67 parser with a "DOM builder" that can accept parse events from any SAX parser and
     68 convert them into a DOM tree.  The name of the functions are perhaps misleading,
     69 but are easy to grasp when learning the interfaces.  The parsing of the document
     70 will be completed before these functions return; it's simply that these
     71 functions do not provide a parser implementation themselves.
     72 
     73 You can also create a :class:`Document` by calling a method on a "DOM
     74 Implementation" object.  You can get this object either by calling the
     75 :func:`getDOMImplementation` function in the :mod:`xml.dom` package or the
     76 :mod:`xml.dom.minidom` module. Using the implementation from the
     77 :mod:`xml.dom.minidom` module will always return a :class:`Document` instance
     78 from the minidom implementation, while the version from :mod:`xml.dom` may
     79 provide an alternate implementation (this is likely if you have the `PyXML
     80 package <http://pyxml.sourceforge.net/>`_ installed).  Once you have a
     81 :class:`Document`, you can add child nodes to it to populate the DOM::
     82 
     83    from xml.dom.minidom import getDOMImplementation
     84 
     85    impl = getDOMImplementation()
     86 
     87    newdoc = impl.createDocument(None, "some_tag", None)
     88    top_element = newdoc.documentElement
     89    text = newdoc.createTextNode('Some textual content.')
     90    top_element.appendChild(text)
     91 
     92 Once you have a DOM document object, you can access the parts of your XML
     93 document through its properties and methods.  These properties are defined in
     94 the DOM specification.  The main property of the document object is the
     95 :attr:`documentElement` property.  It gives you the main element in the XML
     96 document: the one that holds all others.  Here is an example program::
     97 
     98    dom3 = parseString("<myxml>Some data</myxml>")
     99    assert dom3.documentElement.tagName == "myxml"
    100 
    101 When you are finished with a DOM tree, you may optionally call the
    102 :meth:`unlink` method to encourage early cleanup of the now-unneeded
    103 objects.  :meth:`unlink` is an :mod:`xml.dom.minidom`\ -specific
    104 extension to the DOM API that renders the node and its descendants are
    105 essentially useless.  Otherwise, Python's garbage collector will
    106 eventually take care of the objects in the tree.
    107 
    108 .. seealso::
    109 
    110    `Document Object Model (DOM) Level 1 Specification <https://www.w3.org/TR/REC-DOM-Level-1/>`_
    111       The W3C recommendation for the DOM supported by :mod:`xml.dom.minidom`.
    112 
    113 
    114 .. _minidom-objects:
    115 
    116 DOM Objects
    117 -----------
    118 
    119 The definition of the DOM API for Python is given as part of the :mod:`xml.dom`
    120 module documentation.  This section lists the differences between the API and
    121 :mod:`xml.dom.minidom`.
    122 
    123 
    124 .. method:: Node.unlink()
    125 
    126    Break internal references within the DOM so that it will be garbage collected on
    127    versions of Python without cyclic GC.  Even when cyclic GC is available, using
    128    this can make large amounts of memory available sooner, so calling this on DOM
    129    objects as soon as they are no longer needed is good practice.  This only needs
    130    to be called on the :class:`Document` object, but may be called on child nodes
    131    to discard children of that node.
    132 
    133 
    134 .. method:: Node.writexml(writer, indent="", addindent="", newl="")
    135 
    136    Write XML to the writer object.  The writer should have a :meth:`write` method
    137    which matches that of the file object interface.  The *indent* parameter is the
    138    indentation of the current node.  The *addindent* parameter is the incremental
    139    indentation to use for subnodes of the current one.  The *newl* parameter
    140    specifies the string to use to terminate newlines.
    141 
    142    For the :class:`Document` node, an additional keyword argument *encoding* can
    143    be used to specify the encoding field of the XML header.
    144 
    145    .. versionchanged:: 2.1
    146       The optional keyword parameters *indent*, *addindent*, and *newl* were added to
    147       support pretty output.
    148 
    149    .. versionchanged:: 2.3
    150       For the :class:`Document` node, an additional keyword argument
    151       *encoding* can be used to specify the encoding field of the XML header.
    152 
    153 
    154 .. method:: Node.toxml([encoding])
    155 
    156    Return the XML that the DOM represents as a string.
    157 
    158    With no argument, the XML header does not specify an encoding, and the result is
    159    Unicode string if the default encoding cannot represent all characters in the
    160    document. Encoding this string in an encoding other than UTF-8 is likely
    161    incorrect, since UTF-8 is the default encoding of XML.
    162 
    163    With an explicit *encoding* [1]_ argument, the result is a byte string in the
    164    specified encoding. It is recommended that this argument is always specified. To
    165    avoid :exc:`UnicodeError` exceptions in case of unrepresentable text data, the
    166    encoding argument should be specified as "utf-8".
    167 
    168    .. versionchanged:: 2.3
    169       the *encoding* argument was introduced; see :meth:`writexml`.
    170 
    171 
    172 .. method:: Node.toprettyxml([indent=""[, newl=""[, encoding=""]]])
    173 
    174    Return a pretty-printed version of the document. *indent* specifies the
    175    indentation string and defaults to a tabulator; *newl* specifies the string
    176    emitted at the end of each line and defaults to ``\n``.
    177 
    178    .. versionadded:: 2.1
    179 
    180    .. versionchanged:: 2.3
    181       the encoding argument was introduced; see :meth:`writexml`.
    182 
    183 The following standard DOM methods have special considerations with
    184 :mod:`xml.dom.minidom`:
    185 
    186 
    187 .. method:: Node.cloneNode(deep)
    188 
    189    Although this method was present in the version of :mod:`xml.dom.minidom`
    190    packaged with Python 2.0, it was seriously broken.  This has been corrected for
    191    subsequent releases.
    192 
    193 
    194 .. _dom-example:
    195 
    196 DOM Example
    197 -----------
    198 
    199 This example program is a fairly realistic example of a simple program. In this
    200 particular case, we do not take much advantage of the flexibility of the DOM.
    201 
    202 .. literalinclude:: ../includes/minidom-example.py
    203 
    204 
    205 .. _minidom-and-dom:
    206 
    207 minidom and the DOM standard
    208 ----------------------------
    209 
    210 The :mod:`xml.dom.minidom` module is essentially a DOM 1.0-compatible DOM with
    211 some DOM 2 features (primarily namespace features).
    212 
    213 Usage of the DOM interface in Python is straight-forward.  The following mapping
    214 rules apply:
    215 
    216 * Interfaces are accessed through instance objects. Applications should not
    217   instantiate the classes themselves; they should use the creator functions
    218   available on the :class:`Document` object. Derived interfaces support all
    219   operations (and attributes) from the base interfaces, plus any new operations.
    220 
    221 * Operations are used as methods. Since the DOM uses only :keyword:`in`
    222   parameters, the arguments are passed in normal order (from left to right).
    223   There are no optional arguments. ``void`` operations return ``None``.
    224 
    225 * IDL attributes map to instance attributes. For compatibility with the OMG IDL
    226   language mapping for Python, an attribute ``foo`` can also be accessed through
    227   accessor methods :meth:`_get_foo` and :meth:`_set_foo`.  ``readonly``
    228   attributes must not be changed; this is not enforced at runtime.
    229 
    230 * The types ``short int``, ``unsigned int``, ``unsigned long long``, and
    231   ``boolean`` all map to Python integer objects.
    232 
    233 * The type ``DOMString`` maps to Python strings. :mod:`xml.dom.minidom` supports
    234   either byte or Unicode strings, but will normally produce Unicode strings.
    235   Values of type ``DOMString`` may also be ``None`` where allowed to have the IDL
    236   ``null`` value by the DOM specification from the W3C.
    237 
    238 * ``const`` declarations map to variables in their respective scope (e.g.
    239   ``xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE``); they must not be changed.
    240 
    241 * ``DOMException`` is currently not supported in :mod:`xml.dom.minidom`.
    242   Instead, :mod:`xml.dom.minidom` uses standard Python exceptions such as
    243   :exc:`TypeError` and :exc:`AttributeError`.
    244 
    245 * :class:`NodeList` objects are implemented using Python's built-in list type.
    246   Starting with Python 2.2, these objects provide the interface defined in the DOM
    247   specification, but with earlier versions of Python they do not support the
    248   official API.  They are, however, much more "Pythonic" than the interface
    249   defined in the W3C recommendations.
    250 
    251 The following interfaces have no implementation in :mod:`xml.dom.minidom`:
    252 
    253 * :class:`DOMTimeStamp`
    254 
    255 * :class:`DocumentType` (added in Python 2.1)
    256 
    257 * :class:`DOMImplementation` (added in Python 2.1)
    258 
    259 * :class:`CharacterData`
    260 
    261 * :class:`CDATASection`
    262 
    263 * :class:`Notation`
    264 
    265 * :class:`Entity`
    266 
    267 * :class:`EntityReference`
    268 
    269 * :class:`DocumentFragment`
    270 
    271 Most of these reflect information in the XML document that is not of general
    272 utility to most DOM users.
    273 
    274 .. rubric:: Footnotes
    275 
    276 .. [#] The encoding string included in XML output should conform to the
    277    appropriate standards. For example, "UTF-8" is valid, but "UTF8" is
    278    not. See https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncodingDecl
    279    and https://www.iana.org/assignments/character-sets/character-sets.xhtml.
    280