1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3 <html> 4 <head> 5 <meta http-equiv="Content-Type" content="text/html"> 6 <style type="text/css"></style> 7 <!-- 8 TD {font-family: Verdana,Arial,Helvetica} 9 BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em} 10 H1 {font-family: Verdana,Arial,Helvetica} 11 H2 {font-family: Verdana,Arial,Helvetica} 12 H3 {font-family: Verdana,Arial,Helvetica} 13 A:link, A:visited, A:active { text-decoration: underline } 14 </style> 15 --> 16 <title>Libxml2 XmlTextReader Interface tutorial</title> 17 </head> 18 19 <body bgcolor="#fffacd" text="#000000"> 20 <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1> 21 22 <p></p> 23 24 <p>This document describes the use of the XmlTextReader streaming API added 25 to libxml2 in version 2.5.0 . This API is closely modeled after the <a 26 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a> 27 and <a 28 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a> 29 classes of the C# language.</p> 30 31 <p>This tutorial will present the key points of this API, and working 32 examples using both C and the Python bindings:</p> 33 34 <p>Table of content:</p> 35 <ul> 36 <li><a href="#Introducti">Introduction: why a new API</a></li> 37 <li><a href="#Walking">Walking a simple tree</a></li> 38 <li><a href="#Extracting">Extracting informations for the current 39 node</a></li> 40 <li><a href="#Extracting1">Extracting informations for the 41 attributes</a></li> 42 <li><a href="#Validating">Validating a document</a></li> 43 <li><a href="#Entities">Entities substitution</a></li> 44 <li><a href="#L1142">Relax-NG Validation</a></li> 45 <li><a href="#Mixing">Mixing the reader and tree or XPath 46 operations</a></li> 47 </ul> 48 49 <p></p> 50 51 <h2><a name="Introducti">Introduction: why a new API</a></h2> 52 53 <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is 54 tree based</a>, where the parsing operation results in a document loaded 55 completely in memory, and expose it as a tree of nodes all availble at the 56 same time. This is very simple and quite powerful, but has the major 57 limitation that the size of the document that can be hamdled is limited by 58 the size of the memory available. Libxml2 also provide a <a 59 href="http://www.saxproject.org/">SAX</a> based API, but that version was 60 designed upon one of the early <a 61 href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is 62 also not formally defined for C. SAX basically work by registering callbacks 63 which are called directly by the parser as it progresses through the document 64 streams. The problem is that this programming model is relatively complex, 65 not well standardized, cannot provide validation directly, makes entity, 66 namespace and base processing relatively hard.</p> 67 68 <p>The <a 69 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader 70 API from C#</a> provides a far simpler programming model. The API acts as a 71 cursor going forward on the document stream and stopping at each node in the 72 way. The user's code keeps control of the progress and simply calls a 73 Read() function repeatedly to progress to each node in sequence in document 74 order. There is direct support for namespaces, xml:base, entity handling and 75 adding DTD validation on top of it was relatively simple. This API is really 76 close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core 77 specification</a> This provides a far more standard, easy to use and powerful 78 API than the existing SAX. Moreover integrating extension features based on 79 the tree seems relatively easy.</p> 80 81 <p>In a nutshell the XmlTextReader API provides a simpler, more standard and 82 more extensible interface to handle large documents than the existing SAX 83 version.</p> 84 85 <h2><a name="Walking">Walking a simple tree</a></h2> 86 87 <p>Basically the XmlTextReader API is a forward only tree walking interface. 88 The basic steps are:</p> 89 <ol> 90 <li>prepare a reader context operating on some input</li> 91 <li>run a loop iterating over all nodes in the document</li> 92 <li>free up the reader context</li> 93 </ol> 94 95 <p>Here is a basic C sample doing this:</p> 96 <pre>#include <libxml/xmlreader.h> 97 98 void processNode(xmlTextReaderPtr reader) { 99 /* handling of a node in the tree */ 100 } 101 102 int streamFile(char *filename) { 103 xmlTextReaderPtr reader; 104 int ret; 105 106 reader = xmlNewTextReaderFilename(filename); 107 if (reader != NULL) { 108 ret = xmlTextReaderRead(reader); 109 while (ret == 1) { 110 processNode(reader); 111 ret = xmlTextReaderRead(reader); 112 } 113 xmlFreeTextReader(reader); 114 if (ret != 0) { 115 printf("%s : failed to parse\n", filename); 116 } 117 } else { 118 printf("Unable to open %s\n", filename); 119 } 120 }</pre> 121 122 <p>A few things to notice:</p> 123 <ul> 124 <li>the include file needed : <code>libxml/xmlreader.h</code></li> 125 <li>the creation of the reader using a filename</li> 126 <li>the repeated call to xmlTextReaderRead() and how any return value 127 different from 1 should stop the loop</li> 128 <li>that a negative return means a parsing error</li> 129 <li>how xmlFreeTextReader() should be used to free up the resources used by 130 the reader.</li> 131 </ul> 132 133 <p>Here is similar code in python for exactly the same processing:</p> 134 <pre>import libxml2 135 136 def processNode(reader): 137 pass 138 139 def streamFile(filename): 140 try: 141 reader = libxml2.newTextReaderFilename(filename) 142 except: 143 print "unable to open %s" % (filename) 144 return 145 146 ret = reader.Read() 147 while ret == 1: 148 processNode(reader) 149 ret = reader.Read() 150 151 if ret != 0: 152 print "%s : failed to parse" % (filename)</pre> 153 154 <p>The only things worth adding are that the <a 155 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader 156 is abstracted as a class like in C#</a> with the same method names (but the 157 properties are currently accessed with methods) and that one doesn't need to 158 free the reader at the end of the processing. It will get garbage collected 159 once all references have disapeared.</p> 160 161 <h2><a name="Extracting">Extracting information for the current node</a></h2> 162 163 <p>So far the example code did not indicate how information was extracted 164 from the reader. It was abstrated as a call to the processNode() routine, 165 with the reader as the argument. At each invocation, the parser is stopped on 166 a given node and the reader can be used to query those node properties. Each 167 <em>Property</em> is available at the C level as a function taking a single 168 xmlTextReaderPtr argument whose name is 169 <code>xmlTextReader</code><em>Property</em> , if the return type is an 170 <code>xmlChar *</code> string then it must be deallocated with 171 <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a 172 <em>Property</em> method to the reader class that can be called on the 173 instance. The list of the properties is based on the <a 174 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C# 175 XmlTextReader class</a> set of properties and methods:</p> 176 <ul> 177 <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of 178 element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for 179 entity references, 6 for entity declarations, 7 for PIs, 8 for comments, 180 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document 181 fragment and 12 for notation nodes.</li> 182 <li><em>Name</em>: the <a 183 href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified 184 name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li> 185 <li><em>LocalName</em>: the <a 186 href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of 187 the node.</li> 188 <li><em>Prefix</em>: a shorthand reference to the <a 189 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with 190 the node.</li> 191 <li><em>NamespaceUri</em>: the URI defining the <a 192 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with 193 the node.</li> 194 <li><em>BaseUri:</em> the base URI of the node. See the <a 195 href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li> 196 <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the 197 root node.</li> 198 <li><em>HasAttributes</em>: whether the node has attributes.</li> 199 <li><em>HasValue</em>: whether the node can have a text value.</li> 200 <li><em>Value</em>: provides the text value of the node if present.</li> 201 <li><em>IsDefault</em>: whether an Attribute node was generated from the 202 default value defined in the DTD or schema (<em>unsupported 203 yet</em>).</li> 204 <li><em>XmlLang</em>: the <a 205 href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope 206 within which the node resides.</li> 207 <li><em>IsEmptyElement</em>: check if the current node is empty, this is a 208 bit bizarre in the sense that <code><a/></code> will be considered 209 empty while <code><a></a></code> will not.</li> 210 <li><em>AttributeCount</em>: provides the number of attributes of the 211 current node.</li> 212 </ul> 213 214 <p>Let's look first at a small example to get this in practice by redefining 215 the processNode() function in the Python example:</p> 216 <pre>def processNode(reader): 217 print "%d %d %s %d" % (reader.Depth(), reader.NodeType(), 218 reader.Name(), reader.IsEmptyElement())</pre> 219 220 <p>and look at the result of calling streamFile("tst.xml") for various 221 content of the XML test file.</p> 222 223 <p>For the minimal document "<code><doc/></code>" we get:</p> 224 <pre>0 1 doc 1</pre> 225 226 <p>Only one node is found, its depth is 0, type 1 indicate an element start, 227 of name "doc" and it is empty. Trying now with 228 "<code><doc></doc></code>" instead leads to:</p> 229 <pre>0 1 doc 0 230 0 15 doc 0</pre> 231 232 <p>The document root node is not flagged as empty anymore and both a start 233 and an end of element are detected. The following document shows how 234 character data are reported:</p> 235 <pre><doc><a/><b>some text</b> 236 <c/></doc></pre> 237 238 <p>We modifying the processNode() function to also report the node Value:</p> 239 <pre>def processNode(reader): 240 print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(), 241 reader.Name(), reader.IsEmptyElement(), 242 reader.Value())</pre> 243 244 <p>The result of the test is:</p> 245 <pre>0 1 doc 0 None 246 1 1 a 1 None 247 1 1 b 0 None 248 2 3 #text 0 some text 249 1 15 b 0 None 250 1 3 #text 0 251 252 1 1 c 1 None 253 0 15 doc 0 None</pre> 254 255 <p>There are a few things to note:</p> 256 <ul> 257 <li>the increase of the depth value (first row) as children nodes are 258 explored</li> 259 <li>the text node child of the b element, of type 3 and its content</li> 260 <li>the text node containing the line return between elements b and c</li> 261 <li>that elements have the Value None (or NULL in C)</li> 262 </ul> 263 264 <p>The equivalent routine for <code>processNode()</code> as used by 265 <code>xmllint --stream --debug</code> is the following and can be found in 266 the xmllint.c module in the source distribution:</p> 267 <pre>static void processNode(xmlTextReaderPtr reader) { 268 xmlChar *name, *value; 269 270 name = xmlTextReaderName(reader); 271 if (name == NULL) 272 name = xmlStrdup(BAD_CAST "--"); 273 value = xmlTextReaderValue(reader); 274 275 printf("%d %d %s %d", 276 xmlTextReaderDepth(reader), 277 xmlTextReaderNodeType(reader), 278 name, 279 xmlTextReaderIsEmptyElement(reader)); 280 xmlFree(name); 281 if (value == NULL) 282 printf("\n"); 283 else { 284 printf(" %s\n", value); 285 xmlFree(value); 286 } 287 }</pre> 288 289 <h2><a name="Extracting1">Extracting information for the attributes</a></h2> 290 291 <p>The previous examples don't indicate how attributes are processed. The 292 simple test "<code><doc a="b"/></code>" provides the following 293 result:</p> 294 <pre>0 1 doc 1 None</pre> 295 296 <p>This proves that attribute nodes are not traversed by default. The 297 <em>HasAttributes</em> property allow to detect their presence. To check 298 their content the API has special instructions. Basically two kinds of operations 299 are possible:</p> 300 <ol> 301 <li>to move the reader to the attribute nodes of the current element, in 302 that case the cursor is positionned on the attribute node</li> 303 <li>to directly query the element node for the attribute value</li> 304 </ol> 305 306 <p>In both case the attribute can be designed either by its position in the 307 list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or 308 by their name (and namespace):</p> 309 <ul> 310 <li><em>GetAttributeNo</em>(no): provides the value of the attribute with 311 the specified index no relative to the containing element.</li> 312 <li><em>GetAttribute</em>(name): provides the value of the attribute with 313 the specified qualified name.</li> 314 <li>GetAttributeNs(localName, namespaceURI): provides the value of the 315 attribute with the specified local name and namespace URI.</li> 316 <li><em>MoveToAttributeNo</em>(no): moves the position of the current 317 instance to the attribute with the specified index relative to the 318 containing element.</li> 319 <li><em>MoveToAttribute</em>(name): moves the position of the current 320 instance to the attribute with the specified qualified name.</li> 321 <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position 322 of the current instance to the attribute with the specified local name 323 and namespace URI.</li> 324 <li><em>MoveToFirstAttribute</em>: moves the position of the current 325 instance to the first attribute associated with the current node.</li> 326 <li><em>MoveToNextAttribute</em>: moves the position of the current 327 instance to the next attribute associated with the current node.</li> 328 <li><em>MoveToElement</em>: moves the position of the current instance to 329 the node that contains the current Attribute node.</li> 330 </ul> 331 332 <p>After modifying the processNode() function to show attributes:</p> 333 <pre>def processNode(reader): 334 print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(), 335 reader.Name(), reader.IsEmptyElement(), 336 reader.Value()) 337 if reader.NodeType() == 1: # Element 338 while reader.MoveToNextAttribute(): 339 print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(), 340 reader.Name(),reader.Value())</pre> 341 342 <p>The output for the same input document reflects the attribute:</p> 343 <pre>0 1 doc 1 None 344 -- 1 2 (a) [b]</pre> 345 346 <p>There are a couple of things to note on the attribute processing:</p> 347 <ul> 348 <li>Their depth is the one of the carrying element plus one.</li> 349 <li>Namespace declarations are seen as attributes, as in DOM.</li> 350 </ul> 351 352 <h2><a name="Validating">Validating a document</a></h2> 353 354 <p>Libxml2 implementation adds some extra features on top of the XmlTextReader 355 API. The main one is the ability to DTD validate the parsed document 356 progressively. This is simply the activation of the associated feature of the 357 parser used by the reader structure. There are a few options available 358 defined as the enum xmlParserProperties in the libxml/xmlreader.h header 359 file:</p> 360 <ul> 361 <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li> 362 <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply 363 loading the DTD)</li> 364 <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading 365 the DTD)</li> 366 <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity 367 reference nodes are not generated and are replaced by their expanded 368 content.</li> 369 <li>more settings might be added, those were the one available at the 2.5.0 370 release...</li> 371 </ul> 372 373 <p>The GetParserProp() and SetParserProp() methods can then be used to get 374 and set the values of those parser properties of the reader. For example</p> 375 <pre>def parseAndValidate(file): 376 reader = libxml2.newTextReaderFilename(file) 377 reader.SetParserProp(libxml2.PARSER_VALIDATE, 1) 378 ret = reader.Read() 379 while ret == 1: 380 ret = reader.Read() 381 if ret != 0: 382 print "Error parsing and validating %s" % (file)</pre> 383 384 <p>This routine will parse and validate the file. Error messages can be 385 captured by registering an error handler. See python/tests/reader2.py for 386 more complete Python examples. At the C level the equivalent call to cativate 387 the validation feature is just:</p> 388 <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre> 389 390 <p>and a return value of 0 indicates success.</p> 391 392 <h2><a name="Entities">Entities substitution</a></h2> 393 394 <p>By default the xmlReader will report entities as such and not replace them 395 with their content. This default behaviour can however be overriden using:</p> 396 397 <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p> 398 399 <h2><a name="L1142">Relax-NG Validation</a></h2> 400 401 <p style="font-size: 10pt">Introduced in version 2.5.7</p> 402 403 <p>Libxml2 can now validate the document being read using the xmlReader using 404 Relax-NG schemas. While the Relax NG validator can't always work in a 405 streamable mode, only subsets which cannot be reduced to regular expressions 406 need to have their subtree expanded for validation. In practice it means 407 that, unless the schemas for the top level element content is not expressable 408 as a regexp, only chunk of the document needs to be parsed while 409 validating.</p> 410 411 <p>The steps to do so are:</p> 412 <ul> 413 <li>create a reader working on a document as usual</li> 414 <li>before any call to read associate it to a Relax NG schemas, either the 415 preparsed schemas or the URL to the schemas to use</li> 416 <li>errors will be reported the usual way, and the validity status can be 417 obtained using the IsValid() interface of the reader like for DTDs.</li> 418 </ul> 419 420 <p>Example, assuming the reader has already being created and that the schema 421 string contains the Relax-NG schemas:</p> 422 <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br> 423 rngs = rngp.relaxNGParse()<br> 424 reader.RelaxNGSetSchema(rngs)<br> 425 ret = reader.Read()<br> 426 while ret == 1:<br> 427 ret = reader.Read()<br> 428 if ret != 0:<br> 429 print "Error parsing the document"<br> 430 if reader.IsValid() != 1:<br> 431 print "Document failed to validate"</code><br> 432 </pre> 433 434 <p>See <code>reader6.py</code> in the sources or documentation for a complete 435 example.</p> 436 437 <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2> 438 439 <p style="font-size: 10pt">Introduced in version 2.5.7</p> 440 441 <p>While the reader is a streaming interface, its underlying implementation 442 is based on the DOM builder of libxml2. As a result it is relatively simple 443 to mix operations based on both models under some constraints. To do so the 444 reader has an Expand() operation allowing to grow the subtree under the 445 current node. It returns a pointer to a standard node which can be 446 manipulated in the usual ways. The node will get all its ancestors and the 447 full subtree available. Usual operations like XPath queries can be used on 448 that reduced view of the document. Here is an example extracted from 449 reader5.py in the sources which extract and prints the bibliography for the 450 "Dragon" compiler book from the XML 1.0 recommendation:</p> 451 <pre>f = open('../../test/valid/REC-xml-19980210.xml') 452 input = libxml2.inputBuffer(f) 453 reader = input.newTextReader("REC") 454 res="" 455 while reader.Read(): 456 while reader.Name() == 'bibl': 457 node = reader.Expand() # expand the subtree 458 if node.xpathEval("@id = 'Aho'"): # use XPath on it 459 res = res + node.serialize() 460 if reader.Next() != 1: # skip the subtree 461 break;</pre> 462 463 <p>Note, however that the node instance returned by the Expand() call is only 464 valid until the next Read() operation. The Expand() operation does not 465 affects the Read() ones, however usually once processed the full subtree is 466 not useful anymore, and the Next() operation allows to skip it completely and 467 process to the successor or return 0 if the document end is reached.</p> 468 469 <p><a href="mailto:xml (a] gnome.org">Daniel Veillard</a></p> 470 471 <p>$Id$</p> 472 473 <p></p> 474 </body> 475 </html> 476