1 TagSoup - Just Keep On Truckin' 2 3 Introduction 4 5 This is the home page of TagSoup, a SAX-compliant parser written in 6 Java that, instead of parsing well-formed or valid XML, parses HTML as 7 it is found in the wild: [1]poor, nasty and brutish, though quite often 8 far from short. TagSoup is designed for people who have to process this 9 stuff using some semblance of a rational application design. By 10 providing a SAX interface, it allows standard XML tools to be applied 11 to even the worst HTML. TagSoup also includes a command-line processor 12 that reads HTML files and can generate either clean HTML or well-formed 13 XML that is a close approximation to XHTML. 14 15 This is also the README file packaged with TagSoup. 16 17 TagSoup is free and Open Source software. As of version 1.2, it is 18 licensed under the [2]Apache License, Version 2.0, which allows 19 proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later 20 projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only 21 project, feel free to ask.) 22 23 Warning: TagSoup will not build on stock Java 5.x or 6.x! 24 25 Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x, 26 TagSoup will not build out of the box. You need to retrieve [3]Saxon 27 6.5.5, which does not have the bug. Unpack the zipfile in an empty 28 directory and copy the saxon.jar and saxon-xml-apis.jar files to 29 $ANT_HOME/lib. The Ant build process for TagSoup will then notice that 30 Saxon is available and use it instead. 31 32 TagSoup 1.2 released 33 34 There are a great many changes, most of them fixes for long-standing 35 bugs, in this release. Only the most important are listed here; for the 36 rest, see the CHANGES file in the source distribution. Very special 37 thanks to Jojo Dijamco, whose intensive efforts at debugging made this 38 release a usable upgrade rather than a useless mass of undetected bugs. 39 * As noted above, I have changed the license to Apache 2.0. 40 * The default content model for bogons (unknown elements) is now ANY 41 rather than EMPTY. This is a breaking change, which I have done 42 only because there was so much demand for it. It can be undone on 43 the command line with the --emptybogons switch, or programmatically 44 with parser.setFeature(Parser.emptyBogonsFeature, true). 45 * The processing of entity references in attribute values has finally 46 been fixed to do what browsers do. That is, a reference is only 47 recognized if it is properly terminated by a semicolon; otherwise 48 it is treated as plain text. This means that URIs like 49 foo?cdown=32&cup=42 are no longer seen as containing an instance of 50 the )U character (whose name happens to be cup). 51 * Several new switches have been added: 52 + --doctype-system and --doctype-public force a DOCTYPE 53 declaration to be output and allow setting the system and 54 public identifiers. 55 + --standalone and --version allow control of the XML 56 declaration that is output. (Note that TagSoup's XML output is 57 always version 1.0, even if you use --version=1.1.) 58 + --norootbogons causes unknown elements not to be allowed as 59 the document root element. Instead, they are made children of 60 the default root element (the html element for HTML). 61 * The TagSoup core now supports character entities with values above 62 U+FFFF. As a consequence, the HTML schema now supports all 2,210 63 standard character entities from the [4]2007-12-14 draft of XML 64 Entity Definitions for Characters, except the 94 which require more 65 than one Unicode character to represent. 66 * The SAX events startPrefixMapping and endPrefixMapping are now 67 being reported for all cases of foreign elements and attributes. 68 * All bugs around newline processing on Windows should now be gone. 69 * A number of content models have been loosened to allow elements to 70 appear in new and non-standard (but commonly found) places. In 71 particular, tables are now allowed inside paragraphs, against the 72 letter of the W3C specification. 73 * Since the span element is intended for fine control of appearance 74 using CSS, it should never have been a restartable element. This 75 very long-standing bug has now been fixed. 76 * The following non-standard elements are now at least partly 77 supported: bgsound, blink, canvas, comment, listing, marquee, nobr, 78 rbc, rb, rp, rtc, rt, ruby, wbr, xmp. 79 * In HTML output mode, boolean attributes like checked are now output 80 as such, rather than in XML style as checked="checked". 81 * Runs of < characters such as << and <<< are now handled correctly 82 in text rather than being transformed into extremely bogus 83 start-tags. 84 85 [5]Download the TagSoup 1.2 jar file here. It's about 87K long. 86 [6]Download the full TagSoup 1.2 source here. If you don't have zip, 87 you can use jar to unpack it. 88 [7]Download the current CHANGES file here. 89 90 TagSoup 1.1 released 91 92 TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use 93 TagSoup within the JAXP framework (which is not something I necessarily 94 recommend, but it is part of the Java XML platform), you can create a 95 SAXParser by calling 96 org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also 97 set the system property javax.xml.parsers.SAXParserFactory to 98 org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing 99 this will cause all JAXP-based XML parsing to go through TagSoup, which 100 is a Bad Thing if your application also reads XML documents. 101 102 What TagSoup does 103 104 TagSoup is designed as a parser, not a whole application; it isn't 105 intended to permanently clean up bad HTML, as [8]HTML Tidy does, only 106 to parse it on the fly. Therefore, it does not convert presentation 107 HTML to CSS or anything similar. It does guarantee well-structured 108 results: tags will wind up properly nested, default attributes will 109 appear appropriately, and so on. 110 111 The semantics of TagSoup are as far as practical those of actual HTML 112 browsers. In particular, never, never will it throw any sort of syntax 113 error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's 114 much, much more. For example, if the first tag is LI, it will supply 115 the application with enclosing HTML, BODY, and UL tags. Why UL? Because 116 that's what browsers assume in this situation. For the same reason, 117 overlapping tags are correctly restarted whenever possible: text like: 118 This is <B>bold, <I>bold italic, </b>italic, </i>normal text 119 120 gets correctly rewritten as: 121 This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text. 122 123 By intention, TagSoup is small and fast. It does not depend on the 124 existence of any framework other than SAX, and should be able to work 125 with any framework that can accept SAX parsers. In particular, [10]XOM 126 is known to work. 127 128 You can replace the low-level HTML scanner with one based on Sean 129 McGrath's [11]PYX format (very close to James Clark's ESIS format). You 130 can also supply an AutoDetector that peeks at the incoming byte stream 131 and guesses a character encoding for it. Otherwise, the platform 132 default is used. If you need an autodetector of character sets, 133 consider trying to adapt the [12]Mozilla one; if you succeed, let me 134 know. 135 136 Note: TagSoup in Java 1.1 137 138 If you go through the TagSoup source and replace all references to 139 HashMap with Hashtable and recompile, TagSoup will work fine in Java 140 1.1 VMs. Thanks to Thorbjrn Vinne for this discovery. 141 142 The TSaxon XSLT-for-HTML processor 143 144 [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5 145 of Michael Kay's Saxon XSLT version 1.0 implementation that includes 146 TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to 147 process either HTML or XML documents with XSLT stylesheets. 148 149 TagSoup as a stand-alone program 150 151 It is possible to run TagSoup as a program by saying java -jar 152 tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command 153 line will be parsed individually. If no files are specified, the 154 standard input is read. 155 156 The following options are understood: 157 158 --files 159 Output into individual files, with html extensions changed to 160 xhtml. Otherwise, all output is sent to the standard output. 161 162 --html 163 Output is in clean HTML: the XML declaration is suppressed, as 164 are end-tags for the known empty elements. 165 166 --omit-xml-declaration 167 The XML declaration is suppressed. 168 169 --method=html 170 End-tags for the known empty HTML elements are suppressed. 171 172 --doctype-system=systemid 173 Forces the output of a DOCTYPE declaration with the specified 174 systemid. 175 176 --doctype-public=publicid 177 Forces the output of a DOCTYPE declaration with the specified 178 publicid. 179 180 --version=version 181 Sets the version string in the XML declaration. 182 183 --standalone=[yes|no] 184 Sets the standalone declaration to yes or no. 185 186 --pyx 187 Output is in PYX format. 188 189 --pyxin 190 Input is in PYXoid format (need not be well-formed). 191 192 --nons 193 Namespaces are suppressed. Normally, all elements are in the 194 XHTML 1.x namespace, and all attributes are in no namespace. 195 196 --nobogons 197 Bogons (unknown elements) are suppressed. 198 199 --nodefaults 200 suppress default attribute values 201 202 --nocolons 203 change explicit colons in element and attribute names to 204 underscores 205 206 --norestart 207 don't restart any normally restartable elements 208 209 --ignorable 210 output whitespace in elements with element-only content 211 212 --emptybogons 213 Bogons are given a content model of EMPTY rather than ANY. 214 215 --any 216 Bogons are given a content model of ANY rather than EMPTY 217 (default). 218 219 --norootbogons 220 Don't allow bogons to be root elements; make them subordinate to 221 the root. 222 223 --lexical 224 Pass through HTML comments and DOCTYPE declarations. Has no 225 effect when output is in PYX format. 226 227 --reuse 228 Reuse a single instance of TagSoup parser throughout. Normally, 229 a new one is instantiated for each input file. 230 231 --nocdata 232 Change the content models of the script and style elements to 233 treat them as ordinary #PCDATA (text-only) elements, as in 234 XHTML, rather than with the special CDATA content model. 235 236 --encoding=encoding 237 Specify the input encoding. The default is the Java platform 238 default. 239 240 --output-encoding=encoding 241 Specify the output encoding. The default is the Java platform 242 default. 243 244 --help 245 Print help. 246 247 --version 248 Print the version number. 249 250 SAX features and properties 251 252 TagSoup supports the following SAX features in addition to the standard 253 ones: 254 255 http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons 256 A value of "true" indicates that the parser will ignore unknown 257 elements. 258 259 http://www.ccil.org/~cowan/tagsoup/features/bogons-empty 260 A value of "true" indicates that the parser will give unknown 261 elements a content model of EMPTY; a value of "false", a content 262 model of ANY. 263 264 http://www.ccil.org/~cowan/tagsoup/features/root-bogons 265 A value of "true" indicates that the parser will allow unknown 266 elements to be the root of the output document. 267 268 http://www.ccil.org/~cowan/tagsoup/features/default-attributes 269 A value of "true" indicates that the parser will return default 270 attribute values for missing attributes that have default 271 values. 272 273 http://www.ccil.org/~cowan/tagsoup/features/translate-colons 274 A value of "true" indicates that the parser will translate 275 colons into underscores in names. 276 277 http://www.ccil.org/~cowan/tagsoup/features/restart-elements 278 A value of "true" indicates that the parser will attempt to 279 restart the restartable elements. 280 281 http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace 282 A value of "true" indicates that the parser will transmit 283 whitespace in element-only content via the SAX 284 ignorableWhitespace callback. Normally this is not done, because 285 HTML is an SGML application and SGML suppresses such whitespace. 286 287 http://www.ccil.org/~cowan/tagsoup/features/cdata-elements 288 A value of "true" indicates that the parser will process the 289 script and style elements (or any elements with type='cdata' in 290 the TSSL schema) as SGML CDATA elements (that is, no markup is 291 recognized except the matching end-tag). 292 293 TagSoup supports the following SAX properties in addition to the 294 standard ones: 295 296 http://www.ccil.org/~cowan/tagsoup/properties/scanner 297 Specifies the Scanner object this parser uses. 298 299 http://www.ccil.org/~cowan/tagsoup/properties/schema 300 Specifies the Schema object this parser uses. 301 302 http://www.ccil.org/~cowan/tagsoup/properties/auto-detector 303 Specifies the AutoDetector (for encoding detection) this parser 304 uses. 305 306 More information 307 308 I gave a presentation (a nocturne, so it's not on the schedule) at 309 [15]Extreme Markup Languages 2004 about TagSoup, updated from the one 310 presented in 2002 at the New York City XML SIG and at XML 2002. This is 311 the main high-level documentation about how TagSoup works. Formats: 312 [16]OpenDocument [17]Powerpoint [18]PDF. 313 314 I also had people add [19]"evil" HTML to a large poster so that I could 315 [20]clean it up; View Source is probably more useful than ordinary 316 browsing. The original instructions were: 317 318 SOUPE DE BALISES (BE EVIL)! 319 Ecritez une balise ouvrante (sans attributs) 320 ou fermante HTML ici, s.v.p. 321 322 There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups. 323 You can [23]join via the Web, or by sending a blank email to 324 [24]tagsoup-friends-subscribe (a] yahoogroups.com. The [25]archives are 325 open to all. 326 327 Online TagSoup processing for publicly accessible HTML documents is now 328 [26]available courtesy of Leigh Dodds. 329 330 References 331 332 1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html 333 2. http://opensource.org/licenses/apache2.0.php 334 3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip 335 4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214 336 5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar 337 6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip 338 7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES 339 8. http://tidy.sf.net/ 340 9. http://www.crumbmuseum.com/truckin.html 341 10. http://www.cafeconleche.org/XOM 342 11. http://gnosis.cx/publish/programming/xml_matters_17.html 343 12. http://jchardet.sourceforge.net/ 344 13. http://www.ccil.org/~cowan 345 14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon 346 15. http://www.extrememarkup.com/extreme/2004 347 16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp 348 17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt 349 18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf 350 19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html 351 20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml 352 21. http://groups.yahoo.com/group/tagsoup-friends 353 22. http://groups.yahoo.com/ 354 23. http://groups.yahoo.com/group/tagsoup-friends/join 355 24. mailto:tagsoup-friends-subscribe (a] yahoogroups.com 356 25. http://groups.yahoo.com/group/tagsoup-friends/messages 357 26. http://xmlarmyknife.org/docs/xhtml/tagsoup/ 358