1 TagSoup - Just Keep On Truckin'
2
3 Introduction
4
5 This is the home page of TagSoup, a SAX-compliant parser written in
6 Java that, instead of parsing well-formed or valid XML, parses HTML as
7 it is found in the wild: [1]poor, nasty and brutish, though quite often
8 far from short. TagSoup is designed for people who have to process this
9 stuff using some semblance of a rational application design. By
10 providing a SAX interface, it allows standard XML tools to be applied
11 to even the worst HTML. TagSoup also includes a command-line processor
12 that reads HTML files and can generate either clean HTML or well-formed
13 XML that is a close approximation to XHTML.
14
15 This is also the README file packaged with TagSoup.
16
17 TagSoup is free and Open Source software. As of version 1.2, it is
18 licensed under the [2]Apache License, Version 2.0, which allows
19 proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later
20 projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only
21 project, feel free to ask.)
22
23 Warning: TagSoup will not build on stock Java 5.x or 6.x!
24
25 Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x,
26 TagSoup will not build out of the box. You need to retrieve [3]Saxon
27 6.5.5, which does not have the bug. Unpack the zipfile in an empty
28 directory and copy the saxon.jar and saxon-xml-apis.jar files to
29 $ANT_HOME/lib. The Ant build process for TagSoup will then notice that
30 Saxon is available and use it instead.
31
32 TagSoup 1.2 released
33
34 There are a great many changes, most of them fixes for long-standing
35 bugs, in this release. Only the most important are listed here; for the
36 rest, see the CHANGES file in the source distribution. Very special
37 thanks to Jojo Dijamco, whose intensive efforts at debugging made this
38 release a usable upgrade rather than a useless mass of undetected bugs.
39 * As noted above, I have changed the license to Apache 2.0.
40 * The default content model for bogons (unknown elements) is now ANY
41 rather than EMPTY. This is a breaking change, which I have done
42 only because there was so much demand for it. It can be undone on
43 the command line with the --emptybogons switch, or programmatically
44 with parser.setFeature(Parser.emptyBogonsFeature, true).
45 * The processing of entity references in attribute values has finally
46 been fixed to do what browsers do. That is, a reference is only
47 recognized if it is properly terminated by a semicolon; otherwise
48 it is treated as plain text. This means that URIs like
49 foo?cdown=32&cup=42 are no longer seen as containing an instance of
50 the )U character (whose name happens to be cup).
51 * Several new switches have been added:
52 + --doctype-system and --doctype-public force a DOCTYPE
53 declaration to be output and allow setting the system and
54 public identifiers.
55 + --standalone and --version allow control of the XML
56 declaration that is output. (Note that TagSoup's XML output is
57 always version 1.0, even if you use --version=1.1.)
58 + --norootbogons causes unknown elements not to be allowed as
59 the document root element. Instead, they are made children of
60 the default root element (the html element for HTML).
61 * The TagSoup core now supports character entities with values above
62 U+FFFF. As a consequence, the HTML schema now supports all 2,210
63 standard character entities from the [4]2007-12-14 draft of XML
64 Entity Definitions for Characters, except the 94 which require more
65 than one Unicode character to represent.
66 * The SAX events startPrefixMapping and endPrefixMapping are now
67 being reported for all cases of foreign elements and attributes.
68 * All bugs around newline processing on Windows should now be gone.
69 * A number of content models have been loosened to allow elements to
70 appear in new and non-standard (but commonly found) places. In
71 particular, tables are now allowed inside paragraphs, against the
72 letter of the W3C specification.
73 * Since the span element is intended for fine control of appearance
74 using CSS, it should never have been a restartable element. This
75 very long-standing bug has now been fixed.
76 * The following non-standard elements are now at least partly
77 supported: bgsound, blink, canvas, comment, listing, marquee, nobr,
78 rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
79 * In HTML output mode, boolean attributes like checked are now output
80 as such, rather than in XML style as checked="checked".
81 * Runs of < characters such as << and <<< are now handled correctly
82 in text rather than being transformed into extremely bogus
83 start-tags.
84
85 [5]Download the TagSoup 1.2 jar file here. It's about 87K long.
86 [6]Download the full TagSoup 1.2 source here. If you don't have zip,
87 you can use jar to unpack it.
88 [7]Download the current CHANGES file here.
89
90 TagSoup 1.1 released
91
92 TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use
93 TagSoup within the JAXP framework (which is not something I necessarily
94 recommend, but it is part of the Java XML platform), you can create a
95 SAXParser by calling
96 org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also
97 set the system property javax.xml.parsers.SAXParserFactory to
98 org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing
99 this will cause all JAXP-based XML parsing to go through TagSoup, which
100 is a Bad Thing if your application also reads XML documents.
101
102 What TagSoup does
103
104 TagSoup is designed as a parser, not a whole application; it isn't
105 intended to permanently clean up bad HTML, as [8]HTML Tidy does, only
106 to parse it on the fly. Therefore, it does not convert presentation
107 HTML to CSS or anything similar. It does guarantee well-structured
108 results: tags will wind up properly nested, default attributes will
109 appear appropriately, and so on.
110
111 The semantics of TagSoup are as far as practical those of actual HTML
112 browsers. In particular, never, never will it throw any sort of syntax
113 error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's
114 much, much more. For example, if the first tag is LI, it will supply
115 the application with enclosing HTML, BODY, and UL tags. Why UL? Because
116 that's what browsers assume in this situation. For the same reason,
117 overlapping tags are correctly restarted whenever possible: text like:
118 This is <B>bold, <I>bold italic, </b>italic, </i>normal text
119
120 gets correctly rewritten as:
121 This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
122
123 By intention, TagSoup is small and fast. It does not depend on the
124 existence of any framework other than SAX, and should be able to work
125 with any framework that can accept SAX parsers. In particular, [10]XOM
126 is known to work.
127
128 You can replace the low-level HTML scanner with one based on Sean
129 McGrath's [11]PYX format (very close to James Clark's ESIS format). You
130 can also supply an AutoDetector that peeks at the incoming byte stream
131 and guesses a character encoding for it. Otherwise, the platform
132 default is used. If you need an autodetector of character sets,
133 consider trying to adapt the [12]Mozilla one; if you succeed, let me
134 know.
135
136 Note: TagSoup in Java 1.1
137
138 If you go through the TagSoup source and replace all references to
139 HashMap with Hashtable and recompile, TagSoup will work fine in Java
140 1.1 VMs. Thanks to Thorbjrn Vinne for this discovery.
141
142 The TSaxon XSLT-for-HTML processor
143
144 [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5
145 of Michael Kay's Saxon XSLT version 1.0 implementation that includes
146 TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
147 process either HTML or XML documents with XSLT stylesheets.
148
149 TagSoup as a stand-alone program
150
151 It is possible to run TagSoup as a program by saying java -jar
152 tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command
153 line will be parsed individually. If no files are specified, the
154 standard input is read.
155
156 The following options are understood:
157
158 --files
159 Output into individual files, with html extensions changed to
160 xhtml. Otherwise, all output is sent to the standard output.
161
162 --html
163 Output is in clean HTML: the XML declaration is suppressed, as
164 are end-tags for the known empty elements.
165
166 --omit-xml-declaration
167 The XML declaration is suppressed.
168
169 --method=html
170 End-tags for the known empty HTML elements are suppressed.
171
172 --doctype-system=systemid
173 Forces the output of a DOCTYPE declaration with the specified
174 systemid.
175
176 --doctype-public=publicid
177 Forces the output of a DOCTYPE declaration with the specified
178 publicid.
179
180 --version=version
181 Sets the version string in the XML declaration.
182
183 --standalone=[yes|no]
184 Sets the standalone declaration to yes or no.
185
186 --pyx
187 Output is in PYX format.
188
189 --pyxin
190 Input is in PYXoid format (need not be well-formed).
191
192 --nons
193 Namespaces are suppressed. Normally, all elements are in the
194 XHTML 1.x namespace, and all attributes are in no namespace.
195
196 --nobogons
197 Bogons (unknown elements) are suppressed.
198
199 --nodefaults
200 suppress default attribute values
201
202 --nocolons
203 change explicit colons in element and attribute names to
204 underscores
205
206 --norestart
207 don't restart any normally restartable elements
208
209 --ignorable
210 output whitespace in elements with element-only content
211
212 --emptybogons
213 Bogons are given a content model of EMPTY rather than ANY.
214
215 --any
216 Bogons are given a content model of ANY rather than EMPTY
217 (default).
218
219 --norootbogons
220 Don't allow bogons to be root elements; make them subordinate to
221 the root.
222
223 --lexical
224 Pass through HTML comments and DOCTYPE declarations. Has no
225 effect when output is in PYX format.
226
227 --reuse
228 Reuse a single instance of TagSoup parser throughout. Normally,
229 a new one is instantiated for each input file.
230
231 --nocdata
232 Change the content models of the script and style elements to
233 treat them as ordinary #PCDATA (text-only) elements, as in
234 XHTML, rather than with the special CDATA content model.
235
236 --encoding=encoding
237 Specify the input encoding. The default is the Java platform
238 default.
239
240 --output-encoding=encoding
241 Specify the output encoding. The default is the Java platform
242 default.
243
244 --help
245 Print help.
246
247 --version
248 Print the version number.
249
250 SAX features and properties
251
252 TagSoup supports the following SAX features in addition to the standard
253 ones:
254
255 http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons
256 A value of "true" indicates that the parser will ignore unknown
257 elements.
258
259 http://www.ccil.org/~cowan/tagsoup/features/bogons-empty
260 A value of "true" indicates that the parser will give unknown
261 elements a content model of EMPTY; a value of "false", a content
262 model of ANY.
263
264 http://www.ccil.org/~cowan/tagsoup/features/root-bogons
265 A value of "true" indicates that the parser will allow unknown
266 elements to be the root of the output document.
267
268 http://www.ccil.org/~cowan/tagsoup/features/default-attributes
269 A value of "true" indicates that the parser will return default
270 attribute values for missing attributes that have default
271 values.
272
273 http://www.ccil.org/~cowan/tagsoup/features/translate-colons
274 A value of "true" indicates that the parser will translate
275 colons into underscores in names.
276
277 http://www.ccil.org/~cowan/tagsoup/features/restart-elements
278 A value of "true" indicates that the parser will attempt to
279 restart the restartable elements.
280
281 http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace
282 A value of "true" indicates that the parser will transmit
283 whitespace in element-only content via the SAX
284 ignorableWhitespace callback. Normally this is not done, because
285 HTML is an SGML application and SGML suppresses such whitespace.
286
287 http://www.ccil.org/~cowan/tagsoup/features/cdata-elements
288 A value of "true" indicates that the parser will process the
289 script and style elements (or any elements with type='cdata' in
290 the TSSL schema) as SGML CDATA elements (that is, no markup is
291 recognized except the matching end-tag).
292
293 TagSoup supports the following SAX properties in addition to the
294 standard ones:
295
296 http://www.ccil.org/~cowan/tagsoup/properties/scanner
297 Specifies the Scanner object this parser uses.
298
299 http://www.ccil.org/~cowan/tagsoup/properties/schema
300 Specifies the Schema object this parser uses.
301
302 http://www.ccil.org/~cowan/tagsoup/properties/auto-detector
303 Specifies the AutoDetector (for encoding detection) this parser
304 uses.
305
306 More information
307
308 I gave a presentation (a nocturne, so it's not on the schedule) at
309 [15]Extreme Markup Languages 2004 about TagSoup, updated from the one
310 presented in 2002 at the New York City XML SIG and at XML 2002. This is
311 the main high-level documentation about how TagSoup works. Formats:
312 [16]OpenDocument [17]Powerpoint [18]PDF.
313
314 I also had people add [19]"evil" HTML to a large poster so that I could
315 [20]clean it up; View Source is probably more useful than ordinary
316 browsing. The original instructions were:
317
318 SOUPE DE BALISES (BE EVIL)!
319 Ecritez une balise ouvrante (sans attributs)
320 ou fermante HTML ici, s.v.p.
321
322 There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups.
323 You can [23]join via the Web, or by sending a blank email to
324 [24]tagsoup-friends-subscribe (a] yahoogroups.com. The [25]archives are
325 open to all.
326
327 Online TagSoup processing for publicly accessible HTML documents is now
328 [26]available courtesy of Leigh Dodds.
329
330 References
331
332 1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
333 2. http://opensource.org/licenses/apache2.0.php
334 3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip
335 4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214
336 5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar
337 6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip
338 7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES
339 8. http://tidy.sf.net/
340 9. http://www.crumbmuseum.com/truckin.html
341 10. http://www.cafeconleche.org/XOM
342 11. http://gnosis.cx/publish/programming/xml_matters_17.html
343 12. http://jchardet.sourceforge.net/
344 13. http://www.ccil.org/~cowan
345 14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon
346 15. http://www.extrememarkup.com/extreme/2004
347 16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp
348 17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
349 18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
350 19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html
351 20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml
352 21. http://groups.yahoo.com/group/tagsoup-friends
353 22. http://groups.yahoo.com/
354 23. http://groups.yahoo.com/group/tagsoup-friends/join
355 24. mailto:tagsoup-friends-subscribe (a] yahoogroups.com
356 25. http://groups.yahoo.com/group/tagsoup-friends/messages
357 26. http://xmlarmyknife.org/docs/xhtml/tagsoup/
358