Home | History | Annotate | Download | only in tagsoup
      1                         TagSoup - Just Keep On Truckin'
      2 
      3   Introduction
      4 
      5    This is the home page of TagSoup, a SAX-compliant parser written in
      6    Java that, instead of parsing well-formed or valid XML, parses HTML as
      7    it is found in the wild: [1]poor, nasty and brutish, though quite often
      8    far from short. TagSoup is designed for people who have to process this
      9    stuff using some semblance of a rational application design. By
     10    providing a SAX interface, it allows standard XML tools to be applied
     11    to even the worst HTML. TagSoup also includes a command-line processor
     12    that reads HTML files and can generate either clean HTML or well-formed
     13    XML that is a close approximation to XHTML.
     14 
     15    This is also the README file packaged with TagSoup.
     16 
     17    TagSoup is free and Open Source software. As of version 1.2, it is
     18    licensed under the [2]Apache License, Version 2.0, which allows
     19    proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later
     20    projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only
     21    project, feel free to ask.)
     22 
     23   Warning: TagSoup will not build on stock Java 5.x or 6.x!
     24 
     25    Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x,
     26    TagSoup will not build out of the box. You need to retrieve [3]Saxon
     27    6.5.5, which does not have the bug. Unpack the zipfile in an empty
     28    directory and copy the saxon.jar and saxon-xml-apis.jar files to
     29    $ANT_HOME/lib. The Ant build process for TagSoup will then notice that
     30    Saxon is available and use it instead.
     31 
     32   TagSoup 1.2 released
     33 
     34    There are a great many changes, most of them fixes for long-standing
     35    bugs, in this release. Only the most important are listed here; for the
     36    rest, see the CHANGES file in the source distribution. Very special
     37    thanks to Jojo Dijamco, whose intensive efforts at debugging made this
     38    release a usable upgrade rather than a useless mass of undetected bugs.
     39      * As noted above, I have changed the license to Apache 2.0.
     40      * The default content model for bogons (unknown elements) is now ANY
     41        rather than EMPTY. This is a breaking change, which I have done
     42        only because there was so much demand for it. It can be undone on
     43        the command line with the --emptybogons switch, or programmatically
     44        with parser.setFeature(Parser.emptyBogonsFeature, true).
     45      * The processing of entity references in attribute values has finally
     46        been fixed to do what browsers do. That is, a reference is only
     47        recognized if it is properly terminated by a semicolon; otherwise
     48        it is treated as plain text. This means that URIs like
     49        foo?cdown=32&cup=42 are no longer seen as containing an instance of
     50        the )U character (whose name happens to be cup).
     51      * Several new switches have been added:
     52           + --doctype-system and --doctype-public force a DOCTYPE
     53             declaration to be output and allow setting the system and
     54             public identifiers.
     55           + --standalone and --version allow control of the XML
     56             declaration that is output. (Note that TagSoup's XML output is
     57             always version 1.0, even if you use --version=1.1.)
     58           + --norootbogons causes unknown elements not to be allowed as
     59             the document root element. Instead, they are made children of
     60             the default root element (the html element for HTML).
     61      * The TagSoup core now supports character entities with values above
     62        U+FFFF. As a consequence, the HTML schema now supports all 2,210
     63        standard character entities from the [4]2007-12-14 draft of XML
     64        Entity Definitions for Characters, except the 94 which require more
     65        than one Unicode character to represent.
     66      * The SAX events startPrefixMapping and endPrefixMapping are now
     67        being reported for all cases of foreign elements and attributes.
     68      * All bugs around newline processing on Windows should now be gone.
     69      * A number of content models have been loosened to allow elements to
     70        appear in new and non-standard (but commonly found) places. In
     71        particular, tables are now allowed inside paragraphs, against the
     72        letter of the W3C specification.
     73      * Since the span element is intended for fine control of appearance
     74        using CSS, it should never have been a restartable element. This
     75        very long-standing bug has now been fixed.
     76      * The following non-standard elements are now at least partly
     77        supported: bgsound, blink, canvas, comment, listing, marquee, nobr,
     78        rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
     79      * In HTML output mode, boolean attributes like checked are now output
     80        as such, rather than in XML style as checked="checked".
     81      * Runs of < characters such as << and <<< are now handled correctly
     82        in text rather than being transformed into extremely bogus
     83        start-tags.
     84 
     85    [5]Download the TagSoup 1.2 jar file here. It's about 87K long.
     86    [6]Download the full TagSoup 1.2 source here. If you don't have zip,
     87    you can use jar to unpack it.
     88    [7]Download the current CHANGES file here.
     89 
     90   TagSoup 1.1 released
     91 
     92    TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use
     93    TagSoup within the JAXP framework (which is not something I necessarily
     94    recommend, but it is part of the Java XML platform), you can create a
     95    SAXParser by calling
     96    org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also
     97    set the system property javax.xml.parsers.SAXParserFactory to
     98    org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing
     99    this will cause all JAXP-based XML parsing to go through TagSoup, which
    100    is a Bad Thing if your application also reads XML documents.
    101 
    102   What TagSoup does
    103 
    104    TagSoup is designed as a parser, not a whole application; it isn't
    105    intended to permanently clean up bad HTML, as [8]HTML Tidy does, only
    106    to parse it on the fly. Therefore, it does not convert presentation
    107    HTML to CSS or anything similar. It does guarantee well-structured
    108    results: tags will wind up properly nested, default attributes will
    109    appear appropriately, and so on.
    110 
    111    The semantics of TagSoup are as far as practical those of actual HTML
    112    browsers. In particular, never, never will it throw any sort of syntax
    113    error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's
    114    much, much more. For example, if the first tag is LI, it will supply
    115    the application with enclosing HTML, BODY, and UL tags. Why UL? Because
    116    that's what browsers assume in this situation. For the same reason,
    117    overlapping tags are correctly restarted whenever possible: text like:
    118 This is <B>bold, <I>bold italic, </b>italic, </i>normal text
    119 
    120    gets correctly rewritten as:
    121 This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
    122 
    123    By intention, TagSoup is small and fast. It does not depend on the
    124    existence of any framework other than SAX, and should be able to work
    125    with any framework that can accept SAX parsers. In particular, [10]XOM
    126    is known to work.
    127 
    128    You can replace the low-level HTML scanner with one based on Sean
    129    McGrath's [11]PYX format (very close to James Clark's ESIS format). You
    130    can also supply an AutoDetector that peeks at the incoming byte stream
    131    and guesses a character encoding for it. Otherwise, the platform
    132    default is used. If you need an autodetector of character sets,
    133    consider trying to adapt the [12]Mozilla one; if you succeed, let me
    134    know.
    135 
    136   Note: TagSoup in Java 1.1
    137 
    138    If you go through the TagSoup source and replace all references to
    139    HashMap with Hashtable and recompile, TagSoup will work fine in Java
    140    1.1 VMs. Thanks to Thorbjrn Vinne for this discovery.
    141 
    142   The TSaxon XSLT-for-HTML processor
    143 
    144    [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5
    145    of Michael Kay's Saxon XSLT version 1.0 implementation that includes
    146    TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
    147    process either HTML or XML documents with XSLT stylesheets.
    148 
    149   TagSoup as a stand-alone program
    150 
    151    It is possible to run TagSoup as a program by saying java -jar
    152    tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command
    153    line will be parsed individually. If no files are specified, the
    154    standard input is read.
    155 
    156    The following options are understood:
    157 
    158    --files
    159           Output into individual files, with html extensions changed to
    160           xhtml. Otherwise, all output is sent to the standard output.
    161 
    162    --html
    163           Output is in clean HTML: the XML declaration is suppressed, as
    164           are end-tags for the known empty elements.
    165 
    166    --omit-xml-declaration
    167           The XML declaration is suppressed.
    168 
    169    --method=html
    170           End-tags for the known empty HTML elements are suppressed.
    171 
    172    --doctype-system=systemid
    173           Forces the output of a DOCTYPE declaration with the specified
    174           systemid.
    175 
    176    --doctype-public=publicid
    177           Forces the output of a DOCTYPE declaration with the specified
    178           publicid.
    179 
    180    --version=version
    181           Sets the version string in the XML declaration.
    182 
    183    --standalone=[yes|no]
    184           Sets the standalone declaration to yes or no.
    185 
    186    --pyx
    187           Output is in PYX format.
    188 
    189    --pyxin
    190           Input is in PYXoid format (need not be well-formed).
    191 
    192    --nons
    193           Namespaces are suppressed. Normally, all elements are in the
    194           XHTML 1.x namespace, and all attributes are in no namespace.
    195 
    196    --nobogons
    197           Bogons (unknown elements) are suppressed.
    198 
    199    --nodefaults
    200           suppress default attribute values
    201 
    202    --nocolons
    203           change explicit colons in element and attribute names to
    204           underscores
    205 
    206    --norestart
    207           don't restart any normally restartable elements
    208 
    209    --ignorable
    210           output whitespace in elements with element-only content
    211 
    212    --emptybogons
    213           Bogons are given a content model of EMPTY rather than ANY.
    214 
    215    --any
    216           Bogons are given a content model of ANY rather than EMPTY
    217           (default).
    218 
    219    --norootbogons
    220           Don't allow bogons to be root elements; make them subordinate to
    221           the root.
    222 
    223    --lexical
    224           Pass through HTML comments and DOCTYPE declarations. Has no
    225           effect when output is in PYX format.
    226 
    227    --reuse
    228           Reuse a single instance of TagSoup parser throughout. Normally,
    229           a new one is instantiated for each input file.
    230 
    231    --nocdata
    232           Change the content models of the script and style elements to
    233           treat them as ordinary #PCDATA (text-only) elements, as in
    234           XHTML, rather than with the special CDATA content model.
    235 
    236    --encoding=encoding
    237           Specify the input encoding. The default is the Java platform
    238           default.
    239 
    240    --output-encoding=encoding
    241           Specify the output encoding. The default is the Java platform
    242           default.
    243 
    244    --help
    245           Print help.
    246 
    247    --version
    248           Print the version number.
    249 
    250   SAX features and properties
    251 
    252    TagSoup supports the following SAX features in addition to the standard
    253    ones:
    254 
    255    http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons
    256           A value of "true" indicates that the parser will ignore unknown
    257           elements.
    258 
    259    http://www.ccil.org/~cowan/tagsoup/features/bogons-empty
    260           A value of "true" indicates that the parser will give unknown
    261           elements a content model of EMPTY; a value of "false", a content
    262           model of ANY.
    263 
    264    http://www.ccil.org/~cowan/tagsoup/features/root-bogons
    265           A value of "true" indicates that the parser will allow unknown
    266           elements to be the root of the output document.
    267 
    268    http://www.ccil.org/~cowan/tagsoup/features/default-attributes
    269           A value of "true" indicates that the parser will return default
    270           attribute values for missing attributes that have default
    271           values.
    272 
    273    http://www.ccil.org/~cowan/tagsoup/features/translate-colons
    274           A value of "true" indicates that the parser will translate
    275           colons into underscores in names.
    276 
    277    http://www.ccil.org/~cowan/tagsoup/features/restart-elements
    278           A value of "true" indicates that the parser will attempt to
    279           restart the restartable elements.
    280 
    281    http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace
    282           A value of "true" indicates that the parser will transmit
    283           whitespace in element-only content via the SAX
    284           ignorableWhitespace callback. Normally this is not done, because
    285           HTML is an SGML application and SGML suppresses such whitespace.
    286 
    287    http://www.ccil.org/~cowan/tagsoup/features/cdata-elements
    288           A value of "true" indicates that the parser will process the
    289           script and style elements (or any elements with type='cdata' in
    290           the TSSL schema) as SGML CDATA elements (that is, no markup is
    291           recognized except the matching end-tag).
    292 
    293    TagSoup supports the following SAX properties in addition to the
    294    standard ones:
    295 
    296    http://www.ccil.org/~cowan/tagsoup/properties/scanner
    297           Specifies the Scanner object this parser uses.
    298 
    299    http://www.ccil.org/~cowan/tagsoup/properties/schema
    300           Specifies the Schema object this parser uses.
    301 
    302    http://www.ccil.org/~cowan/tagsoup/properties/auto-detector
    303           Specifies the AutoDetector (for encoding detection) this parser
    304           uses.
    305 
    306   More information
    307 
    308    I gave a presentation (a nocturne, so it's not on the schedule) at
    309    [15]Extreme Markup Languages 2004 about TagSoup, updated from the one
    310    presented in 2002 at the New York City XML SIG and at XML 2002. This is
    311    the main high-level documentation about how TagSoup works. Formats:
    312    [16]OpenDocument [17]Powerpoint [18]PDF.
    313 
    314    I also had people add [19]"evil" HTML to a large poster so that I could
    315    [20]clean it up; View Source is probably more useful than ordinary
    316    browsing. The original instructions were:
    317 
    318                          SOUPE DE BALISES (BE EVIL)!
    319    Ecritez une balise ouvrante (sans attributs)
    320    ou fermante HTML ici, s.v.p.
    321 
    322    There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups.
    323    You can [23]join via the Web, or by sending a blank email to
    324    [24]tagsoup-friends-subscribe (a] yahoogroups.com. The [25]archives are
    325    open to all.
    326 
    327    Online TagSoup processing for publicly accessible HTML documents is now
    328    [26]available courtesy of Leigh Dodds.
    329 
    330 References
    331 
    332    1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
    333    2. http://opensource.org/licenses/apache2.0.php
    334    3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip
    335    4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214
    336    5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar
    337    6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip
    338    7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES
    339    8. http://tidy.sf.net/
    340    9. http://www.crumbmuseum.com/truckin.html
    341   10. http://www.cafeconleche.org/XOM
    342   11. http://gnosis.cx/publish/programming/xml_matters_17.html
    343   12. http://jchardet.sourceforge.net/
    344   13. http://www.ccil.org/~cowan
    345   14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon
    346   15. http://www.extrememarkup.com/extreme/2004
    347   16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp
    348   17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
    349   18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
    350   19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html
    351   20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml
    352   21. http://groups.yahoo.com/group/tagsoup-friends
    353   22. http://groups.yahoo.com/
    354   23. http://groups.yahoo.com/group/tagsoup-friends/join
    355   24. mailto:tagsoup-friends-subscribe (a] yahoogroups.com
    356   25. http://groups.yahoo.com/group/tagsoup-friends/messages
    357   26. http://xmlarmyknife.org/docs/xhtml/tagsoup/
    358