1 = 4.3.2 (20131002) = 2 3 * Fixed a bug in which short Unicode input was improperly encoded to 4 ASCII when checking whether or not it was the name of a file on 5 disk. [bug=1227016] 6 7 * Fixed a crash when a short input contains data not valid in 8 filenames. [bug=1232604] 9 10 * Fixed a bug that caused Unicode data put into UnicodeDammit to 11 return None instead of the original data. [bug=1214983] 12 13 * Combined two tests to stop a spurious test failure when tests are 14 run by nosetests. [bug=1212445] 15 16 = 4.3.1 (20130815) = 17 18 * Fixed yet another problem with the html5lib tree builder, caused by 19 html5lib's tendency to rearrange the tree during 20 parsing. [bug=1189267] 21 22 * Fixed a bug that caused the optimized version of find_all() to 23 return nothing. [bug=1212655] 24 25 = 4.3.0 (20130812) = 26 27 * Instead of converting incoming data to Unicode and feeding it to the 28 lxml tree builder in chunks, Beautiful Soup now makes successive 29 guesses at the encoding of the incoming data, and tells lxml to 30 parse the data as that encoding. Giving lxml more control over the 31 parsing process improves performance and avoids a number of bugs and 32 issues with the lxml parser which had previously required elaborate 33 workarounds: 34 35 - An issue in which lxml refuses to parse Unicode strings on some 36 systems. [bug=1180527] 37 38 - A returning bug that truncated documents longer than a (very 39 small) size. [bug=963880] 40 41 - A returning bug in which extra spaces were added to a document if 42 the document defined a charset other than UTF-8. [bug=972466] 43 44 This required a major overhaul of the tree builder architecture. If 45 you wrote your own tree builder and didn't tell me, you'll need to 46 modify your prepare_markup() method. 47 48 * The UnicodeDammit code that makes guesses at encodings has been 49 split into its own class, EncodingDetector. A lot of apparently 50 redundant code has been removed from Unicode, Dammit, and some 51 undocumented features have also been removed. 52 53 * Beautiful Soup will issue a warning if instead of markup you pass it 54 a URL or the name of a file on disk (a common beginner's mistake). 55 56 * A number of optimizations improve the performance of the lxml tree 57 builder by about 33%, the html.parser tree builder by about 20%, and 58 the html5lib tree builder by about 15%. 59 60 * All find_all calls should now return a ResultSet object. Patch by 61 Aaron DeVore. [bug=1194034] 62 63 = 4.2.1 (20130531) = 64 65 * The default XML formatter will now replace ampersands even if they 66 appear to be part of entities. That is, "<" will become 67 "&lt;". The old code was left over from Beautiful Soup 3, which 68 didn't always turn entities into Unicode characters. 69 70 If you really want the old behavior (maybe because you add new 71 strings to the tree, those strings include entities, and you want 72 the formatter to leave them alone on output), it can be found in 73 EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] 74 75 * Gave new_string() the ability to create subclasses of 76 NavigableString. [bug=1181986] 77 78 * Fixed another bug by which the html5lib tree builder could create a 79 disconnected tree. [bug=1182089] 80 81 * The .previous_element of a BeautifulSoup object is now always None, 82 not the last element to be parsed. [bug=1182089] 83 84 * Fixed test failures when lxml is not installed. [bug=1181589] 85 86 * html5lib now supports Python 3. Fixed some Python 2-specific 87 code in the html5lib test suite. [bug=1181624] 88 89 * The html.parser treebuilder can now handle numeric attributes in 90 text when the hexidecimal name of the attribute starts with a 91 capital X. Patch by Tim Shirley. [bug=1186242] 92 93 = 4.2.0 (20130514) = 94 95 * The Tag.select() method now supports a much wider variety of CSS 96 selectors. 97 98 - Added support for the adjacent sibling combinator (+) and the 99 general sibling combinator (~). Tests by "liquider". [bug=1082144] 100 101 - The combinators (>, +, and ~) can now combine with any supported 102 selector, not just one that selects based on tag name. 103 104 - Added limited support for the "nth-of-type" pseudo-class. Code 105 by Sven Slootweg. [bug=1109952] 106 107 * The BeautifulSoup class is now aliased to "_s" and "_soup", making 108 it quicker to type the import statement in an interactive session: 109 110 from bs4 import _s 111 or 112 from bs4 import _soup 113 114 The alias may change in the future, so don't use this in code you're 115 going to run more than once. 116 117 * Added the 'diagnose' submodule, which includes several useful 118 functions for reporting problems and doing tech support. 119 120 - diagnose(data) tries the given markup on every installed parser, 121 reporting exceptions and displaying successes. If a parser is not 122 installed, diagnose() mentions this fact. 123 124 - lxml_trace(data, html=True) runs the given markup through lxml's 125 XML parser or HTML parser, and prints out the parser events as 126 they happen. This helps you quickly determine whether a given 127 problem occurs in lxml code or Beautiful Soup code. 128 129 - htmlparser_trace(data) is the same thing, but for Python's 130 built-in HTMLParser class. 131 132 * In an HTML document, the contents of a <script> or <style> tag will 133 no longer undergo entity substitution by default. XML documents work 134 the same way they did before. [bug=1085953] 135 136 * Methods like get_text() and properties like .strings now only give 137 you strings that are visible in the document--no comments or 138 processing commands. [bug=1050164] 139 140 * The prettify() method now leaves the contents of <pre> tags 141 alone. [bug=1095654] 142 143 * Fix a bug in the html5lib treebuilder which sometimes created 144 disconnected trees. [bug=1039527] 145 146 * Fix a bug in the lxml treebuilder which crashed when a tag included 147 an attribute from the predefined "xml:" namespace. [bug=1065617] 148 149 * Fix a bug by which keyword arguments to find_parent() were not 150 being passed on. [bug=1126734] 151 152 * Stop a crash when unwisely messing with a tag that's been 153 decomposed. [bug=1097699] 154 155 * Now that lxml's segfault on invalid doctype has been fixed, fixed a 156 corresponding problem on the Beautiful Soup end that was previously 157 invisible. [bug=984936] 158 159 * Fixed an exception when an overspecified CSS selector didn't match 160 anything. Code by Stefaan Lippens. [bug=1168167] 161 162 = 4.1.3 (20120820) = 163 164 * Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious 165 test failure caused by the lousy HTMLParser in those 166 versions. [bug=1038503] 167 168 * Raise a more specific error (FeatureNotFound) when a requested 169 parser or parser feature is not installed. Raise NotImplementedError 170 instead of ValueError when the user calls insert_before() or 171 insert_after() on the BeautifulSoup object itself. Patch by Aaron 172 Devore. [bug=1038301] 173 174 = 4.1.2 (20120817) = 175 176 * As per PEP-8, allow searching by CSS class using the 'class_' 177 keyword argument. [bug=1037624] 178 179 * Display namespace prefixes for namespaced attribute names, instead of 180 the fully-qualified names given by the lxml parser. [bug=1037597] 181 182 * Fixed a crash on encoding when an attribute name contained 183 non-ASCII characters. 184 185 * When sniffing encodings, if the cchardet library is installed, 186 Beautiful Soup uses it instead of chardet. cchardet is much 187 faster. [bug=1020748] 188 189 * Use logging.warning() instead of warning.warn() to notify the user 190 that characters were replaced with REPLACEMENT 191 CHARACTER. [bug=1013862] 192 193 = 4.1.1 (20120703) = 194 195 * Fixed an html5lib tree builder crash which happened when html5lib 196 moved a tag with a multivalued attribute from one part of the tree 197 to another. [bug=1019603] 198 199 * Correctly display closing tags with an XML namespace declared. Patch 200 by Andreas Kostyrka. [bug=1019635] 201 202 * Fixed a typo that made parsing significantly slower than it should 203 have been, and also waited too long to close tags with XML 204 namespaces. [bug=1020268] 205 206 * get_text() now returns an empty Unicode string if there is no text, 207 rather than an empty bytestring. [bug=1020387] 208 209 = 4.1.0 (20120529) = 210 211 * Added experimental support for fixing Windows-1252 characters 212 embedded in UTF-8 documents. (UnicodeDammit.detwingle()) 213 214 * Fixed the handling of " with the built-in parser. [bug=993871] 215 216 * Comments, processing instructions, document type declarations, and 217 markup declarations are now treated as preformatted strings, the way 218 CData blocks are. [bug=1001025] 219 220 * Fixed a bug with the lxml treebuilder that prevented the user from 221 adding attributes to a tag that didn't originally have 222 attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. 223 224 * Fixed some edge-case bugs having to do with inserting an element 225 into a tag it's already inside, and replacing one of a tag's 226 children with another. [bug=997529] 227 228 * Added the ability to search for attribute values specified in UTF-8. [bug=1003974] 229 230 This caused a major refactoring of the search code. All the tests 231 pass, but it's possible that some searches will behave differently. 232 233 = 4.0.5 (20120427) = 234 235 * Added a new method, wrap(), which wraps an element in a tag. 236 237 * Renamed replace_with_children() to unwrap(), which is easier to 238 understand and also the jQuery name of the function. 239 240 * Made encoding substitution in <meta> tags completely transparent (no 241 more %SOUP-ENCODING%). 242 243 * Fixed a bug in decoding data that contained a byte-order mark, such 244 as data encoded in UTF-16LE. [bug=988980] 245 246 * Fixed a bug that made the HTMLParser treebuilder generate XML 247 definitions ending with two question marks instead of 248 one. [bug=984258] 249 250 * Upon document generation, CData objects are no longer run through 251 the formatter. [bug=988905] 252 253 * The test suite now passes when lxml is not installed, whether or not 254 html5lib is installed. [bug=987004] 255 256 * Print a warning on HTMLParseErrors to let people know they should 257 install a better parser library. 258 259 = 4.0.4 (20120416) = 260 261 * Fixed a bug that sometimes created disconnected trees. 262 263 * Fixed a bug with the string setter that moved a string around the 264 tree instead of copying it. [bug=983050] 265 266 * Attribute values are now run through the provided output formatter. 267 Previously they were always run through the 'minimal' formatter. In 268 the future I may make it possible to specify different formatters 269 for attribute values and strings, but for now, consistent behavior 270 is better than inconsistent behavior. [bug=980237] 271 272 * Added the missing renderContents method from Beautiful Soup 3. Also 273 added an encode_contents() method to go along with decode_contents(). 274 275 * Give a more useful error when the user tries to run the Python 2 276 version of BS under Python 3. 277 278 * UnicodeDammit can now convert Microsoft smart quotes to ASCII with 279 UnicodeDammit(markup, smart_quotes_to="ascii"). 280 281 = 4.0.3 (20120403) = 282 283 * Fixed a typo that caused some versions of Python 3 to convert the 284 Beautiful Soup codebase incorrectly. 285 286 * Got rid of the 4.0.2 workaround for HTML documents--it was 287 unnecessary and the workaround was triggering a (possibly different, 288 but related) bug in lxml. [bug=972466] 289 290 = 4.0.2 (20120326) = 291 292 * Worked around a possible bug in lxml that prevents non-tiny XML 293 documents from being parsed. [bug=963880, bug=963936] 294 295 * Fixed a bug where specifying `text` while also searching for a tag 296 only worked if `text` wanted an exact string match. [bug=955942] 297 298 = 4.0.1 (20120314) = 299 300 * This is the first official release of Beautiful Soup 4. There is no 301 4.0.0 release, to eliminate any possibility that packaging software 302 might treat "4.0.0" as being an earlier version than "4.0.0b10". 303 304 * Brought BS up to date with the latest release of soupselect, adding 305 CSS selector support for direct descendant matches and multiple CSS 306 class matches. 307 308 = 4.0.0b10 (20120302) = 309 310 * Added support for simple CSS selectors, taken from the soupselect project. 311 312 * Fixed a crash when using html5lib. [bug=943246] 313 314 * In HTML5-style <meta charset="foo"> tags, the value of the "charset" 315 attribute is now replaced with the appropriate encoding on 316 output. [bug=942714] 317 318 * Fixed a bug that caused calling a tag to sometimes call find_all() 319 with the wrong arguments. [bug=944426] 320 321 * For backwards compatibility, brought back the BeautifulStoneSoup 322 class as a deprecated wrapper around BeautifulSoup. 323 324 = 4.0.0b9 (20120228) = 325 326 * Fixed the string representation of DOCTYPEs that have both a public 327 ID and a system ID. 328 329 * Fixed the generated XML declaration. 330 331 * Renamed Tag.nsprefix to Tag.prefix, for consistency with 332 NamespacedAttribute. 333 334 * Fixed a test failure that occured on Python 3.x when chardet was 335 installed. 336 337 * Made prettify() return Unicode by default, so it will look nice on 338 Python 3 when passed into print(). 339 340 = 4.0.0b8 (20120224) = 341 342 * All tree builders now preserve namespace information in the 343 documents they parse. If you use the html5lib parser or lxml's XML 344 parser, you can access the namespace URL for a tag as tag.namespace. 345 346 However, there is no special support for namespace-oriented 347 searching or tree manipulation. When you search the tree, you need 348 to use namespace prefixes exactly as they're used in the original 349 document. 350 351 * The string representation of a DOCTYPE always ends in a newline. 352 353 * Issue a warning if the user tries to use a SoupStrainer in 354 conjunction with the html5lib tree builder, which doesn't support 355 them. 356 357 = 4.0.0b7 (20120223) = 358 359 * Upon decoding to string, any characters that can't be represented in 360 your chosen encoding will be converted into numeric XML entity 361 references. 362 363 * Issue a warning if characters were replaced with REPLACEMENT 364 CHARACTER during Unicode conversion. 365 366 * Restored compatibility with Python 2.6. 367 368 * The install process no longer installs docs or auxillary text files. 369 370 * It's now possible to deepcopy a BeautifulSoup object created with 371 Python's built-in HTML parser. 372 373 * About 100 unit tests that "test" the behavior of various parsers on 374 invalid markup have been removed. Legitimate changes to those 375 parsers caused these tests to fail, indicating that perhaps 376 Beautiful Soup should not test the behavior of foreign 377 libraries. 378 379 The problematic unit tests have been reformulated as informational 380 comparisons generated by the script 381 scripts/demonstrate_parser_differences.py. 382 383 This makes Beautiful Soup compatible with html5lib version 0.95 and 384 future versions of HTMLParser. 385 386 = 4.0.0b6 (20120216) = 387 388 * Multi-valued attributes like "class" always have a list of values, 389 even if there's only one value in the list. 390 391 * Added a number of multi-valued attributes defined in HTML5. 392 393 * Stopped generating a space before the slash that closes an 394 empty-element tag. This may come back if I add a special XHTML mode 395 (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty 396 useless. 397 398 * Passing text along with tag-specific arguments to a find* method: 399 400 find("a", text="Click here") 401 402 will find tags that contain the given text as their 403 .string. Previously, the tag-specific arguments were ignored and 404 only strings were searched. 405 406 * Fixed a bug that caused the html5lib tree builder to build a 407 partially disconnected tree. Generally cleaned up the html5lib tree 408 builder. 409 410 * If you restrict a multi-valued attribute like "class" to a string 411 that contains spaces, Beautiful Soup will only consider it a match 412 if the values correspond to that specific string. 413 414 = 4.0.0b5 (20120209) = 415 416 * Rationalized Beautiful Soup's treatment of CSS class. A tag 417 belonging to multiple CSS classes is treated as having a list of 418 values for the 'class' attribute. Searching for a CSS class will 419 match *any* of the CSS classes. 420 421 This actually affects all attributes that the HTML standard defines 422 as taking multiple values (class, rel, rev, archive, accept-charset, 423 and headers), but 'class' is by far the most common. [bug=41034] 424 425 * If you pass anything other than a dictionary as the second argument 426 to one of the find* methods, it'll assume you want to use that 427 object to search against a tag's CSS classes. Previously this only 428 worked if you passed in a string. 429 430 * Fixed a bug that caused a crash when you passed a dictionary as an 431 attribute value (possibly because you mistyped "attrs"). [bug=842419] 432 433 * Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags 434 like <meta charset="utf-8" />. [bug=837268] 435 436 * If Unicode, Dammit can't figure out a consistent encoding for a 437 page, it will try each of its guesses again, with errors="replace" 438 instead of errors="strict". This may mean that some data gets 439 replaced with REPLACEMENT CHARACTER, but at least most of it will 440 get turned into Unicode. [bug=754903] 441 442 * Patched over a bug in html5lib (?) that was crashing Beautiful Soup 443 on certain kinds of markup. [bug=838800] 444 445 * Fixed a bug that wrecked the tree if you replaced an element with an 446 empty string. [bug=728697] 447 448 * Improved Unicode, Dammit's behavior when you give it Unicode to 449 begin with. 450 451 = 4.0.0b4 (20120208) = 452 453 * Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() 454 455 * BeautifulSoup.new_tag() will follow the rules of whatever 456 tree-builder was used to create the original BeautifulSoup object. A 457 new <p> tag will look like "<p />" if the soup object was created to 458 parse XML, but it will look like "<p></p>" if the soup object was 459 created to parse HTML. 460 461 * We pass in strict=False to html.parser on Python 3, greatly 462 improving html.parser's ability to handle bad HTML. 463 464 * We also monkeypatch a serious bug in html.parser that made 465 strict=False disastrous on Python 3.2.2. 466 467 * Replaced the "substitute_html_entities" argument with the 468 more general "formatter" argument. 469 470 * Bare ampersands and angle brackets are always converted to XML 471 entities unless the user prevents it. 472 473 * Added PageElement.insert_before() and PageElement.insert_after(), 474 which let you put an element into the parse tree with respect to 475 some other element. 476 477 * Raise an exception when the user tries to do something nonsensical 478 like insert a tag into itself. 479 480 481 = 4.0.0b3 (20120203) = 482 483 Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful 484 Soup's custom HTML parser in favor of a system that lets you write a 485 little glue code and plug in any HTML or XML parser you want. 486 487 Beautiful Soup 4.0 comes with glue code for four parsers: 488 489 * Python's standard HTMLParser (html.parser in Python 3) 490 * lxml's HTML and XML parsers 491 * html5lib's HTML parser 492 493 HTMLParser is the default, but I recommend you install lxml if you 494 can. 495 496 For complete documentation, see the Sphinx documentation in 497 bs4/doc/source/. What follows is a summary of the changes from 498 Beautiful Soup 3. 499 500 === The module name has changed === 501 502 Previously you imported the BeautifulSoup class from a module also 503 called BeautifulSoup. To save keystrokes and make it clear which 504 version of the API is in use, the module is now called 'bs4': 505 506 >>> from bs4 import BeautifulSoup 507 508 === It works with Python 3 === 509 510 Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was 511 so bad that it barely worked at all. Beautiful Soup 4 works with 512 Python 3, and since its parser is pluggable, you don't sacrifice 513 quality. 514 515 Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 516 support to the finish line. Ezio Melotti is also to thank for greatly 517 improving the HTML parser that comes with Python 3.2. 518 519 === CDATA sections are normal text, if they're understood at all. === 520 521 Currently, the lxml and html5lib HTML parsers ignore CDATA sections in 522 markup: 523 524 <p><![CDATA[foo]]></p> => <p></p> 525 526 A future version of html5lib will turn CDATA sections into text nodes, 527 but only within tags like <svg> and <math>: 528 529 <svg><![CDATA[foo]]></svg> => <p>foo</p> 530 531 The default XML parser (which uses lxml behind the scenes) turns CDATA 532 sections into ordinary text elements: 533 534 <p><![CDATA[foo]]></p> => <p>foo</p> 535 536 In theory it's possible to preserve the CDATA sections when using the 537 XML parser, but I don't see how to get it to work in practice. 538 539 === Miscellaneous other stuff === 540 541 If the BeautifulSoup instance has .is_xml set to True, an appropriate 542 XML declaration will be emitted when the tree is transformed into a 543 string: 544 545 <?xml version="1.0" encoding="utf-8"> 546 <markup> 547 ... 548 </markup> 549 550 The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree 551 builders set it to False. If you want to parse XHTML with an HTML 552 parser, you can set it manually. 553 554 555 = 3.2.0 = 556 557 The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 558 to make it obvious which one you should use. 559 560 = 3.1.0 = 561 562 A hybrid version that supports 2.4 and can be automatically converted 563 to run under Python 3.0. There are three backwards-incompatible 564 changes you should be aware of, but no new features or deliberate 565 behavior changes. 566 567 1. str() may no longer do what you want. This is because the meaning 568 of str() inverts between Python 2 and 3; in Python 2 it gives you a 569 byte string, in Python 3 it gives you a Unicode string. 570 571 The effect of this is that you can't pass an encoding to .__str__ 572 anymore. Use encode() to get a string and decode() to get Unicode, and 573 you'll be ready (well, readier) for Python 3. 574 575 2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, 576 which is gone in Python 3. There's some bad HTML that SGMLParser 577 handled but HTMLParser doesn't, usually to do with attribute values 578 that aren't closed or have brackets inside them: 579 580 <a href="foo</a>, </a><a href="bar">baz</a> 581 <a b="<a>">', '<a b="<a>"></a><a>"></a> 582 583 A later version of Beautiful Soup will allow you to plug in different 584 parsers to make tradeoffs between speed and the ability to handle bad 585 HTML. 586 587 3. In Python 3 (but not Python 2), HTMLParser converts entities within 588 attributes to the corresponding Unicode characters. In Python 2 it's 589 possible to parse this string and leave the é intact. 590 591 <a href="http://crummy.com?sacré&bleu"> 592 593 In Python 3, the é is always converted to \xe9 during 594 parsing. 595 596 597 = 3.0.7a = 598 599 Added an import that makes BS work in Python 2.3. 600 601 602 = 3.0.7 = 603 604 Fixed a UnicodeDecodeError when unpickling documents that contain 605 non-ASCII characters. 606 607 Fixed a TypeError that occured in some circumstances when a tag 608 contained no text. 609 610 Jump through hoops to avoid the use of chardet, which can be extremely 611 slow in some circumstances. UTF-8 documents should never trigger the 612 use of chardet. 613 614 Whitespace is preserved inside <pre> and <textarea> tags that contain 615 nothing but whitespace. 616 617 Beautiful Soup can now parse a doctype that's scoped to an XML namespace. 618 619 620 = 3.0.6 = 621 622 Got rid of a very old debug line that prevented chardet from working. 623 624 Added a Tag.decompose() method that completely disconnects a tree or a 625 subset of a tree, breaking it up into bite-sized pieces that are 626 easy for the garbage collecter to collect. 627 628 Tag.extract() now returns the tag that was extracted. 629 630 Tag.findNext() now does something with the keyword arguments you pass 631 it instead of dropping them on the floor. 632 633 Fixed a Unicode conversion bug. 634 635 Fixed a bug that garbled some <meta> tags when rewriting them. 636 637 638 = 3.0.5 = 639 640 Soup objects can now be pickled, and copied with copy.deepcopy. 641 642 Tag.append now works properly on existing BS objects. (It wasn't 643 originally intended for outside use, but it can be now.) (Giles 644 Radford) 645 646 Passing in a nonexistent encoding will no longer crash the parser on 647 Python 2.4 (John Nagle). 648 649 Fixed an underlying bug in SGMLParser that thinks ASCII has 255 650 characters instead of 127 (John Nagle). 651 652 Entities are converted more consistently to Unicode characters. 653 654 Entity references in attribute values are now converted to Unicode 655 characters when appropriate. Numeric entities are always converted, 656 because SGMLParser always converts them outside of attribute values. 657 658 ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to 659 XHTML_ENTITIES. 660 661 The regular expression for bare ampersands was too loose. In some 662 cases ampersands were not being escaped. (Sam Ruby?) 663 664 Non-breaking spaces and other special Unicode space characters are no 665 longer folded to ASCII spaces. (Robert Leftwich) 666 667 Information inside a TEXTAREA tag is now parsed literally, not as HTML 668 tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) 669 670 = 3.0.4 = 671 672 Fixed a bug that crashed Unicode conversion in some cases. 673 674 Fixed a bug that prevented UnicodeDammit from being used as a 675 general-purpose data scrubber. 676 677 Fixed some unit test failures when running against Python 2.5. 678 679 When considering whether to convert smart quotes, UnicodeDammit now 680 looks at the original encoding in a case-insensitive way. 681 682 = 3.0.3 (20060606) = 683 684 Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be 685 sure to pass in an appropriate value for convertEntities, or XML/HTML 686 entities might stick around that aren't valid in HTML/XML). The result 687 may not validate, but it should be good enough to not choke a 688 real-world XML parser. Specifically, the output of a properly 689 constructed soup object should always be valid as part of an XML 690 document, but parts may be missing if they were missing in the 691 original. As always, if the input is valid XML, the output will also 692 be valid. 693 694 = 3.0.2 (20060602) = 695 696 Previously, Beautiful Soup correctly handled attribute values that 697 contained embedded quotes (sometimes by escaping), but not other kinds 698 of XML character. Now, it correctly handles or escapes all special XML 699 characters in attribute values. 700 701 I aliased methods to the 2.x names (fetch, find, findText, etc.) for 702 backwards compatibility purposes. Those names are deprecated and if I 703 ever do a 4.0 I will remove them. I will, I tell you! 704 705 Fixed a bug where the findAll method wasn't passing along any keyword 706 arguments. 707 708 When run from the command line, Beautiful Soup now acts as an HTML 709 pretty-printer, not an XML pretty-printer. 710 711 = 3.0.1 (20060530) = 712 713 Reintroduced the "fetch by CSS class" shortcut. I thought keyword 714 arguments would replace it, but they don't. You can't call soup('a', 715 class='foo') because class is a Python keyword. 716 717 If Beautiful Soup encounters a meta tag that declares the encoding, 718 but a SoupStrainer tells it not to parse that tag, Beautiful Soup will 719 no longer try to rewrite the meta tag to mention the new 720 encoding. Basically, this makes SoupStrainers work in real-world 721 applications instead of crashing the parser. 722 723 = 3.0.0 "Who would not give all else for two p" (20060528) = 724 725 This release is not backward-compatible with previous releases. If 726 you've got code written with a previous version of the library, go 727 ahead and keep using it, unless one of the features mentioned here 728 really makes your life easier. Since the library is self-contained, 729 you can include an old copy of the library in your old applications, 730 and use the new version for everything else. 731 732 The documentation has been rewritten and greatly expanded with many 733 more examples. 734 735 Beautiful Soup autodetects the encoding of a document (or uses the one 736 you specify), and converts it from its native encoding to 737 Unicode. Internally, it only deals with Unicode strings. When you 738 print out the document, it converts to UTF-8 (or another encoding you 739 specify). [Doc reference] 740 741 It's now easy to make large-scale changes to the parse tree without 742 screwing up the navigation members. The methods are extract, 743 replaceWith, and insert. [Doc reference. See also Improving Memory 744 Usage with extract] 745 746 Passing True in as an attribute value gives you tags that have any 747 value for that attribute. You don't have to create a regular 748 expression. Passing None for an attribute value gives you tags that 749 don't have that attribute at all. 750 751 Tag objects now know whether or not they're self-closing. This avoids 752 the problem where Beautiful Soup thought that tags like <BR /> were 753 self-closing even in XML documents. You can customize the self-closing 754 tags for a parser object by passing them in as a list of 755 selfClosingTags: you don't have to subclass anymore. 756 757 There's a new built-in parser, MinimalSoup, which has most of 758 BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc 759 reference] 760 761 You can use a SoupStrainer to tell Beautiful Soup to parse only part 762 of a document. This saves time and memory, often making Beautiful Soup 763 about as fast as a custom-built SGMLParser subclass. [Doc reference, 764 SoupStrainer reference] 765 766 You can (usually) use keyword arguments instead of passing a 767 dictionary of attributes to a search method. That is, you can replace 768 soup(args={"id" : "5"}) with soup(id="5"). You can still use args if 769 (for instance) you need to find an attribute whose name clashes with 770 the name of an argument to findAll. [Doc reference: **kwargs attrs] 771 772 The method names have changed to the better method names used in 773 Rubyful Soup. Instead of find methods and fetch methods, there are 774 only find methods. Instead of a scheme where you can't remember which 775 method finds one element and which one finds them all, we have find 776 and findAll. In general, if the method name mentions All or a plural 777 noun (eg. findNextSiblings), then it finds many elements 778 method. Otherwise, it only finds one element. [Doc reference] 779 780 Some of the argument names have been renamed for clarity. For instance 781 avoidParserProblems is now parserMassage. 782 783 Beautiful Soup no longer implements a feed method. You need to pass a 784 string or a filehandle into the soup constructor, not with feed after 785 the soup has been created. There is still a feed method, but it's the 786 feed method implemented by SGMLParser and calling it will bypass 787 Beautiful Soup and cause problems. 788 789 The NavigableText class has been renamed to NavigableString. There is 790 no NavigableUnicodeString anymore, because every string inside a 791 Beautiful Soup parse tree is a Unicode string. 792 793 findText and fetchText are gone. Just pass a text argument into find 794 or findAll. 795 796 Null was more trouble than it was worth, so I got rid of it. Anything 797 that used to return Null now returns None. 798 799 Special XML constructs like comments and CDATA now have their own 800 NavigableString subclasses, instead of being treated as oddly-formed 801 data. If you parse a document that contains CDATA and write it back 802 out, the CDATA will still be there. 803 804 When you're parsing a document, you can get Beautiful Soup to convert 805 XML or HTML entities into the corresponding Unicode characters. [Doc 806 reference] 807 808 = 2.1.1 (20050918) = 809 810 Fixed a serious performance bug in BeautifulStoneSoup which was 811 causing parsing to be incredibly slow. 812 813 Corrected several entities that were previously being incorrectly 814 translated from Microsoft smart-quote-like characters. 815 816 Fixed a bug that was breaking text fetch. 817 818 Fixed a bug that crashed the parser when text chunks that look like 819 HTML tag names showed up within a SCRIPT tag. 820 821 THEAD, TBODY, and TFOOT tags are now nestable within TABLE 822 tags. Nested tables should parse more sensibly now. 823 824 BASE is now considered a self-closing tag. 825 826 = 2.1.0 "Game, or any other dish?" (20050504) = 827 828 Added a wide variety of new search methods which, given a starting 829 point inside the tree, follow a particular navigation member (like 830 nextSibling) over and over again, looking for Tag and NavigableText 831 objects that match certain criteria. The new methods are findNext, 832 fetchNext, findPrevious, fetchPrevious, findNextSibling, 833 fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings, 834 findParent, and fetchParents. All of these use the same basic code 835 used by first and fetch, so you can pass your weird ways of matching 836 things into these methods. 837 838 The fetch method and its derivatives now accept a limit argument. 839 840 You can now pass keyword arguments when calling a Tag object as though 841 it were a method. 842 843 Fixed a bug that caused all hand-created tags to share a single set of 844 attributes. 845 846 = 2.0.3 (20050501) = 847 848 Fixed Python 2.2 support for iterators. 849 850 Fixed a bug that gave the wrong representation to tags within quote 851 tags like <script>. 852 853 Took some code from Mark Pilgrim that treats CDATA declarations as 854 data instead of ignoring them. 855 856 Beautiful Soup's setup.py will now do an install even if the unit 857 tests fail. It won't build a source distribution if the unit tests 858 fail, so I can't release a new version unless they pass. 859 860 = 2.0.2 (20050416) = 861 862 Added the unit tests in a separate module, and packaged it with 863 distutils. 864 865 Fixed a bug that sometimes caused renderContents() to return a Unicode 866 string even if there was no Unicode in the original string. 867 868 Added the done() method, which closes all of the parser's open 869 tags. It gets called automatically when you pass in some text to the 870 constructor of a parser class; otherwise you must call it yourself. 871 872 Reinstated some backwards compatibility with 1.x versions: referencing 873 the string member of a NavigableText object returns the NavigableText 874 object instead of throwing an error. 875 876 = 2.0.1 (20050412) = 877 878 Fixed a bug that caused bad results when you tried to reference a tag 879 name shorter than 3 characters as a member of a Tag, eg. tag.table.td. 880 881 Made sure all Tags have the 'hidden' attribute so that an attempt to 882 access tag.hidden doesn't spawn an attempt to find a tag named 883 'hidden'. 884 885 Fixed a bug in the comparison operator. 886 887 = 2.0.0 "Who cares for fish?" (20050410) 888 889 Beautiful Soup version 1 was very useful but also pretty stupid. I 890 originally wrote it without noticing any of the problems inherent in 891 trying to build a parse tree out of ambiguous HTML tags. This version 892 solves all of those problems to my satisfaction. It also adds many new 893 clever things to make up for the removal of the stupid things. 894 895 == Parsing == 896 897 The parser logic has been greatly improved, and the BeautifulSoup 898 class should much more reliably yield a parse tree that looks like 899 what the page author intended. For a particular class of odd edge 900 cases that now causes problems, there is a new class, 901 ICantBelieveItsBeautifulSoup. 902 903 By default, Beautiful Soup now performs some cleanup operations on 904 text before parsing it. This is to avoid common problems with bad 905 definitions and self-closing tags that crash SGMLParser. You can 906 provide your own set of cleanup operations, or turn it off 907 altogether. The cleanup operations include fixing self-closing tags 908 that don't close, and replacing Microsoft smart quotes and similar 909 characters with their HTML entity equivalents. 910 911 You can now get a pretty-print version of parsed HTML to get a visual 912 picture of how Beautiful Soup parses it, with the Tag.prettify() 913 method. 914 915 == Strings and Unicode == 916 917 There are separate NavigableText subclasses for ASCII and Unicode 918 strings. These classes directly subclass the corresponding base data 919 types. This means you can treat NavigableText objects as strings 920 instead of having to call methods on them to get the strings. 921 922 str() on a Tag always returns a string, and unicode() always returns 923 Unicode. Previously it was inconsistent. 924 925 == Tree traversal == 926 927 In a first() or fetch() call, the tag name or the desired value of an 928 attribute can now be any of the following: 929 930 * A string (matches that specific tag or that specific attribute value) 931 * A list of strings (matches any tag or attribute value in the list) 932 * A compiled regular expression object (matches any tag or attribute 933 value that matches the regular expression) 934 * A callable object that takes the Tag object or attribute value as a 935 string. It returns None/false/empty string if the given string 936 doesn't match, and any other value if it does. 937 938 This is much easier to use than SQL-style wildcards (see, regular 939 expressions are good for something). Because of this, I took out 940 SQL-style wildcards. I'll put them back if someone complains, but 941 their removal simplifies the code a lot. 942 943 You can use fetch() and first() to search for text in the parse tree, 944 not just tags. There are new alias methods fetchText() and firstText() 945 designed for this purpose. As with searching for tags, you can pass in 946 a string, a regular expression object, or a method to match your text. 947 948 If you pass in something besides a map to the attrs argument of 949 fetch() or first(), Beautiful Soup will assume you want to match that 950 thing against the "class" attribute. When you're scraping 951 well-structured HTML, this makes your code a lot cleaner. 952 953 1.x and 2.x both let you call a Tag object as a shorthand for 954 fetch(). For instance, foo("bar") is a shorthand for 955 foo.fetch("bar"). In 2.x, you can also access a specially-named member 956 of a Tag object as a shorthand for first(). For instance, foo.barTag 957 is a shorthand for foo.first("bar"). By chaining these shortcuts you 958 traverse a tree in very little code: for header in 959 soup.bodyTag.pTag.tableTag('th'): 960 961 If an element relationship (like parent or next) doesn't apply to a 962 tag, it'll now show up Null instead of None. first() will also return 963 Null if you ask it for a nonexistent tag. Null is an object that's 964 just like None, except you can do whatever you want to it and it'll 965 give you Null instead of throwing an error. 966 967 This lets you do tree traversals like soup.htmlTag.headTag.titleTag 968 without having to worry if the intermediate stages are actually 969 there. Previously, if there was no 'head' tag in the document, headTag 970 in that instance would have been None, and accessing its 'titleTag' 971 member would have thrown an AttributeError. Now, you can get what you 972 want when it exists, and get Null when it doesn't, without having to 973 do a lot of conditionals checking to see if every stage is None. 974 975 There are two new relations between page elements: previousSibling and 976 nextSibling. They reference the previous and next element at the same 977 level of the parse tree. For instance, if you have HTML like this: 978 979 <p><ul><li>Foo<br /><li>Bar</ul> 980 981 The first 'li' tag has a previousSibling of Null and its nextSibling 982 is the second 'li' tag. The second 'li' tag has a nextSibling of Null 983 and its previousSibling is the first 'li' tag. The previousSibling of 984 the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the 985 'br' tag. 986 987 I took out the ability to use fetch() to find tags that have a 988 specific list of contents. See, I can't even explain it well. It was 989 really difficult to use, I never used it, and I don't think anyone 990 else ever used it. To the extent anyone did, they can probably use 991 fetchText() instead. If it turns out someone needs it I'll think of 992 another solution. 993 994 == Tree manipulation == 995 996 You can add new attributes to a tag, and delete attributes from a 997 tag. In 1.x you could only change a tag's existing attributes. 998 999 == Porting Considerations == 1000 1001 There are three changes in 2.0 that break old code: 1002 1003 In the post-1.2 release you could pass in a function into fetch(). The 1004 function took a string, the tag name. In 2.0, the function takes the 1005 actual Tag object. 1006 1007 It's no longer to pass in SQL-style wildcards to fetch(). Use a 1008 regular expression instead. 1009 1010 The different parsing algorithm means the parse tree may not be shaped 1011 like you expect. This will only actually affect you if your code uses 1012 one of the affected parts. I haven't run into this problem yet while 1013 porting my code. 1014 1015 = Between 1.2 and 2.0 = 1016 1017 This is the release to get if you want Python 1.5 compatibility. 1018 1019 The desired value of an attribute can now be any of the following: 1020 1021 * A string 1022 * A string with SQL-style wildcards 1023 * A compiled RE object 1024 * A callable that returns None/false/empty string if the given value 1025 doesn't match, and any other value otherwise. 1026 1027 This is much easier to use than SQL-style wildcards (see, regular 1028 expressions are good for something). Because of this, I no longer 1029 recommend you use SQL-style wildcards. They may go away in a future 1030 release to clean up the code. 1031 1032 Made Beautiful Soup handle processing instructions as text instead of 1033 ignoring them. 1034 1035 Applied patch from Richie Hindle (richie at entrian dot com) that 1036 makes tag.string a shorthand for tag.contents[0].string when the tag 1037 has only one string-owning child. 1038 1039 Added still more nestable tags. The nestable tags thing won't work in 1040 a lot of cases and needs to be rethought. 1041 1042 Fixed an edge case where searching for "%foo" would match any string 1043 shorter than "foo". 1044 1045 = 1.2 "Who for such dainties would not stoop?" (20040708) = 1046 1047 Applied patch from Ben Last (ben at benlast dot com) that made 1048 Tag.renderContents() correctly handle Unicode. 1049 1050 Made BeautifulStoneSoup even dumber by making it not implicitly close 1051 a tag when another tag of the same type is encountered; only when an 1052 actual closing tag is encountered. This change courtesy of Fuzzy (mike 1053 at pcblokes dot com). BeautifulSoup still works as before. 1054 1055 = 1.1 "Swimming in a hot tureen" = 1056 1057 Added more 'nestable' tags. Changed popping semantics so that when a 1058 nestable tag is encountered, tags are popped up to the previously 1059 encountered nestable tag (of whatever kind). I will revert this if 1060 enough people complain, but it should make more people's lives easier 1061 than harder. This enhancement was suggested by Anthony Baxter (anthony 1062 at interlink dot com dot au). 1063 1064 = 1.0 "So rich and green" (20040420) = 1065 1066 Initial release. 1067