1 Beautiful Soup Documentation 2 ============================ 3 4 .. image:: 6.1.jpg 5 :align: right 6 :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself." 7 8 `Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ is a 9 Python library for pulling data out of HTML and XML files. It works 10 with your favorite parser to provide idiomatic ways of navigating, 11 searching, and modifying the parse tree. It commonly saves programmers 12 hours or days of work. 13 14 These instructions illustrate all major features of Beautiful Soup 4, 15 with examples. I show you what the library is good for, how it works, 16 how to use it, how to make it do what you want, and what to do when it 17 violates your expectations. 18 19 The examples in this documentation should work the same way in Python 20 2.7 and Python 3.2. 21 22 You might be looking for the documentation for `Beautiful Soup 3 23 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. 24 If so, you should know that Beautiful Soup 3 is no longer being 25 developed, and that Beautiful Soup 4 is recommended for all new 26 projects. If you want to learn about the differences between Beautiful 27 Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_. 28 29 This documentation has been translated into other languages by its users. 30 31 * . (` <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_) 32 33 Getting help 34 ------------ 35 36 If you have questions about Beautiful Soup, or run into problems, 37 `send mail to the discussion group 38 <https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup>`_. If 39 your problem involves parsing an HTML document, be sure to mention 40 :ref:`what the diagnose() function says <diagnose>` about 41 that document. 42 43 Quick Start 44 =========== 45 46 Here's an HTML document I'll be using as an example throughout this 47 document. It's part of a story from `Alice in Wonderland`:: 48 49 html_doc = """ 50 <html><head><title>The Dormouse's story</title></head> 51 <body> 52 <p class="title"><b>The Dormouse's story</b></p> 53 54 <p class="story">Once upon a time there were three little sisters; and their names were 55 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 56 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 57 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 58 and they lived at the bottom of a well.</p> 59 60 <p class="story">...</p> 61 """ 62 63 Running the "three sisters" document through Beautiful Soup gives us a 64 ``BeautifulSoup`` object, which represents the document as a nested 65 data structure:: 66 67 from bs4 import BeautifulSoup 68 soup = BeautifulSoup(html_doc) 69 70 print(soup.prettify()) 71 # <html> 72 # <head> 73 # <title> 74 # The Dormouse's story 75 # </title> 76 # </head> 77 # <body> 78 # <p class="title"> 79 # <b> 80 # The Dormouse's story 81 # </b> 82 # </p> 83 # <p class="story"> 84 # Once upon a time there were three little sisters; and their names were 85 # <a class="sister" href="http://example.com/elsie" id="link1"> 86 # Elsie 87 # </a> 88 # , 89 # <a class="sister" href="http://example.com/lacie" id="link2"> 90 # Lacie 91 # </a> 92 # and 93 # <a class="sister" href="http://example.com/tillie" id="link2"> 94 # Tillie 95 # </a> 96 # ; and they lived at the bottom of a well. 97 # </p> 98 # <p class="story"> 99 # ... 100 # </p> 101 # </body> 102 # </html> 103 104 Here are some simple ways to navigate that data structure:: 105 106 soup.title 107 # <title>The Dormouse's story</title> 108 109 soup.title.name 110 # u'title' 111 112 soup.title.string 113 # u'The Dormouse's story' 114 115 soup.title.parent.name 116 # u'head' 117 118 soup.p 119 # <p class="title"><b>The Dormouse's story</b></p> 120 121 soup.p['class'] 122 # u'title' 123 124 soup.a 125 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 126 127 soup.find_all('a') 128 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 129 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 130 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 131 132 soup.find(id="link3") 133 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 134 135 One common task is extracting all the URLs found within a page's <a> tags:: 136 137 for link in soup.find_all('a'): 138 print(link.get('href')) 139 # http://example.com/elsie 140 # http://example.com/lacie 141 # http://example.com/tillie 142 143 Another common task is extracting all the text from a page:: 144 145 print(soup.get_text()) 146 # The Dormouse's story 147 # 148 # The Dormouse's story 149 # 150 # Once upon a time there were three little sisters; and their names were 151 # Elsie, 152 # Lacie and 153 # Tillie; 154 # and they lived at the bottom of a well. 155 # 156 # ... 157 158 Does this look like what you need? If so, read on. 159 160 Installing Beautiful Soup 161 ========================= 162 163 If you're using a recent version of Debian or Ubuntu Linux, you can 164 install Beautiful Soup with the system package manager: 165 166 :kbd:`$ apt-get install python-bs4` 167 168 Beautiful Soup 4 is published through PyPi, so if you can't install it 169 with the system packager, you can install it with ``easy_install`` or 170 ``pip``. The package name is ``beautifulsoup4``, and the same package 171 works on Python 2 and Python 3. 172 173 :kbd:`$ easy_install beautifulsoup4` 174 175 :kbd:`$ pip install beautifulsoup4` 176 177 (The ``BeautifulSoup`` package is probably `not` what you want. That's 178 the previous major release, `Beautiful Soup 3`_. Lots of software uses 179 BS3, so it's still available, but if you're writing new code you 180 should install ``beautifulsoup4``.) 181 182 If you don't have ``easy_install`` or ``pip`` installed, you can 183 `download the Beautiful Soup 4 source tarball 184 <http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ and 185 install it with ``setup.py``. 186 187 :kbd:`$ python setup.py install` 188 189 If all else fails, the license for Beautiful Soup allows you to 190 package the entire library with your application. You can download the 191 tarball, copy its ``bs4`` directory into your application's codebase, 192 and use Beautiful Soup without installing it at all. 193 194 I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it 195 should work with other recent versions. 196 197 Problems after installation 198 --------------------------- 199 200 Beautiful Soup is packaged as Python 2 code. When you install it for 201 use with Python 3, it's automatically converted to Python 3 code. If 202 you don't install the package, the code won't be converted. There have 203 also been reports on Windows machines of the wrong version being 204 installed. 205 206 If you get the ``ImportError`` "No module named HTMLParser", your 207 problem is that you're running the Python 2 version of the code under 208 Python 3. 209 210 If you get the ``ImportError`` "No module named html.parser", your 211 problem is that you're running the Python 3 version of the code under 212 Python 2. 213 214 In both cases, your best bet is to completely remove the Beautiful 215 Soup installation from your system (including any directory created 216 when you unzipped the tarball) and try the installation again. 217 218 If you get the ``SyntaxError`` "Invalid syntax" on the line 219 ``ROOT_TAG_NAME = u'[document]'``, you need to convert the Python 2 220 code to Python 3. You can do this either by installing the package: 221 222 :kbd:`$ python3 setup.py install` 223 224 or by manually running Python's ``2to3`` conversion script on the 225 ``bs4`` directory: 226 227 :kbd:`$ 2to3-3.2 -w bs4` 228 229 .. _parser-installation: 230 231 232 Installing a parser 233 ------------------- 234 235 Beautiful Soup supports the HTML parser included in Python's standard 236 library, but it also supports a number of third-party Python parsers. 237 One is the `lxml parser <http://lxml.de/>`_. Depending on your setup, 238 you might install lxml with one of these commands: 239 240 :kbd:`$ apt-get install python-lxml` 241 242 :kbd:`$ easy_install lxml` 243 244 :kbd:`$ pip install lxml` 245 246 Another alternative is the pure-Python `html5lib parser 247 <http://code.google.com/p/html5lib/>`_, which parses HTML the way a 248 web browser does. Depending on your setup, you might install html5lib 249 with one of these commands: 250 251 :kbd:`$ apt-get install python-html5lib` 252 253 :kbd:`$ easy_install html5lib` 254 255 :kbd:`$ pip install html5lib` 256 257 This table summarizes the advantages and disadvantages of each parser library: 258 259 +----------------------+--------------------------------------------+--------------------------------+--------------------------+ 260 | Parser | Typical usage | Advantages | Disadvantages | 261 +----------------------+--------------------------------------------+--------------------------------+--------------------------+ 262 | Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not very lenient | 263 | | | * Decent speed | (before Python 2.7.3 | 264 | | | * Lenient (as of Python 2.7.3 | or 3.2.2) | 265 | | | and 3.2.) | | 266 +----------------------+--------------------------------------------+--------------------------------+--------------------------+ 267 | lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very fast | * External C dependency | 268 | | | * Lenient | | 269 +----------------------+--------------------------------------------+--------------------------------+--------------------------+ 270 | lxml's XML parser | ``BeautifulSoup(markup, ["lxml", "xml"])`` | * Very fast | * External C dependency | 271 | | ``BeautifulSoup(markup, "xml")`` | * The only currently supported | | 272 | | | XML parser | | 273 +----------------------+--------------------------------------------+--------------------------------+--------------------------+ 274 | html5lib | ``BeautifulSoup(markup, "html5lib")`` | * Extremely lenient | * Very slow | 275 | | | * Parses pages the same way a | * External Python | 276 | | | web browser does | dependency | 277 | | | * Creates valid HTML5 | | 278 +----------------------+--------------------------------------------+--------------------------------+--------------------------+ 279 280 If you can, I recommend you install and use lxml for speed. If you're 281 using a version of Python 2 earlier than 2.7.3, or a version of Python 282 3 earlier than 3.2.2, it's `essential` that you install lxml or 283 html5lib--Python's built-in HTML parser is just not very good in older 284 versions. 285 286 Note that if a document is invalid, different parsers will generate 287 different Beautiful Soup trees for it. See `Differences 288 between parsers`_ for details. 289 290 Making the soup 291 =============== 292 293 To parse a document, pass it into the ``BeautifulSoup`` 294 constructor. You can pass in a string or an open filehandle:: 295 296 from bs4 import BeautifulSoup 297 298 soup = BeautifulSoup(open("index.html")) 299 300 soup = BeautifulSoup("<html>data</html>") 301 302 First, the document is converted to Unicode, and HTML entities are 303 converted to Unicode characters:: 304 305 BeautifulSoup("Sacré bleu!") 306 <html><head></head><body>Sacr bleu!</body></html> 307 308 Beautiful Soup then parses the document using the best available 309 parser. It will use an HTML parser unless you specifically tell it to 310 use an XML parser. (See `Parsing XML`_.) 311 312 Kinds of objects 313 ================ 314 315 Beautiful Soup transforms a complex HTML document into a complex tree 316 of Python objects. But you'll only ever have to deal with about four 317 `kinds` of objects. 318 319 .. _Tag: 320 321 ``Tag`` 322 ------- 323 324 A ``Tag`` object corresponds to an XML or HTML tag in the original document:: 325 326 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') 327 tag = soup.b 328 type(tag) 329 # <class 'bs4.element.Tag'> 330 331 Tags have a lot of attributes and methods, and I'll cover most of them 332 in `Navigating the tree`_ and `Searching the tree`_. For now, the most 333 important features of a tag are its name and attributes. 334 335 Name 336 ^^^^ 337 338 Every tag has a name, accessible as ``.name``:: 339 340 tag.name 341 # u'b' 342 343 If you change a tag's name, the change will be reflected in any HTML 344 markup generated by Beautiful Soup:: 345 346 tag.name = "blockquote" 347 tag 348 # <blockquote class="boldest">Extremely bold</blockquote> 349 350 Attributes 351 ^^^^^^^^^^ 352 353 A tag may have any number of attributes. The tag ``<b 354 class="boldest">`` has an attribute "class" whose value is 355 "boldest". You can access a tag's attributes by treating the tag like 356 a dictionary:: 357 358 tag['class'] 359 # u'boldest' 360 361 You can access that dictionary directly as ``.attrs``:: 362 363 tag.attrs 364 # {u'class': u'boldest'} 365 366 You can add, remove, and modify a tag's attributes. Again, this is 367 done by treating the tag as a dictionary:: 368 369 tag['class'] = 'verybold' 370 tag['id'] = 1 371 tag 372 # <blockquote class="verybold" id="1">Extremely bold</blockquote> 373 374 del tag['class'] 375 del tag['id'] 376 tag 377 # <blockquote>Extremely bold</blockquote> 378 379 tag['class'] 380 # KeyError: 'class' 381 print(tag.get('class')) 382 # None 383 384 .. _multivalue: 385 386 Multi-valued attributes 387 &&&&&&&&&&&&&&&&&&&&&&& 388 389 HTML 4 defines a few attributes that can have multiple values. HTML 5 390 removes a couple of them, but defines a few more. The most common 391 multi-valued attribute is ``class`` (that is, a tag can have more than 392 one CSS class). Others include ``rel``, ``rev``, ``accept-charset``, 393 ``headers``, and ``accesskey``. Beautiful Soup presents the value(s) 394 of a multi-valued attribute as a list:: 395 396 css_soup = BeautifulSoup('<p class="body strikeout"></p>') 397 css_soup.p['class'] 398 # ["body", "strikeout"] 399 400 css_soup = BeautifulSoup('<p class="body"></p>') 401 css_soup.p['class'] 402 # ["body"] 403 404 If an attribute `looks` like it has more than one value, but it's not 405 a multi-valued attribute as defined by any version of the HTML 406 standard, Beautiful Soup will leave the attribute alone:: 407 408 id_soup = BeautifulSoup('<p id="my id"></p>') 409 id_soup.p['id'] 410 # 'my id' 411 412 When you turn a tag back into a string, multiple attribute values are 413 consolidated:: 414 415 rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>') 416 rel_soup.a['rel'] 417 # ['index'] 418 rel_soup.a['rel'] = ['index', 'contents'] 419 print(rel_soup.p) 420 # <p>Back to the <a rel="index contents">homepage</a></p> 421 422 If you parse a document as XML, there are no multi-valued attributes:: 423 424 xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml') 425 xml_soup.p['class'] 426 # u'body strikeout' 427 428 429 430 ``NavigableString`` 431 ------------------- 432 433 A string corresponds to a bit of text within a tag. Beautiful Soup 434 uses the ``NavigableString`` class to contain these bits of text:: 435 436 tag.string 437 # u'Extremely bold' 438 type(tag.string) 439 # <class 'bs4.element.NavigableString'> 440 441 A ``NavigableString`` is just like a Python Unicode string, except 442 that it also supports some of the features described in `Navigating 443 the tree`_ and `Searching the tree`_. You can convert a 444 ``NavigableString`` to a Unicode string with ``unicode()``:: 445 446 unicode_string = unicode(tag.string) 447 unicode_string 448 # u'Extremely bold' 449 type(unicode_string) 450 # <type 'unicode'> 451 452 You can't edit a string in place, but you can replace one string with 453 another, using :ref:`replace_with`:: 454 455 tag.string.replace_with("No longer bold") 456 tag 457 # <blockquote>No longer bold</blockquote> 458 459 ``NavigableString`` supports most of the features described in 460 `Navigating the tree`_ and `Searching the tree`_, but not all of 461 them. In particular, since a string can't contain anything (the way a 462 tag may contain a string or another tag), strings don't support the 463 ``.contents`` or ``.string`` attributes, or the ``find()`` method. 464 465 If you want to use a ``NavigableString`` outside of Beautiful Soup, 466 you should call ``unicode()`` on it to turn it into a normal Python 467 Unicode string. If you don't, your string will carry around a 468 reference to the entire Beautiful Soup parse tree, even when you're 469 done using Beautiful Soup. This is a big waste of memory. 470 471 ``BeautifulSoup`` 472 ----------------- 473 474 The ``BeautifulSoup`` object itself represents the document as a 475 whole. For most purposes, you can treat it as a :ref:`Tag` 476 object. This means it supports most of the methods described in 477 `Navigating the tree`_ and `Searching the tree`_. 478 479 Since the ``BeautifulSoup`` object doesn't correspond to an actual 480 HTML or XML tag, it has no name and no attributes. But sometimes it's 481 useful to look at its ``.name``, so it's been given the special 482 ``.name`` "[document]":: 483 484 soup.name 485 # u'[document]' 486 487 Comments and other special strings 488 ---------------------------------- 489 490 ``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost 491 everything you'll see in an HTML or XML file, but there are a few 492 leftover bits. The only one you'll probably ever need to worry about 493 is the comment:: 494 495 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" 496 soup = BeautifulSoup(markup) 497 comment = soup.b.string 498 type(comment) 499 # <class 'bs4.element.Comment'> 500 501 The ``Comment`` object is just a special type of ``NavigableString``:: 502 503 comment 504 # u'Hey, buddy. Want to buy a used parser' 505 506 But when it appears as part of an HTML document, a ``Comment`` is 507 displayed with special formatting:: 508 509 print(soup.b.prettify()) 510 # <b> 511 # <!--Hey, buddy. Want to buy a used parser?--> 512 # </b> 513 514 Beautiful Soup defines classes for anything else that might show up in 515 an XML document: ``CData``, ``ProcessingInstruction``, 516 ``Declaration``, and ``Doctype``. Just like ``Comment``, these classes 517 are subclasses of ``NavigableString`` that add something extra to the 518 string. Here's an example that replaces the comment with a CDATA 519 block:: 520 521 from bs4 import CData 522 cdata = CData("A CDATA block") 523 comment.replace_with(cdata) 524 525 print(soup.b.prettify()) 526 # <b> 527 # <![CDATA[A CDATA block]]> 528 # </b> 529 530 531 Navigating the tree 532 =================== 533 534 Here's the "Three sisters" HTML document again:: 535 536 html_doc = """ 537 <html><head><title>The Dormouse's story</title></head> 538 539 <p class="title"><b>The Dormouse's story</b></p> 540 541 <p class="story">Once upon a time there were three little sisters; and their names were 542 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 543 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 544 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 545 and they lived at the bottom of a well.</p> 546 547 <p class="story">...</p> 548 """ 549 550 from bs4 import BeautifulSoup 551 soup = BeautifulSoup(html_doc) 552 553 I'll use this as an example to show you how to move from one part of 554 a document to another. 555 556 Going down 557 ---------- 558 559 Tags may contain strings and other tags. These elements are the tag's 560 `children`. Beautiful Soup provides a lot of different attributes for 561 navigating and iterating over a tag's children. 562 563 Note that Beautiful Soup strings don't support any of these 564 attributes, because a string can't have children. 565 566 Navigating using tag names 567 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 568 569 The simplest way to navigate the parse tree is to say the name of the 570 tag you want. If you want the <head> tag, just say ``soup.head``:: 571 572 soup.head 573 # <head><title>The Dormouse's story</title></head> 574 575 soup.title 576 # <title>The Dormouse's story</title> 577 578 You can do use this trick again and again to zoom in on a certain part 579 of the parse tree. This code gets the first <b> tag beneath the <body> tag:: 580 581 soup.body.b 582 # <b>The Dormouse's story</b> 583 584 Using a tag name as an attribute will give you only the `first` tag by that 585 name:: 586 587 soup.a 588 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 589 590 If you need to get `all` the <a> tags, or anything more complicated 591 than the first tag with a certain name, you'll need to use one of the 592 methods described in `Searching the tree`_, such as `find_all()`:: 593 594 soup.find_all('a') 595 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 596 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 597 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 598 599 ``.contents`` and ``.children`` 600 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 601 602 A tag's children are available in a list called ``.contents``:: 603 604 head_tag = soup.head 605 head_tag 606 # <head><title>The Dormouse's story</title></head> 607 608 head_tag.contents 609 [<title>The Dormouse's story</title>] 610 611 title_tag = head_tag.contents[0] 612 title_tag 613 # <title>The Dormouse's story</title> 614 title_tag.contents 615 # [u'The Dormouse's story'] 616 617 The ``BeautifulSoup`` object itself has children. In this case, the 618 <html> tag is the child of the ``BeautifulSoup`` object.:: 619 620 len(soup.contents) 621 # 1 622 soup.contents[0].name 623 # u'html' 624 625 A string does not have ``.contents``, because it can't contain 626 anything:: 627 628 text = title_tag.contents[0] 629 text.contents 630 # AttributeError: 'NavigableString' object has no attribute 'contents' 631 632 Instead of getting them as a list, you can iterate over a tag's 633 children using the ``.children`` generator:: 634 635 for child in title_tag.children: 636 print(child) 637 # The Dormouse's story 638 639 ``.descendants`` 640 ^^^^^^^^^^^^^^^^ 641 642 The ``.contents`` and ``.children`` attributes only consider a tag's 643 `direct` children. For instance, the <head> tag has a single direct 644 child--the <title> tag:: 645 646 head_tag.contents 647 # [<title>The Dormouse's story</title>] 648 649 But the <title> tag itself has a child: the string "The Dormouse's 650 story". There's a sense in which that string is also a child of the 651 <head> tag. The ``.descendants`` attribute lets you iterate over `all` 652 of a tag's children, recursively: its direct children, the children of 653 its direct children, and so on:: 654 655 for child in head_tag.descendants: 656 print(child) 657 # <title>The Dormouse's story</title> 658 # The Dormouse's story 659 660 The <head> tag has only one child, but it has two descendants: the 661 <title> tag and the <title> tag's child. The ``BeautifulSoup`` object 662 only has one direct child (the <html> tag), but it has a whole lot of 663 descendants:: 664 665 len(list(soup.children)) 666 # 1 667 len(list(soup.descendants)) 668 # 25 669 670 .. _.string: 671 672 ``.string`` 673 ^^^^^^^^^^^ 674 675 If a tag has only one child, and that child is a ``NavigableString``, 676 the child is made available as ``.string``:: 677 678 title_tag.string 679 # u'The Dormouse's story' 680 681 If a tag's only child is another tag, and `that` tag has a 682 ``.string``, then the parent tag is considered to have the same 683 ``.string`` as its child:: 684 685 head_tag.contents 686 # [<title>The Dormouse's story</title>] 687 688 head_tag.string 689 # u'The Dormouse's story' 690 691 If a tag contains more than one thing, then it's not clear what 692 ``.string`` should refer to, so ``.string`` is defined to be 693 ``None``:: 694 695 print(soup.html.string) 696 # None 697 698 .. _string-generators: 699 700 ``.strings`` and ``stripped_strings`` 701 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 702 703 If there's more than one thing inside a tag, you can still look at 704 just the strings. Use the ``.strings`` generator:: 705 706 for string in soup.strings: 707 print(repr(string)) 708 # u"The Dormouse's story" 709 # u'\n\n' 710 # u"The Dormouse's story" 711 # u'\n\n' 712 # u'Once upon a time there were three little sisters; and their names were\n' 713 # u'Elsie' 714 # u',\n' 715 # u'Lacie' 716 # u' and\n' 717 # u'Tillie' 718 # u';\nand they lived at the bottom of a well.' 719 # u'\n\n' 720 # u'...' 721 # u'\n' 722 723 These strings tend to have a lot of extra whitespace, which you can 724 remove by using the ``.stripped_strings`` generator instead:: 725 726 for string in soup.stripped_strings: 727 print(repr(string)) 728 # u"The Dormouse's story" 729 # u"The Dormouse's story" 730 # u'Once upon a time there were three little sisters; and their names were' 731 # u'Elsie' 732 # u',' 733 # u'Lacie' 734 # u'and' 735 # u'Tillie' 736 # u';\nand they lived at the bottom of a well.' 737 # u'...' 738 739 Here, strings consisting entirely of whitespace are ignored, and 740 whitespace at the beginning and end of strings is removed. 741 742 Going up 743 -------- 744 745 Continuing the "family tree" analogy, every tag and every string has a 746 `parent`: the tag that contains it. 747 748 .. _.parent: 749 750 ``.parent`` 751 ^^^^^^^^^^^ 752 753 You can access an element's parent with the ``.parent`` attribute. In 754 the example "three sisters" document, the <head> tag is the parent 755 of the <title> tag:: 756 757 title_tag = soup.title 758 title_tag 759 # <title>The Dormouse's story</title> 760 title_tag.parent 761 # <head><title>The Dormouse's story</title></head> 762 763 The title string itself has a parent: the <title> tag that contains 764 it:: 765 766 title_tag.string.parent 767 # <title>The Dormouse's story</title> 768 769 The parent of a top-level tag like <html> is the ``BeautifulSoup`` object 770 itself:: 771 772 html_tag = soup.html 773 type(html_tag.parent) 774 # <class 'bs4.BeautifulSoup'> 775 776 And the ``.parent`` of a ``BeautifulSoup`` object is defined as None:: 777 778 print(soup.parent) 779 # None 780 781 .. _.parents: 782 783 ``.parents`` 784 ^^^^^^^^^^^^ 785 786 You can iterate over all of an element's parents with 787 ``.parents``. This example uses ``.parents`` to travel from an <a> tag 788 buried deep within the document, to the very top of the document:: 789 790 link = soup.a 791 link 792 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 793 for parent in link.parents: 794 if parent is None: 795 print(parent) 796 else: 797 print(parent.name) 798 # p 799 # body 800 # html 801 # [document] 802 # None 803 804 Going sideways 805 -------------- 806 807 Consider a simple document like this:: 808 809 sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>") 810 print(sibling_soup.prettify()) 811 # <html> 812 # <body> 813 # <a> 814 # <b> 815 # text1 816 # </b> 817 # <c> 818 # text2 819 # </c> 820 # </a> 821 # </body> 822 # </html> 823 824 The <b> tag and the <c> tag are at the same level: they're both direct 825 children of the same tag. We call them `siblings`. When a document is 826 pretty-printed, siblings show up at the same indentation level. You 827 can also use this relationship in the code you write. 828 829 ``.next_sibling`` and ``.previous_sibling`` 830 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 831 832 You can use ``.next_sibling`` and ``.previous_sibling`` to navigate 833 between page elements that are on the same level of the parse tree:: 834 835 sibling_soup.b.next_sibling 836 # <c>text2</c> 837 838 sibling_soup.c.previous_sibling 839 # <b>text1</b> 840 841 The <b> tag has a ``.next_sibling``, but no ``.previous_sibling``, 842 because there's nothing before the <b> tag `on the same level of the 843 tree`. For the same reason, the <c> tag has a ``.previous_sibling`` 844 but no ``.next_sibling``:: 845 846 print(sibling_soup.b.previous_sibling) 847 # None 848 print(sibling_soup.c.next_sibling) 849 # None 850 851 The strings "text1" and "text2" are `not` siblings, because they don't 852 have the same parent:: 853 854 sibling_soup.b.string 855 # u'text1' 856 857 print(sibling_soup.b.string.next_sibling) 858 # None 859 860 In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a 861 tag will usually be a string containing whitespace. Going back to the 862 "three sisters" document:: 863 864 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> 865 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 866 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> 867 868 You might think that the ``.next_sibling`` of the first <a> tag would 869 be the second <a> tag. But actually, it's a string: the comma and 870 newline that separate the first <a> tag from the second:: 871 872 link = soup.a 873 link 874 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 875 876 link.next_sibling 877 # u',\n' 878 879 The second <a> tag is actually the ``.next_sibling`` of the comma:: 880 881 link.next_sibling.next_sibling 882 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 883 884 .. _sibling-generators: 885 886 ``.next_siblings`` and ``.previous_siblings`` 887 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 888 889 You can iterate over a tag's siblings with ``.next_siblings`` or 890 ``.previous_siblings``:: 891 892 for sibling in soup.a.next_siblings: 893 print(repr(sibling)) 894 # u',\n' 895 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 896 # u' and\n' 897 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 898 # u'; and they lived at the bottom of a well.' 899 # None 900 901 for sibling in soup.find(id="link3").previous_siblings: 902 print(repr(sibling)) 903 # ' and\n' 904 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 905 # u',\n' 906 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 907 # u'Once upon a time there were three little sisters; and their names were\n' 908 # None 909 910 Going back and forth 911 -------------------- 912 913 Take a look at the beginning of the "three sisters" document:: 914 915 <html><head><title>The Dormouse's story</title></head> 916 <p class="title"><b>The Dormouse's story</b></p> 917 918 An HTML parser takes this string of characters and turns it into a 919 series of events: "open an <html> tag", "open a <head> tag", "open a 920 <title> tag", "add a string", "close the <title> tag", "open a <p> 921 tag", and so on. Beautiful Soup offers tools for reconstructing the 922 initial parse of the document. 923 924 .. _element-generators: 925 926 ``.next_element`` and ``.previous_element`` 927 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 928 929 The ``.next_element`` attribute of a string or tag points to whatever 930 was parsed immediately afterwards. It might be the same as 931 ``.next_sibling``, but it's usually drastically different. 932 933 Here's the final <a> tag in the "three sisters" document. Its 934 ``.next_sibling`` is a string: the conclusion of the sentence that was 935 interrupted by the start of the <a> tag.:: 936 937 last_a_tag = soup.find("a", id="link3") 938 last_a_tag 939 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 940 941 last_a_tag.next_sibling 942 # '; and they lived at the bottom of a well.' 943 944 But the ``.next_element`` of that <a> tag, the thing that was parsed 945 immediately after the <a> tag, is `not` the rest of that sentence: 946 it's the word "Tillie":: 947 948 last_a_tag.next_element 949 # u'Tillie' 950 951 That's because in the original markup, the word "Tillie" appeared 952 before that semicolon. The parser encountered an <a> tag, then the 953 word "Tillie", then the closing </a> tag, then the semicolon and rest of 954 the sentence. The semicolon is on the same level as the <a> tag, but the 955 word "Tillie" was encountered first. 956 957 The ``.previous_element`` attribute is the exact opposite of 958 ``.next_element``. It points to whatever element was parsed 959 immediately before this one:: 960 961 last_a_tag.previous_element 962 # u' and\n' 963 last_a_tag.previous_element.next_element 964 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 965 966 ``.next_elements`` and ``.previous_elements`` 967 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 968 969 You should get the idea by now. You can use these iterators to move 970 forward or backward in the document as it was parsed:: 971 972 for element in last_a_tag.next_elements: 973 print(repr(element)) 974 # u'Tillie' 975 # u';\nand they lived at the bottom of a well.' 976 # u'\n\n' 977 # <p class="story">...</p> 978 # u'...' 979 # u'\n' 980 # None 981 982 Searching the tree 983 ================== 984 985 Beautiful Soup defines a lot of methods for searching the parse tree, 986 but they're all very similar. I'm going to spend a lot of time explaining 987 the two most popular methods: ``find()`` and ``find_all()``. The other 988 methods take almost exactly the same arguments, so I'll just cover 989 them briefly. 990 991 Once again, I'll be using the "three sisters" document as an example:: 992 993 html_doc = """ 994 <html><head><title>The Dormouse's story</title></head> 995 996 <p class="title"><b>The Dormouse's story</b></p> 997 998 <p class="story">Once upon a time there were three little sisters; and their names were 999 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 1000 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 1001 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 1002 and they lived at the bottom of a well.</p> 1003 1004 <p class="story">...</p> 1005 """ 1006 1007 from bs4 import BeautifulSoup 1008 soup = BeautifulSoup(html_doc) 1009 1010 By passing in a filter to an argument like ``find_all()``, you can 1011 zoom in on the parts of the document you're interested in. 1012 1013 Kinds of filters 1014 ---------------- 1015 1016 Before talking in detail about ``find_all()`` and similar methods, I 1017 want to show examples of different filters you can pass into these 1018 methods. These filters show up again and again, throughout the 1019 search API. You can use them to filter based on a tag's name, 1020 on its attributes, on the text of a string, or on some combination of 1021 these. 1022 1023 .. _a string: 1024 1025 A string 1026 ^^^^^^^^ 1027 1028 The simplest filter is a string. Pass a string to a search method and 1029 Beautiful Soup will perform a match against that exact string. This 1030 code finds all the <b> tags in the document:: 1031 1032 soup.find_all('b') 1033 # [<b>The Dormouse's story</b>] 1034 1035 If you pass in a byte string, Beautiful Soup will assume the string is 1036 encoded as UTF-8. You can avoid this by passing in a Unicode string instead. 1037 1038 .. _a regular expression: 1039 1040 A regular expression 1041 ^^^^^^^^^^^^^^^^^^^^ 1042 1043 If you pass in a regular expression object, Beautiful Soup will filter 1044 against that regular expression using its ``match()`` method. This code 1045 finds all the tags whose names start with the letter "b"; in this 1046 case, the <body> tag and the <b> tag:: 1047 1048 import re 1049 for tag in soup.find_all(re.compile("^b")): 1050 print(tag.name) 1051 # body 1052 # b 1053 1054 This code finds all the tags whose names contain the letter 't':: 1055 1056 for tag in soup.find_all(re.compile("t")): 1057 print(tag.name) 1058 # html 1059 # title 1060 1061 .. _a list: 1062 1063 A list 1064 ^^^^^^ 1065 1066 If you pass in a list, Beautiful Soup will allow a string match 1067 against `any` item in that list. This code finds all the <a> tags 1068 `and` all the <b> tags:: 1069 1070 soup.find_all(["a", "b"]) 1071 # [<b>The Dormouse's story</b>, 1072 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1073 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1074 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1075 1076 .. _the value True: 1077 1078 ``True`` 1079 ^^^^^^^^ 1080 1081 The value ``True`` matches everything it can. This code finds `all` 1082 the tags in the document, but none of the text strings:: 1083 1084 for tag in soup.find_all(True): 1085 print(tag.name) 1086 # html 1087 # head 1088 # title 1089 # body 1090 # p 1091 # b 1092 # p 1093 # a 1094 # a 1095 # a 1096 # p 1097 1098 .. a function: 1099 1100 A function 1101 ^^^^^^^^^^ 1102 1103 If none of the other matches work for you, define a function that 1104 takes an element as its only argument. The function should return 1105 ``True`` if the argument matches, and ``False`` otherwise. 1106 1107 Here's a function that returns ``True`` if a tag defines the "class" 1108 attribute but doesn't define the "id" attribute:: 1109 1110 def has_class_but_no_id(tag): 1111 return tag.has_attr('class') and not tag.has_attr('id') 1112 1113 Pass this function into ``find_all()`` and you'll pick up all the <p> 1114 tags:: 1115 1116 soup.find_all(has_class_but_no_id) 1117 # [<p class="title"><b>The Dormouse's story</b></p>, 1118 # <p class="story">Once upon a time there were...</p>, 1119 # <p class="story">...</p>] 1120 1121 This function only picks up the <p> tags. It doesn't pick up the <a> 1122 tags, because those tags define both "class" and "id". It doesn't pick 1123 up tags like <html> and <title>, because those tags don't define 1124 "class". 1125 1126 Here's a function that returns ``True`` if a tag is surrounded by 1127 string objects:: 1128 1129 from bs4 import NavigableString 1130 def surrounded_by_strings(tag): 1131 return (isinstance(tag.next_element, NavigableString) 1132 and isinstance(tag.previous_element, NavigableString)) 1133 1134 for tag in soup.find_all(surrounded_by_strings): 1135 print tag.name 1136 # p 1137 # a 1138 # a 1139 # a 1140 # p 1141 1142 Now we're ready to look at the search methods in detail. 1143 1144 ``find_all()`` 1145 -------------- 1146 1147 Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive 1148 <recursive>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1149 1150 The ``find_all()`` method looks through a tag's descendants and 1151 retrieves `all` descendants that match your filters. I gave several 1152 examples in `Kinds of filters`_, but here are a few more:: 1153 1154 soup.find_all("title") 1155 # [<title>The Dormouse's story</title>] 1156 1157 soup.find_all("p", "title") 1158 # [<p class="title"><b>The Dormouse's story</b></p>] 1159 1160 soup.find_all("a") 1161 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1162 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1163 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1164 1165 soup.find_all(id="link2") 1166 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1167 1168 import re 1169 soup.find(text=re.compile("sisters")) 1170 # u'Once upon a time there were three little sisters; and their names were\n' 1171 1172 Some of these should look familiar, but others are new. What does it 1173 mean to pass in a value for ``text``, or ``id``? Why does 1174 ``find_all("p", "title")`` find a <p> tag with the CSS class "title"? 1175 Let's look at the arguments to ``find_all()``. 1176 1177 .. _name: 1178 1179 The ``name`` argument 1180 ^^^^^^^^^^^^^^^^^^^^^ 1181 1182 Pass in a value for ``name`` and you'll tell Beautiful Soup to only 1183 consider tags with certain names. Text strings will be ignored, as 1184 will tags whose names that don't match. 1185 1186 This is the simplest usage:: 1187 1188 soup.find_all("title") 1189 # [<title>The Dormouse's story</title>] 1190 1191 Recall from `Kinds of filters`_ that the value to ``name`` can be `a 1192 string`_, `a regular expression`_, `a list`_, `a function`_, or `the value 1193 True`_. 1194 1195 .. _kwargs: 1196 1197 The keyword arguments 1198 ^^^^^^^^^^^^^^^^^^^^^ 1199 1200 Any argument that's not recognized will be turned into a filter on one 1201 of a tag's attributes. If you pass in a value for an argument called ``id``, 1202 Beautiful Soup will filter against each tag's 'id' attribute:: 1203 1204 soup.find_all(id='link2') 1205 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1206 1207 If you pass in a value for ``href``, Beautiful Soup will filter 1208 against each tag's 'href' attribute:: 1209 1210 soup.find_all(href=re.compile("elsie")) 1211 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1212 1213 You can filter an attribute based on `a string`_, `a regular 1214 expression`_, `a list`_, `a function`_, or `the value True`_. 1215 1216 This code finds all tags whose ``id`` attribute has a value, 1217 regardless of what the value is:: 1218 1219 soup.find_all(id=True) 1220 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1221 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1222 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1223 1224 You can filter multiple attributes at once by passing in more than one 1225 keyword argument:: 1226 1227 soup.find_all(href=re.compile("elsie"), id='link1') 1228 # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>] 1229 1230 Some attributes, like the data-* attributes in HTML 5, have names that 1231 can't be used as the names of keyword arguments:: 1232 1233 data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') 1234 data_soup.find_all(data-foo="value") 1235 # SyntaxError: keyword can't be an expression 1236 1237 You can use these attributes in searches by putting them into a 1238 dictionary and passing the dictionary into ``find_all()`` as the 1239 ``attrs`` argument:: 1240 1241 data_soup.find_all(attrs={"data-foo": "value"}) 1242 # [<div data-foo="value">foo!</div>] 1243 1244 .. _attrs: 1245 1246 Searching by CSS class 1247 ^^^^^^^^^^^^^^^^^^^^^^ 1248 1249 It's very useful to search for a tag that has a certain CSS class, but 1250 the name of the CSS attribute, "class", is a reserved word in 1251 Python. Using ``class`` as a keyword argument will give you a syntax 1252 error. As of Beautiful Soup 4.1.2, you can search by CSS class using 1253 the keyword argument ``class_``:: 1254 1255 soup.find_all("a", class_="sister") 1256 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1257 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1258 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1259 1260 As with any keyword argument, you can pass ``class_`` a string, a regular 1261 expression, a function, or ``True``:: 1262 1263 soup.find_all(class_=re.compile("itl")) 1264 # [<p class="title"><b>The Dormouse's story</b></p>] 1265 1266 def has_six_characters(css_class): 1267 return css_class is not None and len(css_class) == 6 1268 1269 soup.find_all(class_=has_six_characters) 1270 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1271 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1272 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1273 1274 :ref:`Remember <multivalue>` that a single tag can have multiple 1275 values for its "class" attribute. When you search for a tag that 1276 matches a certain CSS class, you're matching against `any` of its CSS 1277 classes:: 1278 1279 css_soup = BeautifulSoup('<p class="body strikeout"></p>') 1280 css_soup.find_all("p", class_="strikeout") 1281 # [<p class="body strikeout"></p>] 1282 1283 css_soup.find_all("p", class_="body") 1284 # [<p class="body strikeout"></p>] 1285 1286 You can also search for the exact string value of the ``class`` attribute:: 1287 1288 css_soup.find_all("p", class_="body strikeout") 1289 # [<p class="body strikeout"></p>] 1290 1291 But searching for variants of the string value won't work:: 1292 1293 css_soup.find_all("p", class_="strikeout body") 1294 # [] 1295 1296 If you want to search for tags that match two or more CSS classes, you 1297 should use a CSS selector:: 1298 1299 css_soup.select("p.strikeout.body") 1300 # [<p class="body strikeout"></p>] 1301 1302 In older versions of Beautiful Soup, which don't have the ``class_`` 1303 shortcut, you can use the ``attrs`` trick mentioned above. Create a 1304 dictionary whose value for "class" is the string (or regular 1305 expression, or whatever) you want to search for:: 1306 1307 soup.find_all("a", attrs={"class": "sister"}) 1308 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1309 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1310 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1311 1312 .. _text: 1313 1314 The ``text`` argument 1315 ^^^^^^^^^^^^^^^^^^^^^ 1316 1317 With ``text`` you can search for strings instead of tags. As with 1318 ``name`` and the keyword arguments, you can pass in `a string`_, `a 1319 regular expression`_, `a list`_, `a function`_, or `the value True`_. 1320 Here are some examples:: 1321 1322 soup.find_all(text="Elsie") 1323 # [u'Elsie'] 1324 1325 soup.find_all(text=["Tillie", "Elsie", "Lacie"]) 1326 # [u'Elsie', u'Lacie', u'Tillie'] 1327 1328 soup.find_all(text=re.compile("Dormouse")) 1329 [u"The Dormouse's story", u"The Dormouse's story"] 1330 1331 def is_the_only_string_within_a_tag(s): 1332 """Return True if this string is the only child of its parent tag.""" 1333 return (s == s.parent.string) 1334 1335 soup.find_all(text=is_the_only_string_within_a_tag) 1336 # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...'] 1337 1338 Although ``text`` is for finding strings, you can combine it with 1339 arguments that find tags: Beautiful Soup will find all tags whose 1340 ``.string`` matches your value for ``text``. This code finds the <a> 1341 tags whose ``.string`` is "Elsie":: 1342 1343 soup.find_all("a", text="Elsie") 1344 # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] 1345 1346 .. _limit: 1347 1348 The ``limit`` argument 1349 ^^^^^^^^^^^^^^^^^^^^^^ 1350 1351 ``find_all()`` returns all the tags and strings that match your 1352 filters. This can take a while if the document is large. If you don't 1353 need `all` the results, you can pass in a number for ``limit``. This 1354 works just like the LIMIT keyword in SQL. It tells Beautiful Soup to 1355 stop gathering results after it's found a certain number. 1356 1357 There are three links in the "three sisters" document, but this code 1358 only finds the first two:: 1359 1360 soup.find_all("a", limit=2) 1361 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1362 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1363 1364 .. _recursive: 1365 1366 The ``recursive`` argument 1367 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 1368 1369 If you call ``mytag.find_all()``, Beautiful Soup will examine all the 1370 descendants of ``mytag``: its children, its children's children, and 1371 so on. If you only want Beautiful Soup to consider direct children, 1372 you can pass in ``recursive=False``. See the difference here:: 1373 1374 soup.html.find_all("title") 1375 # [<title>The Dormouse's story</title>] 1376 1377 soup.html.find_all("title", recursive=False) 1378 # [] 1379 1380 Here's that part of the document:: 1381 1382 <html> 1383 <head> 1384 <title> 1385 The Dormouse's story 1386 </title> 1387 </head> 1388 ... 1389 1390 The <title> tag is beneath the <html> tag, but it's not `directly` 1391 beneath the <html> tag: the <head> tag is in the way. Beautiful Soup 1392 finds the <title> tag when it's allowed to look at all descendants of 1393 the <html> tag, but when ``recursive=False`` restricts it to the 1394 <html> tag's immediate children, it finds nothing. 1395 1396 Beautiful Soup offers a lot of tree-searching methods (covered below), 1397 and they mostly take the same arguments as ``find_all()``: ``name``, 1398 ``attrs``, ``text``, ``limit``, and the keyword arguments. But the 1399 ``recursive`` argument is different: ``find_all()`` and ``find()`` are 1400 the only methods that support it. Passing ``recursive=False`` into a 1401 method like ``find_parents()`` wouldn't be very useful. 1402 1403 Calling a tag is like calling ``find_all()`` 1404 -------------------------------------------- 1405 1406 Because ``find_all()`` is the most popular method in the Beautiful 1407 Soup search API, you can use a shortcut for it. If you treat the 1408 ``BeautifulSoup`` object or a ``Tag`` object as though it were a 1409 function, then it's the same as calling ``find_all()`` on that 1410 object. These two lines of code are equivalent:: 1411 1412 soup.find_all("a") 1413 soup("a") 1414 1415 These two lines are also equivalent:: 1416 1417 soup.title.find_all(text=True) 1418 soup.title(text=True) 1419 1420 ``find()`` 1421 ---------- 1422 1423 Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive 1424 <recursive>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1425 1426 The ``find_all()`` method scans the entire document looking for 1427 results, but sometimes you only want to find one result. If you know a 1428 document only has one <body> tag, it's a waste of time to scan the 1429 entire document looking for more. Rather than passing in ``limit=1`` 1430 every time you call ``find_all``, you can use the ``find()`` 1431 method. These two lines of code are `nearly` equivalent:: 1432 1433 soup.find_all('title', limit=1) 1434 # [<title>The Dormouse's story</title>] 1435 1436 soup.find('title') 1437 # <title>The Dormouse's story</title> 1438 1439 The only difference is that ``find_all()`` returns a list containing 1440 the single result, and ``find()`` just returns the result. 1441 1442 If ``find_all()`` can't find anything, it returns an empty list. If 1443 ``find()`` can't find anything, it returns ``None``:: 1444 1445 print(soup.find("nosuchtag")) 1446 # None 1447 1448 Remember the ``soup.head.title`` trick from `Navigating using tag 1449 names`_? That trick works by repeatedly calling ``find()``:: 1450 1451 soup.head.title 1452 # <title>The Dormouse's story</title> 1453 1454 soup.find("head").find("title") 1455 # <title>The Dormouse's story</title> 1456 1457 ``find_parents()`` and ``find_parent()`` 1458 ---------------------------------------- 1459 1460 Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1461 1462 Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1463 1464 I spent a lot of time above covering ``find_all()`` and 1465 ``find()``. The Beautiful Soup API defines ten other methods for 1466 searching the tree, but don't be afraid. Five of these methods are 1467 basically the same as ``find_all()``, and the other five are basically 1468 the same as ``find()``. The only differences are in what parts of the 1469 tree they search. 1470 1471 First let's consider ``find_parents()`` and 1472 ``find_parent()``. Remember that ``find_all()`` and ``find()`` work 1473 their way down the tree, looking at tag's descendants. These methods 1474 do the opposite: they work their way `up` the tree, looking at a tag's 1475 (or a string's) parents. Let's try them out, starting from a string 1476 buried deep in the "three daughters" document:: 1477 1478 a_string = soup.find(text="Lacie") 1479 a_string 1480 # u'Lacie' 1481 1482 a_string.find_parents("a") 1483 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1484 1485 a_string.find_parent("p") 1486 # <p class="story">Once upon a time there were three little sisters; and their names were 1487 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1488 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and 1489 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 1490 # and they lived at the bottom of a well.</p> 1491 1492 a_string.find_parents("p", class="title") 1493 # [] 1494 1495 One of the three <a> tags is the direct parent of the string in 1496 question, so our search finds it. One of the three <p> tags is an 1497 indirect parent of the string, and our search finds that as 1498 well. There's a <p> tag with the CSS class "title" `somewhere` in the 1499 document, but it's not one of this string's parents, so we can't find 1500 it with ``find_parents()``. 1501 1502 You may have made the connection between ``find_parent()`` and 1503 ``find_parents()``, and the `.parent`_ and `.parents`_ attributes 1504 mentioned earlier. The connection is very strong. These search methods 1505 actually use ``.parents`` to iterate over all the parents, and check 1506 each one against the provided filter to see if it matches. 1507 1508 ``find_next_siblings()`` and ``find_next_sibling()`` 1509 ---------------------------------------------------- 1510 1511 Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1512 1513 Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1514 1515 These methods use :ref:`.next_siblings <sibling-generators>` to 1516 iterate over the rest of an element's siblings in the tree. The 1517 ``find_next_siblings()`` method returns all the siblings that match, 1518 and ``find_next_sibling()`` only returns the first one:: 1519 1520 first_link = soup.a 1521 first_link 1522 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 1523 1524 first_link.find_next_siblings("a") 1525 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1526 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1527 1528 first_story_paragraph = soup.find("p", "story") 1529 first_story_paragraph.find_next_sibling("p") 1530 # <p class="story">...</p> 1531 1532 ``find_previous_siblings()`` and ``find_previous_sibling()`` 1533 ------------------------------------------------------------ 1534 1535 Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1536 1537 Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1538 1539 These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's 1540 siblings that precede it in the tree. The ``find_previous_siblings()`` 1541 method returns all the siblings that match, and 1542 ``find_previous_sibling()`` only returns the first one:: 1543 1544 last_link = soup.find("a", id="link3") 1545 last_link 1546 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 1547 1548 last_link.find_previous_siblings("a") 1549 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1550 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1551 1552 first_story_paragraph = soup.find("p", "story") 1553 first_story_paragraph.find_previous_sibling("p") 1554 # <p class="title"><b>The Dormouse's story</b></p> 1555 1556 1557 ``find_all_next()`` and ``find_next()`` 1558 --------------------------------------- 1559 1560 Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1561 1562 Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1563 1564 These methods use :ref:`.next_elements <element-generators>` to 1565 iterate over whatever tags and strings that come after it in the 1566 document. The ``find_all_next()`` method returns all matches, and 1567 ``find_next()`` only returns the first match:: 1568 1569 first_link = soup.a 1570 first_link 1571 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 1572 1573 first_link.find_all_next(text=True) 1574 # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', 1575 # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n'] 1576 1577 first_link.find_next("p") 1578 # <p class="story">...</p> 1579 1580 In the first example, the string "Elsie" showed up, even though it was 1581 contained within the <a> tag we started from. In the second example, 1582 the last <p> tag in the document showed up, even though it's not in 1583 the same part of the tree as the <a> tag we started from. For these 1584 methods, all that matters is that an element match the filter, and 1585 show up later in the document than the starting element. 1586 1587 ``find_all_previous()`` and ``find_previous()`` 1588 ----------------------------------------------- 1589 1590 Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1591 1592 Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1593 1594 These methods use :ref:`.previous_elements <element-generators>` to 1595 iterate over the tags and strings that came before it in the 1596 document. The ``find_all_previous()`` method returns all matches, and 1597 ``find_previous()`` only returns the first match:: 1598 1599 first_link = soup.a 1600 first_link 1601 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 1602 1603 first_link.find_all_previous("p") 1604 # [<p class="story">Once upon a time there were three little sisters; ...</p>, 1605 # <p class="title"><b>The Dormouse's story</b></p>] 1606 1607 first_link.find_previous("title") 1608 # <title>The Dormouse's story</title> 1609 1610 The call to ``find_all_previous("p")`` found the first paragraph in 1611 the document (the one with class="title"), but it also finds the 1612 second paragraph, the <p> tag that contains the <a> tag we started 1613 with. This shouldn't be too surprising: we're looking at all the tags 1614 that show up earlier in the document than the one we started with. A 1615 <p> tag that contains an <a> tag must have shown up before the <a> 1616 tag it contains. 1617 1618 CSS selectors 1619 ------------- 1620 1621 Beautiful Soup supports the most commonly-used `CSS selectors 1622 <http://www.w3.org/TR/CSS2/selector.html>`_. Just pass a string into 1623 the ``.select()`` method of a ``Tag`` object or the ``BeautifulSoup`` 1624 object itself. 1625 1626 You can find tags:: 1627 1628 soup.select("title") 1629 # [<title>The Dormouse's story</title>] 1630 1631 soup.select("p nth-of-type(3)") 1632 # [<p class="story">...</p>] 1633 1634 Find tags beneath other tags:: 1635 1636 soup.select("body a") 1637 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1638 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1639 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1640 1641 soup.select("html head title") 1642 # [<title>The Dormouse's story</title>] 1643 1644 Find tags `directly` beneath other tags:: 1645 1646 soup.select("head > title") 1647 # [<title>The Dormouse's story</title>] 1648 1649 soup.select("p > a") 1650 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1651 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1652 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1653 1654 soup.select("p > a:nth-of-type(2)") 1655 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1656 1657 soup.select("p > #link1") 1658 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1659 1660 soup.select("body > a") 1661 # [] 1662 1663 Find the siblings of tags:: 1664 1665 soup.select("#link1 ~ .sister") 1666 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1667 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1668 1669 soup.select("#link1 + .sister") 1670 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1671 1672 Find tags by CSS class:: 1673 1674 soup.select(".sister") 1675 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1676 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1677 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1678 1679 soup.select("[class~=sister]") 1680 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1681 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1682 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1683 1684 Find tags by ID:: 1685 1686 soup.select("#link1") 1687 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1688 1689 soup.select("a#link2") 1690 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1691 1692 Test for the existence of an attribute:: 1693 1694 soup.select('a[href]') 1695 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1696 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1697 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1698 1699 Find tags by attribute value:: 1700 1701 soup.select('a[href="http://example.com/elsie"]') 1702 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1703 1704 soup.select('a[href^="http://example.com/"]') 1705 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1706 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1707 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1708 1709 soup.select('a[href$="tillie"]') 1710 # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1711 1712 soup.select('a[href*=".com/el"]') 1713 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1714 1715 Match language codes:: 1716 1717 multilingual_markup = """ 1718 <p lang="en">Hello</p> 1719 <p lang="en-us">Howdy, y'all</p> 1720 <p lang="en-gb">Pip-pip, old fruit</p> 1721 <p lang="fr">Bonjour mes amis</p> 1722 """ 1723 multilingual_soup = BeautifulSoup(multilingual_markup) 1724 multilingual_soup.select('p[lang|=en]') 1725 # [<p lang="en">Hello</p>, 1726 # <p lang="en-us">Howdy, y'all</p>, 1727 # <p lang="en-gb">Pip-pip, old fruit</p>] 1728 1729 This is a convenience for users who know the CSS selector syntax. You 1730 can do all this stuff with the Beautiful Soup API. And if CSS 1731 selectors are all you need, you might as well use lxml directly, 1732 because it's faster. But this lets you `combine` simple CSS selectors 1733 with the Beautiful Soup API. 1734 1735 1736 Modifying the tree 1737 ================== 1738 1739 Beautiful Soup's main strength is in searching the parse tree, but you 1740 can also modify the tree and write your changes as a new HTML or XML 1741 document. 1742 1743 Changing tag names and attributes 1744 --------------------------------- 1745 1746 I covered this earlier, in `Attributes`_, but it bears repeating. You 1747 can rename a tag, change the values of its attributes, add new 1748 attributes, and delete attributes:: 1749 1750 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') 1751 tag = soup.b 1752 1753 tag.name = "blockquote" 1754 tag['class'] = 'verybold' 1755 tag['id'] = 1 1756 tag 1757 # <blockquote class="verybold" id="1">Extremely bold</blockquote> 1758 1759 del tag['class'] 1760 del tag['id'] 1761 tag 1762 # <blockquote>Extremely bold</blockquote> 1763 1764 1765 Modifying ``.string`` 1766 --------------------- 1767 1768 If you set a tag's ``.string`` attribute, the tag's contents are 1769 replaced with the string you give:: 1770 1771 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1772 soup = BeautifulSoup(markup) 1773 1774 tag = soup.a 1775 tag.string = "New link text." 1776 tag 1777 # <a href="http://example.com/">New link text.</a> 1778 1779 Be careful: if the tag contained other tags, they and all their 1780 contents will be destroyed. 1781 1782 ``append()`` 1783 ------------ 1784 1785 You can add to a tag's contents with ``Tag.append()``. It works just 1786 like calling ``.append()`` on a Python list:: 1787 1788 soup = BeautifulSoup("<a>Foo</a>") 1789 soup.a.append("Bar") 1790 1791 soup 1792 # <html><head></head><body><a>FooBar</a></body></html> 1793 soup.a.contents 1794 # [u'Foo', u'Bar'] 1795 1796 ``BeautifulSoup.new_string()`` and ``.new_tag()`` 1797 ------------------------------------------------- 1798 1799 If you need to add a string to a document, no problem--you can pass a 1800 Python string in to ``append()``, or you can call the factory method 1801 ``BeautifulSoup.new_string()``:: 1802 1803 soup = BeautifulSoup("<b></b>") 1804 tag = soup.b 1805 tag.append("Hello") 1806 new_string = soup.new_string(" there") 1807 tag.append(new_string) 1808 tag 1809 # <b>Hello there.</b> 1810 tag.contents 1811 # [u'Hello', u' there'] 1812 1813 If you want to create a comment or some other subclass of 1814 ``NavigableString``, pass that class as the second argument to 1815 ``new_string()``:: 1816 1817 from bs4 import Comment 1818 new_comment = soup.new_string("Nice to see you.", Comment) 1819 tag.append(new_comment) 1820 tag 1821 # <b>Hello there<!--Nice to see you.--></b> 1822 tag.contents 1823 # [u'Hello', u' there', u'Nice to see you.'] 1824 1825 (This is a new feature in Beautiful Soup 4.2.1.) 1826 1827 What if you need to create a whole new tag? The best solution is to 1828 call the factory method ``BeautifulSoup.new_tag()``:: 1829 1830 soup = BeautifulSoup("<b></b>") 1831 original_tag = soup.b 1832 1833 new_tag = soup.new_tag("a", href="http://www.example.com") 1834 original_tag.append(new_tag) 1835 original_tag 1836 # <b><a href="http://www.example.com"></a></b> 1837 1838 new_tag.string = "Link text." 1839 original_tag 1840 # <b><a href="http://www.example.com">Link text.</a></b> 1841 1842 Only the first argument, the tag name, is required. 1843 1844 ``insert()`` 1845 ------------ 1846 1847 ``Tag.insert()`` is just like ``Tag.append()``, except the new element 1848 doesn't necessarily go at the end of its parent's 1849 ``.contents``. It'll be inserted at whatever numeric position you 1850 say. It works just like ``.insert()`` on a Python list:: 1851 1852 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1853 soup = BeautifulSoup(markup) 1854 tag = soup.a 1855 1856 tag.insert(1, "but did not endorse ") 1857 tag 1858 # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> 1859 tag.contents 1860 # [u'I linked to ', u'but did not endorse', <i>example.com</i>] 1861 1862 ``insert_before()`` and ``insert_after()`` 1863 ------------------------------------------ 1864 1865 The ``insert_before()`` method inserts a tag or string immediately 1866 before something else in the parse tree:: 1867 1868 soup = BeautifulSoup("<b>stop</b>") 1869 tag = soup.new_tag("i") 1870 tag.string = "Don't" 1871 soup.b.string.insert_before(tag) 1872 soup.b 1873 # <b><i>Don't</i>stop</b> 1874 1875 The ``insert_after()`` method moves a tag or string so that it 1876 immediately follows something else in the parse tree:: 1877 1878 soup.b.i.insert_after(soup.new_string(" ever ")) 1879 soup.b 1880 # <b><i>Don't</i> ever stop</b> 1881 soup.b.contents 1882 # [<i>Don't</i>, u' ever ', u'stop'] 1883 1884 ``clear()`` 1885 ----------- 1886 1887 ``Tag.clear()`` removes the contents of a tag:: 1888 1889 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1890 soup = BeautifulSoup(markup) 1891 tag = soup.a 1892 1893 tag.clear() 1894 tag 1895 # <a href="http://example.com/"></a> 1896 1897 ``extract()`` 1898 ------------- 1899 1900 ``PageElement.extract()`` removes a tag or string from the tree. It 1901 returns the tag or string that was extracted:: 1902 1903 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1904 soup = BeautifulSoup(markup) 1905 a_tag = soup.a 1906 1907 i_tag = soup.i.extract() 1908 1909 a_tag 1910 # <a href="http://example.com/">I linked to</a> 1911 1912 i_tag 1913 # <i>example.com</i> 1914 1915 print(i_tag.parent) 1916 None 1917 1918 At this point you effectively have two parse trees: one rooted at the 1919 ``BeautifulSoup`` object you used to parse the document, and one rooted 1920 at the tag that was extracted. You can go on to call ``extract`` on 1921 a child of the element you extracted:: 1922 1923 my_string = i_tag.string.extract() 1924 my_string 1925 # u'example.com' 1926 1927 print(my_string.parent) 1928 # None 1929 i_tag 1930 # <i></i> 1931 1932 1933 ``decompose()`` 1934 --------------- 1935 1936 ``Tag.decompose()`` removes a tag from the tree, then `completely 1937 destroys it and its contents`:: 1938 1939 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1940 soup = BeautifulSoup(markup) 1941 a_tag = soup.a 1942 1943 soup.i.decompose() 1944 1945 a_tag 1946 # <a href="http://example.com/">I linked to</a> 1947 1948 1949 .. _replace_with: 1950 1951 ``replace_with()`` 1952 ------------------ 1953 1954 ``PageElement.replace_with()`` removes a tag or string from the tree, 1955 and replaces it with the tag or string of your choice:: 1956 1957 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1958 soup = BeautifulSoup(markup) 1959 a_tag = soup.a 1960 1961 new_tag = soup.new_tag("b") 1962 new_tag.string = "example.net" 1963 a_tag.i.replace_with(new_tag) 1964 1965 a_tag 1966 # <a href="http://example.com/">I linked to <b>example.net</b></a> 1967 1968 ``replace_with()`` returns the tag or string that was replaced, so 1969 that you can examine it or add it back to another part of the tree. 1970 1971 ``wrap()`` 1972 ---------- 1973 1974 ``PageElement.wrap()`` wraps an element in the tag you specify. It 1975 returns the new wrapper:: 1976 1977 soup = BeautifulSoup("<p>I wish I was bold.</p>") 1978 soup.p.string.wrap(soup.new_tag("b")) 1979 # <b>I wish I was bold.</b> 1980 1981 soup.p.wrap(soup.new_tag("div") 1982 # <div><p><b>I wish I was bold.</b></p></div> 1983 1984 This method is new in Beautiful Soup 4.0.5. 1985 1986 ``unwrap()`` 1987 --------------------------- 1988 1989 ``Tag.unwrap()`` is the opposite of ``wrap()``. It replaces a tag with 1990 whatever's inside that tag. It's good for stripping out markup:: 1991 1992 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1993 soup = BeautifulSoup(markup) 1994 a_tag = soup.a 1995 1996 a_tag.i.unwrap() 1997 a_tag 1998 # <a href="http://example.com/">I linked to example.com</a> 1999 2000 Like ``replace_with()``, ``unwrap()`` returns the tag 2001 that was replaced. 2002 2003 Output 2004 ====== 2005 2006 .. _.prettyprinting: 2007 2008 Pretty-printing 2009 --------------- 2010 2011 The ``prettify()`` method will turn a Beautiful Soup parse tree into a 2012 nicely formatted Unicode string, with each HTML/XML tag on its own line:: 2013 2014 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 2015 soup = BeautifulSoup(markup) 2016 soup.prettify() 2017 # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...' 2018 2019 print(soup.prettify()) 2020 # <html> 2021 # <head> 2022 # </head> 2023 # <body> 2024 # <a href="http://example.com/"> 2025 # I linked to 2026 # <i> 2027 # example.com 2028 # </i> 2029 # </a> 2030 # </body> 2031 # </html> 2032 2033 You can call ``prettify()`` on the top-level ``BeautifulSoup`` object, 2034 or on any of its ``Tag`` objects:: 2035 2036 print(soup.a.prettify()) 2037 # <a href="http://example.com/"> 2038 # I linked to 2039 # <i> 2040 # example.com 2041 # </i> 2042 # </a> 2043 2044 Non-pretty printing 2045 ------------------- 2046 2047 If you just want a string, with no fancy formatting, you can call 2048 ``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag`` 2049 within it:: 2050 2051 str(soup) 2052 # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>' 2053 2054 unicode(soup.a) 2055 # u'<a href="http://example.com/">I linked to <i>example.com</i></a>' 2056 2057 The ``str()`` function returns a string encoded in UTF-8. See 2058 `Encodings`_ for other options. 2059 2060 You can also call ``encode()`` to get a bytestring, and ``decode()`` 2061 to get Unicode. 2062 2063 .. _output_formatters: 2064 2065 Output formatters 2066 ----------------- 2067 2068 If you give Beautiful Soup a document that contains HTML entities like 2069 "&lquot;", they'll be converted to Unicode characters:: 2070 2071 soup = BeautifulSoup("“Dammit!” he said.") 2072 unicode(soup) 2073 # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>' 2074 2075 If you then convert the document to a string, the Unicode characters 2076 will be encoded as UTF-8. You won't get the HTML entities back:: 2077 2078 str(soup) 2079 # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>' 2080 2081 By default, the only characters that are escaped upon output are bare 2082 ampersands and angle brackets. These get turned into "&", "<", 2083 and ">", so that Beautiful Soup doesn't inadvertently generate 2084 invalid HTML or XML:: 2085 2086 soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>") 2087 soup.p 2088 # <p>The law firm of Dewey, Cheatem, & Howe</p> 2089 2090 soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') 2091 soup.a 2092 # <a href="http://example.com/?foo=val1&bar=val2">A link</a> 2093 2094 You can change this behavior by providing a value for the 2095 ``formatter`` argument to ``prettify()``, ``encode()``, or 2096 ``decode()``. Beautiful Soup recognizes four possible values for 2097 ``formatter``. 2098 2099 The default is ``formatter="minimal"``. Strings will only be processed 2100 enough to ensure that Beautiful Soup generates valid HTML/XML:: 2101 2102 french = "<p>Il a dit <<Sacré bleu!>></p>" 2103 soup = BeautifulSoup(french) 2104 print(soup.prettify(formatter="minimal")) 2105 # <html> 2106 # <body> 2107 # <p> 2108 # Il a dit <<Sacr bleu!>> 2109 # </p> 2110 # </body> 2111 # </html> 2112 2113 If you pass in ``formatter="html"``, Beautiful Soup will convert 2114 Unicode characters to HTML entities whenever possible:: 2115 2116 print(soup.prettify(formatter="html")) 2117 # <html> 2118 # <body> 2119 # <p> 2120 # Il a dit <<Sacré bleu!>> 2121 # </p> 2122 # </body> 2123 # </html> 2124 2125 If you pass in ``formatter=None``, Beautiful Soup will not modify 2126 strings at all on output. This is the fastest option, but it may lead 2127 to Beautiful Soup generating invalid HTML/XML, as in these examples:: 2128 2129 print(soup.prettify(formatter=None)) 2130 # <html> 2131 # <body> 2132 # <p> 2133 # Il a dit <<Sacr bleu!>> 2134 # </p> 2135 # </body> 2136 # </html> 2137 2138 link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') 2139 print(link_soup.a.encode(formatter=None)) 2140 # <a href="http://example.com/?foo=val1&bar=val2">A link</a> 2141 2142 Finally, if you pass in a function for ``formatter``, Beautiful Soup 2143 will call that function once for every string and attribute value in 2144 the document. You can do whatever you want in this function. Here's a 2145 formatter that converts strings to uppercase and does absolutely 2146 nothing else:: 2147 2148 def uppercase(str): 2149 return str.upper() 2150 2151 print(soup.prettify(formatter=uppercase)) 2152 # <html> 2153 # <body> 2154 # <p> 2155 # IL A DIT <<SACR BLEU!>> 2156 # </p> 2157 # </body> 2158 # </html> 2159 2160 print(link_soup.a.prettify(formatter=uppercase)) 2161 # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2"> 2162 # A LINK 2163 # </a> 2164 2165 If you're writing your own function, you should know about the 2166 ``EntitySubstitution`` class in the ``bs4.dammit`` module. This class 2167 implements Beautiful Soup's standard formatters as class methods: the 2168 "html" formatter is ``EntitySubstitution.substitute_html``, and the 2169 "minimal" formatter is ``EntitySubstitution.substitute_xml``. You can 2170 use these functions to simulate ``formatter=html`` or 2171 ``formatter==minimal``, but then do something extra. 2172 2173 Here's an example that replaces Unicode characters with HTML entities 2174 whenever possible, but `also` converts all strings to uppercase:: 2175 2176 from bs4.dammit import EntitySubstitution 2177 def uppercase_and_substitute_html_entities(str): 2178 return EntitySubstitution.substitute_html(str.upper()) 2179 2180 print(soup.prettify(formatter=uppercase_and_substitute_html_entities)) 2181 # <html> 2182 # <body> 2183 # <p> 2184 # IL A DIT <<SACRÉ BLEU!>> 2185 # </p> 2186 # </body> 2187 # </html> 2188 2189 One last caveat: if you create a ``CData`` object, the text inside 2190 that object is always presented `exactly as it appears, with no 2191 formatting`. Beautiful Soup will call the formatter method, just in 2192 case you've written a custom method that counts all the strings in the 2193 document or something, but it will ignore the return value:: 2194 2195 from bs4.element import CData 2196 soup = BeautifulSoup("<a></a>") 2197 soup.a.string = CData("one < three") 2198 print(soup.a.prettify(formatter="xml")) 2199 # <a> 2200 # <![CDATA[one < three]]> 2201 # </a> 2202 2203 2204 ``get_text()`` 2205 -------------- 2206 2207 If you only want the text part of a document or tag, you can use the 2208 ``get_text()`` method. It returns all the text in a document or 2209 beneath a tag, as a single Unicode string:: 2210 2211 markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' 2212 soup = BeautifulSoup(markup) 2213 2214 soup.get_text() 2215 u'\nI linked to example.com\n' 2216 soup.i.get_text() 2217 u'example.com' 2218 2219 You can specify a string to be used to join the bits of text 2220 together:: 2221 2222 # soup.get_text("|") 2223 u'\nI linked to |example.com|\n' 2224 2225 You can tell Beautiful Soup to strip whitespace from the beginning and 2226 end of each bit of text:: 2227 2228 # soup.get_text("|", strip=True) 2229 u'I linked to|example.com' 2230 2231 But at that point you might want to use the :ref:`.stripped_strings <string-generators>` 2232 generator instead, and process the text yourself:: 2233 2234 [text for text in soup.stripped_strings] 2235 # [u'I linked to', u'example.com'] 2236 2237 Specifying the parser to use 2238 ============================ 2239 2240 If you just need to parse some HTML, you can dump the markup into the 2241 ``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful 2242 Soup will pick a parser for you and parse the data. But there are a 2243 few additional arguments you can pass in to the constructor to change 2244 which parser is used. 2245 2246 The first argument to the ``BeautifulSoup`` constructor is a string or 2247 an open filehandle--the markup you want parsed. The second argument is 2248 `how` you'd like the markup parsed. 2249 2250 If you don't specify anything, you'll get the best HTML parser that's 2251 installed. Beautiful Soup ranks lxml's parser as being the best, then 2252 html5lib's, then Python's built-in parser. You can override this by 2253 specifying one of the following: 2254 2255 * What type of markup you want to parse. Currently supported are 2256 "html", "xml", and "html5". 2257 2258 * The name of the parser library you want to use. Currently supported 2259 options are "lxml", "html5lib", and "html.parser" (Python's 2260 built-in HTML parser). 2261 2262 The section `Installing a parser`_ contrasts the supported parsers. 2263 2264 If you don't have an appropriate parser installed, Beautiful Soup will 2265 ignore your request and pick a different parser. Right now, the only 2266 supported XML parser is lxml. If you don't have lxml installed, asking 2267 for an XML parser won't give you one, and asking for "lxml" won't work 2268 either. 2269 2270 Differences between parsers 2271 --------------------------- 2272 2273 Beautiful Soup presents the same interface to a number of different 2274 parsers, but each parser is different. Different parsers will create 2275 different parse trees from the same document. The biggest differences 2276 are between the HTML parsers and the XML parsers. Here's a short 2277 document, parsed as HTML:: 2278 2279 BeautifulSoup("<a><b /></a>") 2280 # <html><head></head><body><a><b></b></a></body></html> 2281 2282 Since an empty <b /> tag is not valid HTML, the parser turns it into a 2283 <b></b> tag pair. 2284 2285 Here's the same document parsed as XML (running this requires that you 2286 have lxml installed). Note that the empty <b /> tag is left alone, and 2287 that the document is given an XML declaration instead of being put 2288 into an <html> tag.:: 2289 2290 BeautifulSoup("<a><b /></a>", "xml") 2291 # <?xml version="1.0" encoding="utf-8"?> 2292 # <a><b/></a> 2293 2294 There are also differences between HTML parsers. If you give Beautiful 2295 Soup a perfectly-formed HTML document, these differences won't 2296 matter. One parser will be faster than another, but they'll all give 2297 you a data structure that looks exactly like the original HTML 2298 document. 2299 2300 But if the document is not perfectly-formed, different parsers will 2301 give different results. Here's a short, invalid document parsed using 2302 lxml's HTML parser. Note that the dangling </p> tag is simply 2303 ignored:: 2304 2305 BeautifulSoup("<a></p>", "lxml") 2306 # <html><body><a></a></body></html> 2307 2308 Here's the same document parsed using html5lib:: 2309 2310 BeautifulSoup("<a></p>", "html5lib") 2311 # <html><head></head><body><a><p></p></a></body></html> 2312 2313 Instead of ignoring the dangling </p> tag, html5lib pairs it with an 2314 opening <p> tag. This parser also adds an empty <head> tag to the 2315 document. 2316 2317 Here's the same document parsed with Python's built-in HTML 2318 parser:: 2319 2320 BeautifulSoup("<a></p>", "html.parser") 2321 # <a></a> 2322 2323 Like html5lib, this parser ignores the closing </p> tag. Unlike 2324 html5lib, this parser makes no attempt to create a well-formed HTML 2325 document by adding a <body> tag. Unlike lxml, it doesn't even bother 2326 to add an <html> tag. 2327 2328 Since the document "<a></p>" is invalid, none of these techniques is 2329 the "correct" way to handle it. The html5lib parser uses techniques 2330 that are part of the HTML5 standard, so it has the best claim on being 2331 the "correct" way, but all three techniques are legitimate. 2332 2333 Differences between parsers can affect your script. If you're planning 2334 on distributing your script to other people, or running it on multiple 2335 machines, you should specify a parser in the ``BeautifulSoup`` 2336 constructor. That will reduce the chances that your users parse a 2337 document differently from the way you parse it. 2338 2339 Encodings 2340 ========= 2341 2342 Any HTML or XML document is written in a specific encoding like ASCII 2343 or UTF-8. But when you load that document into Beautiful Soup, you'll 2344 discover it's been converted to Unicode:: 2345 2346 markup = "<h1>Sacr\xc3\xa9 bleu!</h1>" 2347 soup = BeautifulSoup(markup) 2348 soup.h1 2349 # <h1>Sacr bleu!</h1> 2350 soup.h1.string 2351 # u'Sacr\xe9 bleu!' 2352 2353 It's not magic. (That sure would be nice.) Beautiful Soup uses a 2354 sub-library called `Unicode, Dammit`_ to detect a document's encoding 2355 and convert it to Unicode. The autodetected encoding is available as 2356 the ``.original_encoding`` attribute of the ``BeautifulSoup`` object:: 2357 2358 soup.original_encoding 2359 'utf-8' 2360 2361 Unicode, Dammit guesses correctly most of the time, but sometimes it 2362 makes mistakes. Sometimes it guesses correctly, but only after a 2363 byte-by-byte search of the document that takes a very long time. If 2364 you happen to know a document's encoding ahead of time, you can avoid 2365 mistakes and delays by passing it to the ``BeautifulSoup`` constructor 2366 as ``from_encoding``. 2367 2368 Here's a document written in ISO-8859-8. The document is so short that 2369 Unicode, Dammit can't get a good lock on it, and misidentifies it as 2370 ISO-8859-7:: 2371 2372 markup = b"<h1>\xed\xe5\xec\xf9</h1>" 2373 soup = BeautifulSoup(markup) 2374 soup.h1 2375 <h1></h1> 2376 soup.original_encoding 2377 'ISO-8859-7' 2378 2379 We can fix this by passing in the correct ``from_encoding``:: 2380 2381 soup = BeautifulSoup(markup, from_encoding="iso-8859-8") 2382 soup.h1 2383 <h1></h1> 2384 soup.original_encoding 2385 'iso8859-8' 2386 2387 In rare cases (usually when a UTF-8 document contains text written in 2388 a completely different encoding), the only way to get Unicode may be 2389 to replace some characters with the special Unicode character 2390 "REPLACEMENT CHARACTER" (U+FFFD, ). If Unicode, Dammit needs to do 2391 this, it will set the ``.contains_replacement_characters`` attribute 2392 to ``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This 2393 lets you know that the Unicode representation is not an exact 2394 representation of the original--some data was lost. If a document 2395 contains , but ``.contains_replacement_characters`` is ``False``, 2396 you'll know that the was there originally (as it is in this 2397 paragraph) and doesn't stand in for missing data. 2398 2399 Output encoding 2400 --------------- 2401 2402 When you write out a document from Beautiful Soup, you get a UTF-8 2403 document, even if the document wasn't in UTF-8 to begin with. Here's a 2404 document written in the Latin-1 encoding:: 2405 2406 markup = b''' 2407 <html> 2408 <head> 2409 <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /> 2410 </head> 2411 <body> 2412 <p>Sacr\xe9 bleu!</p> 2413 </body> 2414 </html> 2415 ''' 2416 2417 soup = BeautifulSoup(markup) 2418 print(soup.prettify()) 2419 # <html> 2420 # <head> 2421 # <meta content="text/html; charset=utf-8" http-equiv="Content-type" /> 2422 # </head> 2423 # <body> 2424 # <p> 2425 # Sacr bleu! 2426 # </p> 2427 # </body> 2428 # </html> 2429 2430 Note that the <meta> tag has been rewritten to reflect the fact that 2431 the document is now in UTF-8. 2432 2433 If you don't want UTF-8, you can pass an encoding into ``prettify()``:: 2434 2435 print(soup.prettify("latin-1")) 2436 # <html> 2437 # <head> 2438 # <meta content="text/html; charset=latin-1" http-equiv="Content-type" /> 2439 # ... 2440 2441 You can also call encode() on the ``BeautifulSoup`` object, or any 2442 element in the soup, just as if it were a Python string:: 2443 2444 soup.p.encode("latin-1") 2445 # '<p>Sacr\xe9 bleu!</p>' 2446 2447 soup.p.encode("utf-8") 2448 # '<p>Sacr\xc3\xa9 bleu!</p>' 2449 2450 Any characters that can't be represented in your chosen encoding will 2451 be converted into numeric XML entity references. Here's a document 2452 that includes the Unicode character SNOWMAN:: 2453 2454 markup = u"<b>\N{SNOWMAN}</b>" 2455 snowman_soup = BeautifulSoup(markup) 2456 tag = snowman_soup.b 2457 2458 The SNOWMAN character can be part of a UTF-8 document (it looks like 2459 ), but there's no representation for that character in ISO-Latin-1 or 2460 ASCII, so it's converted into "☃" for those encodings:: 2461 2462 print(tag.encode("utf-8")) 2463 # <b></b> 2464 2465 print tag.encode("latin-1") 2466 # <b>☃</b> 2467 2468 print tag.encode("ascii") 2469 # <b>☃</b> 2470 2471 Unicode, Dammit 2472 --------------- 2473 2474 You can use Unicode, Dammit without using Beautiful Soup. It's useful 2475 whenever you have data in an unknown encoding and you just want it to 2476 become Unicode:: 2477 2478 from bs4 import UnicodeDammit 2479 dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") 2480 print(dammit.unicode_markup) 2481 # Sacr bleu! 2482 dammit.original_encoding 2483 # 'utf-8' 2484 2485 Unicode, Dammit's guesses will get a lot more accurate if you install 2486 the ``chardet`` or ``cchardet`` Python libraries. The more data you 2487 give Unicode, Dammit, the more accurately it will guess. If you have 2488 your own suspicions as to what the encoding might be, you can pass 2489 them in as a list:: 2490 2491 dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) 2492 print(dammit.unicode_markup) 2493 # Sacr bleu! 2494 dammit.original_encoding 2495 # 'latin-1' 2496 2497 Unicode, Dammit has two special features that Beautiful Soup doesn't 2498 use. 2499 2500 Smart quotes 2501 ^^^^^^^^^^^^ 2502 2503 You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML 2504 entities:: 2505 2506 markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>" 2507 2508 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup 2509 # u'<p>I just “love” Microsoft Word’s smart quotes</p>' 2510 2511 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup 2512 # u'<p>I just “love” Microsoft Word’s smart quotes</p>' 2513 2514 You can also convert Microsoft smart quotes to ASCII quotes:: 2515 2516 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup 2517 # u'<p>I just "love" Microsoft Word\'s smart quotes</p>' 2518 2519 Hopefully you'll find this feature useful, but Beautiful Soup doesn't 2520 use it. Beautiful Soup prefers the default behavior, which is to 2521 convert Microsoft smart quotes to Unicode characters along with 2522 everything else:: 2523 2524 UnicodeDammit(markup, ["windows-1252"]).unicode_markup 2525 # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>' 2526 2527 Inconsistent encodings 2528 ^^^^^^^^^^^^^^^^^^^^^^ 2529 2530 Sometimes a document is mostly in UTF-8, but contains Windows-1252 2531 characters such as (again) Microsoft smart quotes. This can happen 2532 when a website includes data from multiple sources. You can use 2533 ``UnicodeDammit.detwingle()`` to turn such a document into pure 2534 UTF-8. Here's a simple example:: 2535 2536 snowmen = (u"\N{SNOWMAN}" * 3) 2537 quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") 2538 doc = snowmen.encode("utf8") + quote.encode("windows_1252") 2539 2540 This document is a mess. The snowmen are in UTF-8 and the quotes are 2541 in Windows-1252. You can display the snowmen or the quotes, but not 2542 both:: 2543 2544 print(doc) 2545 # I like snowmen! 2546 2547 print(doc.decode("windows-1252")) 2548 # I like snowmen! 2549 2550 Decoding the document as UTF-8 raises a ``UnicodeDecodeError``, and 2551 decoding it as Windows-1252 gives you gibberish. Fortunately, 2552 ``UnicodeDammit.detwingle()`` will convert the string to pure UTF-8, 2553 allowing you to decode it to Unicode and display the snowmen and quote 2554 marks simultaneously:: 2555 2556 new_doc = UnicodeDammit.detwingle(doc) 2557 print(new_doc.decode("utf8")) 2558 # I like snowmen! 2559 2560 ``UnicodeDammit.detwingle()`` only knows how to handle Windows-1252 2561 embedded in UTF-8 (or vice versa, I suppose), but this is the most 2562 common case. 2563 2564 Note that you must know to call ``UnicodeDammit.detwingle()`` on your 2565 data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit`` 2566 constructor. Beautiful Soup assumes that a document has a single 2567 encoding, whatever it might be. If you pass it a document that 2568 contains both UTF-8 and Windows-1252, it's likely to think the whole 2569 document is Windows-1252, and the document will come out looking like 2570 ` I like snowmen!`. 2571 2572 ``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0. 2573 2574 Parsing only part of a document 2575 =============================== 2576 2577 Let's say you want to use Beautiful Soup look at a document's <a> 2578 tags. It's a waste of time and memory to parse the entire document and 2579 then go over it again looking for <a> tags. It would be much faster to 2580 ignore everything that wasn't an <a> tag in the first place. The 2581 ``SoupStrainer`` class allows you to choose which parts of an incoming 2582 document are parsed. You just create a ``SoupStrainer`` and pass it in 2583 to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. 2584 2585 (Note that *this feature won't work if you're using the html5lib parser*. 2586 If you use html5lib, the whole document will be parsed, no 2587 matter what. This is because html5lib constantly rearranges the parse 2588 tree as it works, and if some part of the document didn't actually 2589 make it into the parse tree, it'll crash. To avoid confusion, in the 2590 examples below I'll be forcing Beautiful Soup to use Python's 2591 built-in parser.) 2592 2593 ``SoupStrainer`` 2594 ---------------- 2595 2596 The ``SoupStrainer`` class takes the same arguments as a typical 2597 method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs 2598 <attrs>`, :ref:`text <text>`, and :ref:`**kwargs <kwargs>`. Here are 2599 three ``SoupStrainer`` objects:: 2600 2601 from bs4 import SoupStrainer 2602 2603 only_a_tags = SoupStrainer("a") 2604 2605 only_tags_with_id_link2 = SoupStrainer(id="link2") 2606 2607 def is_short_string(string): 2608 return len(string) < 10 2609 2610 only_short_strings = SoupStrainer(text=is_short_string) 2611 2612 I'm going to bring back the "three sisters" document one more time, 2613 and we'll see what the document looks like when it's parsed with these 2614 three ``SoupStrainer`` objects:: 2615 2616 html_doc = """ 2617 <html><head><title>The Dormouse's story</title></head> 2618 2619 <p class="title"><b>The Dormouse's story</b></p> 2620 2621 <p class="story">Once upon a time there were three little sisters; and their names were 2622 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 2623 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 2624 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 2625 and they lived at the bottom of a well.</p> 2626 2627 <p class="story">...</p> 2628 """ 2629 2630 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) 2631 # <a class="sister" href="http://example.com/elsie" id="link1"> 2632 # Elsie 2633 # </a> 2634 # <a class="sister" href="http://example.com/lacie" id="link2"> 2635 # Lacie 2636 # </a> 2637 # <a class="sister" href="http://example.com/tillie" id="link3"> 2638 # Tillie 2639 # </a> 2640 2641 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) 2642 # <a class="sister" href="http://example.com/lacie" id="link2"> 2643 # Lacie 2644 # </a> 2645 2646 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) 2647 # Elsie 2648 # , 2649 # Lacie 2650 # and 2651 # Tillie 2652 # ... 2653 # 2654 2655 You can also pass a ``SoupStrainer`` into any of the methods covered 2656 in `Searching the tree`_. This probably isn't terribly useful, but I 2657 thought I'd mention it:: 2658 2659 soup = BeautifulSoup(html_doc) 2660 soup.find_all(only_short_strings) 2661 # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', 2662 # u'\n\n', u'...', u'\n'] 2663 2664 Troubleshooting 2665 =============== 2666 2667 .. _diagnose: 2668 2669 ``diagnose()`` 2670 -------------- 2671 2672 If you're having trouble understanding what Beautiful Soup does to a 2673 document, pass the document into the ``diagnose()`` function. (New in 2674 Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing 2675 you how different parsers handle the document, and tell you if you're 2676 missing a parser that Beautiful Soup could be using:: 2677 2678 from bs4.diagnose import diagnose 2679 data = open("bad.html").read() 2680 diagnose(data) 2681 2682 # Diagnostic running on Beautiful Soup 4.2.0 2683 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) 2684 # I noticed that html5lib is not installed. Installing it may help. 2685 # Found lxml version 2.3.2.0 2686 # 2687 # Trying to parse your data with html.parser 2688 # Here's what html.parser did with the document: 2689 # ... 2690 2691 Just looking at the output of diagnose() may show you how to solve the 2692 problem. Even if not, you can paste the output of ``diagnose()`` when 2693 asking for help. 2694 2695 Errors when parsing a document 2696 ------------------------------ 2697 2698 There are two different kinds of parse errors. There are crashes, 2699 where you feed a document to Beautiful Soup and it raises an 2700 exception, usually an ``HTMLParser.HTMLParseError``. And there is 2701 unexpected behavior, where a Beautiful Soup parse tree looks a lot 2702 different than the document used to create it. 2703 2704 Almost none of these problems turn out to be problems with Beautiful 2705 Soup. This is not because Beautiful Soup is an amazingly well-written 2706 piece of software. It's because Beautiful Soup doesn't include any 2707 parsing code. Instead, it relies on external parsers. If one parser 2708 isn't working on a certain document, the best solution is to try a 2709 different parser. See `Installing a parser`_ for details and a parser 2710 comparison. 2711 2712 The most common parse errors are ``HTMLParser.HTMLParseError: 2713 malformed start tag`` and ``HTMLParser.HTMLParseError: bad end 2714 tag``. These are both generated by Python's built-in HTML parser 2715 library, and the solution is to :ref:`install lxml or 2716 html5lib. <parser-installation>` 2717 2718 The most common type of unexpected behavior is that you can't find a 2719 tag that you know is in the document. You saw it going in, but 2720 ``find_all()`` returns ``[]`` or ``find()`` returns ``None``. This is 2721 another common problem with Python's built-in HTML parser, which 2722 sometimes skips tags it doesn't understand. Again, the solution is to 2723 :ref:`install lxml or html5lib. <parser-installation>` 2724 2725 Version mismatch problems 2726 ------------------------- 2727 2728 * ``SyntaxError: Invalid syntax`` (on the line ``ROOT_TAG_NAME = 2729 u'[document]'``): Caused by running the Python 2 version of 2730 Beautiful Soup under Python 3, without converting the code. 2731 2732 * ``ImportError: No module named HTMLParser`` - Caused by running the 2733 Python 2 version of Beautiful Soup under Python 3. 2734 2735 * ``ImportError: No module named html.parser`` - Caused by running the 2736 Python 3 version of Beautiful Soup under Python 2. 2737 2738 * ``ImportError: No module named BeautifulSoup`` - Caused by running 2739 Beautiful Soup 3 code on a system that doesn't have BS3 2740 installed. Or, by writing Beautiful Soup 4 code without knowing that 2741 the package name has changed to ``bs4``. 2742 2743 * ``ImportError: No module named bs4`` - Caused by running Beautiful 2744 Soup 4 code on a system that doesn't have BS4 installed. 2745 2746 .. _parsing-xml: 2747 2748 Parsing XML 2749 ----------- 2750 2751 By default, Beautiful Soup parses documents as HTML. To parse a 2752 document as XML, pass in "xml" as the second argument to the 2753 ``BeautifulSoup`` constructor:: 2754 2755 soup = BeautifulSoup(markup, "xml") 2756 2757 You'll need to :ref:`have lxml installed <parser-installation>`. 2758 2759 Other parser problems 2760 --------------------- 2761 2762 * If your script works on one computer but not another, it's probably 2763 because the two computers have different parser libraries 2764 available. For example, you may have developed the script on a 2765 computer that has lxml installed, and then tried to run it on a 2766 computer that only has html5lib installed. See `Differences between 2767 parsers`_ for why this matters, and fix the problem by mentioning a 2768 specific parser library in the ``BeautifulSoup`` constructor. 2769 2770 * Because `HTML tags and attributes are case-insensitive 2771 <http://www.w3.org/TR/html5/syntax.html#syntax>`_, all three HTML 2772 parsers convert tag and attribute names to lowercase. That is, the 2773 markup <TAG></TAG> is converted to <tag></tag>. If you want to 2774 preserve mixed-case or uppercase tags and attributes, you'll need to 2775 :ref:`parse the document as XML. <parsing-xml>` 2776 2777 .. _misc: 2778 2779 Miscellaneous 2780 ------------- 2781 2782 * ``UnicodeEncodeError: 'charmap' codec can't encode character 2783 u'\xfoo' in position bar`` (or just about any other 2784 ``UnicodeEncodeError``) - This is not a problem with Beautiful Soup. 2785 This problem shows up in two main situations. First, when you try to 2786 print a Unicode character that your console doesn't know how to 2787 display. (See `this page on the Python wiki 2788 <http://wiki.python.org/moin/PrintFails>`_ for help.) Second, when 2789 you're writing to a file and you pass in a Unicode character that's 2790 not supported by your default encoding. In this case, the simplest 2791 solution is to explicitly encode the Unicode string into UTF-8 with 2792 ``u.encode("utf8")``. 2793 2794 * ``KeyError: [attr]`` - Caused by accessing ``tag['attr']`` when the 2795 tag in question doesn't define the ``attr`` attribute. The most 2796 common errors are ``KeyError: 'href'`` and ``KeyError: 2797 'class'``. Use ``tag.get('attr')`` if you're not sure ``attr`` is 2798 defined, just as you would with a Python dictionary. 2799 2800 * ``AttributeError: 'ResultSet' object has no attribute 'foo'`` - This 2801 usually happens because you expected ``find_all()`` to return a 2802 single tag or string. But ``find_all()`` returns a _list_ of tags 2803 and strings--a ``ResultSet`` object. You need to iterate over the 2804 list and look at the ``.foo`` of each one. Or, if you really only 2805 want one result, you need to use ``find()`` instead of 2806 ``find_all()``. 2807 2808 * ``AttributeError: 'NoneType' object has no attribute 'foo'`` - This 2809 usually happens because you called ``find()`` and then tried to 2810 access the `.foo`` attribute of the result. But in your case, 2811 ``find()`` didn't find anything, so it returned ``None``, instead of 2812 returning a tag or a string. You need to figure out why your 2813 ``find()`` call isn't returning anything. 2814 2815 Improving Performance 2816 --------------------- 2817 2818 Beautiful Soup will never be as fast as the parsers it sits on top 2819 of. If response time is critical, if you're paying for computer time 2820 by the hour, or if there's any other reason why computer time is more 2821 valuable than programmer time, you should forget about Beautiful Soup 2822 and work directly atop `lxml <http://lxml.de/>`_. 2823 2824 That said, there are things you can do to speed up Beautiful Soup. If 2825 you're not using lxml as the underlying parser, my advice is to 2826 :ref:`start <parser-installation>`. Beautiful Soup parses documents 2827 significantly faster using lxml than using html.parser or html5lib. 2828 2829 You can speed up encoding detection significantly by installing the 2830 `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library. 2831 2832 `Parsing only part of a document`_ won't save you much time parsing 2833 the document, but it can save a lot of memory, and it'll make 2834 `searching` the document much faster. 2835 2836 Beautiful Soup 3 2837 ================ 2838 2839 Beautiful Soup 3 is the previous release series, and is no longer 2840 being actively developed. It's currently packaged with all major Linux 2841 distributions: 2842 2843 :kbd:`$ apt-get install python-beautifulsoup` 2844 2845 It's also published through PyPi as ``BeautifulSoup``.: 2846 2847 :kbd:`$ easy_install BeautifulSoup` 2848 2849 :kbd:`$ pip install BeautifulSoup` 2850 2851 You can also `download a tarball of Beautiful Soup 3.2.0 2852 <http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_. 2853 2854 If you ran ``easy_install beautifulsoup`` or ``easy_install 2855 BeautifulSoup``, but your code doesn't work, you installed Beautiful 2856 Soup 3 by mistake. You need to run ``easy_install beautifulsoup4``. 2857 2858 `The documentation for Beautiful Soup 3 is archived online 2859 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. If 2860 your first language is Chinese, it might be easier for you to read 2861 `the Chinese translation of the Beautiful Soup 3 documentation 2862 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html>`_, 2863 then read this document to find out about the changes made in 2864 Beautiful Soup 4. 2865 2866 Porting code to BS4 2867 ------------------- 2868 2869 Most code written against Beautiful Soup 3 will work against Beautiful 2870 Soup 4 with one simple change. All you should have to do is change the 2871 package name from ``BeautifulSoup`` to ``bs4``. So this:: 2872 2873 from BeautifulSoup import BeautifulSoup 2874 2875 becomes this:: 2876 2877 from bs4 import BeautifulSoup 2878 2879 * If you get the ``ImportError`` "No module named BeautifulSoup", your 2880 problem is that you're trying to run Beautiful Soup 3 code, but you 2881 only have Beautiful Soup 4 installed. 2882 2883 * If you get the ``ImportError`` "No module named bs4", your problem 2884 is that you're trying to run Beautiful Soup 4 code, but you only 2885 have Beautiful Soup 3 installed. 2886 2887 Although BS4 is mostly backwards-compatible with BS3, most of its 2888 methods have been deprecated and given new names for `PEP 8 compliance 2889 <http://www.python.org/dev/peps/pep-0008/>`_. There are numerous other 2890 renames and changes, and a few of them break backwards compatibility. 2891 2892 Here's what you'll need to know to convert your BS3 code and habits to BS4: 2893 2894 You need a parser 2895 ^^^^^^^^^^^^^^^^^ 2896 2897 Beautiful Soup 3 used Python's ``SGMLParser``, a module that was 2898 deprecated and removed in Python 3.0. Beautiful Soup 4 uses 2899 ``html.parser`` by default, but you can plug in lxml or html5lib and 2900 use that instead. See `Installing a parser`_ for a comparison. 2901 2902 Since ``html.parser`` is not the same parser as ``SGMLParser``, it 2903 will treat invalid markup differently. Usually the "difference" is 2904 that ``html.parser`` crashes. In that case, you'll need to install 2905 another parser. But sometimes ``html.parser`` just creates a different 2906 parse tree than ``SGMLParser`` would. If this happens, you may need to 2907 update your BS3 scraping code to deal with the new tree. 2908 2909 Method names 2910 ^^^^^^^^^^^^ 2911 2912 * ``renderContents`` -> ``encode_contents`` 2913 * ``replaceWith`` -> ``replace_with`` 2914 * ``replaceWithChildren`` -> ``unwrap`` 2915 * ``findAll`` -> ``find_all`` 2916 * ``findAllNext`` -> ``find_all_next`` 2917 * ``findAllPrevious`` -> ``find_all_previous`` 2918 * ``findNext`` -> ``find_next`` 2919 * ``findNextSibling`` -> ``find_next_sibling`` 2920 * ``findNextSiblings`` -> ``find_next_siblings`` 2921 * ``findParent`` -> ``find_parent`` 2922 * ``findParents`` -> ``find_parents`` 2923 * ``findPrevious`` -> ``find_previous`` 2924 * ``findPreviousSibling`` -> ``find_previous_sibling`` 2925 * ``findPreviousSiblings`` -> ``find_previous_siblings`` 2926 * ``nextSibling`` -> ``next_sibling`` 2927 * ``previousSibling`` -> ``previous_sibling`` 2928 2929 Some arguments to the Beautiful Soup constructor were renamed for the 2930 same reasons: 2931 2932 * ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)`` 2933 * ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)`` 2934 2935 I renamed one method for compatibility with Python 3: 2936 2937 * ``Tag.has_key()`` -> ``Tag.has_attr()`` 2938 2939 I renamed one attribute to use more accurate terminology: 2940 2941 * ``Tag.isSelfClosing`` -> ``Tag.is_empty_element`` 2942 2943 I renamed three attributes to avoid using words that have special 2944 meaning to Python. Unlike the others, these changes are *not backwards 2945 compatible.* If you used these attributes in BS3, your code will break 2946 on BS4 until you change them. 2947 2948 * ``UnicodeDammit.unicode`` -> ``UnicodeDammit.unicode_markup`` 2949 * ``Tag.next`` -> ``Tag.next_element`` 2950 * ``Tag.previous`` -> ``Tag.previous_element`` 2951 2952 Generators 2953 ^^^^^^^^^^ 2954 2955 I gave the generators PEP 8-compliant names, and transformed them into 2956 properties: 2957 2958 * ``childGenerator()`` -> ``children`` 2959 * ``nextGenerator()`` -> ``next_elements`` 2960 * ``nextSiblingGenerator()`` -> ``next_siblings`` 2961 * ``previousGenerator()`` -> ``previous_elements`` 2962 * ``previousSiblingGenerator()`` -> ``previous_siblings`` 2963 * ``recursiveChildGenerator()`` -> ``descendants`` 2964 * ``parentGenerator()`` -> ``parents`` 2965 2966 So instead of this:: 2967 2968 for parent in tag.parentGenerator(): 2969 ... 2970 2971 You can write this:: 2972 2973 for parent in tag.parents: 2974 ... 2975 2976 (But the old code will still work.) 2977 2978 Some of the generators used to yield ``None`` after they were done, and 2979 then stop. That was a bug. Now the generators just stop. 2980 2981 There are two new generators, :ref:`.strings and 2982 .stripped_strings <string-generators>`. ``.strings`` yields 2983 NavigableString objects, and ``.stripped_strings`` yields Python 2984 strings that have had whitespace stripped. 2985 2986 XML 2987 ^^^ 2988 2989 There is no longer a ``BeautifulStoneSoup`` class for parsing XML. To 2990 parse XML you pass in "xml" as the second argument to the 2991 ``BeautifulSoup`` constructor. For the same reason, the 2992 ``BeautifulSoup`` constructor no longer recognizes the ``isHTML`` 2993 argument. 2994 2995 Beautiful Soup's handling of empty-element XML tags has been 2996 improved. Previously when you parsed XML you had to explicitly say 2997 which tags were considered empty-element tags. The ``selfClosingTags`` 2998 argument to the constructor is no longer recognized. Instead, 2999 Beautiful Soup considers any empty tag to be an empty-element tag. If 3000 you add a child to an empty-element tag, it stops being an 3001 empty-element tag. 3002 3003 Entities 3004 ^^^^^^^^ 3005 3006 An incoming HTML or XML entity is always converted into the 3007 corresponding Unicode character. Beautiful Soup 3 had a number of 3008 overlapping ways of dealing with entities, which have been 3009 removed. The ``BeautifulSoup`` constructor no longer recognizes the 3010 ``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode, 3011 Dammit`_ still has ``smart_quotes_to``, but its default is now to turn 3012 smart quotes into Unicode.) The constants ``HTML_ENTITIES``, 3013 ``XML_ENTITIES``, and ``XHTML_ENTITIES`` have been removed, since they 3014 configure a feature (transforming some but not all entities into 3015 Unicode characters) that no longer exists. 3016 3017 If you want to turn Unicode characters back into HTML entities on 3018 output, rather than turning them into UTF-8 characters, you need to 3019 use an :ref:`output formatter <output_formatters>`. 3020 3021 Miscellaneous 3022 ^^^^^^^^^^^^^ 3023 3024 :ref:`Tag.string <.string>` now operates recursively. If tag A 3025 contains a single tag B and nothing else, then A.string is the same as 3026 B.string. (Previously, it was None.) 3027 3028 `Multi-valued attributes`_ like ``class`` have lists of strings as 3029 their values, not strings. This may affect the way you search by CSS 3030 class. 3031 3032 If you pass one of the ``find*`` methods both :ref:`text <text>` `and` 3033 a tag-specific argument like :ref:`name <name>`, Beautiful Soup will 3034 search for tags that match your tag-specific criteria and whose 3035 :ref:`Tag.string <.string>` matches your value for :ref:`text 3036 <text>`. It will `not` find the strings themselves. Previously, 3037 Beautiful Soup ignored the tag-specific arguments and looked for 3038 strings. 3039 3040 The ``BeautifulSoup`` constructor no longer recognizes the 3041 `markupMassage` argument. It's now the parser's responsibility to 3042 handle markup correctly. 3043 3044 The rarely-used alternate parser classes like 3045 ``ICantBelieveItsBeautifulSoup`` and ``BeautifulSOAP`` have been 3046 removed. It's now the parser's decision how to handle ambiguous 3047 markup. 3048 3049 The ``prettify()`` method now returns a Unicode string, not a bytestring. 3050