Home | History | Annotate | Download | only in source
      1 Beautiful Soup Documentation
      2 ============================
      3 
      4 .. image:: 6.1.jpg
      5    :align: right
      6    :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."
      7 
      8 `Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ is a
      9 Python library for pulling data out of HTML and XML files. It works
     10 with your favorite parser to provide idiomatic ways of navigating,
     11 searching, and modifying the parse tree. It commonly saves programmers
     12 hours or days of work.
     13 
     14 These instructions illustrate all major features of Beautiful Soup 4,
     15 with examples. I show you what the library is good for, how it works,
     16 how to use it, how to make it do what you want, and what to do when it
     17 violates your expectations.
     18 
     19 The examples in this documentation should work the same way in Python
     20 2.7 and Python 3.2.
     21 
     22 You might be looking for the documentation for `Beautiful Soup 3
     23 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_.
     24 If so, you should know that Beautiful Soup 3 is no longer being
     25 developed, and that Beautiful Soup 4 is recommended for all new
     26 projects. If you want to learn about the differences between Beautiful
     27 Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_.
     28 
     29 This documentation has been translated into other languages by its users.
     30 
     31 *     . (`  <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
     32 
     33 Getting help
     34 ------------
     35 
     36 If you have questions about Beautiful Soup, or run into problems,
     37 `send mail to the discussion group
     38 <https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup>`_. If
     39 your problem involves parsing an HTML document, be sure to mention
     40 :ref:`what the diagnose() function says <diagnose>` about
     41 that document.
     42 
     43 Quick Start
     44 ===========
     45 
     46 Here's an HTML document I'll be using as an example throughout this
     47 document. It's part of a story from `Alice in Wonderland`::
     48 
     49  html_doc = """
     50  <html><head><title>The Dormouse's story</title></head>
     51  <body>
     52  <p class="title"><b>The Dormouse's story</b></p>
     53 
     54  <p class="story">Once upon a time there were three little sisters; and their names were
     55  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
     56  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
     57  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
     58  and they lived at the bottom of a well.</p>
     59 
     60  <p class="story">...</p>
     61  """
     62 
     63 Running the "three sisters" document through Beautiful Soup gives us a
     64 ``BeautifulSoup`` object, which represents the document as a nested
     65 data structure::
     66 
     67  from bs4 import BeautifulSoup
     68  soup = BeautifulSoup(html_doc)
     69 
     70  print(soup.prettify())
     71  # <html>
     72  #  <head>
     73  #   <title>
     74  #    The Dormouse's story
     75  #   </title>
     76  #  </head>
     77  #  <body>
     78  #   <p class="title">
     79  #    <b>
     80  #     The Dormouse's story
     81  #    </b>
     82  #   </p>
     83  #   <p class="story">
     84  #    Once upon a time there were three little sisters; and their names were
     85  #    <a class="sister" href="http://example.com/elsie" id="link1">
     86  #     Elsie
     87  #    </a>
     88  #    ,
     89  #    <a class="sister" href="http://example.com/lacie" id="link2">
     90  #     Lacie
     91  #    </a>
     92  #    and
     93  #    <a class="sister" href="http://example.com/tillie" id="link2">
     94  #     Tillie
     95  #    </a>
     96  #    ; and they lived at the bottom of a well.
     97  #   </p>
     98  #   <p class="story">
     99  #    ...
    100  #   </p>
    101  #  </body>
    102  # </html>
    103 
    104 Here are some simple ways to navigate that data structure::
    105 
    106  soup.title
    107  # <title>The Dormouse's story</title>
    108 
    109  soup.title.name
    110  # u'title'
    111 
    112  soup.title.string
    113  # u'The Dormouse's story'
    114 
    115  soup.title.parent.name
    116  # u'head'
    117 
    118  soup.p
    119  # <p class="title"><b>The Dormouse's story</b></p>
    120 
    121  soup.p['class']
    122  # u'title'
    123 
    124  soup.a
    125  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    126 
    127  soup.find_all('a')
    128  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    129  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    130  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    131 
    132  soup.find(id="link3")
    133  # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    134 
    135 One common task is extracting all the URLs found within a page's <a> tags::
    136 
    137  for link in soup.find_all('a'):
    138      print(link.get('href'))
    139  # http://example.com/elsie
    140  # http://example.com/lacie
    141  # http://example.com/tillie
    142 
    143 Another common task is extracting all the text from a page::
    144 
    145  print(soup.get_text())
    146  # The Dormouse's story
    147  #
    148  # The Dormouse's story
    149  #
    150  # Once upon a time there were three little sisters; and their names were
    151  # Elsie,
    152  # Lacie and
    153  # Tillie;
    154  # and they lived at the bottom of a well.
    155  #
    156  # ...
    157 
    158 Does this look like what you need? If so, read on.
    159 
    160 Installing Beautiful Soup
    161 =========================
    162 
    163 If you're using a recent version of Debian or Ubuntu Linux, you can
    164 install Beautiful Soup with the system package manager:
    165 
    166 :kbd:`$ apt-get install python-bs4`
    167 
    168 Beautiful Soup 4 is published through PyPi, so if you can't install it
    169 with the system packager, you can install it with ``easy_install`` or
    170 ``pip``. The package name is ``beautifulsoup4``, and the same package
    171 works on Python 2 and Python 3.
    172 
    173 :kbd:`$ easy_install beautifulsoup4`
    174 
    175 :kbd:`$ pip install beautifulsoup4`
    176 
    177 (The ``BeautifulSoup`` package is probably `not` what you want. That's
    178 the previous major release, `Beautiful Soup 3`_. Lots of software uses
    179 BS3, so it's still available, but if you're writing new code you
    180 should install ``beautifulsoup4``.)
    181 
    182 If you don't have ``easy_install`` or ``pip`` installed, you can
    183 `download the Beautiful Soup 4 source tarball
    184 <http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ and
    185 install it with ``setup.py``.
    186 
    187 :kbd:`$ python setup.py install`
    188 
    189 If all else fails, the license for Beautiful Soup allows you to
    190 package the entire library with your application. You can download the
    191 tarball, copy its ``bs4`` directory into your application's codebase,
    192 and use Beautiful Soup without installing it at all.
    193 
    194 I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it
    195 should work with other recent versions.
    196 
    197 Problems after installation
    198 ---------------------------
    199 
    200 Beautiful Soup is packaged as Python 2 code. When you install it for
    201 use with Python 3, it's automatically converted to Python 3 code. If
    202 you don't install the package, the code won't be converted. There have
    203 also been reports on Windows machines of the wrong version being
    204 installed.
    205 
    206 If you get the ``ImportError`` "No module named HTMLParser", your
    207 problem is that you're running the Python 2 version of the code under
    208 Python 3.
    209 
    210 If you get the ``ImportError`` "No module named html.parser", your
    211 problem is that you're running the Python 3 version of the code under
    212 Python 2.
    213 
    214 In both cases, your best bet is to completely remove the Beautiful
    215 Soup installation from your system (including any directory created
    216 when you unzipped the tarball) and try the installation again.
    217 
    218 If you get the ``SyntaxError`` "Invalid syntax" on the line
    219 ``ROOT_TAG_NAME = u'[document]'``, you need to convert the Python 2
    220 code to Python 3. You can do this either by installing the package:
    221 
    222 :kbd:`$ python3 setup.py install`
    223 
    224 or by manually running Python's ``2to3`` conversion script on the
    225 ``bs4`` directory:
    226 
    227 :kbd:`$ 2to3-3.2 -w bs4`
    228 
    229 .. _parser-installation:
    230 
    231 
    232 Installing a parser
    233 -------------------
    234 
    235 Beautiful Soup supports the HTML parser included in Python's standard
    236 library, but it also supports a number of third-party Python parsers.
    237 One is the `lxml parser <http://lxml.de/>`_. Depending on your setup,
    238 you might install lxml with one of these commands:
    239 
    240 :kbd:`$ apt-get install python-lxml`
    241 
    242 :kbd:`$ easy_install lxml`
    243 
    244 :kbd:`$ pip install lxml`
    245 
    246 Another alternative is the pure-Python `html5lib parser
    247 <http://code.google.com/p/html5lib/>`_, which parses HTML the way a
    248 web browser does. Depending on your setup, you might install html5lib
    249 with one of these commands:
    250 
    251 :kbd:`$ apt-get install python-html5lib`
    252 
    253 :kbd:`$ easy_install html5lib`
    254 
    255 :kbd:`$ pip install html5lib`
    256 
    257 This table summarizes the advantages and disadvantages of each parser library:
    258 
    259 +----------------------+--------------------------------------------+--------------------------------+--------------------------+
    260 | Parser               | Typical usage                              | Advantages                     | Disadvantages            |
    261 +----------------------+--------------------------------------------+--------------------------------+--------------------------+
    262 | Python's html.parser | ``BeautifulSoup(markup, "html.parser")``   | * Batteries included           | * Not very lenient       |
    263 |                      |                                            | * Decent speed                 |   (before Python 2.7.3   |
    264 |                      |                                            | * Lenient (as of Python 2.7.3  |   or 3.2.2)              |
    265 |                      |                                            |   and 3.2.)                    |                          |
    266 +----------------------+--------------------------------------------+--------------------------------+--------------------------+
    267 | lxml's HTML parser   | ``BeautifulSoup(markup, "lxml")``          | * Very fast                    | * External C dependency  |
    268 |                      |                                            | * Lenient                      |                          |
    269 +----------------------+--------------------------------------------+--------------------------------+--------------------------+
    270 | lxml's XML parser    | ``BeautifulSoup(markup, ["lxml", "xml"])`` | * Very fast                    | * External C dependency  |
    271 |                      | ``BeautifulSoup(markup, "xml")``           | * The only currently supported |                          |
    272 |                      |                                            |   XML parser                   |                          |
    273 +----------------------+--------------------------------------------+--------------------------------+--------------------------+
    274 | html5lib             | ``BeautifulSoup(markup, "html5lib")``      | * Extremely lenient            | * Very slow              |
    275 |                      |                                            | * Parses pages the same way a  | * External Python        |
    276 |                      |                                            |   web browser does             |   dependency             |
    277 |                      |                                            | * Creates valid HTML5          |                          |
    278 +----------------------+--------------------------------------------+--------------------------------+--------------------------+
    279 
    280 If you can, I recommend you install and use lxml for speed. If you're
    281 using a version of Python 2 earlier than 2.7.3, or a version of Python
    282 3 earlier than 3.2.2, it's `essential` that you install lxml or
    283 html5lib--Python's built-in HTML parser is just not very good in older
    284 versions.
    285 
    286 Note that if a document is invalid, different parsers will generate
    287 different Beautiful Soup trees for it. See `Differences
    288 between parsers`_ for details.
    289 
    290 Making the soup
    291 ===============
    292 
    293 To parse a document, pass it into the ``BeautifulSoup``
    294 constructor. You can pass in a string or an open filehandle::
    295 
    296  from bs4 import BeautifulSoup
    297 
    298  soup = BeautifulSoup(open("index.html"))
    299 
    300  soup = BeautifulSoup("<html>data</html>")
    301 
    302 First, the document is converted to Unicode, and HTML entities are
    303 converted to Unicode characters::
    304 
    305  BeautifulSoup("Sacr&eacute; bleu!")
    306  <html><head></head><body>Sacr bleu!</body></html>
    307 
    308 Beautiful Soup then parses the document using the best available
    309 parser. It will use an HTML parser unless you specifically tell it to
    310 use an XML parser. (See `Parsing XML`_.)
    311 
    312 Kinds of objects
    313 ================
    314 
    315 Beautiful Soup transforms a complex HTML document into a complex tree
    316 of Python objects. But you'll only ever have to deal with about four
    317 `kinds` of objects.
    318 
    319 .. _Tag:
    320 
    321 ``Tag``
    322 -------
    323 
    324 A ``Tag`` object corresponds to an XML or HTML tag in the original document::
    325 
    326  soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
    327  tag = soup.b
    328  type(tag)
    329  # <class 'bs4.element.Tag'>
    330 
    331 Tags have a lot of attributes and methods, and I'll cover most of them
    332 in `Navigating the tree`_ and `Searching the tree`_. For now, the most
    333 important features of a tag are its name and attributes.
    334 
    335 Name
    336 ^^^^
    337 
    338 Every tag has a name, accessible as ``.name``::
    339 
    340  tag.name
    341  # u'b'
    342 
    343 If you change a tag's name, the change will be reflected in any HTML
    344 markup generated by Beautiful Soup::
    345 
    346  tag.name = "blockquote"
    347  tag
    348  # <blockquote class="boldest">Extremely bold</blockquote>
    349 
    350 Attributes
    351 ^^^^^^^^^^
    352 
    353 A tag may have any number of attributes. The tag ``<b
    354 class="boldest">`` has an attribute "class" whose value is
    355 "boldest". You can access a tag's attributes by treating the tag like
    356 a dictionary::
    357 
    358  tag['class']
    359  # u'boldest'
    360 
    361 You can access that dictionary directly as ``.attrs``::
    362 
    363  tag.attrs
    364  # {u'class': u'boldest'}
    365 
    366 You can add, remove, and modify a tag's attributes. Again, this is
    367 done by treating the tag as a dictionary::
    368 
    369  tag['class'] = 'verybold'
    370  tag['id'] = 1
    371  tag
    372  # <blockquote class="verybold" id="1">Extremely bold</blockquote>
    373 
    374  del tag['class']
    375  del tag['id']
    376  tag
    377  # <blockquote>Extremely bold</blockquote>
    378 
    379  tag['class']
    380  # KeyError: 'class'
    381  print(tag.get('class'))
    382  # None
    383 
    384 .. _multivalue:
    385 
    386 Multi-valued attributes
    387 &&&&&&&&&&&&&&&&&&&&&&&
    388 
    389 HTML 4 defines a few attributes that can have multiple values. HTML 5
    390 removes a couple of them, but defines a few more. The most common
    391 multi-valued attribute is ``class`` (that is, a tag can have more than
    392 one CSS class). Others include ``rel``, ``rev``, ``accept-charset``,
    393 ``headers``, and ``accesskey``. Beautiful Soup presents the value(s)
    394 of a multi-valued attribute as a list::
    395 
    396  css_soup = BeautifulSoup('<p class="body strikeout"></p>')
    397  css_soup.p['class']
    398  # ["body", "strikeout"]
    399 
    400  css_soup = BeautifulSoup('<p class="body"></p>')
    401  css_soup.p['class']
    402  # ["body"]
    403 
    404 If an attribute `looks` like it has more than one value, but it's not
    405 a multi-valued attribute as defined by any version of the HTML
    406 standard, Beautiful Soup will leave the attribute alone::
    407 
    408  id_soup = BeautifulSoup('<p id="my id"></p>')
    409  id_soup.p['id']
    410  # 'my id'
    411 
    412 When you turn a tag back into a string, multiple attribute values are
    413 consolidated::
    414 
    415  rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
    416  rel_soup.a['rel']
    417  # ['index']
    418  rel_soup.a['rel'] = ['index', 'contents']
    419  print(rel_soup.p)
    420  # <p>Back to the <a rel="index contents">homepage</a></p>
    421 
    422 If you parse a document as XML, there are no multi-valued attributes::
    423 
    424  xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
    425  xml_soup.p['class']
    426  # u'body strikeout'
    427 
    428 
    429 
    430 ``NavigableString``
    431 -------------------
    432 
    433 A string corresponds to a bit of text within a tag. Beautiful Soup
    434 uses the ``NavigableString`` class to contain these bits of text::
    435 
    436  tag.string
    437  # u'Extremely bold'
    438  type(tag.string)
    439  # <class 'bs4.element.NavigableString'>
    440 
    441 A ``NavigableString`` is just like a Python Unicode string, except
    442 that it also supports some of the features described in `Navigating
    443 the tree`_ and `Searching the tree`_. You can convert a
    444 ``NavigableString`` to a Unicode string with ``unicode()``::
    445 
    446  unicode_string = unicode(tag.string)
    447  unicode_string
    448  # u'Extremely bold'
    449  type(unicode_string)
    450  # <type 'unicode'>
    451 
    452 You can't edit a string in place, but you can replace one string with
    453 another, using :ref:`replace_with`::
    454 
    455  tag.string.replace_with("No longer bold")
    456  tag
    457  # <blockquote>No longer bold</blockquote>
    458 
    459 ``NavigableString`` supports most of the features described in
    460 `Navigating the tree`_ and `Searching the tree`_, but not all of
    461 them. In particular, since a string can't contain anything (the way a
    462 tag may contain a string or another tag), strings don't support the
    463 ``.contents`` or ``.string`` attributes, or the ``find()`` method.
    464 
    465 If you want to use a ``NavigableString`` outside of Beautiful Soup,
    466 you should call ``unicode()`` on it to turn it into a normal Python
    467 Unicode string. If you don't, your string will carry around a
    468 reference to the entire Beautiful Soup parse tree, even when you're
    469 done using Beautiful Soup. This is a big waste of memory.
    470 
    471 ``BeautifulSoup``
    472 -----------------
    473 
    474 The ``BeautifulSoup`` object itself represents the document as a
    475 whole. For most purposes, you can treat it as a :ref:`Tag`
    476 object. This means it supports most of the methods described in
    477 `Navigating the tree`_ and `Searching the tree`_.
    478 
    479 Since the ``BeautifulSoup`` object doesn't correspond to an actual
    480 HTML or XML tag, it has no name and no attributes. But sometimes it's
    481 useful to look at its ``.name``, so it's been given the special
    482 ``.name`` "[document]"::
    483 
    484  soup.name
    485  # u'[document]'
    486 
    487 Comments and other special strings
    488 ----------------------------------
    489 
    490 ``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
    491 everything you'll see in an HTML or XML file, but there are a few
    492 leftover bits. The only one you'll probably ever need to worry about
    493 is the comment::
    494 
    495  markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
    496  soup = BeautifulSoup(markup)
    497  comment = soup.b.string
    498  type(comment)
    499  # <class 'bs4.element.Comment'>
    500 
    501 The ``Comment`` object is just a special type of ``NavigableString``::
    502 
    503  comment
    504  # u'Hey, buddy. Want to buy a used parser'
    505 
    506 But when it appears as part of an HTML document, a ``Comment`` is
    507 displayed with special formatting::
    508 
    509  print(soup.b.prettify())
    510  # <b>
    511  #  <!--Hey, buddy. Want to buy a used parser?-->
    512  # </b>
    513 
    514 Beautiful Soup defines classes for anything else that might show up in
    515 an XML document: ``CData``, ``ProcessingInstruction``,
    516 ``Declaration``, and ``Doctype``. Just like ``Comment``, these classes
    517 are subclasses of ``NavigableString`` that add something extra to the
    518 string. Here's an example that replaces the comment with a CDATA
    519 block::
    520 
    521  from bs4 import CData
    522  cdata = CData("A CDATA block")
    523  comment.replace_with(cdata)
    524 
    525  print(soup.b.prettify())
    526  # <b>
    527  #  <![CDATA[A CDATA block]]>
    528  # </b>
    529 
    530 
    531 Navigating the tree
    532 ===================
    533 
    534 Here's the "Three sisters" HTML document again::
    535 
    536  html_doc = """
    537  <html><head><title>The Dormouse's story</title></head>
    538 
    539  <p class="title"><b>The Dormouse's story</b></p>
    540 
    541  <p class="story">Once upon a time there were three little sisters; and their names were
    542  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    543  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    544  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    545  and they lived at the bottom of a well.</p>
    546 
    547  <p class="story">...</p>
    548  """
    549 
    550  from bs4 import BeautifulSoup
    551  soup = BeautifulSoup(html_doc)
    552 
    553 I'll use this as an example to show you how to move from one part of
    554 a document to another.
    555 
    556 Going down
    557 ----------
    558 
    559 Tags may contain strings and other tags. These elements are the tag's
    560 `children`. Beautiful Soup provides a lot of different attributes for
    561 navigating and iterating over a tag's children.
    562 
    563 Note that Beautiful Soup strings don't support any of these
    564 attributes, because a string can't have children.
    565 
    566 Navigating using tag names
    567 ^^^^^^^^^^^^^^^^^^^^^^^^^^
    568 
    569 The simplest way to navigate the parse tree is to say the name of the
    570 tag you want. If you want the <head> tag, just say ``soup.head``::
    571 
    572  soup.head
    573  # <head><title>The Dormouse's story</title></head>
    574 
    575  soup.title
    576  # <title>The Dormouse's story</title>
    577 
    578 You can do use this trick again and again to zoom in on a certain part
    579 of the parse tree. This code gets the first <b> tag beneath the <body> tag::
    580 
    581  soup.body.b
    582  # <b>The Dormouse's story</b>
    583 
    584 Using a tag name as an attribute will give you only the `first` tag by that
    585 name::
    586 
    587  soup.a
    588  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    589 
    590 If you need to get `all` the <a> tags, or anything more complicated
    591 than the first tag with a certain name, you'll need to use one of the
    592 methods described in `Searching the tree`_, such as `find_all()`::
    593 
    594  soup.find_all('a')
    595  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    596  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    597  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    598 
    599 ``.contents`` and ``.children``
    600 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    601 
    602 A tag's children are available in a list called ``.contents``::
    603 
    604  head_tag = soup.head
    605  head_tag
    606  # <head><title>The Dormouse's story</title></head>
    607 
    608  head_tag.contents
    609  [<title>The Dormouse's story</title>]
    610 
    611  title_tag = head_tag.contents[0]
    612  title_tag
    613  # <title>The Dormouse's story</title>
    614  title_tag.contents
    615  # [u'The Dormouse's story']
    616 
    617 The ``BeautifulSoup`` object itself has children. In this case, the
    618 <html> tag is the child of the ``BeautifulSoup`` object.::
    619 
    620  len(soup.contents)
    621  # 1
    622  soup.contents[0].name
    623  # u'html'
    624 
    625 A string does not have ``.contents``, because it can't contain
    626 anything::
    627 
    628  text = title_tag.contents[0]
    629  text.contents
    630  # AttributeError: 'NavigableString' object has no attribute 'contents'
    631 
    632 Instead of getting them as a list, you can iterate over a tag's
    633 children using the ``.children`` generator::
    634 
    635  for child in title_tag.children:
    636      print(child)
    637  # The Dormouse's story
    638 
    639 ``.descendants``
    640 ^^^^^^^^^^^^^^^^
    641 
    642 The ``.contents`` and ``.children`` attributes only consider a tag's
    643 `direct` children. For instance, the <head> tag has a single direct
    644 child--the <title> tag::
    645 
    646  head_tag.contents
    647  # [<title>The Dormouse's story</title>]
    648 
    649 But the <title> tag itself has a child: the string "The Dormouse's
    650 story". There's a sense in which that string is also a child of the
    651 <head> tag. The ``.descendants`` attribute lets you iterate over `all`
    652 of a tag's children, recursively: its direct children, the children of
    653 its direct children, and so on::
    654 
    655  for child in head_tag.descendants:
    656      print(child)
    657  # <title>The Dormouse's story</title>
    658  # The Dormouse's story
    659 
    660 The <head> tag has only one child, but it has two descendants: the
    661 <title> tag and the <title> tag's child. The ``BeautifulSoup`` object
    662 only has one direct child (the <html> tag), but it has a whole lot of
    663 descendants::
    664 
    665  len(list(soup.children))
    666  # 1
    667  len(list(soup.descendants))
    668  # 25
    669 
    670 .. _.string:
    671 
    672 ``.string``
    673 ^^^^^^^^^^^
    674 
    675 If a tag has only one child, and that child is a ``NavigableString``,
    676 the child is made available as ``.string``::
    677 
    678  title_tag.string
    679  # u'The Dormouse's story'
    680 
    681 If a tag's only child is another tag, and `that` tag has a
    682 ``.string``, then the parent tag is considered to have the same
    683 ``.string`` as its child::
    684 
    685  head_tag.contents
    686  # [<title>The Dormouse's story</title>]
    687 
    688  head_tag.string
    689  # u'The Dormouse's story'
    690 
    691 If a tag contains more than one thing, then it's not clear what
    692 ``.string`` should refer to, so ``.string`` is defined to be
    693 ``None``::
    694 
    695  print(soup.html.string)
    696  # None
    697 
    698 .. _string-generators:
    699 
    700 ``.strings`` and ``stripped_strings``
    701 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    702 
    703 If there's more than one thing inside a tag, you can still look at
    704 just the strings. Use the ``.strings`` generator::
    705 
    706  for string in soup.strings:
    707      print(repr(string))
    708  # u"The Dormouse's story"
    709  # u'\n\n'
    710  # u"The Dormouse's story"
    711  # u'\n\n'
    712  # u'Once upon a time there were three little sisters; and their names were\n'
    713  # u'Elsie'
    714  # u',\n'
    715  # u'Lacie'
    716  # u' and\n'
    717  # u'Tillie'
    718  # u';\nand they lived at the bottom of a well.'
    719  # u'\n\n'
    720  # u'...'
    721  # u'\n'
    722 
    723 These strings tend to have a lot of extra whitespace, which you can
    724 remove by using the ``.stripped_strings`` generator instead::
    725 
    726  for string in soup.stripped_strings:
    727      print(repr(string))
    728  # u"The Dormouse's story"
    729  # u"The Dormouse's story"
    730  # u'Once upon a time there were three little sisters; and their names were'
    731  # u'Elsie'
    732  # u','
    733  # u'Lacie'
    734  # u'and'
    735  # u'Tillie'
    736  # u';\nand they lived at the bottom of a well.'
    737  # u'...'
    738 
    739 Here, strings consisting entirely of whitespace are ignored, and
    740 whitespace at the beginning and end of strings is removed.
    741 
    742 Going up
    743 --------
    744 
    745 Continuing the "family tree" analogy, every tag and every string has a
    746 `parent`: the tag that contains it.
    747 
    748 .. _.parent:
    749 
    750 ``.parent``
    751 ^^^^^^^^^^^
    752 
    753 You can access an element's parent with the ``.parent`` attribute. In
    754 the example "three sisters" document, the <head> tag is the parent
    755 of the <title> tag::
    756 
    757  title_tag = soup.title
    758  title_tag
    759  # <title>The Dormouse's story</title>
    760  title_tag.parent
    761  # <head><title>The Dormouse's story</title></head>
    762 
    763 The title string itself has a parent: the <title> tag that contains
    764 it::
    765 
    766  title_tag.string.parent
    767  # <title>The Dormouse's story</title>
    768 
    769 The parent of a top-level tag like <html> is the ``BeautifulSoup`` object
    770 itself::
    771 
    772  html_tag = soup.html
    773  type(html_tag.parent)
    774  # <class 'bs4.BeautifulSoup'>
    775 
    776 And the ``.parent`` of a ``BeautifulSoup`` object is defined as None::
    777 
    778  print(soup.parent)
    779  # None
    780 
    781 .. _.parents:
    782 
    783 ``.parents``
    784 ^^^^^^^^^^^^
    785 
    786 You can iterate over all of an element's parents with
    787 ``.parents``. This example uses ``.parents`` to travel from an <a> tag
    788 buried deep within the document, to the very top of the document::
    789 
    790  link = soup.a
    791  link
    792  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    793  for parent in link.parents:
    794      if parent is None:
    795          print(parent)
    796      else:
    797          print(parent.name)
    798  # p
    799  # body
    800  # html
    801  # [document]
    802  # None
    803 
    804 Going sideways
    805 --------------
    806 
    807 Consider a simple document like this::
    808 
    809  sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
    810  print(sibling_soup.prettify())
    811  # <html>
    812  #  <body>
    813  #   <a>
    814  #    <b>
    815  #     text1
    816  #    </b>
    817  #    <c>
    818  #     text2
    819  #    </c>
    820  #   </a>
    821  #  </body>
    822  # </html>
    823 
    824 The <b> tag and the <c> tag are at the same level: they're both direct
    825 children of the same tag. We call them `siblings`. When a document is
    826 pretty-printed, siblings show up at the same indentation level. You
    827 can also use this relationship in the code you write.
    828 
    829 ``.next_sibling`` and ``.previous_sibling``
    830 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    831 
    832 You can use ``.next_sibling`` and ``.previous_sibling`` to navigate
    833 between page elements that are on the same level of the parse tree::
    834 
    835  sibling_soup.b.next_sibling
    836  # <c>text2</c>
    837 
    838  sibling_soup.c.previous_sibling
    839  # <b>text1</b>
    840 
    841 The <b> tag has a ``.next_sibling``, but no ``.previous_sibling``,
    842 because there's nothing before the <b> tag `on the same level of the
    843 tree`. For the same reason, the <c> tag has a ``.previous_sibling``
    844 but no ``.next_sibling``::
    845 
    846  print(sibling_soup.b.previous_sibling)
    847  # None
    848  print(sibling_soup.c.next_sibling)
    849  # None
    850 
    851 The strings "text1" and "text2" are `not` siblings, because they don't
    852 have the same parent::
    853 
    854  sibling_soup.b.string
    855  # u'text1'
    856 
    857  print(sibling_soup.b.string.next_sibling)
    858  # None
    859 
    860 In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a
    861 tag will usually be a string containing whitespace. Going back to the
    862 "three sisters" document::
    863 
    864  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    865  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    866  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
    867 
    868 You might think that the ``.next_sibling`` of the first <a> tag would
    869 be the second <a> tag. But actually, it's a string: the comma and
    870 newline that separate the first <a> tag from the second::
    871 
    872  link = soup.a
    873  link
    874  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    875 
    876  link.next_sibling
    877  # u',\n'
    878 
    879 The second <a> tag is actually the ``.next_sibling`` of the comma::
    880 
    881  link.next_sibling.next_sibling
    882  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    883 
    884 .. _sibling-generators:
    885 
    886 ``.next_siblings`` and ``.previous_siblings``
    887 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    888 
    889 You can iterate over a tag's siblings with ``.next_siblings`` or
    890 ``.previous_siblings``::
    891 
    892  for sibling in soup.a.next_siblings:
    893      print(repr(sibling))
    894  # u',\n'
    895  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    896  # u' and\n'
    897  # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    898  # u'; and they lived at the bottom of a well.'
    899  # None
    900 
    901  for sibling in soup.find(id="link3").previous_siblings:
    902      print(repr(sibling))
    903  # ' and\n'
    904  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    905  # u',\n'
    906  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    907  # u'Once upon a time there were three little sisters; and their names were\n'
    908  # None
    909 
    910 Going back and forth
    911 --------------------
    912 
    913 Take a look at the beginning of the "three sisters" document::
    914 
    915  <html><head><title>The Dormouse's story</title></head>
    916  <p class="title"><b>The Dormouse's story</b></p>
    917 
    918 An HTML parser takes this string of characters and turns it into a
    919 series of events: "open an <html> tag", "open a <head> tag", "open a
    920 <title> tag", "add a string", "close the <title> tag", "open a <p>
    921 tag", and so on. Beautiful Soup offers tools for reconstructing the
    922 initial parse of the document.
    923 
    924 .. _element-generators:
    925 
    926 ``.next_element`` and ``.previous_element``
    927 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    928 
    929 The ``.next_element`` attribute of a string or tag points to whatever
    930 was parsed immediately afterwards. It might be the same as
    931 ``.next_sibling``, but it's usually drastically different.
    932 
    933 Here's the final <a> tag in the "three sisters" document. Its
    934 ``.next_sibling`` is a string: the conclusion of the sentence that was
    935 interrupted by the start of the <a> tag.::
    936 
    937  last_a_tag = soup.find("a", id="link3")
    938  last_a_tag
    939  # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    940 
    941  last_a_tag.next_sibling
    942  # '; and they lived at the bottom of a well.'
    943 
    944 But the ``.next_element`` of that <a> tag, the thing that was parsed
    945 immediately after the <a> tag, is `not` the rest of that sentence:
    946 it's the word "Tillie"::
    947 
    948  last_a_tag.next_element
    949  # u'Tillie'
    950 
    951 That's because in the original markup, the word "Tillie" appeared
    952 before that semicolon. The parser encountered an <a> tag, then the
    953 word "Tillie", then the closing </a> tag, then the semicolon and rest of
    954 the sentence. The semicolon is on the same level as the <a> tag, but the
    955 word "Tillie" was encountered first.
    956 
    957 The ``.previous_element`` attribute is the exact opposite of
    958 ``.next_element``. It points to whatever element was parsed
    959 immediately before this one::
    960 
    961  last_a_tag.previous_element
    962  # u' and\n'
    963  last_a_tag.previous_element.next_element
    964  # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    965 
    966 ``.next_elements`` and ``.previous_elements``
    967 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    968 
    969 You should get the idea by now. You can use these iterators to move
    970 forward or backward in the document as it was parsed::
    971 
    972  for element in last_a_tag.next_elements:
    973      print(repr(element))
    974  # u'Tillie'
    975  # u';\nand they lived at the bottom of a well.'
    976  # u'\n\n'
    977  # <p class="story">...</p>
    978  # u'...'
    979  # u'\n'
    980  # None
    981 
    982 Searching the tree
    983 ==================
    984 
    985 Beautiful Soup defines a lot of methods for searching the parse tree,
    986 but they're all very similar. I'm going to spend a lot of time explaining
    987 the two most popular methods: ``find()`` and ``find_all()``. The other
    988 methods take almost exactly the same arguments, so I'll just cover
    989 them briefly.
    990 
    991 Once again, I'll be using the "three sisters" document as an example::
    992 
    993  html_doc = """
    994  <html><head><title>The Dormouse's story</title></head>
    995 
    996  <p class="title"><b>The Dormouse's story</b></p>
    997 
    998  <p class="story">Once upon a time there were three little sisters; and their names were
    999  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
   1000  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
   1001  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
   1002  and they lived at the bottom of a well.</p>
   1003 
   1004  <p class="story">...</p>
   1005  """
   1006 
   1007  from bs4 import BeautifulSoup
   1008  soup = BeautifulSoup(html_doc)
   1009 
   1010 By passing in a filter to an argument like ``find_all()``, you can
   1011 zoom in on the parts of the document you're interested in.
   1012 
   1013 Kinds of filters
   1014 ----------------
   1015 
   1016 Before talking in detail about ``find_all()`` and similar methods, I
   1017 want to show examples of different filters you can pass into these
   1018 methods. These filters show up again and again, throughout the
   1019 search API. You can use them to filter based on a tag's name,
   1020 on its attributes, on the text of a string, or on some combination of
   1021 these.
   1022 
   1023 .. _a string:
   1024 
   1025 A string
   1026 ^^^^^^^^
   1027 
   1028 The simplest filter is a string. Pass a string to a search method and
   1029 Beautiful Soup will perform a match against that exact string. This
   1030 code finds all the <b> tags in the document::
   1031 
   1032  soup.find_all('b')
   1033  # [<b>The Dormouse's story</b>]
   1034 
   1035 If you pass in a byte string, Beautiful Soup will assume the string is
   1036 encoded as UTF-8. You can avoid this by passing in a Unicode string instead.
   1037 
   1038 .. _a regular expression:
   1039 
   1040 A regular expression
   1041 ^^^^^^^^^^^^^^^^^^^^
   1042 
   1043 If you pass in a regular expression object, Beautiful Soup will filter
   1044 against that regular expression using its ``match()`` method. This code
   1045 finds all the tags whose names start with the letter "b"; in this
   1046 case, the <body> tag and the <b> tag::
   1047 
   1048  import re
   1049  for tag in soup.find_all(re.compile("^b")):
   1050      print(tag.name)
   1051  # body
   1052  # b
   1053 
   1054 This code finds all the tags whose names contain the letter 't'::
   1055 
   1056  for tag in soup.find_all(re.compile("t")):
   1057      print(tag.name)
   1058  # html
   1059  # title
   1060 
   1061 .. _a list:
   1062 
   1063 A list
   1064 ^^^^^^
   1065 
   1066 If you pass in a list, Beautiful Soup will allow a string match
   1067 against `any` item in that list. This code finds all the <a> tags
   1068 `and` all the <b> tags::
   1069 
   1070  soup.find_all(["a", "b"])
   1071  # [<b>The Dormouse's story</b>,
   1072  #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1073  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1074  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1075 
   1076 .. _the value True:
   1077 
   1078 ``True``
   1079 ^^^^^^^^
   1080 
   1081 The value ``True`` matches everything it can. This code finds `all`
   1082 the tags in the document, but none of the text strings::
   1083 
   1084  for tag in soup.find_all(True):
   1085      print(tag.name)
   1086  # html
   1087  # head
   1088  # title
   1089  # body
   1090  # p
   1091  # b
   1092  # p
   1093  # a
   1094  # a
   1095  # a
   1096  # p
   1097 
   1098 .. a function:
   1099 
   1100 A function
   1101 ^^^^^^^^^^
   1102 
   1103 If none of the other matches work for you, define a function that
   1104 takes an element as its only argument. The function should return
   1105 ``True`` if the argument matches, and ``False`` otherwise.
   1106 
   1107 Here's a function that returns ``True`` if a tag defines the "class"
   1108 attribute but doesn't define the "id" attribute::
   1109 
   1110  def has_class_but_no_id(tag):
   1111      return tag.has_attr('class') and not tag.has_attr('id')
   1112 
   1113 Pass this function into ``find_all()`` and you'll pick up all the <p>
   1114 tags::
   1115 
   1116  soup.find_all(has_class_but_no_id)
   1117  # [<p class="title"><b>The Dormouse's story</b></p>,
   1118  #  <p class="story">Once upon a time there were...</p>,
   1119  #  <p class="story">...</p>]
   1120 
   1121 This function only picks up the <p> tags. It doesn't pick up the <a>
   1122 tags, because those tags define both "class" and "id". It doesn't pick
   1123 up tags like <html> and <title>, because those tags don't define
   1124 "class".
   1125 
   1126 Here's a function that returns ``True`` if a tag is surrounded by
   1127 string objects::
   1128 
   1129  from bs4 import NavigableString
   1130  def surrounded_by_strings(tag):
   1131      return (isinstance(tag.next_element, NavigableString)
   1132              and isinstance(tag.previous_element, NavigableString))
   1133 
   1134  for tag in soup.find_all(surrounded_by_strings):
   1135      print tag.name
   1136  # p
   1137  # a
   1138  # a
   1139  # a
   1140  # p
   1141 
   1142 Now we're ready to look at the search methods in detail.
   1143 
   1144 ``find_all()``
   1145 --------------
   1146 
   1147 Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
   1148 <recursive>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
   1149 
   1150 The ``find_all()`` method looks through a tag's descendants and
   1151 retrieves `all` descendants that match your filters. I gave several
   1152 examples in `Kinds of filters`_, but here are a few more::
   1153 
   1154  soup.find_all("title")
   1155  # [<title>The Dormouse's story</title>]
   1156 
   1157  soup.find_all("p", "title")
   1158  # [<p class="title"><b>The Dormouse's story</b></p>]
   1159 
   1160  soup.find_all("a")
   1161  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1162  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1163  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1164 
   1165  soup.find_all(id="link2")
   1166  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
   1167 
   1168  import re
   1169  soup.find(text=re.compile("sisters"))
   1170  # u'Once upon a time there were three little sisters; and their names were\n'
   1171 
   1172 Some of these should look familiar, but others are new. What does it
   1173 mean to pass in a value for ``text``, or ``id``? Why does
   1174 ``find_all("p", "title")`` find a <p> tag with the CSS class "title"?
   1175 Let's look at the arguments to ``find_all()``.
   1176 
   1177 .. _name:
   1178 
   1179 The ``name`` argument
   1180 ^^^^^^^^^^^^^^^^^^^^^
   1181 
   1182 Pass in a value for ``name`` and you'll tell Beautiful Soup to only
   1183 consider tags with certain names. Text strings will be ignored, as
   1184 will tags whose names that don't match.
   1185 
   1186 This is the simplest usage::
   1187 
   1188  soup.find_all("title")
   1189  # [<title>The Dormouse's story</title>]
   1190 
   1191 Recall from `Kinds of filters`_ that the value to ``name`` can be `a
   1192 string`_, `a regular expression`_, `a list`_, `a function`_, or `the value
   1193 True`_.
   1194 
   1195 .. _kwargs:
   1196 
   1197 The keyword arguments
   1198 ^^^^^^^^^^^^^^^^^^^^^
   1199 
   1200 Any argument that's not recognized will be turned into a filter on one
   1201 of a tag's attributes. If you pass in a value for an argument called ``id``,
   1202 Beautiful Soup will filter against each tag's 'id' attribute::
   1203 
   1204  soup.find_all(id='link2')
   1205  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
   1206 
   1207 If you pass in a value for ``href``, Beautiful Soup will filter
   1208 against each tag's 'href' attribute::
   1209 
   1210  soup.find_all(href=re.compile("elsie"))
   1211  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
   1212 
   1213 You can filter an attribute based on `a string`_, `a regular
   1214 expression`_, `a list`_, `a function`_, or `the value True`_.
   1215 
   1216 This code finds all tags whose ``id`` attribute has a value,
   1217 regardless of what the value is::
   1218 
   1219  soup.find_all(id=True)
   1220  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1221  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1222  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1223 
   1224 You can filter multiple attributes at once by passing in more than one
   1225 keyword argument::
   1226 
   1227  soup.find_all(href=re.compile("elsie"), id='link1')
   1228  # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
   1229 
   1230 Some attributes, like the data-* attributes in HTML 5, have names that
   1231 can't be used as the names of keyword arguments::
   1232 
   1233  data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
   1234  data_soup.find_all(data-foo="value")
   1235  # SyntaxError: keyword can't be an expression
   1236 
   1237 You can use these attributes in searches by putting them into a
   1238 dictionary and passing the dictionary into ``find_all()`` as the
   1239 ``attrs`` argument::
   1240 
   1241  data_soup.find_all(attrs={"data-foo": "value"})
   1242  # [<div data-foo="value">foo!</div>]
   1243 
   1244 .. _attrs:
   1245 
   1246 Searching by CSS class
   1247 ^^^^^^^^^^^^^^^^^^^^^^
   1248 
   1249 It's very useful to search for a tag that has a certain CSS class, but
   1250 the name of the CSS attribute, "class", is a reserved word in
   1251 Python. Using ``class`` as a keyword argument will give you a syntax
   1252 error. As of Beautiful Soup 4.1.2, you can search by CSS class using
   1253 the keyword argument ``class_``::
   1254 
   1255  soup.find_all("a", class_="sister")
   1256  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1257  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1258  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1259 
   1260 As with any keyword argument, you can pass ``class_`` a string, a regular
   1261 expression, a function, or ``True``::
   1262 
   1263  soup.find_all(class_=re.compile("itl"))
   1264  # [<p class="title"><b>The Dormouse's story</b></p>]
   1265 
   1266  def has_six_characters(css_class):
   1267      return css_class is not None and len(css_class) == 6
   1268 
   1269  soup.find_all(class_=has_six_characters)
   1270  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1271  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1272  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1273 
   1274 :ref:`Remember <multivalue>` that a single tag can have multiple
   1275 values for its "class" attribute. When you search for a tag that
   1276 matches a certain CSS class, you're matching against `any` of its CSS
   1277 classes::
   1278 
   1279  css_soup = BeautifulSoup('<p class="body strikeout"></p>')
   1280  css_soup.find_all("p", class_="strikeout")
   1281  # [<p class="body strikeout"></p>]
   1282 
   1283  css_soup.find_all("p", class_="body")
   1284  # [<p class="body strikeout"></p>]
   1285 
   1286 You can also search for the exact string value of the ``class`` attribute::
   1287 
   1288  css_soup.find_all("p", class_="body strikeout")
   1289  # [<p class="body strikeout"></p>]
   1290 
   1291 But searching for variants of the string value won't work::
   1292 
   1293  css_soup.find_all("p", class_="strikeout body")
   1294  # []
   1295 
   1296 If you want to search for tags that match two or more CSS classes, you
   1297 should use a CSS selector::
   1298 
   1299  css_soup.select("p.strikeout.body")
   1300  # [<p class="body strikeout"></p>]
   1301 
   1302 In older versions of Beautiful Soup, which don't have the ``class_``
   1303 shortcut, you can use the ``attrs`` trick mentioned above. Create a
   1304 dictionary whose value for "class" is the string (or regular
   1305 expression, or whatever) you want to search for::
   1306 
   1307  soup.find_all("a", attrs={"class": "sister"})
   1308  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1309  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1310  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1311 
   1312 .. _text:
   1313 
   1314 The ``text`` argument
   1315 ^^^^^^^^^^^^^^^^^^^^^
   1316 
   1317 With ``text`` you can search for strings instead of tags. As with
   1318 ``name`` and the keyword arguments, you can pass in `a string`_, `a
   1319 regular expression`_, `a list`_, `a function`_, or `the value True`_.
   1320 Here are some examples::
   1321 
   1322  soup.find_all(text="Elsie")
   1323  # [u'Elsie']
   1324 
   1325  soup.find_all(text=["Tillie", "Elsie", "Lacie"])
   1326  # [u'Elsie', u'Lacie', u'Tillie']
   1327 
   1328  soup.find_all(text=re.compile("Dormouse"))
   1329  [u"The Dormouse's story", u"The Dormouse's story"]
   1330 
   1331  def is_the_only_string_within_a_tag(s):
   1332      """Return True if this string is the only child of its parent tag."""
   1333      return (s == s.parent.string)
   1334 
   1335  soup.find_all(text=is_the_only_string_within_a_tag)
   1336  # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
   1337 
   1338 Although ``text`` is for finding strings, you can combine it with
   1339 arguments that find tags: Beautiful Soup will find all tags whose
   1340 ``.string`` matches your value for ``text``. This code finds the <a>
   1341 tags whose ``.string`` is "Elsie"::
   1342 
   1343  soup.find_all("a", text="Elsie")
   1344  # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
   1345 
   1346 .. _limit:
   1347 
   1348 The ``limit`` argument
   1349 ^^^^^^^^^^^^^^^^^^^^^^
   1350 
   1351 ``find_all()`` returns all the tags and strings that match your
   1352 filters. This can take a while if the document is large. If you don't
   1353 need `all` the results, you can pass in a number for ``limit``. This
   1354 works just like the LIMIT keyword in SQL. It tells Beautiful Soup to
   1355 stop gathering results after it's found a certain number.
   1356 
   1357 There are three links in the "three sisters" document, but this code
   1358 only finds the first two::
   1359 
   1360  soup.find_all("a", limit=2)
   1361  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1362  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
   1363 
   1364 .. _recursive:
   1365 
   1366 The ``recursive`` argument
   1367 ^^^^^^^^^^^^^^^^^^^^^^^^^^
   1368 
   1369 If you call ``mytag.find_all()``, Beautiful Soup will examine all the
   1370 descendants of ``mytag``: its children, its children's children, and
   1371 so on. If you only want Beautiful Soup to consider direct children,
   1372 you can pass in ``recursive=False``. See the difference here::
   1373 
   1374  soup.html.find_all("title")
   1375  # [<title>The Dormouse's story</title>]
   1376 
   1377  soup.html.find_all("title", recursive=False)
   1378  # []
   1379 
   1380 Here's that part of the document::
   1381 
   1382  <html>
   1383   <head>
   1384    <title>
   1385     The Dormouse's story
   1386    </title>
   1387   </head>
   1388  ...
   1389 
   1390 The <title> tag is beneath the <html> tag, but it's not `directly`
   1391 beneath the <html> tag: the <head> tag is in the way. Beautiful Soup
   1392 finds the <title> tag when it's allowed to look at all descendants of
   1393 the <html> tag, but when ``recursive=False`` restricts it to the
   1394 <html> tag's immediate children, it finds nothing.
   1395 
   1396 Beautiful Soup offers a lot of tree-searching methods (covered below),
   1397 and they mostly take the same arguments as ``find_all()``: ``name``,
   1398 ``attrs``, ``text``, ``limit``, and the keyword arguments. But the
   1399 ``recursive`` argument is different: ``find_all()`` and ``find()`` are
   1400 the only methods that support it. Passing ``recursive=False`` into a
   1401 method like ``find_parents()`` wouldn't be very useful.
   1402 
   1403 Calling a tag is like calling ``find_all()``
   1404 --------------------------------------------
   1405 
   1406 Because ``find_all()`` is the most popular method in the Beautiful
   1407 Soup search API, you can use a shortcut for it. If you treat the
   1408 ``BeautifulSoup`` object or a ``Tag`` object as though it were a
   1409 function, then it's the same as calling ``find_all()`` on that
   1410 object. These two lines of code are equivalent::
   1411 
   1412  soup.find_all("a")
   1413  soup("a")
   1414 
   1415 These two lines are also equivalent::
   1416 
   1417  soup.title.find_all(text=True)
   1418  soup.title(text=True)
   1419 
   1420 ``find()``
   1421 ----------
   1422 
   1423 Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
   1424 <recursive>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
   1425 
   1426 The ``find_all()`` method scans the entire document looking for
   1427 results, but sometimes you only want to find one result. If you know a
   1428 document only has one <body> tag, it's a waste of time to scan the
   1429 entire document looking for more. Rather than passing in ``limit=1``
   1430 every time you call ``find_all``, you can use the ``find()``
   1431 method. These two lines of code are `nearly` equivalent::
   1432 
   1433  soup.find_all('title', limit=1)
   1434  # [<title>The Dormouse's story</title>]
   1435 
   1436  soup.find('title')
   1437  # <title>The Dormouse's story</title>
   1438 
   1439 The only difference is that ``find_all()`` returns a list containing
   1440 the single result, and ``find()`` just returns the result.
   1441 
   1442 If ``find_all()`` can't find anything, it returns an empty list. If
   1443 ``find()`` can't find anything, it returns ``None``::
   1444 
   1445  print(soup.find("nosuchtag"))
   1446  # None
   1447 
   1448 Remember the ``soup.head.title`` trick from `Navigating using tag
   1449 names`_? That trick works by repeatedly calling ``find()``::
   1450 
   1451  soup.head.title
   1452  # <title>The Dormouse's story</title>
   1453 
   1454  soup.find("head").find("title")
   1455  # <title>The Dormouse's story</title>
   1456 
   1457 ``find_parents()`` and ``find_parent()``
   1458 ----------------------------------------
   1459 
   1460 Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
   1461 
   1462 Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
   1463 
   1464 I spent a lot of time above covering ``find_all()`` and
   1465 ``find()``. The Beautiful Soup API defines ten other methods for
   1466 searching the tree, but don't be afraid. Five of these methods are
   1467 basically the same as ``find_all()``, and the other five are basically
   1468 the same as ``find()``. The only differences are in what parts of the
   1469 tree they search.
   1470 
   1471 First let's consider ``find_parents()`` and
   1472 ``find_parent()``. Remember that ``find_all()`` and ``find()`` work
   1473 their way down the tree, looking at tag's descendants. These methods
   1474 do the opposite: they work their way `up` the tree, looking at a tag's
   1475 (or a string's) parents. Let's try them out, starting from a string
   1476 buried deep in the "three daughters" document::
   1477 
   1478   a_string = soup.find(text="Lacie")
   1479   a_string
   1480   # u'Lacie'
   1481 
   1482   a_string.find_parents("a")
   1483   # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
   1484 
   1485   a_string.find_parent("p")
   1486   # <p class="story">Once upon a time there were three little sisters; and their names were
   1487   #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1488   #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
   1489   #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
   1490   #  and they lived at the bottom of a well.</p>
   1491 
   1492   a_string.find_parents("p", class="title")
   1493   # []
   1494 
   1495 One of the three <a> tags is the direct parent of the string in
   1496 question, so our search finds it. One of the three <p> tags is an
   1497 indirect parent of the string, and our search finds that as
   1498 well. There's a <p> tag with the CSS class "title" `somewhere` in the
   1499 document, but it's not one of this string's parents, so we can't find
   1500 it with ``find_parents()``.
   1501 
   1502 You may have made the connection between ``find_parent()`` and
   1503 ``find_parents()``, and the `.parent`_ and `.parents`_ attributes
   1504 mentioned earlier. The connection is very strong. These search methods
   1505 actually use ``.parents`` to iterate over all the parents, and check
   1506 each one against the provided filter to see if it matches.
   1507 
   1508 ``find_next_siblings()`` and ``find_next_sibling()``
   1509 ----------------------------------------------------
   1510 
   1511 Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
   1512 
   1513 Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
   1514 
   1515 These methods use :ref:`.next_siblings <sibling-generators>` to
   1516 iterate over the rest of an element's siblings in the tree. The
   1517 ``find_next_siblings()`` method returns all the siblings that match,
   1518 and ``find_next_sibling()`` only returns the first one::
   1519 
   1520  first_link = soup.a
   1521  first_link
   1522  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
   1523 
   1524  first_link.find_next_siblings("a")
   1525  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1526  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1527 
   1528  first_story_paragraph = soup.find("p", "story")
   1529  first_story_paragraph.find_next_sibling("p")
   1530  # <p class="story">...</p>
   1531 
   1532 ``find_previous_siblings()`` and ``find_previous_sibling()``
   1533 ------------------------------------------------------------
   1534 
   1535 Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
   1536 
   1537 Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
   1538 
   1539 These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's
   1540 siblings that precede it in the tree. The ``find_previous_siblings()``
   1541 method returns all the siblings that match, and
   1542 ``find_previous_sibling()`` only returns the first one::
   1543 
   1544  last_link = soup.find("a", id="link3")
   1545  last_link
   1546  # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
   1547 
   1548  last_link.find_previous_siblings("a")
   1549  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1550  #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
   1551 
   1552  first_story_paragraph = soup.find("p", "story")
   1553  first_story_paragraph.find_previous_sibling("p")
   1554  # <p class="title"><b>The Dormouse's story</b></p>
   1555 
   1556 
   1557 ``find_all_next()`` and ``find_next()``
   1558 ---------------------------------------
   1559 
   1560 Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
   1561 
   1562 Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
   1563 
   1564 These methods use :ref:`.next_elements <element-generators>` to
   1565 iterate over whatever tags and strings that come after it in the
   1566 document. The ``find_all_next()`` method returns all matches, and
   1567 ``find_next()`` only returns the first match::
   1568 
   1569  first_link = soup.a
   1570  first_link
   1571  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
   1572 
   1573  first_link.find_all_next(text=True)
   1574  # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
   1575  #  u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
   1576 
   1577  first_link.find_next("p")
   1578  # <p class="story">...</p>
   1579 
   1580 In the first example, the string "Elsie" showed up, even though it was
   1581 contained within the <a> tag we started from. In the second example,
   1582 the last <p> tag in the document showed up, even though it's not in
   1583 the same part of the tree as the <a> tag we started from. For these
   1584 methods, all that matters is that an element match the filter, and
   1585 show up later in the document than the starting element.
   1586 
   1587 ``find_all_previous()`` and ``find_previous()``
   1588 -----------------------------------------------
   1589 
   1590 Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
   1591 
   1592 Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
   1593 
   1594 These methods use :ref:`.previous_elements <element-generators>` to
   1595 iterate over the tags and strings that came before it in the
   1596 document. The ``find_all_previous()`` method returns all matches, and
   1597 ``find_previous()`` only returns the first match::
   1598 
   1599  first_link = soup.a
   1600  first_link
   1601  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
   1602 
   1603  first_link.find_all_previous("p")
   1604  # [<p class="story">Once upon a time there were three little sisters; ...</p>,
   1605  #  <p class="title"><b>The Dormouse's story</b></p>]
   1606 
   1607  first_link.find_previous("title")
   1608  # <title>The Dormouse's story</title>
   1609 
   1610 The call to ``find_all_previous("p")`` found the first paragraph in
   1611 the document (the one with class="title"), but it also finds the
   1612 second paragraph, the <p> tag that contains the <a> tag we started
   1613 with. This shouldn't be too surprising: we're looking at all the tags
   1614 that show up earlier in the document than the one we started with. A
   1615 <p> tag that contains an <a> tag must have shown up before the <a>
   1616 tag it contains.
   1617 
   1618 CSS selectors
   1619 -------------
   1620 
   1621 Beautiful Soup supports the most commonly-used `CSS selectors
   1622 <http://www.w3.org/TR/CSS2/selector.html>`_. Just pass a string into
   1623 the ``.select()`` method of a ``Tag`` object or the ``BeautifulSoup``
   1624 object itself.
   1625 
   1626 You can find tags::
   1627 
   1628  soup.select("title")
   1629  # [<title>The Dormouse's story</title>]
   1630 
   1631  soup.select("p nth-of-type(3)")
   1632  # [<p class="story">...</p>]
   1633 
   1634 Find tags beneath other tags::
   1635 
   1636  soup.select("body a")
   1637  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1638  #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
   1639  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1640 
   1641  soup.select("html head title")
   1642  # [<title>The Dormouse's story</title>]
   1643 
   1644 Find tags `directly` beneath other tags::
   1645 
   1646  soup.select("head > title")
   1647  # [<title>The Dormouse's story</title>]
   1648 
   1649  soup.select("p > a")
   1650  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1651  #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
   1652  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1653 
   1654  soup.select("p > a:nth-of-type(2)")
   1655  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
   1656 
   1657  soup.select("p > #link1")
   1658  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
   1659 
   1660  soup.select("body > a")
   1661  # []
   1662 
   1663 Find the siblings of tags::
   1664 
   1665  soup.select("#link1 ~ .sister")
   1666  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1667  #  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]
   1668 
   1669  soup.select("#link1 + .sister")
   1670  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
   1671 
   1672 Find tags by CSS class::
   1673 
   1674  soup.select(".sister")
   1675  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1676  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1677  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1678 
   1679  soup.select("[class~=sister]")
   1680  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1681  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1682  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1683 
   1684 Find tags by ID::
   1685 
   1686  soup.select("#link1")
   1687  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
   1688 
   1689  soup.select("a#link2")
   1690  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
   1691 
   1692 Test for the existence of an attribute::
   1693 
   1694  soup.select('a[href]')
   1695  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1696  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1697  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1698 
   1699 Find tags by attribute value::
   1700 
   1701  soup.select('a[href="http://example.com/elsie"]')
   1702  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
   1703 
   1704  soup.select('a[href^="http://example.com/"]')
   1705  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   1706  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
   1707  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1708 
   1709  soup.select('a[href$="tillie"]')
   1710  # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
   1711 
   1712  soup.select('a[href*=".com/el"]')
   1713  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
   1714 
   1715 Match language codes::
   1716 
   1717  multilingual_markup = """
   1718   <p lang="en">Hello</p>
   1719   <p lang="en-us">Howdy, y'all</p>
   1720   <p lang="en-gb">Pip-pip, old fruit</p>
   1721   <p lang="fr">Bonjour mes amis</p>
   1722  """
   1723  multilingual_soup = BeautifulSoup(multilingual_markup)
   1724  multilingual_soup.select('p[lang|=en]')
   1725  # [<p lang="en">Hello</p>,
   1726  #  <p lang="en-us">Howdy, y'all</p>,
   1727  #  <p lang="en-gb">Pip-pip, old fruit</p>]
   1728 
   1729 This is a convenience for users who know the CSS selector syntax. You
   1730 can do all this stuff with the Beautiful Soup API. And if CSS
   1731 selectors are all you need, you might as well use lxml directly,
   1732 because it's faster. But this lets you `combine` simple CSS selectors
   1733 with the Beautiful Soup API.
   1734 
   1735 
   1736 Modifying the tree
   1737 ==================
   1738 
   1739 Beautiful Soup's main strength is in searching the parse tree, but you
   1740 can also modify the tree and write your changes as a new HTML or XML
   1741 document.
   1742 
   1743 Changing tag names and attributes
   1744 ---------------------------------
   1745 
   1746 I covered this earlier, in `Attributes`_, but it bears repeating. You
   1747 can rename a tag, change the values of its attributes, add new
   1748 attributes, and delete attributes::
   1749 
   1750  soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
   1751  tag = soup.b
   1752 
   1753  tag.name = "blockquote"
   1754  tag['class'] = 'verybold'
   1755  tag['id'] = 1
   1756  tag
   1757  # <blockquote class="verybold" id="1">Extremely bold</blockquote>
   1758 
   1759  del tag['class']
   1760  del tag['id']
   1761  tag
   1762  # <blockquote>Extremely bold</blockquote>
   1763 
   1764 
   1765 Modifying ``.string``
   1766 ---------------------
   1767 
   1768 If you set a tag's ``.string`` attribute, the tag's contents are
   1769 replaced with the string you give::
   1770 
   1771   markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
   1772   soup = BeautifulSoup(markup)
   1773 
   1774   tag = soup.a
   1775   tag.string = "New link text."
   1776   tag
   1777   # <a href="http://example.com/">New link text.</a>
   1778 
   1779 Be careful: if the tag contained other tags, they and all their
   1780 contents will be destroyed.
   1781 
   1782 ``append()``
   1783 ------------
   1784 
   1785 You can add to a tag's contents with ``Tag.append()``. It works just
   1786 like calling ``.append()`` on a Python list::
   1787 
   1788    soup = BeautifulSoup("<a>Foo</a>")
   1789    soup.a.append("Bar")
   1790 
   1791    soup
   1792    # <html><head></head><body><a>FooBar</a></body></html>
   1793    soup.a.contents
   1794    # [u'Foo', u'Bar']
   1795 
   1796 ``BeautifulSoup.new_string()`` and ``.new_tag()``
   1797 -------------------------------------------------
   1798 
   1799 If you need to add a string to a document, no problem--you can pass a
   1800 Python string in to ``append()``, or you can call the factory method
   1801 ``BeautifulSoup.new_string()``::
   1802 
   1803    soup = BeautifulSoup("<b></b>")
   1804    tag = soup.b
   1805    tag.append("Hello")
   1806    new_string = soup.new_string(" there")
   1807    tag.append(new_string)
   1808    tag
   1809    # <b>Hello there.</b>
   1810    tag.contents
   1811    # [u'Hello', u' there']
   1812 
   1813 If you want to create a comment or some other subclass of
   1814 ``NavigableString``, pass that class as the second argument to
   1815 ``new_string()``::
   1816 
   1817    from bs4 import Comment
   1818    new_comment = soup.new_string("Nice to see you.", Comment)
   1819    tag.append(new_comment)
   1820    tag
   1821    # <b>Hello there<!--Nice to see you.--></b>
   1822    tag.contents
   1823    # [u'Hello', u' there', u'Nice to see you.']
   1824 
   1825 (This is a new feature in Beautiful Soup 4.2.1.)
   1826 
   1827 What if you need to create a whole new tag?  The best solution is to
   1828 call the factory method ``BeautifulSoup.new_tag()``::
   1829 
   1830    soup = BeautifulSoup("<b></b>")
   1831    original_tag = soup.b
   1832 
   1833    new_tag = soup.new_tag("a", href="http://www.example.com")
   1834    original_tag.append(new_tag)
   1835    original_tag
   1836    # <b><a href="http://www.example.com"></a></b>
   1837 
   1838    new_tag.string = "Link text."
   1839    original_tag
   1840    # <b><a href="http://www.example.com">Link text.</a></b>
   1841 
   1842 Only the first argument, the tag name, is required.
   1843 
   1844 ``insert()``
   1845 ------------
   1846 
   1847 ``Tag.insert()`` is just like ``Tag.append()``, except the new element
   1848 doesn't necessarily go at the end of its parent's
   1849 ``.contents``. It'll be inserted at whatever numeric position you
   1850 say. It works just like ``.insert()`` on a Python list::
   1851 
   1852   markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
   1853   soup = BeautifulSoup(markup)
   1854   tag = soup.a
   1855 
   1856   tag.insert(1, "but did not endorse ")
   1857   tag
   1858   # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
   1859   tag.contents
   1860   # [u'I linked to ', u'but did not endorse', <i>example.com</i>]
   1861 
   1862 ``insert_before()`` and ``insert_after()``
   1863 ------------------------------------------
   1864 
   1865 The ``insert_before()`` method inserts a tag or string immediately
   1866 before something else in the parse tree::
   1867 
   1868    soup = BeautifulSoup("<b>stop</b>")
   1869    tag = soup.new_tag("i")
   1870    tag.string = "Don't"
   1871    soup.b.string.insert_before(tag)
   1872    soup.b
   1873    # <b><i>Don't</i>stop</b>
   1874 
   1875 The ``insert_after()`` method moves a tag or string so that it
   1876 immediately follows something else in the parse tree::
   1877 
   1878    soup.b.i.insert_after(soup.new_string(" ever "))
   1879    soup.b
   1880    # <b><i>Don't</i> ever stop</b>
   1881    soup.b.contents
   1882    # [<i>Don't</i>, u' ever ', u'stop']
   1883 
   1884 ``clear()``
   1885 -----------
   1886 
   1887 ``Tag.clear()`` removes the contents of a tag::
   1888 
   1889   markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
   1890   soup = BeautifulSoup(markup)
   1891   tag = soup.a
   1892 
   1893   tag.clear()
   1894   tag
   1895   # <a href="http://example.com/"></a>
   1896 
   1897 ``extract()``
   1898 -------------
   1899 
   1900 ``PageElement.extract()`` removes a tag or string from the tree. It
   1901 returns the tag or string that was extracted::
   1902 
   1903   markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
   1904   soup = BeautifulSoup(markup)
   1905   a_tag = soup.a
   1906 
   1907   i_tag = soup.i.extract()
   1908 
   1909   a_tag
   1910   # <a href="http://example.com/">I linked to</a>
   1911 
   1912   i_tag
   1913   # <i>example.com</i>
   1914 
   1915   print(i_tag.parent)
   1916   None
   1917 
   1918 At this point you effectively have two parse trees: one rooted at the
   1919 ``BeautifulSoup`` object you used to parse the document, and one rooted
   1920 at the tag that was extracted. You can go on to call ``extract`` on
   1921 a child of the element you extracted::
   1922 
   1923   my_string = i_tag.string.extract()
   1924   my_string
   1925   # u'example.com'
   1926 
   1927   print(my_string.parent)
   1928   # None
   1929   i_tag
   1930   # <i></i>
   1931 
   1932 
   1933 ``decompose()``
   1934 ---------------
   1935 
   1936 ``Tag.decompose()`` removes a tag from the tree, then `completely
   1937 destroys it and its contents`::
   1938 
   1939   markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
   1940   soup = BeautifulSoup(markup)
   1941   a_tag = soup.a
   1942 
   1943   soup.i.decompose()
   1944 
   1945   a_tag
   1946   # <a href="http://example.com/">I linked to</a>
   1947 
   1948 
   1949 .. _replace_with:
   1950 
   1951 ``replace_with()``
   1952 ------------------
   1953 
   1954 ``PageElement.replace_with()`` removes a tag or string from the tree,
   1955 and replaces it with the tag or string of your choice::
   1956 
   1957   markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
   1958   soup = BeautifulSoup(markup)
   1959   a_tag = soup.a
   1960 
   1961   new_tag = soup.new_tag("b")
   1962   new_tag.string = "example.net"
   1963   a_tag.i.replace_with(new_tag)
   1964 
   1965   a_tag
   1966   # <a href="http://example.com/">I linked to <b>example.net</b></a>
   1967 
   1968 ``replace_with()`` returns the tag or string that was replaced, so
   1969 that you can examine it or add it back to another part of the tree.
   1970 
   1971 ``wrap()``
   1972 ----------
   1973 
   1974 ``PageElement.wrap()`` wraps an element in the tag you specify. It
   1975 returns the new wrapper::
   1976 
   1977  soup = BeautifulSoup("<p>I wish I was bold.</p>")
   1978  soup.p.string.wrap(soup.new_tag("b"))
   1979  # <b>I wish I was bold.</b>
   1980 
   1981  soup.p.wrap(soup.new_tag("div")
   1982  # <div><p><b>I wish I was bold.</b></p></div>
   1983 
   1984 This method is new in Beautiful Soup 4.0.5.
   1985 
   1986 ``unwrap()``
   1987 ---------------------------
   1988 
   1989 ``Tag.unwrap()`` is the opposite of ``wrap()``. It replaces a tag with
   1990 whatever's inside that tag. It's good for stripping out markup::
   1991 
   1992   markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
   1993   soup = BeautifulSoup(markup)
   1994   a_tag = soup.a
   1995 
   1996   a_tag.i.unwrap()
   1997   a_tag
   1998   # <a href="http://example.com/">I linked to example.com</a>
   1999 
   2000 Like ``replace_with()``, ``unwrap()`` returns the tag
   2001 that was replaced.
   2002 
   2003 Output
   2004 ======
   2005 
   2006 .. _.prettyprinting:
   2007 
   2008 Pretty-printing
   2009 ---------------
   2010 
   2011 The ``prettify()`` method will turn a Beautiful Soup parse tree into a
   2012 nicely formatted Unicode string, with each HTML/XML tag on its own line::
   2013 
   2014   markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
   2015   soup = BeautifulSoup(markup)
   2016   soup.prettify()
   2017   # '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...'
   2018 
   2019   print(soup.prettify())
   2020   # <html>
   2021   #  <head>
   2022   #  </head>
   2023   #  <body>
   2024   #   <a href="http://example.com/">
   2025   #    I linked to
   2026   #    <i>
   2027   #     example.com
   2028   #    </i>
   2029   #   </a>
   2030   #  </body>
   2031   # </html>
   2032 
   2033 You can call ``prettify()`` on the top-level ``BeautifulSoup`` object,
   2034 or on any of its ``Tag`` objects::
   2035 
   2036   print(soup.a.prettify())
   2037   # <a href="http://example.com/">
   2038   #  I linked to
   2039   #  <i>
   2040   #   example.com
   2041   #  </i>
   2042   # </a>
   2043 
   2044 Non-pretty printing
   2045 -------------------
   2046 
   2047 If you just want a string, with no fancy formatting, you can call
   2048 ``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag``
   2049 within it::
   2050 
   2051  str(soup)
   2052  # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
   2053 
   2054  unicode(soup.a)
   2055  # u'<a href="http://example.com/">I linked to <i>example.com</i></a>'
   2056 
   2057 The ``str()`` function returns a string encoded in UTF-8. See
   2058 `Encodings`_ for other options.
   2059 
   2060 You can also call ``encode()`` to get a bytestring, and ``decode()``
   2061 to get Unicode.
   2062 
   2063 .. _output_formatters:
   2064 
   2065 Output formatters
   2066 -----------------
   2067 
   2068 If you give Beautiful Soup a document that contains HTML entities like
   2069 "&lquot;", they'll be converted to Unicode characters::
   2070 
   2071  soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
   2072  unicode(soup)
   2073  # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'
   2074 
   2075 If you then convert the document to a string, the Unicode characters
   2076 will be encoded as UTF-8. You won't get the HTML entities back::
   2077 
   2078  str(soup)
   2079  # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'
   2080 
   2081 By default, the only characters that are escaped upon output are bare
   2082 ampersands and angle brackets. These get turned into "&amp;", "&lt;",
   2083 and "&gt;", so that Beautiful Soup doesn't inadvertently generate
   2084 invalid HTML or XML::
   2085 
   2086  soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>")
   2087  soup.p
   2088  # <p>The law firm of Dewey, Cheatem, &amp; Howe</p>
   2089 
   2090  soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
   2091  soup.a
   2092  # <a href="http://example.com/?foo=val1&bar=val2">A link</a>
   2093 
   2094 You can change this behavior by providing a value for the
   2095 ``formatter`` argument to ``prettify()``, ``encode()``, or
   2096 ``decode()``. Beautiful Soup recognizes four possible values for
   2097 ``formatter``.
   2098 
   2099 The default is ``formatter="minimal"``. Strings will only be processed
   2100 enough to ensure that Beautiful Soup generates valid HTML/XML::
   2101 
   2102  french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
   2103  soup = BeautifulSoup(french)
   2104  print(soup.prettify(formatter="minimal"))
   2105  # <html>
   2106  #  <body>
   2107  #   <p>
   2108  #    Il a dit &lt;&lt;Sacr bleu!&gt;&gt;
   2109  #   </p>
   2110  #  </body>
   2111  # </html>
   2112 
   2113 If you pass in ``formatter="html"``, Beautiful Soup will convert
   2114 Unicode characters to HTML entities whenever possible::
   2115 
   2116  print(soup.prettify(formatter="html"))
   2117  # <html>
   2118  #  <body>
   2119  #   <p>
   2120  #    Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;
   2121  #   </p>
   2122  #  </body>
   2123  # </html>
   2124 
   2125 If you pass in ``formatter=None``, Beautiful Soup will not modify
   2126 strings at all on output. This is the fastest option, but it may lead
   2127 to Beautiful Soup generating invalid HTML/XML, as in these examples::
   2128 
   2129  print(soup.prettify(formatter=None))
   2130  # <html>
   2131  #  <body>
   2132  #   <p>
   2133  #    Il a dit <<Sacr bleu!>>
   2134  #   </p>
   2135  #  </body>
   2136  # </html>
   2137 
   2138  link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
   2139  print(link_soup.a.encode(formatter=None))
   2140  # <a href="http://example.com/?foo=val1&bar=val2">A link</a>
   2141 
   2142 Finally, if you pass in a function for ``formatter``, Beautiful Soup
   2143 will call that function once for every string and attribute value in
   2144 the document. You can do whatever you want in this function. Here's a
   2145 formatter that converts strings to uppercase and does absolutely
   2146 nothing else::
   2147 
   2148  def uppercase(str):
   2149      return str.upper()
   2150 
   2151  print(soup.prettify(formatter=uppercase))
   2152  # <html>
   2153  #  <body>
   2154  #   <p>
   2155  #    IL A DIT <<SACR BLEU!>>
   2156  #   </p>
   2157  #  </body>
   2158  # </html>
   2159 
   2160  print(link_soup.a.prettify(formatter=uppercase))
   2161  # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
   2162  #  A LINK
   2163  # </a>
   2164 
   2165 If you're writing your own function, you should know about the
   2166 ``EntitySubstitution`` class in the ``bs4.dammit`` module. This class
   2167 implements Beautiful Soup's standard formatters as class methods: the
   2168 "html" formatter is ``EntitySubstitution.substitute_html``, and the
   2169 "minimal" formatter is ``EntitySubstitution.substitute_xml``. You can
   2170 use these functions to simulate ``formatter=html`` or
   2171 ``formatter==minimal``, but then do something extra.
   2172 
   2173 Here's an example that replaces Unicode characters with HTML entities
   2174 whenever possible, but `also` converts all strings to uppercase::
   2175 
   2176  from bs4.dammit import EntitySubstitution
   2177  def uppercase_and_substitute_html_entities(str):
   2178      return EntitySubstitution.substitute_html(str.upper())
   2179 
   2180  print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
   2181  # <html>
   2182  #  <body>
   2183  #   <p>
   2184  #    IL A DIT &lt;&lt;SACR&Eacute; BLEU!&gt;&gt;
   2185  #   </p>
   2186  #  </body>
   2187  # </html>
   2188 
   2189 One last caveat: if you create a ``CData`` object, the text inside
   2190 that object is always presented `exactly as it appears, with no
   2191 formatting`. Beautiful Soup will call the formatter method, just in
   2192 case you've written a custom method that counts all the strings in the
   2193 document or something, but it will ignore the return value::
   2194 
   2195  from bs4.element import CData
   2196  soup = BeautifulSoup("<a></a>")
   2197  soup.a.string = CData("one < three")
   2198  print(soup.a.prettify(formatter="xml"))
   2199  # <a>
   2200  #  <![CDATA[one < three]]>
   2201  # </a>
   2202 
   2203 
   2204 ``get_text()``
   2205 --------------
   2206 
   2207 If you only want the text part of a document or tag, you can use the
   2208 ``get_text()`` method. It returns all the text in a document or
   2209 beneath a tag, as a single Unicode string::
   2210 
   2211   markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
   2212   soup = BeautifulSoup(markup)
   2213 
   2214   soup.get_text()
   2215   u'\nI linked to example.com\n'
   2216   soup.i.get_text()
   2217   u'example.com'
   2218 
   2219 You can specify a string to be used to join the bits of text
   2220 together::
   2221 
   2222  # soup.get_text("|")
   2223  u'\nI linked to |example.com|\n'
   2224 
   2225 You can tell Beautiful Soup to strip whitespace from the beginning and
   2226 end of each bit of text::
   2227 
   2228  # soup.get_text("|", strip=True)
   2229  u'I linked to|example.com'
   2230 
   2231 But at that point you might want to use the :ref:`.stripped_strings <string-generators>`
   2232 generator instead, and process the text yourself::
   2233 
   2234  [text for text in soup.stripped_strings]
   2235  # [u'I linked to', u'example.com']
   2236 
   2237 Specifying the parser to use
   2238 ============================
   2239 
   2240 If you just need to parse some HTML, you can dump the markup into the
   2241 ``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful
   2242 Soup will pick a parser for you and parse the data. But there are a
   2243 few additional arguments you can pass in to the constructor to change
   2244 which parser is used.
   2245 
   2246 The first argument to the ``BeautifulSoup`` constructor is a string or
   2247 an open filehandle--the markup you want parsed. The second argument is
   2248 `how` you'd like the markup parsed.
   2249 
   2250 If you don't specify anything, you'll get the best HTML parser that's
   2251 installed. Beautiful Soup ranks lxml's parser as being the best, then
   2252 html5lib's, then Python's built-in parser. You can override this by
   2253 specifying one of the following:
   2254 
   2255 * What type of markup you want to parse. Currently supported are
   2256   "html", "xml", and "html5".
   2257 
   2258 * The name of the parser library you want to use. Currently supported
   2259   options are "lxml", "html5lib", and "html.parser" (Python's
   2260   built-in HTML parser).
   2261 
   2262 The section `Installing a parser`_ contrasts the supported parsers.
   2263 
   2264 If you don't have an appropriate parser installed, Beautiful Soup will
   2265 ignore your request and pick a different parser. Right now, the only
   2266 supported XML parser is lxml. If you don't have lxml installed, asking
   2267 for an XML parser won't give you one, and asking for "lxml" won't work
   2268 either.
   2269 
   2270 Differences between parsers
   2271 ---------------------------
   2272 
   2273 Beautiful Soup presents the same interface to a number of different
   2274 parsers, but each parser is different. Different parsers will create
   2275 different parse trees from the same document. The biggest differences
   2276 are between the HTML parsers and the XML parsers. Here's a short
   2277 document, parsed as HTML::
   2278 
   2279  BeautifulSoup("<a><b /></a>")
   2280  # <html><head></head><body><a><b></b></a></body></html>
   2281 
   2282 Since an empty <b /> tag is not valid HTML, the parser turns it into a
   2283 <b></b> tag pair.
   2284 
   2285 Here's the same document parsed as XML (running this requires that you
   2286 have lxml installed). Note that the empty <b /> tag is left alone, and
   2287 that the document is given an XML declaration instead of being put
   2288 into an <html> tag.::
   2289 
   2290  BeautifulSoup("<a><b /></a>", "xml")
   2291  # <?xml version="1.0" encoding="utf-8"?>
   2292  # <a><b/></a>
   2293 
   2294 There are also differences between HTML parsers. If you give Beautiful
   2295 Soup a perfectly-formed HTML document, these differences won't
   2296 matter. One parser will be faster than another, but they'll all give
   2297 you a data structure that looks exactly like the original HTML
   2298 document.
   2299 
   2300 But if the document is not perfectly-formed, different parsers will
   2301 give different results. Here's a short, invalid document parsed using
   2302 lxml's HTML parser. Note that the dangling </p> tag is simply
   2303 ignored::
   2304 
   2305  BeautifulSoup("<a></p>", "lxml")
   2306  # <html><body><a></a></body></html>
   2307 
   2308 Here's the same document parsed using html5lib::
   2309 
   2310  BeautifulSoup("<a></p>", "html5lib")
   2311  # <html><head></head><body><a><p></p></a></body></html>
   2312 
   2313 Instead of ignoring the dangling </p> tag, html5lib pairs it with an
   2314 opening <p> tag. This parser also adds an empty <head> tag to the
   2315 document.
   2316 
   2317 Here's the same document parsed with Python's built-in HTML
   2318 parser::
   2319 
   2320  BeautifulSoup("<a></p>", "html.parser")
   2321  # <a></a>
   2322 
   2323 Like html5lib, this parser ignores the closing </p> tag. Unlike
   2324 html5lib, this parser makes no attempt to create a well-formed HTML
   2325 document by adding a <body> tag. Unlike lxml, it doesn't even bother
   2326 to add an <html> tag.
   2327 
   2328 Since the document "<a></p>" is invalid, none of these techniques is
   2329 the "correct" way to handle it. The html5lib parser uses techniques
   2330 that are part of the HTML5 standard, so it has the best claim on being
   2331 the "correct" way, but all three techniques are legitimate.
   2332 
   2333 Differences between parsers can affect your script. If you're planning
   2334 on distributing your script to other people, or running it on multiple
   2335 machines, you should specify a parser in the ``BeautifulSoup``
   2336 constructor. That will reduce the chances that your users parse a
   2337 document differently from the way you parse it.
   2338 
   2339 Encodings
   2340 =========
   2341 
   2342 Any HTML or XML document is written in a specific encoding like ASCII
   2343 or UTF-8.  But when you load that document into Beautiful Soup, you'll
   2344 discover it's been converted to Unicode::
   2345 
   2346  markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
   2347  soup = BeautifulSoup(markup)
   2348  soup.h1
   2349  # <h1>Sacr bleu!</h1>
   2350  soup.h1.string
   2351  # u'Sacr\xe9 bleu!'
   2352 
   2353 It's not magic. (That sure would be nice.) Beautiful Soup uses a
   2354 sub-library called `Unicode, Dammit`_ to detect a document's encoding
   2355 and convert it to Unicode. The autodetected encoding is available as
   2356 the ``.original_encoding`` attribute of the ``BeautifulSoup`` object::
   2357 
   2358  soup.original_encoding
   2359  'utf-8'
   2360 
   2361 Unicode, Dammit guesses correctly most of the time, but sometimes it
   2362 makes mistakes. Sometimes it guesses correctly, but only after a
   2363 byte-by-byte search of the document that takes a very long time. If
   2364 you happen to know a document's encoding ahead of time, you can avoid
   2365 mistakes and delays by passing it to the ``BeautifulSoup`` constructor
   2366 as ``from_encoding``.
   2367 
   2368 Here's a document written in ISO-8859-8. The document is so short that
   2369 Unicode, Dammit can't get a good lock on it, and misidentifies it as
   2370 ISO-8859-7::
   2371 
   2372  markup = b"<h1>\xed\xe5\xec\xf9</h1>"
   2373  soup = BeautifulSoup(markup)
   2374  soup.h1
   2375  <h1></h1>
   2376  soup.original_encoding
   2377  'ISO-8859-7'
   2378 
   2379 We can fix this by passing in the correct ``from_encoding``::
   2380 
   2381  soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
   2382  soup.h1
   2383  <h1></h1>
   2384  soup.original_encoding
   2385  'iso8859-8'
   2386 
   2387 In rare cases (usually when a UTF-8 document contains text written in
   2388 a completely different encoding), the only way to get Unicode may be
   2389 to replace some characters with the special Unicode character
   2390 "REPLACEMENT CHARACTER" (U+FFFD, ). If Unicode, Dammit needs to do
   2391 this, it will set the ``.contains_replacement_characters`` attribute
   2392 to ``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This
   2393 lets you know that the Unicode representation is not an exact
   2394 representation of the original--some data was lost. If a document
   2395 contains , but ``.contains_replacement_characters`` is ``False``,
   2396 you'll know that the  was there originally (as it is in this
   2397 paragraph) and doesn't stand in for missing data.
   2398 
   2399 Output encoding
   2400 ---------------
   2401 
   2402 When you write out a document from Beautiful Soup, you get a UTF-8
   2403 document, even if the document wasn't in UTF-8 to begin with. Here's a
   2404 document written in the Latin-1 encoding::
   2405 
   2406  markup = b'''
   2407   <html>
   2408    <head>
   2409     <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
   2410    </head>
   2411    <body>
   2412     <p>Sacr\xe9 bleu!</p>
   2413    </body>
   2414   </html>
   2415  '''
   2416 
   2417  soup = BeautifulSoup(markup)
   2418  print(soup.prettify())
   2419  # <html>
   2420  #  <head>
   2421  #   <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
   2422  #  </head>
   2423  #  <body>
   2424  #   <p>
   2425  #    Sacr bleu!
   2426  #   </p>
   2427  #  </body>
   2428  # </html>
   2429 
   2430 Note that the <meta> tag has been rewritten to reflect the fact that
   2431 the document is now in UTF-8.
   2432 
   2433 If you don't want UTF-8, you can pass an encoding into ``prettify()``::
   2434 
   2435  print(soup.prettify("latin-1"))
   2436  # <html>
   2437  #  <head>
   2438  #   <meta content="text/html; charset=latin-1" http-equiv="Content-type" />
   2439  # ...
   2440 
   2441 You can also call encode() on the ``BeautifulSoup`` object, or any
   2442 element in the soup, just as if it were a Python string::
   2443 
   2444  soup.p.encode("latin-1")
   2445  # '<p>Sacr\xe9 bleu!</p>'
   2446 
   2447  soup.p.encode("utf-8")
   2448  # '<p>Sacr\xc3\xa9 bleu!</p>'
   2449 
   2450 Any characters that can't be represented in your chosen encoding will
   2451 be converted into numeric XML entity references. Here's a document
   2452 that includes the Unicode character SNOWMAN::
   2453 
   2454  markup = u"<b>\N{SNOWMAN}</b>"
   2455  snowman_soup = BeautifulSoup(markup)
   2456  tag = snowman_soup.b
   2457 
   2458 The SNOWMAN character can be part of a UTF-8 document (it looks like
   2459 ), but there's no representation for that character in ISO-Latin-1 or
   2460 ASCII, so it's converted into "&#9731" for those encodings::
   2461 
   2462  print(tag.encode("utf-8"))
   2463  # <b></b>
   2464 
   2465  print tag.encode("latin-1")
   2466  # <b>&#9731;</b>
   2467 
   2468  print tag.encode("ascii")
   2469  # <b>&#9731;</b>
   2470 
   2471 Unicode, Dammit
   2472 ---------------
   2473 
   2474 You can use Unicode, Dammit without using Beautiful Soup. It's useful
   2475 whenever you have data in an unknown encoding and you just want it to
   2476 become Unicode::
   2477 
   2478  from bs4 import UnicodeDammit
   2479  dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
   2480  print(dammit.unicode_markup)
   2481  # Sacr bleu!
   2482  dammit.original_encoding
   2483  # 'utf-8'
   2484 
   2485 Unicode, Dammit's guesses will get a lot more accurate if you install
   2486 the ``chardet`` or ``cchardet`` Python libraries. The more data you
   2487 give Unicode, Dammit, the more accurately it will guess. If you have
   2488 your own suspicions as to what the encoding might be, you can pass
   2489 them in as a list::
   2490 
   2491  dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
   2492  print(dammit.unicode_markup)
   2493  # Sacr bleu!
   2494  dammit.original_encoding
   2495  # 'latin-1'
   2496 
   2497 Unicode, Dammit has two special features that Beautiful Soup doesn't
   2498 use.
   2499 
   2500 Smart quotes
   2501 ^^^^^^^^^^^^
   2502 
   2503 You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML
   2504 entities::
   2505 
   2506  markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>"
   2507 
   2508  UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup
   2509  # u'<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>'
   2510 
   2511  UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup
   2512  # u'<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'
   2513 
   2514 You can also convert Microsoft smart quotes to ASCII quotes::
   2515 
   2516  UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup
   2517  # u'<p>I just "love" Microsoft Word\'s smart quotes</p>'
   2518 
   2519 Hopefully you'll find this feature useful, but Beautiful Soup doesn't
   2520 use it. Beautiful Soup prefers the default behavior, which is to
   2521 convert Microsoft smart quotes to Unicode characters along with
   2522 everything else::
   2523 
   2524  UnicodeDammit(markup, ["windows-1252"]).unicode_markup
   2525  # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>'
   2526 
   2527 Inconsistent encodings
   2528 ^^^^^^^^^^^^^^^^^^^^^^
   2529 
   2530 Sometimes a document is mostly in UTF-8, but contains Windows-1252
   2531 characters such as (again) Microsoft smart quotes. This can happen
   2532 when a website includes data from multiple sources. You can use
   2533 ``UnicodeDammit.detwingle()`` to turn such a document into pure
   2534 UTF-8. Here's a simple example::
   2535 
   2536  snowmen = (u"\N{SNOWMAN}" * 3)
   2537  quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
   2538  doc = snowmen.encode("utf8") + quote.encode("windows_1252")
   2539 
   2540 This document is a mess. The snowmen are in UTF-8 and the quotes are
   2541 in Windows-1252. You can display the snowmen or the quotes, but not
   2542 both::
   2543 
   2544  print(doc)
   2545  # I like snowmen!
   2546 
   2547  print(doc.decode("windows-1252"))
   2548  # I like snowmen!
   2549 
   2550 Decoding the document as UTF-8 raises a ``UnicodeDecodeError``, and
   2551 decoding it as Windows-1252 gives you gibberish. Fortunately,
   2552 ``UnicodeDammit.detwingle()`` will convert the string to pure UTF-8,
   2553 allowing you to decode it to Unicode and display the snowmen and quote
   2554 marks simultaneously::
   2555 
   2556  new_doc = UnicodeDammit.detwingle(doc)
   2557  print(new_doc.decode("utf8"))
   2558  # I like snowmen!
   2559 
   2560 ``UnicodeDammit.detwingle()`` only knows how to handle Windows-1252
   2561 embedded in UTF-8 (or vice versa, I suppose), but this is the most
   2562 common case.
   2563 
   2564 Note that you must know to call ``UnicodeDammit.detwingle()`` on your
   2565 data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit``
   2566 constructor. Beautiful Soup assumes that a document has a single
   2567 encoding, whatever it might be. If you pass it a document that
   2568 contains both UTF-8 and Windows-1252, it's likely to think the whole
   2569 document is Windows-1252, and the document will come out looking like
   2570 ` I like snowmen!`.
   2571 
   2572 ``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0.
   2573 
   2574 Parsing only part of a document
   2575 ===============================
   2576 
   2577 Let's say you want to use Beautiful Soup look at a document's <a>
   2578 tags. It's a waste of time and memory to parse the entire document and
   2579 then go over it again looking for <a> tags. It would be much faster to
   2580 ignore everything that wasn't an <a> tag in the first place. The
   2581 ``SoupStrainer`` class allows you to choose which parts of an incoming
   2582 document are parsed. You just create a ``SoupStrainer`` and pass it in
   2583 to the ``BeautifulSoup`` constructor as the ``parse_only`` argument.
   2584 
   2585 (Note that *this feature won't work if you're using the html5lib parser*.
   2586 If you use html5lib, the whole document will be parsed, no
   2587 matter what. This is because html5lib constantly rearranges the parse
   2588 tree as it works, and if some part of the document didn't actually
   2589 make it into the parse tree, it'll crash. To avoid confusion, in the
   2590 examples below I'll be forcing Beautiful Soup to use Python's
   2591 built-in parser.)
   2592 
   2593 ``SoupStrainer``
   2594 ----------------
   2595 
   2596 The ``SoupStrainer`` class takes the same arguments as a typical
   2597 method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs
   2598 <attrs>`, :ref:`text <text>`, and :ref:`**kwargs <kwargs>`. Here are
   2599 three ``SoupStrainer`` objects::
   2600 
   2601  from bs4 import SoupStrainer
   2602 
   2603  only_a_tags = SoupStrainer("a")
   2604 
   2605  only_tags_with_id_link2 = SoupStrainer(id="link2")
   2606 
   2607  def is_short_string(string):
   2608      return len(string) < 10
   2609 
   2610  only_short_strings = SoupStrainer(text=is_short_string)
   2611 
   2612 I'm going to bring back the "three sisters" document one more time,
   2613 and we'll see what the document looks like when it's parsed with these
   2614 three ``SoupStrainer`` objects::
   2615 
   2616  html_doc = """
   2617  <html><head><title>The Dormouse's story</title></head>
   2618 
   2619  <p class="title"><b>The Dormouse's story</b></p>
   2620 
   2621  <p class="story">Once upon a time there were three little sisters; and their names were
   2622  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
   2623  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
   2624  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
   2625  and they lived at the bottom of a well.</p>
   2626 
   2627  <p class="story">...</p>
   2628  """
   2629 
   2630  print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
   2631  # <a class="sister" href="http://example.com/elsie" id="link1">
   2632  #  Elsie
   2633  # </a>
   2634  # <a class="sister" href="http://example.com/lacie" id="link2">
   2635  #  Lacie
   2636  # </a>
   2637  # <a class="sister" href="http://example.com/tillie" id="link3">
   2638  #  Tillie
   2639  # </a>
   2640 
   2641  print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
   2642  # <a class="sister" href="http://example.com/lacie" id="link2">
   2643  #  Lacie
   2644  # </a>
   2645 
   2646  print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
   2647  # Elsie
   2648  # ,
   2649  # Lacie
   2650  # and
   2651  # Tillie
   2652  # ...
   2653  #
   2654 
   2655 You can also pass a ``SoupStrainer`` into any of the methods covered
   2656 in `Searching the tree`_. This probably isn't terribly useful, but I
   2657 thought I'd mention it::
   2658 
   2659  soup = BeautifulSoup(html_doc)
   2660  soup.find_all(only_short_strings)
   2661  # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
   2662  #  u'\n\n', u'...', u'\n']
   2663 
   2664 Troubleshooting
   2665 ===============
   2666 
   2667 .. _diagnose:
   2668 
   2669 ``diagnose()``
   2670 --------------
   2671 
   2672 If you're having trouble understanding what Beautiful Soup does to a
   2673 document, pass the document into the ``diagnose()`` function. (New in
   2674 Beautiful Soup 4.2.0.)  Beautiful Soup will print out a report showing
   2675 you how different parsers handle the document, and tell you if you're
   2676 missing a parser that Beautiful Soup could be using::
   2677 
   2678  from bs4.diagnose import diagnose
   2679  data = open("bad.html").read()
   2680  diagnose(data)
   2681 
   2682  # Diagnostic running on Beautiful Soup 4.2.0
   2683  # Python version 2.7.3 (default, Aug  1 2012, 05:16:07)
   2684  # I noticed that html5lib is not installed. Installing it may help.
   2685  # Found lxml version 2.3.2.0
   2686  #
   2687  # Trying to parse your data with html.parser
   2688  # Here's what html.parser did with the document:
   2689  # ...
   2690 
   2691 Just looking at the output of diagnose() may show you how to solve the
   2692 problem. Even if not, you can paste the output of ``diagnose()`` when
   2693 asking for help.
   2694 
   2695 Errors when parsing a document
   2696 ------------------------------
   2697 
   2698 There are two different kinds of parse errors. There are crashes,
   2699 where you feed a document to Beautiful Soup and it raises an
   2700 exception, usually an ``HTMLParser.HTMLParseError``. And there is
   2701 unexpected behavior, where a Beautiful Soup parse tree looks a lot
   2702 different than the document used to create it.
   2703 
   2704 Almost none of these problems turn out to be problems with Beautiful
   2705 Soup. This is not because Beautiful Soup is an amazingly well-written
   2706 piece of software. It's because Beautiful Soup doesn't include any
   2707 parsing code. Instead, it relies on external parsers. If one parser
   2708 isn't working on a certain document, the best solution is to try a
   2709 different parser. See `Installing a parser`_ for details and a parser
   2710 comparison.
   2711 
   2712 The most common parse errors are ``HTMLParser.HTMLParseError:
   2713 malformed start tag`` and ``HTMLParser.HTMLParseError: bad end
   2714 tag``. These are both generated by Python's built-in HTML parser
   2715 library, and the solution is to :ref:`install lxml or
   2716 html5lib. <parser-installation>`
   2717 
   2718 The most common type of unexpected behavior is that you can't find a
   2719 tag that you know is in the document. You saw it going in, but
   2720 ``find_all()`` returns ``[]`` or ``find()`` returns ``None``. This is
   2721 another common problem with Python's built-in HTML parser, which
   2722 sometimes skips tags it doesn't understand.  Again, the solution is to
   2723 :ref:`install lxml or html5lib. <parser-installation>`
   2724 
   2725 Version mismatch problems
   2726 -------------------------
   2727 
   2728 * ``SyntaxError: Invalid syntax`` (on the line ``ROOT_TAG_NAME =
   2729   u'[document]'``): Caused by running the Python 2 version of
   2730   Beautiful Soup under Python 3, without converting the code.
   2731 
   2732 * ``ImportError: No module named HTMLParser`` - Caused by running the
   2733   Python 2 version of Beautiful Soup under Python 3.
   2734 
   2735 * ``ImportError: No module named html.parser`` - Caused by running the
   2736   Python 3 version of Beautiful Soup under Python 2.
   2737 
   2738 * ``ImportError: No module named BeautifulSoup`` - Caused by running
   2739   Beautiful Soup 3 code on a system that doesn't have BS3
   2740   installed. Or, by writing Beautiful Soup 4 code without knowing that
   2741   the package name has changed to ``bs4``.
   2742 
   2743 * ``ImportError: No module named bs4`` - Caused by running Beautiful
   2744   Soup 4 code on a system that doesn't have BS4 installed.
   2745 
   2746 .. _parsing-xml:
   2747 
   2748 Parsing XML
   2749 -----------
   2750 
   2751 By default, Beautiful Soup parses documents as HTML. To parse a
   2752 document as XML, pass in "xml" as the second argument to the
   2753 ``BeautifulSoup`` constructor::
   2754 
   2755  soup = BeautifulSoup(markup, "xml")
   2756 
   2757 You'll need to :ref:`have lxml installed <parser-installation>`.
   2758 
   2759 Other parser problems
   2760 ---------------------
   2761 
   2762 * If your script works on one computer but not another, it's probably
   2763   because the two computers have different parser libraries
   2764   available. For example, you may have developed the script on a
   2765   computer that has lxml installed, and then tried to run it on a
   2766   computer that only has html5lib installed. See `Differences between
   2767   parsers`_ for why this matters, and fix the problem by mentioning a
   2768   specific parser library in the ``BeautifulSoup`` constructor.
   2769 
   2770 * Because `HTML tags and attributes are case-insensitive
   2771   <http://www.w3.org/TR/html5/syntax.html#syntax>`_, all three HTML
   2772   parsers convert tag and attribute names to lowercase. That is, the
   2773   markup <TAG></TAG> is converted to <tag></tag>. If you want to
   2774   preserve mixed-case or uppercase tags and attributes, you'll need to
   2775   :ref:`parse the document as XML. <parsing-xml>`
   2776 
   2777 .. _misc:
   2778 
   2779 Miscellaneous
   2780 -------------
   2781 
   2782 * ``UnicodeEncodeError: 'charmap' codec can't encode character
   2783   u'\xfoo' in position bar`` (or just about any other
   2784   ``UnicodeEncodeError``) - This is not a problem with Beautiful Soup.
   2785   This problem shows up in two main situations. First, when you try to
   2786   print a Unicode character that your console doesn't know how to
   2787   display. (See `this page on the Python wiki
   2788   <http://wiki.python.org/moin/PrintFails>`_ for help.) Second, when
   2789   you're writing to a file and you pass in a Unicode character that's
   2790   not supported by your default encoding.  In this case, the simplest
   2791   solution is to explicitly encode the Unicode string into UTF-8 with
   2792   ``u.encode("utf8")``.
   2793 
   2794 * ``KeyError: [attr]`` - Caused by accessing ``tag['attr']`` when the
   2795   tag in question doesn't define the ``attr`` attribute. The most
   2796   common errors are ``KeyError: 'href'`` and ``KeyError:
   2797   'class'``. Use ``tag.get('attr')`` if you're not sure ``attr`` is
   2798   defined, just as you would with a Python dictionary.
   2799 
   2800 * ``AttributeError: 'ResultSet' object has no attribute 'foo'`` - This
   2801   usually happens because you expected ``find_all()`` to return a
   2802   single tag or string. But ``find_all()`` returns a _list_ of tags
   2803   and strings--a ``ResultSet`` object. You need to iterate over the
   2804   list and look at the ``.foo`` of each one. Or, if you really only
   2805   want one result, you need to use ``find()`` instead of
   2806   ``find_all()``.
   2807 
   2808 * ``AttributeError: 'NoneType' object has no attribute 'foo'`` - This
   2809   usually happens because you called ``find()`` and then tried to
   2810   access the `.foo`` attribute of the result. But in your case,
   2811   ``find()`` didn't find anything, so it returned ``None``, instead of
   2812   returning a tag or a string. You need to figure out why your
   2813   ``find()`` call isn't returning anything.
   2814 
   2815 Improving Performance
   2816 ---------------------
   2817 
   2818 Beautiful Soup will never be as fast as the parsers it sits on top
   2819 of. If response time is critical, if you're paying for computer time
   2820 by the hour, or if there's any other reason why computer time is more
   2821 valuable than programmer time, you should forget about Beautiful Soup
   2822 and work directly atop `lxml <http://lxml.de/>`_.
   2823 
   2824 That said, there are things you can do to speed up Beautiful Soup. If
   2825 you're not using lxml as the underlying parser, my advice is to
   2826 :ref:`start <parser-installation>`. Beautiful Soup parses documents
   2827 significantly faster using lxml than using html.parser or html5lib.
   2828 
   2829 You can speed up encoding detection significantly by installing the
   2830 `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library.
   2831 
   2832 `Parsing only part of a document`_ won't save you much time parsing
   2833 the document, but it can save a lot of memory, and it'll make
   2834 `searching` the document much faster.
   2835 
   2836 Beautiful Soup 3
   2837 ================
   2838 
   2839 Beautiful Soup 3 is the previous release series, and is no longer
   2840 being actively developed. It's currently packaged with all major Linux
   2841 distributions:
   2842 
   2843 :kbd:`$ apt-get install python-beautifulsoup`
   2844 
   2845 It's also published through PyPi as ``BeautifulSoup``.:
   2846 
   2847 :kbd:`$ easy_install BeautifulSoup`
   2848 
   2849 :kbd:`$ pip install BeautifulSoup`
   2850 
   2851 You can also `download a tarball of Beautiful Soup 3.2.0
   2852 <http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_.
   2853 
   2854 If you ran ``easy_install beautifulsoup`` or ``easy_install
   2855 BeautifulSoup``, but your code doesn't work, you installed Beautiful
   2856 Soup 3 by mistake. You need to run ``easy_install beautifulsoup4``.
   2857 
   2858 `The documentation for Beautiful Soup 3 is archived online
   2859 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. If
   2860 your first language is Chinese, it might be easier for you to read
   2861 `the Chinese translation of the Beautiful Soup 3 documentation
   2862 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html>`_,
   2863 then read this document to find out about the changes made in
   2864 Beautiful Soup 4.
   2865 
   2866 Porting code to BS4
   2867 -------------------
   2868 
   2869 Most code written against Beautiful Soup 3 will work against Beautiful
   2870 Soup 4 with one simple change. All you should have to do is change the
   2871 package name from ``BeautifulSoup`` to ``bs4``. So this::
   2872 
   2873   from BeautifulSoup import BeautifulSoup
   2874 
   2875 becomes this::
   2876 
   2877   from bs4 import BeautifulSoup
   2878 
   2879 * If you get the ``ImportError`` "No module named BeautifulSoup", your
   2880   problem is that you're trying to run Beautiful Soup 3 code, but you
   2881   only have Beautiful Soup 4 installed.
   2882 
   2883 * If you get the ``ImportError`` "No module named bs4", your problem
   2884   is that you're trying to run Beautiful Soup 4 code, but you only
   2885   have Beautiful Soup 3 installed.
   2886 
   2887 Although BS4 is mostly backwards-compatible with BS3, most of its
   2888 methods have been deprecated and given new names for `PEP 8 compliance
   2889 <http://www.python.org/dev/peps/pep-0008/>`_. There are numerous other
   2890 renames and changes, and a few of them break backwards compatibility.
   2891 
   2892 Here's what you'll need to know to convert your BS3 code and habits to BS4:
   2893 
   2894 You need a parser
   2895 ^^^^^^^^^^^^^^^^^
   2896 
   2897 Beautiful Soup 3 used Python's ``SGMLParser``, a module that was
   2898 deprecated and removed in Python 3.0. Beautiful Soup 4 uses
   2899 ``html.parser`` by default, but you can plug in lxml or html5lib and
   2900 use that instead. See `Installing a parser`_ for a comparison.
   2901 
   2902 Since ``html.parser`` is not the same parser as ``SGMLParser``, it
   2903 will treat invalid markup differently. Usually the "difference" is
   2904 that ``html.parser`` crashes. In that case, you'll need to install
   2905 another parser. But sometimes ``html.parser`` just creates a different
   2906 parse tree than ``SGMLParser`` would. If this happens, you may need to
   2907 update your BS3 scraping code to deal with the new tree.
   2908 
   2909 Method names
   2910 ^^^^^^^^^^^^
   2911 
   2912 * ``renderContents`` -> ``encode_contents``
   2913 * ``replaceWith`` -> ``replace_with``
   2914 * ``replaceWithChildren`` -> ``unwrap``
   2915 * ``findAll`` -> ``find_all``
   2916 * ``findAllNext`` -> ``find_all_next``
   2917 * ``findAllPrevious`` -> ``find_all_previous``
   2918 * ``findNext`` -> ``find_next``
   2919 * ``findNextSibling`` -> ``find_next_sibling``
   2920 * ``findNextSiblings`` -> ``find_next_siblings``
   2921 * ``findParent`` -> ``find_parent``
   2922 * ``findParents`` -> ``find_parents``
   2923 * ``findPrevious`` -> ``find_previous``
   2924 * ``findPreviousSibling`` -> ``find_previous_sibling``
   2925 * ``findPreviousSiblings`` -> ``find_previous_siblings``
   2926 * ``nextSibling`` -> ``next_sibling``
   2927 * ``previousSibling`` -> ``previous_sibling``
   2928 
   2929 Some arguments to the Beautiful Soup constructor were renamed for the
   2930 same reasons:
   2931 
   2932 * ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)``
   2933 * ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)``
   2934 
   2935 I renamed one method for compatibility with Python 3:
   2936 
   2937 * ``Tag.has_key()`` -> ``Tag.has_attr()``
   2938 
   2939 I renamed one attribute to use more accurate terminology:
   2940 
   2941 * ``Tag.isSelfClosing`` -> ``Tag.is_empty_element``
   2942 
   2943 I renamed three attributes to avoid using words that have special
   2944 meaning to Python. Unlike the others, these changes are *not backwards
   2945 compatible.* If you used these attributes in BS3, your code will break
   2946 on BS4 until you change them.
   2947 
   2948 * ``UnicodeDammit.unicode`` -> ``UnicodeDammit.unicode_markup``
   2949 * ``Tag.next`` -> ``Tag.next_element``
   2950 * ``Tag.previous`` -> ``Tag.previous_element``
   2951 
   2952 Generators
   2953 ^^^^^^^^^^
   2954 
   2955 I gave the generators PEP 8-compliant names, and transformed them into
   2956 properties:
   2957 
   2958 * ``childGenerator()`` -> ``children``
   2959 * ``nextGenerator()`` -> ``next_elements``
   2960 * ``nextSiblingGenerator()`` -> ``next_siblings``
   2961 * ``previousGenerator()`` -> ``previous_elements``
   2962 * ``previousSiblingGenerator()`` -> ``previous_siblings``
   2963 * ``recursiveChildGenerator()`` -> ``descendants``
   2964 * ``parentGenerator()`` -> ``parents``
   2965 
   2966 So instead of this::
   2967 
   2968  for parent in tag.parentGenerator():
   2969      ...
   2970 
   2971 You can write this::
   2972 
   2973  for parent in tag.parents:
   2974      ...
   2975 
   2976 (But the old code will still work.)
   2977 
   2978 Some of the generators used to yield ``None`` after they were done, and
   2979 then stop. That was a bug. Now the generators just stop.
   2980 
   2981 There are two new generators, :ref:`.strings and
   2982 .stripped_strings <string-generators>`. ``.strings`` yields
   2983 NavigableString objects, and ``.stripped_strings`` yields Python
   2984 strings that have had whitespace stripped.
   2985 
   2986 XML
   2987 ^^^
   2988 
   2989 There is no longer a ``BeautifulStoneSoup`` class for parsing XML. To
   2990 parse XML you pass in "xml" as the second argument to the
   2991 ``BeautifulSoup`` constructor. For the same reason, the
   2992 ``BeautifulSoup`` constructor no longer recognizes the ``isHTML``
   2993 argument.
   2994 
   2995 Beautiful Soup's handling of empty-element XML tags has been
   2996 improved. Previously when you parsed XML you had to explicitly say
   2997 which tags were considered empty-element tags. The ``selfClosingTags``
   2998 argument to the constructor is no longer recognized. Instead,
   2999 Beautiful Soup considers any empty tag to be an empty-element tag. If
   3000 you add a child to an empty-element tag, it stops being an
   3001 empty-element tag.
   3002 
   3003 Entities
   3004 ^^^^^^^^
   3005 
   3006 An incoming HTML or XML entity is always converted into the
   3007 corresponding Unicode character. Beautiful Soup 3 had a number of
   3008 overlapping ways of dealing with entities, which have been
   3009 removed. The ``BeautifulSoup`` constructor no longer recognizes the
   3010 ``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode,
   3011 Dammit`_ still has ``smart_quotes_to``, but its default is now to turn
   3012 smart quotes into Unicode.) The constants ``HTML_ENTITIES``,
   3013 ``XML_ENTITIES``, and ``XHTML_ENTITIES`` have been removed, since they
   3014 configure a feature (transforming some but not all entities into
   3015 Unicode characters) that no longer exists.
   3016 
   3017 If you want to turn Unicode characters back into HTML entities on
   3018 output, rather than turning them into UTF-8 characters, you need to
   3019 use an :ref:`output formatter <output_formatters>`.
   3020 
   3021 Miscellaneous
   3022 ^^^^^^^^^^^^^
   3023 
   3024 :ref:`Tag.string <.string>` now operates recursively. If tag A
   3025 contains a single tag B and nothing else, then A.string is the same as
   3026 B.string. (Previously, it was None.)
   3027 
   3028 `Multi-valued attributes`_ like ``class`` have lists of strings as
   3029 their values, not strings. This may affect the way you search by CSS
   3030 class.
   3031 
   3032 If you pass one of the ``find*`` methods both :ref:`text <text>` `and`
   3033 a tag-specific argument like :ref:`name <name>`, Beautiful Soup will
   3034 search for tags that match your tag-specific criteria and whose
   3035 :ref:`Tag.string <.string>` matches your value for :ref:`text
   3036 <text>`. It will `not` find the strings themselves. Previously,
   3037 Beautiful Soup ignored the tag-specific arguments and looked for
   3038 strings.
   3039 
   3040 The ``BeautifulSoup`` constructor no longer recognizes the
   3041 `markupMassage` argument. It's now the parser's responsibility to
   3042 handle markup correctly.
   3043 
   3044 The rarely-used alternate parser classes like
   3045 ``ICantBelieveItsBeautifulSoup`` and ``BeautifulSOAP`` have been
   3046 removed. It's now the parser's decision how to handle ambiguous
   3047 markup.
   3048 
   3049 The ``prettify()`` method now returns a Unicode string, not a bytestring.
   3050