Home | History | Annotate | Download | only in library
      1 :mod:`urllib.robotparser` ---  Parser for robots.txt
      2 ====================================================
      3 
      4 .. module:: urllib.robotparser
      5    :synopsis: Load a robots.txt file and answer questions about
      6               fetchability of other URLs.
      7 
      8 .. sectionauthor:: Skip Montanaro <skip (a] pobox.com>
      9 
     10 **Source code:** :source:`Lib/urllib/robotparser.py`
     11 
     12 .. index::
     13    single: WWW
     14    single: World Wide Web
     15    single: URL
     16    single: robots.txt
     17 
     18 --------------
     19 
     20 This module provides a single class, :class:`RobotFileParser`, which answers
     21 questions about whether or not a particular user agent can fetch a URL on the
     22 Web site that published the :file:`robots.txt` file.  For more details on the
     23 structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
     24 
     25 
     26 .. class:: RobotFileParser(url='')
     27 
     28    This class provides methods to read, parse and answer questions about the
     29    :file:`robots.txt` file at *url*.
     30 
     31    .. method:: set_url(url)
     32 
     33       Sets the URL referring to a :file:`robots.txt` file.
     34 
     35    .. method:: read()
     36 
     37       Reads the :file:`robots.txt` URL and feeds it to the parser.
     38 
     39    .. method:: parse(lines)
     40 
     41       Parses the lines argument.
     42 
     43    .. method:: can_fetch(useragent, url)
     44 
     45       Returns ``True`` if the *useragent* is allowed to fetch the *url*
     46       according to the rules contained in the parsed :file:`robots.txt`
     47       file.
     48 
     49    .. method:: mtime()
     50 
     51       Returns the time the ``robots.txt`` file was last fetched.  This is
     52       useful for long-running web spiders that need to check for new
     53       ``robots.txt`` files periodically.
     54 
     55    .. method:: modified()
     56 
     57       Sets the time the ``robots.txt`` file was last fetched to the current
     58       time.
     59 
     60    .. method:: crawl_delay(useragent)
     61 
     62       Returns the value of the ``Crawl-delay`` parameter from ``robots.txt``
     63       for the *useragent* in question.  If there is no such parameter or it
     64       doesn't apply to the *useragent* specified or the ``robots.txt`` entry
     65       for this parameter has invalid syntax, return ``None``.
     66 
     67       .. versionadded:: 3.6
     68 
     69    .. method:: request_rate(useragent)
     70 
     71       Returns the contents of the ``Request-rate`` parameter from
     72       ``robots.txt`` in the form of a :func:`~collections.namedtuple`
     73       ``(requests, seconds)``.  If there is no such parameter or it doesn't
     74       apply to the *useragent* specified or the ``robots.txt`` entry for this
     75       parameter has invalid syntax, return ``None``.
     76 
     77       .. versionadded:: 3.6
     78 
     79 
     80 The following example demonstrates basic use of the :class:`RobotFileParser`
     81 class::
     82 
     83    >>> import urllib.robotparser
     84    >>> rp = urllib.robotparser.RobotFileParser()
     85    >>> rp.set_url("http://www.musi-cal.com/robots.txt")
     86    >>> rp.read()
     87    >>> rrate = rp.request_rate("*")
     88    >>> rrate.requests
     89    3
     90    >>> rrate.seconds
     91    20
     92    >>> rp.crawl_delay("*")
     93    6
     94    >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
     95    False
     96    >>> rp.can_fetch("*", "http://www.musi-cal.com/")
     97    True
     98