Home | History | Annotate | Download | only in library
      1 
      2 :mod:`robotparser` ---  Parser for robots.txt
      3 =============================================
      4 
      5 .. module:: robotparser
      6    :synopsis: Loads a robots.txt file and answers questions about
      7               fetchability of other URLs.
      8 .. sectionauthor:: Skip Montanaro <skip (a] pobox.com>
      9 
     10 
     11 .. index::
     12    single: WWW
     13    single: World Wide Web
     14    single: URL
     15    single: robots.txt
     16 
     17 .. note::
     18    The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in
     19    Python 3.
     20    The :term:`2to3` tool will automatically adapt imports when converting
     21    your sources to Python 3.
     22 
     23 This module provides a single class, :class:`RobotFileParser`, which answers
     24 questions about whether or not a particular user agent can fetch a URL on the
     25 Web site that published the :file:`robots.txt` file.  For more details on the
     26 structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
     27 
     28 
     29 .. class:: RobotFileParser(url='')
     30 
     31    This class provides methods to read, parse and answer questions about the
     32    :file:`robots.txt` file at *url*.
     33 
     34 
     35    .. method:: set_url(url)
     36 
     37       Sets the URL referring to a :file:`robots.txt` file.
     38 
     39 
     40    .. method:: read()
     41 
     42       Reads the :file:`robots.txt` URL and feeds it to the parser.
     43 
     44 
     45    .. method:: parse(lines)
     46 
     47       Parses the lines argument.
     48 
     49 
     50    .. method:: can_fetch(useragent, url)
     51 
     52       Returns ``True`` if the *useragent* is allowed to fetch the *url*
     53       according to the rules contained in the parsed :file:`robots.txt`
     54       file.
     55 
     56 
     57    .. method:: mtime()
     58 
     59       Returns the time the ``robots.txt`` file was last fetched.  This is
     60       useful for long-running web spiders that need to check for new
     61       ``robots.txt`` files periodically.
     62 
     63 
     64    .. method:: modified()
     65 
     66       Sets the time the ``robots.txt`` file was last fetched to the current
     67       time.
     68 
     69 The following example demonstrates basic use of the RobotFileParser class. ::
     70 
     71    >>> import robotparser
     72    >>> rp = robotparser.RobotFileParser()
     73    >>> rp.set_url("http://www.musi-cal.com/robots.txt")
     74    >>> rp.read()
     75    >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
     76    False
     77    >>> rp.can_fetch("*", "http://www.musi-cal.com/")
     78    True
     79 
     80