Home | History | Annotate | Download | only in docs
      1 URL Parsing With WSGI And Paste
      2 +++++++++++++++++++++++++++++++
      3 
      4 :author: Ian Bicking <ianb (a] colorstudy.com>
      5 :revision: $Rev$
      6 :date: $LastChangedDate$
      7 
      8 .. contents::
      9 
     10 Introduction and Audience
     11 =========================
     12 
     13 This document is intended for web framework authors and integrators,
     14 and people who want to understand the internal architecture of Paste.
     15 
     16 .. include:: include/contact.txt
     17 
     18 URL Parsing
     19 ===========
     20 
     21 .. note::
     22 
     23    Sometimes people use "URL", and sometimes "URI".  I think URLs are
     24    a subset of URIs.  But in practice you'll almost never see URIs
     25    that aren't URLs, and certainly not in Paste.  URIs that aren't
     26    URLs are abstract Identifiers, that cannot necessarily be used to
     27    Locate the resource.  This document is *all* about locating.
     28 
     29 Most generally, URL parsing is about taking a URL and determining what
     30 "resource" the URL refers to.  "Resource" is a rather vague term,
     31 intentionally.  It's really just a metaphor -- in reality there aren't
     32 any "resources" in HTTP; there are only requests and responses.
     33 
     34 In Paste, everything is about WSGI.  But that can seem too fancy.
     35 There are four core things involved: the *request* (personified in the
     36 WSGI environment), the *response* (personified inthe
     37 ``start_response`` callback and the return iterator), the WSGI
     38 application, and the server that calls that application.  The
     39 application and request are objects, while the server and response are
     40 really more like actions than concrete objects.
     41 
     42 In this context, URL parsing is about mapping a URL to an
     43 *application* and a *request*.  The request actually gets modified as
     44 it moves through different parts of the system.  Two dictionary keys
     45 in particular relate to URLs -- ``SCRIPT_NAME`` and ``PATH_INFO`` --
     46 but any part of the environment can be modified as it is passed
     47 through the system.
     48 
     49 Dispatching
     50 ===========
     51 
     52 .. note::
     53 
     54    WSGI isn't object oriented?  Well, if you look at it, you'll notice
     55    there's no objects except built-in types, so it shouldn't be a
     56    surprise.  Additionally, the interface and promises of the objects
     57    we do see are very minimal.  An application doesn't have any
     58    interface except one method -- ``__call__`` -- and that method
     59    *does* things, it doesn't give any other information.
     60 
     61 Because WSGI is action-oriented, rather than object-oriented, it's
     62 more important what we *do*.  "Finding" an application is probably an
     63 intermediate step, but "running" the application is our ultimate goal,
     64 and the only real judge of success.  An application that isn't run is
     65 useless to us, because it doesn't have any other useful methods.
     66 
     67 So what we're really doing is *dispatching* -- we're handing the
     68 request and responsibility for the response off to another object
     69 (another actor, really).  In the process we can actually retain some
     70 control -- we can capture and transform the response, and we can
     71 modify the request -- but that's not what the typical URL resolver will
     72 do.  
     73 
     74 Motivations
     75 ===========
     76 
     77 The most obvious kind of URL parsing is finding a WSGI application.
     78 
     79 Typically when a framework first supports WSGI or is integrated into
     80 Paste, it is "monolithic" with respect to URLs.  That is, you define
     81 (in Paste, or maybe in Apache) a "root" URL, and everything under that
     82 goes into the framework.  What the framework does internally, Paste
     83 does not know -- it probably finds internal objects to dispatch to, 
     84 but the framework is opaque to Paste.  Not just to Paste, but to
     85 any code that isn't in that framework.
     86 
     87 That means that we can't mix code from multiple frameworks, or as
     88 easily share services, or use WSGI middleware that doesn't apply to
     89 the entire framework/application.
     90 
     91 An example of someplace we might want to use an "application" that
     92 isn't part of the framework would be uploading large files.  It's
     93 possible to keep track of upload progress, and report that back to the
     94 user, but no framework typically is capable of this.  This is usually
     95 because the POST request is completely read and parsed before it
     96 invokes any application code.
     97 
     98 This is resolvable in WSGI -- a WSGI application can provide its own
     99 code to read and parse the POST request, and simultaneously report
    100 progress (usually in a way that *another* WSGI application/request can
    101 read and report to the user on that progress).  This is an example
    102 where you want to allow "foreign" applications to be intermingled with
    103 framework application code.
    104 
    105 Finding Applications
    106 ====================
    107 
    108 OK, enough theory.  How does a URL parser work?  Well, it is a WSGI
    109 application, and a WSGI server, in the typical "WSGI middleware"
    110 style.  Except that it determines which application it will serve
    111 for each request.
    112 
    113 Let's consider Paste's ``URLParser`` (in ``paste.urlparser``).  This
    114 class takes a directory name as its only required argument, and
    115 instances are WSGI applications.
    116 
    117 When a request comes in, the parser looks at ``PATH_INFO`` to see
    118 what's left to parse.  ``SCRIPT_NAME`` represents where we are *now*;
    119 it's the part of the URL that has been parsed.
    120 
    121 There's a couple special cases:
    122 
    123 The empty string:
    124 
    125     URLParser serves directories.  When ``PATH_INFO`` is empty, that
    126     means we got a request with no trailing ``/``, like say ``/blog``
    127     If URLParser serves the ``blog`` directory, then this won't do --
    128     the user is requesting the ``blog`` *page*.  We have to redirect
    129     them to ``/blog/``.
    130 
    131 A single ``/``:
    132 
    133     So, we got a trailing ``/``.  This means we need to serve the
    134     "index" page.  In URLParser, this is some file named ``index``,
    135     though that's really an implementation detail.  You could create
    136     an index dynamically (like Apache's file listings), or whatever.
    137 
    138 Otherwise we get a string like ``/path...``.  Note that ``PATH_INFO``
    139 *must* start with a ``/``, or it must be empty.
    140 
    141 URLParser pulls off the first part of the path.  E.g., if
    142 ``PATH_INFO`` is ``/blog/edit/285``, then the first part is ``blog``.
    143 It appends this to ``SCRIPT_NAME``, and strips it off ``PATH_INFO``
    144 (which becomes ``/edit/285``).
    145 
    146 It then searches for a file that matches "blog".  In URLParser, this
    147 means it looks for a filename which matches that name (ignoring the
    148 extension).  It then uses the type of that file (determined by
    149 extension) to create a WSGI application.
    150 
    151 One case is that the file is a directory.  In that case, the
    152 application is *another* URLParser instance, this time with the new
    153 directory.
    154 
    155 URLParser actually allows per-extension "plugins" -- these are just
    156 functions that get a filename, and produce a WSGI application.  One of
    157 these is ``make_py`` -- this function imports the module, and looks
    158 for special symbols; if it finds a symbol ``application``, it assumes
    159 this is a WSGI application that is ready to accept the request.  If it
    160 finds a symbol that matches the name of the module (e.g., ``edit``),
    161 then it assumes that is an application *factory*, meaning that when
    162 you call it with no arguments you get a WSGI application.
    163 
    164 Another function takes "unknown" files (files for which no better
    165 constructor exists) and creates an application that simply responds
    166 with the contents of that file (and the appropriate ``Content-Type``).
    167 
    168 In any case, ``URLParser`` delegates as soon as it can.  It doesn't
    169 parse the entire path -- it just finds the *next* application, which
    170 in turn may delegate to yet another application.
    171 
    172 Here's a very simple implementation of URLParser::
    173 
    174     class URLParser(object):
    175         def __init__(self, dir):
    176             self.dir = dir
    177         def __call__(self, environ, start_response):
    178             segment = wsgilib.path_info_pop(environ)
    179             if segment is None: # No trailing /
    180                 # do a redirect...
    181             for filename in os.listdir(self.dir):
    182                 if os.path.splitext(filename)[0] == segment:
    183                     return self.serve_application(
    184                         environ, start_response, filename)
    185             # do a 404 Not Found
    186         def serve_application(self, environ, start_response, filename):
    187             basename, ext = os.path.splitext(filename)
    188             filename = os.path.join(self.dir, filename)
    189             if os.path.isdir(filename):
    190                 return URLParser(filename)(environ, start_response)
    191             elif ext == '.py':
    192                 module = import_module(filename)
    193                 if hasattr(module, 'application'):
    194                     return module.application(environ, start_response)
    195                 elif hasattr(module, basename):
    196                     return getattr(module, basename)(
    197                         environ, start_response)
    198             else:
    199                 return wsgilib.send_file(filename)
    200 
    201 Modifying The Request
    202 =====================
    203 
    204 Well, URLParser is one kind of parser.  But others are possible, and
    205 aren't too hard to write.
    206 
    207 Lets imagine a URL like ``/2004/05/01/edit``.  It's likely that
    208 ``/2004/05/01`` doesn't point to anything on file, but is really more
    209 of a "variable" that gets passed to ``edit``.  So we can pull them off
    210 and put them somewhere.  This is a good place for a WSGI extension.
    211 Lets put them in ``environ["app.url_date"]``.
    212 
    213 We'll pass one other applications in -- once we get the date (if any)
    214 we need to pass the request onto an application that can actually
    215 handle it.  This "application" might be a URLParser or similar system
    216 (that figures out what ``/edit`` means).
    217 
    218 ::
    219 
    220     class GrabDate(object):
    221         def __init__(self, subapp):
    222             self.subapp = subapp
    223         def __call__(self, environ, start_response):
    224             date_parts = []
    225             while len(date_parts) < 3:
    226                first, rest = wsgilib.path_info_split(environ['PATH_INFO'])
    227                try:
    228                    date_parts.append(int(first))
    229                    wsgilib.path_info_pop(environ)
    230                except (ValueError, TypeError):
    231 	           break
    232             environ['app.date_parts'] = date_parts
    233             return self.subapp(environ, start_response)
    234 
    235 This is really like traditional "middleware", in that it sits between
    236 the server and just one application.
    237 
    238 Assuming you put this class in the ``myapp.grabdate`` module, you
    239 could install it by adding this to your configuration::
    240 
    241     middleware.append('myapp.grabdate.GrabDate')
    242 
    243 Object Publishing
    244 =================
    245 
    246 Besides looking in the filesystem, "object publishing" is another
    247 popular way to do URL parsing.  This is pretty easy to implement as
    248 well -- it usually just means use ``getattr`` with the popped
    249 segments.  But we'll implement a rough approximation of `Quixote's
    250 <http://www.mems-exchange.org/software/quixote/>`_ URL parsing::
    251 
    252     class ObjectApp(object):
    253         def __init__(self, obj):
    254             self.obj = obj
    255         def __call__(self, environ, start_response):
    256             next = wsgilib.path_info_pop(environ)
    257             if next is None: 
    258                 # This is the object, lets serve it...
    259                 return self.publish(obj, environ, start_response)
    260             next = next or '_q_index' # the default index method
    261             if next in obj._q_export and getattr(obj, next, None):
    262                 return ObjectApp(getattr(obj, next))(
    263                     environ, start_reponse)
    264             next_obj = obj._q_traverse(next)
    265             if not next_obj:
    266                 # Do a 404
    267             return ObjectApp(next_obj)(environ, start_response)
    268 
    269         def publish(self, obj, environ, start_response):
    270             if callable(obj):
    271                 output = str(obj())
    272             else:
    273                 output = str(obj)
    274             start_response('200 OK', [('Content-type', 'text/html')])
    275             return [output]
    276 
    277 The ``publish`` object is a little weak, and functions like
    278 ``_q_traverse`` aren't passed interesting information about the
    279 request, but this is only a rough approximation of the framework.
    280 Things to note:
    281 
    282 * The object has standard attributes and methods -- ``_q_exports``
    283   (attributes that are public to the web) and ``_q_traverse``
    284   (a way of overriding the traversal without having an attribute for
    285   each possible path segment).
    286 
    287 * The object isn't rendered until the path is completely consumed
    288   (when ``next`` is ``None``).  This means ``_q_traverse`` has to
    289   consume extra segments of the path.  In this version ``_q_traverse``
    290   is only given the next piece of the path; Quixote gives it the
    291   entire path (as a list of segments).  
    292 
    293 * ``publish`` is really a small and lame way to turn a Quixote object
    294   into a WSGI application.  For any serious framework you'd want to do
    295   a better job than what I do here.
    296 
    297 * It would be even better if you used something like `Adaptation
    298   <http://www.python.org/peps/pep-0246.html>`_ to convert objects into
    299   applications.  This would include removing the explicit creation of
    300   new ``ObjectApp`` instances, which could also be a kind of fall-back
    301   adaptation. 
    302 
    303 Anyway, this example is less complete, but maybe it will get you
    304 thinking.
    305