1 URL Parsing With WSGI And Paste 2 +++++++++++++++++++++++++++++++ 3 4 :author: Ian Bicking <ianb (a] colorstudy.com> 5 :revision: $Rev$ 6 :date: $LastChangedDate$ 7 8 .. contents:: 9 10 Introduction and Audience 11 ========================= 12 13 This document is intended for web framework authors and integrators, 14 and people who want to understand the internal architecture of Paste. 15 16 .. include:: include/contact.txt 17 18 URL Parsing 19 =========== 20 21 .. note:: 22 23 Sometimes people use "URL", and sometimes "URI". I think URLs are 24 a subset of URIs. But in practice you'll almost never see URIs 25 that aren't URLs, and certainly not in Paste. URIs that aren't 26 URLs are abstract Identifiers, that cannot necessarily be used to 27 Locate the resource. This document is *all* about locating. 28 29 Most generally, URL parsing is about taking a URL and determining what 30 "resource" the URL refers to. "Resource" is a rather vague term, 31 intentionally. It's really just a metaphor -- in reality there aren't 32 any "resources" in HTTP; there are only requests and responses. 33 34 In Paste, everything is about WSGI. But that can seem too fancy. 35 There are four core things involved: the *request* (personified in the 36 WSGI environment), the *response* (personified inthe 37 ``start_response`` callback and the return iterator), the WSGI 38 application, and the server that calls that application. The 39 application and request are objects, while the server and response are 40 really more like actions than concrete objects. 41 42 In this context, URL parsing is about mapping a URL to an 43 *application* and a *request*. The request actually gets modified as 44 it moves through different parts of the system. Two dictionary keys 45 in particular relate to URLs -- ``SCRIPT_NAME`` and ``PATH_INFO`` -- 46 but any part of the environment can be modified as it is passed 47 through the system. 48 49 Dispatching 50 =========== 51 52 .. note:: 53 54 WSGI isn't object oriented? Well, if you look at it, you'll notice 55 there's no objects except built-in types, so it shouldn't be a 56 surprise. Additionally, the interface and promises of the objects 57 we do see are very minimal. An application doesn't have any 58 interface except one method -- ``__call__`` -- and that method 59 *does* things, it doesn't give any other information. 60 61 Because WSGI is action-oriented, rather than object-oriented, it's 62 more important what we *do*. "Finding" an application is probably an 63 intermediate step, but "running" the application is our ultimate goal, 64 and the only real judge of success. An application that isn't run is 65 useless to us, because it doesn't have any other useful methods. 66 67 So what we're really doing is *dispatching* -- we're handing the 68 request and responsibility for the response off to another object 69 (another actor, really). In the process we can actually retain some 70 control -- we can capture and transform the response, and we can 71 modify the request -- but that's not what the typical URL resolver will 72 do. 73 74 Motivations 75 =========== 76 77 The most obvious kind of URL parsing is finding a WSGI application. 78 79 Typically when a framework first supports WSGI or is integrated into 80 Paste, it is "monolithic" with respect to URLs. That is, you define 81 (in Paste, or maybe in Apache) a "root" URL, and everything under that 82 goes into the framework. What the framework does internally, Paste 83 does not know -- it probably finds internal objects to dispatch to, 84 but the framework is opaque to Paste. Not just to Paste, but to 85 any code that isn't in that framework. 86 87 That means that we can't mix code from multiple frameworks, or as 88 easily share services, or use WSGI middleware that doesn't apply to 89 the entire framework/application. 90 91 An example of someplace we might want to use an "application" that 92 isn't part of the framework would be uploading large files. It's 93 possible to keep track of upload progress, and report that back to the 94 user, but no framework typically is capable of this. This is usually 95 because the POST request is completely read and parsed before it 96 invokes any application code. 97 98 This is resolvable in WSGI -- a WSGI application can provide its own 99 code to read and parse the POST request, and simultaneously report 100 progress (usually in a way that *another* WSGI application/request can 101 read and report to the user on that progress). This is an example 102 where you want to allow "foreign" applications to be intermingled with 103 framework application code. 104 105 Finding Applications 106 ==================== 107 108 OK, enough theory. How does a URL parser work? Well, it is a WSGI 109 application, and a WSGI server, in the typical "WSGI middleware" 110 style. Except that it determines which application it will serve 111 for each request. 112 113 Let's consider Paste's ``URLParser`` (in ``paste.urlparser``). This 114 class takes a directory name as its only required argument, and 115 instances are WSGI applications. 116 117 When a request comes in, the parser looks at ``PATH_INFO`` to see 118 what's left to parse. ``SCRIPT_NAME`` represents where we are *now*; 119 it's the part of the URL that has been parsed. 120 121 There's a couple special cases: 122 123 The empty string: 124 125 URLParser serves directories. When ``PATH_INFO`` is empty, that 126 means we got a request with no trailing ``/``, like say ``/blog`` 127 If URLParser serves the ``blog`` directory, then this won't do -- 128 the user is requesting the ``blog`` *page*. We have to redirect 129 them to ``/blog/``. 130 131 A single ``/``: 132 133 So, we got a trailing ``/``. This means we need to serve the 134 "index" page. In URLParser, this is some file named ``index``, 135 though that's really an implementation detail. You could create 136 an index dynamically (like Apache's file listings), or whatever. 137 138 Otherwise we get a string like ``/path...``. Note that ``PATH_INFO`` 139 *must* start with a ``/``, or it must be empty. 140 141 URLParser pulls off the first part of the path. E.g., if 142 ``PATH_INFO`` is ``/blog/edit/285``, then the first part is ``blog``. 143 It appends this to ``SCRIPT_NAME``, and strips it off ``PATH_INFO`` 144 (which becomes ``/edit/285``). 145 146 It then searches for a file that matches "blog". In URLParser, this 147 means it looks for a filename which matches that name (ignoring the 148 extension). It then uses the type of that file (determined by 149 extension) to create a WSGI application. 150 151 One case is that the file is a directory. In that case, the 152 application is *another* URLParser instance, this time with the new 153 directory. 154 155 URLParser actually allows per-extension "plugins" -- these are just 156 functions that get a filename, and produce a WSGI application. One of 157 these is ``make_py`` -- this function imports the module, and looks 158 for special symbols; if it finds a symbol ``application``, it assumes 159 this is a WSGI application that is ready to accept the request. If it 160 finds a symbol that matches the name of the module (e.g., ``edit``), 161 then it assumes that is an application *factory*, meaning that when 162 you call it with no arguments you get a WSGI application. 163 164 Another function takes "unknown" files (files for which no better 165 constructor exists) and creates an application that simply responds 166 with the contents of that file (and the appropriate ``Content-Type``). 167 168 In any case, ``URLParser`` delegates as soon as it can. It doesn't 169 parse the entire path -- it just finds the *next* application, which 170 in turn may delegate to yet another application. 171 172 Here's a very simple implementation of URLParser:: 173 174 class URLParser(object): 175 def __init__(self, dir): 176 self.dir = dir 177 def __call__(self, environ, start_response): 178 segment = wsgilib.path_info_pop(environ) 179 if segment is None: # No trailing / 180 # do a redirect... 181 for filename in os.listdir(self.dir): 182 if os.path.splitext(filename)[0] == segment: 183 return self.serve_application( 184 environ, start_response, filename) 185 # do a 404 Not Found 186 def serve_application(self, environ, start_response, filename): 187 basename, ext = os.path.splitext(filename) 188 filename = os.path.join(self.dir, filename) 189 if os.path.isdir(filename): 190 return URLParser(filename)(environ, start_response) 191 elif ext == '.py': 192 module = import_module(filename) 193 if hasattr(module, 'application'): 194 return module.application(environ, start_response) 195 elif hasattr(module, basename): 196 return getattr(module, basename)( 197 environ, start_response) 198 else: 199 return wsgilib.send_file(filename) 200 201 Modifying The Request 202 ===================== 203 204 Well, URLParser is one kind of parser. But others are possible, and 205 aren't too hard to write. 206 207 Lets imagine a URL like ``/2004/05/01/edit``. It's likely that 208 ``/2004/05/01`` doesn't point to anything on file, but is really more 209 of a "variable" that gets passed to ``edit``. So we can pull them off 210 and put them somewhere. This is a good place for a WSGI extension. 211 Lets put them in ``environ["app.url_date"]``. 212 213 We'll pass one other applications in -- once we get the date (if any) 214 we need to pass the request onto an application that can actually 215 handle it. This "application" might be a URLParser or similar system 216 (that figures out what ``/edit`` means). 217 218 :: 219 220 class GrabDate(object): 221 def __init__(self, subapp): 222 self.subapp = subapp 223 def __call__(self, environ, start_response): 224 date_parts = [] 225 while len(date_parts) < 3: 226 first, rest = wsgilib.path_info_split(environ['PATH_INFO']) 227 try: 228 date_parts.append(int(first)) 229 wsgilib.path_info_pop(environ) 230 except (ValueError, TypeError): 231 break 232 environ['app.date_parts'] = date_parts 233 return self.subapp(environ, start_response) 234 235 This is really like traditional "middleware", in that it sits between 236 the server and just one application. 237 238 Assuming you put this class in the ``myapp.grabdate`` module, you 239 could install it by adding this to your configuration:: 240 241 middleware.append('myapp.grabdate.GrabDate') 242 243 Object Publishing 244 ================= 245 246 Besides looking in the filesystem, "object publishing" is another 247 popular way to do URL parsing. This is pretty easy to implement as 248 well -- it usually just means use ``getattr`` with the popped 249 segments. But we'll implement a rough approximation of `Quixote's 250 <http://www.mems-exchange.org/software/quixote/>`_ URL parsing:: 251 252 class ObjectApp(object): 253 def __init__(self, obj): 254 self.obj = obj 255 def __call__(self, environ, start_response): 256 next = wsgilib.path_info_pop(environ) 257 if next is None: 258 # This is the object, lets serve it... 259 return self.publish(obj, environ, start_response) 260 next = next or '_q_index' # the default index method 261 if next in obj._q_export and getattr(obj, next, None): 262 return ObjectApp(getattr(obj, next))( 263 environ, start_reponse) 264 next_obj = obj._q_traverse(next) 265 if not next_obj: 266 # Do a 404 267 return ObjectApp(next_obj)(environ, start_response) 268 269 def publish(self, obj, environ, start_response): 270 if callable(obj): 271 output = str(obj()) 272 else: 273 output = str(obj) 274 start_response('200 OK', [('Content-type', 'text/html')]) 275 return [output] 276 277 The ``publish`` object is a little weak, and functions like 278 ``_q_traverse`` aren't passed interesting information about the 279 request, but this is only a rough approximation of the framework. 280 Things to note: 281 282 * The object has standard attributes and methods -- ``_q_exports`` 283 (attributes that are public to the web) and ``_q_traverse`` 284 (a way of overriding the traversal without having an attribute for 285 each possible path segment). 286 287 * The object isn't rendered until the path is completely consumed 288 (when ``next`` is ``None``). This means ``_q_traverse`` has to 289 consume extra segments of the path. In this version ``_q_traverse`` 290 is only given the next piece of the path; Quixote gives it the 291 entire path (as a list of segments). 292 293 * ``publish`` is really a small and lame way to turn a Quixote object 294 into a WSGI application. For any serious framework you'd want to do 295 a better job than what I do here. 296 297 * It would be even better if you used something like `Adaptation 298 <http://www.python.org/peps/pep-0246.html>`_ to convert objects into 299 applications. This would include removing the explicit creation of 300 new ``ObjectApp`` instances, which could also be a kind of fall-back 301 adaptation. 302 303 Anyway, this example is less complete, but maybe it will get you 304 thinking. 305