Re-imagining WSGI and Pylons
Posted: | 2009-05-19 17:55 |
---|---|
Tags: | Python, Pylons |
Simon Willison wrote a blog post today, about a micro-framework called djng, a light-weight stack that depends on Django. His implementation ideas seem to be quite similiar to what Pylons does already with StackedObjectProxies but I want to share with you where I've got to with the work I started 6 months ago to re-imagine Pylons. Before I get too carried away let me start at the start...
During the process of writing The Definitive Guide to Pylons I came across lots of things I thought could be improved. Anyone who's worked on a large open source project will know that getting consensus for change can be difficult, particularly if you don't have concrete ideas about how to solve the problems you can see. Anyone who's written a book about an evolving product will know that simply trying to keep on top of the changes in the existing product is an extremely difficult task and anyone who has tried to do both at the same time will realize it is quite close to impossible! I finsished the last work on the print version of the book about 6 months ago and then started work on the code for a start-up I'm launching this year (still top secret at the moment) but rather than simply taking Pylons 0.9.7 I started with an empty Python file and added in code as I needed it with a view to using as many of the Pylons components as possilbe but not including any code which had touched the problem areas I was aware of from writing the book.
What I've produced as a result is what I call an "enterprise micro-framework". It's an extremely simple architecture made of small components that would feel right at home in any home-made framework, but is powerful and complete enough to run huge multi-server systems. Now the code is more or less complete I feel in a good position to explain it and to explain why it is the most effective architecure I know for web development.
I'll start the description in this post and then hopefully blog about more aspects of the system over the next few weeks.
Let's get started...
What's Wrong with WSGI?
Actually nothing! WSGI is a stroke of brilliance and has done a huge amount to help all the Python web framework communities. The one problem with it has been how people like me interpreted this paragraph from PJE's WSGI PEP 333:
If middleware can be both simple and robust, and WSGI is widely available in servers and frameworks, it allows for the possibility of an entirely new kind of Python web application framework: one consisting of loosely-coupled WSGI middleware components. Indeed, existing framework authors may even choose to refactor their frameworks' existing services to be provided in this way, becoming more like libraries used with WSGI, and less like monolithic frameworks. This would then allow application developers to choose "best-of-breed" components for specific functionality, rather than having to commit to all the pros and cons of a single framework.
I've used this quote in many talks myself because I felt my intreptation of it was important. It gave me a license to put any kind of service (eg database connections, templating set-ups etc) into the WSGI environment along with all the CGI-like string variables like QUERY_STRING. Since the environ object is available through a WSGI application it was a very convenient place to have all these other objects too. Pylons uses this approach quite a lot too and although it proves to be very, very useful it led PJE to write a post entitled WSGI Considered Harmful where he critisised the use of Python objects in the environment and instead suggests such things should be dealt with by instantiating objects elsewhere by passing the WSGI environ to an object which provides a particular API, rather than putting that object in the WSGi environ. Here's a pertinent quote:
Meanwhile, you then have to pull stuff out of the environment in order to use it, doing crap like environ['some.session.service'].whatever() in order to do something that could've been written more clearly as SomeSessionService(environ).whatever(), and doesn't require you to stack a bunch of so-called "middleware" on top of your application object.
So please, end the madness now.
The problem with PJE's critisim is that, however ugly it appeared, the approach of adding services to the WSGI environ worked extrodinarily well and was really useful. To try to deal with PJE's critisisms there were various attempts to solve the problem by simply renaming WSGI middleware components which add services to the WSGI environment as WSGI Framework Components or WFCs so the term "WSGI middleware" did indeed refer to components that operated just on the HTTP-level, that way we could all carry on regardless.
This all might sound a bit esoteric but so for those of you who haven't used Pylons before and aren't too familiar with WSGI here's a quick run-down of how a WSGI application using the "bung it all in the environment approach" might work:
def get_posts_app(environ, start_response): connection = environ['database.pool'].connect() status = '200 OK' headers = [('Content-type', 'text/html')] start_response(status, headers) return [ '<html>\n' '<head><title>Blog Posts</title></head>\n' '<body>'\n' '<h1>Blog Posts</h1>\n' get_posts(connection) '</body>\n' '</html>\n' ]
As the request comes in it passes through the stack of WSGI middlweare and WSGI Framework Components. One of the WSGI Frameworks components adds a database connection pool to the environ dictionary as the database.pool key. The request eventually gets to the WSGI application (get_potsts_app), which is called with environ and start_response arguments. The environ argument contains all the request information and services set up by WSGI Framework Components. The start_response argument is a function created by the server so that the application can tell it the HTTP status and the headers which it needs to sen to the browser before the page itself is returned.
The interesting thing in this example is how the database connection is used. It is extracted from the environ dictionary and then passed to the get_posts() function (which I haven't shown) which would simply return the HTML for some blog posts using a database connection.
The application then returns the page as an iterable containing strings.
How Can We Make this Example Better?
Well, one approach is to get rid of start_response() and this has been discussed here and here. My interest lies elsewhere. I want to think about how we can avoid having to put things like database connection pools in the WSGI environ dictionary.
The most obvious thing which springs to mind is to pass services such as the database connection pool as arguments to the WSGI application along with environ and start_response like this:
def get_posts_app(environ, start_response, pool): connection = pool.connect() status = '200 OK' headers = [('Content-type', 'text/html')] start_response(status, headers) return [ '<html>\n' '<head><title>Blog Posts</title></head>\n' '<body>'\n' '<h1>Blog Posts</h1>\n' get_posts(connection) '</body>\n'
That's looking neater already. Now imagine I use a templating system as well as a database. The new app might look like this:
def get_posts_app(environ, start_response, pool, render): connection = pool.connect() status = '200 OK' headers = [('Content-type', 'text/html')] start_response(status, headers) return render('blog_posts.html', posts=get_posts(connection))
Now that's looking more like what you'd expect to see in a proper application framework. You can imagine that a real app might have more service arguments though.
Two Leaps of Faith
From this point I'm going to ask you to make two of leaps of faith.
Think of environ and start_response as Services
It is easy to think of pool and render as services (even though we haven't formally defined "service" yet) but I want you to also think of environ and start_response as being servies. After all, environ is an object that provides information about the request, the server and the WSGI variables, and start_response() is a service that lets you set response information. You can look at it like this: It just so happens that environ and start_response "services" have been standardised in the WSGI spec whereas pool and render haven't.
There's one problem with this "pass services as arguments" approach though: different applications need different services and we wouldn't want an API which was different for every different possible combination of services an application could use. We'll come back to this in a minute. First let's look at another problem with most web frameworks: thread-locals.
Thread-Local Hell
HTTP connections are generally very slow at getting data to and from the server so dedicating an entire process to each request is very inefficient. Instead, the same process can handle multiple requests at the same time using threads. This is great for performance put introduces a technical challenge: if you aren't careful, different threads can change each other's data, after all they are sharing the same code. The vast majority of the time you'll never notice this problem as a web developer because the framework you are using takes care of it.
In the case of Pylons, global variables are used along with StackedObjectProxies to solve the problem. Here's some psuedo-code demonstarting the sort of approach Pylons takes:
from pylons import request, response, pool, render def get_posts_app(): connection = pool.connect() response.status = '200 OK' response.content_type', 'text/html' return render('blog_posts.html', posts=get_posts(connection))
As you can see, rather than passing the services as arguments to the application, you import StackedObjectProxies from Pylons and use them as global variables. Pylons ensures that whichever thread you access the "service" from, you get the correct data for that thread. This magic all happens behind the scenes using thread-locals to make programming a multithreaded web application as easy as programming any other sort of code. Other frameworks use variations of this idea too and other components within Pylons (such as the SQLAlchemy session use their own implementation of the same idea too).
There are some problems with the thread-local approach though:
- Thread-locals used as globals are hard to understand
- They hide what is really going (which breaks the principle of least surprise)
- They make it very hard to use code that relies on the StackedObjectProxies outside of a web request because the proxy objects aren't initialised until a request starts so any code treating them as normal object won't work.
With these thought's in mind, why not avoid the threading problem completely by passing an object containing all the services we need explictily from one part of an application to another? Let's call that object state as it represents the state of the application and all its services for the particular thread which is executing. Our code instead looks like this:
def get_posts_app(state): connection = state.pool.connect() state.response.status = '200 OK' state.response.content_type', 'text/html' return state.render('blog_posts.html', posts=get_posts(connection))
This new application just takes the one argument, state, which is an object which contains all the "services" (including the WSGI environ and WSGI start_response() callable) and can be customised depending on the services the application requires. It just means there is a bit more typing (you have to type state in front of each service) and you have to adapt existing code to explicitly take a state argument rather than relying on the right data to magically appearing in the function you are using it from. Whilst this is a bit of hassle, the vast removal of complexity is well, well worth it.
Having a single state argument to the application also solves the problem of different applications requiring different services.
So that's the two leaps of faith:
- Treating environ and start_response() as just one of many services
- Having an explicit state object containing all services, passed explicitly around that application as a function argument rather than a myriad of thread-local hacks
[If you've been following this for a long time, this is pretty much what I was pushing for right at the start of Pylons in this post from 2005. I just didn't describe it as a state and I didn't describe the attributes exposed as being services]
How Do You Build the State Object?
Now we've established the beneits of a state object, let's have a think about how to build it. Before we do you need to be aware that there are actually two types of state: appliction state and request state. Let's look at some examples and think about the dependencies of each of these two types of state.
Application State vs Request State
Objects that get created when an application is first loaded into memory make up the application state. For example, database connections, template renders etc. Objects which are set up on each request make up the request state, for example Request and Response objects, an object which gives access to a session store, authentication information etc. Things which have request state are created at the start of a request and destroyed at the end of a request. Things which have application state are destroyed when the server is shut down.
Now in practice objects which have application state and those which have request state might be related. Think about our database connection pool again. Although the individual connections are created when the application is loaded and thus have application state, you might want to ensure that the same connection is used throughout a particular request so that if a problem occurs, all the changes from that request can be rolled back. This means that the object which manages a connection for a request has request state. This situation happens rather a lot.
Introducing Service Objects
With the distinction between application state and request state firmly established, let's think of what a "service" object might look like to provide an object as an attribute of state. (Hint: it looks a lot like Django middleware by coincidence):
class DatabasePoolService(object): def __init__(self, dsn): # Save the input arguments so that they can be accessed if needed elsewhere self.dsn = dsn # Set up the connection pool self.pool = make_connection_pool(dsn) def start(self, state, key): # The request is starting, create a connection for this request state['key'] = RequestSpecificConnection(self.pool.connect()) def stop(self, state, key): # Commit the changes and relase the connection back to the pool state['key'].commit() state['key'].release() def error(self, state, key): # An error occurred, rollback the changes and return the connection to the pool state['key'].rollback() state['key'].release()
An instance of this class will be created when the application is first loaded and remain in memory for the entire lifetime of the application. When it is created it initialises a connection pool which it saves as self.pool. On each request the service's start() method is called to set up any request-specific objects and add them as attributes of the state object passed as an argument. If the key argument is also passed the request-specific object should be added as that key to the state.
Using a Service Object
Now, say as a user you want to use this new database code, this is all you need to do:
# Create the state (AttributeDict is any object that behaves like a # dictionary but whose keys can be accessed as attributes) state = AttributeDict() # Create a service database_service = DatabasePoolService('mysq://james@mysql.example.com:password/test') # Start the serivce database_service.start(state, 'pool') # Run the application try: get_posts(state) except: # Handle errors database_service.error(state, 'pool') # Stop the services database_service.stop(state, 'pool')
If you have lots of services you might put them all in a dictionary and initialis them all at once:
# Create the state state = AttributeDict() # Create a service services = { 'pool': DatabasePoolService('mysq://james@mysql.example.com:password/test'), 'template': TemplateService('/path/to/templates'), } # Start the serivces for key, service in services.items(): service.start(state, key) # Run the application try: get_posts(state) except: # Handle errors for key, service in services.items(): service.error(state, key) # Stop the services for key, service in services.items(): service.start(state, key) database_service.stop(state, key)
All you need to do now is set up the environ and start_response "services" and to replace the call to get_posts(state) with some code which handles dispatch and you have the basis of an entire framework, standardised around the concept of a service:
def create_application(dsn, template_dir) # Create a service services = { 'pool': DatabasePoolService(dsn), 'template': TemplateService(template_dir), } def handle_request(environ, start_response): # Create the state state = AttributeDict() # Set up the application state state.app = services # Set up environ and start_response state.environ = environ state.start_response = start_response # Start the serivces for key, service in services.items(): service.start(state, key) # Run the application try: if state.environ.get('PATH_INFO') == '/posts': return get_posts(state) else: return handle_404(state) except: # Handle errors for key, service in services.items(): service.error(state, key) finally: # Stop the services for key, service in services.items(): service.start(state, key) database_service.stop(state, key) return handle_request
By adding new service objects to this architecture you can easily recreate a framework as sophisticated as Pylons.
It is also very suprising just how many things which are currently written as WSGI middleware can actually be re-implemented much more simply as services.
As you can probably spot, handle_request is a valid WSGI application so you could easily build a WSGI middleware stack around it rather than returning it directly.
In fact, it isn't necessary to have all the services defined in one place either. In my production version of this concept services are created only for the requests that need them (making the possibility of running this entire framework as a CGI script more plausible as only those services which are required for the particular request are instantiated). In the production version the services have dependencies amongst each other, for example the error handler service requires the mail service etc. There are also tools to extract objects from the WSGI environment to turn them into services (so that you can use Beaker session store via the state object for example). You can also create services at different parts of the middleware stack so that WSGI middleware itself can use services.
Note
If you try to apply the services+state approach to existing Pylons code relying on StackedObjectProxies you quickly find it hard work. Since this new API is so clean though it is very easy to set up attributes of the state object with the Pylons registry manager so that you can still access them as module globals when you really needed. In fact this stack can run existing Pylons 0.9.7 applications very well. One of the first things I did as a proof of concept was to see if I could get the SimpleSite tutotorial from the Pylons Book to run under this stack (including SQLAlchemy) but without the PylonsApp() instance itself and I'm pleased to say I could.
Where else can services be used?
Well that's the beauty of the services+state approach. You can use the same initialisation code anywhere you want to use the services whether in a request or not. This means there's not only no need for PylonsApp(), there's no need for environment.py, config.py and even no need to have any complex code in websetup.py. Anywhere that wants to use a service outside of a web request just sets it up in the same way and uses the state object as if it were provided in a web application. Everything is nice, clean, simple and well encapsulated. It's just so simple and I've been so close to it for so long, I can't imagine why it's taken me so long to formalise and why it isn't used more widely. [I know Aquarium used an explicit state but I'd be interested to hear of other examples.]
So What Does a Production Application Look Like?
Using this approach a production application is nothing more than a series of single functions (no classes either) which each take a state argument arranged into files. Services are created where a set of these functions all need to access similar functionality. If it helps you can think of the state argument as being to a request what the self argument is to a class method.
Once the actually controller code is broken down into nothing more than simple functions with a state argument, other aspects of the code become simpler. It becomes easy to have these functions called directly from a middleware compoment to handle something like a sign in screen (since all the dependencies are well controlled in the state object and calling the function isn't going to have any unexpected consequences).
Summing up
As I mentioned, I've been using this stack for about 6 months now and I absolutley love it. It is genuinely a real pleasure to use. The only problem is that my success encouraged me to look at other Pylons-related problems and I've also re-invented Routes, FormEncode, solutions to the problems solved by SQLAlchemy and much more. I'm currently re-inventing templating and one that's finished there won't be any Pylons left in my Pylons-like stack. I'm not saying that's necessarily a good thing, (after all Pylons rocks!) but I have found it very refreshing trying to re-tackle every problem solved by Pylons to see if any parts can be improved. I'll try to blog about some of the more interesting experiences when I get chance.
I'm also very excited that a similar thing is happening in the Django community because if Django ends up with a similar refator there would be no problem in mixing and matching Django and Pylons components, without necessrily requiring either Django or Pylons and that could be quite a future.
If you are interested in microframeworks or in the approach I've taken here please get in touch. I'm happy to post comments you email me here.