Home Blog CV Projects Patterns Notes Book Colophon Search

Chapter 7: Introducing the Model and SQLAlchemy

When people think about a model layer, they often immediately think of using a relational database management system (RDBMS) such as PostgreSQL or MySQL. In fact, there are many different ways to store your data in a Pylons application, so there are many different ways to model that data. It is important to decide on the correct approach for your particular needs. Some approaches might include these:

Storing data in files in the filesystem

Storing data via a web service such as Amazon S3

Storing data in an object database

Storing data in an XML database

Storing data in an RDBMS

Pylons supports all of these approaches, but each has its advantages and disadvantages. If you are heavily relying on XML data, then an XML database makes sense. If you want to be able to manipulate and store Python objects that don’t need to be indexed quickly, an object database might suit your needs. If you are storing lots of binary data such as photographs or videos that don’t need to be searchable, you might store them in a third-party storage solution such as Amazon S3. And if you have large amounts of related data that needs to be quickly indexed, an RDBMS might be best.

In this chapter, I’ll cover these different approaches to storing information and then give you an in-depth look at how to use RDBMSs with SQLAlchemy in Pylons.

Storing Data in the Filesystem

There isn’t a great deal of point in storing data types such as photos, videos, and other binary data in a database because they take up a lot of space, which will slow down queries. It is much better to store binary data on the filesystem and store only key properties such as the filename or the creation date in a database.

You might be tempted to store your application’s data in your project’s data directory since it is already present and can be customized in your application’s config file. The disadvantage is that because it is already used to store temporary session, cache, and template information, other Pylons developers working on your project might be used to deleting it when they want to clear this temporary information. To avoid this problem, it is better to keep the data directory for cached information and to add a new directory for your user’s data. Let’s call ours attachments, but the location will be customizable in the config file too.

You could write code like this to load one of the files in this directory:

import os
from pylons import config

def load_file(filename):
    path = os.path.join(config['app_conf']['attachments'], filename)
    fp = open(path, 'rb')
    data = fp.read()
    fp.close()
    return data

You can save a file to the directory with a function like this:

def save_file(filename, data):
    path = os.path.join(config['app_conf']['attachments'], filename)
    fp = open(path, 'wb')
    fp.write(data)
    fp.close()

You can list all the files like this:

def list_files():
    path = os.path.join(config['app_conf']['attachments'])
    return os.listdir(path)

For this example, you’ll need to add a new variable in your project config file’s [app:main] section:

# You could customize this to specify something like /var/lib/myapp/attachments
# if you prefer
attachments = %(here)s/attachments

Each Pylons project has a model directory, which is where code for interacting with the application’s data should be stored so you can define the previous functions in model/__init__.py, for example.

You can get information about a particular file like this:

path = os.path.join(config['app_conf']['attachments'], filename)
size = os.path.getsize(path)

The os.path module documented at http://docs.python.org/lib/module-os.path.html has other similar methods for accessing other information about files such as getmtime(path), which returns the modification time.

For additional filesystem information, see the os.stat() function, which returns an object whose attributes correspond to the members of the stat structure, namely, st_mode (protection bits), st_ino (inode number), st_dev (device), st_nlink (number of hard links), st_uid (user ID of owner), st_gid (group ID of owner), st_size (size of file, in bytes), st_atime (time of most recent access), st_mtime (time of most recent content modification), and st_ctime (platform dependent; time of most recent metadata change on Unix or the time of creation on Windows). It can be used in two ways, as described in the module documentation at http://docs.python.org/lib/os-file-dir.html:

>>> import os
>>> statinfo = os.stat('somefile.txt')
>>> statinfo
(33188, 422511L, 769L, 1, 1032, 100, 926L, 1105022698,1105022732, 1105022732)
>>> statinfo.st_size
926L
>>> statinfo[7]
1105022698

You might want to turn the access and modification times into Python datetime objects and then format them in a different way:

>>> import datetime, time
>>> modified = datetime.datetime.fromtimestamp(statinfo[7])
>>> modified
datetime.datetime(2005, 1, 6, 14, 44, 58)
>>> modified.strftime("%Y-%m-%dT%H:%M:%S")
'2005-01-06T14:44:58'

It is sometimes useful to express in words when something happens. You can do so like this using the time_ago_in_words() function included with WebHelpers:

>>> from webhelpers.date import time_ago_in_words
>>> time_ago_in_words(modified)
'over 2 years'

It is also useful to express a file size in human-readable terms. Here’s a helper that does just that, which you can add to your project’s lib/helpers.py file and use as h.size_to_human():

def size_to_human(size, unit=1024, round=True):
    unit_name = 'bytes'
    size=int(size)
    if size > unit:
        size = size/float(unit)
        unit_name = 'KB'
    if size > unit:
        size = size/float(unit)
        unit_name = 'MB'
    if size > unit:
        size = size/float(unit)
        unit_name = 'GB'
    size = str(size)
    if round:
        if len(size)>4:
            size = "%d" % float(size)
    return size+'&nbsp;'+unit_name

Here is some further reading on filesystem use:

http://docs.python.org/lib/bltin-file-objects.html

http://docs.python.org/lib/os-file-dir.html

http://docs.python.org/lib/module-shutil.html

The shutil module’s copytree() function can be particularly useful on occasion.

Storing Data in Amazon S3

If you are building a web application to store very large amounts of information, it is possible that you might prefer to use a third-party storage service to look after your data rather than using your own hard disk space. Amazon S3 is one such service, but there are many others, including CacheFly. The basic principle of these services is that you pay for the bandwidth and storage used. If you have a startup and can’t predict in advance how popular it will be, you may struggle to predict how many servers you will need for storage. By using a third-party service, the storage problem is largely solved because you can just order more storage without needing to buy any more machines.

Amazon S3 works via an XML web services API, but a number of Python libraries provide a Python interface to these services. Here’s how you would upload and retrieve a file from Amazon S3 using a package called boto from http://code.google.com/b/boto:

from boto.s3.connection import S3Connection
from boto.s3.key import Key
conn = S3Connection('<aws access key>', '<aws secret key>')
bucket = conn.create_bucket('pylonsbook')
k = Key(bucket)
k.key = 'foobar'
k.set_contents_from_filename('foo.png')
k.get_contents_to_filename('bar.png')

In S3, bucket names are not unique to individual users, so you will have to find a bucket name that hasn’t yet been used rather than using pylonsbook. You can install boto using Easy Install like this:

$ easy_install "boto==1.4c"

Once the file is uploaded, your users will be able to access it directly without you needing to download it again every time it is requested, because it has a publicly accessible URL. You can visit a file uploaded with the previous code at http://s3.amazonaws.com/pylonsbook/foobar.

Amazon S3 also allows you to store metadata about files you upload. As long as you don’t need to be able to search this metadata, you might find Amazon S3 provides all the tools you need for your particular application. Here’s how you would set some metadata associated with the file:

k.set_metadata('meta1', 'This is the first metadata value')
k.set_metadata('meta2', 'This is the second metadata value')

This code associates two metadata key/value pairs with the key k. To retrieve those values later, you’d use this code:

>>> k.get_metadata('meta1')
'This is the first metadata value'
>>> k.get_metadata('meta2')
'This is the second metadata value'

Tip

To test this example, you would need to sign up for an Amazon web services account and replace the example values <aws access key> and <aws secret key> with your real Amazon keys. You will be charged for any data you store on Amazon, although for a simple test like this, the charge is very low. Just remember to delete your data if you don’t want to be continually charged for its storage each month.

Exploring Database Approaches

Storing data structures in files or via third-party storage solutions clearly isn’t the right approach for all data storage needs. Often the key requirement is to be able to search or select related sets of information. In that case, a database is a sensible way to go.

I’ll discuss the different types of databases you can use in your Pylons application.

Object Databases

If most of the data in your Pylons applications is in the form of classes, one very sensible way of storing that data is in an object database. An object database looks like a Python dictionary that is automatically saved to disk. You can store strings, numbers, dates, class instances, or even nested dictionaries and lists to create arbitrarily deep data structures. Compared to a regular Python dictionary, you have to call a few extra commands to open the database and commit changes, but reading/setting values works exactly like the normal Python operations. This avoids the complexity of converting a Python data structure to a non-Python medium (XML or RDBMS tables), and it allows you to quickly prototype a model because you can easily change and extend it.

Two object databases are available for Python: Durus and ZODB. Durus is smaller and simpler, while ZODB is the database used in large Zope applications. Durus is recommended only for databases with fewer than 1 million records.

Durus and ZODB can store only “pickleable” data types, in other words, those that can be serialized with Python’s pickle module. This includes all the standard data types including lists and dictionaries and instances of classes defined at the top level of their module. It does not include objects tied to external resources (an open file object or a database connection) or classes defined inside another class or inside a function. The Python standard library lists exactly which types can be pickled; see http://docs.python.org/lib/node317.html. Some users choose to store only built-in Python types (for example, dicts instead of class instances) to guarantee the data can always be unpickled on any Python system.

Both Durus and ZODB have a “persistent” class. Any object subclassing this will be saved and loaded separately rather than with its parent object.

The main disadvantage of object databases is that all searching is done in Python code, in for loops you write, while an RDBMS such as PostgreSQL has heavily optimized C routines for searching very quickly and with low memory overhead. Depending on the nature of your data and the types of searches you do, an RDBMS may or may not have a significant performance advantage. If you are considering using an object database, you should weigh this against the programming convenience of using the familiar and flexible Python types an object database provides.

Some users unfamiliar with object databases wonder how stable they are. Of course, this is a question you should ask about any database engine before trusting your data to it. Durus and ZODB use an append-only strategy with a simple filesystem layout to minimize the possibility of errors. Rather than overwriting objects, new versions are simply appended to the end of the file, and the old versions are abandoned. Backing up the data is a simple matter of copying the file. If the latest transaction at the end of the file gets corrupted or incompletely written, Durus and ZODB will simply truncate the file to return to the state that existed before the last transaction. Periodically the administrator runs a “pack” operation to rewrite the file without the abandoned sections, shrinking the file size.

Since the majority of Pylons developers use an RDBMS for their model, documentation on using ZODB or Durus is very thin. If an object database is an approach you’d like to consider, then these links might help:

Durus: http://www.mems-exchange.org/software/durus/, http://sluggo.scrapping.cc/python/pylons/pylons-durus.html
ZODB: http://pypi.python.org/pypi/ZODB3, http://en.wikipedia.org/wiki/ZODB (links to tutorials)

XML Databases

XML databases use XML documents as the unit of data they store and manipulate. If your Pylons application uses a lot of XML, it might make sense to store that information directly as XML in an XML database rather than storing it in another type of database. The following are the advantages of this approach:

You don’t need to do any conversion between the data store and the document format your application uses.

You can use query languages such as XPath and XQuery to quickly perform searches on documents in an optimized way.

Two XML databases you can use with Pylons are eXist and Berkeley DB XML:

eXist XML database (http://exist.sourceforge.net/):: The eXist server is written in Java but has XML-RPC and REST-style HTTP APIs that can be used from a Pylons application. Some parts of the main pylonshq.com web site currently use an eXist back end.

Oracle Berkeley DB XML (http://www.oracle.com/database/berkeley-db/xml/index.html):: This is an open source, embeddable XML database with XQuery-based access to documents stored in containers. DB XML has a Python binding that could be used to integrate it into a Pylons application. One thing to be aware of with DB XML is that the license would require that your Pylons application be released under the source license too unless you bought a commercial license from Oracle.

Relational Database Management Systems

Despite the advantages of object databases and XML databases for certain situations, the vast majority of people choose to use an RDBMS for the data persistence layer of their applications. Most of the time when people refer to a database, they mean an RDBMS such as MySQL, PostgreSQL, and many others. In the relational model, data and relationships can be represented in tables, rows, and columns that are defined and manipulated using a special language called Structured Query Language (SQL; pronounced “sequel”).

RDBMSs can be used in small, personal applications or in huge, multinational projects. Although the basic principles of how to use an RDBMS remain broadly the same in both cases, you will need a much greater understanding of how relational database management systems actually work in order to use them effectively in larger-scale projects because issues such as replication, failover, and partitioning become more important. These topics are beyond the scope of this book, but if you are interested, plenty of information is available online and in specialist books.

Object-Relational Mappers

Object-relational mappers (ORMs) are tools that map the data structures in your database, namely, the rows in each table to objects in your Pylons application. As you manipulate the objects in the application, they automatically generate the SQL necessary to manipulate the underlying data.

Using an object-relational mapper has a number of advantages:

They make it much easier and more convenient to work with the underlying data.

Your Pylons application will work on any of the database engines supported by the object-relational mapper you use.

They usually deal with some of the complications such as connection pools and thread safety for you.

They’re often easier to learn for newcomers than learning SQL.

Although object-relational mappers have major advantages, they are not without their weaknesses:

By abstracting away the SQL, you generally have less control over the database than you would have had. Tools such as SQLAlchemy make up for this by also providing you with raw SQL access for the occasions when it is needed.

If you don’t understand how object-relational mappers work, it is easy to write inefficient code that requires many SQL statements to be executed. (Careful reading of this chapter should prevent that problem, though.)

Object-relational mappers can sometimes contain quite complex code that is necessary to make the interfaces they expose so easy to use. This means that if you run into a problem, it can be hard to track it down in the source code. By choosing a popular ORM such as SQLAlchemy, the chances are that there are a very few bugs, and any you find are likely to be dealt with quickly by the community.

Overall then, the benefits of object-relational mappers outweigh their disadvantages for the vast majority of Pylons developers.

Quite a few object-relational mappers are available for Python:

SQLAlchemy (http://sqlalchemy.org) is a modern object-relational mapper and Python SQL toolkit with powerful features, excellent documentation and support, and a full-featured API. It provides a full suite of well-known enterprise-level persistence patterns, is designed for efficient and high-performing database access and exposes a simple and Pythonic API.

Storm (https://storm.canonical.com/) is a new object-relational mapper from Canonical, the company behind Ubuntu Linux. It is simpler than SQLAlchemy with thorough unit tests. Storm is particularly designed to feel very natural to Python programmers and exposes multiple databases as stores in a clean and easy-to-use fashion.

SQLObject (http://sqlobject.org) is a popular object-relational mapper for providing an object interface to your database, with tables as classes, rows as instances, and columns as attributes. SQLObject is fairly old now, and although it is still used in TurboGears 1 and some other older frameworks, most users now choose SQLAlchemy instead.

By far the most popular tool for use as a model in a Pylons application is SQLAlchemy, and with good reason. It is a very powerful tool that handles the vast majority of cases you are ever likely to need, has a large and helpful community behind it, and has extensive and accurate documentation. That’s not to say it is always the right tool for the job, and as you’ve seen so far in this chapter, Pylons is flexible enough to work with many different tools as a model. For the majority of cases, SQLAlchemy is a really good choice.

Setting Up SQLAlchemy

In this section, you’ll look at everything you need to install and set up in order to use SQLAlchemy.

SQLAlchemy relies on various DB-API 2.0 drivers to handle the actual connections to the RDBMS software. Before you can use SQLAlchemy in earnest, you need to download and install the DB-API 2.0 driver for the RDBMS software you want to use. Not all RDBMSs have a Python DB-API 2.0 driver, and not all Python DB-API drivers can be used with SQLAlchemy.

Table 7-1 outlines the major RDBMSs used by Pylons developers and the Python driver you need in order to be able to use them from Pylons. Other drivers are available for these RDBMSs, but at the time of writing, these are the drivers supported by SQLAlchemy.

It is worth noting that if you are using Python 2.5 or newer, you don’t need to install pysqlite, because it is already included as part of the Python standard library.

RDBMS Engine	Python DB-API 2.0 Driver
PostgreSQL http://postgreql.org	psycopg2 http://initd.org/projects/psycopg2
MySQL http://mysql.org	MySQLdb module packaged as mysql-python http://sourceforge.net/projects/mysql-python
SQLite http://sqlite.org	pysqlite http://initd.org/tracker/pysqlite
Oracle http://www.oracle.com/technology/products/database/oracle10g/index.html	cx_Oracle http://www.python.net/crew/atuining/cx_Oracle/
Microsoft SQL Server http://microsoft.com/sql/default.aspx	pyodbc (recommended), adodbapi, or pymssql http://pyodbc.sourceforge.net/
Firebird http://www.firebirdsql.org/	kinterbasdb http://kinterbasdb.sourceforge.net/
Informix http://www.ibm.com/software/data/informix/	informixdb http://informixdb.sourceforge.net/

Table 7-1. Popular RDBMSs and the Corresponding DB-API Drivers

If you are just looking to get started quickly, SQLite is a good choice. You can download the latest SQLite 3 binary for your platform from http://www.sqlite.org/download.html. Once you have installed it, you will be able to run the sqlite3 command to get an interactive prompt:

$ sqlite3
SQLite version 3.4.0
Enter ".help" for instructions
sqlite>

You can type .help for help or .quit to quit.

I’ll use SQLite for the examples because it is so easy to set up, but you could equally well use any of the systems in Table 7-1. SQLite also has the advantage that the Python modules it needs are already included with Python 2.5 and newer.

Installing the DB-API Driver

Once you have installed, configured, and started the RDBMS you want to use, you need to install the appropriate DB-API 2.0 driver. In the case of SQLite, this is very easy because the software is automatically included with Python 2.5 or newer. If you are using Python 2.4 or older, you will need to install the driver in the same way you would for any RDBMS.

The driver you will need for your RDBMS is listed in Table 7-1 along with the URL where you can obtain it. Most Pylons-related software is available on the Python Package Index and can be installed with the easy_install command, but if you are not running on Windows, it is usually necessary to have a build environment set up with the Python development package and appropriate client library for the RDBMS you want to use already installed so that easy_install can compile the C or C++ libraries it needs to compile. For example, with MySQL, you might run this:

$ easy_install mysql-python

This would download the source for the MySQLdb module (this is a rare example when the package name is significantly different from the module name) and compile it. To compile it successfully, you will need the client library. For example, on Debian Etch, you would need to install the libmysqlclient15-dev package and the python-dev package.

Most commonly used software that isn’t on the Python Package Index will be available through your platform’s software repository. For example, versions of MySQLdb are available for Windows that you can install with its installer, and MySQLdb is available through the repositories for Debian, Ubuntu, Fedora, and other platforms. Mac OS X users can typically download a binary for the version of their operating system too.

If you are compiling a driver from source, it is always a good idea to read the software’s README or INSTALL files and to follow the instructions carefully. Bear in mind that you might need to use an older compiler than the one that comes with your platform.

Although installing a Python database driver sounds like it might be difficult, in practice it is normally easy because you can usually find a binary version.

If you are following along using SQLite and are using Python 2.4 or older, let’s install pysqlite2:

$ easy_install pysqlite

This installs the pysqlite2 module to use in your application, but note that the package name is pysqlite even though you need to import pysqlite2 to use the module.

Installing SQLAlchemy

Installing SQLAlchemy is easy. You simply specify the version you require with Easy Install, and it will be downloaded and installed for you. At the time of this writing, the latest version is 0.5, so the examples in this book are likely to work with any version above 0.5 and below 0.6. It is always wise to read the release notes for each new version, though:

$ easy_install "SQLAlchemy>=0.5,<=0.5.99"

If you want to ensure that your application uses only the version of SQLAlchemy you tested your application on, you should specify the version explicitly:

$ easy_install "SQLAlchemy==0.5.0"

Creating a Database

Now that you have the RDBMS software up and running, an appropriate DB-API driver, and SQLAlchemy itself, you will want to create a database.

Creating a database on the command line with SQLite is a simply a matter of connecting to it. The database is created if it doesn’t already exist:

$ sqlite3 test.db

You don’t actually need to create a database on the command line with SQLlite because a database will automatically be created for you when you connect from SQLAlchemy.

With other databases, things are a little more complex. PostgreSQL uses the createdb command, and MySQL uses a CREATE DATABASE SQL statement. Refer to your RDBMS documentation for the correct approach.

With everything in place, let’s take a look at SQLAlchemy’s architecture.

Exploring SQLAlchemy’s Architecture

SQLAlchemy’s architecture contains a complete set of APIs, each representing one aspect of what is actually going on. Conceptually you can think of these APIs in three layers, each building on top of the previous one: