Pylons and TurboGears 2.0
If you are a Pylons or TurboGears developer you will almost certainly have noticed the excitement in both communities about TurboGears’ announcement that version 2.0 will be built on Pylons. If this is news to you see Mark Ramm, and Jonathan LaCour’s posts and the mailing list thread. It is great to see such positive comments about Pylons and brilliant to see the increased collaboration that is already taking place.
People are already talking about how this affects Django and Zope and the possibility of further collaboration with those communities but in this post I just want to describe the direction Pylons was taking before the announcement and the direction it is likely to take in the future just to put the announcement in context.
At the moment Pylons provides a stable, customizable core on top of which is integrated support for all the major Python templating languages and support for a choice of database object relational mappers, JavaScript libraries, form tools etc. This is fantastic for experienced developers because it enables them to pick and choose the best tools for the job and know that Pylons will not get in their way when they try to do something unusual. The downside is that the sheer choice of options for components beyond the core framework is often bewildering for new developers who are more interested in taking the recommended approach rather than researching what is best for their particular needs.
Ben Bangert and I have been discussing this recently and decided that what we needed was a “Pylons Power Pack” which would contain all of the community-recommended best tools of the moment. There would be full support for other Power Packs which would contain the alternative choices so that Pylons would still retain its emphasis on developer choice. As part of this restructure we planned to move all of the code which could be thought of as being related to a Power Pack out of Pylons itself and into the particular Power Pack which used it thus creating a more defined Pylons core. This division between Pylons core and the Pylons Power Packs would have another advantage: the high level tools in the Power Packs are still changing rapidly and there is still a lot of innovation taking place in automating the integration of Models, Views and Controllers. By keeping core Pylons separate from the high-level tools which use it, Pylons would remain flexible enough to be able to adapt.
This is where the TurboGears announcement is very interesting. In reality TurboGears and Pylons are already fairly similar because they have a similar high-level structure and both support the same high-level tools, indeed Pylons even borrows some of TurboGears APIs such as Buffet. Pylons’ low-level middleware architecture is its key strength though and by basing TurboGears 2.0 on Pylons the TurboGears community get full access to the flexibility and modularity WSGI provides (including the Pylons interactive debugger) whilst still having access to the high-level tools both frameworks require.
From the Pylons point of view, TurboGears 2.0 can become what was planned for the “Pylons Power Pack”, namely a powerful collection of all the best tools, nicely integrated together, well documented and ready for a beginner to get started with. This really is a win-win situation because the Pylons community gains its much-needed Power Pack and the TurboGears community gains full WSGI support. Add in the fact that now two sets of keen developers will be working to solve the same problems together rather than separately and the Pylons+TurboGears combination will be a force to be reckoned with.
Apart from helping as much as possible with the TurboGears effort I suspect Pylons will keep pretty much to its current course. Top priorities for us are:
- To create a good logo and define a brand so it is clearer what Pylons does
- To completely restructure and improve the pylons site so that it better befits Pylons’ status (addressing all of these points)
- To better integrate and expand the Python Web Documentation wiki
- To release the new version of AuthKit
There are one or two exciting Pylons announcements to come soon too but you’ll have to wait a little longer to hear them!
Foxmarks on TechCrunch
I noticed that Foxmarks has been mentioned on TechCrunch. This is exciting because Foxmarks is yet another example of a very successful production product written in Pylons. Here is the Foxmarks announcement on the Pylons list. Congrats jj.
Seagate FreeAgent Go 120GB External Hard Drive
After playing with VMWare yesterday and successfully booting Windows XP under Debian (after re-activating it because it complained of hardware changes) I quickly realised that I was going to rapidly run out of disk space if I continued creating new disk images. Since I want my files to be available on lots of different machines it was time to bite the bullet and buy an external hard drive. I looked on dabs.com and decided on the Western Digital 160GB Passport for £69.99. Before I ordered it I thought I’d head down to PC World (something I only ever do if I can’t wait for delivery) and had a look at their range in case there was something comparable. After deciding none of the products they had were good value I gave up and came home only to read on CNET that the FreeAgent Go (one of the drives I had looked at) was actually a very fast hard drive beaten only by the Maxtor OneTouch III Mini Edition in their tests. What is more, the 120GB version is a 7200rpm drive as opposed to the 5400rpm they tested so there was a chance the version in PC World might be even faster. In the end I drove back to PC World and bought it for £69.99.
The box is very nicely styled and when you open it you are greeted with a very simple instruction manual which says “This won’t take long”. It’s right. All you do is take the drive and the cable out of their packaging and plug them in. That’s it. You are then good to go. The USB cable it comes with has two plugs, one is for power and the other is for power and data. The cable is split so that you can easily plug the drive into USB sockets which are at opposite sides of your computer. If you don’t have two USB slots free, not to worry. The drive works perfectly well without having the one with “Power” written on it plugged in because power is still supplied by the other one marked “Power+Data”. If you read the instruction leaflet (which I didn’t) you’ll also notice that the drive has a whopping 5 years warranty. Seagate must be confident of the quality of the product.
One thing that really strikes you is just how small this drive is. Physically it is only fractionally larger than my wallet and my mouse is actually longer than the drive. When it is plugged in the base of the drive glows yellow and when you are transferring data the yellow light fades and comes back slowly in a very pleasing manner.
One other point worth mentioning is that the drive comes with some software. I don’t know what this does, and frankly I’m not interested so I just created a new directory on the drive and moved everything that was on there already to that directory in case I wanted it later. As a result of moving the hidden Autorun.inf the software doesn’t auto play when you plug the drive in and it no longer has the custom drive icon. Perfect.
At this point I decided to test the drive. Since CNET suggested a 10GB transfer should be possible in 10 minutes I tried transferring 270 MP3 and video files totaling 1.21GB. I was horrified to see it took 21 minutes and 5 seconds. Something was wrong
A quick look in device manager reveals that I have an Intel(R) 82801DB/DBM USB 2.0 Enhanced Host Controller -24CD on my IBM Thinkpad R50e. This page explains:
The 82801DB I/O Controller Hub (ICH4) contains three USB 1.1 (UHCI) controllers and one USB 2.0 (EHCI) controller, supporting up to six ports. Whether a port is controlled by one of the UHCI controllers or by the EHCI controller, and therefore whether or not it supports Hi-Speed, is up to some internal routing logic. I think the six physical ports (external to the ICH chip) can each be run either by the appropriate Universal Host Controller or the Enhanced Host Controller independently of all the others.
The EHCI driver is responsible for setting up the routing to the EHC or the UHC as appropriate. Either it’s not detecting the speed correctly initially, erroneously routing the hi-speed device to the UHC, or there’s a fault in the hardware or driver.
Sure enough, when I plugged the Power+Data cable into the other USB port the drive burst into life. Bottom line: if you do buy this drive for its speed you absolutely must make sure you have a Hi-Speed USB port to make proper use of it. Here are the real test results using the Hi-Speed USB port rather than the USB 1.1 port:
1.21GB music written to the drive in 71 seconds (about 17MB/second)
1.21GB music read from the drive in 78 seconds (about 15.5 MB/second)
The interesting thing here is that reading was slower than writing which strongly suggests that this drive is actually faster than my laptop’s internal hard disk which frankly is plenty fast enough for me! I also tried it with just the one USB cable rather than the two and I attached my USB hub, mouse, printer and keyboard to the other USB port to see if it made a difference. The results:
1.21GB music written to the drive in 73 seconds (about 16.5MB/second)
Basically it doesn’t make any difference. The two seconds could easily have been due to me not starting and stopping the clock accurately.
One final point worth mentioning is that the drive isn’t really 120GB in size. Most hard disk manufacturers like to say 1GB=1000Mb whereas in computing terms 1GB=1024MB. Consequently Windows describes the drive as 111GB capacity. This is still plenty big enough for my purposes though. The drive also comes pre-formatted as NTFS. If you are using Mac OS X or Linux you will need to reformat it to be something more appropriate or else install the FUSE NTFS driver to be able to use the drive. I’ll try that next.
Update (23/06/07): After investigating ntfs-3g I still think it is too new. Although the software itself is considered stable it requires more recent versions of software than are found in in Debian Etch stable. Although I could install it from testing I’ve learned the hard way that installing software from testing or unstable always causes problems in the long run.
The only filesystem which is natively supported under both Windows, Mac OS X and most Linux distributions is FAT32. It can handle partitions above 32GB as long as you don’t use Windows tools to format it (it would appear Microsoft would prefer you to use NTFS) but it works fine from Linux. The limitations main limitation is that FAT32 does not support files larger than 4GB but if you can live with that then it should be a fine choice.
To reformat your drive you’ll need to install dosfstools and if you prefer GUI interfaces to do this sort of things you can use gparted:
sudo apt-get install dosfstools gparted
To format the drive simply start gparted, select the correct drive and reformat it FAT32. To check it works disconnect and reconnect the device.
Once I’d formatted the drive I rebooted into Windows and re-ran the tests. The same transfer took 78 seconds. Slightly longer than with NTFS but not too bad. I did notice that transferring lots of small files is a bit slower and also that Windows’ time estimates are far too high when it starts copying which give you the impression to start with that FAT32 is a lot worse than NTFS.
Anyway, I’m very happy now. I have a drive which will work on virtually any modern computer with no drivers or filesystem tweaks and I can easily keep it with my laptop or even in my pocket and know I’ll always have access to my important files.
Update2 (25/06/2007): I’ve noticed the FreeAgent Go drive displays files sizes in property dialogs much faster than my laptop hard drive but that for small file sizes, such as my subversion code checkouts, the FAT32 filesystem is vastly less efficient than NTFS. 236MB small files take up 446MB space on NTFS and 1.1GB on FAT32.
Update3 (20/07/2007): On Debain Etch this drive frequently becomes read-only for no apparant reason. This is easily resolved as follows:
1. Get the sdparm package
sudo apt-get install sdparm
Find out your drive device (in my case /dev/sdb1)
sudo fdisk -l
Clear standby mode:
sdparm --clear STANDBY -6 /dev/sdb1
I must admit I don’t fully understand what this does but it looks like it is clearing a flag on the drive itself which causes it to go read only. Anyway, it does the trick for me!
London Hack Day
So I’ve been here at Alexandra Palace in North London at the Hack Day organised by Yahoo! and BBC Backstage since 9:30 yesterday morning. I managed to get about 4 hours sleep but apart from that have been hacking away on various things, meeting new people (including Jacob Kaplan-Moss from Django) and meeting people I know already. By the way wii tennis on a conference projector is truly fantastic.
Every delegate has been asked to try to hack together a project which uses the BBC or Yahoo! APIs. Our team has produced is an image recognition engine called Fruitr which determines the type of fruit in a picture and recommends a recipe which uses the fruit. The comparison engine searches all images on Flickr which are tagged with fruitr and determines the picture on Flickr which best matches the picture the user uploaded. It then looks at the other tags associated with the image on Flickr to determine the type of fruit from the other tags. Once the fruit type has been determined the site fetches a recipe which uses the fruit with a JSON request to another server.
As well as the web interface the engine also has an email interface so you can email your picture to recipe@fruitr.com to receive a recipe via email. This means the site also works with photos taken on a mobile phoned and emailed on.
The great thing about the system is that the accuracy can be improved by other users uploading good quality fruit images to Flickr and tagging them with the fruitr tag.
Why not have a go at Fruitr.co.uk. If you need a test image I’ve uploaded a few here.
Running the Numbers - An American Self-Portrait
Came across this link via the MTV Switch blog which gives visual representations of some important statistics. Really brings them home.
Safari 3 Beta for Windows
Apple have announced a Safari 3 beta for Windows. The download is 8Mb and should make it easier for people like me to develop websites that work on the Mac version of Safari. I’ve already noticed the 3aims.com site doesn’t render correctly on Safari 3 so that is something I’ll look into.
What’s interesting is that the default install also includes Bonjour (Apple’s network discovery tool) and the Apple Software Update program. Does this mark a new trend of Apple is trying to launch core components on the PC?
It looks like the browser tries to use the Mac fonts but the rendering doesn’t look quite right to me.
Update: Sadly it seems to crash every time I exit with an error “The memory cannot be read”
Why is an Open Source Toolset a Good Choice for Prototyping
What is Open Source?
Open source software is software which is distributed together with the source code used to build it so that developers who use the software are free to examine how it works, review it for security issues and contribute improvements. The promise of open source is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in.
Most open source software is licensed under very permissive terms which allow you to use or distribute it free of charge in commercial projects without having to release your own work publicly. Thousands of companies worldwide support and contribute to open source software including Sun, Novell, IBM, HP and RedHat to name just a few.
Good Quality Tools, Freely Redistributable
The open source model allows developers to pick and choose the very best components for a particular task from anywhere in the world rather than being constrained to the tools supplied by a particular vendor or having to develop the required components from scratch. It is a well known fact that most of the world’s websites run on Open Source software and that even companies such as Yahoo! and Google rely heavily on open source to provide their services. By choosing an open source approach the development team has access to many of the same sets of tools which power the world’s most popular websites. These tools give the developer a major head-start towards producing the first prototype.
Flexibility During the Prototyping
Expectations can quickly change as the prototype develops. As a consequence the set of tools which were considered the best fit for requirements at the start of the prototyping process might not be the best tools half way through. If a team had chosen a proprietary set of tools the cost of change involved with choosing a new vendor and negotiating new licenses could be prohibitive but in an open source model, since all the tools are freely available, the team is simply able to change to other tools and software as needed without becoming encumbered in licensing issues.
Choice Going Forward
At the end of the prototyping phase all the code produced by the prototyping team, together with the open source tools used, can be given to the main project team without licensing restrictions. This means that the main team is able to re-use useful portions of the code and the tools on which they rely if they choose to do so. Since there are many companies which support open source software there would be no requirement to use the same company for the main development phase as was used for the prototype. As a result, future development is not tied in to one vendor from the start. In turn this gives the project team more flexibility to choose the most appropriate partner for later stages of the project.
Interesting links:
Why Open Source Software? Look at the Numbers!
Browser usage stats
Incremental Backups using Rsync
If you simply backup the filesystem on a remote server once a day via a cron job you will be able to restore the data in the event of a hardware failure - clearly very useful! This isn’t the only reason for performing a backup though. Another possibility is that either through an accident, a virus or because of a malicious user you loose some data. If you have a backup you can simply restore the missing or modified files from the backup but it is possilbe that your cron job might have run before you get a chance. In these circumstances your backup will be a replica of the main filesystem which will be no good at all.
The solution to this problem is to take backups of the filesystem at regular intervals, perhaps once a day or even once an hour. If you do this simply by copying all the files each time in a similar way to the one described in my my last rsync article you will quickly run out of disk space on the backup server.
Luckily you can reduce disk usage by making sure files from new backups are created using a hard link to an exisiting copy of the same file from a previous backup wherever possible. This way although each copy behaves like a full copy only files which have changed since the previous backup are actually physically copies, the rest are simply hard linked to the last backup which did contain a full copy of the file.
The example below shows how hard linking works. We have already setup a directory copy with two files then we create a copy using hard links:
james@bose:~/hard$ cp -al orig copy
The l option links the files instead of actually copying them.
Notice the inodes of the files in each directory are the same because they are the same physical file and how the hard linked directory has a much smaller disk usage than the original.
james@bose:~/hard$ du -h
3.9M ./copy
4.0K ./orig
4.0M .
james@bose:~/hard$ ls -li orig/
total 3988
379977 -rw-r–r– 2 james james 4074531 2007-06-02 22:06 test1
380075 -rw-r–r– 2 james james 29 2007-06-02 21:59 test2
james@bose:~/hard$ ls -li copy/
total 3988
379977 -rw-r–r– 2 james james 4074531 2007-06-02 22:06 test1
380075 -rw-r–r– 2 james james 29 2007-06-02 21:59 test2
james@bose:~/hard$
The du program only counts hard linked files once so you can see that it actually reports copy as using more disk space than orig. If you ran du on each of the directories induvidually they would both report a size of 3.9M.
Rsync version 2.5.6 and above supports an option called –link-dest which instructs rsync to use the link-dest directory specified (on the destination machine) as an additional hierarchy to compare destination files against when doing transfers (but only if the files are missing in the destination directory). Unchanged files are then hard linked from the link-dest directory to the destination directory rather than copied from ther source location. The files must be identical in all preserved attributes (e.g. permissions, possibly ownership) in order for the files to be linked together. If the link-dest directory is a relative path, it is relative to the destination directory not the current working directory as you might have expected. Because the link-dest directory is not counsulted if a file already exists, it is usually best to run rsync on a new, empty directory.
Anyway, imagine you have already run rsync to create a backup of a remote server filesystem in a directory called backup. An hour later you might want to create an incremental backup using the hard linking approach just mentioned so rather than issue this command:
rsync -aHxvz --delete --progress --numeric-ids -e "ssh -c arcfour -o Compression=no -x" root@example.com:/ backup.0/
you would issue these commands:
mv backup.0 backup.1
rsync -aHxvz –delete –progress –numeric-ids -e “ssh -c arcfour -o Compression=no -x” –link-dest=../backup.1 root@example.com:/ backup.0
It is important to get the / characters correct at the end of the paths to ensure rsync copies everything as you expect. This time backup.0 will contain the latest copy the filesystem but any files which have not changed will simply be hard linked to the corresponding file in backup so the second backup takes up far less space than the first.
As an example here is the disk usage from the backups of one of my servers:
bose:/home/james/files/james/files/Backup# du -hsc backup.0 backup.1
4.9G backup.0
141M backup.1
5.0G total
Note: Once again although technically backup.1 is the older copy, the du command reports that backup.0 is using more disk space. As mentioned earlier this is simply because du considers the folder it comes across first to be the one that it should assign the size of a hard linked file to so backup.0 looks bigger than backup.1. Of course the only number that actually matters is the total. In this case this is 5.0G which is a lot less than the 9.8G which would be needed to store two full copies of the backups if we weren’t using hard links:
bose:/home/james/files/james/files/Backup# du -hs backup.0
4.9G backup.0
bose:/home/james/files/james/files/Backup# du -hs backup.1
4.9G backup.1
The beauty of this set up is that if you ever need to restore a file from a backup it is as simple as copying it back to the server from the backup you need.
Rsync Basics
The rsync man page describes rsync as a “faster, flexible replacement for rcp”. Rsync is used for copying files from between remote hosts and local filesystems although it can also be used to copy files from locations on the same filesystem.
Here is a quick rsync test where we setup a directory root with two files test1 and test2 and rsync them to another directory backup with the command rsync -av root/ backup/. Here’s the shell output:
james@bose:~$ mkdir root
james@bose:~$ cd root/
james@bose:~/root$ echo “Test1″ > test1
james@bose:~/root$ echo “Test2″ > test2
james@bose:~/root$ ls -li
total 8
33132 -rw-r–r– 1 james james 6 2007-06-02 18:03 test1
33182 -rw-r–r– 1 james james 6 2007-06-02 18:03 test2
james@bose:~/root$ cd ..
james@bose:~$ mkdir backup
james@bose:~$ rsync -a -v root/ backup/
building file list … done
./
test1
test2
sent 209 bytes received 70 bytes 558.00 bytes/sec
total size is 12 speedup is 0.04
james@bose:~$ ls -li backup/
total 8
33184 -rw-r–r– 1 james james 6 2007-06-02 18:03 test1
33187 -rw-r–r– 1 james james 6 2007-06-02 18:03 test2
james@bose:~$
The -a option copies files in archive mode and the -v option stands for verbose and outputs information while rsync is working. You can also use -vv and -vvv to get progressively more verbose messages.
This works nicely but it is much more interesting to copy files from a remote server. Using the -e option we can specify the remote shell to use, in this case ssh. You can also specify the remote shell by setting the RSYNC_RSH environment variable instead of specifying a value with -e.
rsync -a -v -e ssh user@3aims.com: web10
This copies user’s home directory to a directory called web10 on the local machine. Bear in mind rsync will only be able to copy files which the user you specified has permissions to.
To specify a different remote directory you can put a path after the : character, for example:
rsync -a -v -e ssh user@3aims.com:/home/user web10
You might want to customise how the remote shell behaves, for example this might be a better option for the backups:
rsync -a -v -e "ssh -c arcfour -o Compression=no -x" user@3aims.com:/home/user web10
Here is what the options to -e mean:
- ssh -u se ssh instead of the default of rsh
- -c arcfour - uses the weakest but fastest encryption that ssh supports
- -o Compression=no - Turns off ssh’s compression - rsync has its own if you want it which we’ll discuss in a minute
- -x - turns off ssh’s X tunneling feature (if you actually have it on by default)
If bandwidth is a problem you might want to use the -z option to have rsync compress data it sends across the network. If you are using rsync compression it makes sense not to use ssh’s compression in the way demonstrated above. Here’s the command using rsync compression:
rsync -a -v -z -e "ssh -c arcfour -o Compression=no -x" user@3aims.com:/home/user web10
If you want to test these how effective each of these commands are you will need to delete the web10 directory rsync creates otherwise rsync will only copy files which have changed. Whilst that’s normally what you want, it isn’t too useful for tests.
Finally, since rsync is very efficient it can saturate a network connection. If you still want to be able to use your network connection whilst rsync is running you can use the –bwlimit option which allows you to specify a maximum transfer rate in kilobytes per second. Due to the nature of rsync transfers, blocks of data are sent, then if rsync determines the transfer was too fast, it will wait before sending the next data block. The result is an average transfer rate equaling the specified limit. For example to limit rsync to using 100KB/sec you could do this:
rsync -a -v -z --bwlimit=100 -e "ssh -c arcfour -o Compression=no -x" user@3aims.com:/home/user web10
You might also want to use –progress so that rsync prints out a %completion and transfer speed while transferring large files (but this isn’t worth adding if you are running from a cron job). If you are performing a backup which you think you might want to restore at some point in the future you should use –numeric-ids. This tells rsync to not attempt to translate UID <> userid or GID <> groupid which is very important for avoiding permission problems when restoring. You might also want the -H option which forces rsync to maintain hardlinks on the server and the -x option which causes rsync to only copy files from one filesystem and not any other files which might be mounted as part of that directory structure. You can also use the –delete option which deletes files from the backup if they don’t exist on the server. If you use –delete the files are deleted before the copying starts.
Putting this all together the command I use to backup one of my servers looks something like this:
rsync -aHxvz --delete --progress --numeric-ids -e "ssh -c arcfour -o Compression=no -x" root@example.com:/ pauli/
Bear in mind rsync is not much good at backing up databases such as MySQL because they frequently store information information in memory so although you may have a copy of the database files, when you restore them you might find the information they contain is corrupt.
For further reading about rsync have a look at the rsync man page or Kevin Korb’s rsync article.
Redirecting Standard Output to Supress Messages in Cron Jobs
Under linux if you want to supress messages written to the standard output stream you can redirect them to a special file called /dev/null which discards all data written to it (but reports that the write operation succeeded), and provides no data to any process that reads from it (it returns EOF).
This functionality is really useful when specifying commands for cron scripts if you only want to be notified if errors occur because you can redirect standard output to /dev/null so that only messages to standard error will be reported.
Let’s demonstrate this. Create a file named output_test.py with the following content:
#!/usr/bin/env python
import sys
sys.stdout.write(”From stdout\n”)
sys.stderr.write(”From stderr\n”)
Run this command to change the permissions so the file can be executed:
chmod 755 output_test.py
Then you can test it:
james@bose:~$ ./output_test.py
From stdout
From stderr
As you can see the output from both standard output and standard error are displayed. Now let’s try redirecting to /dev/null:
james@bose:~$ ./output_test.py > /dev/null
From stderr
As you can see only standard error is actually displayed. We can make use of this in the example below where some_script.py is run every 4 hours and any error information is emailed to james@example.com but any standard output is ignored:
MAILTO=james@example.com
0 */4 * * * some_script.py > /dev/null
With this example an email will only be sent if information is written to standard error.
You might also want to redirect error output to /dev/null instead (but this would normally not be recommended - error messages are there for a reason). The way to do this depends on your shell. I’m using Bash so I can do so like this:
james@bose:~$ ./output_test.py 2> /dev/null
From stdout
In this example the 2 just before > says that the standard error should be redirected to /dev/null. A 1 would refer to standard output and a 0 would refer to standard input. You can also redirect both:
james@bose:~$ ./output_test.py 1> /dev/null 2> /dev/null
james@bose:~$
Learn more about redirection at the wikipedia page.