James Gardner: Home > Blog > 2007 > Building a custom urllib2 opener...

Building a custom urllib2 opener to retrieve a DOI

Posted:2007-08-18 13:56
Tags:Python, Web

If you haven't come across DOIs before they are simply unique identifiers in the form 10.XXXX/some_label where XXXX is a four digit code assigned to your organisation and some_label is any label you want to create a DOI for, usually the unique ID of some object you want to reference. You can then associate a URL with the DOI so that it can be resolved to a real object on the web.

The other day I was trying to create some code so that I could programatically discover the URL a particular DOI resolved to. What I wanted to do was use urllib2 to post my DOI to the same URL the DOI resolver form at the bottom of http://doi.org posts to and then retrieve the HTTP response to find out where the DOI redirects to.

Here was my first attempt:

import urllib2
import urllib

org_id = '10.3333'
label = 'test'
data = {'hdl':org_id+'/'+label, 'x':'13', 'y':'8'}

fp = urllib2.urlopen('http://dx.doi.org', urllib.urlencode(data))
print fp.headers
fp.close()

Unfortunately this doesn't work because the default behaviour is for urllib2 to follow the HTTP redirect to the redirect page so the headers are for the page that is redirected to, not the headers from the original response which issued the HTTP redirect which was what we wanted.

Date: Sat, 18 Aug 2007 13:04:45 GMT
Server: Apache/2.0.55 (Ubuntu) DAV/2 SVN/1.3.1 mod_python/3.1.4 \
  Python/2.4.3 PHP/5.1.2 proxy_html/2.4 mod_ssl/2.0.55 OpenSSL/0.9.8a
X-Powered-By: PHP/5.1.2
X-Pingback: http://jimmyg.org/xmlrpc.php
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

To fix this you need to create your own handler:

import urllib2
import urllib

class CustomRedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        result = urllib2.HTTPRedirectHandler.http_error_301(
            self, req, fp, code, msg, headers)
        result.status = code
        return result

org_id = '10.3333'
label = 'test'
data = {'hdl':org_id+'/'+label, 'x':'13', 'y':'8'}

opener = urllib2.build_opener(CustomRedirectHandler())
req = urllib2.Request('http://dx.doi.org', urllib.urlencode(data))
fp = opener.open(req)
print fp.url
fp.close()

Now everything works as expected and the URL is printed.

There is some more information about urllib2 and redirects at Dive Into Python. Learn more about DOIs.

(view source)

James Gardner: Home > Blog > 2007 > Building a custom urllib2 opener...