Home Blog CV Projects Patterns Notes Book Colophon Search

Data Wrangling Workflow

13 Apr, 2015

I've noticed a common workflow I use when I'm faced with a new set of CSV files that I want to get the measure of for data wrangling purposes. It looks roughly like this:

Import all the files into SQLite as they are
Try to guess the types of the columns
Get counts of each file and the max and min values of each column

Here's a sample SQL file:

undivert(`../file/data_wrangling/import.sql')

The trick here is the use of the md_code(mode) command to allow SQLite to parse CSV format.

You can run the SQL like this:

cat import.sql | sqlite3 new.db

I then repeat the above process, but since I now know the types, I specify them on the import so that I can get proper values for maxiumums and minimums.

Next, I create a big summary table that has the data in the shape I want it and I use this table for queries to explore the data and generate reports.

Finally I write audit queries to check any assumptions I relied on to create the summary table are true for each of the files that are used to make it up.

I caught up with David Raznick at the weekend and he explained that he follows a very similar process when working with data commercially. We might collaborate on a tool to build a web-based interface to this particular workflow to make things easier for less technical wranglers.