Home Blog CV Projects Patterns Notes Book Colophon Search

Data Wrangling Workflow

13 Apr, 2015

I've noticed a common workflow I use when I'm faced with a new set of CSV files that I want to get the measure of for data wrangling purposes. It looks roughly like this:

Here's a sample SQL file:

undivert(`../file/data_wrangling/import.sql')

The trick here is the use of the md_code(mode) command to allow SQLite to parse CSV format.

You can run the SQL like this:

cat import.sql | sqlite3 new.db

I then repeat the above process, but since I now know the types, I specify them on the import so that I can get proper values for maxiumums and minimums.

Next, I create a big summary table that has the data in the shape I want it and I use this table for queries to explore the data and generate reports.

Finally I write audit queries to check any assumptions I relied on to create the summary table are true for each of the files that are used to make it up.

I caught up with David Raznick at the weekend and he explained that he follows a very similar process when working with data commercially. We might collaborate on a tool to build a web-based interface to this particular workflow to make things easier for less technical wranglers.

Copyright James Gardner 1996-2020 All Rights Reserved. Admin.