13 Apr, 2015
I've noticed a common workflow I use when I'm faced with a new set of CSV files that I want to get the measure of for data wrangling purposes. It looks roughly like this:
Here's a sample SQL file:
undivert(`../file/data_wrangling/import.sql')
The trick here is the use of the md_code(mode) command to allow SQLite to parse CSV format.
You can run the SQL like this:
cat import.sql | sqlite3 new.db
I then repeat the above process, but since I now know the types, I specify them on the import so that I can get proper values for maxiumums and minimums.
Next, I create a big summary table that has the data in the shape I want it and I use this table for queries to explore the data and generate reports.
Finally I write audit queries to check any assumptions I relied on to create the summary table are true for each of the files that are used to make it up.
I caught up with David Raznick at the weekend and he explained that he follows a very similar process when working with data commercially. We might collaborate on a tool to build a web-based interface to this particular workflow to make things easier for less technical wranglers.
Copyright James Gardner 1996-2020 All Rights Reserved. Admin.