My name is Philippe Buteau and I am an analyst at my own small co-owned firm, Buteau and Lindley. Back in May, Susan Kistler briefly wrote about Google Refine on aea365 and prompted me to take a look. Since then, I have used Refine in a number of ways and thought that I would submit a more extended post elaborating on this tool for data cleaning.
Rad Resource – Google Refine: First of all, what is it? Google Refine is “a [free] power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases.” To be more explicit, it allows you to import a data set and then to clean that data set in multiple ways. If you are a programmer, Google Refine allows you to do lots more, but I am limiting my focus here to the more generally applicable function of data cleaning.
Lessons Learned – Cleaning Data: Here are three examples of ways in which I used Refine for cleaning data and a comparison to doing the same in Microsoft Excel:
- Removing erroneous rows: I imported a financial data set that included multiple subtotal rows. All I wanted was the rows that had specific categories and transactions, so that I could work with these. The subtotal rows created problems when sorting or filtering. In Refine I chose “Text Filter” from the column heading and then identified all of the rows with “Sub” in them, then deleted these rows all at once. Verdict: This as similar to what could be done in Excel, but was easily accomplished in Refine as well.
- Combining multiple similar responses within a field: Once your data is imported, select Facet – Text Facet from the pull down list for a particular column. A column representing all of the responses and how many times that response appears is generated. You then just select each one that you want to merge and give it a common name. Thus, I could combine “New York” “NY” “NY “ and “NNY” so that they were all “NY”. Alternatively, there is a smart clustering feature that tries to do this for you – guessing at what responses are similar and should be combined. You can then review its guesses and fix as needed before the clustering is actually done. Verdict: Both the hand-combining and clustering were accomplished much more easily than would be possible in Excel and the clustering tool’s guesses were surprisingly accurate and a huge time saver.
- Finding Outliers: From the column pull down list of a numeric field, select Facet – Numeric Facet. This will immediately show you a small histogram with the distribution of all of the values in that column as well as the range of values in the column. Each side of the histogram has a handle that slides back and forth. Sliding the handle to display only the most extreme values to the left or right side of the histogram filters all of the rows in the dataset so you are looking only at the ones within the constricted range of outliers. Verdict: Much faster and more intuitive than options for doing the same in Excel and the combination of viewing graphically and the fields themselves provided a richer understanding.
Lessons Learned – Undo: The history feature was a godsend. It allows you to undo mistakes and step backwards through your cleaning. I also found that it gave me the confidence to try out some things, knowing that I could undo them immediately.
Lesson Learned – Download to Desktop: Google Refine can be downloaded to your desktop so you don’t have to upload your data and you retain full control and ownership of it.
Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.