AEA365 | A Tip-a-Day by and for Evaluators

TAG | data cleaning

My name is Samantha Theriault, and I am the Research Assistant at Randi Korn & Associates (RK&A), a research, evaluation and intentional planning company that specializes in museums and informal learning.

Lesson Learned: As a research assistant, I spend much of my time entering and processing data.  Data entry and clean-up is time consuming and challenging, and it makes me feel like a worm turning dirt among the roots of our work. From this point of view, I’m watching projects grow from the ground-up, so it’s even more exciting to see the final product. Preparing quantitative and qualitative data for analysis isn’t glamourous, but it is invigorating and contributes to the success of all our work. Evaluation runs on data – so keeping it organized from the beginning is vital.

Hot Tip: Set aside chunks of time to spend with data. Depending on the size of your project and methods, data entry can take a few hours or several days, or even weeks! Prioritizing data entry – rather than “squeezing it in” between tasks – which increases my comfort with the data set and puts me in a rewarding flow state. With mindfulness during data entry, I notice patterns as they emerge. For example, I recently struggled to interpret a participant’s shorthand on a question about which neighborhood they live in, but noticed others spelled out all the words in their responses. I matched the shorthand to the full neighborhood name (and double-checked with Google!), which I would have missed were I entering data mindlessly or too quickly. Similarly, I might notice other trends, such as reduced responses to a certain question, which I flag for examination later.

Cool Trick: Create a “living” data entry processing handbook.  When I was first learning to process data using SPSS, I created a “cleaning up data files checklist” and add unique tasks and tips to it each time I work on a new data set.  My checklist includes recoding system-missing responses, ensuring that survey responses follow skip logic, and reminders such as “slow down and double-check your work!” Since my colleagues depend on these data sets to do their work, I include their needs in my checklist, too: spell out acronyms, label variables, and delete “working” variable labels I created while collapsing categories into single columns. I also create a log for each digital survey’s lifecycle, documenting any changes to the museum’s exhibition or program during data collection, and quirks to remember when it’s time to process the data for analysis. This detailed approach to record-keeping is especially useful when my colleagues have specific questions about the data during analysis.

Rad Resource: Microsoft OneNote is a password-protected digital notebook that I use to keep track of my data cleaning process (and many other elements of managing data collection and processing). I like that I can repurpose checklists and save relevant files on the same page – it’s like a 4D Moleskine to me!

(click for larger image)

The American Evaluation Association is celebrating Labor Day Week in Evaluation: Honoring the WORK of evaluation. The contributions this week are tributes to the behind the scenes and often underappreciated work evaluators do. Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org . aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

My name is Lisa R. Holliday, and I am a Data Architect with The Evaluation Group. Cleaning data is a large part of my job.  For studies that need to match participants’ names or IDs, it is critical to ensure that information is entered consistently from one data collection cycle to the next. However, depending on the data collection tools available, it’s not always possible to do this. Cleaning and matching data can be time-consuming.

I’ve found two useful tools that have helped to reduce the amount of time I spend working with these data and increased matching accuracy.

Rad Resource 1: FuzzyMatch Macro for Excel

FuzzyMatch is a free Excel macro available from the Mr.Excel message board. It creates three functions:

  1. FuzzyVlookup
  2. FuzzyHLookup
  3. FuzzyPercent

FuzzyVlookup and FuzzyHlookup look up values against a master list and suggest possible matches. FuzzyPercent allows you to assess the similarities between the proposed matches and the master list. In my testing, FuzzyMatch was able to correctly match values that differed by as many as nineteen characters in length.

One drawback to this macro is that it can be time-consuming to run. During testing, it took Excel approximately one minute to process 942 rows using FuzzyVlookup and FuzzyPercent concurrently. Additionally, the functions have volatile characteristics (they sometimes recalculate when you save the workbook). To avoid this, after running the functions, copy and paste the results as text.

This video demonstrates how to use FuzzyMatch.

Rad Resource 2: iugum Data Matching Software

This software performs much like the Excel macro, except it is faster and more user-friendly. There is a 14-day free trial available, and subscriptions run from $99 to $400. During testing, it matched over 8,000 rows of data in less than one minute. If unable to perform an exact match, iugum provides possible matches, and it is easy to accept or reject suggested matches. If you regularly clean data using a master list, this is a valuable tool.

This video provides an overview of iugum Data Matching Software.

We’re celebrating 2-for-1 Week here at aea365. With tremendous interest in the blog lately, we’ve had many authors eager to share their evaluation wisdom, so for one special week, readers will be treated to two blog posts per day! Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org . aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

 

 

· ·

I’m Jennifer Ann Morrow, a faculty member in Evaluation, Statistics, and Measurement at the University of Tennessee. I created a 12 Step process evaluators can follow to ensure their data is clean prior to conducting analyses.

Hot Tip: Evaluators should follow these 12 steps prior to conducting analyses for evaluation reports:

1. Create a data codebook

a. Datafile names, variable names and labels, value labels, citations for instrument sources, and a project diary

2. Create a data analysis plan

a. General instructions, list of datasets, evaluation questions, variables used, and specific analyses and visuals for each evaluation question

3. Perform initial frequencies – Round 1

a. Conduct frequency analyses on every variable

4. Check for coding mistakes

a. Use the frequencies from Step 3 to compare all values with what is in your codebook. Double check to make sure you have specified missing values

5. Modify and create variables

a. Reverse code (e.g., from 1 to 5 to 5 to 1) any variables that need it, recode any variable values to match your codebook, and create any new variables (e.g., total score) that you will use in future analyses

6. Frequencies and descriptives – Round 2

a. Rerun frequencies on every variable and conduct descriptives (e.g., mean, standard deviation, skewness, kurtosis) on every continuous variable

7. Search for outliers

a. Define what an outlying score is and then decide whether or not to delete, transform, or modify outliers

8. Assess for normality

a. Check to ensure that your values for skewness and kurtosis are not too high and then decide on whether or not to transform your variable, use a non-parametric equivalent, or modify your alpha level for your analysis

9. Dealing with missing data

a. Check for patterns of missing data and then decide if you are going to delete cases/variables or estimate missing data

10. Examine cell sample size

a. Check for equal sample sizes in your grouping variables

11. Frequencies and descriptives – The finale

a. Run your final versions of frequencies and descriptives

12. Assumption testing

a. Conduct the appropriate assumption analyses based on the specific inferential statistics that you will be conducting.

Lesson Learned: One statistics course is not enough. Utilize all the great resources that AEA offers to gain additional training in data analysis.

Rad Resources:

Want to learn more from Jennifer? Register for her upcoming AEA eStudy: The twelve steps of data cleaning: Strategies for dealing with dirty data and her workshop Twelve Steps of Data Cleaning: Strategies for Dealing with Dirty Evaluation Data at Evaluation 2013 in Washington, DC.

This week, we’re featuring posts by people who will be presenting Professional Development workshops at Evaluation 2013 in Washington, DC. Click here for a complete listing of Professional Development workshops offered at Evaluation 2013. Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org. aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

 

·

Hello, my name is Juan Paulo Ramírez, independent consultant, sole owner of “GIS and Human Dimensions, L.L.C.” How many times have you used spreadsheets or sophisticated statistical software (i.e., SAS, SPSS) to estimate frequencies of a population and you asked yourself: is it really necessary to do this using very expensive and sophisticated software? Or, spending hours and hours cleaning up the data to make it consistent within and between records and variables? Would there be a better and more efficient way to complete these trivial and time consuming tasks? There is, and Google Refine is the answer!

Lessons learned: Google Refine is a free desktop application (not a web-service) that you install on your computer (you can download it here). Google Refine allows users to seamlessly and efficiently calculate frequencies and multi-tabulate data from large datasets (i.e., hundreds of thousands of records), along with cleaning up your data. What I found is that you learn more by trial and error with Google Refine, and discover how easy it is to get the information needed in a few steps. Google Refine has saved me days of hard work! Google Refine works with numeric, time and text data and allows you to directly work with Excel files.

The following are a few examples of how I have used Google Refine: 1) Getting demographic frequencies (i.e., gender, age) and cross tabulating it with economic variables (i.e., income) and location (i.e., county). 2) Cleaning up data that it is inconsistent, since people have sometimes answered questions without any written restrictions (i.e., lengthy responses, spelling error, blank spaces). 3) When you select a date variable, Google Refine creates a bar chart with two ends that you can adjust, dragging them with your mouse to define specific time periods. 4) If you make a mistake, Google Refine allows you to undo everything you have done!

Rad resource: There are three videos available that show the potential applications of Google Refine. You can watch them here. I watched the first video once and it was enough to convince me that this was a must have application. I started using it right away, and it became one of the most essential tools that now I use in my work.

Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org. aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

· ·

My name is Philippe Buteau and I am an analyst at my own small co-owned firm, Buteau and Lindley. Back in May, Susan Kistler briefly wrote about Google Refine on aea365 and prompted me to take a look. Since then, I have used Refine in a number of ways and thought that I would submit a more extended post elaborating on this tool for data cleaning.

Rad Resource – Google Refine: First of all, what is it? Google Refine is “a [free] power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases.” To be more explicit, it allows you to import a data set and then to clean that data set in multiple ways. If you are a programmer, Google Refine allows you to do lots more, but I am limiting my focus here to the more generally applicable function of data cleaning.

Lessons Learned – Cleaning Data: Here are three examples of ways in which I used Refine for cleaning data and a comparison to doing the same in Microsoft Excel:

  • Removing erroneous rows: I imported a financial data set that included multiple subtotal rows. All I wanted was the rows that had specific categories and transactions, so that I could work with these. The subtotal rows created problems when sorting or filtering. In Refine I chose “Text Filter” from the column heading and then identified all of the rows with “Sub” in them, then deleted these rows all at once. Verdict: This as similar to what could be done in Excel, but was easily accomplished in Refine as well.
  • Combining multiple similar responses within a field: Once your data is imported, select Facet – Text Facet from the pull down list for a particular column. A column representing all of the responses and how many times that response appears is generated. You then just select each one that you want to merge and give it a common name. Thus, I could combine “New York” “NY” “NY “ and “NNY” so that they were all “NY”. Alternatively, there is a smart clustering feature that tries to do this for you – guessing at what responses are similar and should be combined. You can then review its guesses and fix as needed before the clustering is actually done. Verdict: Both the hand-combining and clustering were accomplished much more easily than would be possible in Excel and the clustering tool’s guesses were surprisingly accurate and a huge time saver.
  • Finding Outliers: From the column pull down list of a numeric field, select Facet – Numeric Facet. This will immediately show you a small histogram with the distribution of all of the values in that column as well as the range of values in the column. Each side of the histogram has a handle that slides back and forth. Sliding the handle to display only the most extreme values to the left or right side of the histogram filters all of the rows in the dataset so you are looking only at the ones within the constricted range of outliers. Verdict: Much faster and more intuitive than options for doing the same in Excel and the combination of viewing graphically and the fields themselves provided a richer understanding.

Lessons Learned – Undo: The history feature was a godsend. It allows you to undo mistakes and step backwards through your cleaning. I also found that it gave me the confidence to try out some things, knowing that I could undo them immediately.

Lesson Learned – Download to Desktop: Google Refine can be downloaded to your desktop so you don’t have to upload your data and you retain full control and ownership of it.

Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

· ·

Archives

To top