Our names are Lindsey Dunn and Lauren Fluegge and we are PhD. students at The University of North Carolina at Greensboro. Recently, we have been involved with a project in which the data come from many different sources and often include hundreds of thousands of cases. We used primarily three programs to organize and merge the data (Microsoft Excel 2007, Microsoft Access 2007, and PASW 18), and include here pros and cons of each program and information about the advantages of the statistical program R.
Excel 2007:
Pros
- Sorts/filters data efficiently
- Summarizes data with Microsoft PivotTable, which automatically sorts, counts, totals or averages the data
Cons
- Analyses excluded from the formula repertoire and results formatting must be entered by hand
- Has “behind the scenes” syntax out of the user’s control which makes it difficult to reproduce procedures on different data
Hot Tip: The text to column function of Excel is useful to reformat identifiers, such as names, to match between two data files you want to merge
Access 2007:
Pros
- Relational nature of databases allows for flexible linking of data files (e.g. the user can match multiple data files by variables with different names; a smaller subset of variables of interest can be selected from the original data files to be in a newly merged data file)
- Access cannot analyze data
Cons
- Merging can be slow or freeze the program if done with large data files
- Syntax is out of the user’s control
Hot tip: Choose “Export” to move any data files to Excel for formatting and simple calculations
PASW 18 (SPSS):
Pros
- Can accomplish a plethora of analyses, decent graphics, and easy to read output
- Efficiently identifies duplicates, sorts, and filters data
- Can be controlled by user-written syntax
Cons
- Inflexible merging procedures
- Error messages are not always clear
- Expensive
Hot Tip: With SPSS, can now open files without a .sav with the “Open” command and retain the field names (rather than using the “New Query” command)
Hottest Tip: Use R statistical program
- R is more efficient than other programs because it allows you to do everything in one place!
- In R, you have control over what and how you merge (e.g. can merge multiple data files at once by several variables)
- You can manage and analyze data as effectively as the other programs in the same program
- R is free and once you have written a merging protocol it can be reused with other data files!
Rad Resources:
Useful Rad resources are listed below:
http://stat.ethz.ch/R-manual/R-patched/library/base/html/merge.html
http://stat.ethz.ch/R-manual/R-devel/library/base/html/duplicated.html
Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org. aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.
Excel is certainly appropriate for very simple analysis like means, standard deviations, frequencies, etc, and the pivot tables can make simple frequencies/percentages a breeze. However…I distrust any more substantial analysis done in Excel like regressions) because it simply doesn’t allow you to appropriately check distributions, model fit, etc. I saw too many models that ended up being wrong because the consultants couldn’t check basic statistical assumptions in Excel – so definitely don’t use it for that!
I like R because it’s free and powerful, but find the syntax to be a bit difficult. SPSS is too limited, the new platform causes things to execute slowly, and the inability to easily export output drives me crazy. In my opinion, SAS is a much better option – easier syntax, good graphics, easy outputting, more statistical options (supports good multi-level modeling).