Hello! I’m Judy Savageau from the Center for Health Policy and Research at UMass Medical School. A recent post from Pei-Pei Lei, my colleague in our Office of Survey Research, introduced some options for statistical programming in R. I wondered whether a basic introduction to statistics might be in order for those contemplating ‘where do I begin’, ‘what statistics do I need to compute’, and ‘how do I choose the appropriate statistical test’. While most AEA365 blogs don’t cover every topic in detail, perhaps a basic 2-part introduction will help here. Analyses are very different with qualitative versus quantitative data; thus, I’ve concentrated on the quantitative side of statistical computations.
Analyses fall into 3 general categories: descriptive, bivariate, and multivariate; they’re typically computed in that order as we:
- explore our data (descriptive analyses) with frequencies, percentile distributions, means, medians, and other measures of ‘central tendency’;
- begin to look at associations between an independent variable (e.g., age, gender, level of education) and an outcome variable (e.g., knowledge, attitudes, skills; bivariate analyses); and
- try to identify a set of factors that might be most ‘predictive’ of the outcome of interest (multivariate analyses).
The decision about what statistical test to use to describe data and its various relationships depends on the ‘nature’ of the data. Is it:
- Categorical data:
- nominal; e.g., gender, race, ethnicity, smoking status, participation in a program: yes/no;
- ordinal: e.g., a Likert-type scale score of 1=Strongly disagree to 5=Strongly agree or 5 levels of education: ‘Less than high school’, ‘High school graduate/GED’, ‘Some college/Associate degree’, ‘College graduate – 4-year program’, and ‘Post-graduate (Masters or PhD degree)’;
- interval: ordinal data in fixed/equal-sized categories; e.g., age groups in 10-year intervals or salary in $25,000 intervals; or is it:
- Continuous data:
- For example: age, years of education, days of school missed due to asthma exacerbations), etc.
Of course, data are often collected in one mode and then ‘collapsed’ for particular analyses (e.g., age recoded into meaningful age groups, Likert-type scales recoded as ‘agree’/’neutral’/ ’disagree’).
Decisions must take into consideration whether the data are ‘normally distributed’ (i.e., is there ‘skewness’ in the data such that the values for age are mostly in persons under 45 though you have a small number of people who are in their 60’s, 70’s, and 80’s?). Most statistical tests have a number of underlying assumptions that one must meet – all starting with data being normally distributed. Thus, one typically begins looking descriptively at their data: frequencies and percentile distributions, means, medians, and standard deviations. Sometimes, graphing the data shows the ‘devil in the detail’ with regard to how data are distributed. There are some statistics one can compute to measure the degree of skewness in the data and whether distributions are significantly different from ‘normal’. And, if the data are not normally distributed, there are several non-parametric statistics that can be computed to take this into account.
Tomorrow’s post will focus on bivariate and multivariable statistics. Stay tuned!
Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to firstname.lastname@example.org . aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.