AEA365 | A Tip-a-Day by and for Evaluators

TAG | experiments

Hello! I’m William Faulkner, Director of Flux, an M&E consultancy based in New Orleans. I want to pull back the curtain on perhaps the most famous experiment in international development history – the one conducted by Washington DC’s IFPRI on Mexico’s largest anti-poverty program, PROGRESA (now Prospera).

 The Down-Low:

Basically, the mainstream narrative of this evaluation ignores three things:

  • Randomization: This evaluation did not randomly assign households to treatment and control status, but only leveraged randomization. The “clustered matched-pairs design” involved non-randomly assigning participating communities first to treatment and control status.
  • Attrition: Selective sample attrition was strongly present and unaccounted for in analyses.
  • Contamination: Treatment communities were doubtless ‘contaminated’ with migrants from control communities. They even decided to end the project early because of pressure from local authorities in control communities.

The project that “proved” that experiments could be rigorously applied in developing country contexts was neither experimental nor rigorous. In other words, a blind eye was turned to the project’s pretty severe internal weaknesses. There was an enormous and delicate opportunity to de-politicize the image of social programming at the national level, put wind in the sails of conditional cash transfers, bolster the credibility of evidence-based policy worldwide, and sustain the direct flow of cash to poor Mexicans.

So What? (Lessons Learned):

What does this case illuminate about experiments?

Let’s leave the shouting behind and get down to brass tacks. The “experiment-as-gold-standard” agenda still commands a significant swath of the networks which commission and undertake evaluations. Claims for methodological pluralism, however, are neither new nor in need of immediate defense. Instead, M&E professionals should systematically target and correct the overzealous representations of experiments, rather than getting bogged down in theoretical discussions about what experiments can and cannot do.

Still, in 2016, we have breathless, infomercial-like articles on experiments coming out in the New York Times. This has to stop. Still, we absolutely must respect the admirable achievement of the randomistas: the fact that there’s a space for this fascinatingly influential methodology in M&E where previously none existed.

The individuality of each evaluation project makes talking about ‘experiments’ as a whole difficult. These things are neither packaged nor predictable. As with IFPRI-Progresa, micro-decisions matter. Context matters. History matters. This case is an ideal centerpiece with which to induce a grounded, fruitful discussion of the rewards and risks of experimental evaluation.

Rad Resources:

The American Evaluation Association is celebrating the Design & Analysis of Experiments TIG Week. The contributions all week come from Experiments TIG members. Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org . aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

·

I’m Laura Peck, recovering professor and now full-time evaluator with Abt Associates.  For many years I taught graduate Research Methods and Program Evaluation courses. One part I enjoyed most was introducing students to the concepts of causality, internal validity and the counterfactual – summarized here as hot tips.

Hot Tips:

#1:  What is causality?

Correlation is not causation.  For an intervention to cause a change in outcomes, the two must be associated and the intervention must temporally precede the change in outcomes.  These two criteria are necessary.  The sufficient criterion is that no other plausible, rival explanations can take credit for the change in outcomes.

#2: What is internal validity?  And why is it threatened?

In evaluation parlance, these “plausible rival explanations” are known as “threats to internal validity.”  Internal validity refers to an evaluation design’s ability to establish that causal connection between intervention and impact.  As such, the threats to internal validity are those factors in the world that might explain a change in outcomes that you think your program achieved independently.  For example, children mature and learn simply by exposure to the world, so how much of an improvement in their reading is due to your tutoring program as opposed to their other experiences and maturation processes?  Another example is job training that assists unemployed people:  one cannot be any less employed than being unemployed, and so “regression to the mean” implies that some people will improve (get jobs) regardless of the training.  These two “plausible rival explanations” are known as the “threats to validity” of maturation and regression artifact.  Along with selection bias and historical explanations (recession, election, national mood swings), these can claim credit for changes in outcomes observed in the world, regardless of what interventions try to do to improve conditions.

#3: Why I stopped worrying and learned to love the counterfactual.

I want interventions to be able to take credit for improving outcomes, when in fact they do.  That is why I like randomization.  Randomizing individuals or classes or schools or cities to gain access to an intervention—and randomizing some not to gain access—provides a reliable “counterfactual.”  In evaluation parlance, the “counterfactual” is what would have happened in the absence of the intervention.  Having a group that is randomized out (e.g., to experience business as usual) means that it experiences all the historical, selection, regression-to-the-mean, and maturation forces as do those who are randomized in.  As such, the difference between the two groups’ outcomes represents the program’s impact.

Challenge:

As a professor, I would challenge my students to use the word “counterfactual” at social gatherings.  Try it!  You’ll be the life of the party.

Rad Resource:

For additional elaboration on these points, please read my Why Randomize? Primer.

The American Evaluation Association is celebrating the Design & Analysis of Experiments TIG Week. The contributions all week come from Experiments TIG members. Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org . aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

· · ·

I am Melinda Davis, a Research Assistant Professor at the University of Arizona in Psychology, coordinate the Program Evaluation and Research Methods minor, and serve as Editor-in-Chief for the Journal of Methods and Measurement in the Social Sciences.  In an ideal world, evaluation studies compare two groups that differ only on the treatment assignment.  Unfortunately, there are many ways that a comparison group can differ from the intervention group.

Lesson Learned: As evaluators, we conduct experiments in order to examine the effects of potentially beneficial treatments.  We need control groups in order to evaluate the effects of treatments. Participants assigned to a control group usually receive a placebo intervention or the status quo intervention (business-as-usual). Individuals who have been assigned to a treatment-as-usual control group may refuse randomization, drop out during the course of the study, or obtain the treatment on their own.  It can be quite challenging to create a plausible placebo condition, or what evaluators call the “counterfactual” condition, particularly for a social services intervention.  Participants in a placebo condition may receive a “mock” intervention that differs in the amount of time, attention, or desirability, all of which can result in differential attrition or attitudes about the effectiveness of the treatment.  At the end of a study, evaluators may not know if an observed effect is due to time spent, attention received, participant satisfaction, group differences resulting from differential dropout rates, or the active component of treatment.  Many threats to validity can appear as problems with the control group, such as maturation, selection, differential loss of respondents across groups, and selection-maturation interactions (see Shadish, Cook and Campbell, 2002).

Cool Trick: Shadish, Clark and Steiner demonstrate an elegant approach to the control group problem. While the focus of their study was not control group issues, their doubly randomized preference trial (DRPT) included a well-designed control group.  Some participants were randomized to math or vocabulary treatment whereas the other group was randomized into their choice of instruction.

The evaluators collected math and vocabulary outcomes for all participants throughout the study.  The effects of the vocabulary intervention on the vocabulary outcome, the effects of the mathematics intervention on the mathematics outcome, and changes across the treated versus untreated condition could be compared, taking covariates into account.  This design allowed the evaluators to parse out the effects of participant bias, and the effect of treatment on the outcomes.

As evaluators, it is helpful to be aware of potential threats to validity and novel study designs that we can use to reduce such threats.

The American Evaluation Association is celebrating the Design & Analysis of Experiments TIG Week. The contributions all week come from Experiments TIG members. Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org . aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

· · ·

I’m Allan Porowski, a Principal Associate at Abt Associates and a fan of experiments – when they’re conducted under the right circumstances. Experiments, commonly referred to as RCTs (randomized controlled trials) go through three stages: (1) crazy start-up period, (2) normal data collection period, and (3) crazy analysis period.

Hot Tips:  Here are some tips to make that start-up period less crazy:

  • Don’t Fall in Love with the Method: Too often, evaluators try to force a given method to fit reality instead of using it to measure reality. Even though we may really want to conduct an RCT, it may not be appropriate. Experiments are not appropriate for new initiatives because they may not yet have excess demand for services, necessary data collection infrastructure, an randomization-accommodating intake process, or staff buy-in. If these criteria are not met, then the program is not ready to be tested experimentally.
  • Be Forward-Looking by Working Backwards: There’s no substitute for in-person planning sessions to hammer out evaluation details. A half day (or better yet, a full day) on-site is needed; and you’ll need a big whiteboard. It helps to start with a discussion of what the site hopes learn, and design the study to meet those goals. Starting out with the big-picture and moving into the details also gets the conversation off to a more productive start than diving into the nuances of randomization.
  • Know Your Audience, and Let Them Know You: Don’t forget that when conducting an RCT, you are asking staff to replace professional judgment with a completely random process. That’s not an easy proposition to make. It’s really important to convey your understanding that RCTs can be disruptive, and explain what can be done to minimize that disruption. Likewise, teach program staff to think like an evaluator. Get them involved in formulating research questions, identifying mediators, and developing hypotheses about the relationship between program services to outcomes. Keep in mind that nodding does not equal understanding. RCTs are not intuitive to most people, including many researchers, so take the time to explain study procedures in multiple ways.
  • Pressure-Test Your Sampling Frame: RCTs are often knocked for lacking generalizability, and unfortunately, that criticism is often warranted. Did you just recruit a bunch of sites that only serve left-handed kids in Boston? Recruitment is tough, but it’s even tougher to make a case that results are generalizable when your sampling frame doesn’t represent the program participants you’re studying.

Rad Resource:  Key Items To Get Right When Conducting a Randomized Controlled Trial in Education. Though over 10 years old, the advice is timeless.

The American Evaluation Association is celebrating the Design & Analysis of Experiments TIG Week. The contributions all week come from Experiments TIG members. Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org . aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

· ·

Archives

To top