Welcome to aea365! Please take a moment to review our new community guidelines. Learn More.

Lindsay Demers on Inter-Rater-Reliability

My name is Lindsay Demers. I am an evaluator and quantitative methodologist at TERC in Cambridge, MA. I have worked as the lead data analyst on several projects at TERC, Lesley University, Boston College and Brandeis University.

Hot Tip: You must calculate inter-rater reliability!  Calculating inter-rater reliability (IRR) is of the utmost importance when using scoring rubrics to gather data from classroom observations, student assessments…really any data source from which humans are doing the extrapolation. The purpose of calculating inter-rater reliability is manifold, but here is one really important reason: You want clean, accurate, and replicable data. If you have coders who have different understandings of how to apply a scoring rubric, you’re going to end up with error-laden, inaccurate estimates of your variables of interest, both of which can lead to false conclusions.

Hot Tip: IRR calculation should be an ongoing process. Of course, IRR should be calculated at the beginning of the coding process to be sure that coders are starting at an adequate level of agreement. (Here, the “beginning of the coding process” means after coders have undergone extensive training on the coding rubric and appear ready to begin coding independently.) However, IRR should be re-calculated intermittently throughout the coding process to ensure that coders are staying consistent and have not “drifted” from one another.

Hot Tip: Percent agreement does not count as a measure of inter-rater reliability. Neither does a standard correlation coefficient.  Percent agreement is insufficient because it does not take into account agreement due to chance. Coders can have a very high percentage of agreement and still have very low IRR when chance is taken into account. With regard to the correlation, this measure is insufficient because two things can change together without being equal. A high correlation does not necessarily indicate a high level of agreement among coders.

Rad Resource:

  • Handbook of Inter-rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters, 3rd Edition by Kilem L. Gwet (2012)

This contribution is from the aea365 Tip-a-Day Alerts, by and for evaluators, from the American Evaluation Association. Please consider contributing – send a note of interest to aea365@eval.org. Want to learn more from Lindsay? She’ll be presenting as part of the Evaluation 2012 Conference Program, October 24-27 in Minneapolis, MN.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.