Urban Institute Week: Four principles for evaluating AI by Judah Axelrod, Alena Stern, Jamie Carter, and Sonia Torres Rodríguez

Artificial intelligence (AI) and automated systems are increasingly interwoven with the complex systems we engage with, such as housing, health care, and local government. We’re Judah Axelrod, Alena Stern, Jamie Carter, and Sonia Torres Rodríguez, data scientists and researchers who work with and study these tools every day. We’ve seen how important it is for AI evaluations to reflect these systems’ complexities, moving beyond narrow laboratory-style assessments of model performance (“model evaluation”) to include interdisciplinary studies of their effects on communities (“holistic evaluation”). Here we share four principles to guide your model and holistic AI evaluation: efficiency, equity, explainability, and effectiveness.

Efficiency

Efficiency evaluations ask whether an AI model’s implementation will improve a system’s performance—such as accuracy or speed—over the status quo.

In model evaluation, this includes choosing appropriate performance metrics. In our work using AI to predict neighborhood change in partnership with the US Department of Housing and Urban Development, we considered the relative harms of incorrectly predicting change versus no change for affected communities to prioritize efficiency metrics.

Holistic evaluations of model efficiency must also measure whether narrow performance gains translate to systemic improvement. A high-performing public safety assessment model won’t improve systemic efficiency if stakeholders don’t change their behavior when presented with results.

Equity

Model equity is concerned with evaluating how a model might have different performance (efficiency) or impact (effectiveness) across data subgroups, depending on both statistical and demographic dimensions (e.g., a group’s sample’s size or race).

In model evaluation, equity considerations focus on similarity in efficiency across subgroups, which could involve trade-offs with overall efficiency.

Holistic evaluation also considers equity in who gets to make decisions about how AI is used and how AI systems affect communities.

Explainability

Explainability is concerned with how well understood an AI system’s decisions or outcomes are to stakeholders—not just its developers, but also those who may interact with the system daily.

Evaluating explainability weighs trade-offs between simpler models, such as decision trees, where it’s easier to understand how individual decisions are made and which variables most influence those decisions, and more complex models that may sacrifice explainability for efficiency.

Holistic evaluation also asks, “Explainability for whom?” and may investigate whether AI programs used participatory best practices to meaningfully engage affected communities in modeling decisions and communicate results.

Effectiveness

Perhaps the most foundational and difficult-to-evaluate metric is an AI system’s overall effectiveness. This requires moving beyond model evaluation to ask holistic evaluation questions about a model’s impact in practice, such as: What are the tangible outcomes for those who interact with a system? How does it compare to alternatives (both human and technological)? How does it fit within a broader decision-making pipeline?

Considering these four principles allows more thorough, organized model evaluation and can push developers and policymakers to step out from behind their computers and engage directly with the communities affected by AI systems.

Lessons Learned

Explore the opacity of model-based decisions in the tenant-screening industry and how this affects rental applicants

Rad Resources

Urban’s ethics checkpoints can guide you on evaluating equity before, during, and after implementing a model.
Urban’s recommendations and standards guide describes power sharing, accountability structures, and considerations of potential benefits and risks of imputed credit data for communities of color.
AI equity toolkits also allow stakeholders to accountably engage with residents and activists about public algorithms.
Urban’s work on the equitable use of automated systems is exploring many of these questions at all levels of government. If you’re interested in partnering, we’d love to hear from you! Please reach out to Judah Axelrod at jaxelrod@urban.org.

The American Evaluation Association is hosting Urban Institute week. The contributions all this week to AEA365 come from staff at the Urban Institute, a nonprofit research organization that provides data and evidence to help advance upward mobility and equity. Do you have questions, concerns, kudos, or content to extend this AEA365 contribution? Please add them in the comments section for this post on the AEA365 webpage so that we may enrich our community of practice. Would you like to submit an AEA365 Tip? Please send a note of interest to AEA365@eval.org. AEA365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators. The views and opinions expressed on the AEA365 blog are solely those of the original authors and other contributors. These views and opinions do not necessarily represent those of the American Evaluation Association, and/or any/all contributors to this site.