AEA’s Digital Data & Technology Working Group Week: The Shape of Evaluation to Come: From Artificial Accountability to Deep Learning by Zach Tilton

The posts for this week come from the Digital Data & Technology AEA Conference Working Group, and share how digital data and technology are factoring into reshaping evaluation.


Hello, I’m Zach Tilton, PhD candidate, evaluation consultant, and chair of the Evaluation 2022 sub-theme working group on Digital Data and Technology.

Technology and evaluation are ancient friends. The evidence of hafted stone-tipped spears from 500,000 thousand years ago and changes in these artifacts over time suggest an ancient evaluative logic at play in determining the quality of a specific technology for survival. Our ancestors’ ability to think evaluatively about tech is also related to their capacity to think counterfactually—to imagine an alternative outcome in their survival if they did something different in their hunting process. Turing Award Laureate and artificial intelligence expert Judea Pearl asserts this one cognitive function is what distinguishes humans from machines—the ability to imagine alternatives, other worlds, to ask “What if?”

A neanderthal evaluating a stone spear tip in their hand, realistic, etheral, cinematic, unreal engine, detailed, 8k.
AI-generated image 1: a neanderthal evaluating a stone spear tip in their hand, realistic, ethereal, cinematic, unreal engine, detailed, 8k
A group of people standing in front of a building on fire
AI-generated image 2: A community of evaluators reshaping their profession in New Orleans, highly detailed digital art

Comparing the past or present with what could have been or could be is at the core of any evaluative endeavor. And that is precisely what we are being invited to do for this year’s AEA conference from November 7-12 in New Orleans, Louisiana where we’ll consider how social justice, new actors, and technology can help us in the process of (re)shaping evaluation together.

This week we’re going to hear more about how digital data and technology are factoring into that reshaping imaginary. To set the stage, I’d like to make a connection between a personal tech-enabled avocation and our often tech-enabled vocation.

Afrofuturist evaluator collecting data in New Orleans + concept art + Mardi Gras colors + half blueprint + futuristic metaverse + trending on artstation + Josan Gonzalez + Laurie Greasley + Dangiuz + elaborate + cinematic + golden ratio + particle dispersion + use rule of thirds + intricate details + high details
AI-generated image 3: Afrofuturist evaluator collecting data in New Orleans + concept art + Mardi Gras colors + half blueprint + futuristic metaverse + trending on artstation + Josan Gonzalez + Laurie Greasley + Dangiuz + elaborate + cinematic + golden ratio + particle dispersion + use rule of thirds + intricate details + high details

Last year, I started experimenting with text-to-image (T2I) generative art after discovering a small community of artists and researchers on Twitter who were using artificial intelligence, machine learning, and neural networks to create images. While there are many ways to make AI art, the latest generative models use deep learning, a process of iteratively improving a machine’s capacity to associate text and images, and latent diffusion, a process of locating points in a multi-dimensional space of clusters of descriptive variables and translating those combinations of variables into unique images recognizable by humans.

Rad Resources

  • Learn to make your own images like these created in Midjourney.
Dvaluation practitioner collecting data in cyberpunk Mardi Gras
AI-generated image 4: evaluation practitioner collecting data in cyberpunk Mardi Gras

Though the practical applications of this technology to evaluation are limited, there are many similarities between this process and ours. For example, both evaluation and AI art processes start with noise and arrange data into a coherent message. Both are highly contextual, and due to the randomness of latent space (and evaluation contexts), generative models will never return exactly the same image for the same prompt, not unlike the same evaluation models returning different findings over different contexts. However, one similarity I’d like to discuss most deals with bias and harm.

Lessons Learned

Like evaluation, the tech stacks and digital data being used for training and models and generating AI art are not value neutral. Algorithms are codified human decisions. Training data images are produced and at least initially tagged by humans.

All of this means that, while a deep learning model technically teaches itself how to better evaluate and generate image outputs associated with specific natural language text, these outputs are laden with social biases of an old world or the status quo. At best this can constrain our moral imagination and at worse can reify harm and violence against vulnerable groups and individuals.

For example, when I first started generating images of social researchers or evaluators for this week’s images, all the images looked mostly like me, white and male. This is a known issue with these models. I had to intentionally prompt the machine to envision a more representative and diverse vision. I suspect we will need a similar intentionality as we approach our conversations of reshaping evaluation together this year.

The word evaluation shaped in clay, Mardi Gras colors.
AI-generated image 5: the word evaluation shaped in clay, Mardi Gras colors.

Interestingly, harm reduction is a major reason that both of the best performing text-to-image models, DALL-E and ImaGen, are still in closed beta, unopen to the public. So, just like our ancient ancestors, we need to evaluate the merit of this tech and any tech we might use in our daily life or evaluation practice, for our survival and most importantly the survival of our marginalized, racialized, and vulnerable siblings. 

In keeping with the theme of reshaping together, for each blog post this week I have collaborated with a neural network (through Midjourney) and guest contributors, either directly or indirectly with their permission, to create AI-generated art to accompany central themes from their blog posts.

Concept art of a diverse group of evaluation workers repairing canonical books + in the style of Krenz Cushart, Ashley Wood, Sam Weber, Eric Fortune, Joao Ruas, Jon Foster, Craig Mullins, Rick Berry, Nigel Quarless + cinematic + aesthetic canon of proportion + seamless
AI-generated image 6: Concept art of a diverse group of evaluation workers repairing canonical books + in the style of Krenz Cushart, Ashley Wood, Sam Weber, Eric Fortune, Joao Ruas, Jon Foster, Craig Mullins, Rick Berry, Nigel Quarless + cinematic + aesthetic canon of proportion + seamless
Get Involved

Join the Working Group for the Digital Data & Technology Town Hall on Thursday, August 18, 2022, from 2pm-3pm EDT titled Convergence in Digital Data and Technology: Trends and Implications for Evaluation.


The American Evaluation Association is hosting Digital Data & Technology Week with our colleagues in AEA’s Digital Data & Technology Working Group. The contributions all this week to AEA365 come from working group members. Do you have questions, concerns, kudos, or content to extend this AEA365 contribution? Please add them in the comments section for this post on the AEA365 webpage so that we may enrich our community of practice. Would you like to submit an AEA365 Tip? Please send a note of interest to AEA365@eval.org. AEA365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators. The views and opinions expressed on the AEA365 blog are solely those of the original authors and other contributors. These views and opinions do not necessarily represent those of the American Evaluation Association, and/or any/all contributors to this site.

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.