ÇÑ×ÓÊÓÆµ

Explore

How We Ensure AI Delivers for Students and Educators

Wang and Almquist: Generative AI promises to unlock new opportunities in education. We need to build systems that ensure it delivers on its promise.

Eamonn Fitzmaurice/The 74

Get stories like this delivered straight to your inbox. Sign up for The 74 Newsletter

Artificial Intelligence products are transforming how teachers work by offering support with everything from lesson planning to personalized instruction. Their potential to streamline tasks and expand access to tailored learning has rightly sparked excitement across the education ecosystem. But as more classrooms begin to use AI-powered tools, we have to ask: How do we know if these tools are any good?

The truth is, educators often don’t know. In the AI world, models are rarely evaluated on meaningful educational tasks. Common benchmarks tend to focus on and , not on open-ended tasks that require knowledge of pedagogical best practices. Unlike traditional curricula, content generated by AI-powered tools rarely undergoes expert review, with rare .

As a result, it’s not always clear whether AI-generated lesson plans are designed with best practices, whether AI-generated passages build the right skills and knowledge and ultimately whether AI-generated feedback is effective for supporting student growth. This leaves educators as the last line of defense in determining what content is good, and students are even more at risk of getting less rigorous content. The lack of shared, transparent evaluation tools means we’re flying blind — not because AI technologies are inherently flawed, but because the infrastructure to assess them hasn’t caught up.

Evaluation is the missing layer that can help education technology expand with trust and impact.

This has been central to the partnership between Stanford University researchers and Learning Commons: ensuring AI-powered tools truly enhance education and partnering with educator experts to closely evaluate the quality of the content and the support that AI tools offer.

Together with collaborators Student Achievement Partners (SAP) and the Achievement Network (ANet), we have developed a set of automated evaluators to help educators and edtech developers assess key dimensions of instructional quality with the push of a button. One of our first focus areas has been on enabling better measurement of text complexity, a critical step in supporting strong reading development.

The reading material given to students matters. that getting text complexity right drives important outcomes for students and getting it wrong has real risks. Offering students a steady diet of material below their grade level actually slows reading development, while students who are offered more appropriately challenging texts will grow faster.

While there are existing, openly-available readability metrics that can help educators and developers estimate the difficulty of a passage, the measures are highly limited in scope and value, as they do not reflect critical features, such as whether a text is covering the right content, with vocabulary, syntax and other aspects at the right level. Other measures of text complexity that have more comprehensive coverage are proprietary and therefore lack transparency and are not available for all. 

That’s why, with our partners, we are co-designing an that use computational methods to assess the level of challenge presented by reading materials. Built in collaboration with educators and researchers, these autoevaluators help developers quickly evaluate large volumes of content, offering both a score and an explanation grounded in learning science.

What sets this approach apart isn’t just the technology; it’s also the collaboration and its repeated use in more rigorously evaluating AI-generated content. Our approach to the work begins with research on what is considered rigorous and how those judgments are made. It is the bar we need to hold education products to, in terms of quality and rigor.

We then take those learnings and turn them into thoughtful benchmark datasets that reflect the critical pieces of information needed to gauge quality and rigor and, ultimately, into autoevaluators that can dynamically measure the quality of content generated by AI. With these tools, edtech companies can test and improve their AI systems based on feedback that is both rigorous and actionable.

As AI technologies become more widely adopted in schools, evaluation must move from an afterthought to essential infrastructure. Just as medicine relies on clinical trials and safety protocols, edtech needs open, transparent and credible ways to assess quality. This isn’t about slowing innovation — it’s about enabling faster, more informed use of this technology, taking out some of the guesswork of what will and won’t work. It also helps developers spot blind spots early. It’s also about empowering educators, those closest to students, with tools to understand and shape what AI generates for real classrooms.

This is core to the work Learning Commons supports as part of the organization’s broader commitment to scale proven teaching and learning practices to benefit every learner. Providing open access to these tools and developing criteria along with educators have been central to this effort. When evaluation methods are co-designed and open, we believe that stakeholders can more easily adapt them to local contexts, challenge assumptions and build on shared learning.

But we’re just getting started, and we can’t do this alone.

Collaboration is key — and challenging. What’s needed now is a broad coalition of school districts, researchers, funders and developers willing to work together to create open, rigorous evaluation frameworks, datasets and tools. Whether it’s assessment of text complexity, content knowledge coherence or alignment to foundational skill standards, evaluation that enables better content must become accessible to all and not just a luxury for the well-resourced.

The call is simple: We need developers, educators, researchers and practitioners to help us as we tackle the challenge of turning critical learning science into real, usable and, most importantly, impactful products for the classroom. Connect with us, bring your expertise, your questions and your use cases. Let’s build the infrastructure together to ensure that AI in education is not only innovative but effective and grounded in the realities of the classroom.

Disclosure: Chan Zuckerberg Initiative provides financial support to The 74.

Did you use this article in your work?

We’d love to hear how The 74’s reporting is helping educators, researchers, and policymakers.

Republish This Article

We want our stories to be shared as widely as possible — for free.

Please view The 74's republishing terms.





On The 74 Today