How We Ensure AI Delivers for Students and Educators

Explore

Opinion

How We Ensure AI Delivers for Students and Educators

Wang and Almquist: Generative AI promises to unlock new opportunities in education. We need to build systems that ensure it delivers on its promise.

By Rose Wang & Jessamy Almquist

October 30, 2025
The views expressed here are those of the authors.

Education news and commentary, delivered right to your inbox.

This site is protected by reCAPTCHA and the Google and apply.

Most Popular

opinion
What a Hallway Sprint Taught Me About Chronic Absenteeism
commentary
Why Some Students Don鈥檛 Raise Their Hands. How Early Education Can Change That
Artificial Intelligence
Five Things to Know About the New Khan TED Institute
new polling
Gen Z Increasingly Skeptical of 鈥斅燗nd Angry About 鈥斅燗rtificial Intelligence
Bright Spots
Why This Connecticut District鈥檚 Reading Scores Are Outstripping Expectations

Get stories like this delivered straight to your inbox. Sign up for The 74 Newsletter

Artificial Intelligence products are transforming how teachers work by offering support with everything from lesson planning to personalized instruction. Their potential to streamline tasks and expand access to tailored learning has rightly sparked excitement across the education ecosystem. But as more classrooms begin to use AI-powered tools, we have to ask: How do we know if these tools are any good?

The truth is, educators often don鈥檛 know. In the AI world, models are rarely evaluated on meaningful educational tasks. Common benchmarks tend to focus on and , not on open-ended tasks that require knowledge of pedagogical best practices. Unlike traditional curricula, content generated by AI-powered tools rarely undergoes expert review, with rare .

As a result, it鈥檚 not always clear whether AI-generated lesson plans are designed with best practices, whether AI-generated passages build the right skills and knowledge and ultimately whether AI-generated feedback is effective for supporting student growth. This leaves educators as the last line of defense in determining what content is good, and students are even more at risk of getting less rigorous content. The lack of shared, transparent evaluation tools means we鈥檙e flying blind 鈥� not because AI technologies are inherently flawed, but because the infrastructure to assess them hasn鈥檛 caught up.

Evaluation is the missing layer that can help education technology expand with trust and impact.

This has been central to the partnership between Stanford University researchers and Learning Commons: ensuring AI-powered tools truly enhance education and partnering with educator experts to closely evaluate the quality of the content and the support that AI tools offer.

Together with collaborators Student Achievement Partners (SAP) and the Achievement Network (ANet), we have developed a set of automated evaluators to help educators and edtech developers assess key dimensions of instructional quality with the push of a button. One of our first focus areas has been on enabling better measurement of text complexity, a critical step in supporting strong reading development.

The reading material given to students matters. that getting text complexity right drives important outcomes for students and getting it wrong has real risks. Offering students a steady diet of material below their grade level actually slows reading development, while students who are offered more appropriately challenging texts will grow faster.

While there are existing, openly-available readability metrics that can help educators and developers estimate the difficulty of a passage, the measures are highly limited in scope and value, as they do not reflect critical features, such as whether a text is covering the right content, with vocabulary, syntax and other aspects at the right level. Other measures of text complexity that have more comprehensive coverage are proprietary and therefore lack transparency and are not available for all.

That鈥檚 why, with our partners, we are co-designing an that use computational methods to assess the level of challenge presented by reading materials. Built in collaboration with educators and researchers, these autoevaluators help developers quickly evaluate large volumes of content, offering both a score and an explanation grounded in learning science.

What sets this approach apart isn鈥檛 just the technology; it鈥檚 also the collaboration and its repeated use in more rigorously evaluating AI-generated content. Our approach to the work begins with research on what is considered rigorous and how those judgments are made. It is the bar we need to hold education products to, in terms of quality and rigor.

We then take those learnings and turn them into thoughtful benchmark datasets that reflect the critical pieces of information needed to gauge quality and rigor and, ultimately, into autoevaluators that can dynamically measure the quality of content generated by AI. With these tools, edtech companies can test and improve their AI systems based on feedback that is both rigorous and actionable.

As AI technologies become more widely adopted in schools, evaluation must move from an afterthought to essential infrastructure. Just as medicine relies on clinical trials and safety protocols, edtech needs open, transparent and credible ways to assess quality. This isn鈥檛 about slowing innovation 鈥� it鈥檚 about enabling faster, more informed use of this technology, taking out some of the guesswork of what will and won鈥檛 work. It also helps developers spot blind spots early. It鈥檚 also about empowering educators, those closest to students, with tools to understand and shape what AI generates for real classrooms.

This is core to the work Learning Commons supports as part of the organization鈥檚 broader commitment to scale proven teaching and learning practices to benefit every learner. Providing open access to these tools and developing criteria along with educators have been central to this effort. When evaluation methods are co-designed and open, we believe that stakeholders can more easily adapt them to local contexts, challenge assumptions and build on shared learning.

But we鈥檙e just getting started, and we can鈥檛 do this alone.

Collaboration is key 鈥� and challenging. What鈥檚 needed now is a broad coalition of school districts, researchers, funders and developers willing to work together to create open, rigorous evaluation frameworks, datasets and tools. Whether it’s assessment of text complexity, content knowledge coherence or alignment to foundational skill standards, evaluation that enables better content must become accessible to all and not just a luxury for the well-resourced.

The call is simple: We need developers, educators, researchers and practitioners to help us as we tackle the challenge of turning critical learning science into real, usable and, most importantly, impactful products for the classroom. Connect with us, bring your expertise, your questions and your use cases. Let鈥檚 build the infrastructure together to ensure that AI in education is not only innovative but effective and grounded in the realities of the classroom.

Disclosure: Chan Zuckerberg Initiative provides financial support to The 74.

Did you use this article in your work?

We鈥檇 love to hear how The 74鈥檚 reporting is helping educators, researchers, and policymakers.

Republish This Article Learn More

Rose Wang is a researcher at OpenAI. She received her Ph.D. in computer science from Stanford University, where her work was recognized by the聽.

Jessamy Almquist is senior manager at Learning Commons, Chan Zuckerberg Initiative鈥檚 work in education, where she leads the development of evidence-based AI and educational technology tools. With over 10 years of experience translating research into products that drive real-world learning outcomes, she specializes in creating educator-centered solutions through collaborative design processes.

Republish This Article

We want our stories to be shared as widely as possible 鈥� for free.

Please view The 74's republishing terms.


                <h2>How We Ensure AI Delivers for Students and Educators</h2>

                <h2>Wang and Almquist: Generative AI promises to unlock new opportunities in education. We need to build systems that ensure it delivers on its promise.</h2>

                <p class="sans">By <a rel="author" href="/contributor/1022559/">Rose Wang</a> & <a rel="author" href="/contributor/1022561/">Jessamy Almquist</a></p>

                <img src="/wp-content/uploads/2025/10/ai-quality-tested.jpg">

                <p>This story first appeared at <a href="/">The 74</a>, a nonprofit news site covering education. <a href="/about/newsletters/?utm_source=republish-button&utm_medium=website&utm_campaign=republish">Sign up for free newsletters from The 74</a> to get more like this in your inbox.</p>
                
<p>Artificial Intelligence products are transforming how teachers work by offering support with everything from lesson planning to personalized instruction. Their potential to streamline tasks and expand access to tailored learning has rightly sparked excitement across the education ecosystem. But as <a href="/article/survey-60-of-teachers-used-ai-this-year-and-saved-up-to-6-hours-of-work-a-week/">more classrooms begin to use AI-powered tools</a>, we have to ask: How do we know if these tools are any good?</p>



<p>The truth is, educators often don鈥檛 know. In the AI world, models are rarely evaluated on meaningful educational tasks. Common benchmarks tend to focus on  and , not on open-ended tasks that require knowledge of pedagogical best practices. Unlike traditional curricula, content generated by AI-powered tools rarely undergoes expert review, with rare .</p>



<p>As a result, it鈥檚 not always clear whether AI-generated lesson plans are designed with best practices, whether AI-generated passages build the right skills and knowledge and ultimately whether AI-generated feedback is effective for supporting student growth. This leaves educators as the last line of defense in determining what content is good, and students are even more at risk of getting less rigorous content. The lack of shared, transparent evaluation tools means we鈥檙e flying blind 鈥� not because AI technologies are inherently flawed, but because the infrastructure to assess them hasn鈥檛 caught up.</p>







<p>Evaluation is the missing layer that can help education technology expand with trust and impact.</p>



<p>This has been central to the partnership between Stanford University researchers and Learning Commons: ensuring AI-powered tools truly enhance education and partnering with educator experts to closely evaluate the quality of the content and the support that AI tools offer.</p>



<p>Together with collaborators Student Achievement Partners (SAP) and the Achievement Network (ANet), we have developed a set of automated evaluators to help educators and edtech developers assess key dimensions of instructional quality with the push of a button. One of our first focus areas has been on enabling better measurement of text complexity, a critical step in supporting strong reading development.</p>



<p>The reading material given to students matters.  that getting text complexity right drives important outcomes for students and getting it wrong has real risks. Offering students a steady diet of material below their grade level actually slows reading development, while students who are offered more appropriately challenging texts will grow faster.</p>



<p>While there are existing, openly-available readability metrics that can help educators and developers estimate the difficulty of a passage, the measures are highly limited in scope and value, as they do not reflect critical features, such as whether a text is covering the right content, with vocabulary, syntax and other aspects at the right level. Other measures of text complexity that have more comprehensive coverage are proprietary and therefore lack transparency and are not available for all. </p>



<aside class="inline_story shortcode simple"><a href="/article/survey-60-of-teachers-used-ai-this-year-and-saved-up-to-6-hours-of-work-a-week/"><figure style="background-image: url(/wp-content/uploads/2025/07/ai-saving-teachers-time-tasks-chart.png);"></figure><div><span class="sans related_tag">Related</span><h4 class="sans">Survey: 60% of Teachers Used AI This Year and Saved up to 6 Hours of Work a Week</h4></div></a></aside>



<p>That鈥檚 why, with our partners, we are co-designing an  that use computational methods to assess the level of challenge presented by reading materials. Built in collaboration with educators and researchers, these autoevaluators help developers quickly evaluate large volumes of content, offering both a score and an explanation grounded in learning science.</p>



<p>What sets this approach apart isn鈥檛 just the technology; it鈥檚 also the collaboration and its repeated use in more rigorously evaluating AI-generated content. Our approach to the work begins with research on what is considered rigorous and how those judgments are made. It is the bar we need to hold education products to, in terms of quality and rigor.</p>



<p>We then take those learnings and turn them into thoughtful benchmark datasets that reflect the critical pieces of information needed to gauge quality and rigor and, ultimately, into autoevaluators that can dynamically measure the quality of content generated by AI. With these tools, edtech companies can test and improve their AI systems based on feedback that is both rigorous and actionable.</p>



<p>As AI technologies become more widely adopted in schools, evaluation must move from an afterthought to essential infrastructure. Just as medicine relies on clinical trials and safety protocols, edtech needs open, transparent and credible ways to assess quality. This isn鈥檛 about slowing innovation 鈥� it鈥檚 about enabling faster, more informed use of this technology, taking out some of the guesswork of what will and won鈥檛 work. It also helps developers spot blind spots early. It鈥檚 also about empowering educators, those closest to students, with tools to understand and shape what AI generates for real classrooms.</p>



<p>This is core to the work Learning Commons supports as part of the organization鈥檚 broader commitment to scale proven teaching and learning practices to benefit every learner. Providing open access to these tools and developing criteria along with educators have been central to this effort. When evaluation methods are co-designed and open, we believe that stakeholders can more easily adapt them to local contexts, challenge assumptions and build on shared learning.</p>



<aside class="inline_story shortcode simple"><a href="/article/ai-is-being-used-in-schools-but-statewide-guidance-is-a-work-in-progress/"><figure style="background-image: url(/wp-content/uploads/2025/09/artificial-intelligence-in-schools-maryland.jpg);"></figure><div><span class="sans related_tag">Related</span><h4 class="sans">AI Is Being Used in Schools, but Statewide Guidance Is a Work in Progress</h4></div></a></aside>



<p>But we鈥檙e just getting started, and we can鈥檛 do this alone.</p>



<p>Collaboration is key 鈥� and challenging. What鈥檚 needed now is a broad coalition of school districts, researchers, funders and developers willing to work together to create open, rigorous evaluation frameworks, datasets and tools. Whether it’s assessment of text complexity, content knowledge coherence or alignment to foundational skill standards, evaluation that enables better content must become accessible to all and not just a luxury for the well-resourced.</p>



<p>The call is simple:  We need developers, educators, researchers and practitioners to help us as we tackle the challenge of turning critical learning science into real, usable and, most importantly, impactful products for the classroom. Connect with us, bring your expertise, your questions and your use cases. Let鈥檚 build the infrastructure together to ensure that AI in education is not only innovative but effective and grounded in the realities of the classroom.</p>



<p><em>Disclosure: Chan Zuckerberg Initiative provides financial support to </em><a href="/"><em>The 74.</em></a></p>

茄子视频

Contact Us

Follow Us

Explore