It’s Time for Humanity’s Best Exam for AI

Better benchmarks can unlock the social benefits of AI technology.

The rhythm of artificial intelligence (AI) development has become unsettlingly familiar. A new model is unveiled, and with it comes a predictable flurry of media attention. One cluster of articles dissects its intricate training data and architecture; another marvels, often breathlessly, at its newfound capabilities; and a third, almost inevitably, scrutinizes its performance on a battery of standardized tests. These benchmarks have become our primary yardsticks for AI progress. Yet, they predominantly paint a picture skewed toward raw technical prowess and potential peril, leaving the public with a pervasive feeling that each impressive step forward for AI might translate into two regrettable steps back for the rest of us.

Many of these evaluations concentrate on the technical capacity of the model or its computational horsepower. Others, with growing urgency, assess the likelihood of misuse—could this advanced AI empower rogue actors to design a bioweapon or destabilize critical infrastructure through sophisticated cyberattacks? A significant portion of evaluations also measures AI against human performance in specific job tasks, fueling widespread anxieties about automation and diminished human agency. The reporting on these tests, frequently framed by alarming headlines, understandably casts AI advancements more as a societal regression than a leap forward. The very branding of prominent benchmarks, such as the ominously titled Humanity’s Last Exam, amplifies these negative connotations. That benchmark and others like it tend to measure a model’s capacity to complete bespoke tests, aid bad actors engaging in harmful conduct, or some combination of the two. It is difficult, if not impossible, to read coverage of such an assessment and come away with a hopeful, or even neutral, view of AI’s trajectory.

This is not to argue that assessing risks or understanding the deep mechanics of AI is unimportant. Vigilance and technical scrutiny are crucial components of responsible development. The current benchmarking landscape, however, is dangerously imbalanced. Those of us who recognize AI’s immense transformative potential to address some of the world’s most intractable problems—including revolutionizing medical diagnostics, accelerating climate solutions, and personalizing education for every child—currently lack a prominent, public-facing benchmark designed to track, celebrate, and encourage these positive developments.

It is time we introduce “Humanity’s Best Exam”—a benchmark that strives to capture a model’s capacity to address public policy problems and otherwise serve the general welfare.

Imagine a new form of evaluation that challenges AI systems not with abstract logic puzzles but with tangible goals vital to human flourishing. Consider a benchmark that tasks AI models with identifying early-stage diabetic retinopathy from retinal scans with over 95 percent accuracy, a leap that could surpass current screening efficacy and save millions from preventable blindness. Picture a test that spurs the design of three novel antibiotic compounds that are effective against stubborn, drug-resistant bacteria within a single year. In the realm of climate science, Humanity’s Best Exam might push AI to develop a groundbreaking, cost-effective catalyst for the direct air capture of carbon dioxide, improving efficiency by a significant margin—say, 20 percent—over existing technologies. Or it could encourage the creation of predictive models for localized flash floods that offer vulnerable regions a critical six-hour lead time with 90 percent accuracy. Or, in education, the challenge could be to generate personalized six-month learning plans for diverse student profiles in foundational STEM subjects, demonstrably elevating learning outcomes by an average of two grade levels.

The creation and widespread adoption of Humanity’s Best Exam would serve several critical, society-shaping purposes.

First, it would powerfully harness the intense competitive spirit of AI laboratories for the global good. AI developers are profoundly motivated by benchmark performance—the race to the top of the leaderboards is fierce. Channeling this potent drive toward solving clearly defined societal problems could positively redirect research priorities and resource allocation within these influential organizations.

Second, such a benchmark would be instrumental in reshaping the public discourse surrounding artificial intelligence. The narrative around any powerful new technology is inevitably shaped by the information that is most readily available and most prominently featured. If the most visible AI assessments continue to highlight dangers and disruptions, public perception will remain tinged with fear and skepticism. Humanity’s Best Exam would provide a steady stream of positive, concrete examples of AI’s potential, offering a more balanced and hopeful counter-narrative. This perspective is essential for fostering a more informed and constructive public conversation, which is, in turn, vital for democratic oversight of this transformative technology.

Finally, a benchmark focused on positive societal impact would provide invaluable guidance for policymakers, investors, and researchers. As a law professor whose research centers on accelerating AI innovation through thoughtful legal and policy reforms, I see a pressing need for clearer signals to guide governance away from reactive, fear-driven legislation and toward proactive, enabling frameworks. Humanity’s Best Exam would illuminate areas where AI is poised to deliver significant societal returns, helping policymakers to direct strategic funding more effectively and to develop supportive, rather than stifling, regulatory environments. Investors would gain a clearer view of emerging opportunities where AI can create substantial financial and social value. Researchers across numerous disciplines could more easily identify how cutting-edge AI capabilities can be leveraged within their fields, potentially sparking new collaborations and accelerating vital research.

But who would build and oversee such an ambitious undertaking, and how could we navigate the inherent challenges? The establishment of Humanity’s Best Exam would necessitate a dedicated, independent, and broadly representative multi-stakeholder governing consortium. This body should ideally include experts from leading academic institutions, established nonprofits with proven experience in managing “grand challenges”—akin to the XPrize Foundation model that involves hosting competitions to achieve societally beneficial breakthroughs—relevant international organizations, domain specialists from fields such as public health, environmental science, and education, as well as ethicists and, critically, representatives from civil society organizations to ensure public accountability. Funding could be drawn from a diverse portfolio, including major philanthropic sources, government grants earmarked for scientific and societal advancement, and perhaps even a coalition of AI laboratories and technology firms committed to socially beneficial AI development.

To address the valid concern that defining “societal benefit” can be subjective, a primary task for this consortium would be to establish a transparent and evolving framework for identifying and prioritizing challenge areas, perhaps drawing inspiration from established global agendas such as the United Nation’s Sustainable Development Goals. The specific tasks within the benchmark would need to be rigorously defined, objectively measurable, and, crucially, regularly updated by diverse expert panels. This dynamism is key to preventing the benchmark from becoming stale, to avoiding the pitfalls of “teaching to the test” in a way that stifles genuine innovation, and to ensuring continued relevance as AI capabilities and societal needs evolve. Although no benchmark can ever be entirely immune to attempts at superficial optimization, focusing on complex, real-world problems with multifaceted success criteria makes simplistic gaming far more difficult than it is on narrower, purely technical tests. Furthermore, a portion of the assessment could incorporate qualitative reviews by expert panels, evaluating the robustness, safety, ethical considerations, and real-world applicability of the proposed AI tools.

The current, almost myopic focus on AI’s potential downsides, although born of a necessary caution, is inadvertently creating an innovation ecosystem shrouded in anxiety. We are meticulously documenting every conceivable way AI could go wrong, while failing to champion, encourage, and measure systematically its profound potential to go spectacularly right.

It is time to correct this imbalance. A crucial first step would be for leading philanthropic organizations, forward-thinking academic consortia, and ethically minded AI developers to convene a foundational summit. The purpose of such a gathering would be to begin outlining the charter, initial problem sets, and robust governance structure for Humanity’s Best Exam. This is far more than a mere intellectual exercise; it is a necessary reorientation of our collective focus and a deliberate effort to harness the awesome power of artificial intelligence for the betterment of all. Let us not only brace for AI’s potential last exam but actively architect its very best.

Kevin T. Frazier

Kevin T. Frazier is the AI innovation and law fellow at Texas Law.