The problem isn’t that AI is wrong. It’s that we believe it’s right — even when it’s grading its own homework.
Humanity’s Last Exam (HLE) is the gold standard for measuring AI intelligence. Published in Nature, backed by the Center for AI Safety and Scale AI, with 2,500 expert-level questions contributed by nearly 1,000 PhDs across 50 countries. It’s the hardest test ever given to an AI system.
The best model in the world — Google’s Gemini 3 Pro — scores 37.5%.
That means the smartest AI we’ve ever built gets the answer wrong nearly two out of every three times.
But here’s what nobody’s talking about: the judge grading those answers is also an AI.
The Judge Is Failing the Same Test
HLE uses OpenAI’s o3-mini as its automated judge. The model reads each AI’s response, compares it to the known correct answer, and decides whether the response is right or wrong.
This sounds reasonable — until you look at what o3-mini actually does when you ask it questions.
OpenAI’s own system card shows that o3 (the full model, not even the mini version) hallucinates 51% of the time on simple factual questions. On questions about public figures, it hallucinates 33% of the time. The smaller o4-mini model — architecturally similar to the judge being used — hits a 79% error rate on simple factual questions.
So the AI industry’s most important benchmark is being graded by a model from the same family that gets basic facts wrong more often than it gets them right.
It Gets Worse
The grading task isn’t just “does this match the answer key.” HLE uses o3-mini with structured decoding to determine whether a model’s response is “semantically equivalent” to the correct answer. That means the judge has to understand whether a differently-phrased answer means the same thing as the ground truth.
This is a reasoning task. And reasoning is exactly where these models fail the most.
Consider: if o3-mini misjudges even 5% of the 2,500 questions, that’s 125 potentially wrong grades. For models scoring in the 25–38% range (625–950 correct answers), a 5% judge error rate could swing a model’s score by several percentage points — enough to completely rearrange the leaderboard.
Scale AI acknowledges this: they note that “small differences could arise from different judge models and prompts used on edge cases.” But the implications of this understatement are enormous when the entire AI industry uses these scores to market their products, attract investment, and make safety claims.
The Answer Key Is Wrong Too
As if an unreliable judge weren’t enough, the answer key itself has problems.
In July 2025, research lab FutureHouse audited the chemistry and biology questions in HLE. Their finding: 29% of the answers directly contradicted published peer-reviewed literature. An additional 19% were “nuanced, depending on assumptions or opinions.”
That means for nearly half the science questions, the “correct” answer was debatable or outright wrong — and the AI judge is comparing model responses against these flawed answers.
One example: HLE asked about the rarest noble gas on Earth as of 2002. The “correct” answer was Oganesson — a synthetic element that existed for a few milliseconds in a Russian nuclear reactor. Only five atoms have ever been created. Multiple peer-reviewed papers argue it isn’t technically a noble gas, and it’s predicted to be a solid, not a gas. The AI judge would mark any model that correctly identified this nuance as wrong.
Scale AI conducted their own review and estimated the error rate at 18% rather than 29%, then published a revised subset. But even at 18%, that’s potentially 450 questions with unreliable ground truth — being graded by an unreliable AI judge.
The best AI in the world fails 63% of the time.
HLE is the hardest AI benchmark ever created — 2,500 expert-level questions across 100 subjects. No AI comes close to passing.
These scores are graded by OpenAI’s o3-mini — from a model family with a 51% hallucination rate on factual questions. An independent audit found 18–29% of HLE’s science answers contradict peer-reviewed literature (FutureHouse / Scale AI, 2025).
The Calibration Problem Compounds Everything
HLE doesn’t just measure accuracy — it measures calibration: whether a model’s stated confidence matches its actual performance. Models show calibration errors ranging from 34% to 89%.
But here’s the recursive problem: the confidence scores are also extracted by o3-mini. So the calibration measurements — which are supposed to tell us whether AI knows what it doesn’t know — are themselves being evaluated by an AI that doesn’t know what it doesn’t know.
GPT-4o shows an 89% calibration error on HLE. That means when it says it’s 90% confident, it’s wrong far more often than it’s right. The judge evaluating that confidence is a model from a family with similar confidence-accuracy misalignment.
Why This Matters Beyond Benchmarks
This isn’t just an academic problem. HLE scores drive real-world decisions:
Investment: Venture capital firms and public markets use benchmark scores to evaluate AI companies. A few percentage points on HLE can mean billions in market cap.
Regulation: The EU AI Act and frameworks from NIST reference benchmark-based evaluation. Stanford HAI’s AI Index 2025 Annual Report cites HLE specifically. Policy decisions affecting millions of people are being informed by scores graded by an unreliable AI.
Safety claims: AI companies market their products as “expert-level” or “near-human” based on benchmark performance. If the grading is unreliable, these safety claims are built on sand.
Enterprise adoption: Companies deciding whether to deploy AI in healthcare, legal, finance, and government are influenced by benchmark results. If a model’s HLE score is wrong by even a few percentage points due to judge error, that could be the difference between “safe to deploy” and “not ready.”
The Deeper Issue
This is a perfect case study of the problem the ZH Standard was built to solve.
The entire AI evaluation pipeline — from question creation to answer grading to leaderboard publication — operates on trust. Trust that the questions are right. Trust that the judge is reliable. Trust that the scores mean what we think they mean.
At every link in that chain, the trust is misplaced.
The questions have a documented 18–29% error rate. The judge comes from a model family with 51–79% hallucination rates on factual tasks. The calibration measurements are evaluated by the same unreliable system. And the scores are published as authoritative without any independent verification layer.
This is exactly the pattern we see in every AI failure case in our database: someone trusts an AI output, takes action on it, and only discovers the error after the damage is done. The $85,000 legal sanction. The $250 million stolen in Minnesota. The government report full of fake citations. The student expelled for unverified AI-generated work.
The problem isn’t that AI is wrong. It’s that we believe it’s right.
What Verification Looks Like
A deterministic verification process doesn’t trust any single output — human or AI. It cross-references every claim against authoritative source data, flags discrepancies before they propagate, and creates an immutable audit trail so any reviewer can independently verify the chain of evidence.
Applied to HLE specifically, this means: don’t trust the AI judge alone. Don’t trust the answer key alone. Verify each grade against the source literature. Hash every verification step. Make the entire grading pipeline auditable.
This is what ZH-1 does — not just for benchmarks, but for every AI output that matters. Legal filings. Medical decisions. Government benefits. Financial reports. Student work. Enterprise data.
Every place where someone is trusting an AI output without verification is a failure waiting to happen.
The ZH Standard by Brightstead Technologies uses a patent-pending verification process to catch AI errors before they cause damage. Every verification is logged with a SHA-256 cryptographic hash, creating a tamper-proof audit trail.
Sources
- Scale AI / CAIS, “Humanity’s Last Exam,” Nature (2025). scale.com/leaderboard
- OpenAI, “o3 and o4-mini System Card,” April 2025.
- FutureHouse, “About 30% of HLE chemistry/biology answers are likely wrong,” July 2025. futurehouse.org
- Scale AI, “Calibration of o3 and o4-mini on HLE,” 2025. scale.com/blog
- Artificial Analysis, “HLE Benchmark Leaderboard,” February 2026. artificialanalysis.ai
- Let’s Data Science, “Humanity’s Last Exam: Why Top AI Models Fail,” February 2026.
Verify your AI outputs
Don’t trust AI grading AI. Run your outputs through deterministic verification.
Try the Demo