Analysis, research, and commentary on AI verification, hallucination prevention, and deterministic compliance.
The Humanity's Last Exam benchmark claims to test what AI can't do. But when the graders are also AI, what exactly is being measured?