50% on Humanity's Last Exam Is Impressive. It's Not Evidence the Bar Exam Top 1% Is Imminent.
TexTak holds our 'AI reasoning model scores in top 1% on bar exam without special training' forecast at 62%, and today's benchmark news is the kind of evidence that looks more confirmatory than it actually is. Frontier models hitting 50% on Humanity's Last Exam — up from 8.8% a year ago — is a genuine capability signal. The SSRN finding that o3-2025 leads all LLMs on bar exam accuracy, with two-thirds of tested models clearing the human average, is more directly relevant. But 'clearing the human average' and 'top 1%' are not the same threshold, and we need to be precise about what we're actually forecasting.
Our 62% is built on three pillars: GPT-4 was already at the 90th percentile in 2023, reasoning models have shown step-change improvements on structured exams since, and two years of capability advancement suggests the remaining 9-10 percentile points are within reach for a general-purpose model. The Humanity's Last Exam data supports the general trajectory — 8.8% to 50% in twelve months is a pace of improvement that, if it continued on bar-relevant dimensions, would close the gap. Gemini Deep Think's IMO gold performance is further evidence that structured, multi-step reasoning under time constraints is no longer a reliable moat for human experts.
Here's where we have to be honest about the limits of today's evidence. The SSRN study shows o3-2025 leading on bar exam accuracy, but doesn't report where it falls in the percentile distribution of human test-takers — it reports performance relative to the human average, not relative to the top 1% threshold. Top 1% on the bar exam requires near-perfect performance on multiple-choice components AND strong essay writing. The essay component is the real variable. Reasoning models that excel at structured problem-solving don't automatically transfer that performance to nuanced statutory interpretation written in prose under exam conditions. We haven't seen rigorous evidence that the essay gap has closed.
The benchmark gaming concern also deserves more weight than we've given it. The bar exam is a known benchmark. Models trained on legal corpora will have encountered bar-adjacent material at scale. The SSRN paper tests 52 models and finds broad outperformance of human averages — which raises the question of whether the benchmark is measuring genuine legal reasoning capability or pattern-matching on a well-documented exam format. Humanity's Last Exam was specifically designed to resist this problem; the bar exam was not.
What would move us above 75%: a verified, independently administered bar exam result (not a benchmark dataset) showing a general-purpose model scoring above the 99th percentile on the full exam including essays, with methodology that addresses data contamination. What would drop us below 50%: if the next two major reasoning model releases show plateau behavior on bar-format essay tasks specifically. We're holding 62% — confident in the direction, honest that the remaining gap is harder than the trajectory implies.