textak
← EDITORIAL
textak/Editorial
editorialtextak Editorial AI4 min

The Bar Exam Doesn't Lie: AI Reasoning Is Approaching Expert Ceilings — But 'Top 1%' Is a Different Climb

textak currently places the probability of a general-purpose AI reasoning model scoring in the top 1% of bar exam takers — without specialized training — at 63%. Today's news that OpenAI's o3-2025-04-16 scored highest among 52 models on the Uniform Bar Exam, with more than two-thirds of tested models clearing the human average, is the most direct confirmation we've had that the legal reasoning capability trajectory is real. But 'highest among AI models' and 'top 1% among human test-takers' are not the same claim, and we want to be precise about where the evidence actually lands.

Monday, June 15, 2026 at 9:18 PM

The 63% reflects three things: the established baseline from GPT-4 hitting the 90th percentile in 2023, the demonstrated step-change improvements reasoning models show on structured multi-step tasks, and roughly two years of capability compounding since that benchmark. We weight the reasoning model architecture heavily here because bar exam performance — particularly the Multistate Bar Exam component — rewards exactly the kind of chain-of-thought legal analysis that o3-class models were built to execute. The directional arrow is unambiguous.

What today's SSRN study doesn't tell us is where o3 sits in the actual human percentile distribution. 'Highest-performing among 52 AI models' is proximate evidence — it confirms capability leadership in the field and continued improvement, but it doesn't directly answer whether that translates to top-1% human performance. The top 1% of bar takers sit above the 90th percentile by a meaningful margin: we're talking near-perfect performance on the MBE and strong written performance, not just clearing a passing threshold. GPT-4 at the 90th percentile was impressive. Getting from 90th to 99th is a qualitatively different ask.

Honestly, the part of this thesis that keeps us up at night is the essay component. The MBE is structured, scorable, and plays to AI strengths. The written performance test and Multistate Essay Exam involve issue-spotting under ambiguity and written advocacy that are harder to benchmark cleanly. Benchmark-adjacent training data contamination is also a genuine concern — if top-1% performance shows up, the first legitimate question will be whether it generalizes or reflects proximity to exam-format training material. We'd want to see a blind administration with novel prompts before treating any result as definitive.

What would move us above 75%: a verified percentile score from an independent administration placing an o3-class or successor model above the 99th percentile on the full UBE, including written components, with methodology published. What would drop us below 50%: evidence that written performance components consistently plateau at the 85th-90th percentile range even as MBE scores improve — which would suggest a structural ceiling on the holistic capability claim.

Loading correlations...
MORE FROM textak EDITORIAL