Reasoning models are approaching expert-level performance on professional exams. A top-1% bar exam score from a general-purpose model would mark a significant capability threshold.
True if a publicly available AI model achieves a score in the top 1% of human test-takers on the Uniform Bar Exam, as reported by the developer or independent evaluation. Must be a general-purpose system not fine-tuned exclusively for legal tasks.
GPT-4 scored 90th percentile in 2023
Reasoning models show step-change improvements on structured exams
Two years of capability advancement since GPT-4 bar performance
Chain-of-thought reasoning models now consistently occupy top ranks across benchmark leaderboards in 2026
o3 achieving 45.1% on ARC-AGI and setting new standards across math, coding, and science benchmarks
SubQ's 12M-token architecture and Gemini 3.5 Flash's frontier-level performance indicate continued capability expansion
Top 1% requires near-perfect performance — a much higher bar than 90th percentile
Bar exam includes subjective essay components that benchmark dominance doesn't address
Benchmark gaming concerns — models may train on exam-adjacent data
Benchmark saturation is now confirmed: MMLU gaps are within measurement noise, suggesting standard capability metrics can no longer distinguish frontier models — top-1% bar exam requires genuine edge, not average frontier performance
Rapid model churn (255 releases in Q1) creates verification timing risk — which model, tested when, by whom?