Reasoning models are approaching expert-level performance on professional exams. A top-1% bar exam score from a general-purpose model would mark a significant capability threshold.
True if a publicly available AI model achieves a score in the top 1% of human test-takers on the Uniform Bar Exam, as reported by the developer or independent evaluation. Must be a general-purpose system not fine-tuned exclusively for legal tasks.
GPT-4 scored 90th percentile in 2023
Reasoning models show step-change improvements on structured exams
Two years of capability advancement since GPT-4 bar performance
Top 1% requires near-perfect performance
Bar exam includes subjective essay components
Benchmark gaming concerns — models may train on exam-adjacent data