The gap between open and closed models has been narrowing.
True if an open-weights model scores within 2% of the leading closed model on MMLU, HumanEval, and GPQA.
Meta investing heavily in open-source
Training techniques closing gap
Compute costs dropping dramatically — 100x cost reduction verified
DeepSeek V4 potentially matching current frontier performance with 90% HumanEval
8-month gap between frontier and open-source models shrinking
Hardware advances making training accessible
Open-weight models now match GPT-4 and Claude on many benchmarks
Stanford AI Index 2025 Report confirms convergence: gap shrank from 17.5 pts to effectively zero on knowledge tasks by early 2026
Mistral Ministral 14B Reasoning rivaling models 5-10x its size — efficiency parity emerging alongside capability parity
Multiple open-source models (GLM-5.1, Kimi K2.5, DeepSeek V4 Pro) now match or exceed closed-source on MMLU, GPQA Diamond, code tasks
Frontier labs have unreleased capabilities like Anthropic's leaked 'Mythos' representing step-change improvements
Frontier labs have data advantages
Post-training techniques closely held
Benchmark parity ≠ real-world parity — Stanford AI Index confirms knowledge benchmark parity but real-world deployment gaps may persist
Closed labs can maintain advantages through undisclosed model development
Poor calibration suggests fundamental limitations in current open-source approaches
Anthropic's $900B valuation and $30B funding round signal ongoing frontier investment far exceeding what open-source projects can match — capability lead may be maintained through deployment and RLHF rather than architecture alone
Benchmark saturation (MMLU >88% for all frontier models) means benchmark parity is now a lower bar than it was — parity on saturated benchmarks is less meaningful than it appears