Open-Source Has Caught the Frontier. Now Prove It Matters Beyond Benchmarks.
textak's open-source frontier parity forecast sits at 75% — up from 72% — and today's Epoch AI data makes that number feel conservative on one dimension and still genuinely uncertain on another. Open-weight models have closed the benchmark gap to effectively zero on knowledge tasks and to single digits on reasoning. But our forecast was never really about benchmarks. It was about whether open-source closes the gap that actually matters: production capability, post-training quality, and real deployment impact. The benchmark story is now resolved. The harder question is just beginning.
Let's be precise about what today's evidence proves and what it doesn't. DeepSeek V4 Pro, Kimi K2.5, GLM-5, and Qwen 3.5 are now matching or beating closed-frontier models on MMLU, GPQA Diamond-adjacent tasks, and coding benchmarks. The capability gap that was 17.5 percentage points at end-2023 is at or near zero on knowledge benchmarks, and the timeline to parity on reasoning has compressed from a 14-month lag in 2024 to roughly 7 months as of May 2026. This is direct evidence that our forecast's core thesis — training techniques and compute cost reductions are closing the gap — has played out faster than most expected. The 75% probability reflects this evidence heavily weighted against one remaining structural concern.
That concern is what drove us to stop short of 80% or higher, and it's worth naming clearly: the Anthropic Mythos Preview news today is exactly the kind of data that keeps us honest. Mythos is scoring 94.6% on GPQA Diamond — setting a new ceiling — precisely as traditional benchmarks saturate above 90%. This points to a dynamic where open-source closes the gap on last quarter's frontier while closed labs move to harder evaluations where the gap reopens. The forecast question becomes a moving target problem. If 'matches closed frontier performance' means the frontier as of today, open-source is there. If it means the frontier as of the resolution date, the race resets each quarter.
The strongest counterargument remains what the 'AGAINST' camp has always had: frontier labs hold closely-guarded post-training techniques, RLHF recipe variations, and unreleased capabilities. The benchmark convergence story is real, but benchmark parity, developer UX parity, and commercial deployment parity are three different things. GLM-5's 92.7% AIME performance and SWE-Bench Verified scores are impressive — and they're direct evidence of narrowing — but Anthropic's export control episode this week also reveals something: governments are treating frontier closed models as strategically sensitive in a way they are not treating Llama or DeepSeek. That gap in institutional treatment suggests the 'real' frontier may be wider than public benchmarks indicate.
Our 75% reflects this: strong direct evidence of benchmark parity, proximate evidence of production capability gains, and residual uncertainty about whether closed labs have step-change unreleased capabilities. What would push us to 85%? A major enterprise deployment study showing open-weight models replacing closed API subscriptions at Fortune 500 scale with equivalent output quality — that's production parity, not benchmark parity. What would drop us below 60%? If Anthropic's Mythos or an OpenAI model demonstrably clears Humanity's Last Exam at 70%+ while the best open-weight models stay below 50%, the gap has reopened in a way that matters. We're watching June's Humanity's Last Exam leaderboard closely.