The Open-Source Frontier Gap Is Closing — But 'Closed to Zero' Doesn't Mean What You Think It Means
textak holds the 'open-source model matches closed frontier performance' forecast at 75%. Today's Epoch AI analysis is the most significant data point we've seen on this question — but it's also a perfect illustration of why benchmark convergence and frontier parity are different claims. We're not moving to 85%. Here's the reasoning.
Let's start with what the Epoch AI data actually proves. Open-weight models — DeepSeek V4, GLM-5, Kimi K2.5, Qwen 3.5 — have closed the gap on knowledge benchmarks to effectively zero, and reduced reasoning task gaps to single digits. On coding tasks, multiple open models are now competitive. That's real. The 17.5-percentage-point gap that existed at end of 2023 is gone on the tests that measured it. A reader could reasonably ask: isn't that the forecast? Hasn't it resolved YES?
Here's why we don't think it has, and why the forecast definition matters. The central problem is that the benchmarks showing parity — MMLU, GPQA Diamond through the 90-94% range, coding benchmarks — have saturated. News item 11 is explicit: Claude Mythos Preview, GPT-5.5, and Gemini 3.1 Pro now score around 94-96% on GPQA Diamond, and so do leading open models. When every model clusters near ceiling, the test no longer discriminates frontier from near-frontier. That's not parity — that's a measurement instrument running out of range. The discriminating evaluations have shifted: Humanity's Last Exam, FrontierMath, SWE-Bench Verified at scale, real agentic task completions. On HLE specifically, the gap between closed frontier and open models is not zero — human domain experts average 35-40%, frontier closed models have not publicly cleared that bar, and open models haven't either. But on the evaluations where we can currently measure differentiation, closed labs retain an edge. Our forecast target — 'matches closed frontier performance' — cannot be treated as resolved when the goalposts have moved to a harder test on which parity has not been demonstrated.
So why 75% and not higher, given all the convergence evidence? Here's our approximate decomposition: We assign roughly 85% probability that benchmark parity on standardized knowledge and coding evaluations holds — the Epoch AI data is strong, directional, and consistent across multiple model families. We assign roughly 60% that an open model achieves parity on the current hardest discriminating evaluation (HLE-class or equivalent) within two more release cycles — this is the genuinely uncertain part, because closed labs are actively pushing the frontier faster on the hard tests. We assign roughly 70% that the post-training gap — RLHF quality, instruction-following reliability, safety tuning depth — remains non-structural rather than a permanent capability moat. Weighting those three components, 75% is where we land. The number isn't arbitrary, but it's also not precise to a single digit — treat it as a range of 70-80%.
The counterargument we take most seriously isn't about benchmark scores at all. It's about production deployment quality: latency profiles, API reliability at scale, safety filtering, multimodal integration, and update cadence. These do not show up cleanly in academic benchmarks. DeepSeek V4's agentic performance on SWE-Bench is a production-relevant signal — one specific capability slice, not a full production quality assessment. Closed labs have systematic advantages in the infrastructure layer around the model that the Epoch AI analysis doesn't capture. If the forecast resolves on benchmark scores alone, 75% is defensible. If production deployment quality is in scope — and for enterprise adoption purposes, it arguably should be — the number is probably lower. We're treating this as primarily a benchmark-scope forecast, but we're naming that as a deliberate scope limitation, not a clean resolution criterion.
What moves us? To 85%+: a top open-weight model achieves HLE scores within 5 points of the leading closed model, published with contamination controls, in the next two evaluation cycles. To below 60%: Anthropic's Mythos or an equivalent closed model demonstrates a step-change on HLE that resets the gap above 15 points on the new discriminating benchmark. The 'three-month lag' framing from Epoch AI is useful but backward-looking — it measures how quickly open models have caught up to where closed models were, not whether they'll catch the frontier as it actively moves.