Open-Source Is 3 Months Behind Frontier — And That Gap Is the Whole Ballgame
TexTak's [open-source-frontier] forecast sits at 69% — up from 67% — and today's evidence is about as direct as we get in this space. Epoch AI's measured lag of 3-6 months between open-weight and closed frontier models isn't a vibe or a benchmark cherry-pick; it's a systematic measurement of the capability gap over time. DeepSeek V4 Pro matching GPT-5.5 and Claude Opus 4.7 on agentic benchmarks at 10-13x lower API cost isn't experimentation data — it's production-grade competitive pricing. GLM-4.7's reported 1.2% hallucination rate, trained entirely on Huawei Ascend silicon, is particularly notable because it breaks the implicit assumption that frontier-quality alignment requires frontier-lab infrastructure. We hold 69% with real conviction, but the counterargument this week got sharper, not weaker.
Let's be precise about what our 69% actually claims, because 'open-source matches closed frontier' is doing a lot of work. Our forecast resolution requires parity on capability benchmarks between a leading open-weight model and the best available closed model at time of resolution. It does not require parity on post-training alignment, UX polish, enterprise support, or safety infrastructure. That distinction matters enormously for how we interpret this week's evidence — and for how honest we're being about what we're actually tracking.
The Epoch AI finding is the strongest single data point we've seen on this forecast. A 3-6 month trailing gap, measured systematically across models over time, is proximate evidence that the gap is structural and shrinking — not just anecdotal convergence. The DeepSeek benchmark numbers are circumstantial but directionally consistent: if a model priced at 10x less than Claude Opus 4.7 is matching it on agentic tasks, the capability differential has compressed to the point where cost, not performance, is the primary variable. That's essentially what parity looks like in a commercial context.
Here's what keeps us up at night on this forecast, and we want to name it clearly rather than bury it: Anthropic's Mythos model. The NSA is reportedly evaluating it for cybersecurity vulnerability detection, and David Sacks described it as 'first-generation automated cyber task capability.' We don't know what Mythos can do relative to anything publicly available, but the fact that it's classified-deployment-ready and being assessed for novel capability suggests the frontier labs aren't standing still while open-source closes the gap. If Mythos and whatever OpenAI has in reserve represent genuine step-changes — not incremental improvements — then the 3-6 month lag figure may be measuring yesterday's gap, not tomorrow's. The open-source community is chasing a moving target, and the target just got harder to see.
What would move us above 80%: A publicly benchmarked open-weight model that matches or beats the then-current best closed model on MMLU, MATH, and a representative agentic task suite, with the result replicated by an independent evaluator — within the next 12 months. What would drop us below 55%: Evidence that Mythos or a comparable unreleased closed model represents a genuine architectural leap rather than incremental scaling, demonstrated by a capability gulf on tasks open models cannot approach. The 3-month lag figure is our primary anchor. The unreleased frontier is our primary uncertainty. We're holding 69% because the evidence on the ground is strong, but we're watching classified deployments more than benchmarks right now.