Open-Source Frontier Parity Is Already Here — The Question Is Which Dimension You're Measuring
textak places the probability of open-source models matching closed frontier performance at 75%, up from 72% three months ago. Today's evidence is the strongest single-week confirmation we've seen for this thesis: five Chinese labs shipped frontier-tier models in four weeks, the open-source lag time compressed from 12 months to roughly 3 months, and the benchmark gap on standard tests has narrowed to single digits. But the Anthropic Mythos shutdown is a direct challenge to the 'which dimension' question that sits at the center of this forecast.
Let's be precise about what we're forecasting, because today's evidence sharpens the definitional fault line. Our 75% reflects a specific claim: that an open-weight model will match a closed frontier model on capability benchmarks that the AI evaluation community treats as meaningful measures of general reasoning and instruction-following. The 72%→75% move was driven by three things: DeepSeek R1's January 2025 cascade triggering sustained open-source acceleration, the compression of the frontier lag from 12 months to 3 months documented by Artificial Analysis, and the GLM-5.1 MIT license running on Huawei Ascend silicon — which isn't just a benchmark result, it's an architectural proof that the open ecosystem can execute independently of CUDA. The 75% does not yet account for the Humanity's Last Exam results, which we'll address below.
Today's benchmark data is worth examining carefully, because it cuts in two directions simultaneously. The Humanity's Last Exam leaderboard — a contamination-resistant test specifically designed to resist saturation — shows Claude Fable 5 leading at 53.3%, with a persistent 50-point gap between frontier models and human domain experts. On traditional benchmarks, GPQA Diamond is at 94.3% and MATH-500 at 96%, which tells us mostly that those tests are no longer useful for distinguishing between models. The HLE result is more important: it shows that even the best closed model is at 53.3% on the hardest resistance benchmark. If an open-source model reaches 48-50% on HLE, that is 'matching frontier performance' by any reasonable definition of the term. The 10-point gap Chinese models currently show on top-of-pyramid performance is not trivial — but it's also not disqualifying, and the trajectory matters.
The strongest counterargument is the one we've held since this forecast launched: post-training techniques and unreleased capabilities at frontier labs represent a moat that benchmarks may not capture. The Anthropic Mythos 5 shutdown is directly relevant here. The government's export-control directive cites a discovered jailbreak technique, not a capability gap — meaning Mythos 5 exists and was apparently deployable to customers before the shutdown. If Mythos 5 represents a step-change over Fable 5's 53.3% HLE score, then the 'closed frontier' benchmark we're targeting for parity may be moving faster than open-source can track. We genuinely don't know the Mythos 5 HLE score, and that uncertainty keeps us from moving above 80%. If Mythos 5 is a 60%+ HLE system, the 3-month lag estimate may already be obsolete.
What would move us above 85%: an open-weight model scoring within 5 points of Fable 5 on HLE, or within 3 points on LiveCodeBench. What would drop us below 65%: Mythos 5 or a comparable unreleased closed model demonstrating a 15+ point HLE advantage over the best open-weight model available within 30 days of its release. We're watching the HLE leaderboard specifically — not MMLU, not GPQA Diamond, those are already saturated — as the resolution-relevant signal for this forecast.