Open-Source Models Have Closed the Gap. The Real Question Is What 'Parity' Actually Means.
textak currently places the open-source frontier parity forecast at 75% — up from 72% last month — and today's benchmark data from LLM Stats represents the strongest direct confirmation we've seen for this thesis. Llama 4, Qwen 3.7, and DeepSeek V3.2 now match or beat closed frontier models on multiple reasoning and coding benchmarks, with the capability gap shrinking from 18 months to roughly 6 months. But before we spike the ball, we need to be honest about what 'parity' means in this forecast — and what it doesn't.
The 75% reflects three converging signals we've been watching: Meta's sustained infrastructure investment in open-source, compute cost reductions that have hit roughly 100x over the past three years, and the training technique transfers that increasingly allow open-source labs to replicate what closed labs discovered. What moved us from 72% to 75% in the previous cycle was the Llama 4 release and early DeepSeek V3 data. Today's LLM Stats report confirming benchmark parity across reasoning and coding is direct evidence — it proves the thing is happening, not just that conditions exist for it to happen. That's a meaningful evidential distinction we try to hold ourselves to.
The counterargument we take most seriously is the one the current news actually makes stronger, not weaker. Anthropic's leaked 'Mythos' model — now confirmed enough to have drawn Trump administration export controls — represents exactly the kind of unreleased capability step-change our forecast model identified as the primary risk to the thesis. If Mythos represents a genuine discontinuity rather than incremental improvement, the gap we're measuring on public benchmarks may be measuring the wrong thing. Frontier labs have always held their best work back from public comparison; export controls on a model by name suggests it's not vaporware. This is the part of our thesis that keeps us up at night.
We're also holding a precise definition of 'parity' in this forecast: benchmark performance on publicly available reasoning and coding evaluations, not product parity, UX parity, or commercial impact parity. Those are different questions. DeepSeek accounting for 41% of Hugging Face downloads is circumstantial evidence of developer preference, not proof that enterprises are deploying open-source at the frontier. And the GPT-5.6 preview from OpenAI's Chief Scientist signals that the closed frontier isn't standing still — rapid model progression means the benchmark target keeps moving. Matching today's H100-era frontier while the closed labs are already on GPT-5.6 development cycles means the 6-month gap could widen again before it narrows.
What would move us above 85%: a third-party evaluation — SemiAnalysis-grade or equivalent — confirming that an open-source model matches a contemporaneous closed frontier model on enterprise production tasks, not just academic benchmarks, within the same 30-day window. What would drop us below 60%: verified Mythos-class performance data showing closed labs have regained an 18-month advantage on agentic or multi-step reasoning tasks that matter for commercial deployment. Right now, the benchmark data is real and the trend is real. We're staying at 75% because the Mythos variable is unquantified, not because we doubt the direction.