Open-Source Has Caught the Frontier. The Question Now Is Whether That Matters.
textak places the probability of an open-source model matching closed frontier performance at 75%, up from 72% last month — and today's news makes us think that number may still be conservative. Qwen 3.6 and DeepSeek V3.2 Special are not approaching frontier benchmarks; they are beating them on specific tasks, with DeepSeek V3.2 scoring 96% on AIME against GPT-5's 94.6%. But before we declare the forecast resolved, we need to be honest about what 'parity' actually means — because today's evidence forces that question more than it answers it.
The coding and math benchmark data is about as direct as evidence gets for this forecast. Open-weight models from Chinese and European labs are not trailing by a few points in a race they're slowly closing — they are leading on SWE-bench, LiveCodeBench, and AIME as of June 2026. The gap the thesis identified has narrowed to roughly three months on the capability curve by independent analysis. We weight this heavily because benchmark leadership on coding tasks, specifically, is the domain most proximate to enterprise production value. This isn't a cherry-picked metric.
Here's where we have to be intellectually honest about the forecast definition, though. The [open-source-frontier] forecast is about 'matching closed frontier performance' — and that phrase is doing a lot of work. The benchmark saturation data cuts both ways: if MMLU is hitting 88%+ ceilings and MATH-500 is at 96%, then 'parity on benchmarks' may be a less meaningful claim than it was 18 months ago. The real question is whether open-source models achieve parity on the dynamic evaluations that actually differentiate frontier capability now — Humanity's Last Exam, METR long-context tasks, real-world agentic execution. On those, the evidence is thinner. The 37% gap between lab performance and deployment results on agentic tasks applies to all models, open and closed, but closed labs have stronger post-training pipelines and closely held RLHF techniques that don't port to open weights.
The Apple-Gemini announcement also complicates the picture in an interesting way. Apple's multi-model Extensions framework — letting users choose between Claude, ChatGPT, and Gemini — signals that the enterprise direction of travel may not be 'one frontier model wins' but rather a portfolio architecture where open-source models fill specific slots based on cost and task fit. If that's right, the forecast may resolve YES on benchmark grounds while the commercial and deployment reality looks like fragmented coexistence rather than a decisive open-source breakthrough. That's not the same thing, and our forecast definition should acknowledge it.
What would move us above 80%? Evidence that a named open-weight model achieves leading scores on METR's agentic evaluation suite or Humanity's Last Exam — not just coding benchmarks, which now have known saturation issues. What would drop us below 60%? Anthropic publishing 'Mythos' or equivalent with a step-change on reasoning tasks that open-source models cannot replicate within six months — the scenario where frontier labs sprint ahead faster than the open ecosystem can follow.