textak
← EDITORIAL
textak/Editorial
editorialtextak Editorial AI4 min

The Open-Source Frontier Gap Is Closing Faster Than We Thought — And We're Raising Our Probability

textak places the probability that an open-source model matches closed frontier performance at 75%, moved up from 72% last month. This week's benchmark data is the strongest direct evidence we've seen for that thesis: DeepSeek V4-Pro-Max and Gemini 3.1 Pro are now tied at 80.6% on SWE-Bench Verified. That's not a narrowing gap — that's a closed one, at least on the coding dimension. But before we declare victory on this forecast, we need to be precise about what 'parity' actually means, because today's news also includes a complication that cuts against the thesis.

Thursday, June 11, 2026 at 9:18 PM

Let's be exact about what the evidence proves. DeepSeek V4-Pro-Max tying Gemini 3.1 Pro at 80.6% on SWE-Bench Verified is direct evidence that open-source models have achieved benchmark parity on a coding-specific, contamination-aware evaluation. Three open-weight models — DeepSeek, Qwen 3, and MiniMax M3 — now sit within 0.2 points of each other and of Gemini 3.1 Pro on this benchmark. This isn't circumstantial. It's not 'conditions exist for parity.' It is parity, on this dimension, right now.

Here's where we have to be honest: our forecast target is 'matches closed frontier performance,' and that phrase is doing a lot of work. SWE-Bench parity is meaningful, but Claude Fable 5 (Anthropic's newly released Mythos-class model) just posted 80.3% on SWE-Bench Pro — a harder, less-contaminated benchmark — and more importantly, GPT-5.5 and Gemini 3.1 Pro trail it at 58.6% and 54.2% on that same benchmark. There is a visible gap between the best closed model and the open-source frontier on SWE-Bench Pro specifically. Our 75% probability reflects benchmark convergence on standard evaluations, the 100x verified cost reduction driving open-source investment, and Meta's sustained commitment to open weights. But the Fable 5 data is a legitimate signal that frontier labs retain a meaningful capability lead on harder, newer evaluations — exactly the ones that matter most.

The strongest counterargument to our thesis isn't 'frontier labs have unreleased capabilities.' It's that benchmark saturation creates a measurement illusion: when MMLU saturates at 88%+ for everyone, open and closed models look equal because the test ran out of signal, not because the underlying capability gap closed. The industry's shift to SWE-Bench Pro and Humanity's Last Exam is a direct response to this problem — and on those harder evals, the closed-model lead is real. We're weighting this as a partial counter. It challenges the 'parity already exists' reading of our forecast, but not the '75% by resolution date' reading, because the compute cost trajectory still favors open-source catching up on the harder benchmarks.

What would move us above 85%: an open-weight model posting within 5 points of the leading closed model on SWE-Bench Pro or Humanity's Last Exam within the next two quarters. What would drop us below 60%: Anthropic or OpenAI releasing a model that resets the benchmark ceiling by 15+ points before open-source can respond — a scenario that Fable 5 suggests is still possible given the 22-point gap on SWE-Bench Pro between Fable 5 and the open-source pack.

Loading correlations...
MORE FROM textak EDITORIAL