DeepSeek V4's 0.3-Point MMLU Gap Is the Open-Source Frontier Story, Not the Whole Story
textak forecasts a 75% probability that an open-source model achieves verified parity with closed frontier performance — and today's benchmark data is the strongest single-day evidence package we've seen for that thesis. The MMLU gap between open and proprietary models collapsed from 17.5 points to 0.3 points in a single year. DeepSeek V4 Pro scores 80.6% on SWE-Bench Verified, outperforming GPT-5.5 and Gemini 3.1 Pro on multiple coding benchmarks. But here's where we have to be disciplined: benchmark convergence is not product parity, and today's evidence package actually contains both the strongest bull case and the strongest bear case for this forecast simultaneously.
Our 75% reflects three compounding forces: Meta's sustained open-source investment, training techniques that are genuinely closing architectural gaps, and compute cost compression that has made 100x cost reduction a verified data point rather than a projection. Today's news adds a fourth: the open-source community's ability to compete on frontier-adjacent benchmarks has reached a point where traditional differentiators like MMLU are no longer useful for separating labs. DeepSeek V4 Pro leading LiveCodeBench at 93.5% — ahead of all closed APIs including Claude Fable 5 — is direct evidence that open-source models can achieve domain-specific benchmark dominance even against models with safety-tuned Mythos-class architectures.
But we need to be precise about what 'parity' means for this forecast to resolve, because today's evidence also sharpens the counterargument in a way we have to take seriously. The frontier has migrated. Claude Fable 5 scores 80.3% on SWE-Bench Pro — a harder evaluation than SWE-Bench Verified — while DeepSeek V4 sits at 80.6% on Verified but hasn't published SWE-Bench Pro numbers. Gemini 3.5 Flash scores 76.2% on Terminal-Bench 2.1. The pattern is consistent: open-source models achieve parity or narrow leads on benchmarks the frontier recently dominated, while the frontier creates new harder benchmarks where meaningful gaps remain. Our forecast definition requires specifying which dimension of parity we're measuring — and that matters for the 75%.
Here's our working definition: 'parity' resolves when an open-source model achieves within-noise performance on the benchmark suite that the leading closed model uses as its primary public capability signal at the time of resolution. Under that definition, DeepSeek V4's acknowledgment that it trails closed-source frontier by 3-6 months is the most analytically useful data point in today's news. That gap is shrinking directionally, but it's a gap the company itself is quantifying. The 0.3-point MMLU collapse is a lagging indicator — it tells us where the frontier was, not where it is. The 37% gap between lab benchmarks and real-world agentic deployment is a different kind of warning: enterprise adoption benchmarks are diverging from lab benchmarks in ways that could make 'parity' a moving and contested target.
What would move us above 80%: an open-source model publishing SWE-Bench Pro scores within 2 points of the leading closed model within the next two model generations. What would drop us below 65%: evidence that Anthropic's Mythos-class capabilities — currently restricted to US government partners under Project Glasswing — represent a qualitative architecture gap rather than a quantitative training gap. The Glasswing restriction is the single most important unknown in our model right now. We're holding at 75% precisely because we can't price that tail.