Open-Source Has Closed the Gap — But 'Parity' Depends on Which Gap You're Measuring
TexTak holds open-source frontier parity at 69% — and today's evidence is the strongest single-day signal we've seen on this forecast. DeepSeek-V3.2 and Qwen 3.6 Plus now sit within single digits of closed frontier models on coding benchmarks, MiniMax M2.5 matches Claude Opus 4.6 on SWE-Bench Verified almost exactly (80.2% vs. 80.8%), and the AISI reports a proprietary lead of only 4-8 months. If you define parity as 'a general-purpose open-weight model that a developer can deploy today and get within rounding error of the closed frontier on measurable tasks,' you could argue that threshold has already crossed. But we're holding at 69% rather than declaring resolution, and the reason matters.
The evidence today is as direct as we get on this forecast. MiniMax M2.5 matching Claude Opus 4.6 within 0.6 percentage points on SWE-Bench Verified is not a 'conditions are forming' data point — that's a head-to-head comparison on a production-relevant coding benchmark where the open-weight model is functionally indistinguishable from the closed one. The AISI's 4-8 month lag estimate, combined with MoE architecture convergence across every major open-source release, suggests the structural dynamics driving the gap have largely played out. We weight this evidence heavily because benchmark parity on coding and reasoning — not abstract capability claims — is the domain where enterprises make deployment decisions.
But here's the counterargument we take seriously, and today's news actually sharpens it rather than dissolves it: Claude Mythos Preview scoring 94.6% on GPQA Diamond is a genuine step-change data point that sits outside what open-source can currently reach. Mythos is not shipping as a generally available product yet — it's a preview — but it represents exactly the kind of unreleased capability that our forecast already flagged as the primary bear case. If the frontier labs are maintaining a 'Mythos-class' tier that open-source tracks with a 4-8 month lag, and frontier labs are accelerating their release cadence, the lag could be self-renewing rather than self-closing. The 69% reflects that tension: open-source has achieved parity on the last generation of frontier benchmarks precisely as the frontier has moved.
The forecast-defining question is which dimension of parity we're measuring. Benchmark parity on coding? Probably already met. Developer preference in enterprise? Our read is yes for cost-sensitive mid-market applications. Commercial impact — meaning that a major enterprise customer switches from a closed API to an open-weight deployment and publicly reports comparable ROI? That's the dimension where we lack direct evidence. We're weighting the 69% primarily on benchmark and developer adoption signals, with a discount for the commercial impact evidence still being circumstantial. The MoE training cost data ($5.6M for DeepSeek V3) is the most structurally important number in today's news for this forecast — it means the cost barrier to building open-weight competitive models has effectively collapsed, which removes the capital moat that previously sustained the gap.
What would move us above 80%: a second open-source model independently achieving SWE-Bench or MMLU parity with whatever closed model is current at time of measurement, combined with a Fortune 500 public deployment announcement citing open-source as the chosen infrastructure. What would drop us below 55%: if Mythos launches publicly and opens a gap of more than 15 percentage points on GPQA Diamond that persists for two quarters — that would suggest the frontier has structurally re-separated rather than merely temporarily led.