Open-Source Has Closed the Frontier Gap — And That Changes the Enterprise Calculus Permanently

TexTak forecasts a 72% probability that open-source models match closed frontier performance — and this week's evidence is the strongest single-week confirmation we've seen since we opened this forecast. DeepSeek V4 Pro achieves 83.7% on SWE-Bench Verified and 99.4% on AIME 2026 while remaining MIT-licensed and self-hostable at near-zero marginal cost. MiniMax M2.5 hits 80.2% on SWE-Bench, matching Claude Opus 4.6. Four Chinese open-weight models reached frontier-adjacent coding and reasoning performance within a 12-day window in May 2026. The question is no longer whether the gap will close — it's whether we defined the gap correctly in the first place.

Saturday, May 16, 2026 at 7:18 PM

LinkedIn Bluesky

Our 72% reflects three compounding signals: benchmark convergence on hard tasks, the cost collapse (GPT-4-level capability now available under $1 per million tokens), and the acceleration of releases from labs outside the US frontier tier. What moved us from 69% to 72% earlier was the trajectory of Meta's open-source investment and compute democratization. What keeps us from going higher is the counterargument we've always taken seriously: benchmark parity and product parity are different things, and frontier labs hold unreleased capabilities that aren't visible in any leaderboard.

This week, that counterargument partially materialized as evidence in the opposite direction. Claude Mythos Preview leads GPQA Diamond at 94.6%, and GPT-5 claims 100% on AIME 2026 — numbers that represent genuine frontier capability advantages. But here's what matters for this forecast: Anthropic's own policy paper treats Chinese open-weight model distillation as an existential competitive threat. When a frontier lab publicly lobbies for export controls to prevent capability transfer, they are confirming that open-weight models are close enough to be dangerous. That's not our inference — it's Anthropic's stated position.

We need to be precise about what 'parity' means in this forecast, because the evidence cuts differently across dimensions. On coding benchmarks: closed. On graduate-level science reasoning (GPQA Diamond): Mythos holds a meaningful lead at 94.6%. On agentic task completion: the Chinese cluster — Kimi K2.6 completing 12-hour continuous tool-use traces, DeepSeek V4 Pro on SWE-Bench — is competitive. On cost and deployability: open-source wins decisively. Our forecast targets benchmark performance parity on at least two major coding and reasoning benchmarks simultaneously, with open-weight models accessible for production deployment. That threshold appears to have been crossed this week on coding; science reasoning remains a genuine gap.

The strongest counter we hold against our own position: the software stack problem. CUDA's ecosystem advantage over open alternatives runs 18+ months behind hardware capability, and this applies doubly to Chinese open-weight models where CUDA compatibility is not guaranteed. A model that achieves 83.7% on SWE-Bench but requires three weeks of DevOps work to deploy in a regulated enterprise environment is not actually at parity for production purposes. We're watching whether the developer ecosystem around DeepSeek V4 and the Chinese cluster closes this integration gap — if it does by Q3, we'd consider moving this above 80%. If the software stack gap persists and Mythos-level science reasoning remains exclusive to closed models, we'd hold at current levels or trim back toward 68%.

Loading correlations...

SHARE THIS ANALYSIS

Share on LinkedIn Share on Bluesky

Open-Source Has Closed the Frontier Gap — And That Changes the Enterprise Calculus Permanently

Open-Source Has Won the Coding Benchmark. The Enterprise Is a Different Fight.

Virgin Voyages' 1,500-Agent Sprint Is Real — But It Doesn't Quite Prove What Our Enterprise Agents Forecast Claims

The Patchwork Is Real. The Pressure Isn't Enough. We're Holding at 18%.