The Open-Source Gap Is Closing Faster Than We Thought — And Today's Evidence Makes It Hard to Argue Otherwise
textak places the probability of open-source matching closed frontier performance at 75% — and today's evidence strengthens that position materially. DeepSeek V4 Pro now leads open-source coding leaderboards at 80 points overall and 93.5 on LiveCodeBench, trailing closed frontier leaders by just 5-10 points on most benchmarks. Meanwhile, Meta's Watermelon model — still in training — reportedly matches GPT-5.5 performance. The gap isn't closing; in some dimensions, it's effectively closed.
We weight this forecast at 75% primarily because the directional evidence is unusually consistent across multiple independent signals. DeepSeek V4 Pro's 93.5 on LiveCodeBench isn't an outlier — it's the latest data point in a trend that has run for 18 months. The 5-10 point trailing gap that remains is real, but it's categorically different from the 20-30 point gaps that characterized the open/closed divide two years ago. At 5-10 points, we're within the margin of benchmark variance and task-specific specialization. Whether that constitutes 'parity' depends entirely on how you define it — and we've been explicit that our forecast targets benchmark performance parity, not product parity, deployment quality, or fine-tuning ecosystem depth.
The Watermelon disclosure complicates the picture in an interesting way. Meta's Chief AI Officer describing GPT-5.5-class performance in a closed briefing is proximate evidence — it tells us Meta believes they've reached this threshold, not that independent evaluation has confirmed it. We're classifying this as circumstantial: consistent with our thesis, but not direct proof. What it does prove is that Meta is willing to spend order-of-magnitude more compute to close the gap, and that institutional commitment matters as much as architectural cleverness. The 100x compute reduction trend that helped drive our original thesis is now being complemented by a willingness to simply throw massive resources at the problem from the open side.
The counterargument we take most seriously isn't benchmark gaming — it's that frontier labs like Anthropic have genuinely unreleased capabilities. Fable 5's #2 BenchLM ranking and 80.3% SWE-Bench Pro score, outperforming all other frontier models by double digits, is a reminder that the closed frontier keeps moving. If Anthropic's Mythos-class architecture represents a genuine step-change, then DeepSeek closing to within 5-10 points of last year's frontier is less meaningful than it appears — because 'the frontier' has already moved. This is the part of our thesis we watch most carefully: parity with a moving target requires either faster advancement or a period where closed labs plateau. We haven't seen that plateau.
What would move us above 80%: a direct independent evaluation — MLPerf submission, SemiAnalysis-style system test, or equivalent — showing an open model within measurement error of the current closed leader on a multi-task suite. What would drop us below 65%: a Mythos-class or equivalent architectural breakthrough from two or more closed labs in the next two quarters that re-opens the gap to 20+ points before open labs can respond. We're currently watching Watermelon's release timeline and whether Meta submits to third-party evaluation or relies on internal benchmarks. Internal claims without external verification are where this forecast's resolution gets contested.