Open-Source Has Arrived at the Frontier — With One Honest Caveat
textak places the probability of open-source matching closed frontier performance at 75%, up from 72% last cycle. Today's benchmark data is the strongest direct confirmation we've published in a single news cycle: five independent open model families have closed a gap that measured 17.5 percentage points at end-2023 to effectively zero on MMLU, MATH-500, and GPQA Diamond. DeepSeek V4 Pro matching GPT-5.5 and Claude Opus 4.7 on agentic benchmarks at 10-13x lower API cost is not a pilot result — it's a structural market condition. But benchmark parity and product parity are different things, and we want to be precise about which one we're calling.
Let's state what today's evidence actually proves and what it doesn't. The benchmark convergence data is direct evidence that open-weight models have reached knowledge and reasoning parity with closed-frontier systems on the evaluations that dominated 2023-2025. That's the forecast target as originally scoped, and on that specific dimension, the evidence is unambiguous. Five independent families — DeepSeek, Qwen, Kimi, GLM, Mistral — reaching frontier quality simultaneously is not noise. It's a structural shift driven by training technique diffusion, compute cost compression (verified 100x reduction), and Meta's sustained open-source investment. The convergence is real.
Here's the honest caveat that keeps this at 75% rather than higher: benchmark saturation cuts against the evidential weight of this data in a way we need to name directly. The same news cycle that documents open-source convergence also documents that MMLU, GPQA Diamond, and MATH-500 are approaching evaluation ceilings — GPT-5.3 Codex at 93% on MMLU, GPQA Diamond at 94.3% industry-wide. When every model scores above 88% on MMLU, 'parity on MMLU' becomes a weaker claim than it was in 2023. The forecast target is benchmark convergence, and the benchmarks themselves have lost discriminative value. We're calling convergence on metrics that may no longer measure the capability frontier.
The Humanity's Last Exam data is where this gets genuinely uncomfortable for our thesis. Frontier models scoring ~35% versus 90% for human domain experts represents a 50+ point gap on a benchmark specifically designed to resist saturation. Open-source models are not exempt from this gap — they've converged to closed-source models at approximately the same position below the expert ceiling. That's convergence, but it's convergence at a level that is substantially below human expert performance on hard reasoning. Whether that matters for the forecast depends on how you define 'frontier performance': parity with closed commercial models, or parity with human expert capability. Our forecast targets the former, not the latter.
The strongest counterargument remains what frontier labs hold in reserve. Anthropic's Mythos — referenced in the Carlini-White House briefing — represents exactly the kind of unreleased capability that creates asymmetric information risk in this forecast. We don't know what's in the closed labs' unreleased pipeline. Post-training techniques and RLHF data curation remain closely held, and the 6-18 month gap LLM Stats documents on production deployment may understate the true frontier advantage on tasks that don't show up in current benchmarks. We're holding at 75% rather than moving higher specifically because of this uncertainty. If Mythos or equivalent represents a step-change rather than an incremental advance, the gap reopens. What would move us above 80%: open-source models matching closed frontier on Humanity's Last Exam or an equivalent hard-reasoning benchmark within two release cycles. What would drop us below 65%: a closed-model release demonstrating greater than 20-point advantage on HLE-equivalent evaluation while open-source models remain clustered near current 35% performance.