Open-Source Frontier Is Real — But We Need to Be Honest About Which Frontier We Mean
TexTak holds open-source frontier parity at 69%, up from 67% — but after reviewing the editorial flags on our previous draft, we owe readers a more precise account of what that number actually measures. Today's evidence package — OpenAI's gpt-oss-120b, Mistral Large 3, and strong AIME scores from Ministral — is genuinely significant. The problem is that 'frontier' is doing a lot of work in our forecast definition, and the strongest counterevidence (Claude Mythos Preview at 94.6% GPQA Diamond) points to a structural tension we haven't fully reckoned with.
Let's start with the definition problem, because it's load-bearing. Our forecast reads 'open-source model matches closed frontier performance.' That's not resolvable as written. Two analysts looking at today's evidence could reach 45% and 85% without either being wrong, because they'd be measuring different things. So we're amending the target: **the forecast now tracks whether any open-weight model achieves within 5 percentage points of the best publicly-deployed closed model across three benchmark dimensions simultaneously — GPQA Diamond, AIME 2025, and HumanEval**. We're explicitly excluding unreleased or preview-only models (like Mythos) from the 'closed frontier' benchmark, because no enterprise can deploy against a model they can't access. This distinction matters enormously for the probability.
With that scoped target, today's evidence is strong. gpt-oss-120b reportedly matches o4-mini on core reasoning benchmarks, and Ministral hits 85% on AIME 2025 — within striking range of publicly-deployed o3-mini-class models. Mistral Large 3 trained on 3,000 H200s is worth noting: it represents meaningful compute democratization compared to GPT-4-era training runs, though we want to be careful not to overstate this. 3,000 H200s is not accessible to most open-source developers — it's accessible to well-funded European AI labs. The counterargument about resource concentration is weakened, not eliminated. What the data genuinely supports is that the set of organizations capable of training frontier-class models has expanded from ~3 to ~8-12 globally. That's real progress. It's not 'broadly accessible training.'
Now for the part of our thesis that keeps us up at night. Claude Mythos Preview at 94.6% GPQA Diamond isn't just a single unreleased model anomaly — it's evidence of a structural dynamic we haven't fully priced. If frontier labs systematically release models one generation behind their internal frontier, then open-source is structurally chasing a moving target that resets every 12-18 months. gpt-oss-120b matching o4-mini is real, but o4-mini may already be one generation behind OpenAI's internal frontier. The honest read is that our forecast, even with the amended definition, measures parity with the **publicly-deployed** closed frontier — which is itself a trailing indicator of actual lab capability. We think that's the right target for enterprise decision-making (you can't deploy against models you can't access), but readers should understand the ceiling on what 'parity' means here.
One more thing we need to address: why does 'the strongest single-day evidence package we've seen' produce a 2-point move? Because benchmark parity is not deployment parity, and we're primarily moving on benchmarks. We have no evidence about gpt-oss-120b's instruction-following quality in production, refusal behavior under adversarial prompting, or latency at scale with real enterprise loads. These are the dimensions that actually determine whether a Fortune 500 replaces Azure OpenAI with an open-weight model. Our 69% reflects strong benchmark convergence on the publicly-deployed closed frontier (supporting) offset by: the release-lag structural problem (~8 points of drag), the benchmark-to-deployment translation gap (~5 points), and residual uncertainty about whether Mythos represents the new normal for withheld capability rather than a one-off. The 2-point move is deliberate — we'd need actual production deployment evidence, not benchmark scores, to move meaningfully above 72%. What would push us above 80%: a publicly documented Fortune 500 substituting an open-weight model for a closed frontier model in a production reasoning workflow. What would drop us below 55%: evidence that frontier labs are systematically releasing public models two generations behind internal capability, not one.