Open-Source Has Closed the Gap. Now the Real Question Is Whether That Gap Stays Closed.
TexTak forecasts a 69% probability that an open-source model matches closed frontier performance — and today's releases from OpenAI and Mistral are the most direct evidence we've seen that this threshold is being crossed in real time. OpenAI dropped gpt-oss-120b matching o4-mini on core reasoning benchmarks; Mistral 3 Large hits frontier instruction-following performance trained on 3,000 H200s; Mistral's Ministral variants score 85% on AIME 2025 with smaller weights. The cumulative weight of these releases in a single news cycle is not coincidental — it reflects a structural shift in where the frontier actually sits. But the LLM Stats leaderboard update is the counterpoint that keeps this forecast from moving higher: Claude Mythos Preview at 94.6% GPQA Diamond is the kind of capability that open-source models are not yet producing, and it suggests the frontier may be pulling away even as open-source catches up to where the frontier was six months ago.
Our 69% reflects three compounding forces: Meta's sustained open-source investment, the 100x verified compute cost reduction making frontier-class training accessible outside closed labs, and the demonstrated momentum of training techniques — RLHF variants, distillation, chain-of-thought supervision — that are no longer proprietary. The OpenAI open-weight release is particularly significant because it comes from a closed lab releasing frontier-class weights, which means the open-source community now has a calibration point they can build on directly. When the organization that defines the frontier hands its architecture to the world, the gap-closing dynamic accelerates nonlinearly.
But we need to be precise about what 'parity' means in this forecast, because the benchmark data today pulls in both directions simultaneously. Gpt-oss-120b matching o4-mini on reasoning is real parity evidence for that specific benchmark class. Mistral 3 Large matching proprietary instruction-following performance is real parity evidence in that dimension. These are not trivial — they represent capabilities that were closed-lab-only eighteen months ago. What they are NOT is parity with Claude Mythos Preview at 94.6% GPQA Diamond, or with whatever Anthropic hasn't released yet. The 'Mythos' signal in the leaderboard data is the part of our thesis that gives us the most pause: if frontier labs are holding unreleased models that represent step-change improvements over what's publicly benchmarked, then open-source may be achieving parity with the published frontier while the actual frontier moves ahead.
This is the inferential gap we have to name honestly. Our forecast defines parity by benchmark convergence, but benchmark parity, developer preference, and product capability are genuinely different things. The Mistral and OpenAI releases prove benchmark convergence in specific dimensions. They do not prove that an open-source model produces equivalent outputs in production at scale, handles adversarial inputs with equivalent robustness, or matches the post-training refinement that closed labs apply but don't publish. The 69% probability is our assessment that at least one of these dimensions crosses a clear, publicly verifiable parity threshold within our forecast window — not that all dimensions converge simultaneously.
What would move us above 80%: a third-party evaluation (not self-reported benchmarks) showing an open-source model matching or exceeding a closed frontier model on GPQA Diamond or an equivalent held-out professional benchmark within the next two quarters. What would drop us below 55%: evidence that Claude Mythos or an equivalent unreleased model represents a genuine architectural discontinuity rather than incremental improvement — specifically, if GPQA Diamond scores above 96% from a closed lab appear while the best open-source models remain below 90% on the same benchmark. We're watching the Mythos release timeline closely.