Mythos vs. Qwen: The Open-Source Frontier Gap Is Closing, But Not Closed
TexTak's open-source frontier parity forecast sits at 69% — and today's benchmark data is the most direct evidence we've seen in favor of that thesis. Qwen 3.5 scoring 88.4% on GPQA Diamond while DeepSeek-R1 achieves frontier reasoning at 1/20th the training cost are not circumstantial signals. They are performance data points. But Claude Mythos Preview scoring 94.6% on the same benchmark — 6 points higher, in a model that isn't even publicly released yet — tells you exactly why we're at 69% and not 85%.
Let's be precise about what the 69% reflects and what it doesn't. Our forecast asks whether an open-source model matches closed frontier performance — and we define parity across three dimensions: benchmark scores, developer preference in head-to-head arena evaluations, and commercial deployment quality. Today's data confirms dimension one is nearly resolved. The MMLU gap that was 17.5 percentage points in late 2023 is effectively zero on knowledge benchmarks by May 2026. Qwen 3.5's 88.4% GPQA Diamond is genuinely frontier-class. That's direct evidence the gap is closing, and it's why we moved from 67% to 69% two weeks ago.
Here's what keeps us from going higher: Mythos. The Anthropic leak — and today's benchmark confirmation — represents exactly the dynamic our 69% was designed to account for. Frontier labs are not standing still. The disclosed Mythos GPQA score of 94.6% is 6 points above Qwen 3.5 on the benchmark that most discriminates at the frontier. That's not a rounding error. It means the open-source models that are 'matching frontier performance' are matching last quarter's frontier, not this quarter's. The lag window — historically 6-18 months — hasn't collapsed to zero. It's narrowed, but it persists.
The strongest counterargument to our thesis isn't capability; it's the definition of parity itself. We've been careful to distinguish benchmark parity from product parity, and that distinction matters more now than ever. Frontier labs' post-training techniques, RLHF pipelines, safety tuning, and — critically — production infrastructure are not open-sourced. DeepSeek achieving frontier reasoning at 1/20th the training cost is a real and important data point, but training cost parity is not deployment quality parity. A Fortune 500 CTO choosing between Mythos-via-API and a self-hosted Qwen deployment is making a decision that involves latency, SLA guarantees, support contracts, and liability — none of which benchmark scores capture. This is the gap in our model we're most honest about.
What would move us above 80%? Two things: first, an open-source model leading — not just matching — a public Arena Elo leaderboard for more than one evaluation cycle. Second, at least two Fortune 500 companies publicly switching from a closed frontier provider to an open-weight model for a production-critical workflow. What would drop us below 55%? Evidence that Mythos represents a step-change rather than an incremental lead — specifically if Anthropic's yet-unreleased capabilities extend the gap to 10+ points on GPQA Diamond and equivalent gaps on agentic benchmarks. We're watching the Mythos public release window closely. If the full model lands where the preview suggests, we'll need to reassess whether 'parity' is achievable within our forecast horizon.