The Open Source Catch-Up is Real — But Architecture Matters More Than Benchmarks
TexTak places the probability of open-source models matching closed frontier performance at 69%, up from 67% this month. The UK AI Security Institute's new data showing just an 8-month lag between frontier releases and open-source parity supports our thesis — but today's leaked capabilities from Anthropic's Claude Mythos suggest the gap may be widening again in ways benchmarks don't capture.
Our 69% reflects three converging trends: Meta's massive open-source investment, the 100x verified cost reduction in training, and consistent narrowing gaps on standard benchmarks. The UK AISI report confirms what we've been tracking — open models now match closed performance within months, not years. This isn't just about compute anymore; it's about technique diffusion accelerating through academic and industry collaboration.
But here's what keeps us from going higher: today's news about Claude Mythos completing 32-step network attacks and discovering a 27-year-old OpenBSD vulnerability suggests frontier labs may have architectural breakthroughs that don't show up in standard benchmarks. The disputed o3 performance on FrontierMath (10% vs claimed 25%) isn't just measurement noise — it hints that capability assessment itself is breaking down as models become more specialized.
The strongest counterargument isn't about resources or talent — it's about architecture. If Anthropic's Mythos represents fundamentally different reasoning capabilities rather than just better post-training, then benchmark parity becomes meaningless. A model that matches GPT-4 on MMLU but can't chain complex multi-step reasoning hasn't achieved real parity, regardless of what Arena rankings show.
What we might be underweighting: the possibility that frontier labs are moving beyond benchmarkable capabilities toward domain-specific reasoning architectures that open-source can't replicate without the underlying research breakthroughs. If three more frontier models demonstrate Mythos-level step-chain reasoning by Q3 while open-source remains stuck at benchmark parity, we'd drop below 60%.