Open-source model matches closed frontier performance

The gap between open and closed models has been narrowing.

RESOLUTION CRITERIA

True if an open-weights model scores within 2% of the leading closed model on MMLU, HumanEval, and GPQA.

▲ FOR

Meta investing heavily in open-source

Training techniques closing gap

Compute costs dropping dramatically - 100x cost reduction verified

NEW: DeepSeek V4 potentially matching current frontier performance with 90% HumanEval

8-month gap between frontier and open-source models shrinking

Hardware advances like NVIDIA Blackwell making training accessible

Open-weight models now match GPT-4 and Claude on many benchmarks

▼ AGAINST

NEW: Frontier labs have unreleased capabilities like Anthropic's leaked 'Mythos' representing step-change improvements

Frontier labs have data advantages

Post-training techniques closely held

Benchmark parity ≠ real-world parity

Closed labs can maintain advantages through undisclosed model development

Poor calibration suggests fundamental limitations in current approaches

RECENT SIGNALS (4)

↑

Gemma 4 31B instruct model released by Google with advanced reasoning for open-source

Price Per Token

↑

Open-Source Gap Narrows to 3 Points: GLM-5 Reaches 77.8% SWE-Bench, Matching Closed Frontier Models

BuildFastWithAI

↑

Mistral Large 3 Joins Frontier Open-Source Ranks Trained on 3,000 NVIDIA H200 GPUs

Mistral AI

↑

Open-Source Models Now Trail State-of-the-Art by Only 3-6 Months, Reshaping Enterprise Economics

BentoML