Individual benchmark scores plotted by date.
| Organisation | Model | Reported | Top Score | Info | Self Reported | Source |
|---|---|---|---|---|---|---|
| o3 mini | 30 Jan 2025 | 84.60% | - | Yes | Source | |
| Qwen 3 235B A22B Thinking 2507 | - | 78.40% | 2024-11-25 | Yes | Source | |
| Qwen 3 235B A22B | - | 77.10% | - | Yes | Source | |
| Kimi K2 (2025-09-05) | 05 Sept 2025 | 76.40% | Pass@1 | Yes | Source | |
| Qwen 3 A235 A22B Instruct 2507 | - | 75.40% | 2024-11-25 | Yes | Source | |
| Qwen 3 32B | - | 74.90% | - | Yes | Source | |
| o3 | 16 Apr 2025 | 74.42% | High Reasoning Effort | No | Source | |
| Qwen 3 Omni 30B A3B Instruct | - | 74.30% | inferred high-confidence family alias from qwen3-30b-a3b (score=0.4819; benches=8) | Yes | Source | |
| Qwen 3 30B A3B Instruct 2507 | - | 74.30% | inferred version-family alias from qwen3-30b-a3b | Yes | Source | |
| Qwen 3 30B A3B Thinking 2507 | - | 74.30% | inferred version-family alias from qwen3-30b-a3b | Yes | Source | |
| Qwen 3 Coder 30B A3B Instruct | - | 74.30% | inferred high-confidence family alias from qwen3-30b-a3b (score=0.5007; benches=8) | Yes | Source | |
| Qwen 3 Omni 30B A3B Thinking | - | 74.30% | inferred high-confidence family alias from qwen3-30b-a3b (score=0.4819; benches=8) | Yes | Source | |
| Qwen 3 Omni 30B A3B Captioner | - | 74.30% | inferred family alias from qwen3-30b-a3b (score=0.4129; benches=8) | Yes | Source | |
| Qwen 3 30B A3B | - | 74.30% | - | Yes | Source | |
| QwQ 32B | - | 73.10% | - | Yes | Source | |
| Claude Opus 4 | 21 May 2025 | 72.93% | 32k Thinking | No | Source | |
| Claude Sonnet 4 | 21 May 2025 | 72.08% | 64k Thinking | No | Source | |
| o4 Mini | 16 Apr 2025 | 71.52% | High Reasoning Effort | No | Source | |
| Deepseek R1 (2025-05-28) | 28 May 2025 | 69.39% | - | No | Source | |
| Claude 3.7 Sonnet | 24 Feb 2025 | 67.43% | 64k Thinking | No | Source | |
| Deepseek R1 (2025-01-20) | 20 Jan 2025 | 65.15% | - | No | Source | |
| Grok 3 Beta | 19 Feb 2025 | 62.36% | High Reasoning Effort | No | Source | |
| GPT 4.5 | 27 Feb 2025 | 58.65% | - | No | Source | |
| GPT 4.1 | 14 Apr 2025 | 55.90% | - | No | Source | |
| Qwen 72B | - | 52.30% | inferred family alias from qwen-2.5-72b-instruct (score=0.3060; benches=14) | Yes | Source | |
| o1 preview | 12 Sept 2024 | 52.30% | - | Yes | Source | |
| Claude 3.5 Sonnet (2024-10-22) | 22 Oct 2024 | 51.80% | - | No | Source | |
| GPT 4.1 Mini | 14 Apr 2025 | 51.57% | - | No | Source | |
| Phi 4 | 12 Dec 2024 | 47.60% | - | Yes | Source | |
| Phi 2 | - | 47.60% | inferred family alias from phi-4 (score=0.3100; benches=13) | Yes | Source | |
| Phi 1 | - | 47.60% | inferred family alias from phi-4 (score=0.3100; benches=13) | Yes | Source | |
| GPT 4.1 Nano | 14 Apr 2025 | 40.40% | - | No | Source | |
| Claude 3.5 Haiku | 04 Nov 2024 | 39.51% | - | No | Source | |
| Qwen 7B | - | 35.90% | inferred family alias from qwen-2.5-7b-instruct (score=0.3083; benches=14) | Yes | Source | |
| Qwen 2.5 Coder 3B | - | 29.60% | inferred family alias from qwen2.5-omni-7b (score=0.3000; benches=45) | Yes | Source | |
| Qwen 2.5 Omni 3B | - | 29.60% | inferred high-confidence family alias from qwen2.5-omni-7b (score=0.4933; benches=45) | Yes | Source | |
| Qwen 2.5 Math PRM 7B | - | 29.60% | inferred family alias from qwen2.5-omni-7b (score=0.4092; benches=45) | Yes | Source | |
| Qwen 2.5 Omni 7B | - | 29.60% | - | Yes | Source | |
| Qwen 2.5 Math 7B | - | 29.60% | inferred high-confidence family alias from qwen2.5-omni-7b (score=0.4767; benches=45) | Yes | Source | |
| Qwen 2.5 Coder 7B | - | 29.60% | inferred high-confidence family alias from qwen2.5-omni-7b (score=0.4700; benches=45) | Yes | Source | |
| Qwen 2.5 Math 7B PRM800K | - | 29.60% | inferred family alias from qwen2.5-omni-7b (score=0.3696; benches=45) | Yes | Source |