Individual benchmark scores plotted by date.
| Organisation | Model | Reported | Top Score | Info | Self Reported | Source |
|---|---|---|---|---|---|---|
| o3 Preview | 20 Dec 2024 | 96.70% | - | Yes | Source | |
| GPT OSS 120b | 05 Aug 2025 | 96.60% | High Reasoning Effort, With Tools | Yes | Source | |
| GPT OSS 20b | 05 Aug 2025 | 96% | High Reasoning Effort, With Tools | Yes | Source | |
| Grok 3 Mini | 18 Apr 2025 | 95.80% | - | Yes | Source | |
| Grok 3 Beta | 19 Feb 2025 | 95.80% | Reasoning | Yes | Source | |
| Grok 3 Mini Beta | 19 Feb 2025 | 95.80% | Think, Cons@64 | Yes | Source | |
| o4 Mini | 16 Apr 2025 | 93.40% | - | Yes | Source | |
| o4 mini Deep Research | 26 Jun 2025 | 93.40% | inferred modality/version alias from o4-mini | Yes | Source | |
| Grok 3 | 18 Apr 2025 | 93.30% | - | Yes | Source | |
| o3 Pro | 10 Jun 2025 | 93% | - | Yes | Source | |
| Gemini Embedding 2 Preview | 10 Mar 2026 | 92% | manual fallback alias from gemini-2.5-pro | Yes | Source | |
| Gemini 2.5 Pro Preview TTS (2025-12-10) | 10 Dec 2025 | 92% | inferred modality/version alias from gemini-2.5-pro | Yes | Source | |
| Gemini 2.5 Pro Experimental (2025-03-25) | 25 Mar 2025 | 92% | inferred alias from gemini-2.5-pro | Yes | Source | |
| Gemini 2.5 Computer Use Preview | 07 Oct 2025 | 92% | inferred family alias from gemini-2.5-pro (score=0.3960; benches=16) | Yes | Source | |
| o3 | 16 Apr 2025 | 91.60% | - | Yes | Source | |
| Deepseek R1 (2025-05-28) | 28 May 2025 | 91.40% | - | Yes | Source | |
| GLM 4.5 | 28 Jul 2025 | 91% | - | Yes | Source | |
| GLM 4.5 Air | 28 Jul 2025 | 89.40% | - | Yes | Source | |
| Gemini Live 2.5 Flash Preview | 09 Apr 2025 | 88% | inferred high-confidence family alias from gemini-2.5-flash (score=0.5083; benches=14) | Yes | Source | |
| Gemini 2.5 Flash Image (Nano Banana) | 02 Oct 2025 | 88% | inferred modality/version alias from gemini-2.5-flash | Yes | Source | |
| Gemini 2.5 Flash Image Preview (Nano Banana) | 25 Aug 2025 | 88% | inferred modality/version alias from gemini-2.5-flash | Yes | Source | |
| Gemini 2.5 Flash Preview (2025-04-17) | 17 Apr 2025 | 88% | Thinking, Single Attempt | Yes | Source | |
| Gemini 2.5 Flash Preview TTS (2025-05-20) | 20 May 2025 | 88% | inferred modality/version alias from gemini-2.5-flash | Yes | Source | |
| Gemini 2.5 Flash Exp Native Audio Thinking Dialog | - | 88% | inferred modality/version alias from gemini-2.5-flash | Yes | Source | |
| Gemini 2.5 Flash Preview (2025-09-25) | 25 Sept 2025 | 88% | inferred alias from gemini-2.5-flash | Yes | Source | |
| Gemini 2.5 Flash Preview Native Audio Dialog | - | 88% | inferred modality/version alias from gemini-2.5-flash | Yes | Source | |
| Gemini 2.5 Flash Native Audio Preview (2025-09-23) | - | 88% | inferred modality/version alias from gemini-2.5-flash | Yes | Source | |
| Gemini 2.5 Flash Preview TTS (2025-12-10) | 10 Dec 2025 | 88% | inferred modality/version alias from gemini-2.5-flash | Yes | Source | |
| o3 mini | 30 Jan 2025 | 87.30% | - | Yes | Source | |
| MiniMax M1 80K | 16 Jun 2025 | 86% | - | Yes | - | |
| o1 pro | 19 Mar 2025 | 86% | - | Yes | Source | |
| Ministral 8B | 09 Oct 2024 | 86% | inferred alias from ministral-8b-latest | Yes | Source | |
| Qwen 3 235B A22B | - | 85.70% | - | Yes | Source | |
| Minimax M1 40K | 16 Jun 2025 | 83.30% | - | Yes | - | |
| Qwen 3 32B | - | 81.40% | - | Yes | Source | |
| Phi 4 Reasoning Plus | 30 Apr 2025 | 81.30% | - | Yes | Source | |
| Granite 3.1 8B Instruct | - | 81.20% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14) | Yes | Source | |
| Granite 3.2 8B Instruct | - | 81.20% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14) | Yes | Source | |
| Granite Guardian 3.0 8B | - | 81.20% | inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14) | Yes | Source | |
| Granite Speech 3.3 8B | - | 81.20% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.5071; benches=14) | Yes | Source | |
| Granite 3.3 2B Instruct | 16 Apr 2025 | 81.20% | inferred family alias from granite-3.3-8b-instruct (score=0.3627; benches=14) | Yes | Source | |
| Granite 3.0 8B Instruct | - | 81.20% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14) | Yes | Source | |
| Granite Guardian 3.1 8B | - | 81.20% | inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14) | Yes | Source | |
| Granite 3.2 8B Instruct Preview | - | 81.20% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4687; benches=14) | Yes | Source | |
| Granite Guardian 3.3 8B | - | 81.20% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.5071; benches=14) | Yes | Source | |
| Granite 3.3 8B Instruct | 16 Apr 2025 | 81.20% | - | Yes | Source | |
| Granite Speech 3.2 8B | - | 81.20% | inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14) | Yes | Source | |
| Qwen 3 Coder 30B A3B Instruct | - | 80.40% | inferred high-confidence family alias from qwen3-30b-a3b (score=0.5007; benches=8) | Yes | Source | |
| Qwen 3 30B A3B Thinking 2507 | - | 80.40% | inferred version-family alias from qwen3-30b-a3b | Yes | Source | |
| Qwen 3 30B A3B | - | 80.40% | - | Yes | Source | |
| Qwen 3 Omni 30B A3B Captioner | - | 80.40% | inferred family alias from qwen3-30b-a3b (score=0.4129; benches=8) | Yes | Source | |
| Qwen 3 30B A3B Instruct 2507 | - | 80.40% | inferred version-family alias from qwen3-30b-a3b | Yes | Source | |
| Qwen 3 Omni 30B A3B Thinking | - | 80.40% | inferred high-confidence family alias from qwen3-30b-a3b (score=0.4819; benches=8) | Yes | Source | |
| Qwen 3 Omni 30B A3B Instruct | - | 80.40% | inferred high-confidence family alias from qwen3-30b-a3b (score=0.4819; benches=8) | Yes | Source | |
| Claude 3.7 Sonnet | 24 Feb 2025 | 80% | 64k Thinking | Yes | Source | |
| Deepseek R1 (2025-01-20) | 20 Jan 2025 | 79.80% | - | No | Source | |
| QwQ 32B | - | 79.50% | - | Yes | Source | |
| Kimi K1.5 | 20 Jan 2025 | 77.50% | - | No | Source | |
| Ministral 3B | 09 Oct 2024 | 77.50% | inferred alias from ministral-3b-latest | Yes | Source | |
| Phi 4 Reasoning | 30 Apr 2025 | 75.30% | - | Yes | Source | |
| o1 | 17 Dec 2024 | 74.30% | - | Yes | Source | |
| Magistral Medium 1.2 | 17 Sept 2025 | 73.60% | inferred version-family alias from magistral-medium | Yes | Source | |
| Magistral Medium 1.1 | 24 Jul 2025 | 73.60% | inferred version-family alias from magistral-medium | Yes | Source | |
| Magistral Medium 1.0 | 10 Jun 2025 | 73.60% | - | Yes | Source | |
| Gemini 2.0 Flash Thinking Exp (2024-12-19) | 19 Dec 2024 | 73.30% | inferred alias from gemini-2.0-flash-thinking | Yes | Source | |
| Gemini 2.0 Flash Thinking Exp (2025-01-21) | 21 Jan 2025 | 73.30% | inferred alias from gemini-2.0-flash-thinking | Yes | Source | |
| Magistral Small 1.0 | 10 Jun 2025 | 70.70% | Pass@1 | Yes | Source | |
| Magistral Small 1.1 | 24 Jul 2025 | 70.68% | inferred version-family alias from magistral-small-2506 | Yes | Source | |
| Magistral Small 1.2 | 17 Sept 2025 | 70.68% | inferred version-family alias from magistral-small-2506 | Yes | Source | |
| Kimi K2 (2025-09-05) | 05 Sept 2025 | 69.60% | Avg@64 | Yes | Source | |
| DeepSeek V3.1 Terminus | 22 Sept 2025 | 66.30% | inferred alias from deepseek-v3.1 | Yes | Source | |
| DeepSeek V3.1 | 21 Aug 2025 | 66.30% | Non-thinking: 66.3%, Thinking: 93.1% | Yes | Source | |
| DeepSeek V3 (2025-03-24) | 25 Mar 2025 | 59.40% | - | Yes | Source | |
| Phi 4 Mini Reasoning | 30 Apr 2025 | 57.50% | - | Yes | Source | |
| QwQ 32B Preview | - | 50% | - | Yes | Source | |
| GPT 4.1 Mini | 14 Apr 2025 | 49.60% | - | Yes | Source | |
| GPT 4.1 | 14 Apr 2025 | 48.10% | - | Yes | Source | |
| o1 preview | 12 Sept 2024 | 42% | - | Yes | Source | |
| DeepSeek V2 (2024-06-28) | 28 Jun 2024 | 39.20% | inferred family alias from deepseek-v3 (score=0.4159; benches=20) | Yes | Source | |
| DeepSeek OCR | 20 Oct 2025 | 39.20% | inferred family alias from deepseek-v3 (score=0.3000; benches=20) | Yes | Source | |
| DeepSeek V4 | - | 39.20% | inferred high-confidence family alias from deepseek-v3 (score=0.5818; benches=20) | Yes | Source | |
| DeepSeek V3 (2024-12-26) | 26 Dec 2024 | 39.20% | - | No | Source | |
| GPT 4.5 | 27 Feb 2025 | 36.70% | - | Yes | Source | |
| Gemini 2.0 Flash | 05 Feb 2025 | 32% | Single Attempt | Yes | Source | |
| GPT 4.1 Nano | 14 Apr 2025 | 29.40% | - | Yes | Source | |
| Claude 3.5 Sonnet (2024-10-22) | 22 Oct 2024 | 16% | - | Yes | - | |
| GPT 4o Transcribe Diarize | 15 Oct 2025 | 13.10% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Realtime Preview (2024-10-01) | 01 Oct 2024 | 13.10% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Audio (2025-06-03) | 03 Jun 2025 | 13.10% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Audio (2024-12-17) | 17 Dec 2024 | 13.10% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o (2024-08-06) | 06 Aug 2024 | 13.10% | - | Yes | Source | |
| GPT 4o Audio (2024-10-01) | 01 Oct 2024 | 13.10% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Realtime Preview (2025-06-03) | 03 Jun 2025 | 13.10% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Search Preview | 11 Mar 2025 | 13.10% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Transcribe | 20 Mar 2025 | 13.10% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source |