Individual benchmark scores plotted by date.
| Organisation | Model | Reported | Top Score | Info | Self Reported | Source |
|---|---|---|---|---|---|---|
| Qwen 3.5 27B | 24 Feb 2026 | 95% | - | Yes | Source | |
| Qwen 3.5 Flash | 23 Feb 2026 | 95% | inferred family alias from qwen3.5-27b (score=0.4147; benches=81) | Yes | Source | |
| Qwen 3.6 Plus | 01 Apr 2026 | 94.30% | - | Yes | Source | |
| o3 mini | 30 Jan 2025 | 93.90% | - | Yes | Source | |
| Qwen 3.5 122B A10B | 24 Feb 2026 | 93.40% | - | Yes | Source | |
| Qwen 3.5 397B A17B | 16 Feb 2026 | 92.60% | - | Yes | Source | |
| Llama 2 70B Chat | 20 Jun 2023 | 92.10% | inferred family alias from llama-3.3-70b-instruct (score=0.3129; benches=9) | Yes | Source | |
| Qwen 3.5 35B A3B | 24 Feb 2026 | 91.90% | - | Yes | Source | |
| Qwen 3.5 9B | 02 Mar 2026 | 91.50% | - | Yes | Source | |
| Nvidia Nemotron Nano 12B V2 | - | 90.30% | inferred high-confidence family alias from nvidia-nemotron-nano-9b-v2 (score=0.4889; benches=6) | Yes | Source | |
| Nvidia Nemotron Nano 9B V2 | - | 90.30% | - | Yes | Source | |
| Kimi K2 (2025-09-05) | 05 Sept 2025 | 89.80% | Prompt Strict | Yes | Source | |
| Qwen 3.5 4B | 02 Mar 2026 | 89.80% | - | Yes | Source | |
| Longcat Flash Cat | - | 89.65% | inferred high-confidence family alias from longcat-flash-chat (score=0.4667; benches=16) | Yes | Source | |
| Llama 3.1 Nemotron Ultra 253B v1 | 07 Apr 2025 | 89.45% | - | Yes | Source | |
| Qwen 3 Next 80B A3B Thinking | - | 88.90% | - | Yes | Source | |
| Qwen 3 A235 A22B Instruct 2507 | - | 88.70% | - | Yes | Source | |
| GPT 4.5 | 27 Feb 2025 | 88.20% | - | Yes | Source | |
| Qwen 3 VL 235B A22B Thinking | - | 88.20% | - | Yes | - | |
| Qwen 3 VL 235B A22B Instruct | - | 87.80% | - | Yes | - | |
| Qwen 3 235B A22B Thinking 2507 | - | 87.80% | - | Yes | Source | |
| Qwen 3 VL 32B Thinking | - | 87.80% | - | Yes | - | |
| Qwen 3 Next 80B A3B Instruct | - | 87.60% | - | Yes | Source | |
| Kimi K1.5 | 20 Jan 2025 | 87.20% | - | Yes | Source | |
| DeepSeek V4 | - | 86.10% | inferred high-confidence family alias from deepseek-v3 (score=0.5818; benches=20) | Yes | Source | |
| DeepSeek V2 (2024-06-28) | 28 Jun 2024 | 86.10% | inferred family alias from deepseek-v3 (score=0.4159; benches=20) | Yes | Source | |
| DeepSeek OCR | 20 Oct 2025 | 86.10% | inferred family alias from deepseek-v3 (score=0.3000; benches=20) | Yes | Source | |
| Qwen 3 VL 30B A3B Instruct | - | 85.80% | - | Yes | - | |
| Phi 4 Reasoning Plus | 30 Apr 2025 | 84.90% | - | Yes | Source | |
| EXAONE 4.0 32B | 15 Jul 2025 | 84.80% | Non Reasoning | Yes | Source | |
| Qwen 3 VL 32B Instruct | - | 84.70% | - | Yes | - | |
| Qwen 72B | - | 84.10% | inferred family alias from qwen-2.5-72b-instruct (score=0.3060; benches=14) | Yes | Source | |
| Jamba Large 1.7 | 03 Jul 2025 | 84% | - | Yes | - | |
| QwQ 32B | - | 83.90% | - | Yes | Source | |
| Qwen 3 VL 8B Instruct | - | 83.70% | - | Yes | - | |
| Phi 4 Reasoning | 30 Apr 2025 | 83.40% | - | Yes | Source | |
| Qwen 3 VL Embedding 8B | - | 83.20% | inferred high-confidence family alias from qwen3-vl-8b-thinking (score=0.5232; benches=50) | Yes | - | |
| Qwen 3 VL Reranker 8B | - | 83.20% | inferred high-confidence family alias from qwen3-vl-8b-thinking (score=0.5275; benches=50) | Yes | - | |
| Qwen 3 Guard Gen 8B | - | 83.20% | inferred family alias from qwen3-vl-8b-thinking (score=0.3400; benches=50) | Yes | - | |
| Qwen 3 Guard Stream 8B | - | 83.20% | inferred family alias from qwen3-vl-8b-thinking (score=0.3371; benches=50) | Yes | - | |
| Qwen 3 8B | - | 83.20% | inferred high-confidence family alias from qwen3-vl-8b-thinking (score=0.4600; benches=50) | Yes | - | |
| Qwen 3 Embedding 8B | - | 83.20% | inferred family alias from qwen3-vl-8b-thinking (score=0.3850; benches=50) | Yes | - | |
| Qwen 3 Reranker 8B | - | 83.20% | inferred family alias from qwen3-vl-8b-thinking (score=0.3850; benches=50) | Yes | - | |
| Qwen 3 VL 8B Thinking | - | 83.20% | - | Yes | - | |
| Qwen 3 4B SafeRL | - | 82.60% | inferred family alias from qwen3-vl-4b-thinking (score=0.3850; benches=48) | Yes | - | |
| Qwen 3 4B Thinking 2507 | - | 82.60% | inferred family alias from qwen3-vl-4b-thinking (score=0.3462; benches=48) | Yes | - | |
| Qwen 3 Guard Gen 4B | - | 82.60% | inferred family alias from qwen3-vl-4b-thinking (score=0.3400; benches=48) | Yes | - | |
| Qwen 3 Embedding 4B | - | 82.60% | inferred family alias from qwen3-vl-4b-thinking (score=0.3850; benches=48) | Yes | - | |
| Qwen 3 Guard Stream 4B | - | 82.60% | inferred family alias from qwen3-vl-4b-thinking (score=0.3371; benches=48) | Yes | - | |
| Qwen 3 4B Instruct 2507 | - | 82.60% | inferred family alias from qwen3-vl-4b-thinking (score=0.3462; benches=48) | Yes | - | |
| Qwen 3 Reranker 4B | - | 82.60% | inferred family alias from qwen3-vl-4b-thinking (score=0.3850; benches=48) | Yes | - | |
| Qwen 3 VL 4B Thinking | - | 82.60% | - | Yes | - | |
| Qwen 3 VL 4B Instruct | - | 82.30% | - | Yes | - | |
| Qwen 3 VL 30B A3B Thinking | - | 81.70% | - | Yes | - | |
| GPT 4o Search Preview | 11 Mar 2025 | 81% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Audio (2024-12-17) | 17 Dec 2024 | 81% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Transcribe Diarize | 15 Oct 2025 | 81% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Audio (2025-06-03) | 03 Jun 2025 | 81% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Transcribe | 20 Mar 2025 | 81% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Audio (2024-10-01) | 01 Oct 2024 | 81% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o Realtime Preview (2025-06-03) | 03 Jun 2025 | 81% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| GPT 4o (2024-08-06) | 06 Aug 2024 | 81% | - | Yes | Source | |
| GPT 4o Realtime Preview (2024-10-01) | 01 Oct 2024 | 81% | inferred modality/version alias from gpt-4o-2024-08-06 | Yes | Source | |
| Llama 3.1 Nemotron Nano 4B V1.1 | - | 79.30% | inferred high-confidence family alias from llama-3.1-nemotron-nano-8b-v1 (score=0.5523; benches=7) | Yes | Source | |
| Llama 3.1 Nemotron Nano 8B V1 | 18 Mar 2025 | 79.30% | - | Yes | Source | |
| Qwen 3.5 2B | 02 Mar 2026 | 78.60% | - | Yes | Source | |
| Jamba Mini 1.7 | 03 Jul 2025 | 76.20% | - | Yes | - | |
| Jamba Large 1.6 | 06 Mar 2025 | 75.80% | - | Yes | - | |
| Granite 3.1 8B Instruct | - | 74.82% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14) | Yes | Source | |
| Granite 3.2 8B Instruct | - | 74.82% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14) | Yes | Source | |
| Granite 3.3 2B Instruct | 16 Apr 2025 | 74.82% | inferred family alias from granite-3.3-8b-instruct (score=0.3627; benches=14) | Yes | Source | |
| Granite Guardian 3.1 8B | - | 74.82% | inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14) | Yes | Source | |
| Granite 3.3 8B Instruct | 16 Apr 2025 | 74.82% | - | Yes | Source | |
| Granite Guardian 3.0 8B | - | 74.82% | inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14) | Yes | Source | |
| Granite Guardian 3.3 8B | - | 74.82% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.5071; benches=14) | Yes | Source | |
| Granite 3.0 8B Instruct | - | 74.82% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14) | Yes | Source | |
| Granite 3.2 8B Instruct Preview | - | 74.82% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4687; benches=14) | Yes | Source | |
| Granite Speech 3.2 8B | - | 74.82% | inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14) | Yes | Source | |
| Granite Speech 3.3 8B | - | 74.82% | inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.5071; benches=14) | Yes | Source | |
| EXAONE 4.0 1.2B | 15 Jul 2025 | 74.70% | Non Reasoning | Yes | Source | |
| Qwen 7B | - | 71.20% | inferred family alias from qwen-2.5-7b-instruct (score=0.3083; benches=14) | Yes | Source | |
| Jamba Mini 1.6 | 06 Mar 2025 | 68.30% | - | Yes | - | |
| Granite 4.0 Tiny | 02 Oct 2025 | 63% | inferred alias from granite-4.0-tiny-preview | Yes | Source | |
| Granite 4.0 Tiny Preview | 02 May 2025 | 63% | - | Yes | Source | |
| Granite 4.0 Micro | 02 Oct 2025 | 63% | inferred high-confidence family alias from granite-4.0-tiny-preview (score=0.4700; benches=12) | Yes | Source | |
| Granite 4.0 Small | 02 Oct 2025 | 63% | inferred high-confidence family alias from granite-4.0-tiny-preview (score=0.4700; benches=12) | Yes | Source | |
| Phi 1 | - | 63% | inferred family alias from phi-4 (score=0.3100; benches=13) | Yes | Source | |
| Phi 4 | 12 Dec 2024 | 63% | - | Yes | Source | |
| Phi 2 | - | 63% | inferred family alias from phi-4 (score=0.3100; benches=13) | Yes | Source | |
| Pixtral 12B | 17 Sept 2024 | 61.30% | inferred version-family alias from pixtral-12b-2409 | Yes | Source | |
| Qwen 3.5 0.8B | 02 Mar 2026 | 44% | - | Yes | Source |