Individual benchmark scores plotted by date.
| Organisation | Model | Reported | Top Score | Info | Self Reported | Source |
|---|---|---|---|---|---|---|
| Mistral Medium 3.1 | 12 Aug 2025 | 20.30 | Top-1 accuracy 32.3%; cases 703 | No | Source | |
| Mistral Large 3.0 | 02 Dec 2025 | 23 | Top-1 accuracy 34.0%; cases 703 | No | Source | |
| MiniMax M2.7 | 18 Mar 2026 | 39.30 | Top-1 accuracy 63.4%; cases 703 | No | Source | |
| Trinity Large Thinking | 01 Apr 2026 | 41.60 | Top-1 accuracy 66.1%; cases 703 | No | Source | |
| Ernie 5.0 | 22 Jan 2026 | 41.70 | Top-1 accuracy 65.3%; cases 703 | No | Source | |
| Qwen 3.5 27B | 24 Feb 2026 | 45.50 | Top-1 accuracy 71.3%; cases 703 | No | Source | |
| MiMo V2 Pro | 18 Mar 2026 | 45.90 | Top-1 accuracy 68.8%; cases 703 | No | Source | |
| Qwen 3.5 122B A10B | 24 Feb 2026 | 51.20 | Top-1 accuracy 76.5%; cases 703 | No | Source | |
| Gemma 4 31B | 02 Apr 2026 | 53 | Reasoning; Top-1 accuracy 76.1%; cases 703 | No | Source | |
| Seed 2.0 Pro | 14 Feb 2026 | 57.10 | Top-1 accuracy 77.0%; cases 703 | No | Source | |
| Qwen 3.6 Plus | 01 Apr 2026 | 59.50 | Top-1 accuracy 81.5%; cases 703 | No | Source | |
| GPT 5.4 Mini | 17 Mar 2026 | 61.70 | xHigh reasoning; Top-1 accuracy 80.8%; cases 703 | No | Source | |
| Gemini 3.1 Flash Lite Preview | 03 Mar 2026 | 63.30 | Top-1 accuracy 82.1%; cases 703 | No | Source | |
| Grok 4.20 | 17 Feb 2026 | 63.80 | Reasoning; Top-1 accuracy 81.5%; cases 703 | No | Source | |
| DeepSeek V3.2 | 01 Dec 2025 | 65 | Top-1 accuracy 81.8%; cases 703 | No | Source | |
| Qwen 3.5 397B A17B | 16 Feb 2026 | 65.10 | Top-1 accuracy 82.4%; cases 703 | No | Source | |
| Kimi K2.5 | 27 Jan 2026 | 69.40 | Thinking; Top-1 accuracy 84.8%; cases 703 | No | Source | |
| GLM 5.1 | 07 Apr 2026 | 69.80 | Top-1 accuracy 85.9%; cases 703 | No | Source | |
| Claude Opus 4.7 | 16 Apr 2026 | 72.80 | High reasoning; Top-1 accuracy 86.8%; cases 703 | No | Source | |
| Claude Sonnet 4.6 | 17 Feb 2026 | 76.30 | High reasoning; Top-1 accuracy 88.5%; cases 703 | No | Source | |
| Gemini 3.1 Pro Preview | 19 Feb 2026 | 79.40 | Top-1 accuracy 91.5%; cases 703 | No | Source | |
| GPT 5.4 | 05 Mar 2026 | 80 | xHigh reasoning; Top-1 accuracy 90.9%; cases 703 | No | Source | |
| Claude Opus 4.6 | 05 Feb 2026 | 80.60 | High reasoning; Top-1 accuracy 90.0%; cases 703 | No | Source |