Individual benchmark scores plotted by date.
| Organisation | Model | Reported | Top Score | Info | Self Reported | Source |
|---|---|---|---|---|---|---|
| Gemma 3 27B | 12 Mar 2025 | 40.26 | Confab 66.3%; non-response 14.2% | No | Source | |
| GPT 4o Mini (2024-07-18) | 18 Jul 2024 | 37.21 | Confab 60.9%; non-response 13.5% | No | Source | |
| Claude 3.5 Haiku | 04 Nov 2024 | 36.74 | Confab 65.8%; non-response 7.6% | No | Source | |
| Claude 3 Haiku | 13 Mar 2024 | 34.21 | Confab 56.9%; non-response 11.5% | No | Source | |
| Nova Pro 1.0 | 04 Dec 2024 | 30.05 | Confab 54.5%; non-response 5.6% | No | Source | |
| Phi 4 | 12 Dec 2024 | 29.43 | Confab 52.5%; non-response 6.4% | No | Source | |
| GPT 4 Turbo (2024-01-25) | 25 Jan 2024 | 28.42 | Confab 26.7%; non-response 30.1% | No | Source | |
| Gemma 2 27B | 27 Jun 2024 | 27.12 | Confab 47.0%; non-response 7.2% | No | Source | |
| Gemini 2.0 Flash | 05 Feb 2025 | 26.85 | Confab 24.3%; non-response 29.4% | No | Source | |
| DeepSeek V3 (2025-03-24) | 24 Mar 2025 | 26.15 | Confab 39.1%; non-response 13.2% | No | Source | |
| Mistral Small 3.0 | 30 Jan 2025 | 25.21 | Confab 38.6%; non-response 11.8% | No | Source | |
| Minimax Text 01 | 15 Jan 2025 | 23.92 | Confab 44.6%; non-response 3.3% | No | Source | |
| Llama 3.3 70B Instruct | 06 Dec 2024 | 22.81 | Confab 17.8%; non-response 27.8% | No | Source | |
| Claude 3 Opus | 04 Mar 2024 | 22.70 | Confab 28.2%; non-response 17.2% | No | Source | |
| Llama 4 Maverick | 05 Apr 2025 | 22.58 | Confab 28.2%; non-response 16.9% | No | Source | |
| Mistral Medium 3.0 | 07 May 2025 | 21.90 | Confab 38.1%; non-response 5.7% | No | Source | |
| Mistral Large 2.0 | 24 Jul 2024 | 21.40 | Confab 32.2%; non-response 10.6% | No | Source | |
| Kimi K2 (2025-09-05) | 05 Sept 2025 | 20.38 | Confab 30.2%; non-response 10.6% | No | Source | |
| Grok 2 | 13 Aug 2024 | 20.14 | Confab 25.7%; non-response 14.5% | No | Source | |
| Claude 3.5 Sonnet (2024-10-22) | 22 Oct 2024 | 19.94 | Confab 12.9%; non-response 27.0% | No | Source | |
| Qwen 2.5 72B | - | 19.09 | Confab 32.2%; non-response 6.0% | No | Source | |
| o1 mini | 12 Sept 2024 | 18.55 | Confab 26.2%; non-response 10.9% | No | Source | |
| Gemini 2.0 Pro Exp (2025-02-05) | 05 Feb 2025 | 18.43 | Confab 15.8%; non-response 21.0% | No | Source | |
| o3 mini | 30 Jan 2025 | 17.91 | Medium reasoning; Confab 27.2%; non-response 8.6% | No | Source | |
| Llama 3.1 405B Instruct | 23 Jul 2024 | 17.62 | Confab 14.4%; non-response 20.9% | No | Source | |
| Claude Opus 4.1 | 05 Aug 2025 | 17.08 | 16K thinking; Confab 2.5%; non-response 31.7% | No | Source | |
| Gemini 2.5 Flash Preview (2025-04-17) | 17 Apr 2025 | 16.81 | 24K; Confab 4.5%; non-response 29.2% | No | Source | |
| Qwen 3 235B A22B Thinking 2507 | - | 16.77 | Thinking 2507; Confab 13.9%; non-response 19.7% | No | Source | |
| Ernie 4.5 300B A47B | - | 15.97 | Confab 25.2%; non-response 6.7% | No | Source | |
| Claude Opus 4 | 21 May 2025 | 15.92 | 16K thinking; Confab 2.5%; non-response 29.4% | No | Source | |
| o4 Mini | 16 Apr 2025 | 15.79 | High reasoning; Confab 26.7%; non-response 4.8% | No | Source | |
| GPT OSS 120b | 05 Aug 2025 | 15.65 | Medium reasoning; Confab 23.3%; non-response 8.0% | No | Source | |
| Qwen 3 235B A22B | - | 15.61 | Confab 23.3%; non-response 8.0% | No | Source | |
| QwQ 32B | - | 15.57 | 16K; Confab 25.2%; non-response 5.9% | No | Source | |
| GPT 4o (2024-08-06) | 06 Aug 2024 | 15.34 | Confab 22.3%; non-response 8.4% | No | Source | |
| Claude 3.7 Sonnet | 24 Feb 2025 | 14.71 | 16K thinking; Confab 7.9%; non-response 21.5% | No | Source | |
| DeepSeek R1 (2025-05-28) | 28 May 2025 | 14.56 | Confab 12.9%; non-response 16.2% | No | Source | |
| o3 | 16 Apr 2025 | 14.38 | High reasoning; Confab 24.8%; non-response 4.0% | No | Source | |
| o3 Pro | 10 Jun 2025 | 14.22 | Medium reasoning; Confab 23.4%; non-response 5.1% | No | Source | |
| Grok 3 Beta | 19 Feb 2025 | 14.19 | No reasoning; Confab 17.8%; non-response 10.6% | No | Source | |
| GPT 4.5 | 27 Feb 2025 | 13.64 | Confab 11.9%; non-response 15.4% | No | Source | |
| Gemini 1.5 Pro 002 | 24 Sept 2024 | 13.54 | Confab 16.8%; non-response 10.2% | No | Source | |
| GPT 5 Mini | 07 Aug 2025 | 13.28 | Medium reasoning; Confab 21.8%; non-response 4.8% | No | Source | |
| Claude Sonnet 4 | 21 May 2025 | 13.20 | 16K thinking; Confab 2.5%; non-response 23.9% | No | Source | |
| o1 preview | 12 Sept 2024 | 13.04 | Confab 18.3%; non-response 7.8% | No | Source | |
| DeepSeek R1 | - | 12.65 | Confab 17.3%; non-response 8.0% | No | Source | |
| Gemini 2.0 Flash Thinking Exp (2025-01-21) | 21 Jan 2025 | 12.43 | Confab 14.9%; non-response 10.0% | No | Source | |
| Grok 4 | 10 Jul 2025 | 12.41 | Confab 4.0%; non-response 20.9% | No | Source | |
| Qwen 3 30B A3B | - | 12.28 | Confab 12.9%; non-response 11.7% | No | Source | |
| o1 | 17 Dec 2024 | 11.74 | Medium reasoning; Confab 10.9%; non-response 12.6% | No | Source | |
| GLM 4.5 | 28 Jul 2025 | 11.30 | Confab 7.9%; non-response 14.7% | No | Source | |
| Gemini 2.5 Pro Experimental (2025-03-25) | 25 Mar 2025 | 10.80 | 2025-03-25; Confab 4.0%; non-response 17.6% | No | Source | |
| Grok 3 Mini Beta | 19 Feb 2025 | 10.80 | High reasoning; Confab 6.9%; non-response 14.7% | No | Source | |
| Gemini 2.5 Pro Preview (2025-05-06) | 06 May 2025 | 10.62 | Confab 5.9%; non-response 15.3% | No | Source | |
| GPT 5 | 07 Aug 2025 | 10.34 | Medium reasoning; Confab 10.9%; non-response 9.8% | No | Source |