Thematic Generalisation

Thematic Generalisation - Benchmark Leaderboard & Model Performance | AI Stats

Models Using This Benchmark

Organisation	Model	Reported	Top Score	Info	Self Reported	Source
Mistral	Mistral Medium 3.1	12 Aug 2025	20.30	Top-1 accuracy 32.3%; cases 703	No	Source
Mistral	Mistral Large 3.0	02 Dec 2025	23	Top-1 accuracy 34.0%; cases 703	No	Source
MiniMax	MiniMax M2.7	18 Mar 2026	39.30	Top-1 accuracy 63.4%; cases 703	No	Source
Arcee AI	Trinity Large Thinking	01 Apr 2026	41.60	Top-1 accuracy 66.1%; cases 703	No	Source
Baidu	Ernie 5.0	22 Jan 2026	41.70	Top-1 accuracy 65.3%; cases 703	No	Source
Qwen	Qwen 3.5 27B	24 Feb 2026	45.50	Top-1 accuracy 71.3%; cases 703	No	Source
Xiaomi	MiMo V2 Pro	18 Mar 2026	45.90	Top-1 accuracy 68.8%; cases 703	No	Source
Qwen	Qwen 3.5 122B A10B	24 Feb 2026	51.20	Top-1 accuracy 76.5%; cases 703	No	Source
Google	Gemma 4 31B	02 Apr 2026	53	Reasoning; Top-1 accuracy 76.1%; cases 703	No	Source
ByteDance	Seed 2.0 Pro	14 Feb 2026	57.10	Top-1 accuracy 77.0%; cases 703	No	Source
Qwen	Qwen 3.6 Plus	01 Apr 2026	59.50	Top-1 accuracy 81.5%; cases 703	No	Source
OpenAI	GPT 5.4 Mini	17 Mar 2026	61.70	xHigh reasoning; Top-1 accuracy 80.8%; cases 703	No	Source
Google	Gemini 3.1 Flash Lite Preview	03 Mar 2026	63.30	Top-1 accuracy 82.1%; cases 703	No	Source
xAI	Grok 4.20	17 Feb 2026	63.80	Reasoning; Top-1 accuracy 81.5%; cases 703	No	Source
DeepSeek	DeepSeek V3.2	01 Dec 2025	65	Top-1 accuracy 81.8%; cases 703	No	Source
Qwen	Qwen 3.5 397B A17B	16 Feb 2026	65.10	Top-1 accuracy 82.4%; cases 703	No	Source
Moonshot	Kimi K2.5	27 Jan 2026	69.40	Thinking; Top-1 accuracy 84.8%; cases 703	No	Source
z.AI	GLM 5.1	07 Apr 2026	69.80	Top-1 accuracy 85.9%; cases 703	No	Source
Anthropic	Claude Opus 4.7	16 Apr 2026	72.80	High reasoning; Top-1 accuracy 86.8%; cases 703	No	Source
Anthropic	Claude Sonnet 4.6	17 Feb 2026	76.30	High reasoning; Top-1 accuracy 88.5%; cases 703	No	Source
Google	Gemini 3.1 Pro Preview	19 Feb 2026	79.40	Top-1 accuracy 91.5%; cases 703	No	Source
OpenAI	GPT 5.4	05 Mar 2026	80	xHigh reasoning; Top-1 accuracy 90.9%; cases 703	No	Source
Anthropic	Claude Opus 4.6	05 Feb 2026	80.60	High reasoning; Top-1 accuracy 90.0%; cases 703	No	Source

Recorded Results

Average Score

Score Range

Leading Model (lowest score)

Models Using This Benchmark