Individual benchmark scores plotted by date.
| Organisation | Model | Reported | Top Score | Info | Self Reported | Source |
|---|---|---|---|---|---|---|
| Qwen 3 VL 235B A22B Thinking | - | 97.30% | - | Yes | - | |
| Qwen 3 A235 A22B Instruct 2507 | - | 95% | - | Yes | Source | |
| Longcat Flash Cat | - | 89.30% | inferred high-confidence family alias from longcat-flash-chat (score=0.4667; benches=16) | Yes | Source | |
| Kimi K2 (2025-09-05) | 05 Sept 2025 | 89% | Acc | Yes | Source | |
| MiniMax M1 80K | 16 Jun 2025 | 86.80% | - | Yes | - | |
| Minimax M1 40K | 16 Jun 2025 | 80.10% | - | Yes | - |