o3
OpenAI
Highlights
Top benchmark results for openai/o3-2025-04-16.
1172#1
0.81#4
0.92#7
0.98#5
0.53#9
0.03#11
2517#3
14.38#15
4.38#21
1500#2
0.83#12
0.77#2
0.73#2
0.20#7
1036#1
0.74#4
1449#2
1190#11
0.80#4
0.53#6
1.83#7
0.85#2
Benchmark table
Detailed scores across tracked benchmarks.
| Benchmark | Category | Top Score | Info | Self Reported | Source |
|---|---|---|---|---|---|
| Ai2 SciArena | - | 1172 | - | No | Source |
| Aider-Polyglot | code | 0.81 | High Reasoning Effort | No | Source |
| AIME 2024 | math | 0.92 | - | Yes | Source |
| AIME 2025 | math | 0.98 | - | Yes | Source |
| ARC-AGI-1 | - | 0.53 | Medium Reasoning Effort | No | Source |
| ARC-AGI-2 | - | 0.03 | Medium Reasoning Effort | No | Source |
| BrowseComp Long Context 128k | - | 0.88 | High Reasoning Effort | Yes | Source |
| Codeforces | - | 2517 | - | Yes | Source |
| Confabulations | - | 14.38 | High Reasoning Effort | No | Source |
| Elimation Game | - | 4.38 | Medium Reasoning Effort | No | Source |
| EQ-Bench 3 | - | 1500 | - | No | Source |
| FActScore hallucination rate | hallucinations | 0.23 | High Reasoning Effort | Yes | Source |
| GPQA Diamond | general-knowledge | 0.83 | - | Yes | Source |
| Graphwalks bfs <128k | - | 0.77 | High Reasoning Effort | Yes | Source |
| Graphwalks parents <128k | - | 0.73 | High Reasoning Effort | Yes | Source |
| Humanity's Last Exam | - | 0.20 | - | Yes | Source |
| LisanBench | - | 1036 | - | No | Source |
| LiveBench | - | 0.74 | High Reasoning Effort | No | Source |
| LMArena Text | - | 1449 | - | No | Source |
| LMArena WebDev | - | 1190 | 16th June 2025 | No | Source |
| LongFact-Concepts hallucination rate | hallucinations | 0.05 | High Reasoning Effort | Yes | Source |
| LongFact-Objects hallucination rate | hallucinations | 0.07 | High Reasoning Effort | Yes | Source |
| NYT Connections | - | 0.80 | High Reasoning Effort | No | Source |
| OpenAI-MRCR: 2 needle 128k | - | 0.55 | High Reasoning Effort | Yes | Source |
| SimpleBench | - | 0.53 | High Reasoning Effort | No | Source |
| Thematic Generalisation | - | 1.83 | Medium Reasoning Effort | No | Source |
| VideoMME | - | 0.85 | High Reasoning Effort | Yes | Source |
Benchmark comparisons
Use the selector to switch benchmarks and see how this model stacks up against its closest competitors.
Aider-Polyglot
Compare this model with the leading peers for the selected benchmark.
Benchmark
0.81
Rank #4/34
34 models
Showing 11 models around the selected model (out of 34 total).