GPT OSS 120b
OpenAI
Highlights
Top benchmark results for openai/gpt-oss-120b-2025-08-05.
0.44#26
0.97#2
0.98#6
2622#2
15.65#20
1152#16
0.81#19
0.58#2
0.91#1
0.30#2
0.19#8
0.01#5
0.90#1
0.81#4
0.62#10
0.49#4
0.68#4
Benchmark table
Detailed scores across tracked benchmarks.
| Benchmark | Category | Top Score | Info | Self Reported | Source |
|---|---|---|---|---|---|
| Aider-Polyglot | code | 0.44 | High Reasoning Effort | Yes | Source |
| AIME 2024 | math | 0.97 | High Reasoning Effort, With Tools | Yes | Source |
| AIME 2025 | math | 0.98 | High Reasoning Effort, With Tools | Yes | Source |
| Codeforces | - | 2622 | High Reasoning Effort, With Tools | Yes | Source |
| Confabulations | - | 15.65 | Medium Reasoning Effort | No | Source |
| EQ-Bench 3 | - | 1152 | - | No | Source |
| GPQA Diamond | general-knowledge | 0.81 | High Reasoning Effort, With Tools | Yes | Source |
| HealthBench | health | 0.58 | High Reasoning Effort | Yes | Source |
| HealthBench Concensus | health | 0.91 | Medium Reasoning Effort | Yes | Source |
| HealthBench Hard | health | 0.30 | High Reasoning Effort | Yes | Source |
| Humanity's Last Exam | - | 0.19 | High Reasoning Effort, With Tools | Yes | Source |
| MathArena Apex | - | 0.01 | High Reasoning Effort | No | Source |
| MMLU | - | 0.90 | High Reasoning Effort | Yes | Source |
| MMMLU | - | 0.81 | High Reasoning Effort, Average | Yes | Source |
| SWE-Bench | code | 0.62 | High Reasoning Effort | Yes | Source |
| Tau Bench (Airline) | - | 0.49 | High Reasoning Effort | Yes | Source |
| Tau Bench (Retail) | - | 0.68 | High Reasoning Effort | Yes | Source |
Benchmark comparisons
Use the selector to switch benchmarks and see how this model stacks up against its closest competitors.
GPQA Diamond
Compare this model with the leading peers for the selected benchmark.
Benchmark
0.81
Rank #19/107
107 models
Showing 11 models around the selected model (out of 107 total).