Models
Providers
Apps
Rankings
Playground
Search...
Ctrl K
Models
Providers
Apps
Rankings
Playground
Search...
Ctrl K
Benchmarks
GPQA
230 models
AIME 2025
149 models
MMLU-Pro
149 models
MMLU
136 models
GPQA Diamond
133 models
MMMU
120 models
SWE-Bench
118 models
Humanity's Last Exam
103 models
AIME 2024
95 models
MMMU Pro
94 models
IFEval
91 models
MATH
84 models
SimpleQA
84 models
HumanEval
80 models
LMArena Text
77 models
Aider-Polyglot
71 models
MMLU Redux
67 models
LiveCodeBench
66 models
Mathvista
66 models
AI2D
61 models
GSM8K
59 models
HMMT 2025
58 models
LiveCodeBench V6
56 models
MMStar
53 models
MathVision
52 models
Video MMMU
52 models
ARC-AGI-2
51 models
CharXiv-R
50 models
DROP
48 models
MMMLU
48 models
OSWorld
48 models
SuperGPQA
47 models
Arena Hard
46 models
RealWorldQA
45 models
ERQA
44 models
Multi-IF
44 models
ScreenSpot-Pro
44 models
ARC-AGI-1
43 models
Confabulations
43 models
DocVQA
43 models
LVBench
43 models
MVBench
43 models
OCRBench
43 models
MathVista-Mini
42 models
Tau 2 Telecom
42 models
ChartQA
41 models
LiveBench
41 models
NYT Connections
41 models
BrowseComp
39 models
CC-OCR
39 models
MMLU-ProX
39 models
Include
38 models
Terminal Bench 2.0
38 models
MMBench-V1.1
37 models
ScreenSpot
37 models
Hallusion Bench
36 models
MGSM
36 models
MMMT Bench
36 models
MMLU Pro
35 models
ODinW
35 models
PolyMATH
35 models
MuirBench
34 models
SimpleBench
34 models
Thematic Generalisation
34 models
EQ-Bench 3
33 models
MATH 500
33 models
CharXiv-D
32 models
CharadesSTA
31 models
MBPP
31 models
OCRBench-V2 (en)
31 models
Video-MME
31 models
EgoSchema
30 models
Aider-Polyglot Edit
29 models
BFCL-v3
29 models
Tau Bench (Retail)
29 models
Vibe-Eval
29 models
BIG-Bench Hard
28 models
BLINK
28 models
Graphwalks bfs <128k
28 models
Graphwalks parents <128k
28 models
MRCR
28 models
Tau 2 Retail
28 models
WritingBench
28 models
Arena-Hard v2
27 models
HumanEval+
27 models
InfoVQAtest
27 models
LiveBench 20241125
27 models
OCRBench-V2 (zh)
27 models
Tau Bench (Airline)
27 models
DocVQAtest
26 models
Elimation Game
26 models
C-Eval
25 models
Creative Writing v3
25 models
MLVU-M
25 models
SWE Bench Multilingual
25 models
HellaSwag
24 models
TruthfulQA
24 models
FACTS Grounding
23 models
Tau2 Bench
23 models
IFBench
22 models
MMMU (val)
22 models
Terminal Bench
22 models
COLLIE
21 models
LMArena WebDev
21 models
TextVQA
21 models
AA-LCR
20 models
FLEURS
20 models
SWE-Lancer
20 models
Codeforces
19 models
MultiPL-E
19 models
Tau2 Airline
19 models
LiveCodeBench V5
18 models
SWE Bench Pro
18 models
VideoMME w sub.
18 models
VideoMME w/o sub.
18 models
AlpacaEval 2.0
17 models
AndroidWorld_SR
17 models
Global-MMLU-Lite
17 models
LongBench v2
17 models
Winogrande
17 models
Ai2 SciArena
16 models
ARC-C
16 models
Global PIQA
16 models
MT-Bench
16 models
OmniDocBench 1.5
16 models
SciCode
16 models
AttaQ
15 models
BrowseComp-zh
15 models
HiddenMath
15 models
PopQA
15 models
Scale MCP Atlas
15 models
AidanBench
14 models
GDPval-AA
14 models
MathArena Apex
14 models
MMBench
14 models
Natural2Code
14 models
OpenAI-MRCR: 2 needle 128k
13 models
WMT23
13 models
BBH
12 models
LiveCodeBench Pro
12 models
MRCR v2 (8-needle)
12 models
PIQA
12 models
Toolathlon
12 models
Vending Bench 2
12 models
WMT24++
12 models
AITZ_EM
11 models
Android Control High_EM
11 models
Android Control Low_EM
11 models
BFCL-V4
11 models
ComplexFuncBench
11 models
InfoVQA
11 models
Internal API instruction following (hard)
11 models
MLVU
11 models
MMBench-Video
11 models
SimpleVQA
11 models
TempCompass
11 models
BFCL
10 models
CLUEWSC
10 models
HealthBench
10 models
LisanBench
10 models
MAXIFE
10 models
NOVA-63
10 models
XSTest
10 models
ActivityNet
9 models
CMMLU
9 models
CountBench
9 models
HealthBench Hard
9 models
MMLongBench-Doc
9 models
OCRBench V2
9 models
Seal-0
9 models
AIME 2026
8 models
AMC_2022_23
8 models
ARC-E
8 models
BrowseComp Long Context 128k
8 models
DeepPlanning
8 models
DynaMath
8 models
IMO Answer Bench
8 models
LongVideoBench
8 models
MMVet
8 models
MMVU
8 models
MobileMiniWob++_SR
8 models
Multi-SWE-Bench
8 models
Online Judgement Benchmark
8 models
OpenAI MRCR 8 Needle 128k
8 models
OpenAI-MRCR: 2 needle 128k
8 models
PerceptionTest
8 models
RefSpatialBench
8 models
Tau 2 Airline
8 models
VideoMME
8 models
VITA-Bench
8 models
ZEROBench-Sub
8 models
BabyVision
7 models
Common Voice 15
7 models
CoVoST2 en-zh
7 models
Creative Story Writing
7 models
CRPErelation
7 models
FActScore hallucination rate
7 models
Frames
7 models
FunctionalMATH
7 models
GiantSteps Tempo
7 models
LongFact-Concepts hallucination rate
7 models
LongFact-Objects hallucination rate
7 models
MBPP+
7 models
Meld
7 models
MMAU
7 models
MMAU Music
7 models
MMAU Sound
7 models
MMAU Speech
7 models
MME-RealWorld
7 models
MMT-Bench
7 models
MusicCaps
7 models
NMOS
7 models
OmniBench
7 models
OmniBench Music
7 models
PhysicsFinals
7 models
PointGrounding
7 models
TriviaQA
7 models
VLMsAreBlind
7 models
VocalSound
7 models
VoiceBench Avg
7 models
WideSearch
7 models
AGIEval
6 models
AlignBench
6 models
BFCL v2
6 models
Bird-SQL (dev)
6 models
BoolQ
6 models
BrowseComp Long Context 256k
6 models
CSimpleQA
6 models
DeepSearchQA
6 models
EvalPlus
6 models
FinanceAgent v1.1
6 models
Frontier Math
6 models
FrontierMath
6 models
MedXpertQA
6 models
OpenAI-MRCR: 2 needle 256k
6 models
RefCOCO-avg
6 models
Social IQa
6 models
TAU-Bench
6 models
V*
6 models
ZebraLogic
6 models
ZEROBench
6 models
CharXiv-Reasoning
5 models
Claw-Eval
5 models
CoVoST2
5 models
CyberGym
5 models
EmbSpatialBench
5 models
HLE-Verified
5 models
HumanEval-Mul
5 models
Hypersim
5 models
MTVQA
5 models
OpenBookQA
5 models
PhiBench
5 models
PinchBench
5 models
SUNRGBD
5 models
TheoremQA
5 models
TIR-Bench
5 models
AA-Index
4 models
Aider
4 models
ARC-AGI
4 models
CNMO 2024
4 models
FACTS
4 models
FullStackBench en
4 models
FullStackBench zh
4 models
Graphwalks BFS >128k
4 models
Graphwalks parents >128k
4 models
LingoQA
4 models
MCP-Mark
4 models
MMLU French
4 models
MMMUval
4 models
MotionBench
4 models
MRCR 1M (pointwise)
4 models
Nuscene
4 models
OpenAI MRCR 8 Needle 1m
4 models
PMC-VQA
4 models
RULER
4 models
Scale MultiChallenge
4 models
SlakeVQA
4 models
VisuLogic
4 models
WorldVQA
4 models
+ Thinking with Tracking
3 models
AA-Omniscience
3 models
AetherCode
3 models
All-Angles
3 models
ArcAGI1-Image
3 models
ArcAGI2-Image
3 models
Artificial Analysis Intelligence Index v4
3 models
BABE
3 models
BeyondAIME
3 models
BFCL Overall FC V4
3 models
BFCL_v3_MultiTurn
3 models
BIG-Bench
3 models
CGBench
3 models
ChartQAPro
3 models
CharXiv-DQ
3 models
CharXiv-RQ
3 models
CL-Bench
3 models
Codeforces(no tool)
3 models
ContPhy
3 models
CritPt
3 models
CrossVid
3 models
CruxEval-O
3 models
Cybersecurity CTFs
3 models
DA-2K
3 models
DeepPlanning v1.1 Avg Accuracy
3 models
DeepPlanning v1.1 Shopping Case Accuracy
3 models
DeepPlanning v1.1 Shopping Match Score
3 models
DeepPlanning v1.1 Travel Case Accuracy
3 models
DeepPlanning v1.1 Travel Composite Score
3 models
DeepPlanning v1.1 Travel CS Score
3 models
DeepPlanning v1.1 Travel PS Score
3 models
DeR 2 Bench
3 models
Disco-X
3 models
DUDE
3 models
EgoTempo
3 models
EMMA
3 models
Encyclo-K
3 models
FactScore
3 models
FrontierSci-olympiad
3 models
FrontierSci-research
3 models
FSC-147↓
3 models
GovReport
3 models
HallusionBench
3 models
HiPhO
3 models
HLE (no tool, text only)
3 models
HMMT Feb 2025
3 models
HMMT Nov 2025
3 models
HumanEval-Average
3 models
HumanEvalFIM-Average
3 models
Ï„ 2 -Bench (telecom)
3 models
IMOAnswerBench (no tool)
3 models
InterGPS
3 models
Inverse IFEval
3 models
KORBench
3 models
LiveSports-3K
3 models
LogicVista
3 models
LongBench v2 (128k)
3 models
LongDocURL
3 models
LongFact-Concepts
3 models
LongFact-Objects
3 models
LPFQA
3 models
MARS-Bench
3 models
MathArenaApex
3 models
MathArenaApex (shortlist)
3 models
MathCanvas
3 models
MathKangaroo
3 models
MEGA MLQA
3 models
MEGA TyDi QA
3 models
MEGA UDPOS
3 models
MEGA XCOPA
3 models
MEGA XStoryCloze
3 models
Minerva ‡
3 models
MME
3 models
MME-CC
3 models
MMLongBench
3 models
MMLU Multilingual
3 models
MMSIBench (circular)
3 models
Morse-500
3 models
MRCR v2 (8-needle)
3 models
MultiChallenge (o3-mini grader)
3 models
Natural Questions
3 models
NL2Repo
3 models
OCRBenchv2
3 models
ODVBench
3 models
OmniDocBench 1.5 ↓
3 models
OVBench
3 models
OVOBench
3 models
PHYBench
3 models
PhyX (openended)
3 models
Point-Bench
3 models
POPE
3 models
ProcBench
3 models
Qasper
3 models
QMSum
3 models
RepoBench
3 models
RepoQA
3 models
SFE
3 models
SimpleQA Verified
3 models
Spider
3 models
SQuALITY
3 models
SummScreenFD
3 models
Superchem (text-only)
3 models
Terminal Bench Hard
3 models
TOMATO
3 models
TreeBench
3 models
TVBench
3 models
VibeEval
3 models
VideoEval-Pro
3 models
VideoHolmes ‡
3 models
VideoReasonBench
3 models
VideoSimpleQA
3 models
VisFactor
3 models
ViSpeak
3 models
Vistra MetricX
3 models
ViVerBench
3 models
VLMsAreBiased
3 models
VPCT
3 models
VQAv2
3 models
Wildbench
3 models
WMT24++ COMET
3 models
WMT24++ MetricX
3 models
XLRS-Bench (macro)
3 models
ZeroBench (main)
3 models
AIME
2 models
AInstein Bench
2 models
ArtifactsBench
2 models
BIObench
2 models
BrowseComp Long Context 128k
2 models
CFEval
2 models
CodeSimpleQA
2 models
DeepConsult
2 models
DeepResearchBench
2 models
Design2Code
2 models
DS-Arena-Code
2 models
DS-FIM-Eval
2 models
FACTS Benchmark Suite
2 models
FigQA
2 models
FinSearchComp
2 models
FlenQA
2 models
HealthBench Concensus
2 models
HLE-text
2 models
HLE-VL
2 models
Ï„ 2 -Bench (retail)
2 models
IF
2 models
IF-Bench
2 models
LiveCodeBench Coding
2 models
LiveCodeBench(01-09)
2 models
LongCodeBench 1M
2 models
Minedojo Verified
2 models
MM-BrowseComp
2 models
MMBench_test
2 models
MMVetGPT4Turbo
2 models
MultiLF
2 models
Multilingual MMLU
2 models
NL2Repo (Pass@1)
2 models
NL2Repo-Bench
2 models
OmniMath
2 models
OpenAI-MRCR: 2 needle 1M
2 models
QVHighlights
2 models
Realkie
2 models
ResearchRubrics
2 models
ScienceQA
2 models
SpreadsheetBench Verified
2 models
SWE-Evo
2 models
Tool-Decathlon
2 models
TydiQA
2 models
USAMO 2025
2 models
USAMO25
2 models
VCR_en_easy
2 models
VIBE-Pro
2 models
VitaBench
2 models
VQAv2 (test)
2 models
WideSearch
2 models
WMT25 MQM
2 models
ACEBench
1 model
AI2 Reasoning Challenge (ARC)
1 model
AMC
1 model
AndroidWorld
1 model
APEX-Agents
1 model
Arc
1 model
Arena Chat Rank
1 model
Arena Search Rank
1 model
ARKitScenes
1 model
Artificial Analysis
1 model
Artificial Analysis Text-to-Video Rank
1 model
AutoLogi
1 model
BCFLv4
1 model
BIG-Bench Extra Hard
1 model
BigCodeBench
1 model
BioLP-Bench
1 model
BrowseComp Long Context 256k
1 model
BrowseComp-VL
1 model
CC-Bench-V2 Backend
1 model
CC-Bench-V2 Frontend
1 model
CC-Bench-V2 Repo Exploration
1 model
Chest ImaGenome Anatomy IOU
1 model
CheXpert CXR Top-5 Macro F1
1 model
CloningScenarios
1 model
CommonSenseQA
1 model
CRUX-O
1 model
CT Dataset 1 Macro Accuracy
1 model
CXR14 3-Condition Macro F1
1 model
CyBench
1 model
EQ-Bench
1 model
EyePACS Accuracy
1 model
FinSearchComp T2&T3
1 model
FinSearchComp-T3
1 model
Flame-VLM-Code
1 model
FLTEval
1 model
FLTEval Pass@16
1 model
FLTEval Pass@2
1 model
FLTEval Pass@4
1 model
FLTEval Pass@8
1 model
GDPval-MM
1 model
Global PICA
1 model
GSM8K Chat
1 model
HMMT Feb 26
1 model
HMMT Feb. 2026
1 model
ImageMining
1 model
InfographicsQA
1 model
Instruct HumanEval
1 model
IVEBench Consistency vs Kling o1
1 model
IVEBench Consistency vs Runway Aleph
1 model
IVEBench Instruction Following vs Kling o1
1 model
IVEBench Instruction Following vs Runway Aleph
1 model
IVEBench Overall vs Kling o1
1 model
IVEBench Overall vs Runway Aleph
1 model
LBPP (v2)
1 model
LiveCodeBench v5 24.12-25.2
1 model
LSAT
1 model
MASK
1 model
MathVerse-Mini
1 model
MBPP EvalPlus
1 model
MedXpertQA Accuracy
1 model
MEWC
1 model
MIABench
1 model
MIMIC CXR Top-5 Macro F1
1 model
MLE-Bench Lite
1 model
MM IF-Eval
1 model
MM-BrowserComp
1 model
MM-ClawBench
1 model
MMLU Chat
1 model
MMLU Redux 2.0
1 model
MMLU-STEM
1 model
MMMU (validation)
1 model
MMSearch
1 model
MMSearch-Plus
1 model
MRCR 1M
1 model
MRCR v2
1 model
MRI Dataset 1 Macro Accuracy
1 model
MS-CXR-T Macro Accuracy
1 model
Objectron
1 model
OctoCodingBench
1 model
OfficeQA
1 model
OfficeQA Pro
1 model
OJBench (C++)
1 model
OlympiadBench
1 model
OmniGAIA
1 model
OpenRCA
1 model
OSWorld-G
1 model
PaperBench
1 model
PathMCQA Accuracy
1 model
PolyMath-en
1 model
ProtocolQA
1 model
RoboSpatialHome
1 model
SAT Math
1 model
ScienceQA Visual
1 model
SecCodeBench
1 model
SIFO
1 model
SIFO-Multiturn
1 model
SkillsBench
1 model
SLAKE Closed-Subset Accuracy
1 model
SLAKE Tokenized F1
1 model
SuperGLUE
1 model
SWE Bench Live
1 model
SWE-Bench Multimodal
1 model
SWE-Perf
1 model
SWE-Review
1 model
SWT-Bench
1 model
TAU3-Bench
1 model
Uniform Bar Exam
1 model
US-DermMCQA Accuracy
1 model
VIBE
1 model
VIBE Android
1 model
VIBE Backend
1 model
VIBE iOS
1 model
VIBE Simulation
1 model
VIBE Web
1 model
Virology Capabilities Test
1 model
Vision2Web
1 model
VQA-RAD Closed-Subset Accuracy
1 model
VQA-RAD Tokenized F1
1 model
We-Math
1 model
WebVoyager
1 model
WMDP
1 model
WSI-Path ROUGE
1 model
XLSum English
1 model
ZClawBench
1 model
Balrog-AI
0 models
Dubesor LLM
0 models
Fiction-Live Bench
0 models
Galileo Agent
0 models
IQ Bench
0 models
MathArena
0 models
MC-Bench
0 models
METR
0 models
Misguided Attention
0 models
MLE-Bench
0 models
SEAL MultiChallenege
0 models
SmolAgents LLM
0 models
Snake-Bench
0 models
SOLO-Bench
0 models
Symflower Coding
0 models
WeirdML
0 models
XLANG Agent
0 models
Sign In