MATH 500

MATH 500 - Benchmark Leaderboard & Model Performance | AI Stats

Models Using This Benchmark

Organisation	Model	Reported	Top Score	Info	Self Reported	Source
z.AI	GLM 4.5	28 Jul 2025	98.20%	-	Yes	Source
z.AI	GLM 4.5 Air	28 Jul 2025	98.10%	-	Yes	Source
Nvidia	Nvidia Nemotron Nano 12B V2	-	97.80%	inferred high-confidence family alias from nvidia-nemotron-nano-9b-v2 (score=0.4889; benches=6)	Yes	Source
Nvidia	Nvidia Nemotron Nano 9B V2	-	97.80%	-	Yes	Source
Moonshot	Kimi K2 (2025-09-05)	05 Sept 2025	97.40%	Acc	Yes	Source
Nvidia	Llama 3.1 Nemotron Ultra 253B v1	07 Apr 2025	97%	-	Yes	Source
MiniMax	MiniMax M1 80K	16 Jun 2025	96.80%	-	Yes	-
Nvidia	Llama 3.3 Nemotron Super 49B v1	18 Mar 2025	96.60%	-	Yes	Source
Nvidia	Llama 3.3 Nemotron Super 49B V1.5	-	96.60%	inferred version-family alias from llama-3.3-nemotron-super-49b-v1	Yes	Source
Meituan	Longcat Flash Cat	-	96.40%	inferred high-confidence family alias from longcat-flash-chat (score=0.4667; benches=16)	Yes	Source
Moonshot	Kimi K1.5	20 Jan 2025	96.20%	-	Yes	Source
MiniMax	Minimax M1 40K	16 Jun 2025	96%	-	Yes	-
Nvidia	Llama 3.1 Nemotron Nano 4B V1.1	-	95.40%	inferred high-confidence family alias from llama-3.1-nemotron-nano-8b-v1 (score=0.5523; benches=7)	Yes	Source
Nvidia	Llama 3.1 Nemotron Nano 8B V1	18 Mar 2025	95.40%	-	Yes	Source
Microsoft	Phi 4 Mini Flash Reasoning	-	94.60%	inferred modality/version alias from phi-4-mini-reasoning	Yes	Source
Microsoft	Phi 4 Mini Reasoning	30 Apr 2025	94.60%	-	Yes	Source
Qwen	QwQ 32B Preview	-	90.60%	-	Yes	Source
Qwen	QwQ 32B	-	90.60%	-	Yes	Source
DeepSeek	DeepSeek OCR	20 Oct 2025	90.20%	inferred family alias from deepseek-v3 (score=0.3000; benches=20)	Yes	Source
DeepSeek	DeepSeek V4	-	90.20%	inferred high-confidence family alias from deepseek-v3 (score=0.5818; benches=20)	Yes	Source
DeepSeek	DeepSeek V2 (2024-06-28)	28 Jun 2024	90.20%	inferred family alias from deepseek-v3 (score=0.4159; benches=20)	Yes	Source
OpenAI	o1 mini	12 Sept 2024	90%	-	Yes	Source
IBM	Granite 3.1 8B Instruct	-	69.02%	inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14)	Yes	Source
IBM	Granite 3.2 8B Instruct	-	69.02%	inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14)	Yes	Source
IBM	Granite Guardian 3.1 8B	-	69.02%	inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14)	Yes	Source
IBM	Granite 3.2 8B Instruct Preview	-	69.02%	inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4687; benches=14)	Yes	Source
IBM	Granite 3.3 8B Instruct	16 Apr 2025	69.02%	-	Yes	Source
IBM	Granite Guardian 3.3 8B	-	69.02%	inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.5071; benches=14)	Yes	Source
IBM	Granite 3.0 8B Instruct	-	69.02%	inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.4911; benches=14)	Yes	Source
IBM	Granite Guardian 3.0 8B	-	69.02%	inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14)	Yes	Source
IBM	Granite Speech 3.2 8B	-	69.02%	inferred family alias from granite-3.3-8b-instruct (score=0.4062; benches=14)	Yes	Source
IBM	Granite Speech 3.3 8B	-	69.02%	inferred high-confidence family alias from granite-3.3-8b-instruct (score=0.5071; benches=14)	Yes	Source
IBM	Granite 3.3 2B Instruct	16 Apr 2025	69.02%	inferred family alias from granite-3.3-8b-instruct (score=0.3627; benches=14)	Yes	Source

Recorded Results

Average Score

Score Range

Leading Model

Models Using This Benchmark