How AI Stats normalises AI benchmarks

What we store

Each benchmark result is stored with a benchmark identifier, a score, optional variant metadata, source links, and self-reported flags where relevant.

Benchmarks also carry category and sort-direction metadata so AI Stats can distinguish measures where higher is better from measures where lower is better.

What normalisation means here

Normalisation in AI Stats does not mean converting all benchmarks into one universal score. It means preserving enough context to present each benchmark consistently and sort it correctly.

When benchmarks use different variants, prompts, or evaluation protocols, AI Stats keeps those differences visible instead of flattening them into a single synthetic ranking.

Ranking logic

For benchmark tables, AI Stats respects the benchmark's declared ordering direction. A lower score can therefore rank above a higher one when the benchmark measures error rate, latency, or another lower-is-better quantity.

If a score cannot be verified or lacks enough context, it may still be stored with source notes but should not be interpreted as equal to a fully verified leaderboard row.

Source quality and disclosure

Benchmark results may come from provider disclosures, benchmark organizers, or other public sources. AI Stats retains source links so readers can inspect the origin of a result.

Self-reported scores are flagged as such where the source format supports it. That flag is a transparency signal, not an automatic rejection of the score.

Caveats

Benchmark scores are one input to model selection, not the full answer. Real production fit also depends on cost, latency, reliability, modality support, and tooling constraints.

Two models with similar benchmark numbers may still behave very differently in your own tasks, especially when prompt style or context length changes.

What we store

Each benchmark result is stored with a benchmark identifier, a score, optional variant metadata, source links, and self-reported flags where relevant.

Benchmarks also carry category and sort-direction metadata so AI Stats can distinguish measures where higher is better from measures where lower is better.

What normalisation means here

Normalisation in AI Stats does not mean converting all benchmarks into one universal score. It means preserving enough context to present each benchmark consistently and sort it correctly.

When benchmarks use different variants, prompts, or evaluation protocols, AI Stats keeps those differences visible instead of flattening them into a single synthetic ranking.

Ranking logic

If a score cannot be verified or lacks enough context, it may still be stored with source notes but should not be interpreted as equal to a fully verified leaderboard row.

Source quality and disclosure

Benchmark results may come from provider disclosures, benchmark organizers, or other public sources. AI Stats retains source links so readers can inspect the origin of a result.

Self-reported scores are flagged as such where the source format supports it. That flag is a transparency signal, not an automatic rejection of the score.

Caveats

Benchmark scores are one input to model selection, not the full answer. Real production fit also depends on cost, latency, reliability, modality support, and tooling constraints.

Two models with similar benchmark numbers may still behave very differently in your own tasks, especially when prompt style or context length changes.

How AI Stats normalises AI benchmarks

What we store

What normalisation means here

Ranking logic

Source quality and disclosure

Caveats

Related pages

How AI Stats normalises AI benchmarks

What we store

What normalisation means here

Ranking logic

Source quality and disclosure

Caveats

Related pages