What we store
Each benchmark result is stored with a benchmark identifier, a score, optional variant metadata, source links, and self-reported flags where relevant.
Benchmarks also carry category and sort-direction metadata so AI Stats can distinguish measures where higher is better from measures where lower is better.
What normalisation means here
Normalisation in AI Stats does not mean converting all benchmarks into one universal score. It means preserving enough context to present each benchmark consistently and sort it correctly.
When benchmarks use different variants, prompts, or evaluation protocols, AI Stats keeps those differences visible instead of flattening them into a single synthetic ranking.
Ranking logic
For benchmark tables, AI Stats respects the benchmark's declared ordering direction. A lower score can therefore rank above a higher one when the benchmark measures error rate, latency, or another lower-is-better quantity.
If a score cannot be verified or lacks enough context, it may still be stored with source notes but should not be interpreted as equal to a fully verified leaderboard row.
Source quality and disclosure
Benchmark results may come from provider disclosures, benchmark organizers, or other public sources. AI Stats retains source links so readers can inspect the origin of a result.
Self-reported scores are flagged as such where the source format supports it. That flag is a transparency signal, not an automatic rejection of the score.
Caveats
Benchmark scores are one input to model selection, not the full answer. Real production fit also depends on cost, latency, reliability, modality support, and tooling constraints.
Two models with similar benchmark numbers may still behave very differently in your own tasks, especially when prompt style or context length changes.