Benchmarks are one part of the story. This page shows how deployments actually feel: how fast tokens stream, what latency to expect, what uptime budgets mean in minutes, and how quantization changes performance.
Across balanced FP16 routes
With routing + caching enabled
Multi-region gateway SLO
See how FP32, FP16, and INT8 feel when tokens stream back-to-back.
FP32 fidelity
Research-grade output with highest stability. Best for evals and long prompts.
FP16 balanced
Balanced latency and quality. Ideal for chat and product surfaces with steady demand.
INT8 turbo
Throughput-first for routing and A/B sweeps. Expect tiny perplexity drift on edge cases.
Dots show request travel time from client to model and back. Shorter lanes = faster round trips.
Regional failover, p95 < 360ms
p95 < 760ms, burst handling enabled
Extra 400–600ms for cache fill & safety
Availability targets converted into real downtime so you can pick the right SLO for each feature.
Availability target
≈ 8h 45m/year of potential interruption.
What it means
Chat surfaces where brief brownouts are acceptable.
Yearly allowance: ~8.8 hours
Availability target
≈ 4h 23m/year — typical for production LLM APIs.
What it means
Customer-facing flows with retries and fallbacks.
Yearly allowance: ~4.4 hours
Availability target
≈ 52m/year — continuous delivery with redundancy.
What it means
Billing, assistants, gateways with multi-region quorum.
Yearly allowance: ~0.9 hours
How model size, speed, and quality shift across FP32, FP16, and INT8.
Quantization
Model size
100%
Relative speed
1×
Quality shift
Baseline perplexity, best calibration for evals.
Best for
Safety reviews, eval harnesses, research reproduction.
Quantization
Model size
≈ 50%
Relative speed
1.6×
Quality shift
Identical vocab coverage with half the memory footprint, minimal drift.
Best for
Interactive apps, streaming chat, steady workloads.
Quantization
Model size
≈ 25%
Relative speed
2.4×
Quality shift
Small perplexity lift on rare tokens, big gains in tokens/sec and cold starts.
Best for
Batch inference, agent inner loops, rapid drafts.