Model performance brief

Throughput, latency, uptime — visualized for every model tier

Benchmarks are one part of the story. This page shows how deployments actually feel: how fast tokens stream, what latency to expect, what uptime budgets mean in minutes, and how quantization changes performance.

Gateway median throughput

75 tokens/sec

Across balanced FP16 routes

P95 end-to-end latency

610 ms

With routing + caching enabled

Observed availability

99.95%

Multi-region gateway SLO

Throughput in practice

See how FP32, FP16, and INT8 feel when tokens stream back-to-back.

FP32 fidelity

32 tokens/sec

Warm start: 1.8s

Research-grade output with highest stability. Best for evals and long prompts.

Generation trace32 t/s

Themodelreasonsstep-by-step,keepingnuanceandcalibrationintactforsafetycriticalpaths.

FP16 balanced

68 tokens/sec

Warm start: 0.9s

Balanced latency and quality. Ideal for chat and product surfaces with steady demand.

Generation trace68 t/s

Responseslandquicklywhilepreservingtoneandstructureformulti-turnconversations.

INT8 turbo

112 tokens/sec

Warm start: 0.4s

Throughput-first for routing and A/B sweeps. Expect tiny perplexity drift on edge cases.

Generation trace112 t/s

Textpoursoutalmostinstantly;greatforprecomputation,drafts,oragentinnerloops.

Latency story

Dots show request travel time from client to model and back. Shorter lanes = faster round trips.

Edge POP + sticky sessions220 ms

Regional failover, p95 < 360ms

You

LLM

GPU cluster w/ smart batching480 ms

p95 < 760ms, burst handling enabled

You

LLM

Cold start + long context940 ms

Extra 400–600ms for cache fill & safety

You

LLM

Uptime budgets translated

Availability targets converted into real downtime so you can pick the right SLO for each feature.

Availability target

99.9%

Downtime allowance43.8 min/month

≈ 8h 45m/year of potential interruption.

What it means

Chat surfaces where brief brownouts are acceptable.

Yearly allowance: ~8.8 hours

Availability target

99.95%

Downtime allowance21.9 min/month

≈ 4h 23m/year — typical for production LLM APIs.

What it means

Customer-facing flows with retries and fallbacks.

Yearly allowance: ~4.4 hours

Availability target

99.99%

Downtime allowance4.4 min/month

≈ 52m/year — continuous delivery with redundancy.

What it means

Billing, assistants, gateways with multi-region quorum.

Yearly allowance: ~0.9 hours

Quantization trade-offs

How model size, speed, and quality shift across FP32, FP16, and INT8.

Quantization

FP32 — highest fidelity

Model size

100%

Relative speed

1×

Quality shift

Baseline perplexity, best calibration for evals.

Best for

Safety reviews, eval harnesses, research reproduction.

Quantization

FP16 — balanced

Model size

≈ 50%

Relative speed

1.6×

Quality shift

Identical vocab coverage with half the memory footprint, minimal drift.

Best for

Interactive apps, streaming chat, steady workloads.

Quantization

INT8 — throughput

Model size

≈ 25%

Relative speed

2.4×

Quality shift

Small perplexity lift on rare tokens, big gains in tokens/sec and cold starts.

Best for

Batch inference, agent inner loops, rapid drafts.

Model performance brief

Throughput, latency, uptime — visualized for every model tier

Gateway median throughput

75 tokens/sec

Across balanced FP16 routes

P95 end-to-end latency

610 ms

With routing + caching enabled

Observed availability

99.95%

Multi-region gateway SLO

Throughput in practice

See how FP32, FP16, and INT8 feel when tokens stream back-to-back.

FP32 fidelity

32 tokens/sec

Warm start: 1.8s

Research-grade output with highest stability. Best for evals and long prompts.

Generation trace32 t/s

Themodelreasonsstep-by-step,keepingnuanceandcalibrationintactforsafetycriticalpaths.

FP16 balanced

68 tokens/sec

Warm start: 0.9s

Balanced latency and quality. Ideal for chat and product surfaces with steady demand.

Generation trace68 t/s

Responseslandquicklywhilepreservingtoneandstructureformulti-turnconversations.

INT8 turbo

112 tokens/sec

Warm start: 0.4s

Throughput-first for routing and A/B sweeps. Expect tiny perplexity drift on edge cases.

Generation trace112 t/s

Textpoursoutalmostinstantly;greatforprecomputation,drafts,oragentinnerloops.

Latency story

Dots show request travel time from client to model and back. Shorter lanes = faster round trips.

Edge POP + sticky sessions220 ms

Regional failover, p95 < 360ms

You

LLM

GPU cluster w/ smart batching480 ms

p95 < 760ms, burst handling enabled

You

LLM

Cold start + long context940 ms

Extra 400–600ms for cache fill & safety

You

LLM

Uptime budgets translated

Availability targets converted into real downtime so you can pick the right SLO for each feature.

Availability target

99.9%

Downtime allowance43.8 min/month

≈ 8h 45m/year of potential interruption.

What it means

Chat surfaces where brief brownouts are acceptable.

Yearly allowance: ~8.8 hours

Availability target

99.95%

Downtime allowance21.9 min/month

≈ 4h 23m/year — typical for production LLM APIs.

What it means

Customer-facing flows with retries and fallbacks.

Yearly allowance: ~4.4 hours

Availability target

99.99%

Downtime allowance4.4 min/month

≈ 52m/year — continuous delivery with redundancy.

What it means

Billing, assistants, gateways with multi-region quorum.

Yearly allowance: ~0.9 hours

Quantization trade-offs

How model size, speed, and quality shift across FP32, FP16, and INT8.

Quantization

FP32 — highest fidelity

Model size

100%

Relative speed

1×

Quality shift

Baseline perplexity, best calibration for evals.

Best for

Safety reviews, eval harnesses, research reproduction.

Quantization

FP16 — balanced

Model size

≈ 50%

Relative speed

1.6×

Quality shift

Identical vocab coverage with half the memory footprint, minimal drift.

Best for

Interactive apps, streaming chat, steady workloads.

Quantization

INT8 — throughput

Model size

≈ 25%

Relative speed

2.4×

Quality shift

Small perplexity lift on rare tokens, big gains in tokens/sec and cold starts.

Best for

Batch inference, agent inner loops, rapid drafts.