The cache, proving itself.

This is the real gateway, not a mockup. A paraphrase reuses a cached answer at $0; a near-identical question with the opposite answer embeds just as close, but the intent guard turns it into a correct miss instead of a confident wrong answer — the failure mode the commercial tier never measures. Flip the outage switch to see pre-first-token failover. Every number below is measured, live.

Live metrics · /api/metrics

…

hit rate: —
exact · semantic: —
guard blocks: —
$ saved: —
TTFT p50/p95: —
rescued: —

Guided tour — click in order

Each “prime” caches an answer the next step tries to reuse. Watch the verdict on every turn below.

Reuse — the savings

Correctness — the cheap catch

Correctness — the subtle catch

Resilience — the rescue

Next up: A cold question: nothing to reuse, so it streams from the model (a MISS).

Start the guided tour, or ask a question.

Public demo over cheap models (Claude Haiku + gpt-4o-mini), IP rate-limited. The cache, metrics, and rate limiter are per-instance and in-memory, so figures reflect this server’s recent traffic. A mid-stream provider failure (after the first token) is surfaced as a partial answer plus an error, never a faked seamless retry — once headers commit, clean failover is impossible, and the demo doesn’t pretend otherwise.