The cache, proving itself.
This is the real gateway, not a mockup. A paraphrase reuses a cached answer at $0; a near-identical question with the opposite answer embeds just as close, but the intent guard turns it into a correct miss instead of a confident wrong answer — the failure mode the commercial tier never measures. Flip the outage switch to see pre-first-token failover. Every number below is measured, live.
Live metrics · /api/metrics
…- hit rate
- —
- exact · semantic
- —
- guard blocks
- —
- $ saved
- —
- TTFT p50/p95
- —
- rescued
- —
Guided tour — click in order
Each “prime” caches an answer the next step tries to reuse. Watch the verdict on every turn below.
Reuse — the savings
Correctness — the cheap catch
Correctness — the subtle catch
Resilience — the rescue
Next up: A cold question: nothing to reuse, so it streams from the model (a MISS).
Start the guided tour, or ask a question.
Public demo over cheap models (Claude Haiku + gpt-4o-mini), IP rate-limited. The cache, metrics, and rate limiter are per-instance and in-memory, so figures reflect this server’s recent traffic. A mid-stream provider failure (after the first token) is surfaced as a partial answer plus an error, never a faked seamless retry — once headers commit, clean failover is impossible, and the demo doesn’t pretend otherwise.