Architecture

What happens
when you send a request.

Every API call flows through a three-stage pipeline — Ingest, Route, Return — with 13 systems operating in parallel. Overhead under 5ms, measured and returned on every response.

Stage 01 · Ingest & Classify

Before routing happens, four checks fire in sequence.

Three possible outcomes: a cache hit ends here, a guardrail block returns 400, or the request continues to Stage 2.

01

Auth & rate limit

API key validation, tenant identification, rate-limit enforcement.
02

Virtual Key Vault

AES-256-GCM decryption of provider keys, per-key budget ceiling checks.
03

Guardrail pre-scan

PII, PCI-DSS, custom regex. Blocked → 400 with sanitized snippet.
04

Semantic cache

pgvector HNSW + in-memory. Hits return in ~4ms, skipping Stages 2 & 3 entirely.

Stage 02 · Route & Execute

Model selection & provider dispatch with safety checks.

01

Thompson sampling

Bayesian posterior over reward. Balances exploration of new models with exploitation of known winners.
02

CAF identity

Cryptographic certificate validation; SPIFFE ID + RBAC verification.
03

ARM budget

Agent budget profile checked. Auto-downgrades to cheaper model if below threshold; rejects if exhausted.
04

Provider dispatch

Circuit breaker verifies health. Failover to the next-best healthy provider — never a retry to the same endpoint.

Model leaderboard

claude-sonnet-4-6

0.91

σ 0.03

gpt-4o

0.87

σ 0.04

gemini-2.5

0.84

σ 0.05

deepseek-r1

0.71

σ 0.08

Circuit breaker state

Anthropic

CLOSED

OpenAI

CLOSED

Google

HALF-OPEN

DeepSeek

CLOSED

Stage 03 · Return & Learn

Response processing, cost tracking, router learning.

01

Streaming firewall

7-check pipeline, sliding-window buffer. PII severed mid-stream; model continues, client stops, agent quarantined, SIEM alert.
02

Guardian

Records cost, latency, token counts, routing metadata. Computes efficiency, tracks budget velocity, flags waste.
03

RMM store

Stores relevant context in the Relational Memory Manager. pgvector similarity retrieves it next session.
04

Posterior update

Thompson posterior updates based on response quality. The router learned something. Cache stored for the next hit.

Request timing · typical

Guardian overhead

3.2 ms

Provider round-trip

847.0 ms

Total

850.2 ms

BR overhead as share

0.38%

Why Thompson sampling

Bayesian bandits > static rules.

The "arms" are the models you have access to. The "reward" is a composite of response quality, cost, and latency. The posterior updates every request.

Explores automatically

New models and workload shifts trigger exploration naturally. No A/B infrastructure to run.

Exploits efficiently

Narrow posteriors route traffic to known-good models for their known task types.

Adapts continuously

Pricing changes, quality drift, deprecations — all tracked without manual intervention.

Manual rules

No adaptation
No learning

Round-robin

No adaptation
No learning

ε-greedy

Slowly adapts
Partial learning

Thompson

Continuous
Every request

Ready to route?

See the pipeline in action.

Every response includes full routing metadata. Watch the router learn your workload in real time.

Start free Explore features

What happens when you send a request.

Before routing happens, four checks fire in sequence.

Auth & rate limit

Virtual Key Vault

Guardrail pre-scan

Semantic cache

Model selection & provider dispatch with safety checks.

Thompson sampling

CAF identity

ARM budget

Provider dispatch

Response processing, cost tracking, router learning.

Streaming firewall

Guardian

RMM store

Posterior update

Bayesian bandits > static rules.

Explores automatically

Exploits efficiently

Adapts continuously

See the pipeline in action.

What happens
when you send a request.