Research Project This is a free AI research project. No warranties, SLAs, or company associations. Learn more

Request Pipeline

Your application sends a standard OpenAI or Anthropic-compatible request. BrainstormRouter processes it through three stages before returning the response.

YOUR APPLICATION POST /v1/chat/completions { model: "auto" } STAGE 1: INGEST Auth & Rate Limit tenant isolation · API key Key Vault AES-256 decrypt · budget Guardrail Pre-scan PII · PCI-DSS · custom Semantic Cache pgvector HNSW · <5ms cache hit → fast return ↗ cache miss ▼ STAGE 2: ROUTE Thompson Sampling Bayesian posterior · UCB1 CAF Identity SPIFFE · cert verify ARM Budget ceiling check · downgrade Provider Dispatch circuit breaker · failover circuit open → failover ↗ STAGE 3: RETURN Streaming Firewall 7-check · PII sever Guardian Output cost · efficiency · headers RMM Store memory · pgvector Observability router update · SIEM PII detected → sever ⚡ Response + Guardian headers + cache metadata + RMM context
Stage 1

Ingest & Classify

Your application sends a request. Four subsystems activate in sequence:

Authentication & rate limiting. API key validation, tenant identification, and rate limit enforcement. Requests are scoped to your tenant — models, cache, memory, and billing are isolated.

Virtual Key Vault. Provider API keys are decrypted from AES-256-GCM storage. Per-key budget ceilings are checked before the request proceeds.

Guardrail pre-scan. Request content is scanned for PII, sensitive patterns, or custom rules. Blocked requests return immediately with structured error responses including sanitized snippets.

Semantic cache lookup. The request is vectorized and compared against your tenant's cache using pgvector HNSW indexes. Cache hits return in under 5ms, skipping Stages 2 and 3 entirely.

INCOMING REQUEST CACHE? HIT CACHE HIT fast path — skip Stage 2 & 3 X-BR-Cache hit Similarity 0.96 Latency 4.2ms 100% SAVED MISS GUARD? PASS CACHE MISS continues to Stage 2 → Route X-BR-Cache miss Guardrail passed Tenant d360e933… → STAGE 2 BLOCKED GUARDRAIL BLOCK immediate 400 response HTTP 400 "error": { "type": "guardrail_violation" "scanner": "pii_scan" "detail": "SSN detected" "action": "blocked" } Cache Hit (~4ms) Cache Miss → Stage 2 Guardrail Block (400)
Stage 2

Route & Execute

Thompson Sampling selection. The router selects the best model based on historical performance data. Each model maintains a Bayesian posterior distribution over reward. The router samples from these distributions, balancing exploitation of known-good models with exploration of potentially better alternatives.

CAF identity check. The agent's cryptographic certificate is validated. SPIFFE ID and RBAC permissions are verified. If the certificate is expired or revoked, the request is rejected before it reaches a provider.

ARM budget check. The agent's budget profile is checked. If the remaining budget is below the downgrade threshold, ARM auto-selects a cheaper model. If the budget is exhausted, the request is rejected with a structured error.

Provider dispatch. The request dispatches to the chosen provider via the appropriate adapter. Circuit breakers verify provider health. If the circuit is open, the request failsover to the next-best model automatically.

THOMPSON SAMPLING — MODEL LEADERBOARD /v1/models/leaderboard ▲ SELECTED claude-haiku-4.5 anthropic · 4,291 samples Reward 0.847 Quality 0.94 Latency 312ms Success 99.7% Variance ±0.023 gpt-4o-mini openai · 2,847 samples Reward 0.812 Quality 0.91 Latency 245ms Success 99.4% Variance ±0.031 gemini-2.0-flash google · 1,203 samples Reward 0.789 CIRCUIT BREAKER STATE openai CLOSED anthropic CLOSED google HALF-OPEN Posteriors update on every request · UCB1 exploration bonus decays with √(2·ln(N)/n) LIVE
Stage 3

Return & Learn

Streaming Firewall. The 7-check pipeline intercepts tokens in a sliding window buffer. PII, governance violations, or blocked content triggers stream severing — the model keeps generating, but your client stops receiving. The agent is automatically quarantined and a SIEM alert is emitted.

Guardian Intelligence. Cost, latency, token counts, and routing metadata are recorded. Guardian calculates efficiency scores, tracks budget velocity, and identifies waste patterns. All data available via response headers and the Insights API.

RMM Store. Relevant context from the conversation is stored in the Relational Memory Manager. pgvector-powered similarity search makes this context available for future sessions. Memory is tenant-isolated and session-scoped.

Observability + Router Update. The Thompson Sampling posterior updates based on response quality. Semantic cache stores the response. The router learns from every request — making the next one smarter.

RESPONSE HEADERS HTTP 200 ROUTING X-BR-Routed-Model claude-haiku-4.5 X-BR-Route-Reason thompson-sampling X-BR-Route-Confidence 0.87 X-BR-Provider anthropic COST X-BR-Estimated-Cost $0.0012 X-BR-Actual-Cost $0.0011 X-BR-Efficiency 1.09 Predicted $0.0012 · Spent $0.0011 9% under budget ▼ PERFORMANCE GUARDIAN OVERHEAD 3.2ms PROVIDER LATENCY 847ms TOTAL LATENCY 850.2ms 0.38% overhead CACHE X-BR-Cache miss X-BR-Cache-Stored true Next similar request → cache hit (~4ms) SECURITY X-BR-Guardian-Status on X-BR-Guardrail-Action allow X-BR-Firewall-Checks 7/7 passed IF PII DETECTED MID-STREAM: X-BR-Guardrail-Action truncate X-BR-Guardrail-Reason PII detected: email, ssn X-BR-Agent-Action quarantined X-BR-SIEM-Event emitted ⚡ Stream severed — model continues generating, client stops receiving

Why Thompson Sampling?

Thompson Sampling is a Bayesian approach to the multi-armed bandit problem. For model routing, the "arms" are available models and the "reward" is a composite score of response quality, cost, and latency.

Explores Automatically

When a new model appears or workload patterns shift, Thompson Sampling naturally explores it. The posterior distribution is wide (uncertain), so samples frequently land high — triggering exploration without explicit rules.

Exploits Efficiently

As confidence grows (narrow posterior), the router exploits known-good models for each task type. A model that consistently delivers high quality at low cost gets the lion's share of traffic for that category.

Adapts Continuously

Provider pricing changes. Model quality drifts. New models launch. Thompson Sampling adapts to all of these without manual intervention — the posterior updates on every request, tracking the live landscape.

Compared to Alternatives

Approach Routing Adapts? Learns from outcomes?
Manual rules "If task=X, use model=Y" No No
Round-robin Rotate evenly No No
Epsilon-greedy Best 90%, random 10% Slowly Partially
Thompson Sampling Bayesian posterior Continuously Every request

Ready to route?

See the pipeline in action

Every response includes full routing metadata. Watch the router learn your workload in real-time.