Architecture
Every API call flows through a three-stage pipeline — Ingest, Route, Return — with 8 shipped systems operating in parallel. Routing overhead under 5ms — measured and returned on every response.
Your application sends a standard OpenAI or Anthropic-compatible request. BrainstormRouter processes it through three stages before returning the response.
Your application sends a request. Four subsystems activate in sequence:
Authentication & rate limiting. API key validation, tenant identification, and rate limit enforcement. Requests are scoped to your tenant — models, cache, memory, and billing are isolated.
Virtual Key Vault. Provider API keys are decrypted from AES-256-GCM storage. Per-key budget ceilings are checked before the request proceeds.
Guardrail pre-scan. Request content is scanned for PII, sensitive patterns, or custom rules. Blocked requests return immediately with structured error responses including sanitized snippets.
Semantic cache lookup. The request is vectorized and compared against your tenant's cache using pgvector HNSW indexes. Cache hits return in under 5ms, skipping Stages 2 and 3 entirely.
Thompson Sampling selection. The router selects the best model based on historical performance data. Each model maintains a Bayesian posterior distribution over reward. The router samples from these distributions, balancing exploitation of known-good models with exploration of potentially better alternatives.
CAF identity check. The agent's cryptographic certificate is validated. SPIFFE ID and RBAC permissions are verified. If the certificate is expired or revoked, the request is rejected before it reaches a provider.
ARM budget check. The agent's budget profile is checked. If the remaining budget is below the downgrade threshold, ARM auto-selects a cheaper model. If the budget is exhausted, the request is rejected with a structured error.
Provider dispatch. The request dispatches to the chosen provider via the appropriate adapter. Circuit breakers verify provider health. If the circuit is open, the request failsover to the next-best model automatically.
Streaming Firewall. The 7-check pipeline intercepts tokens in a sliding window buffer. PII, governance violations, or blocked content triggers stream severing — the model keeps generating, but your client stops receiving. The agent is automatically quarantined and a SIEM alert is emitted.
Guardian Intelligence. Cost, latency, token counts, and routing metadata are recorded. Guardian calculates efficiency scores, tracks budget velocity, and identifies waste patterns. All data available via response headers and the Insights API.
RMM Store. Relevant context from the conversation is stored in the Relational Memory Manager. pgvector-powered similarity search makes this context available for future sessions. Memory is tenant-isolated and session-scoped.
Observability + Router Update. The Thompson Sampling posterior updates based on response quality. Semantic cache stores the response. The router learns from every request — making the next one smarter.
Thompson Sampling is a Bayesian approach to the multi-armed bandit problem. For model routing, the "arms" are available models and the "reward" is a composite score of response quality, cost, and latency.
When a new model appears or workload patterns shift, Thompson Sampling naturally explores it. The posterior distribution is wide (uncertain), so samples frequently land high — triggering exploration without explicit rules.
As confidence grows (narrow posterior), the router exploits known-good models for each task type. A model that consistently delivers high quality at low cost gets the lion's share of traffic for that category.
Provider pricing changes. Model quality drifts. New models launch. Thompson Sampling adapts to all of these without manual intervention — the posterior updates on every request, tracking the live landscape.
| Approach | Routing | Adapts? | Learns from outcomes? |
|---|---|---|---|
| Manual rules | "If task=X, use model=Y" | No | No |
| Round-robin | Rotate evenly | No | No |
| Epsilon-greedy | Best 90%, random 10% | Slowly | Partially |
| Thompson Sampling | Bayesian posterior | Continuously | Every request |
Ready to route?
Every response includes full routing metadata. Watch the router learn your workload in real-time.