Code Review Pipeline — Case Study

The Problem

Good code review requires multiple lenses. Security flaws look different from architectural problems, which look different from style violations. This dev tools company had built a code review pipeline that invoked three separate models: one specialized prompt for security scanning, another for logic/correctness, a third for style/maintainability.

But they ran them sequentially — waiting for the security check to finish before running the logic check, then style. A 1,000-line PR took 12-15 seconds. Developers context-switched. Review turnaround suffered.

The bottleneck wasn't the models; it was orchestration. Running requests one-at-a-time is wasteful when you have independent tasks.

The Architecture

The pipeline fanned out all three model calls through BrainstormRouter in parallel. Each call used a different system prompt tailored to its review dimension, but all three ran simultaneously:

// Fan out three reviews in parallel
const [security, logic, style] = await Promise.all([
  client.chat.completions.create({
    model: "anthropic/claude-sonnet-4-0",
    messages: [
      { role: "system", content: SECURITY_PROMPT },
      { role: "user", content: diff }
    ]
  }),
  client.chat.completions.create({
    model: "openai/gpt-4o",
    messages: [
      { role: "system", content: LOGIC_PROMPT },
      { role: "user", content: diff }
    ]
  }),
  client.chat.completions.create({
    model: "auto:floor",
    messages: [
      { role: "system", content: STYLE_PROMPT },
      { role: "user", content: diff }
    ]
  })
]);

Each review dimension used the best model for its task. Security scanning used Claude Sonnet for its precision on vulnerability patterns. Logic review used GPT-4o for its strong code comprehension. Style checking used auto:floor — the cheapest model that met quality thresholds — because style rules are well-defined and don't need premium reasoning.

Cache Amplification

The semantic cache produced outsized returns for code review. Software projects reuse patterns constantly: utility functions, error handlers, config parsers. A 200-line utility function that appears across 30 PRs is reviewed once; subsequent encounters hit the cache in milliseconds.

Over 90 days, the team accumulated 10,000+ cached code patterns. Cache hit rate on style reviews reached 45% — nearly half of all style checks returned instantly because the code patterns had been reviewed before.

Review Pipeline Metrics

Dimension	Model	Avg Latency	Cache Hit Rate
Security scan	Claude Sonnet	5.8s	22%
Logic review	GPT-4o	6.2s	18%
Style check	GPT-4o Mini (auto)	2.1s	45%
Parallel total	—	6.5s	—
Sequential baseline	—	15.0s	—

Results

Total review time dropped from 15 seconds to 6.5 seconds — a 55% reduction. The parallel execution wall-time equals the slowest individual review (logic at 6.2s), not the sum of all three.

Cost per review stayed flat because the style dimension moved to a cheaper model via auto:floor. The router confirmed that GPT-4o Mini produced equivalent style feedback to Sonnet on well-defined rules — so the third review dimension became 8x cheaper without quality loss.

Developers reported faster context switches. A 6-second review fits within a "glance at the PR" workflow. A 15-second review doesn't — it triggers a tab switch, a Slack check, and a lost train of thought.

Parallelize security, logic,
and style reviews instantly

The Problem

The Architecture

Cache Amplification

Review Pipeline Metrics

Results

Regulatory accuracy at startup economics →