A request pipeline that tries the cheap, fast, specific path first. Only the hard cases travel the full depth. Nothing ever hard-fails.
Like a hospital ER: triage handles what it can in seconds. Only the complex cases reach a specialist.
The Idea
Think of a postal sorting facility. Most packages have clear labels and get routed instantly. Ambiguous ones go to a human sorter. Only the rare, unreadable ones reach a supervisor.
A cascade router works the same way. Each layer tries to resolve the request using the fastest available path. A miss drops through to the next layer. Most requests never reach the bottom.
The Flow
The token travels downward through each layer. Green branches exit early. Those requests never pay for the layers below.
Layer by Layer
The front door. Every request enters here, no exceptions.
TLS termination, health-checked load balancing across instances, request normalization. The first thing any request touches in your system.
A single overloaded entry point is the most common cause of total outages. Distributing load here is what keeps the lights on.
Have we answered this exact thing before? If yes, return immediately.
Check a fast in-memory cache, then a shared distributed cache. A hit returns immediately and the request never travels deeper. A miss falls through to L2.
The cheapest request is the one you never compute. Cache hits cut both latency and cost by orders of magnitude.
Figure out where this goes. Fastest match wins.
Try an O(1) exact-path lookup, then a prefix tree, then regex, then a wildcard catch-all. Stop at the first match. Matchers are ordered cheapest-first.
Ordering matchers by cost means 95% of traffic resolves on the fast path. Only rare edge cases pay for expensive regex evaluation.
Bouncer, bartender's tab, and the security camera.
Each request passes through composable middleware: verify identity, enforce per-client rate limits, emit structured logs. Any link can short-circuit with a 401 or 429 and end the request right here.
Rejecting a bad request at L3 is 100x cheaper than after a service has already done work. Fail fast, protect everything downstream.
Hand it to the right specialist. For LLM systems, pick the right model tier.
Resolve the matched route to a concrete handler. For AI systems: send to the cheap model first. If the answer's confidence falls below threshold, escalate to the expensive model. This is the cascade's payoff layer.
Match work to the cheapest capable resource. Get this right and you cut costs without touching accuracy.
The safety net. Something always catches.
If a service is down or confidence is still too low, degrade gracefully: retry with backoff, serve stale cache, route to a backup region, or escalate to a human queue. Never a hard 500.
Systems don't fail because nothing goes wrong. They fail because nothing catches what does. The fallback is what makes the cascade reliable, not just fast.
In the Wild
Send every query to a cheap model first (Haiku, sampled k=3 at temp 0.7). Measure agreement across the samples - high consistency means high confidence. Only escalate the genuinely uncertain queries to an expensive model (Sonnet). Anything still ambiguous falls through to a human review queue.
Consistency-based calibration (isotonic regression) sets the escalation threshold. No logprobs required - works on the Anthropic API as-is.
API Gateway Routing
A gateway serving 40k req/s resolves 92% on an exact-match O(1) table, 7% on prefix trees, under 1% on regex. Tiering the matchers keeps p99 under 5ms regardless of load.
p99 under 5msMulti-Region Failover
Primary region serves all traffic. On health-check failure the cascade routes to a secondary region in under 30 seconds, then to a static maintenance response if both are down. Three tiers, zero hard-down.
Under 30s failoverCDN Cascade
A request checks the edge POP first (~10ms), then a regional shield cache (~40ms), then origin (~200ms). At a 96% edge hit rate, origin sees only 4% of total traffic.
96% edge hit rateTry It
Pick a preset and watch which layer handles it. Notice how early exits skip all the expensive layers below.
The Pattern
Each comment maps directly to a layer. The structure is the same whether you're routing HTTP requests or LLM queries.
async def cascade(request) -> Response:
# L1 - cheapest answer is one you never compute
if hit := await cache.get(request.key):
return hit # return early
# L2 - fastest matcher wins; stop at first match
route = (
exact_match(request.path)
or prefix_match(request.path)
or regex_match(request.path)
or WILDCARD
)
# L3 - fail fast: reject bad requests before they cost anything
for check in (authenticate, rate_limit, log):
if err := check(request):
return err # 401 / 429, early exit
# L4 - match work to the cheapest capable resource
answer = await route.cheap_handler(request)
if answer.confidence < THRESHOLD: # uncertain?
answer = await route.expensive_handler(request) # escalate
# L5 - nothing ever hard-fails
return answer or await fallback(request)This is the infrastructure thinking behind every project on this portfolio.
Back to portfolio