System Design

Cascade Router

A request pipeline that tries the cheap, fast, specific path first. Only the hard cases travel the full depth. Nothing ever hard-fails.

Like a hospital ER: triage handles what it can in seconds. Only the complex cases reach a specialist.

See it flow ↓ How it works

The Idea

Try cheap. Escalate smart.

Think of a postal sorting facility. Most packages have clear labels and get routed instantly. Ambiguous ones go to a human sorter. Only the rare, unreadable ones reach a supervisor.

A cascade router works the same way. Each layer tries to resolve the request using the fastest available path. A miss drops through to the next layer. Most requests never reach the bottom.

⚡

Cheap-first

Layers are ordered by cost. The cheapest resolution is the one you never compute twice.

↓

Escalate on miss

A request only reaches a more expensive layer when every cheaper option has tried and failed.

🛡

Always a fallback

Every cascade ends in a handler that never fails. Stale cache, static response, or a graceful error. Nothing drops silently.

The Flow

One request, six gates.

The token travels downward through each layer. Green branches exit early. Those requests never pay for the layers below.

Layer by Layer

What actually happens.

L0Request Entry

The front door. Every request enters here, no exceptions.

TLS termination, health-checked load balancing across instances, request normalization. The first thing any request touches in your system.

A single overloaded entry point is the most common cause of total outages. Distributing load here is what keeps the lights on.

L1Cache Check

Have we answered this exact thing before? If yes, return immediately.

Check a fast in-memory cache, then a shared distributed cache. A hit returns immediately and the request never travels deeper. A miss falls through to L2.

The cheapest request is the one you never compute. Cache hits cut both latency and cost by orders of magnitude.

L2Route Matching

Figure out where this goes. Fastest match wins.

Try an O(1) exact-path lookup, then a prefix tree, then regex, then a wildcard catch-all. Stop at the first match. Matchers are ordered cheapest-first.

Ordering matchers by cost means 95% of traffic resolves on the fast path. Only rare edge cases pay for expensive regex evaluation.

L3Middleware Chain

Bouncer, bartender's tab, and the security camera.

Each request passes through composable middleware: verify identity, enforce per-client rate limits, emit structured logs. Any link can short-circuit with a 401 or 429 and end the request right here.

Rejecting a bad request at L3 is 100x cheaper than after a service has already done work. Fail fast, protect everything downstream.

L4Service Routing

Hand it to the right specialist. For LLM systems, pick the right model tier.

Resolve the matched route to a concrete handler. For AI systems: send to the cheap model first. If the answer's confidence falls below threshold, escalate to the expensive model. This is the cascade's payoff layer.

Match work to the cheapest capable resource. Get this right and you cut costs without touching accuracy.

L5Fallback Handler

The safety net. Something always catches.

If a service is down or confidence is still too low, degrade gracefully: retry with backoff, serve stale cache, route to a backup region, or escalate to a human queue. Never a hard 500.

Systems don't fail because nothing goes wrong. They fail because nothing catches what does. The fallback is what makes the cascade reliable, not just fast.

In the Wild

Where cascades earn their keep.

LLM Request Routing Featured

Send every query to a cheap model first (Haiku, sampled k=3 at temp 0.7). Measure agreement across the samples - high consistency means high confidence. Only escalate the genuinely uncertain queries to an expensive model (Sonnet). Anything still ambiguous falls through to a human review queue.

~60x Haiku vs Sonnet cost ratio Escalate only the uncertain 20-40% Match Sonnet accuracy at a fraction of cost

Consistency-based calibration (isotonic regression) sets the escalation threshold. No logprobs required - works on the Anthropic API as-is.

API Gateway Routing

A gateway serving 40k req/s resolves 92% on an exact-match O(1) table, 7% on prefix trees, under 1% on regex. Tiering the matchers keeps p99 under 5ms regardless of load.

p99 under 5ms

Multi-Region Failover

Primary region serves all traffic. On health-check failure the cascade routes to a secondary region in under 30 seconds, then to a static maintenance response if both are down. Three tiers, zero hard-down.

Under 30s failover

CDN Cascade

A request checks the edge POP first (~10ms), then a regional shield cache (~40ms), then origin (~200ms). At a 96% edge hit rate, origin sees only 4% of total traffic.

96% edge hit rate

Try It

Push a request through.

Pick a preset and watch which layer handles it. Notice how early exits skip all the expensive layers below.

Cache:

L0 Request Entrywaiting

L1 Cache Checkwaiting

L2 Route Matchingwaiting

L3 Middlewarewaiting

L4 Service Routingwaiting

L5 Fallbackwaiting

The Pattern

What it looks like in code.

Each comment maps directly to a layer. The structure is the same whether you're routing HTTP requests or LLM queries.

Python

async def cascade(request) -> Response:
    # L1 - cheapest answer is one you never compute
    if hit := await cache.get(request.key):
        return hit                          # return early

    # L2 - fastest matcher wins; stop at first match
    route = (
        exact_match(request.path)
        or prefix_match(request.path)
        or regex_match(request.path)
        or WILDCARD
    )

    # L3 - fail fast: reject bad requests before they cost anything
    for check in (authenticate, rate_limit, log):
        if err := check(request):
            return err                         # 401 / 429, early exit

    # L4 - match work to the cheapest capable resource
    answer = await route.cheap_handler(request)
    if answer.confidence < THRESHOLD:       # uncertain?
        answer = await route.expensive_handler(request)  # escalate

    # L5 - nothing ever hard-fails
    return answer or await fallback(request)

Designed for reliability,
not demos.

This is the infrastructure thinking behind every project on this portfolio.

Back to portfolio ↗