AI Engineer

End-to-end
AI infrastructure.

Production systems, not demos. Everyone ships the model. I build what holds it up: data pipelines, orchestration, reliability layers. The 90% that determines whether the 10% actually works. Infrastructure that compounds.

See the work↓The principle

The principle

Most enterprise AI products fail
at the 60% and 30%, not the 10%.

The model is the easy part. Infrastructure, orchestration, reliability. That's where production systems actually break.

60%

Infrastructure

Data pipelines, APIs, storage, auth, deployment. The foundation everything else depends on.

30%

Orchestration

Agent coordination, async pipelines, error handling, retries. The logic that holds it together.

10%

The model calls, prompts, and structured outputs. The part everyone talks about.

Cascade Router: interactive token-flow simulator→

Work

Built on this principle.

CrucibleCase Study

2026

Cross-provider multi-agent LLM output verification. Three critics: GPT-4o for accuracy, Claude for logic, Gemini for completeness. They audit any output in parallel via asyncio.gather. An adjudicator synthesizes per-dimension verdicts, calibrated confidence scores, and dismissed-flag explanations. Different providers, different training data, no shared failure modes.

System demo

GPT-4oaccuracy criticanalyzing_

Claudelogic criticanalyzing_

Geminicompletenessanalyzing_

→

0.94

confidence

15/15 caught

15/15 planted errors caught · 0 false positives · deterministic eval harness

PythonClaude APIGPT-4oGeminiFastAPIasyncioDocker

GitHub↗Write-up↗

RelayNovel System

2026

Multi-agent system where Claude agents share extended thinking blockswith each other, not just final text outputs. The recursive loop runs Planner → Critic → Solver for N rounds; each agent receives the full reasoning chain of every prior agent before responding. The Claude-native equivalent of RecursiveMAS (arXiv 2604.25917): same core idea, no GPU access required, deployable today via the Anthropic API.

System demo

Agent 1thinking…

→

Agent 2reading chain…

→

Resolverevaluating…

agreement ✓2.9× tokens · +2.5pp accuracy

50-example GSM8K eval: Single 96%, Relay 98% · Round improvement confirmed (96% to 98%) · Proven novel 2026-05-31

MATH level 4-5 (n=200, preliminary): self-relay 65.5% vs single-agent 63.0% · read-after + disagreement escalation · 3,290 avg tokens vs 1,234 (2.7x cost)

PythonAnthropic SDKExtended ThinkingMulti-AgentGSM8K EvalViteReact

GitHub↗Write-up↗

QuenchCase Study

2026

Semantic caching proxy for LLM APIs. Every prompt gets embedded and checked against past answers: similar enough means instant response, no upstream call. Drop it in by changing one URL. Partitioned by model and system prompt so cross-context false positives are structurally impossible. Full observability stack: Qdrant, Prometheus, Grafana.

System demo

embed("explain RAG")→cache→HIT15ms ↩

embed("new prompt")→cache→MISS→LLM240ms ↩

90% hit rate · P95 15ms · 0 false positives · 30-60% cost reduction

PythonFastAPIQdrantPrometheusAnthropic APIGrafanaDocker

GitHub↗

EtchCase Study

2026

Failure forensics for multi-step AI pipelines. Every pipeline run is traced end-to-end: each step captures its input, output, prompt, token count, and LLM self-confidence. When an output is flagged as bad, an LLM-as-judge scores the quality delta at every step and walks backward to find the origin. Cascade failures don't fool it: a step that received garbage and passed it along has delta ≈ 0. The step that produced garbage from good input is the root cause.

System demo

step 1δ 0.02

→

step 2δ 0.71

→

step 3δ 0.04

→

output✗

root cause: step 2quality-delta backtrace

4-step pipeline · quality-delta root cause · 5 failure categories · Streamlit trace explorer

PythonFastAPIAnthropic APISQLiteStreamlitDocker

GitHub↗

ReckoningResearch

2026

Do LLM failure modes pool across unrelated organizations? A two-tier experiment: synthetic orgs first (YELLOW: 2/6 failure types recurring), then 25 real GDPRhub enforcement decisions across 3 DPAs. If blind spots concentrate in a small number of article families that recur regardless of organization or jurisdiction, they are structural to the model class: detectable in advance, not patchable one at a time.

System demo

Art.13

Art.17

Art.25

Art.5

Art.6

Art.32

total

73% concentration

GREEN · 6 recurring blind-spot families · 73% concentration · 25 real enforcement decisions

PythonClaude APIGDPRhubLLM-as-judgePre-registeredStructured eval

Case Study↗

End-to-endAI infrastructure.

Most enterprise AI products failat the 60% and 30%, not the 10%.

Built on this principle.

End-to-end
AI infrastructure.

Most enterprise AI products fail
at the 60% and 30%, not the 10%.