AI Engineer

End-to-end
AI infrastructure.

Production systems, not demos. Everyone ships the model. I build what holds it up: data pipelines, orchestration, reliability layers. The 90% that determines whether the 10% actually works. Infrastructure that compounds.

The principle

Most enterprise AI products fail
at the 60% and 30%, not the 10%.

The model is the easy part. Infrastructure, orchestration, reliability. That's where production systems actually break.

60%
Infrastructure
Data pipelines, APIs, storage, auth, deployment. The foundation everything else depends on.
30%
Orchestration
Agent coordination, async pipelines, error handling, retries. The logic that holds it together.
10%
AI
The model calls, prompts, and structured outputs. The part everyone talks about.
Cascade Router: interactive token-flow simulator
Work

Built on this principle.

CrucibleCase Study
2026

Cross-provider multi-agent LLM output verification. Three critics: GPT-4o for accuracy, Claude for logic, Gemini for completeness. They audit any output in parallel via asyncio.gather. An adjudicator synthesizes per-dimension verdicts, calibrated confidence scores, and dismissed-flag explanations. Different providers, different training data, no shared failure modes.

System demo
GPT-4oaccuracy criticanalyzing_
Claudelogic criticanalyzing_
Geminicompletenessanalyzing_
0.94
confidence
15/15 caught
15/15 planted errors caught · 0 false positives · deterministic eval harness
PythonClaude APIGPT-4oGeminiFastAPIasyncioDocker
RelayNovel System
2026

Multi-agent system where Claude agents share extended thinking blockswith each other, not just final text outputs. The recursive loop runs Planner → Critic → Solver for N rounds; each agent receives the full reasoning chain of every prior agent before responding. The Claude-native equivalent of RecursiveMAS (arXiv 2604.25917): same core idea, no GPU access required, deployable today via the Anthropic API.

System demo
Agent 1thinking…
Agent 2reading chain…
Resolverevaluating…
agreement ✓2.9× tokens · +2.5pp accuracy
50-example GSM8K eval: Single 96%, Relay 98% · Round improvement confirmed (96% to 98%) · Proven novel 2026-05-31
MATH level 4-5 (n=200, preliminary): self-relay 65.5% vs single-agent 63.0% · read-after + disagreement escalation · 3,290 avg tokens vs 1,234 (2.7x cost)
PythonAnthropic SDKExtended ThinkingMulti-AgentGSM8K EvalViteReact
QuenchCase Study
2026

Semantic caching proxy for LLM APIs. Every prompt gets embedded and checked against past answers: similar enough means instant response, no upstream call. Drop it in by changing one URL. Partitioned by model and system prompt so cross-context false positives are structurally impossible. Full observability stack: Qdrant, Prometheus, Grafana.

System demo
embed("explain RAG")cacheHIT15ms ↩
embed("new prompt")cacheMISSLLM240ms ↩
90% hit rate · P95 15ms · 0 false positives · 30-60% cost reduction
PythonFastAPIQdrantPrometheusAnthropic APIGrafanaDocker
EtchCase Study
2026

Failure forensics for multi-step AI pipelines. Every pipeline run is traced end-to-end: each step captures its input, output, prompt, token count, and LLM self-confidence. When an output is flagged as bad, an LLM-as-judge scores the quality delta at every step and walks backward to find the origin. Cascade failures don't fool it: a step that received garbage and passed it along has delta ≈ 0. The step that produced garbage from good input is the root cause.

System demo
step 1δ 0.02
step 2δ 0.71
step 3δ 0.04
output
root cause: step 2quality-delta backtrace
4-step pipeline · quality-delta root cause · 5 failure categories · Streamlit trace explorer
PythonFastAPIAnthropic APISQLiteStreamlitDocker
ReckoningResearch
2026

Do LLM failure modes pool across unrelated organizations? A two-tier experiment: synthetic orgs first (YELLOW: 2/6 failure types recurring), then 25 real GDPRhub enforcement decisions across 3 DPAs. If blind spots concentrate in a small number of article families that recur regardless of organization or jurisdiction, they are structural to the model class: detectable in advance, not patchable one at a time.

System demo
Art.13
Art.17
Art.25
Art.5
Art.6
Art.32
total
73% concentration
GREEN · 6 recurring blind-spot families · 73% concentration · 25 real enforcement decisions
PythonClaude APIGDPRhubLLM-as-judgePre-registeredStructured eval