Production systems, not demos. Everyone ships the model. I build what holds it up: data pipelines, orchestration, reliability layers. The 90% that determines whether the 10% actually works. Infrastructure that compounds.
The model is the easy part. Infrastructure, orchestration, reliability. That's where production systems actually break.
Cross-provider multi-agent LLM output verification. Three critics: GPT-4o for accuracy, Claude for logic, Gemini for completeness. They audit any output in parallel via asyncio.gather. An adjudicator synthesizes per-dimension verdicts, calibrated confidence scores, and dismissed-flag explanations. Different providers, different training data, no shared failure modes.
Multi-agent system where Claude agents share extended thinking blockswith each other, not just final text outputs. The recursive loop runs Planner → Critic → Solver for N rounds; each agent receives the full reasoning chain of every prior agent before responding. The Claude-native equivalent of RecursiveMAS (arXiv 2604.25917): same core idea, no GPU access required, deployable today via the Anthropic API.
Semantic caching proxy for LLM APIs. Every prompt gets embedded and checked against past answers: similar enough means instant response, no upstream call. Drop it in by changing one URL. Partitioned by model and system prompt so cross-context false positives are structurally impossible. Full observability stack: Qdrant, Prometheus, Grafana.
Failure forensics for multi-step AI pipelines. Every pipeline run is traced end-to-end: each step captures its input, output, prompt, token count, and LLM self-confidence. When an output is flagged as bad, an LLM-as-judge scores the quality delta at every step and walks backward to find the origin. Cascade failures don't fool it: a step that received garbage and passed it along has delta ≈ 0. The step that produced garbage from good input is the root cause.
Do LLM failure modes pool across unrelated organizations? A two-tier experiment: synthetic orgs first (YELLOW: 2/6 failure types recurring), then 25 real GDPRhub enforcement decisions across 3 DPAs. If blind spots concentrate in a small number of article families that recur regardless of organization or jurisdiction, they are structural to the model class: detectable in advance, not patchable one at a time.