Lessons Learned from Why Agentic AI Projects Fail in Production (Deep Dive)
This is a deeper dive into my LinkedIn post on the pattern that keeps repeating itself with failed Agentic AI projects in the Enterprise....
… not because the models are bad, but because teams are trying to prompt-engineer their way out of writing actual control flow. We touched on this in the previous post but here I want to get technical about why this happens, what the actual failure modes look like, and what architecture works instead. This isn’t theoretical! Everything is based on systems we’ve built, evaluated, and shipped to production in regulated industries where “the AI decided” is not an acceptable answer for anything.
What Agentic Actually Means Architecturally
Before getting into failure modes, it’s important to be clear on a few things because marketing folks are doing what marketing folks do and have turned the “agentic” term into something that means nothing and something that means everything simultaneously.
Thanks for reading! Subscribe for free to receive new posts and support my work.
A deterministic pipeline is a fixed code path: Input → Stage 1 → Stage 2 → Stage 3 → Output. Same input, same execution path, every time. You can unit test every stage, you can trace from input to output, you can reason about behavior. When it breaks you know exactly which stage failed because each one has a defined input and output contract. It’s boring, it not sexy, but it works.
An agentic system is a loop: Input → LLM reasons about what to do → selects and calls a tools → observes the result → LLM reasons again → selects another tool → repeat until the LLM decides it’s done. The LLM is the orchestrator AND the decision-maker AND the executor. Each invocation might take a different path through the tool graph and thus the behavior is emergent, not designed.
The issue with emergent behavior: you can’t unit test it. You can’t write an assertion that says “given this input, the agent will call tools in this order.” You can only sample it statistically and hope the distribution of outcomes is acceptable. That’s fine for a research prototype and it’s also fine for your vibe coded TODO list, or personal fitness app. It’s not fine for a system processing clinical lab data in a HIPAA-regulated environment.
So the question to be answered is this: what parts of your system need to be deterministic vs agentic? This is the single most important design decision in applied AI right now and concurrently a key skill to develop.
The Three Failure Modes (With Specifics)
In my experience so far, agentic systems break in production in three specific & diagnosable ways.
Failure Mode 1: Decision Drift
In an agentic loop, the LLM makes a series of decisions: which tool to call, what parameters to pass, how to interpret the result, whether to retry, when to terminate. Each of these is a probabilistic inference. Token generation is stochastic and temperature, sampling, even the order of tokens in the context window can shift the decision.
Run the same clinical lab data through a full agentic extraction loop ten times. You won’t get the same execution path ten times! The agent might read page 3 before page 1 on run 4, it might choose a different OCR provider on run 7, it might declare extraction “complete” earlier on run 9 because the phrasing in its chain-of-thought happened to trigger an end-of-task heuristic.
When we evaluated a full agentic architecture against the deterministic pipeline, accuracy dropped from 100% to roughly 70% with the same inputs! The final extraction results were usually consistent, but “usually” is not a word that survives a compliance audit. When you need to answer to a regulator the answer has to be yes, not “statistically, most of the time.”
Failure Mode 2: Cost Multiplication at Scale
Every iteration of an agentic loop is an LLM inference call. Each call consumes input tokens (the full context window + tool results) and generates output tokens (reasoning + tool selection). The token count grows with each iteration because the context window grows.
A deterministic pipeline processes clinical lab data in a single pass through a fixed sequence: render pages at 300 DPI, send to cloud OCR, pattern-match against LOINC codes, parse values and reference ranges, score abnormalities, compute trends. Total cost: $0.02 per document, almost entirely OCR API fees. Zero LLM calls.
The full agentic alternative we evaluated required 5-15 LLM calls per document as the agent needs to: analyze the document structure, decide on an extraction strategy, execute extraction calls, validate results, handle any edge cases it encounters, and determine when it’s done. Each call includes the growing conversation context. Per-document cost: $0.07-0.17.
At enterprise scale (tens of thousands of documents) the agentic approach multiplies costs by 3.5-8.5x with no improvement in accuracy. The POC doesn’t catch this because POCs process dozens of documents and the per-unit economics are invisible until you multiply them by real volume.
Failure Mode 3: The Debugging Black Hole
This is the failure mode that burns the most engineering hours and is the hardest to explain to folks who haven’t lived it.
The deterministic pipeline has discrete stages with observable boundaries. OCR stage: input is page images, output is text blocks with per-line confidence scores. Pattern matching stage: input is raw text, output is structured lab values matched against LOINC codes. Scoring stage: input is extracted values, output is abnormality flags and clinical priority scores. When something is wrong, you check the output at each boundary. The failure is always localizable: OCR misread the value, or the pattern didn’t match a format variation, or the reference range was wrong. You fix the stage + you add a test, and you are done.
Now imagine debugging an agentic extraction that returned wrong results: open a multi-turn conversation log and find that the agent read page 1, then decided to call the table extraction tool with specific parameters, then interpreted the results, then decided to read page 3 for additional context, then attempted to validate a creatinine value, then second-guessed itself and re-extracted. Somewhere in that chain of reasoning, it misinterpreted “creatinine kinase” as “creatinine.” Why? Because the token probabilities, given the accumulated context at that point in the conversation, made that the most likely completion. There’s no root cause in the traditional sense. There’s no bug to fix, this is by design, this is how LLMs work!! There’s a probability distribution that happened to produce the wrong output on this run.
In a regulated environment, this isn’t just an engineering inconvenience. An audit trail requires a traceable chain of custody from input to output. “The model reasoned differently on this run” is not a valid audit finding. You need deterministic traceability: this value was extracted by this pattern from this text at this position with this confidence score. Agentic loops break that chain by design.
The Root Cause: Perception vs. Judgment vs. Orchestration
All three failure modes trace back to one architectural fact: the agentic framework puts the LLM in charge of three fundamentally different jobs that have fundamentally different reliability requirements.
Perception is where unstructured input becomes structured data. I.E. reading a document, classifying a document type, extracting entities, matching patterns, etc. This is inherently fuzzy because the inputs are unstructured and variable and in this problem domain LLMs are world-class & exactly what they were designed for.
Judgment is where structured data becomes decisions: Is this value abnormal? What’s the clinical priority? Should this be flagged for urgent review? The inputs are now structured in the way of numbers, classifications, confidence scores, and the logic should be deterministic. Same inputs, same decision, every time.
Orchestration is the control flow connecting everything: Which stage runs next? What happens on failure? When do you retry vs escalate? This must be deterministic code, not LLM inference as control flow is not a perception task.
Most agentic frameworks mash all three together in a single LLM loop. The agent perceives AND judges AND orchestrates. Every “AND” in that sentence is a place where non-determinism creeps from the perception layer (where it’s acceptable) into the judgment and orchestration layers (where it’s not).
What the Best AI Engineers Have Been Doing for 3+ Years
The pattern that actually works in production is what the applied ML community has been calling “fuzzy classifiers with deterministic wrappers.” This isn’t new and as I’ve learned, good AI engineers have been building this way since before the current LLM wave but the broader market is just now catching up w/ the agentic hype cycle.
The core idea: constrain the LLM to very specific, objective classification tasks. Don’t ask “is this code good?” or “should this patient be reviewed?” These are holistic judgment calls, and the LLM will give you a different answer depending on how you frame the question. I’ve literally asked the same model to review the same authentication middleware 15 minutes apart with two different framings and gotten completely opposite conclusions. Ask it to find security issues, it returns five concerns. Ask it if the code is ready to ship it fluffs my ego about the elegant error handling. Again this is not a bug, this is how probabilistic text generation works.
Instead, constrain the classification: “Does this report contain creatinine above 1.5 mg/dL? Return yes/no + confidence 0-1.” “Does this function handle all error paths? Return yes/no + list of unhandled exceptions.” “Does this document match source pattern A or B? Return classification + confidence.” These are perception tasks and they need to be narrow & objective with structured output. LLMs are reliable at this because we’ve removed the ambiguity that makes holistic judgment unreliable.
THEN implement the value chain of what’s good/bad OUTSIDE the LLM loop, in deterministic code. This is where the skill of context engineering (an entire discipline of its own!) comes in not just crafting prompts, but designing the boundary between what the model classifies and what the code decides.
What This Looks Like in a Real System
Let me walk through the architecture of the healthcare system we shipped to show how this separation works in practice.
Perception Layer: Cloud OCR + LOINC Pattern Matching
The system processes clinical lab data from the largest lab companies in the US. The perception pipeline has three stages:
Stage 1: Digitization. PDF pages rendered at 300 DPI, sent to cloud OCR (AWS Textract, Google Document AI, or Azure Document Intelligence, with runtime provider switching). Returns text blocks with per-line confidence scores plus table structures. This is the fuzzy part as the Cloud OCR is probabilistic & confidence varies by image quality, layout, font, format, etc.
Stage 2: LOINC code matching. LOINC (Logical Observation Identifiers Names and Codes) is a universal standard for clinical observations. When the report includes codes like (4548-4), extraction is a deterministic regex lookup: find the parenthesized code, parse the adjacent numeric value, match to a canonical lab definition; no ambiguity. When LOINC codes aren’t present, the system falls back to name-based matching with a curated variation table and fuzzy substring logic and explicitly flags those results as needing verification. The system knows when its perception layer is less reliable and says so.
Stage 3: Value parsing. Extract numeric values, units, reference ranges. Handle edge cases deterministically: values with < or > prefixes (below/above detection threshold), units in various formats (mg/dL, mIU/L, nmol/L), reference ranges expressed as upper-only, lower-only, or full ranges. All of this is parsing logic, not inference.
The entire perception layer outputs structured data: lab name, numeric value, unit, reference range, confidence score, extraction method used. That structured output is the contract between the fuzzy layer and the deterministic layer.
Judgment Layer: Deterministic Clinical Decision Logic
Everything downstream of that structured output is deterministic code. No LLM calls, probability distributions or prompt sensitivity.
Abnormality detection: Compare extracted value against reference range. Creatinine 1.8 with range 0.7-1.3? Abnormal.... this is a numeric comparison & it doesn’t need a model.
Trend analysis: Track values over multiple collection dates. Compute absolute and percentage change. If the change is less than 5%, the trend is “stable.” For tests where lower is better (HbA1c, LDL, creatinine, triglycerides), a negative change means “improving.” For tests where higher is better (HDL, eGFR, hemoglobin), a positive change means “improving.” Fluctuation detection counts direction reversals across measurements and two or more reversals flags the pattern. All of this is in a deterministic function with hardcoded clinical rules. Published medical guidelines not learned weights from an LLM.
Clinical priority scoring: This is the highest-stakes decision in the system. Each test gets a score from 1 to 10: worsening + abnormal = 10 (urgent attention). Worsening only = 7 (monitor closely). Abnormal + stable = 5 (continue monitoring). Fluctuating = 4 (check medication adherence). Improving + still abnormal = 3 (continue therapy). Improving + normal = 1 (good progress).
When a clinician asks “why was this flagged?” the answer is fully traceable: “Creatinine extracted at 1.8 mg/dL via LOINC code 2160-0, reference range 0.7-1.3, flagged abnormal. Three measurements over 90 days showing 12% upward trend. Priority score 10: worsening and abnormal.” Every element is deterministic & auditable with zero model reasoning to reconstruct.
Orchestration Layer: Deterministic Control Flow
Provider selection: check if the configured cloud provider is available, if not fall back to the next configured provider. This is an if-statement, not an LLM decision. Confidence routing: if OCR confidence is below threshold, flag for human review. If LOINC code extraction succeeds, use it; if not, fall back to name-based extraction and mark as needing verification. Retry logic: if cloud OCR fails, retry with exponential backoff then fall back to local regex-only extraction.
All of this is control flow and NONE of it benefits from LLM reasoning. An agentic system would have the LLM “decide” which provider to use, “decide” whether to retry, “decide” if results are good enough. Each of those decisions introduces non-determinism into a layer where determinism is the entire point.
The Quantitative Proof
But don’t take my word for it. We didn’t just build the deterministic architecture and assume it was better, we designed and evaluated four progressively sophisticated agentic alternatives and compared them head to head.
Level 0: Deterministic fallback chain (no AI). Predefined retry strategies i.e. try AWS, then Azure, then GCP, then enhanced OCR settings, then pure regex. Fully deterministic, fully testable... this is the baseline and it works. The reason I include it is to make a point: this is what you’re comparing against when you start adding LLM decision-making into the loop. Everything after this introduces non-determinism, and the question is whether that non-determinism earns its cost.
Level 1: LLM-assisted strategy selection. Use Claude to analyze each document and recommend which OCR provider and extraction approach to use. The problem: we’re paying for an LLM inference call to make a decision that a single conditional handles reliably. The LLM added $0.003-0.01 per document in API costs, 1-3 seconds of latency, and non-deterministic strategy selection, with zero improvement in extraction quality.
Level 2: Full agentic loop. The LLM controls the entire extraction through tool use. Reading pages, choosing OCR providers, applying extraction patterns, validating results, deciding when it’s done resulting in 5-15 LLM calls per document. Processing time: 85-120 seconds vs 72.7 second baseline. Cost per document: $0.07-0.17 vs $0.02. Determinism: ~70% vs 100%. Additional code: 400+ lines of tool definitions, conversation management, and error handling for mid-conversation failures. Accuracy improvement: exactly zero.
Level 3: Hybrid. Deterministic primary + agentic fallback. Architecturally the most reasonable BUT maintaining an entire agentic subsystem for a fallback path that never triggers is pure lunacy!! 450 additional lines of code, two complete systems to maintain, full test coverage required for both paths.... for zero benefit. Nah.
Every agentic option made the system worse on every metric. Not marginal but significantly worse. The full agentic loop was up to 65% slower, up to 8.5x more expensive, and 30% less deterministic with absolutely zero accuracy improvement.
The agentic approaches didn’t fail because the models were bad rather they failed because the problem didn’t need an agent. Fixed lab sources + detectable formats + LOINC codes + known reference ranges.... that’s a stability problem and agents solve variability problems. Applying the variability solution to a stability problem adds cost and risk with no upside.
Why Teams Keep Getting This Wrong
If the failure modes are this predictable and the quantitative evidence is this clear, why do teams keep building agentic systems that break in production?
First, agentic sells better. “We built an AI agent that autonomously processes your clinical documents” gets funded. “We built a deterministic pipeline with AI-powered OCR at the perception layer” does not. The exec dashboard for an agentic system looks like the future sci-fi movie while the deterministic pipeline looks like a boring flowchart. This is a real problem because it means capital flows toward architectures that demo well and not architectures that survive production! You see this all over the place on the interwebs right now. You only have to look a far as the millions of OpenClaw posts.
Second, vibe coding has conditioned a generation of developers to throw everything at the LLM. When your primary tool is an LLM+agentic harness (claude code) the reflex is to prompt for a holistic answer rather than decompose the problem into what the model should classify and what deterministic code should decide. That decomposition requires deep domain knowledge as you have to understand your problem space well enough to enumerate specific, objective criteria. You can’t outsource that understanding to the model. This is context engineering, not prompt engineering, and it’s a fundamentally different skill.
Third, because it requires admitting that the LLM is not as smart as it seems. When you separate perception from judgment, you’re acknowledging that the model’s “understanding” is pattern matching, not comprehension. The model almost certainly knows that a creatinine of 4.0 is clinically dangerous as that’s basic medical knowledge well-represented in training data, but will it flag it the same way every time? Will the surrounding context, the phrasing of the prompt, the other values in the report shift whether it calls it ‘critical’ vs ‘elevated’ vs ‘worth monitoring’? A deterministic comparison against a reference range + other variables gets it right every time, with zero variance. And when a regulator asks how the decision was made you point to one line of logic instead of a probability distribution.
The people still trying to prompt-engineer an LLM into making holistic judgment calls like “is this code good,” “should this patient be reviewed urgently,” “is this extraction complete” are building systems that work in demos and break in production. The LLM will wax poetic about how great the code is if you ask it that way. Constrain it to “does this function handle all error paths, return yes/no + list of unhandled exceptions” and now you have something you can build actual control flow around.
When Agentic Earns Its Place
I want to be clear that there are real problems where fully agentic architectures earn their complexity. I’ll write about that in the future based on the experiences of another solution we developed.
If the system needed to support hundreds of unpredictable formats instead of a group of semi-standardized sources, the LOINC pattern matching would break. The name variation table would be unmanageable, the input variability would exceed what deterministic rules can handle and that’s when an agentic approach at the perception layer starts earning its trade-offs: the flexibility to reason about unfamiliar document structures, adapt extraction strategies to novel formats, and handle edge cases that no reasonable set of rules would cover.
If the scope expanded to radiology narratives, pathology reports, or handwritten clinical notes, the entire perception layer would need to change as deterministic pattern matching doesn’t work on unstructured narrative text. The variability is too high, the context too important, the semantic interpretation too nuanced. A model-heavy perception layer makes sense here.
But even in those scenarios, the judgment layer stays deterministic. The clinical priority scoring & abnormality detection, trend analysis and the entire orchestration stays in code. We can make the classifiers more flexible to handle higher variability at the perception layer but we haven’t given the model the car keys to the decision layer.
The framework is simple: agentic solves variability problems while deterministic solves stability problems. Most production systems I’m seeing are stability problems wearing a variability costume because someone saw a compelling agentic demo and assumed that’s the right architecture for everything.
The Practitioner’s Audit
If you’re running an agentic AI project approaching production, or troubleshooting one that’s already struggling, here’s what I’d evaluate:
Map every LLM call to a layer. Is the model doing perception (classifying, extracting, pattern-matching structured output)? Or is it doing judgment (deciding, routing, scoring, escalating)? If it’s doing judgment, that’s where your production failures will come from. Move those decisions into deterministic code.
Check your orchestration. Is the LLM deciding which tool to call next, or is control flow in code? Every LLM-controlled orchestration decision is a place where decision drift accumulates. If a conditional or a state machine can make the same routing decision, use that.
Test for determinism! Run the same input 10+ times. Do you get the same execution path? The same intermediate states? The same output? If not, identify which layer is introducing variance and whether that variance is in perception (acceptable) or judgment/orchestration (not acceptable).
Run the cost math at production volume. Multiply your per-document LLM token consumption by realistic monthly volume and compare against a deterministic alternative where the LLM is constrained to specific classification calls. If the agentic approach is 3x+ more expensive with no accuracy improvement, you have your answer.
Ask the debugging question. When the system produces a wrong result, can you localize the failure to a specific stage with a defined input/output contract? Or do you need to read through conversation logs? If the latter, your debugging costs in production will dominate your engineering time.
Classify your problem. Is the core challenge high input variability (many unknown sources, unpredictable formats, novel document types)? Or is it a bounded input space with known structure and stable rules? If the latter, the deterministic approach wins. Every. Single. Time.
The Architecture That Ships
“Agents everywhere” is 100% not the answer but neither is “no agents anywhere.” The architecture that actually ships to production and survives at scale is a hybrid: deterministic pipelines for the stable guardrails, agents (or more precisely, LLM-powered fuzzy classifiers) for the genuinely high-variation perception work, and deterministic wrappers around every model output that touches a decision.
Fuzzy classification at the perception layer + deterministic wrappers at the decision layer. Control flow in code not in the LLM loop. Confidence scores that route to human escalation when the model is uncertain coupled with an audit trail that traces from input to output without requiring anyone to reconstruct a model’s “reasoning.”
The hard part isn’t building this its it’s having the domain expertise to decompose your problem correctly, to know which parts are perception tasks where the model’s flexibility adds value, and which parts are judgment tasks where determinism is non-negotiable. That decomposition requires you to understand your domain deeply enough to enumerate specific, objective criteria. You can’t outsource that to the model and you can’t prompt-engineer your way around it. You have to actually do the hard work of defining what good looks like.
And IME, that’s exactly where many enterprise AI projects are breaking down. Not at the model layer or data layer.... but at the architecture layer. At the boundary between what should be fuzzy and what should be deterministic which is critically the decision about where the LLM loop ends and control flow begins.
Get that boundary right and everything else follows. Get it wrong and no amount of model capability fixes it.
You can’t outsource your thinking to AI.
(^^^^^^^^^^^^^^^^^I will be saying this over and over, get used to it.)

