← Back Operate · Acrein Group

Agent Incidents Have Two Failure Types. Your Postmortem Checks for One.

5 min read · Acrein Group

When Your Agent Goes Wrong, You Need to Know Why Before You Fix It

Your agent acted. The outcome was wrong. The incident got filed, investigated, and closed. A fix shipped. But now you're sitting with your team and someone asks the question nobody answered: did the agent have bad information, or did it understand the situation correctly and still choose wrong?

You realize the postmortem never actually separated those two things.

This matters. Because the two failures require completely different fixes, owned by different people, measured differently. If you fix the wrong one, the same class of incident happens again.

The Two Failure Modes Inside One Incident

When an agent causes a production problem, one of two things happened at decision time.

First: the agent had complete visibility into the situation and still chose wrong. It understood what was happening and picked the incorrect action anyway. That is a decision failure. The model or the constraints that guide the model are insufficient.

Second: the agent had incomplete visibility and made a confident choice based on what it could see. It acted on partial information as though it were complete. That is a context failure. The agent was never given access to what it needed to know.

Both look like the same incident from the outside. A wrong action happened. An alert fired. Something broke.

Both feel like the same problem when you're writing the postmortem summary.

Neither is fixed by treating them as one failure.

Why Your Standard Postmortem Template Misses This

Typical incident review goes: what happened, why did it happen, what did we change to prevent it.

That structure assumes one root cause. One thing to fix. One owner to own the fix.

With agent incidents, that assumption is wrong.

A decision failure lives in the model, the constraints, the decision logic. It is an engineering problem. It belongs to whoever owns the agent's configuration and training signal.

A context failure lives in the permissions, the data access, the visibility boundaries, the integration endpoints. It is a systems and access control problem. It belongs to whoever owns the agent's integration footprint.

When your postmortem bundles both into "the agent made a bad decision," you ship a fix to one layer and hope it addresses both. Usually it addresses neither.

The team stays uncertain. The same class of wrong choice can still happen next week. And you have no confidence in whether the fix you shipped even mattered.

The Diagnostic You Need to Run Before Closing Any Incident

Before the postmortem closes, run two explicit checks.

Check one: reconstruct exactly what context the agent had access to at the moment it made the decision. Not what context you assume it had. What it actually could see.

Pull the logs. Trace the data flows. Check the permissions it ran under. Check the API responses it received. Rebuild the information landscape from the agent's perspective at that exact moment in time.

Check two: given that context, was a correct decision even possible?

This is the critical question most teams skip. With what the agent could see, could it have chosen differently and still solved the problem? Or was the outcome inevitable given the visibility it had?

If the answer is yes, a correct decision was possible: the failure is in the decision model. The agent had what it needed and chose wrong. Fix the decision logic.

If the answer is no, a correct decision was not possible: the failure is in context design. The agent was never given the information required to choose correctly. Fix the visibility or permissions or data access that starved it.

Only one of these fixes addresses the actual failure.

What This Looks Like When It Breaks

An agent managing customer support tickets receives a high-priority escalation. It routes the ticket to the overflow queue and closes it. The customer sees their issue dismissed. The incident fires.

Postmortem finds: the agent did not have visibility into current queue depth. It had no way to know that routing was available. Decision failure? No. The agent made the only choice it could make. Context failure: yes. It needs access to real-time queue metrics and routing availability before it can make an intelligent choice.

Same company, different day. A different agent has complete visibility into queue depth, routing availability, and customer history. It still sends a premium support customer to overflow instead of escalating to tier-two. The incident fires again.

Postmortem finds: the agent could see everything. It understood the situation. Decision failure: yes. The routing constraints or the model's understanding of escalation criteria need to change. It is not a data access problem. It is a decision model problem.

Same surface-level incident. Two completely different root causes. Two completely different owners. Two completely different fixes.

If the first team had treated their problem as a decision failure, they would have rewritten constraints that were already correct. The real issue would have persisted.

If the second team had treated their problem as a context failure, they would have added data access the agent already had. The real issue would have persisted.

Why This Distinction Changes Everything

Once you know which failure type occurred, the rest of the incident response has a clear owner and a clear success metric.

Decision failure. The agent had what it needed. That means the decision model is the bottleneck. Ownership moves to whoever owns the model configuration, constraints, or training signal. Success looks like: the agent sees the same context and now chooses differently.

Context failure. The agent was blind to something critical. That means visibility is the bottleneck. Ownership moves to whoever manages permissions, integrations, and data access. Success looks like: the agent now has access to the information it needs, and now it chooses correctly.

Without this distinction, postmortems produce fixes that feel right without proving they work. You ship a change. You redeploy the agent. You wait to see if the same failure happens again.

With the distinction, you know exactly what you are testing. You know whether the fix you shipped addressed the actual failure or just patched the symptoms.

The next incident will tell you whether you were right.


If your team just closed an agent incident and nobody explicitly determined which failure type occurred, go back and run the two checks. Reconstruct what context the agent had. Evaluate whether correct action was possible with that context. Only then will you know whether your fix actually addressed what broke, or whether the same failure is still possible next week. Acrein Group runs agentic operations across its own portfolio and has learned this distinction the hard way, if you need to move past postmortem guessing to genuine operational confidence, that's where this work lives.

Building, stuck, or ready to scale?

The right conversation at the right moment changes everything. Let's have it.

Talk to us