Context Window Poisoning in Multi-Agent Pipelines: Why Agents Hallucinate Together

8 min read · May 27, 2026 · By Orvi

When one agent's hallucination spreads through a pipeline, no agent catches it. Here's why cascading trust causes silent failures — and how to stop it.

Most multi-agent AI coordination patterns that reach production were designed for demos, not distributed systems — and the engineering teams building them often cannot tell the difference until something blows up at 2 AM on a Tuesday with a client deliverable attached.

That sentence will annoy people. It should. But if you have spent real time debugging why an orchestrator confidently delivered a hallucinated output after five agents agreed on it, or why a pipeline that passed every eval suddenly entered an infinite loop in production, you know the annoyance is aimed correctly. These are not random failures. They are the predictable output of a set of structural conditions that the industry is replicating at scale. The multi-agent AI coordination patterns and production failures being reported now are not surprises — they are the delayed consequence of architectural decisions made under demo pressure.

This is what those decisions look like from the inside.

Why Do Multi-Agent Systems Fail at the Seams and Not the Center?

The short answer: each agent performs fine in isolation. What’s undefined — and stays undefined until load forces the question — is the interface between them.

Within a single agent, the failure surface is bounded. One model, one context window, one set of tool permissions. You can test it, observe it, characterize its failure modes. The moment you introduce a second agent that receives the first agent’s output as input, you have created an interface with no contract. There is no schema enforcement. There is no type system at the handoff. Agent A returns a string formatted to look like JSON; agent B parses it as JSON; then at 2:47 AM, agent A returns a string with a nested apostrophe in a city name, and agent B either throws an exception or, more dangerously, silently proceeds with partial state.

The exception is recoverable. The silent partial-state continuation is not — because the system logs a success.

Microsoft Research documented this failure class directly in the AutoGen framework paper (Wu et al., 2023). The framework required explicit termination callbacks and shared state contracts to prevent what the authors called “conversation derailment” — agents in a loop that are coherent, confident, and wrong. The team treated this failure mode as a design requirement because it appeared so consistently in evaluation that it could not be treated as an edge case. The fix was not a model improvement. It was an architectural primitive.

The seam is where the system lives. The center is where the team tests.

What Is the Single Pattern That Destroys Production Systems Eventually?

Cascading trust: each agent treats the previous agent’s output as verified ground truth, so errors compound rather than cancel.

Here is the exact failure sequence. An orchestrator sends a research task to a sub-agent. The sub-agent returns findings with a confident wrong assertion — a hallucinated publication date, a conflated organization name, a statistic that is directionally plausible but numerically fabricated. The orchestrator incorporates this into the working context without verification — because it was not designed to verify, it was designed to coordinate. A synthesis agent receives the corrupted context and builds on it. The final output is internally consistent, professionally formatted, and wrong at its foundation.

Every agent completed its assigned step. The task-completion metric logged a success. The pipeline itself is the problem.

This is precisely what the AgentBench benchmark (Liu et al., 2023) was designed to stress-test. Across eight distinct real-world task environments — including OS interaction, database management, and knowledge reasoning — even GPT-4, the top-performing model at time of publication, showed substantial failure rates on tasks requiring multi-step grounded reasoning. The dominant failure mode was not reasoning breakdown within a single step. It was error propagation across steps: each agent was positioned to complete its local task, and no agent had the global context required to catch what had gone wrong upstream.

The system that produces this outcome has a name: it is an evaluation architecture that rewards local optimization without measuring global correctness. Every incentive at every layer is to pass the output downstream. Challenging upstream output is not part of any agent’s job description, because job descriptions in agent pipelines are written by engineers optimizing for modularity, not for truth. That is structural, not incidental. The system produces cascading trust because cascading trust is what the system was built to produce.

Does Adding More Agents Actually Reduce Failures?

The intuitive answer is yes — redundancy means more verification. The experimental answer is no, unless the agents bring genuinely different information.

The counterargument is coherent: if you add a verification agent after every step, or run three agents in parallel and take a majority vote, you get more coverage. Three checks are better than one.

What this argument misses is that agents sharing the same base model share the same systematic failure modes. Three GPT-4o instances reviewing the same claim are not three independent auditors — they are the same prior sampled three times. Du et al. (2023) tested this directly in a structured multi-agent debate setup: language models explicitly challenged each other’s reasoning across factual and mathematical tasks. Performance improved only when agents were initialized with genuinely different framings of the problem. Without that divergence, agents converged toward agreement, not accuracy — they debated toward shared confidence rather than toward the truth.

Redundancy without diversity is latency. A verification agent that reads the orchestrator’s output and confirms it looks reasonable is not adding a check — it is adding a step. The check is only real if the verification agent has access to ground truth the original agent lacked, or if it is operating on a structurally different reasoning path. Spinning up three homogeneous agents to vote on a claim is not error correction. It is error laundering.

Which Coordination Architectures Actually Hold Under Load?

Hierarchical orchestration with explicit state contracts at every handoff. Everything else is a prototype that has not yet met its production environment.

What holds: a single orchestrator that owns the task definition and the success criteria; specialized sub-agents with narrowly scoped tool permissions; explicit schemas for every inter-agent message — not “the model will format it correctly” but actual enforced schemas; and a defined recovery path for every failure state. Not an error message. A recovery path. What does the orchestrator do when a sub-agent returns malformed output? If the answer at architecture review is “it depends on the model’s judgment,” the system is not production-ready. It is production-adjacent.

What does not hold: flat graphs where any agent can message any other agent without an orchestrator mediating state; shared write access to external resources without coordination primitives; pipelines longer than three hops without a human-in-the-loop checkpoint; dynamic agent spawning without hard resource ceilings.

Anthropic’s 2024 guidance on building effective agents draws a precise line between workflows — where agent steps are predetermined before execution — and dynamic agents, where the model decides at runtime which tools or sub-agents to call. The guidance is explicit: use predetermined workflows when reliability is the primary constraint; use dynamic orchestration only when the task genuinely cannot be decomposed in advance, and accept the expanded failure surface as a conscious architectural choice, not an implicit one.

That distinction almost never survives implementation, because a predetermined pipeline feels less impressive than an autonomous orchestrator. The demo pressure is toward autonomy. The production pressure is toward reliability. These are in direct tension, and the tension is almost never named at architecture review.

What Are the Failure Modes Nobody Warns You About?

Context window poisoning across handoffs. And exit condition drift — the structural bias of every multi-agent loop toward continuing rather than stopping.

Context poisoning is the quiet one. It does not surface as a crash or an error. What you get is systematic drift. Agent A, asked to summarize a document, produces a summary that slightly overstates a causal relationship. The overstatement is subtle — a “correlates with” that becomes a “causes.” Agent B builds recommendations on the summary’s framing. The final output is wrong in ways that require domain expertise to catch, and wrong in ways that standard automated evaluation — which has no domain expertise — cannot detect. The pipeline reports success because success was defined as task completion, not accuracy.

Exit condition drift is subtler still. Multi-agent systems, unlike single agents, have no natural terminal state unless one is explicitly defined. Every agent in the loop is optimizing to produce output that the next agent can use. The system as a whole develops a structural bias toward continuing — because continuation is what each agent is rewarded for. Loops designed to complete in three steps run for fourteen because no agent was built to say “this task is done.” They were built to say “here is my output.”

What practitioners do that papers do not recommend: they set hard step ceilings at the orchestrator level and treat any pipeline that approaches the ceiling as a diagnostic signal, not a success metric. A pipeline that regularly uses twelve steps to complete a task designed for four is telling you something about your task decomposition. The ceiling is not a guardrail — it is an instrument. Read it.

Where Is This System Headed That Most People Have Not Accounted For?

The next major wave of multi-agent production failures will not be technical — it will be organizational.

As these systems move from engineering experiments into operational infrastructure — into decision support, into automated reporting, into anything that informs resource allocation — the gap between what an agent pipeline reports and what is actually true becomes a liability question, not an engineering question. The teams that built the systems understand their failure modes. The stakeholders consuming the outputs often do not, and are not being given the context to evaluate them.

The structural incentive inside every organization deploying multi-agent AI is toward trust. Trust is fast. Distrust is expensive. Verification requires domain expertise that is often not on the consuming team. The pressure, applied at every organizational layer, is to surface confident outputs — because confident outputs are what get budgets renewed.

That incentive, at scale, is how you end up with entire operational processes running on cascading agent outputs that no individual human has verified end to end. Not because anyone made a reckless decision. Because the system made verification irrational.

The organizations that explicitly design human verification back into the pipeline — not as an override, but as a structural feature with a dedicated role and budget line — will be the ones worth learning from in three years. The rest will be the case studies.

The Book of Life Orvi · 2026

multi-agent AIAI coordinationproduction AI failuresLLM orchestrationagent architectureAI systems designagent pipelinesAI reliability