The Trajectory Integrity Problem: Why Agentic AI Systems Drift Over Time
Abstract. Agentic AI systems fail differently from single-step language model interactions. In long-horizon workflows, minor early errors can propagate and compound across subsequent reasoning and execution steps. Recent empirical evidence shows that frontier models corrupt on average 25% of document content over 20 sequential interactions through compounding error propagation across workflow steps, even when early steps appear locally successful 2. A distinct failure mode emerges across repeated executions of the same task: pass^k reliability collapses from above 60% pass^1 to below 25% pass^8 in the retail domain; this exposes behavioral inconsistency independent of trajectory length 1. Theoretical analysis further shows that, without new exogenous signals, decomposing workflows across multiple agents does not add new decision-relevant information; instead, communication constraints can introduce information-compression costs 3. This note examines the mechanisms by which trajectory integrity degrades in long-horizon autonomous workflows; it also identifies why the most dangerous aspect of agentic drift is not that outputs become incoherent, but that they remain fluent 4.
§1. The Question
Single-step evaluations measure whether a model can complete an isolated task. Agentic systems execute sequences of interconnected actions in which intermediate outputs can influence subsequent reasoning, retrieval, planning, and tool usage 24. As trajectories grow longer, can locally reasonable steps accumulate into globally incorrect workflows? And why can systems continue producing coherent outputs even after workflow execution has begun diverging from the original objective?
§2. Scope and Definitions
Agentic AI System: A language-model-based system capable of autonomously executing multi-step workflows involving planning, retrieval, tool usage, delegation, and sequential interaction with external environments.
Trajectory: The sequential chain of intermediate states, actions, tool calls, retrieved context, and generated outputs produced during workflow execution.
Trajectory Integrity: A term defined in this note, building on 4, for the extent to which a workflow remains aligned with its original objective over long execution horizons despite intermediate uncertainty, evolving workflow state, and environmental interaction.
Recursive Context Contamination: A term introduced in this note for a failure mode in which intermediate outputs become part of the evolving workflow context, allowing early errors to propagate through subsequent reasoning, retrieval, and planning steps 245.
Deceptive Fluency: A term used in 4 to describe a failure mode in which a system continues producing linguistically coherent outputs despite functional misalignment between reasoning and execution. Outputs remain coherent but increasingly align with drifted intermediate workflow context rather than the original task objective.
Silent Incorrect Computation: A term used in 5 for a computational manifestation of Deceptive Fluency: the production of syntactically valid, executable code or pipelines that yield numerically plausible but logically incorrect or physically inconsistent results. Unlike fluency failures in text, Silent Incorrect Computation executes without error while producing outputs whose failure may not be immediately visible from surface inspection alone.
Silent Drift: A term introduced in this note for the gradual divergence of a workflow from its intended objective through compounding intermediate errors that produce locally plausible but globally misaligned execution trajectories 45.
Trajectory Instability: A term introduced in this note for the failure mode in which repeated executions of the same task expose inconsistency in agent behavior, measured by the collapse of pass^k reliability across trials 1.
Delegation Loss: A term introduced in this note for the structural reliability cost introduced when added agent roles reprocess shared evidence under communication constraints rather than contributing new decision-relevant signals 3.
§3. Key Findings
-
Reliability collapses across repeated executions. τ-bench introduces pass^k (i.e., the probability that an agent succeeds on all k repeated executions of the same task). GPT-4o achieves above 60% pass^1 but below 25% pass^8 in the retail domain 1. A system appearing to succeed above 60% of the time on individual attempts in the retail domain reliably completes the same task across all repeated executions less than 25% of the time. This finding suggests that reliability in agentic systems is constrained not only by step-level capability, but also by behavioral consistency across repeated executions (i.e., Trajectory Instability).
-
Step-level competence does not predict trajectory-level reliability. Microsoft Research evaluations across 19 LLMs show that frontier models corrupt on average 25% of document content over 20 sequential interactions 2. Two findings are particularly significant: adding an agentic harness does not improve performance on the degradation measure, and strong performance in early interactions does not predict long-horizon reliability. Trajectory failure compounds over extended interaction horizons.
-
The Agentic Paradox. Counterintuitively, using an agentic harness does not improve performance on DELEGATE-52 2. Even with tool-mediated iteration, models do not improve over the single-turn baseline. This suggests that tool-mediated orchestration alone is insufficient to prevent long-horizon document degradation.
-
Multi-agent decomposition introduces structural reliability limits. Theoretical analysis in 3 shows that, without new exogenous signals, a delegated multi-agent network cannot outperform a centralized decision maker with access to the same information. Added agent roles do not introduce new decision-relevant signals; they reprocess shared evidence. Decomposing workflows across agents therefore introduces communication and information compression costs during information transfer. Under logarithmic loss, these communication costs can be expressed in terms of conditional mutual information. This suggests that decomposition carries a structural communication cost (i.e., Delegation Loss) when agents operate over the same underlying information.
-
Fluency masks workflow degradation. Agentic systems can continue generating coherent outputs even as workflow execution diverges from the intended objective 4. This phenomenon (i.e., deceptive fluency) occurs when the system maintains linguistic coherence despite functional misalignment. As a result, workflows may continue producing confident, well-structured outputs even when intermediate reasoning or execution steps have already become unreliable.
-
Silent drift is not visible from isolated outputs. A case study of structured agentic workflows shows that systems can produce individually plausible outputs while introducing inconsistencies that may not be visible from isolated outputs alone 5. This creates a risk that workflow degradation may remain difficult to detect when outputs are evaluated in isolation rather than across complete execution trajectories.
§4. Technical Deep Dive: Why Agentic Systems Drift
§A. Recursive Context Contamination
Unlike single-step prompting, agentic systems operate through sequential interactions in which intermediate outputs can influence subsequent reasoning and execution steps. Recent studies show that early errors can cascade through later planning, retrieval, and tool-use stages 45. As workflows grow longer, these effects compound over time 2, increasing the risk that locally plausible intermediate outputs contribute to progressively degraded execution trajectories. This creates a setting in which early workflow divergence may influence subsequent system behavior while remaining difficult to detect from isolated outputs alone.
§B. Trajectory Instability and Reliability Degradation
τ-bench introduces pass^k to evaluate reliability across repeated executions of the same task 1. Although the best-performing GPT-4o function-calling agent achieves above 60% average task success in the retail domain, reliability decreases substantially across repeated trials, with pass^8 dropping below 25%. Related work further shows that minor early errors can cascade through subsequent planning, retrieval, and execution stages 4. Together, these findings suggest that agentic reliability depends not only on local task success, but also on behavioral consistency across repeated executions of the same task; the failure mode this note identifies as Trajectory Instability.
§C. Deceptive Fluency: The Asymmetry That Makes Drift Dangerous
Agentic systems can continue producing coherent and well-structured outputs even after workflow execution has begun diverging from the intended objective 4. This phenomenon (i.e., deceptive fluency) occurs when linguistic coherence persists despite functional misalignment. As a result, trajectories may continue appearing operationally plausible even when intermediate planning, reasoning, or execution steps have already degraded. This creates a dangerous asymmetry: local plausibility may persist even as workflow reliability deteriorates over time.
Similar patterns appear in computational workflows. A case study in scientific agentic systems shows that models can produce plausible but inaccurate numerical results and silent numerical errors during execution 5; this note refers to this class of failures as Silent Incorrect Computation.
§D. Delegation and the Cost of Decomposition
Multi-agent architectures often decompose workflows across specialized roles such as planner, worker, critic, and reviewer. Theoretical analysis in 3 challenges the assumption that decomposition alone improves reliability: when agents share the same information boundary, added roles do not introduce new decision-relevant signals. Instead, each handoff requires information transfer through endogenous message passing, which can introduce communication cost or information compression. Multi-agent architectures can therefore fail to improve reliability over centralized systems operating on the same information; this is not because individual agents are weak, but because communication overhead (i.e., Delegation Loss) is a structural property of the architecture.
§5. Practical Taxonomy of Agentic Failure Modes
| Failure Mode | Primary Mechanism | Diagnostic Symptom | Ref |
|---|---|---|---|
| Recursive Context Contamination | Intermediate outputs become part of the evolving workflow context, allowing early errors to propagate through later reasoning, retrieval, planning, or execution steps | Early errors cascade through later workflow stages | 245 |
| Trajectory Instability | Repeated executions of the same task expose inconsistency in agent behavior across trajectories | pass^k decreases substantially as the number of repeated trials increases | 1 |
| Deceptive Fluency | Linguistic coherence persists despite functional misalignment between reasoning and execution | Outputs remain fluent, plausible, or well-structured even when workflow execution has degraded | 4 |
| Silent Incorrect Computation | Syntactically valid, executable code or pipelines yield numerically plausible but logically incorrect or physically inconsistent results without raising execution errors | Outputs execute without error while producing results whose incorrectness is not visible from surface inspection alone | 5 |
| Delegation Loss | Added agent roles reprocess shared evidence under communication constraints rather than adding new decision-relevant signals | Delegated systems fail to outperform centralized decision makers operating over the same information | 3 |
| Silent Drift | Locally plausible intermediate errors accumulate across multi-step workflows without obvious failure signals | Errors may not be detectable from isolated or final outputs alone; broader trajectory-level evaluation is needed | 45 |
§6. Implications for AI System Design
-
Do not equate short-horizon performance with workflow reliability. Short-horizon performance does not reliably predict long-horizon trajectory reliability 2 or behavioral consistency across repeated executions 1. Evaluation for professional deployment should therefore include extended trajectory testing and repeated-execution reliability measurement (not only step-level or single-execution measures) to surface both trajectory degradation and Trajectory Instability before deployment.
-
Treat intermediate workflow context as a high-risk reliability layer. Because early errors can cascade through later reasoning and execution steps (i.e., Recursive Context Contamination), intermediate outputs that influence subsequent workflow state should not be treated as automatically reliable. Agentic systems should verify intermediate outputs before they shape downstream reasoning, planning, or execution.
-
Fluent outputs are not evidence of stable execution. Deceptive Fluency shows that agentic systems can remain linguistically coherent despite functional misalignment between reasoning and execution 4. Linguistic coherence alone is therefore insufficient for operational trust; reliable agentic workflows require process correctness and grounded execution, not merely plausible completion.
-
Agentic decomposition is not architecturally free. Delegating tasks across multiple agents introduces communication overhead and information compression costs (i.e., Delegation Loss). Multi-agent designs should be chosen because agents genuinely access new information, preserve decision-relevant information, or provide non-redundant review—and not because decomposition is assumed to improve quality by default.
-
Evaluate trajectories, not only endpoints. A correct final output does not guarantee that the workflow remained reliable throughout execution. Evaluation should track planning, retrieval, reasoning, and execution across intermediate trajectory steps, with verification gates at each interaction unit.
§7. Open Questions
-
Drift detection under deceptive fluency. If linguistically coherent outputs are insufficient evidence of trajectory integrity, what intermediate signals (planning consistency, retrieval behavior, execution feedback, or verification-gate failures) can reliably detect drift before errors compound across the workflow—and can such detection operate online during execution, or only through post-hoc trajectory analysis?
-
Workflow context governance. Treating intermediate workflow context as untrusted until verified raises a structural question: what formal properties must an automated verification mechanism satisfy at each interaction unit to prevent Recursive Context Contamination without requiring continuous human review?
-
Delegation thresholds. Since decomposition can introduce communication and information-compression costs, under what conditions do added agent roles provide enough new information, decision-relevant information gain, or non-redundant review to justify those costs? And can this threshold be estimated before deployment rather than discovered through failure?
-
Long-horizon evaluation standards. Pass^k reliability and trajectory-length degradation capture distinct failure dimensions: behavioral inconsistency across trials and compounding error propagation within workflows respectively. Is there a formal relationship between these two dimensions that characterizes deployment readiness—and can minimum trajectory length and repetition depth be derived from reliability requirements rather than established empirically through failure?
-
Silent Incorrect Computation in production pipelines. When agentic systems produce syntactically valid, executable outputs that are logically or physically incorrect, standard execution monitoring provides no failure signal. What verification primitives (formal, statistical, or domain-grounded) can detect Silent Incorrect Computation before outputs propagate downstream, and at what computational cost?
§8. References
- Yao et al., τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, arXiv:2406.12045, 2024.
- Laban et al., LLMs Corrupt Your Documents When You Delegate, Microsoft Research, arXiv:2604.15597, 2026.
- Ao et al., On the Reliability Limits of LLM-Based Multi-Agent Planning, arXiv:2603.26993, 2026.
- Sinha et al., Beyond Fluency: Toward Reliable Trajectories in Agentic Information Retrieval, arXiv:2604.04269, 2026.
- Rawat et al., Plausible but Wrong: A Case Study on Agentic Failures in Astrophysical Workflows, arXiv:2604.25345, 2026.