The Tool Trust Problem: Why External Calls Do Not Automatically Ground Agentic Systems
Abstract. Note 007 showed that agentic systems drift across trajectories: intermediate states, retrieved context, tool calls, and generated outputs become part of the workflow itself. Note 008 showed that internal verification does not automatically stop this drift when the verifier shares the same information boundary as the system it evaluates. A natural response is to give agents tools: APIs, databases, search engines, calculators, code execution environments, monitoring systems, or enterprise applications. The assumption is simple: if the model can call an external system, it becomes grounded. The evidence is less reassuring. Recent evaluations show that tool use introduces its own reliability layer. ToolSandbox shows that realistic tool use is stateful, conversational, and interactive, and that state dependency, canonicalization, and insufficient-information tasks remain challenging even for strong models 1. In infrastructure verification, erroneous tool responses reduce a baseline ReAct agent to 27.3% task success rate, while a verification-aware system recovers performance to 50.0% 2. Tool descriptions alone can shift usage by more than 10x, without changing the underlying tool 4. Recent evaluations identify temporal blindness in multi-turn agents: without time information, most models align only slightly above chance with human tool-use preferences, and even timestamp augmentation peaks around 65% alignment 5. Step-level evaluation further shows that end-to-end checks can mask intermediate failures: DAG-based dependency modeling adds 22 percentage points to failure-detection recall and 34 percentage points to root-cause accuracy over flat step-level evaluation 3. This note introduces the Tool Trust Problem: external calls introduce signals, but they do not automatically introduce grounding. A tool call grounds an agent only if the selected tool is appropriate, the arguments are correct, the tool interface is reliable, the data is fresh, the output is valid, and downstream steps preserve the signal. Otherwise, tool outputs become trajectory events: they can introduce error into the workflow, shape downstream steps, and propagate through the agent's execution path. The most dangerous tool failure is not a failed API call. It is a wrong tool signal that enters the workflow as trusted evidence.
§1. The Question
Agentic systems increasingly rely on tools to act in the world. A model retrieves documents, queries an API, calls a database, executes code, invokes a calculator, or inspects system telemetry. These external calls appear to solve the grounding problem: instead of relying only on language-model priors, the agent can consult an external source.
But does tool access actually ground the system?
A tool call is not evidence by itself. It is a structured interaction between an agent, a tool description, an argument schema, an execution environment, a returned output, and a downstream reasoning step. If any part of that chain fails, the system may not become more grounded. It may become more confidently wrong.
The central question is therefore:
When an agent calls an external tool, does the call introduce reliable grounding, or does it create a new failure surface inside the trajectory?
This question matters because tool outputs do not remain isolated. Once returned, they become part of the agent's workflow state. They can influence planning, retrieval, synthesis, verification, and future tool calls. A malformed argument, stale observation, incorrect tool response, or fragile tool-selection decision can therefore become, in the terminology of Note 007, a source of Silent Drift rather than a correction to it.
§2. Scope and Definitions
Tool Trust Problem: A term introduced in this note for the structural condition in which agentic systems treat external tool calls as grounding signals even though tool selection, argument generation, execution, output validity, temporal freshness, and downstream interpretation may each fail.
Tool Signal: A term introduced in this note for any output, observation, exception, retrieval result, database response, API return value, code result, or execution trace produced by an external tool and incorporated into an agentic workflow.
Tool-Layer Failure: A term introduced in this note for failures that occur between the agent's decision to use a tool and the downstream use of the returned Tool Signal. These include wrong tool selection, malformed arguments, schema mismatch, execution failure, incorrect output, stale data, or misinterpretation of the tool result.
Schema Fragility: A term introduced in this note for the sensitivity of tool selection or tool execution to natural-language descriptions, argument schemas, tool metadata, or interface specifications. Schema Fragility is present when small changes in tool descriptions or specifications alter tool-use behavior without changing the underlying tool functionality.
Temporal Validity Failure: A term introduced in this note for a failure mode in which an agent treats a previously valid Tool Signal as still valid after the external world has changed, or redundantly re-calls a tool when the prior signal remains valid. Temporal Validity Failure reflects misalignment between the agent's static context and the time-sensitive environment in which tools operate.
Tool Redundancy: A term introduced in this note for a tool call, verification step, or environmental interaction that fails to expand the system's Information Boundary because it relies on the same data source, stale context, schema, prior tool output, or failure surface as the signal it is meant to ground.
Silent Tool Propagation: A term introduced in this note for the downstream propagation of incorrect, stale, malformed, or misinterpreted Tool Signals through later reasoning, planning, verification, or synthesis steps without an explicit failure signal.
Grounding Illusion: A term introduced in this note for the appearance of reliability created when an agent invokes an external tool, even though the tool call has not established that the selected tool was appropriate, the arguments were correct, the output was valid, or the returned signal was preserved correctly through the trajectory.
§3. Key Findings
Tool use is trajectory-level, not call-level. Lu et al. show that realistic tool use is stateful, conversational, and interactive, involving stateful tool execution, implicit dependencies between tools, user simulation, and dynamic evaluation over intermediate and final milestones 1. This matters because a tool call is not a detached lookup. It is a trajectory event. The agent must decide when to call the tool, which tool to call, how to form arguments, how to interpret the response, and how to update future workflow state. The Tool Trust Problem begins when systems treat this multi-step interaction as if it were a single grounding operation.
Erroneous tool outputs can sharply degrade agent reliability. Abuzakuk et al. show that existing agentic systems often implicitly assume that invoked tools return correct outputs, making them vulnerable to erroneous tool responses 2. In their infrastructure-verification setting, erroneous tool responses reduce a baseline ReAct agent to 27.3% task accuracy, while RIVA recovers performance to 50.0% through cross-validation, multi-perspective verification, and tool-call history tracking 2. The implication is direct: external calls do not eliminate uncertainty. They relocate uncertainty into the tool layer.
End-to-end success can hide intermediate workflow failures. Guo et al. show that agentic workflows require step-level evaluation because end-to-end outcome checks and ad-hoc trace inspection can mask intermediate failures 3. Their DAG-based dependency modeling adds 22 percentage points to failure-detection recall and 34 percentage points to root-cause accuracy over flat step-level evaluation with the same judges and rubrics 3. This supports a central mechanism of Note 009: Silent Tool Propagation may not be visible from final outputs alone when tool-derived signals shape downstream workflow steps. A workflow can appear successful while relying on degraded or misinterpreted Tool Signals.
Tool selection is fragile because tool descriptions are part of the control surface. Faghih et al. show that LLMs rely entirely on tool descriptions to decide which tools to use, and that editing tool descriptions alone can drastically increase tool usage without changing the underlying tool functionality 4. In controlled experiments, properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than original descriptions 4. This is Schema Fragility: the tool interface is not only a technical API. It is also a natural-language control surface.
Tool-derived context can become stale. Cheng et al. identify temporal blindness in multi-turn LLM agents: models operate with a stationary context and fail to account for real-world time elapsed between messages or actions 5. This leads agents either to over-rely on stale context and skip needed tool calls or under-rely on valid context and repeat unnecessary calls 5. In the no-timestamp setting, model decisions remain close to chance-level alignment with human tool-use preferences, with the best result only slightly above 60% 5. Adding timestamps improves alignment only modestly, with the strongest model reaching roughly 65% 5. This is Temporal Validity Failure: A Tool Signal can be correct when produced and wrong when reused.
Tool-use errors have structure, not just failure rates. Kokane et al. argue that existing tool-use benchmarks often report success rates without explaining failure cases 6. ToolScan introduces diagnostic analysis for seven common error patterns in tool-use LLMs; it shows that prominent models exhibit distinct underlying tool-use errors even when aggregate performance appears similar 6. This matters because Tool-Layer Failure is not a generic category. It can be diagnosed, localized, and mitigated only if evaluation exposes where the tool-use chain failed.
§4. Technical Deep Dive: The Architecture of Tool Trust
§A. Tool Calls Are Trajectory Events
Tool calls are often treated as grounding events: the agent reaches outside the model, obtains a result, and becomes more reliable. This view is incomplete.
In an agentic system, a tool call changes the trajectory. It may update workflow state, constrain future reasoning, trigger another tool call, support a final answer, or become evidence for a verification step. ToolSandbox makes this explicit by evaluating tool use as stateful, conversational, and interactive, with implicit state dependencies and dynamic evaluation across intermediate and final milestones 1.
This means that a Tool Signal is not merely an output. It is a state transition. Once incorporated, it can shape future planning, retrieval, synthesis, and verification. Tool use therefore belongs to trajectory reliability, not only to function-calling accuracy.
The design implication is simple: do not ask only whether the tool was called. Ask whether the call improved the trajectory.
§B. External Does Not Mean Trusted
In an agentic system, externality and reliability are not the same property. Externality is often mistaken for grounding. A database, API, telemetry system, or search engine is outside the language model; its output appears more reliable than generated text. However, the agent still has to select the right tool, generate valid arguments, interpret the returned output, and decide how much to trust it.
RIVA exposes this failure mode directly. Abuzakuk et al. show that agents can fail when erroneous tool responses are indistinguishable from genuine infrastructure anomalies 2. In configuration-drift detection, an anomalous tool output may reflect a real cloud misconfiguration, or it may reflect a broken, stale, or inconsistent tool response. A baseline ReAct agent incorporating bad tool outputs directly into its reasoning chain collapses to 27.3% accuracy under erroneous tool responses 2.
The same structural limitation identified in Note 008 reappears at the tool layer. A verifier tool that queries the same data source, schema, API endpoint, stale context, or failure surface as the tool it is meant to validate does not provide independent grounding. It creates Tool Redundancy: additional tool procedure without a non-redundant signal.
This is the Tool Trust Problem: the agent does not simply need external signals. It needs a way to evaluate whether those signals should be trusted.
§C. Schema Fragility Begins Before Execution
Tool failure does not begin when an API returns the wrong output. It begins earlier, when the model decides which tool to call and how to call it.
Tool descriptions, names, and argument schemas are part of the agent's control surface. Faghih et al. show that LLMs decide whether and which tools to invoke based entirely on natural-language descriptions, and that small description edits can shift tool usage dramatically 4. In their experiments, edited descriptions receive more than 10 times the usage of original descriptions from GPT-4.1 and Qwen2.5-7B, without changing the tool's underlying function 4.
This is Schema Fragility. A tool interface is not just software infrastructure. It is also prompt infrastructure. If tool descriptions can steer selection behavior without changing capability, then tool choice is not purely a rational match between task and function. It is partly mediated by language, framing, and interface specification.
A system can therefore fail before the tool is ever executed.
§D. Tool Signals Expire
A Tool Signal is not permanently valid. Search results, market data, system telemetry, weather, inventory, permissions, infrastructure state, and user context can all change. The agent must decide whether a prior observation is still valid or whether a fresh tool call is needed.
Cheng et al. show that multi-turn LLM agents are temporally blind: they often treat dialogue context as stationary even when the external world has changed 5. This produces two symmetric failures. The agent may over-rely on stale context and skip a necessary tool call. Or it may under-rely on still-valid context and repeat an unnecessary tool call, adding latency and cost 5.
This is Temporal Validity Failure. It shows that freshness is not a property of the tool alone. It is a property of the relationship between the Tool Signal, the environment, and the time at which the signal is reused.
Grounding decays.
A system that does not know when its context has expired is not grounded. It is frozen.
§E. Silent Tool Propagation
Tool-layer errors are dangerous because they propagate silently. A malformed argument can return a plausible but wrong result. A stale observation can be treated as current. A fragile tool description can cause the wrong tool to be selected. A failed execution can be misread as domain evidence. Once accepted, these signals become part of the workflow state.
Guo et al. formalize agent executions as DAGs because agent workflows contain structured dependencies in which errors can propagate and compound through downstream steps 3. Their results show that DAG-based dependency modeling substantially improves failure detection and root-cause attribution over flat evaluation 3. This is the evaluation counterpart of Silent Tool Propagation: without dependency-aware evaluation, tool-layer failures can disappear into the final answer.
A tool failure that raises an exception is at least visible. A tool failure that returns a plausible answer is more dangerous. In multi-step workflows, a corrupted Tool Signal can also trigger downstream tool calls that amplify rather than correct the original error; a cascade in which one tool-layer failure becomes the input condition for the next.
§F. Tool Errors Require Diagnosis, Not Just Scoring
A single success rate is not enough to understand tool reliability. Two systems can have similar final performance while failing for different reasons: wrong tool selection, wrong argument generation, invalid output format, redundant calls, missing calls, execution misinterpretation, or incomplete task decomposition.
ToolScan addresses this by characterizing tool-use errors rather than reporting only aggregate success rates 6. Kokane et al. identify seven common error patterns in tool-use LLMs and argue that benchmarks should provide diagnostic feedback on tool-use behavior 6.
This matters for system design. If tool failures have structure, then mitigation must also be structured. Wrong-tool failures require different interventions than stale-context failures. Schema Fragility requires different controls than erroneous-output validation. Silent Tool Propagation requires trajectory-level monitoring, not only better function-calling prompts.
§G. The Conditions for Genuine Grounding
Grounding becomes meaningful only when tool interaction reduces Tool Redundancy: the tool must introduce new evidence, causal feedback, state change, temporal freshness, or constraints unavailable to the model's internal generation process. The relevant question is not whether the system has tool access, a function registry, or an execution loop. It is whether the tool interaction changes the Information Boundary or applies a non-redundant environmental constraint 12.
The evidence above shows why this condition is hard to satisfy. Erroneous tool responses can be indistinguishable from real environmental anomalies 2. Tool-description edits can shift selection behavior without changing tool functionality 4. Tool-derived context can expire 5. Downstream dependencies can propagate intermediate failures beyond what end-to-end evaluation reveals 3.
The unresolved gap is architectural: current systems can often produce tool execution, but they do not yet reliably provide epistemically independent, trajectory-safe grounding.
Tool access does not relocate the grounding problem to a more reliable domain. It distributes the grounding problem across a new failure surface: external to the model, but internal to the trajectory. Note 007 showed that agents can drift while appearing to execute correctly. Note 008 showed that internal verification does not automatically stop that drift. Note 009 shows that tools designed to ground agents can themselves become sources of Silent Drift. The common thread is structural: generated outputs, verification steps, and tool calls become dangerous when treated as automatically trustworthy. Grounding is not tool access. Grounding is trajectory-safe signal preservation.
§5. Practical Taxonomy of Tool-Layer Failure Modes
| Failure Mode | Primary Mechanism | Diagnostic Symptom | Ref |
|---|---|---|---|
| Grounding Illusion | Treating external tool access as evidence of grounding without validating tool selection, argument validity, output correctness, temporal freshness, or downstream interpretation. | Tool workflows increase apparent rigor while independent grounding and trajectory reliability remain unproven. | 12345 |
| Tool Redundancy | A tool call reuses the same data source, stale context, prior output, or failure surface as the signal it is meant to ground. | Additional tool calls increase execution surface without adding an independent grounding signal. | 25 |
| Erroneous Tool Signal | A tool returns incorrect or misleading output that the agent cannot distinguish from a genuine environmental observation. | The agent incorporates bad tool output into reasoning, causing missed detections, false alarms, or incorrect decisions. | 2 |
| Schema Fragility | Tool selection depends on natural-language descriptions, metadata, or argument schemas that can steer model behavior independently of tool functionality. | Small edits to tool descriptions substantially change tool usage patterns. | 4 |
| Temporal Validity Failure | The agent fails to account for elapsed real-world time when deciding whether a previous Tool Signal remains valid. | The agent over-relies on stale context or redundantly repeats tool calls. | 5 |
| Silent Tool Propagation | Incorrect, stale, malformed, or misinterpreted Tool Signals enter downstream reasoning without explicit failure signals. | Final outputs appear plausible while intermediate tool-layer errors drive downstream decisions. | 236 |
| Undiagnosed Tool-Use Error | Aggregate success metrics hide distinct failure patterns such as wrong tool selection, malformed arguments, invalid format, redundant calls, missing calls, or incomplete execution. | Similar success rates mask different underlying error mechanisms. | 6 |
§6. Implications for AI System Design
Do not equate tool access with grounding. A tool call is not evidence unless the system establishes that the selected tool is appropriate, the arguments are valid, the output is trustworthy, temporally fresh, and preserved correctly downstream. Otherwise, tool access creates a Grounding Illusion: the appearance of reliability without verified grounding 12345.
Treat tool outputs as untrusted until validated. External does not mean reliable. RIVA shows that erroneous tool responses can sharply degrade agent performance when agents cannot distinguish broken tool outputs from real environmental anomalies 2. Tool Signals should therefore be validated through cross-checks, tool-call history tracking, independent constraints, or dependency-aware evaluation before they shape downstream decisions 23.
Govern tool descriptions as control surfaces. Tool descriptions and schemas are not documentation only. They influence behavior. Schema Fragility means that natural-language tool metadata can alter which tools agents select, even when underlying functionality is unchanged 4. Tool registries should therefore be versioned, reviewed, tested, and monitored as part of the reliability surface.
Track temporal validity explicitly. A Tool Signal has a time horizon. The system must know when an observation was produced, how quickly the underlying environment changes, and whether reuse is still valid. Temporal Validity Failure should be treated as a first-class reliability risk in any workflow involving dynamic data. Freshness should be explicit metadata, not an assumption carried silently in context.
Evaluate tool use at the trajectory level. Tool failures propagate through dependencies. End-to-end evaluation can miss intermediate errors that drive downstream reasoning. Agentic tool evaluation should therefore track tool selection, argument generation, execution, output interpretation, and propagation across the workflow DAG 3.
Diagnose failure modes, not only success rates. A final success rate does not explain whether the system failed because it chose the wrong tool, formed bad arguments, trusted stale data, misread the output, or propagated a bad signal. ToolScan shows that tool-use errors have recurring patterns 6. Reliable systems require diagnostic evaluation, not only aggregate scoring.
These implications target different points in the tool-use chain: selection, argument generation, execution, output interpretation, freshness, and propagation. But the deeper architectural problem remains: localized controls can reduce Tool-Layer Failure without necessarily expanding the system's Information Boundary. The goal is not more tool procedure. The goal is trajectory-safe grounding.
§7. Open Questions
Can a system know when it should not trust its own tools? If a Tool Signal is not automatically evidence, what structural conditions are sufficient to establish trust? Current mechanisms address parts of the problem: cross-validation and tool-call history tracking 2, dependency-aware evaluation 3, schema robustness testing 4, and temporal freshness 5. The unresolved architectural question is whether tool trust can be formalized through mechanisms that genuinely alter the system's Information Boundary, rather than through redundant validation loops that share the same tool-layer failure surface.
How should agents reason about stale tool outputs? Temporal Blindness shows that agents struggle to align tool-use decisions with elapsed time 5. What metadata, memory structure, or post-training alignment is required for agents to decide when prior context remains valid and when fresh tool calls are necessary?
Can Schema Fragility be measured before deployment? If natural-language tool descriptions can shift tool usage by more than 10x 4, tool registries require pre-deployment robustness testing. What perturbation tests can identify fragile tool descriptions, ambiguous schemas, or metadata that cause inappropriate tool selection?
What is the right evaluation standard for tool-grounded agents? Should tool-grounded agents be evaluated by final task success, tool-selection accuracy, argument validity, execution correctness, output interpretation, propagation risk, or root-cause localization? A reliable standard must measure not only whether the agent called a tool, but whether the tool call improved the trajectory. This requires evaluating tool use as a causal part of the workflow, not as an isolated function-calling event.
How can Silent Tool Propagation be detected online? DAG-based evaluation can expose propagation during trace evaluation 3, but production systems also need online detection. What runtime signals can identify when a Tool Signal has become a source of downstream error before the workflow completes?
§8. References
- Lu et al., ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities, NAACL 2025.
- Abuzakuk et al., RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection, arXiv:2603.02345, 2026.
- Guo et al., AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking, arXiv:2604.23581, 2026.
- Faghih et al., Tool Preferences in Agentic LLMs are Unreliable, EMNLP 2025.
- Cheng et al., Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception, arXiv:2510.23853, 2026.
- Kokane et al., ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs, arXiv:2411.13547, 2025.
This note synthesizes findings from recent research on tool use and grounding failure in agentic AI systems. The interpretations presented reflect the author's reading of the current literature.
© 2026 WhyAI Technologies, Inc.