Skip to content
Book a demo
Research

May 5, 20262 min read

The "Good Enough" Fallacy in Professional AI

I have spent the last year building AI for financial research. What I found did not match the confidence many organizations take from surface-level benchmarks and demos.

Over the past six weeks, I published six technical notes, each grounded in recent peer-reviewed research and focused on a specific structural failure mode in knowledge-intensive AI systems. The findings were consistent enough that I think they are worth summarizing for anyone deploying AI in professional settings today.

The short version: these systems can appear to succeed while failing at the level of evidence and reasoning.

Citations do not guarantee that a model used the cited source. In controlled studies, citation correctness and citation faithfulness can diverge: a citation can be correct (i.e., pointing to a genuinely relevant source) while still being unfaithful; this means that the answer was not actually derived from that passage but generated from parametric memory and matched to a source after the fact.

Passing recall benchmarks does not establish reasoning capability. Controlled evaluations show that models can reliably locate information in long contexts while still failing on tasks that require multi-hop tracing and evidence aggregation. The two abilities are measured differently and do not reliably co-occur.

Models do not reliably defer to fresh evidence when it conflicts with what they encoded during training. Empirical studies show this resistance is strongest precisely for facts that change most often; the cases where retrieval matters most.

Standard retrieval architectures compound this further. Time-agnostic vector representations cannot distinguish between chronologically exclusive versions of the same fact. Common tokenizers fragment calendar dates into meaningless sub-tokens, creating a structural barrier to temporal reasoning before retrieval even begins.

Retrieval is not always the bottleneck people assume. In multi-hop settings, systems can fail even when relevant evidence is already present in the retrieved context. The problem is often reasoning over the evidence, not merely retrieving it.

You cannot reliably tell whether a system worked by looking only at the final output. Answer-level metrics can miss whether a response was faithfully grounded, produced from parametric memory, or simply correct for the wrong reasons.

None of these are universal failure rates. Each finding is specific to particular benchmarks, models, and experimental setups. I am not claiming that every AI system in professional settings fails this way, this often.

I am claiming something more precise: these failures are not random edge cases. They reflect recurring structural weaknesses, and the standard tools many teams use to evaluate AI systems often cannot detect them.

This matters in high-stakes settings. A wrong answer on a benchmark costs nothing. A wrong figure in a CIM, a misread covenant, or a missed regulatory update can carry real consequences. The gap between a system that works and one that merely appears to work is not an academic distinction.

Organizations that deploy AI most reliably are not those with the most confidence in their benchmarks, but those that understand what those benchmarks cannot reveal.