Insights6 min read

The agentic AI market has a verification problem

Joel Webber

Cofounder & Engineering

Last updated: 05/26/2026

On June 16, we’re hosting our summer release event where we’ll show what verifiable, autonomous agents look like in practice—RSVP here.

Most “agentic AI” being sold right now is a co-pilot in a costume.

I don’t say that to be cynical. I say it because the word “agent” is being applied to so many different things this year that it’s stopped meaning anything, and that makes it hard to evaluate what you’re actually buying.

Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The technology isn’t the issue. The issue is that teams committed to “agentic” before anyone agreed on what the word required, and discovered eighteen months later that what they bought couldn’t carry the weight of the job.

So before you write another line of your AI roadmap, I want to offer the test I run on every vendor.

A two-part test for whether something is actually an agent

Ask the vendor: Can the AI finish the job without me in the loop, and can I verify what it did when I check?

If the answer to either half is no, it’s not an agent. It’s a faster co-pilot. Co-pilots are useful, but they aren’t the category shift the marketing implies, and they aren’t going to move your ROI conversation past “we’re saving some time.”

The reason the test is two parts and not one is that almost everything being sold this year passes the first half and fails the second. The second half is where the value lives.

The “without me in the loop” half

This is the one most products fail outright. A lot of what’s being sold as agentic still requires the human to ask the right question, interpret the answer, and decide what to do next. The AI is doing more of the work than it used to, but it isn’t doing the job.

A real agent owns an outcome, not a task. That distinction matters. A task is “summarize this session.” An outcome is “tell me which of my A/B tests are losing and why.” The first one ends when the AI produces a paragraph. The second one ends when a human running an experimentation program has the answer they actually needed, derived from evidence and ready to act on.

The agents that are starting to earn their keep this year own the second kind. They watch the A/B test and tell you why the variant is losing—not just what happened. They monitor site health between releases and flag the regression before your support queue notices. They handle the investigative work that used to eat an analyst’s Tuesday, start to finish, while the analyst is in other meetings doing higher-leverage work.

If the human is still the one finishing the thought, you don’t have an agent yet. You have a smarter assistant, which is useful but not transformative, and not going to change your ROI math.

The verification half is the one that matters more

This is the part almost nobody is talking about, and it determines whether any of this works at scale.

When you hand a job to a teammate, you’re trusting them to make a call without you there. The reason that works in human teams is that you can ask them later what they saw, what they did, and why. They can show you. The work is auditable, and the auditability is what makes the trust possible. New employees don’t get autonomy on day one. They earn it by showing their work until the work earns the room to operate.

An autonomous agent that can’t show its work isn’t a teammate. It’s a black box that occasionally produces an answer. You can’t put that on the org chart. You can’t put it in front of a regulator. You can’t put it anywhere the cost of being wrong is real—and in any business worth automating, the cost of being wrong is always real.

This is where most agent deployments quietly stall. They work in the demo. The first time something goes sideways in production and someone asks what the agent did and why, the team can produce a confidence score and not much else. That’s the moment leadership pulls back the autonomy, and the project’s value collapses.

The question to push harder on

Not “can it do the thing.” Can it show me what it saw, what it decided, and why, at a level of detail I can actually check?

If the answer is a confidence score and a summary, that isn’t verification.

Real verification means the agent’s work is traceable to ground-truth evidence. For an agent operating on user behavior, every action it takes should be grounded in something concrete: the session it analyzed, the click that triggered the flag, the moment a user got stuck and dropped off. Not aggregated metrics or inferred patterns. The raw record of what happened, that you can replay and check.

This separates agents you can deploy at scale from agents you can only demo. A lot of products on the market today don’t clear that bar, because they were built on top of analytics infrastructure that throws away most of the underlying behavioral evidence in favor of pre-defined events. You can’t verify what the data doesn’t contain.

What this looks like in practice

A few examples of jobs that pass the test.

An A/B test evaluator that not only tells you the variant is underperforming, but shows you the specific user sessions where the friction occurred, so you can watch what went wrong instead of reading a summary.

A site health agent that doesn’t just alert on a metric anomaly, but identifies the deploy that caused it and links you to the broken user journeys, with the relevant sessions queued up and ready to review.

A release monitor that watches new feature adoption around the clock and surfaces the moments where users hit confusion, with replay evidence, so the product team can make the fix on Monday instead of arguing about whether the issue is real.

In each case, the agent owns the outcome rather than the task. And in each, the work is traceable back to something a human can verify in seconds. Without both halves, you don’t have an agent you can trust with autonomy.

If you’re evaluating AI this year, the question isn’t whether the demo looks impressive. The question is whether you’d let it run on Tuesday morning without watching, and whether you’d be able to defend what it did on Tuesday afternoon.

If you want to see what we’re building to help teams clear that bar, we’d love to show you at our June 16 summer release event. RSVP here.

Joel Webber ✦ Subject Matter Expert
Cofounder & Engineering, Fullstory

Joel Webber is a Co-Founder and Chief Architect at Fullstory. He is based in Marietta, Georgia.

Additional Resources

new-technical-standard
The ghost in the machine: Why AI agents are exposing our technical debt

Lane Greer outlines how integrating semantic data attributes in your UI enhances performance, analytics, and AI readiness in digital storefronts.

Read the blog
fs-mcp-prompts
Prompts and use cases to get the most out of Fullstory MCP

Learn how to structure prompts for the Fullstory MCP to turn hours of manual bug investigation into seconds of automated, behavior-driven triage.

Read the blog
Lee-Blog-2
How to build lasting value in a world of ephemeral AI agents

Lee Dallas explains how the AI agent lifecycle is disrupting enterprise budgeting and why your data architecture is the key to lasting returns.

Read the blog
Lee-Blog (1)
The agentic workspace: Adapting and investing in the age of AI  

Lee Dallas shares tips for making strategic investments as AI tools evolve and discusses where humans fit into the agentic workspace.

Read the blog
Reactive vs. Proactive
The two sides of AI: Reactive and proactive approaches explained

Explore the two sides of AI and how businesses can use both to analyze past behavior and predict future actions for better user experiences.

Read the blog
Retail AI
Retail transformation: Win the AI revolution with intelligent experiences

Discover how AI is transforming retail by predicting customer needs, eliminating invisible friction points, enabling transactional agents, and more.

Read the blog