Insights • 6 min read

The agentic AI market has a verification problem

Joel Webber ✦

Cofounder & Engineering

Last updated: 05/26/2026

FEATURED

Blog Post↗

80% of digital experience pros say AI has increased review workload

Blog Post↗

The invisible work behind every digital experience

On-Demand Webinar↗

The new currency of digital business

On June 16, we hosted our summer release event where we showed what verifiable, autonomous agents look like in practice—watch on demand.

Most “agentic AI” being sold right now is a co-pilot in a costume.

I don’t say that to be cynical. I say it because the word “agent” is being applied to so many different things this year that it’s stopped meaning anything, and that makes it hard to evaluate what you’re actually buying.

Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The technology isn’t the issue. The issue is that teams committed to “agentic” before anyone agreed on what the word required, and discovered eighteen months later that what they bought couldn’t carry the weight of the job.

So before you write another line of your AI roadmap, I want to offer the test I run on every vendor.

A two-part test for whether something is actually an agent

Ask the vendor: Can the AI finish the job without me in the loop, and can I verify what it did when I check?

If the answer to either half is no, it’s not an agent. It’s a faster co-pilot. Co-pilots are useful, but they aren’t the category shift the marketing implies, and they aren’t going to move your ROI conversation past “we’re saving some time.”

The reason the test is two parts and not one is that almost everything being sold this year passes the first half and fails the second. The second half is where the value lives.

The “without me in the loop” half

This is the one most products fail outright. A lot of what’s being sold as agentic still requires the human to ask the right question, interpret the answer, and decide what to do next. The AI is doing more of the work than it used to, but it isn’t doing the job.

A real agent owns an outcome, not a task. That distinction matters. A task is “summarize this session.” An outcome is “tell me which of my A/B tests are losing and why.” The first one ends when the AI produces a paragraph. The second one ends when a human running an experimentation program has the answer they actually needed, derived from evidence and ready to act on.

The agents that are starting to earn their keep this year own the second kind. They watch the A/B test and tell you why the variant is losing—not just what happened. They monitor site health between releases and flag the regression before your support queue notices. They handle the investigative work that used to eat an analyst’s Tuesday, start to finish, while the analyst is in other meetings doing higher-leverage work.

If the human is still the one finishing the thought, you don’t have an agent yet. You have a smarter assistant, which is useful but not transformative, and not going to change your ROI math.

The verification half is the one that matters more

This is the part almost nobody is talking about, and it determines whether any of this works at scale.

When you hand a job to a teammate, you’re trusting them to make a call without you there. The reason that works in human teams is that you can ask them later what they saw, what they did, and why. They can show you. The work is auditable, and the auditability is what makes the trust possible. New employees don’t get autonomy on day one. They earn it by showing their work until the work earns the room to operate.

An autonomous agent that can’t show its work isn’t a teammate. It’s a black box that occasionally produces an answer. You can’t put that on the org chart. You can’t put it in front of a regulator. You can’t put it anywhere the cost of being wrong is real—and in any business worth automating, the cost of being wrong is always real.

This is where most agent deployments quietly stall. They work in the demo. The first time something goes sideways in production and someone asks what the agent did and why, the team can produce a confidence score and not much else. That’s the moment leadership pulls back the autonomy, and the project’s value collapses.

The question to push harder on

Not “can it do the thing.” Can it show me what it saw, what it decided, and why, at a level of detail I can actually check?

If the answer is a confidence score and a summary, that isn’t verification.

Real verification means the agent’s work is traceable to ground-truth evidence. For an agent operating on user behavior, every action it takes should be grounded in something concrete: the session it analyzed, the click that triggered the flag, the moment a user got stuck and dropped off. Not aggregated metrics or inferred patterns. The raw record of what happened, that you can replay and check.

This separates agents you can deploy at scale from agents you can only demo. A lot of products on the market today don’t clear that bar, because they were built on top of analytics infrastructure that throws away most of the underlying behavioral evidence in favor of pre-defined events. You can’t verify what the data doesn’t contain.

What this looks like in practice

A few examples of jobs that pass the test.

An A/B test evaluator that not only tells you the variant is underperforming, but shows you the specific user sessions where the friction occurred, so you can watch what went wrong instead of reading a summary.

A site health agent that doesn’t just alert on a metric anomaly, but identifies the deploy that caused it and links you to the broken user journeys, with the relevant sessions queued up and ready to review.

A release monitor that watches new feature adoption around the clock and surfaces the moments where users hit confusion, with replay evidence, so the product team can make the fix on Monday instead of arguing about whether the issue is real.

In each case, the agent owns the outcome rather than the task. And in each, the work is traceable back to something a human can verify in seconds. Without both halves, you don’t have an agent you can trust with autonomy.

If you’re evaluating AI this year, the question isn’t whether the demo looks impressive. The question is whether you’d let it run on Tuesday morning without watching, and whether you’d be able to defend what it did on Tuesday afternoon.

Get caught up on our summer release event.