AI Agent Benchmarks Are Misguided. Here’s How to Find Real Value.

Written by Jonny McFadden | November 18, 2025

AI Agent benchmarks have become the new vanity metric in the AI industry. They look impressive in a demo but rarely predict real-world value. If you’re trying to figure out how to leverage agentic AI and “AI Agent Benchmarks” is in your Google search history, you might be asking the wrong questions entirely. Let me tell you why!

AI Agents and their potential are similar to when personal computers came on the scene in the 1980s. The technology is impressive, but businesses are still asking: Where’s the value, and what exactly do we do with it? “Which AI Agent is better?” seems like a fair question, but like many things with generative AI, it’s far more nuanced than it seems on the surface.

In this blog article, we’ll focus on why generic agent benchmarking is a misguided approach. We’ll explore the difference between traditional software and AI Agents, discuss the pitfalls of generic benchmarking, and close with a practical framework for evaluating this new technology. The singular goal of this article is to equip you with the knowledge you need to find the solution that’s right for you.

Traditional software vs. GenAI

Traditional software

AI Agents and large language models (LLMs) are fundamentally different from traditional technology. Consequently, a different approach for evaluation is needed. Traditional software is deterministic: you do A, you get B. Consider the following two examples:

Example 1

A file containing structured data is uploaded to a server every night.
That file triggers a workflow to parse the data.
The parsed data is transformed using predefined rules and inserted into a database.
Data analysts query the data to extract useful information.

Example 2

You set up an automation integrating your marketing platform with your CRM.
A prospective customer, or lead fills out a form on the website.
That same lead receives an email one minute later.
A record is produced in the CRM.
A sales rep is notified to follow up with that lead.

In both of these scenarios, predefined steps are executed to achieve measurable and repeatable goals. This is the type of technology solutions businesses are used to, and for many use cases, it may still be the ideal choice. That said, it’s not perfect. The resilience of these workflows is only as good as the rules engineers can build into them. If something happens outside those rules, the system fails and a human has to step in.

Agentic AI and LLMs

Unlike traditional, rule-based software, LLMs are non-deterministic. Non-deterministic does not mean not valuable, it means there isn’t a defined goal. Leaders must change how they evaluate the technology and the value it provides. Agentic systems open a whole new door of use cases, but how these fit into your business is different than traditional software

Purchasing Agentic technology is more akin to hiring a new, very intelligent employee. When you buy traditional software, the domain and use case are clear. The technology does exactly what it’s designed to do - no more, no less. Hiring a new employee, while scoped to a specific area of the business, is far more fluid. Humans can understand information across broad domains, they can explain their actions and thought process. They can reason and make decisions. They are intelligent yet imperfect. They collaborate within teams to achieve results. AI Agents are similar. They can mimic intelligence, reason through problems, explain their actions, and dynamically plan and act on a specific task with little instruction. And, similar to humans, they are not to be trusted blindly but used collaboratively as an extension of the organization. While imperfect, they can be extremely useful when used properly.

The problem with AI agent benchmarks

Back to our original question: Are AI Agent benchmarks worthwhile? If done broadly - no. If done as an internal test for a specific use case at your company - yes. Think of AI Agents as similar to managing and evaluating people. How a human performs is based on many factors and it differs from person to person based on the task. View AI Agents through a similar lens.

That variability is what makes benchmarking difficult. Consider the following 3 questions:

Who is better: Michael Jordan or Kobe Bryant?
Who is better: Michael Jordan or Tom Brady?
Who is better: Michael Jordan or Albert Einstein?

By now, you see the problem. Better at what, exactly? That’s the issue with broadly benchmarking AI Agents. It’s not a simple comparison. Many variables are at play, and the answer depends entirely on context. If you’re hiring a basketball coach, Jordan is the obvious choice. If you’re looking for a theoretical physicist, Einstein wins. The same logic applies to AI Agents: performance depends on task, environment, and measurement criteria, and that is going to be extremely dependent on the specifics of your organization.

So…where does that leave us? For starters, don’t put the cart before the horse. Forget benchmarks for now and step back to look at the bigger picture. First ask yourself, is this a process that AI should even be involved in? While incredible, AI agents are still just software. They’re not going to solve every problem you have, so be sure to focus in the right area to realize the most value. The following four questions are great guidelines for understanding where Generative AI can fit into your organization.

What tasks are highly manual? What information intensive activities require analysis, collating or creating content?
What are some activities within existing processes where acceleration or improved quality and consistency is valuable?
What activities are you NOT doing now that you wish you COULD, but are too time consuming or costly?
If you could hire and train and then infinitely scale new workers to help solve a specific information related problem what would it be?

Following the above guidelines will give you a list of use cases where AI Agents will provide the most value. Once mapped, prioritize them by balancing expected impact against the work required to get each use case into production. The image below shows an example of our approach to mapping out use cases when conducting Generative AI Workshops with our clients.

Consider at the two examples shown in this chart:

AI Agent customer service chatbot

Value: dramatically reduces human workload and improves response time, which then improves customer sentiment and loyalty.
Work required: deep system integrations, strong compliance guardrails, and human-in-the-loop testing until efficacy is proven.

Company knowledge assistant

Value: improves employee productivity and makes knowledge instantly accessible across the organization.
Work required: ingest company documents into a platform, connect an AI Agent capable of answering questions and citing sources.

Both use cases are valuable, but the knowledge assistant is a lighter lift with far less operational risk. This type of analysis should be done for every use case to prioritize implementation effectively. Once that exercise is complete, you’ll have a clear roadmap for integrating AI Agents. Only then should you start to evaluate the technology itself.

For more info on evaluating use cases check out Our Use Case Guide.

Evaluating the technology

When evaluating AI technology, remember that AI Agents are non-deterministic. Every successful use case requires a cycle of testing, validation, and refinement until the desired level of efficacy is achieved. Any platform you consider should therefore offer transparency and flexibility.

Here are a few examples of what we continually modify when refining AI Agents for customers:

The LLM used by the AI Agent
The tools the agent has access to
The prompts it’s given (both user-facing and behind the scenes)
The context and internal data sources it leverages

Selecting the best AI Agent isn’t about choosing the technology with the “best agents”; it’s about finding the platform that enables rapid iteration and refinement so agents deliver what your business actually needs. Think back to the personal computer analogy: in the 1980s, the most successful companies weren’t those that bought the best machines, they were the ones that figured out how to fit them into their business in the right way.

In short, evaluating AI Agents isn’t about picking the smartest model or most capable agent. It’s about choosing a technology, and a partner that enables adaptability and knows your business. The winners in this new era will be those who learn, iterate, and integrate faster than their competitors.

View full post