Vertesia Blog

Stop Focusing on LLM Tokens and Model Benchmarks

Written by Sarah Berger | October 31, 2024

Large Language Models (LLMs) have introduced a paradigm shift for enterprise organizations across various industries. With the promise of accelerating innovation, automating content generation, and transforming customer interactions, Generative AI (GenAI) has garnered a lot of excitement. Yet, many organizations find themselves in a loop of benchmarking, fine-tuning foundational models, and analyzing token costs instead of taking the necessary steps to deliver the ROI and organizational value that warrants such efforts and investment.

While benchmarking models is potentially valuable for research, focusing on these technical details and comparisons will hinder the enterprise’s progress. A different approach is needed: concentrate on getting high-priority, high-value LLM use cases into production as quickly as possible to show real business impact. Only after demonstrating ROI should organizations then fine-tune models or explore lower-cost alternatives.

In this article, we’ll explore why enterprises should deprioritize LLM cost scrutiny and benchmarking and instead focus on fast-tracking business-critical use cases to production. We’ll also discuss why a platform approach will help you avoid the most common pitfalls, accelerate your time-to-value, and ultimately help you achieve your specific business objectives in a highly impactful and measurable way. Only then will the incremental refinements be meaningful and quantifiable relative to real gains vs. theoretical business metrics.

The Problem: Wasting Time & Resources on the Proof-of-Concept loop

Many enterprises start their GenAI journey focused on the data science aspect of the equation, investing in R&D to compare foundational models, test different hyperparameters, and benchmark token costs to find the optimal configuration. While this approach may sound sensible because it focuses on differentiating GenAI components from traditional IT projects, it often leads to analysis paralysis. The organization finds itself in an endless loop of proof-of-concept (POC) experiments without a clear path to production or business impact.  

The reasons why this happens vary, but all too often, teams need an end-to-end platform for these initiatives to succeed. The tools and frameworks they use are designed to measure and tune models versus design, test, and deploy applications. They are often working in fragmented and complex scripting interfaces that will not scale for a production system or support model volatility and evolution, nor were they designed to get the project into production with proper software development lifecycle controls (SDLC), compliance, and security rigor (i.e., Information Security Approval), or user adoption (i.e., ease of application integration and performance standards that work for business users).

As a result, too much energy, time, and resources are spent on activities that don’t move the needle on tangible and quantifiable business objectives. This results in projects with no or low ROI and dwindling competitive advantage as others in the industry execute instead of explore. Ultimately, investments without ROI lose appeal, frustrate leadership, and reduce project budgets.

The underlying issue is that many teams are trying to optimize models before understanding the specific nuances of their enterprise use case. They focus on finding the “perfect” model configuration before proving the solution's utility in a real-world environment. 

Instead, a pragmatic approach is needed: prioritize delivering a working solution that drives immediate business impact. Once that’s done, you can optimize, switch models, and reduce costs as required.

Getting to production is the goal

Enterprise organizations, particularly those making significant investments in GenAI, should focus on solving high-impact business problems. This means identifying priority use cases where LLMs can unlock new revenue streams, reduce costs, or dramatically improve productivity. In short, organizations should aim to solve problems that matter most to the business.

Taking this approach requires a shift from technology-centric thinking to business-first thinking:

  1. Get to Value First, Optimization Later
    The initial focus should be showing quick wins that provide tangible value. This demonstrates to stakeholders that the investment in GenAI is worthwhile and sets the stage for further refinements.

  2. Iterate Quickly and Get Real-World Feedback
    Getting these business solutions into production quickly allows teams to collect real-world feedback. This is crucial, as it’s often impossible to fully understand an LLM’s effectiveness without observing its performance in a live environment. Once a baseline is established, organizations can experiment with different models, architectures, or token optimization strategies.

  3. Justify GenAI Spend
    By quickly implementing these business solutions, your team can truly validate the use cases and extend the business case for further investment. Like any other investment, ROI is vital during budget planning. This pragmatic approach is crucial for justifying and securing ongoing support for GenAI initiatives when budgets are tightened.

  4. Avoid Fixating on Token Costs in the Beginning
    Token costs and model size should not be the primary concern until you’ve validated that a targeted use case works and delivers value. When a high-value use case is identified and proven to the business sponsors, a secondary phase is where your time is best spent to explore more cost-effective options, such as using smaller models or optimizing inference efficiency.

The most common pitfalls of the “Test Everything” approach

Many organizations build highly sophisticated but non-production-ready architectures to test various LLMs or run ongoing benchmarking studies on continuously changing models. While this approach may provide insights into model performance in isolated scenarios, it rarely translates into actionable and measurable business value.

The key pitfalls of the “test everything” mindset include:

  1. Massive Opportunity Cost
    The time, energy, and resources invested in experimenting with models without a clear path to production equate to delays in delivering ROI and business value to the organization. The longer the delay, the higher the opportunity cost.

  2. Stakeholder Fatigue
    Requesting time and resources for “more data needed” will likely lead to stakeholder skepticism over time. Leaders need real business benefits and measurable outcomes, not data, models, and frameworks.

  3. Focusing on Costs Over Value
    When enterprises prioritize token costs rather than achieving business outcomes, they risk underinvesting in potentially massive transformative use cases that would justify a higher initial spend. The adage of “penny wise but pound foolish” can be avoided by quickly deploying and measuring the real business value to get the most critical aspect of the ROI equation: the return.  

  4. Limited Scalability
    Organizations in this test loop often build highly customized test environments, which make it extremely difficult to scale across multiple use cases. An over-engineered testing framework will likely become a bottleneck that prevents or significantly delays your team’s ability to deploy the project into production. If a project is launched after developing custom scripts to support each project, any future iterations, modifications, and model changes to the business solution will likely present additional challenges.

A platform-driven approach for rapid deployment

We offer a better approach. Vertesia is a platform that helps organizations rapidly deploy high-value use cases into production. It provides a streamlined way to build, test, and deploy LLM-powered tasks across diverse solutions. 

To do this well, at scale, you’ll need:

  1. Simplified Task Creation and Management
    Our customers can create, configure, and manage LLM tasks (i.e., instructions to one or more models via structured prompt segments) without the hazard of building complex test environments or infrastructure.

  2. Seamless Integration of Content and Data
    Why waste more time and add risk by moving your content and data to one place when you can leverage existing systems? Composable easily integrates with existing systems and data pipelines, making it easy to gather the context-specific data required for high-impact use cases.

  3. Quick Deployment and Evaluation
    Composable allows traditional IT teams and business stakeholders to collaborate, test, and iterate rapidly, then deploy into production environments with minimal overhead. This ensures that all teams get real-world feedback and refine the solution based on actual performance, not just lab tests.

  4. Model Flexibility without Complexity
    Models change, so you need a flexible platform to make 1-click adjustments during the build, test, and deploy process without complexity. Composable enables teams to switch between models or bring in proprietary models as needed without re-engineering the entire application. This allows for easy experimentation at the right time, which we strongly recommend you do after establishing the business value.

  5. Focus on Business Metrics
    The Composable platform helps customers prioritize their business KPIs—such as reduced churn, increased sales, or improved customer satisfaction—over purely technical metrics like perplexity or token efficiency. This aligns LLM performance with business objectives and gives the teams a clear, meaningful, and measurable ROI.

How do we go from model-centric to outcome-centric thinking?

Shifting from a model-centric to an outcome-centric mindset may take work. For AI project teams accustomed to R&D and benchmarking, the technical allure of digging into the latest model performance and fine-tuning strategies can be substantial. However, it’s crucial to remember that GenAI's ultimate goal is to drive positive business outcomes.

An outcome-centric approach should begin by asking the team:

  • What business problem are we solving?
  • Who is impacted by this problem? If we address this, who would this help?
  • What is the expected benefit? KPIs to measure? Time and dollars saved? Revenue gained? Customer satisfaction % increased?  
  • How can we validate that impact quickly?

Once these questions are answered, the next step is to confirm a basic version of the solution with the business and get it into production as quickly as possible. Only then should the organization explore optimizations such as experimenting with different models, reducing token costs, or deploying smaller models for efficiency.

Prove value first, optimize later

In summary, the key to GenAI's success is getting high-value use cases into production and quickly demonstrating business impact. Rather than getting bogged down by technical details and model comparisons, enterprises should focus on solving the most critical problems.

The path forward is clear: Use a production-ready AI platform, like Composable, to allow traditional development and business teams to quickly define, iterate, test, and deploy with model flexibility. This will enable teams to focus on achieving real-world business impact from day one. Once deployed, there will be time for ongoing optimization and model tuning.

By shifting focus from benchmarking and token costs to proving value in production, enterprises can unlock the full potential of Generative AI and drive transformative business outcomes. Doing so will make everyone more comfortable answering leadership’s perennial question, “What is the business value and ROI on this work?”