Vertesia Wins "AI Startup of the Year" - What This Award Really Means
This isn't just an accolade; it's a powerful validation of our mission to help organizations truly harness the power of AI at an enterprise scale.
Cloud providers impact AI behavior. Learn how Claude 4.5 performs differently and what this means for AI deployments.
What if I told you this:
That’s exactly what happened when my colleagues and I started noticing that Claude 4.5 on AWS generated different results than on Google Vertex.
Because agent behavior isn’t fully deterministic, we couldn’t trust our subjective interpretations alone, so we designed an experiment to validate the results with real data.
Multiple people on my team were reporting the same thing:
This is a big deal if you’re building production agents on top of multiple providers. If the same model behaves differently depending on where you run it, that affects reliability, evaluation, and ultimately an organization’s ability to put agents into production.
To test our hypothesis, the next step we took was to outline a controlled experiment.
I wanted to isolate one variable: the provider.
Everything else needed to be identical. Here was the setup:
The only change:
I used one of our multipurpose agents. This is a generalist agent with instructions on how to think through problems, use tools, and manage memory.
The prompt I gave each agent:
“Analyze every single one of the documents in your repository and create a report categorizing and summarizing the documents you find.”
This kind of open-ended, multi-step task is perfect for surfacing behavioral differences:
Once each agent finished its run, I did a manual comparison and noticed some differences right away. The great thing about using agents is that you can have them act as a second set of eyes. So, to be extra certain of the results, I used a third, independent agent: a Comparative Analysis Agent. This agent was run using Claude 4.5 on AWS. The reason for this was simple. We had multiple data points telling us that the model on AWS provided a more extensive analysis. We had this perspective subjectively from multiple people in our organization, we confirmed this by manually looking through the test results, and this was ultimately confirmed by the agent comparison as well.
For each run, this Comparative Analysis Agent received three items:
The Comparative Analysis Agent was instructed to do a structured comparison across:
In other words: not just “Which output looks nicer?”, but how each agent thought, moved, and decided its way through the task.
Here’s the high-level summary of what came back:
|
Metric |
AWS Agent |
Google Agent |
Winner |
|
Documents analyzed |
1,087 (100%) |
952 (87.6%) |
AWS |
|
Categories created |
15 major categories |
19 major categories |
Tie |
|
Report completeness |
Complete, with verification |
Truncated / incomplete |
AWS |
|
Self-verification |
Yes – proactive |
No |
AWS |
|
Execution approach |
Direct execution |
Plan-based |
Different |
|
Output quality (score) |
9.5 / 10 |
9.0 / 10 |
AWS |
Both agents produced great results, but Claude 4.5 on the google environment seemed to prioritize efficiency while the AWS version immediately put its effort toward exhaustive analysis.
It is important to note that this experiment is a report on observed behavioral differences under specific testing conditions. The results should not be interpreted as a blanket condemnation of either cloud solution; both providers offer robust, high-performing AI environments. The goal of this analysis is simply to highlight that provider choice introduces subtle, yet real, behavioral variables that builders must account for.
The most interesting part wasn’t the scores. It was the philosophy each agent seemed to adopt. AWS jumped right in, evaluated, and iterated. Google planned, and then systematically executed that plan. It’s analogous to a single brain having different personalities based on the environment.
Provider: AWS
Overall philosophy: “Act first, verify later, correct as needed.”
Step-by-step behavior:
Key traits:
To sum up Claude 4.5’s approach on AWS, the agent took a much more direct approach and dove right into the task. It then evaluated its work, noticed it had not fully completed the request from the user, and self-corrected to analyze the rest of the information.
Provider: Google Vertex
Overall philosophy: “Plan thoroughly, then execute systematically.”
Step-by-step behavior:
Key traits:
The result: a good report with strong structure and more categories (19 vs. 15), but a less detailed view of what documents existed in the project’s repository.
While this one test was eye opening, we didn’t stop after a single iteration. To truly understand the differences, we ran the same test 2 more times. The result was 3 different analyses for the same test repeated on 3 separate occasions with identical setups. The consistent findings between all 3 tests were:
On paper, both agents are using “Claude 4.5.” But in practice:
If you’re building AI products, this has implications.
You can’t assume that running “Model X” on cloud provider A vs cloud provider B will be behaviorally identical.
Even small differences in:
…can create meaningful divergence in how an agent explores, verifies, and completes work.
If I had only looked at the final reports, I would have concluded:
But when you look at the run logs and have an analysis agent compare the process, you see:
Both approaches have pros and cons. What it ultimately comes down to is the specifics of the problem you are trying to solve.
Provider selection is no longer just
It’s also:
If your product promises things like “comprehensive analysis” or “fast and efficient workflows,” these behavioral nuances matter.
A few lessons from this experiment you can apply right away:
We often talk about models as if they’re static, uniform entities: “Use Claude,” “Use GPT,” “Use Model X.”
In reality, what you get in production is the combination of:
This test confirmed our hypothesis that the same model can behave differently across providers. But the real lesson is this: integrating AI agents is about much more than the model itself. Many variables shape an agent’s performance, and you need the ability to isolate changes, run controlled experiments, and compare outcomes objectively. Neither provider was universally better, but the differences were real—and they have implications for how businesses approach AI. In our next post, we’ll break down exactly how we ran this experiment, the other tests we’re performing, and how we’re helping customers iterate faster and get to ROI sooner.
This isn't just an accolade; it's a powerful validation of our mission to help organizations truly harness the power of AI at an enterprise scale.
Our platform is now available on AWS Marketplace. Benefits include simplified procurement, integrated billing, scalability, security, and more.
Discover how Vertesia's Generative AI platform automates the localization of global assets, transforming content to resonate with diverse audiences.
Be the first to know about new blog articles from Vertesia. Stay up to date on industry trends, news, product updates, and more.