How Claude 4.5 Behaves on Different Clouds

Written by Jonny McFadden | December 16, 2025

What if I told you this:

Same model. Same task. Same data. Same agent framework.
The only difference? Cloud provider.
And the behavior of the agent was meaningfully different.

That’s exactly what happened when my colleagues and I started noticing that Claude 4.5 on AWS generated different results than on Google Vertex.

Because agent behavior isn’t fully deterministic, we couldn’t trust our subjective interpretations alone, so we designed an experiment to validate the results with real data.

The hypothesis: “Claude behaves different depending on where we run it”

Multiple people on my team were reporting the same thing:

Claude 4.5 on AWS: The agent seemed to be more thorough in its approach to the task. It provided more detailed, comprehensive results.
Claude 4.5 on Google Vertex: Still very good, but it was slightly less extensive in tasks requiring analysis and seemed to prioritize efficiency.

This is a big deal if you’re building production agents on top of multiple providers. If the same model behaves differently depending on where you run it, that affects reliability, evaluation, and ultimately an organization’s ability to put agents into production.

To test our hypothesis, the next step we took was to outline a controlled experiment.

The experiment design

I wanted to isolate one variable: the provider.

Everything else needed to be identical. Here was the setup:

Both agents would use the same project. A project is defined as an isolated environment with a single configuration, database, set of documents, and models.
Both agents had access to the same documents. In this instance, there were over 1,000 documents in a single project that needed to be analyzed. They were the same for both agents.
The instructions, or prompt, given to both agents was exactly the same. This was the prompt we gave each agent. The prompt they received was identical and asked the agent to create a categorization report for what documents were in the project.
The availability of tools remained the same for both agents. The tools represent how the agent is able to interact with its environment. For instance, agents will have tools that help them search for documents, analyze documents, create reports, etc…The tools were the same for both agents.
The test would be run using the same infrastructure. The technology that these agents were running on was identical for both tests, and they were run at the same time.

The only change:

Agent A → Claude 4.5 via AWS
Agent B → Claude 4.5 via Google Vertex

The task

I used one of our multipurpose agents. This is a generalist agent with instructions on how to think through problems, use tools, and manage memory.

The prompt I gave each agent:

“Analyze every single one of the documents in your repository and create a report categorizing and summarizing the documents you find.”

This kind of open-ended, multi-step task is perfect for surfacing behavioral differences:

How does the agent search?
How does it structure its work?
Does it verify itself?
Does it recover from gaps?

Evaluating the runs: an agent that judges agents

Once each agent finished its run, I did a manual comparison and noticed some differences right away. The great thing about using agents is that you can have them act as a second set of eyes. So, to be extra certain of the results, I used a third, independent agent: a Comparative Analysis Agent. This agent was run using Claude 4.5 on AWS. The reason for this was simple. We had multiple data points telling us that the model on AWS provided a more extensive analysis. We had this perspective subjectively from multiple people in our organization, we confirmed this by manually looking through the test results, and this was ultimately confirmed by the agent comparison as well.

For each run, this Comparative Analysis Agent received three items:

The prompt given to the agent
The final report the agent produced
The run logs showing every action, thought, decision, and tool call made by the agent

The Comparative Analysis Agent was instructed to do a structured comparison across:

Execution approach
Output quality
Efficiency
Overall effectiveness
Strengths and weaknesses

In other words: not just “Which output looks nicer?”, but how each agent thought, moved, and decided its way through the task.

Results at a glance

Here’s the high-level summary of what came back:

Metric	AWS Agent	Google Agent	Winner
Documents analyzed	1,087 (100%)	952 (87.6%)	AWS
Categories created	15 major categories	19 major categories	Tie
Report completeness	Complete, with verification	Truncated / incomplete	AWS
Self-verification	Yes – proactive	No	AWS
Execution approach	Direct execution	Plan-based	Different
Output quality (score)	9.5 / 10	9.0 / 10	AWS

Both agents produced great results, but Claude 4.5 on the google environment seemed to prioritize efficiency while the AWS version immediately put its effort toward exhaustive analysis.

It is important to note that this experiment is a report on observed behavioral differences under specific testing conditions. The results should not be interpreted as a blanket condemnation of either cloud solution; both providers offer robust, high-performing AI environments. The goal of this analysis is simply to highlight that provider choice introduces subtle, yet real, behavioral variables that builders must account for.

The most interesting part wasn’t the scores. It was the philosophy each agent seemed to adopt. AWS jumped right in, evaluated, and iterated. Google planned, and then systematically executed that plan. It’s analogous to a single brain having different personalities based on the environment.

How the AWS agent behaved

Provider: AWS
Overall philosophy: “Act first, verify later, correct as needed.”

Step-by-step behavior:

Called the tool search_documents with a limit of 1000.
Analyzed the returned metadata.
Created a comprehensive report based on those documents.
Performed a self-check to verify completeness.
Detected a gap: there were 87 documents that hadn’t been accounted for.
Retrieved the remaining documents.
Updated and expanded the report to cover 100% of the repository.

Key traits:

Bias toward action: It didn’t over-plan; it started working.
Self-verification: It explicitly checked its own work.
Recovery behavior: When it found missing documents, it fixed the gap without being asked.

To sum up Claude 4.5’s approach on AWS, the agent took a much more direct approach and dove right into the task. It then evaluated its work, noticed it had not fully completed the request from the user, and self-corrected to analyze the rest of the information.

How the Google Vertex agent behaved

Provider: Google Vertex
Overall philosophy: “Plan thoroughly, then execute systematically.”

Step-by-step behavior:

Called the tool search_documents with a limit of 1000.
Designed a structured 4-phase plan:
- Phase 1: Analyze document types and patterns
- Phase 2: Categorize into logical groups

- Phase 3: Analyze content and metadata

- Phase 4: Create a comprehensive report

Executed the phases in order.
Generated a report.

Key traits:

Plan-heavy: It spent more effort up front outlining how it would work.
Structured execution: The output reflected that plan – clear phases, logical progression
Less robust recovery: It did not proactively verify whether it had seen every document and did not compensate for the 87 missing ones.

The result: a good report with strong structure and more categories (19 vs. 15), but a less detailed view of what documents existed in the project’s repository.

Follow-up tests

While this one test was eye opening, we didn’t stop after a single iteration. To truly understand the differences, we ran the same test 2 more times. The result was 3 different analyses for the same test repeated on 3 separate occasions with identical setups. The consistent findings between all 3 tests were:

AWS uses more tokens in both its work and output
AWS optimizes for stakeholder presentations and strategic decision-making; Google optimizes for analytical efficiency and data consumption.
The models showed different tool usage patterns (choosing what tools to use and when to use them).

Why this matters

On paper, both agents are using “Claude 4.5.” But in practice:

The AWS version behaved like a doer.
The Google Vertex version behaved like a planner.

If you’re building AI products, this has implications.

1. “Same model” ≠ “same behavior”

You can’t assume that running “Model X” on cloud provider A vs cloud provider B will be behaviorally identical.

Even small differences in:

Default system prompts
Safety configuration
Inference settings
Provider-side wrappers

…can create meaningful divergence in how an agent explores, verifies, and completes work.

2. Evaluating outputs is not enough

If I had only looked at the final reports, I would have concluded:

“Both are good.”
“AWS seems a bit more verbose.”

But when you look at the run logs and have an analysis agent compare the process, you see:

Who is more likely to self-check their work
Who prioritizes efficiency and planning
Who leans toward longer, more expansive output

Both approaches have pros and cons. What it ultimately comes down to is the specifics of the problem you are trying to solve.

3. Provider choice becomes a product decision

Provider selection is no longer just

Latency
Cost
Regional availability

It’s also:

Efficiency vs. exhaustiveness
Tendency to plan vs. quickly execute
How well the model works with your agent framework

If your product promises things like “comprehensive analysis” or “fast and efficient workflows,” these behavioral nuances matter.

Practical takeaways if you’re building with agents

A few lessons from this experiment you can apply right away:

Test per provider, not just per model.
Don’t assume one model behaves the same everywhere. Design evaluations specifically for each provider you plan to support.
Evaluate behavior, not just text quality.
Look at:
1. Did the agent cover all required inputs?
2. Did it verify its own work?
3. Did it recover when something was missing?
Use agents to evaluate agents.
A Comparative Analysis Agent with access to prompts, outputs, and run logs can:
1. Surface subtle differences in strategy
2. Highlight weaknesses you’d otherwise miss
3. Scale your evaluation process
4. Understand far more information and variables than a single human
Design tasks that force depth.
Open-ended, multi-step tasks (“analyze everything and summarize it comprehensively”) are better for revealing real-world behavior than toy prompts.
Bake completeness checks into your agent patterns.
No LLM or agent will be perfect. Always ensure to add a step for agents to verify their own work. Alternatively, you can build a second, independent agent to verify the work of its “virtual colleague”.

Closing thoughts

We often talk about models as if they’re static, uniform entities: “Use Claude,” “Use GPT,” “Use Model X.”

In reality, what you get in production is the combination of:

The model
The provider
The surrounding agent framework
The instructions and tools you wrap around it

This test confirmed our hypothesis that the same model can behave differently across providers. But the real lesson is this: integrating AI agents is about much more than the model itself. Many variables shape an agent’s performance, and you need the ability to isolate changes, run controlled experiments, and compare outcomes objectively. Neither provider was universally better, but the differences were real—and they have implications for how businesses approach AI. In our next post, we’ll break down exactly how we ran this experiment, the other tests we’re performing, and how we’re helping customers iterate faster and get to ROI sooner.

View full post