PRODUCT

Beyond Internal Reasoning: Why Enterprise AI Needs Macro-Reasoning

Discover how Macro-Reasoning transforms enterprise AI by enabling durable, scalable, and intelligent workflows beyond traditional context windows.


For the past two years, I’ve been building AI agent infrastructure at Vertesia. Not the kind where you string a few API calls together in a notebook and call it an agent — the kind where enterprises run autonomous workflows across millions of documents, for days at a time, with real money on the line.

We started with Temporal as the foundation — we knew from day one that if agents were going to run real enterprise workflows, they needed durability and scalability baked in, not bolted on. On top of that, we built a simple orchestrator: a loop that calls an LLM, executes tools, and repeats until the job is done. That worked. Then a customer needed the agent to process 4,000 contracts instead of one, and the context window exploded. So we built checkpointing. Then customers wanted specialist agents working in parallel — a financial analyst and a legal reviewer running simultaneously — so we built subagent coordination. Then we realized agents were drowning in system prompts full of instructions they didn’t need yet, so we built dynamic skill injection.

Each time, we were solving a different symptom. It took a while to see the pattern: we weren’t just patching an agent framework. We were building an operating system layer that lets an LLM think beyond its own context window.

I’m calling that pattern Macro-Reasoning, and I think it’s the missing piece in the current conversation about AI “thinking.”

The reasoning hype — and what it misses

Everyone is excited about models that “think.”

With reasoning models — OpenAI’s o-series, DeepSeek R1, Claude with extended thinking — the industry has gone all-in on what I’ll call Micro-Reasoning: the ability of an LLM to generate hidden chain-of-thought tokens to work through complex logic, math, or code before producing an answer. And it works. For bounded puzzles, it’s a genuine leap forward.

But solving a math problem in one API call is fundamentally different from running a three-week regulatory compliance audit across 12 jurisdictions.

Micro-reasoning is single-threaded. It happens inside a single request, within a finite context window, under a strict timeout. The model can plan a solution — but it cannot act on that plan while it thinks. If your data pipeline takes 10 minutes, the call times out. If the research yields 200 pages of context, the window explodes. The model was brilliant, and it still failed.

Yes, agentic tools have made real progress here. Claude Code chains LLM calls with tool use in a terminal loop. OpenAI’s Codex runs tasks in cloud containers. OpenClaw goes further with persistent memory and self-extensible skills on your local machine. Devin gives each agent a full persistent VM in the cloud.

But look at what’s still missing. Claude Code loses its reasoning state if you close the terminal — there’s no durable execution engine underneath. Codex Cloud containers are task-scoped: they run, produce a PR, and die. OpenClaw has memory that survives sessions, but the agent process itself is still ephemeral — and there’s no formal context management or checkpointing. Devin’s VM persistence is the closest to true durability, but it achieves it through infrastructure management rather than a principled workflow engine with step-level checkpointing and replay.

None of them actively manage the context window (token budgeting, image stripping, document reduction). None inject operational knowledge dynamically at the infrastructure level. None can checkpoint a reasoning process mid-step and resume it three days later with full fidelity. They’re all still, at their core, ephemeral processes that happen to be cleverer about what they persist to disk.

That works for writing code. It doesn’t work when your agent needs to process documents for three days, wait for a slow data pipeline, manage a context window that keeps filling up, and adapt its knowledge as the task evolves.

Even a single agent hitting these constraints is no longer micro-reasoning. The moment your agent needs to checkpoint its progress, sleep, wake up, manage its own memory, and dynamically load new skills mid-workflow — the reasoning has moved outside the model and into the infrastructure.

And you don’t need a million-document migration to get there. An agent editing a couple of Word documents benefits from document reduction — turning a bloated DOCX into a clean, instrumented representation the model can actually work with. An agent analyzing an Excel file benefits from dynamic skill injection — receiving spreadsheet-specific instructions precisely when it picks up that tool. Same for writing code, performing data analysis, building charts, creating long-form documents like RFP responses, or generating diagrams — each time the agent picks up a new tool, the engine loads the right operational knowledge for that task. These aren’t scale problems. They’re reasoning quality problems. The infrastructure makes a single agent working on a single deliverable smarter, not just faster.

That’s Macro-Reasoning. It works for simple use cases — editing a document, analyzing a spreadsheet, drafting an RFP response. And it scales all the way to a coordinated team across thousands of documents. The foundation is the same — durable, externalized thought. You don’t need to choose between a lightweight agent and a powerful one. The infrastructure is always there; the agent uses what it needs.

From internal monologue to operating system

Here’s the core idea. When a reasoning model micro-reasons, it simulates reality. It generates an internal thought: “I should break this contract analysis into three parts, query the database, and synthesize the results.” Good plan. But it’s still just generating text — it’s not actually doing any of those things.

Macro-reasoning externalizes the cognitive process. Instead of asking a model to construct a 50-step plan in one shot and hope nothing goes wrong, you give the LLM an actual operating system to think with — what I’m calling an Agent OS. A single agent with durable state, context management, and dynamic skills is already macro-reasoning. It’s thinking with infrastructure, not just with tokens. Add subagents, and that single agent becomes a coordinator running a team.

micro-reasoning vs. macro-reasoningdiagram-1-micro-vs-macro

Micro-Reasoning vs. Macro-Reasoning

The same primitives any operating system provides to programs — threads, IPC, shared memory, dynamic libraries — turn out to be exactly what an LLM agent needs to reason beyond a single context window. Build them together and you get a Macro-Reasoning Engine:

OS Primitive

Agent Equivalent

What it enables

System Calls

Tools (80+)

Acting on the world — documents, data, code, web, email

Threads

Subagents

Parallel specialist execution

IPC / Signals

Async steering

Mid-execution course correction

Shared Memory

Artifacts

Cross-agent knowledge sharing

Dynamic Libraries

Skills

Just-in-time operational knowledge

Process Scheduler

Durable orchestration

Workflow execution across minutes, hours, or days

Memory Management

Context management

Adaptive token budgeting

This shifts the burden of “thinking” from the context window into the infrastructure. The model doesn’t need to hold everything in its head — it has a world to interact with.

At Vertesia, this is the architecture we’ve been building and running in production. Here’s what it changes — starting with what a single agent gains, then what happens when you scale to many.

1. Thought that survives across time

Here’s something most people don’t think about: every LLM-based agent today is ephemeral. The moment the process dies — a timeout, a crash, a deployment, your laptop going to sleep — all the reasoning is gone. The agent’s understanding of the problem, its intermediate findings, the 47 tool calls it already made — all lost. You start over from scratch.

This is fine for a chatbot. It’s a disaster for real work.

Durability means the agent’s reasoning state is persisted outside the process. If the server restarts at 3 AM, the agent picks up exactly where it left off at 3:01 AM. If it needs to wait for an external data pipeline that takes six hours, it doesn’t sit there burning compute — it goes to sleep and wakes up when the data arrives. If it’s been working for three days and has checkpointed its progress twelve times, every one of those checkpoints is recoverable.

We built this on Temporal from the start because we knew this was non-negotiable for enterprise. Workflows can run for minutes, hours, or days, and survive any infrastructure event in between. The agent doesn’t know or care — it just keeps reasoning.

We recently ran workflows that processed hundreds of thousands of documents, each requiring 10+ inference calls, with external data pipelines that could take minutes or hours to respond. In a traditional setup, that’s a prayer. In ours, it’s a Tuesday.

I’ve written about this in more detail in Beyond Context Windows and How We Built Truly Autonomous Agents. What macro-reasoning adds is the framing: durability isn’t just an infrastructure feature. It’s a reasoning primitive. Without it, complex thought is impossible — because complex thought takes time, and time kills ephemeral processes.

2. Knowledge that arrives just in time

Traditional agents suffer from a loading problem. Either you front-load a massive system prompt with every instruction the agent might need (“You are a helpful assistant who knows about contracts, finance, presentation design, document formatting, regulatory compliance, and…”) or you build a fragile RAG pipeline that stuffs the context window with retrieved content just in case.

Both approaches waste context. And wasted context means worse reasoning — the model is swimming through instructions it doesn’t need while potentially missing the ones it does.

Macro-reasoning treats operational knowledge as modular Skills — instruction sets that the engine injects dynamically based on what the agent is actually doing, not what it might do.

In practice: an agent starts working on a task. It selects a document management tool. At that moment — and only at that moment — it receives specific instructions on versioning, formatting constraints, metadata requirements. If it later decides to build a slide deck, it receives the presentation authoring skill instead. The knowledge the agent operates with evolves mid-workflow.

We currently run 80+ skills across document processing, data analysis, content generation, and more. The result: agents that are experts at whatever they’re doing right now, without paying the context cost of being experts at everything all the time.

Anthropic recently introduced Agent Skills as an open standard — essentially the same insight: procedural knowledge should be injected contextually, not front-loaded. We’ve been running this pattern in production for months, and the data confirms the intuition: focused context produces dramatically better tool usage than kitchen-sink prompts.

3. A memory that manages itself

Here’s a problem nobody talks about enough: even a single agent running a long workflow will fill up its context window. Every tool result, every intermediate finding, every image — it all accumulates. Left unchecked, you hit the wall and the agent either crashes or starts forgetting its early work.

Macro-reasoning requires active memory management — the same way an operating system manages RAM.

In our architecture, this happens at multiple layers:

  • Token budget monitoring. The system continuously tracks cumulative token usage against the model’s context window. When usage hits 75% capacity, it triggers a checkpoint — the agent summarizes its progress so far, clears the conversation history, and continues from the summary with full awareness of what it’s done. We’ve seen agents checkpoint hundreds of times across multi-day workflows with zero state loss.
  • Turn-based image stripping. Images are critical context when they’re recent, but they’re enormous token sinks. The system keeps images for the 3 most recent turns, then automatically replaces them with text placeholders. The agent remembers what the image showed; it just doesn’t carry the raw pixels forward forever.
  • Artifact reduction. When agents work with documents — especially complex formats like DOCX — the system doesn’t feed the raw file into the context window. It reduces the document to a simplified, instrumented representation: structural content with ID markers, stripped of styling metadata. The agent can read and edit the document precisely without burning tokens on font declarations and XML namespaces.
  • Large text truncation. During checkpointing, text blocks are capped at ~8,000 tokens. Binary data is either serialized efficiently or stripped with placeholders. The goal is always the same: preserve what the agent needs to know, discard how it learned it.

This isn’t glamorous work. But it’s the difference between an agent that runs for 15 minutes and one that runs for 15 days. Micro-reasoning doesn’t need memory management because it lives and dies in a single request. Macro-reasoning doesn’t have that luxury — it needs an immune system for context bloat.

4. Scaling up: the conference room, not the solo genius

Everything above works for a single agent. But complex enterprise problems are rarely solved by one brain — they’re solved by a specialized team arguing in a conference room.

Once you have macro-reasoning infrastructure for one agent, you can extend it to many. A coordinator agent doesn’t just sequentially run tools — it spawns specialized subagents with distinct mandates: a Financial Analyst to crunch the numbers, a Legal Reviewer to flag risk clauses, a Synthesizer to pull it all together. Each subagent is itself a full macro-reasoning agent — with its own durable state, context management, and skills.

The macro-reasoning engine: from single agent to coordinated teamdiagram-2-agent-system

The Macro-Reasoning Engine: from single agent to coordinated team

And instead of imagining a parallel process (which is what a reasoning model does internally when it “plans”), the system actually executes one:

  • True parallelism. These subagents execute simultaneously as independent durable workflows. They don’t share a context window — they share a workspace.
  • Shared blackboard. Subagents read and write to a unified artifact workspace as they work. The Financial Analyst’s preliminary findings are immediately available to the Legal Reviewer without either of them needing to fit the other’s full context in their window.
  • Async steering. Because the execution loop is non-blocking with heartbeat monitoring, the coordinator can watch its subtasks in real-time. If a child agent starts going down a rabbit hole — burning tokens on an unproductive path — the coordinator sends it a signal to course-correct mid-execution. Think of it as a project manager who can tap a team member on the shoulder while they’re working, not just review their output after the fact.
  • Iterative convergence. The coordinator evaluates outputs against defined criteria. If the analysis is incomplete or contradictory, it triggers another round — cross-pollinating context between subagents before producing a final answer.

For engineering leaders, this means you can throw genuinely complex, multi-domain problems at an agent system and get back structured, reviewed output — not a single model’s best guess.

Where this leads

Reasoning models are getting better fast. But pouring more compute into internal chain-of-thought won’t solve the structural constraints of enterprise work: tasks that span days, require multiple specialists, depend on slow external systems, and demand auditability at every step.

The industry is starting to converge on this insight. Salesforce calls their agent layer an “operating system.” Amazon’s AgentCore provides runtime, memory, and gateway primitives. LangGraph added durable state and persistence. Everyone is building pieces of it.

But the framing matters. These aren’t just engineering conveniences — they’re reasoning primitives. The ability to think across time, think in parallel, dynamically load knowledge, and actively manage memory isn’t an infrastructure feature. It’s a cognitive architecture. It’s the difference between a model that can solve a puzzle and a system that can run your business.

The future of enterprise agents isn’t about models thinking harder in a black box. It’s about giving them infrastructure where they can think out loud, in parallel, and across time.

 

Similar posts

Get notified when a new blog article is published

Be the first to know about new blog articles from Vertesia. Stay up to date on industry trends, news, product updates, and more.