How I Built an AI System That Actually Thinks Before Acting

The story of Dimension's orchestration layer and why "smarter agents" need to know when to delegate.

The Problem With Greedy Agents

Imagine asking a human expert to handle everything: research, writing, analysis, design, and execution. All at once. They'll get exhausted, make mistakes, and blow through your budget in hours.

That's what happens when you give a single powerful LLM a massive tool set and zero delegation strategy.

I spent months shipping features into Dimension's boss agent (Claude Opus). Every time we added a new integration—Gmail, Slack, Notion, GitHub—the context window got heavier, token costs exploded, and latency crept up. The agent was trying to be smart about everything instead of being smart about what to ask for.

The breakthrough came when I asked a simple question: What if the agent's real job wasn't to do the work, but to decide who should do it?

The Orchestration Insight

Here's the core idea: A "boss agent" receives a request and does one of three things:

Handles it directly — Simple stuff. "Schedule these 5 meetings." Five parallel API calls. Done.
Delegates to specialists — Complex research. "Analyze 8 competitors." Spawns 8 independent agents that work in parallel, each handling one company.
Investigates first — Scope is fuzzy. "Look into Q3 numbers." Search memory, ask clarifying questions, then decide if it's simple or needs decomposition.

The boss synthesizes everything at the end. The final output—the document, the email, the PDF—always comes from the boss. Sub-agents are dumb by design. They gather data. They don't decide what matters.

Think of it like a CEO:

The CEO doesn't write every memo (direct execution is rare).
The CEO doesn't become a researcher just because a problem needs research (delegation is normal).
The CEO reads everything their team writes before it goes out (the boss owns output quality).

Real World: The Sycamore Brief

We shipped a live example. A user asked for competitive intelligence on Sycamore Labs. Here's what the orchestrator did:

Recognized complexity — "This needs deep research across team members and market landscape."
Decomposed the work — Created 7 tasks: 5 for team member deep-dives, 2 for competitive landscape analysis.
Spawned in parallel — 7 sub-agents spun up. While one researches Figma, another researches design trends, another digs into Sycamore's funding. No waiting.
Coordinated via scratchpad — All findings drop into Redis. No agent-to-agent chatter.
Synthesized — Boss read all 7 outputs and generated a styled, citeable PDF.

Time: 4 minutes 53 seconds
Cost: $3.86
Tokens: 1.5M processed

That's not expensive. That's embarrassingly cheap for work that would take a human team a day.

The Three Execution Modes

Mode 1: Direct Execution

"Schedule 5 meetings with John, Sarah, and Mike on Friday at 3pm, 4pm, and 5pm respectively."

Boss Agent
  ├── CALL calendar.create(john, friday, 3pm)
  ├── CALL calendar.create(sarah, friday, 4pm)
  ├── CALL calendar.create(mike, friday, 5pm)
  └── Respond to user

No sub-agents. No Redis. Just tool calls. Overhead would kill performance.

Mode 2: Parallel Delegation

"Research our top 8 competitors: pricing, product strategy, go-to-market, funding."

Boss Agent
  ├── CREATE_TASK(competitor_1, "Research pricing...")
  ├── CREATE_TASK(competitor_2, "Research pricing...")
  ... (8 tasks)
  └── SPAWN_SUB_AGENTS([task_1, task_2, ..., task_8])

Sub-Agent-1 ──┐
Sub-Agent-2 ──┼─→ Redis Scratchpad
...           │
Sub-Agent-8 ──┘

Boss reads all answers → synthesizes → final report

Each sub-agent gets a self-contained task description. No access to the conversation. No shared context except what's explicitly in the task. This forces clean boundaries.

Mode 3: Explore, Then Decide

"Look into our Q3 numbers."

Boss Agent
  ├── MEMORY_SEARCH("Q3 reports")
  ├── HYBRID_SEARCH(docs, "Q3 results")
  ├── *Understands what's needed*
  ├── Decides: "Is this a simple fetch or does it need analysis?"
  └── Executes accordingly

The agent thinks first. Sometimes "Q3 numbers" is a quick spreadsheet fetch. Sometimes it needs cross-team research.

The Scratchpad: Coordination Without Chaos

Sub-agents don't talk to each other. They don't coordinate. They don't negotiate.

Instead:

Boss creates tasks → stores them in Redis as pending
Each sub-agent pulls its task, marks running
Sub-agent executes independently
Sub-agent writes its answer back to Redis
Boss reads all answers → synthesizes

Why not message passing?

No coordination overhead — Sub-agents are embarrassingly parallel. No consensus protocol, no shared channels.
Sub-millisecond reads — Boss can check status instantly.
Clean boundaries — Task description IS the interface. No implicit expectations.
State isolation — Each sub-agent has its own internal LLM message history. Completely separate from the boss's conversation.

The scratchpad is dumb. Redis. Seven-day TTL. That's it. But it's the thing that lets you run 100 parallel agents without building a distributed consensus system.

Lazy Integration Loading

Dimension connects to 20+ platforms: Gmail, Slack, Linear, Notion, Airtable, Dropbox, GitHub, Vercel, Figma, and more.

Loading all 20 as tools upfront would bloat the context window to uselessness.

Here's the three-layer strategy:

Layer 1 — Setup Time
When a conversation starts, load only the integrations the user has connected. Pre-load tools from active artifacts (e.g., if they have a presentation open, load Slides tools). Carry forward the previous run's integrations for prompt cache hits.

Layer 2 — Mid-Conversation Discovery
Agent reasons: "I need to post this to Slack."

Agent calls: ADD_INTEGRATIONS(["slack"])

System: validates Slack is connected → rebuilds tool list → re-binds model → next ReAct iteration has Slack tools.

Layer 3 — Cache Optimization
Tools are sorted alphabetically. Cache breakpoint lands after the last tool definition. Same tool set across iterations = prompt cache hit. Same integrations across runs = cross-run cache hits.

Result: The system scales to dozens of integrations while keeping context tight and prompt cache hit rates high.

Human-in-the-Loop: When the Agent Needs Permission

Some actions have consequences. Sending an email. Posting to Slack. Deleting an issue.

The system maintains a critical action mapping per integration. For each action, the user configures trust:

Trust Level 1: Full Autonomy
"I trust the agent." Every critical action executes immediately. No pauses.

Trust Level 2: Allowlist Integrations
"Trust all Gmail." Every Gmail tool skips review. Slack tools still pause for approval.

Trust Level 3: Allowlist Specific Tools
"Trust email send, not email delete." Send executes immediately. Delete still requires approval.

When the agent proposes a critical action, the full state persists to the database. The user sees what's about to happen and can:

Approve — Execute as proposed
Edit parameters — Change recipient, reword the message
Reject + explain — Agent learns from feedback

The architecture isn't a simple wrapper. It's a graph-level interrupt primitive built into LangGraph's checkpoint system. State persists across pauses. You can interrupt a run for days and resume cleanly with full context.

Handling Long Conversations Without Degradation

A user might run a conversation across weeks. Dozens of runs. Hundreds of tool calls. Thousands of messages.

Without active management, the context window fills up and the system either crashes or gets dumb.

Two-Tier Compaction Strategy:

During Run — When token usage hits 80% of the context window, a cheap summarizer (Gemini Flash) condenses older messages into structured summaries. Oldest messages get pruned. Latest complete human turns stay intact.

After Run — Once a run completes, check if the thread exceeds platform-specific thresholds:

Web: 148K tokens
Slack: 64K tokens
iMessages: 64K tokens

If it does, a post-processing consumer generates a compact summary: User's Goal, Completed Actions, Failed Actions, User Decisions & Preferences, Pending Requests, Key Context.

Result: Conversations run indefinitely. Token costs stay bounded. Prompt cache hit rates stay high because the compressed prefix stays stable.

Design Principles I Learned the Hard Way

1. Reasoning Before Routing
The boss agent doesn't route by keyword matching or regex. It reasons about task complexity and picks an execution strategy dynamically. The decision changes per request based on the work.

2. Shared State, Not Message Passing
The orchestrator has full control over execution order. Redis scratchpad provides shared state. No inter-agent communication complexity.

3. Lazy Capability Acquisition
Agents discover what tools they need through reasoning. A conversation starts with minimal tools and acquires integrations as the work demands them. Scales to dozens of platforms without linear cost.

4. Human Oversight as a Primitive
Critical actions interrupt execution, persist to the database, allow parameter editing, and resume cleanly across hours or days. Trust isn't granted per-session. It's earned per-action.

5. Finite Context, Infinite Conversations
Context is an explicitly managed resource. Two-tier compaction. Deterministic tool ordering for prompt cache hits. Structured summarization. Platform-aware thresholds.

What Changed

Before orchestration, every request hit the boss agent. Latency was variable. Costs were high. The system was fragile.

After orchestration:

Simple requests stay fast (direct execution, no delegation overhead).
Complex requests get parallelized instead of sequential (a 30-minute problem becomes 5 minutes).
Costs plummeted — research work that would have cost $20 now costs $3.86.
The system is stable — Token overflow is handled. Conversation length is no longer a scaling bottleneck.

The trick wasn't building a smarter agent. It was building an agent smart enough to know when not to do the work itself.

This powers Dimension's core conversational intelligence across web, Slack, and iMessages. Used daily by thousands.