Built and shipped dimension.dev. Production system running across web, Slack, and iMessages.
- Platforms: Web · Slack · iMessages
- Built: Lead engineer, 10 months
- Integrations: 20+
01 — The Orchestration Graph
The boss agent operates as a reasoning orchestrator. It receives a user request, decides whether the work requires delegation or direct execution, and if delegating, decomposes the request into independent tasks that execute in parallel sub-agents. Each sub-agent has its own ReAct loop with scoped tools and context. The boss synthesizes all findings into the final user-facing output.
Flow:
User Request → Boss Agent (Claude Opus 4.6)
setup → compact → agent ↔ tools (ReAct loop)
"Is the work per item a tool call or a research chain?"
┌─────────────────┐ ┌──────────────────────────┐
│ Direct Execution │ │ Delegation │
│ Boss handles via │ │ CREATE_TASKS + │
│ tool calls │ │ SPAWN_SUB_AGENT │
└─────────────────┘ └──────────────────────────┘
↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Sub-Agent │ │ Sub-Agent │ │ Sub-Agent │
│ 1 │ │ 2 │ │ 3 │
└──────────┘ └──────────┘ └──────────┘
↓
Redis Scratchpad → Boss Synthesizes Output
(Shared state) (Docs, emails, PDFs)
Built on LangGraph with durable checkpointing on Postgres, Redis for fast cross-agent coordination, and NATS JetStream for reliable event-driven execution.
Live Example: The intelligence brief on Sycamore was produced by this system. The boss agent decomposed the request into 7 parallel research sub-agents — 5 team member deep-dives and 2 competitive landscape analyses — coordinated findings via scratchpad, and generated a styled PDF. Total execution: 4 minutes 53 seconds, $3.86 in API cost, 1.5M tokens processed.
02 — The Delegation Decision
The boss agent isn't a static router. It reasons about task complexity and picks one of three execution strategies:
01 — Direct Execution — Boss handles it
"Schedule 5 meetings with these details." Five parallel calendar API calls. Done. Sub-agent overhead would be wasteful — the work per item is a single tool call.
02 — Delegation — Parallel sub-agents
"Research 8 competitors for our market analysis." Each company needs web searches, product analysis, pricing research — an independent research chain. Boss creates 8 tasks, spawns 8 sub-agents in parallel.
03 — Explore First — Investigate, then decide
"Look into our Q3 numbers." Scope is unclear. Boss searches memory and documents first, figures out what's actually needed, then decides whether to handle directly or decompose.
The key question the boss asks itself: "Is the work per item trivial (a tool call) or substantial (a research chain)?"
When the user signals depth — "research thoroughly," "deep dive," "comprehensive analysis" — the system leans toward delegation even for fewer items.
Sub-agents are research-only
They gather data, search, and return structured findings. The final user-facing output stays with the boss: documents, spreadsheets, presentations, PDFs, email bodies, notification text, summaries.
The boss (Opus 4.6) produces higher-quality synthesis, formatting, and tone than any sub-agent would. Delegate the research, own the output.
Task descriptions are the contract
Sub-agents can't see the conversation. They receive a self-contained task description with all relevant IDs, context, scope, and deliverable format. This forces clean decomposition. For multi-phase work, later tasks reference earlier task IDs and fetch results from a shared scratchpad instead of relaying context.
03 — Lazy Integration Loading
Dimension connects to 20+ integrations — Gmail, Calendar, Slack, Linear, Notion, GitHub, Airtable, Dropbox, Google Drive, Granola, and more. Loading all tools upfront would bloat the context window. The solution is a three-layer lazy loading architecture.
L1 — Setup-Time — On conversation start
Fetch user's connected integrations. Pre-load tools from active artifacts (open presentation → load slides tools). Carry forward previous run's integrations for prompt cache hits. Platform-aware: web preserves cache, iMessages resets each run to prevent bloat.
L2 — Mid-Conversation — Agent discovers needs
Agent reasons "I need to post to Slack" → calls ADD_INTEGRATIONS(["slack"]) → validates against user's connected integrations → tool list rebuilt, model re-bound → next ReAct iteration has Slack tools available.
L3 — Cache Optimization — Cross-iteration and cross-run
Tools sorted alphabetically for deterministic ordering. Cache breakpoint placed on last tool definition. Same tool set across iterations = prompt cache hit. Integrations persist across runs = cross-run cache hits.
Result: The agent operates across a massive integration surface while keeping context window usage tight and prompt cache hit rates high. Typically 3–5 integration sets loaded per conversation.
04 — The Scratchpad
Sub-agents don't communicate with each other. They don't share messages. They don't coordinate. This is by design.
Boss Agent
(Creates tasks, reads answers)
↓ creates ↑ reads
┌────┐ ┌────┐ ┌────┐
│SA 1│ │SA 2│ │SA 3│
└────┘ └────┘ └────┘
↓ fetch ↑ write
┌─────────────────────┐
│ Redis Scratchpad │
│ task:{thread}:{run}:{id} │
│ 7-day TTL │
└─────────────────────┘
Lifecycle
- Boss creates tasks → scratchpad stores them as pending
- Each sub-agent fetches its task, marks running, executes independently
- Sub-agent writes its answer back to scratchpad
- Boss reads completed answers and synthesizes the final output
Why scratchpad over message passing?
- No coordination needed — Sub-agents are embarrassingly parallel — no shared channels, no consensus protocol.
- Sub-millisecond reads — The boss can check status without blocking anything.
- Clean interfaces — Tasks are self-contained contracts — the description IS the interface.
- State isolation — Each sub-agent has its own internal message history, completely separate from the boss's conversation.
Multi-Phase Execution: Completed tasks from the same run are visible to later sub-agents. Phase 2 tasks can reference phase 1 outputs by task ID via GET_TASK, enabling staged workflows without the boss having to copy-paste findings between phases.
05 — Human-in-the-Loop as a First-Class Primitive
The system maintains a critical tool mapping per integration — send email, post to Slack, create calendar event, delete a Linear issue. These are the actions with real-world consequences. Users configure trust at three levels:
| Level | Example | Effect |
|---|---|---|
| Disable HIL entirely | "I trust the agent" | Full autonomy, no pauses |
| Allowlist integrations | "Trust all Gmail" | Every Gmail tool skips review |
| Allowlist specific tools | "Trust email send only" | Send skips review, delete still pauses |
What the user can do
When the agent proposes a critical action, the full state — proposed parameters, integration context, run status — is persisted to the database. The user sees what the agent wants to do and can:
- Approve — Execute as proposed
- Edit Parameters — Change recipient, reword message, adjust date
- Reject + Feedback — Agent adapts on next iteration
Freeform addressing
Users can interact with HIL components on the UI or respond in natural language on an interrupted thread. An LLM classifies their intent and maps it to decisions.
Architecture: HIL isn't a wrapper around the agent loop. It's a graph-level interrupt primitive built into LangGraph's checkpoint system. State persists to the database across pauses. Runs can be interrupted for days or weeks — and resume cleanly with full execution context. No lost work, no replay.
06 — Real-Time Streaming Architecture
The agent streams five event types: ready (graph started), messages (LLM token chunks), custom (tool calls, sub-agent events), interrupt (HIL approval requests), and terminal events (done, error, canceled).
Direct Path — Slack & iMessages
The graph runs inside the request handler. SSE chunks stream straight to the client. Simple, low-latency, no intermediate infrastructure.
Indirect Path — Web
- /stream publishes to NATS JetStream
- Returns SSE subscribing to Redis Stream
- Consumer runs graph, appends via XADD
- Browser reads via XREAD BLOCK — live
Why the indirection? Browser disconnects don't kill the agent. Close your tab mid-run, the graph keeps running in the NATS consumer. The Redis Stream keeps accumulating frames. Come back, hit /view, and pick up where you left off.
Replay via Redis Streams
SSE events are stored in Redis Streams (an append-only log with cursor-based reads), not Redis pub/sub (fire-and-forget). This gives the system native replay support.
Every SSE frame gets a Redis entry ID, sent to the browser as the SSE id: field. On reconnection, the client passes back its last_event_id, and the server calls XREAD from that cursor. All missed events replay instantly, then the connection transitions to live streaming.
This survives tab switches, device sleep, and intermittent network drops — without any custom replay logic. Redis Streams and the SSE spec handle it natively.
TTL stratification keeps Redis clean: streams get a 1-hour safety-net TTL during execution, shortened to a 2-minute buffer after completion. Enough time for a client to reconnect and drain remaining events, then the stream auto-expires.
Dual Redis pattern
| Pattern | Used for | Why |
|---|---|---|
| Redis Streams | SSE event delivery | Durable, ordered, replayable |
| Redis Pub/Sub | Cancel signals | Instant, fire-and-forget |
Thread lifecycle safety: If a consumer crashes, orphan detection on the next request identifies stale busy threads (terminal run status + no active consumer), auto-resolves them, and the user proceeds normally.
07 — Context Window as a Managed Resource
Conversations don't end. A user might run a thread across weeks — dozens of runs, hundreds of tool calls, thousands of messages. Without active management, the context window fills up and the agent either crashes or degrades.
During Run — Mid-Run Compaction
Before every ReAct iteration, the system estimates current token usage. If it crosses 80% of the model's context window, a cheap summarizer (Gemini Flash) condenses older messages into a structured summary. Oldest messages are pruned, newest complete human turns preserved.
- Sonnet (256K): ~205K threshold
- Opus (1M): ~819K threshold
After Run — End-Run Compaction
After a run completes, the post-processing consumer checks whether the thread exceeds a platform-specific threshold. The thresholds differ because the platforms differ.
- Web: 148K
- Slack: 64K
- iMsg: 64K
Concurrent safety
End-run compaction acquires a Redis lock before modifying thread state. If another run is already active, the compaction artifact (summary + message IDs to prune) gets queued in Redis rather than applied immediately. The queued artifact is flushed on the next /stream call before the new run starts, ensuring no data races between compaction and execution.
Structured summarization
The summary isn't a narrative. It captures: User's Goal, Completed Actions, Failed/Blocked Actions, User Decisions & Preferences, Pending Requests, Key Context. This preserves what the agent needs for the next run — an operational snapshot, not a story.
The Result: Conversations run indefinitely without degradation. Token costs stay bounded. Prompt cache hit rates stay high because the compressed message prefix stays stable across runs.
08 — Design Principles
01 — Hierarchical delegation with a reasoning orchestrator
The boss agent doesn't route by keyword matching or intent classification. It reasons about task complexity, decides execution strategy dynamically, and writes self-contained task contracts for sub-agents. The delegation decision — direct vs. delegate vs. explore — is a judgment call made by the orchestrator on every request.
02 — Boss-controlled execution with shared state, not message passing
The orchestrator has full control over execution order — spawning sub-agents in parallel when tasks are independent, or sequentially when outputs feed into downstream work. No coordination protocol between agents, no consensus mechanism, no message passing. The Redis scratchpad provides shared state without inter-agent communication complexity. Each agent works independently; the boss decides when to launch them and synthesizes the results.
03 — Lazy capability acquisition
Agents discover what tools they need through reasoning and search, not through pre-loaded capability sets. A conversation starts with minimal tools and acquires integrations as the work demands them. The system scales to dozens of integrations without linear context window cost.
04 — Human oversight as a system primitive
Critical actions interrupt execution, persist full state to the database, allow parameter editing, support partial and freeform addressing, and resume cleanly across hours or days. Trust is earned per-action, not granted per-session.
05 — Finite context, infinite conversations
Two-tier compaction (mid-run safety, end-run optimization), deterministic tool ordering for prompt cache hits, structured summarization, platform-aware thresholds. The context window is treated as an explicitly managed resource — not an assumed-infinite buffer.
This architecture powers dimension.dev's core conversational agent — used and loved by thousands of users daily across web, Slack, and iMessages.