Production system indexing Google Drive, Gmail, Outlook, Notion, Linear, Airtable, Dropbox, and more. Consumer-driven pipelines with scale-to-zero workers and hybrid search.
- Integrations: 8+ providers
- Built: Lead engineer, 10 months
- Stack: NATS · TurboPuffer · KEDA
01 — Eight Integrations, One Architecture
Each integration has a dedicated indexing system with its own NATS consumers, fetcher, parser, and vector store schema. They share the same infrastructure (NATS JetStream, TurboPuffer, KEDA, Voyage embeddings) but handle integration-specific APIs, data models, and sync strategies independently.
| Integration | What's Indexed | Sync Strategy |
|---|---|---|
| Google Drive | Docs, Sheets, Slides, PDFs, Word, PPT, Excel, CSV, Markdown, TXT | Full sync + Drive webhook, fan-out to 3 specialized workers |
| Gmail | Emails (1 vector per message), relevance-scored | 1-year history + Pub/Sub push, VIP relationship extraction |
| Outlook | Email messages, calendar events | Full sync + Graph subscriptions, per-calendar webhooks |
| OneDrive | PDFs, Word, Excel, CSV, PPT, Markdown, TXT | Graph delta queries + subscription webhooks, fan-out to 3 workers |
| Notion | Pages (recursive block tree crawl, max 5000 blocks per page) | Full sync + per-page webhooks (created/updated/deleted/restored) |
| Linear | Issues, projects, documents, comments | Full sync + per-entity webhooks, 4 separate namespace types |
| Airtable | Base schemas and table structure metadata | Full sync + webhook, schema hashing for change detection |
| Dropbox | PDFs, Word, Excel, CSV, PPT, Markdown, TXT | Full sync + delta cursor webhook, fan-out to 3 specialized workers |
Note: The rest of this document uses Google Drive as the running example. The patterns shown here — consumer-driven ingestion, diff-based sync, scale-to-zero workers, namespace-isolated vector storage — are the same across all integrations.
02 — The Ingestion Pipeline
A user connects Google Drive. A NATS consumer fires the orchestrator (GoogleFetcher), which discovers every file, diffs against what's already indexed, classifies by MIME type, and publishes to three specialized worker queues. Each worker scales independently from zero.
┌───────────┐ ┌──────────┐
│ Full Sync │ │ Webhook │
│ (User │ │ (File │
│ connects) │ │ changed) │
└───────────┘ └──────────┘
↓
┌─────────────────────────────┐
│ NATS JetStream │
│ GoogleDriveSyncWorkspace │
│ Consumer · WebhookConsumer │
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ GoogleFetcher │
│ 1. Discovery │
│ 2. Diff │
│ 3. Classify │
│ 4. Fan-out │
└─────────────────────────────┘
↓
┌──────────┐ ┌───────────┐ ┌──────────────┐
│ PDF │ │ Text │ │ Tabular │
│ Worker │ │ Worker │ │ Worker │
│ Gemini │ │ 750m CPU │ │ ~5K row │
│ vision │ │ / 1Gi RAM │ │ batch │
│ Max 10 │ │ Max 5 │ │ streaming │
│ pods │ │ pods │ │ Max 5 pods │
└──────────┘ └───────────┘ └──────────────┘
↓
┌─────────────────────────────┐
│ TurboPuffer │
│ Namespace per Drive · │
│ User-level ACL · │
│ Voyage AI embeddings │
└─────────────────────────────┘
KEDA watches NATS queue depth and spins pods only when there's work. All workers run on GKE spot instances (60–91% discount) with 15-second termination grace periods.
03 — Document Discovery & Fan-Out
The GoogleFetcher Lifecycle
01 — Discovery
Query Drive API (My Drive + Shared Drives), filter by MIME type, resolve shortcuts, build folder paths, extract collaborators.
02 — Diff
Query TurboPuffer for existing doc IDs. Compare and classify: to_connect (new), to_delete (removed), to_update (metadata changed).
03 — Classify & Limit
Apply tier-based caps (Free: 1K, Plus: 5K, Pro: 10K, Max: 25K files). Keep most-recently-modified. Track cost per connection.
04 — Publish
Route new docs to PDF / text / tabular NATS queues. Delete removed docs from vector store. Patch metadata-only changes in-place.
MIME Classification
| Worker Queue | Document Types |
|---|---|
| application/pdf | |
| Text | Google Docs, Slides, Word, PowerPoint, Markdown, Plain Text |
| Tabular | Google Sheets, Excel (.xlsx/.xls), CSV |
Why Three Queues? Each document type has different resource profiles. PDFs need Gemini vision for page-by-page extraction — CPU and memory heavy. Sheets stream tab-by-tab in batches of ~5,000 rows. Docs need API-specific tree parsing. Separating them means a burst of 500 PDFs doesn't starve text processing.
04 — Scale-to-Zero Workers
Zero replicas is the default state. KEDA polls each NATS JetStream consumer every 15 seconds. When pending messages exceed the lag threshold (25), KEDA scales up the deployment. When the queue drains and stays empty for 5 minutes, it scales back to zero. No idle pods burning compute.
- PDF Pods: 0→10 — Gemini Vision · heavy CPU/RAM
- Text Pods: 0→5 — 750m CPU / 1Gi RAM
- Tabular Pods: 0→5 — ~5K row batch streaming
Message Delivery Guarantees
| Outcome | Action | Effect |
|---|---|---|
| Success | msg.ack() | Message removed from queue |
| Bad data | msg.ack() | Skip — retrying won't fix it |
| OAuth revoked | msg.ack() | Permanent failure — user must re-auth |
| Transient error | msg.nak(delay=7200) | Retry in 2 hours |
Worker Isolation: Each worker type has independent resource limits and max replica counts. A burst of heavy PDFs (Gemini vision + image rendering) can't exhaust memory for lighter text processing. Kubernetes resource requests guarantee each pod gets dedicated CPU and memory. All workers run on GKE spot instances with 15-second termination grace periods for preemption.
05 — Vector Store & Multi-Tenancy
Each Google Drive (personal or shared) maps to one TurboPuffer namespace. This keeps vector search scoped — querying a user's documents doesn't scan every document in the system. The namespace identifier comes from the Drive API (drive ID for shared drives, account identifier for personal drives).
Vector Schema
| Field | Purpose |
|---|---|
document_id | Google Drive file ID |
user_ids | Users with access — ACL array, filtered at query time |
embedding | Voyage AI vector (Voyage-4) |
content | Chunk text (searchable via BM25) |
title_ngrams | Edge n-grams (1–10 chars) for fuzzy title matching |
folder_path | ["My Drive", "Projects", "Q1"] — folder hierarchy |
url | webViewLink for citations in agent responses |
Dedup Strategy
Workers check if a doc exists by document_id before processing. If it exists, add current user to user_ids and skip. Prevents duplicate Gemini/embedding costs on webhook storms.
Disconnect Handling
When a user disconnects: if they're the only user with access, delete the document. If others still have access, remove their user ID but keep the document. Clean, zero-orphan teardown.
06 — Incremental Sync via Webhooks
After the initial full sync, Google Drive's changes API fires on views, shares, renames, moves, metadata edits, and actual content changes. The smart diff layer prevents reprocessing on every event. Without it, a popular shared folder would trigger constant re-indexing from read-only activity.
Initial Sync
- User connects Google Drive
GoogleFetcher.main()- List ALL files, diff, index
- Register 7-day webhook channel
Incremental Sync
- Drive API detects file change
- Webhook fires → NATS consumer
handle_webhook_sync()- Fetch changes since last page token
Smart Diffing: Mutable vs Immutable
| Document Type | On Change | Why |
|---|---|---|
| Docs, Sheets, Slides | Compare updated_at → re-index if changed | Mutable — edits change content |
| PDFs, Office, text, CSV | Patch user_ids only — never re-index | Immutable once uploaded |
A PDF shared with 10 new team members triggers 10 lightweight
user_idspatches, not 10 full Gemini vision re-extractions.
Channel Rotation
Google Drive webhook channels expire after 7 days. A dedicated NATS consumer (GoogleDriveRotateExpiringChannelConsumer) handles rotation automatically.
The page token is the bridge between old and new channels. It marks "changes since this point" — the new channel picks up exactly where the old one stopped. No gap in coverage.
07 — Cost Control & Real-Time Progress
Embedding costs scale linearly with document count. A free user connecting a 50,000-file corporate Drive shouldn't generate $200 in Voyage AI costs. Tier limits cap files indexed per integration, selecting the most-recently-modified so users get the most relevant content within budget.
| Plan | Total Files | Docs | Sheets | Slides |
|---|---|---|---|---|
| Free | 1,000 | 250 | 50 | 50 |
| Plus | 5,000 | 750 | 250 | 250 |
| Pro | 10,000 | 1,500 | 500 | 500 |
| Max | 25,000 | 3,000 | 1,000 | 1,000 |
Cost tracking is per-connection, not per-user. A user with two Google accounts gets independent budgets for each. If one hits its limit mid-sync, that connection stops — others continue unaffected.
Phased Progress Tracking
01 — Discovery — 0–5%
GoogleFetcher lists files from the Drive API. Quick for small drives, may take minutes for large workspaces.
02 — Processing — 5–98.5%
Per-entity tracking via Ably: "142/300 docs · 28/50 sheets · 15/45 PDFs." Each worker calls update_google_background_progress() after finishing a document.
03 — Finalization — 98.5–100%
Progress capped at 98.5% until finalize_google_background_job() confirms all registered jobs are complete. Distributed join point across 3 worker pools.
08 — Hybrid Search Architecture
Every search runs as a multi_query — multiple retrieval legs in a single TurboPuffer call. The agent provides a vector_query (semantic) and a bm25_query (keywords). Both execute across all user namespaces for the provider in parallel.
HYBRID_SEARCH (provider, vector_query, bm25_query)
↓
┌───────────┐ ┌───────────┐ ┌───────────┐
│Namespace 1│ │Namespace 2│ │Namespace N│
│multi_query│ │multi_query│ │multi_query│
└───────────┘ └───────────┘ └───────────┘
↓
Dedup by chunk_id → Reciprocal Rank Fusion (k=60)
↓
Voyage rerank-2.5-lite → Threshold: 0.4 → Top 10
Query Legs per Provider
| Provider | Vector ANN | BM25 Leg 1 | BM25 Leg 2 |
|---|---|---|---|
| Google Drive | embedding | content | document_name |
| Gmail | embedding | subject | content |
| Notion | embedding | content | page_title |
| Linear | embedding | title | description |
| GitHub | embedding | content | path + repo_name |
| Dropbox | embedding | content | file_name |
Why RRF + Reranking? Vector misses exact keywords. BM25 misses semantics. RRF fuses both. Voyage cross-encoder scores true relevance. Below 0.4? Returns nothing rather than noise.
Access Control: Every query filters by
user_ids Contains user_id. Unauthorized docs never enter the ranking pipeline. All embeddings use Voyage-4.
09 — Search to Action Bridge
The agent doesn't know where a document lives. "The PRD" could be in Google Docs, Notion, or Linear. So the agent runs parallel searches across likely providers, picks the most relevant result regardless of source, and lazily loads only the tools needed to fetch full content.
"Find the PRD for Project Alpha and summarize it"
↓
┌─────────────┐ ┌──────────┐ ┌─────────────┐
│HYBRID_SEARCH│ │HYBRID │ │HYBRID_SEARCH│
│google-drive │ │SEARCH │ │linear-docs │
│3 results ✓ │ │notion │ │0 results │
│ │ │1 result ✓│ │ │
└─────────────┘ └──────────┘ └─────────────┘
↓
Results include: integrations: ["google_docs", "notion"]
↓
ADD_INTEGRATIONS → Fetch Full Content → Respond to User
Load google_docs GOOGLE_DOCS_GET(id) Summarize, cite, act
+ notion
Pointers, Not Content
Search returns metadata and chunks (enough to assess relevance), not full documents. Full content is fetched only when needed, via integration-specific tools.
Lazy Tool Loading
Results carry an integrations list telling the agent which tools to load. No tools are loaded until search proves they're needed.
This architecture powers dimension.dev's context engine — indexing 8+ integrations and serving hybrid search for thousands of users daily.