The Engineering Behind Searching Everything

How we built a system that indexes 8+ platforms and makes their data instantly searchable.

The Problem

Your documents live everywhere. Gmail. Google Drive. Notion. Linear. Airtable. GitHub. Dropbox. Outlook.

When you ask an AI agent, "Find the Q1 roadmap," the agent has no idea where it lives. Is it in Notion? Google Docs? Linear? All three?

And once you find it, how do you cite it without embedding the entire 50-page document into context?

Most systems solve this by building a separate integration for each platform. Then they build a separate indexer. Then they glue them together with faith and duct tape.

We did something different.

One Architecture, Eight Integrations

Each integration (Google Drive, Gmail, Notion, etc.) has its own API quirks, permission models, and sync strategies. But underneath, they all feed a single indexing pipeline.

The pattern:

Full sync when the user connects (or incremental sync via webhooks)
Discover all documents
Classify by type (PDF, text, spreadsheet, etc.)
Fan out to specialized workers
Index into a vector store

Here's what we index from each platform:

Integration	What We Index	Sync Strategy
Google Drive	Docs, Sheets, Slides, PDFs, Word, Excel	Full sync + webhooks, 3 worker types
Gmail	Emails, VIP contact relationships	1-year history + push, LLM relationship analysis
Notion	Pages (recursive block crawl, max 5K blocks)	Full sync + per-page webhooks
Linear	Issues, projects, docs, comments	Full sync + per-entity webhooks
Airtable	Base schemas, table structures	Full sync + webhook with schema hashing
Dropbox	PDFs, Word, Excel, CSV, Markdown	Full sync + delta cursor webhook

The common infrastructure (NATS JetStream, TurboPuffer, KEDA, Voyage embeddings) handles all of them. The only difference is the integration-specific fetcher and parser.

The Ingestion Pipeline: From Files to Searchable Vectors

When a user connects Google Drive, here's what happens:

User connects Drive
        ↓
NATS consumer fires GoogleFetcher
        ↓
Fetcher discovers all files
        ↓
Diffs against what's already indexed
        ↓
Classifies by MIME type
        ↓
Publishes to 3 worker queues (PDF, Text, Tabular)
        ↓
Workers scale from 0 to N based on queue depth
        ↓
Each worker extracts content and generates embeddings
        ↓
Vectors land in TurboPuffer with user ACL
        ↓
Agent can search across all documents instantly

The clever part is worker specialization. PDFs need Gemini Vision to extract text and images—expensive, CPU-heavy. Spreadsheets need row-by-row parsing—high memory. Plain text is cheap.

If we fed all 500 PDFs into the same queue as 10,000 spreadsheet rows, the PDFs would starve the text processing.

Instead: three separate NATS queues, three independent worker pools, independent resource limits.

Scale-to-Zero: Only Pay for Work You're Actually Doing

Here's the infrastructure cost problem most companies don't solve: Most of the time, you're not indexing anything. A pod sitting idle costs money.

Our solution: KEDA (Kubernetes Event-Driven Autoscaling) watches the NATS queue depth.

When there's work:

PDF pods: 0 → 10 replicas (Gemini Vision, heavy CPU/RAM)
Text pods: 0 → 5 replicas (750m CPU, 1Gi RAM each)
Tabular pods: 0 → 5 replicas (batch streaming, ~5K rows at a time)

KEDA polls every 15 seconds. If the queue exceeds 25 pending messages, it scales up. If it stays empty for 5 minutes, it scales back to zero.

All workers run on GKE spot instances (60–91% discount). They have 15-second termination grace periods for preemption.

Result: We literally don't pay for idle compute. Every pod exists only because there's something to process.

The Vector Store: Namespaces, Not Monoliths

A naive approach: dump every document from every user into one massive vector index. Query returns results across all users. Oops, you just gave user A access to user B's documents.

Instead: every Google Drive maps to one TurboPuffer namespace.

Personal Drive → namespace based on account ID
Shared Drive → namespace based on drive ID
Each namespace has user-level ACL arrays

When an agent searches for "the PRD," the query filters by user_ids Contains user_id before ranking. Unauthorized documents never enter the pipeline.

Each vector stores:

document_id — Drive file ID
user_ids — ACL: who has access
embedding — Voyage AI vector (Voyage-4, 1024 dims)
content — Chunk text (searchable via BM25)
title_ngrams — Fuzzy title matching (1–10 char edge n-grams)
folder_path — ["My Drive", "Projects", "Q1"] hierarchy
url — webViewLink for citations

Smart Incremental Sync: Only Re-Index What Changed

Initial full sync is expensive. A 50,000-file corporate Drive takes minutes.

But most of the time, a user isn't connecting new drives. They're editing existing files.

Google Drive's changes API fires on views, shares, renames, moves, metadata edits, and content changes. Without smart diffing, a popular shared folder would trigger constant re-indexing from read-only activity (views, people viewing, etc.).

The solution: Mutable vs Immutable documents

Mutable (Google Docs, Sheets, Slides): Compare updated_at → re-index if changed
Immutable (PDFs, Office files, CSV): Never re-index content. Only patch user ACLs.

A PDF shared with 10 new team members triggers 10 lightweight ACL patches. Not 10 full Gemini Vision re-extractions.

Google Drive webhook channels expire after 7 days. A dedicated NATS consumer handles rotation: it gets the page token (marks "changes since here"), registers a new channel, and picks up exactly where the old one left off.

No gaps in coverage.

Cost Control: Tier-Based Indexing

Embedding costs scale linearly with documents. A free user connecting a 50,000-file corporate Drive shouldn't generate $200 in Voyage AI costs.

So we impose tier limits and keep the most relevant files:

Plan	Total Files	Docs	Sheets	Slides
Free	1,000	250	250	250
Plus	5,000	750	250	250
Pro	10,000	1,500	500	500
Max	25,000	3,000	1,000	1,000

Cost tracking is per-connection, not per-user. A user with two Google accounts gets independent budgets for each.

During sync, we apply the tier cap and keep files that were modified most recently. Users get the most relevant content within budget.

Hybrid Search: Why Vectors Alone Aren't Enough

A pure vector search (semantic search) misses exact keywords.

"I need the Q2 roadmap." Vector search might return a document about 2024 strategy (semantically similar but wrong) and miss the exact Q2 document.

A pure BM25 search (keyword matching) misses semantics.

"Find docs about our technical debt." BM25 returns exact matches for "technical" and "debt" but misses documents that discuss infrastructure problems without using those words.

We do both:

Agent: "Find the Q2 roadmap"
        ↓
Search across all user namespaces
        ├─ Vector leg: "Find semantically similar documents"
        └─ BM25 leg 1: "Find docs with 'Q2' in title"
        └─ BM25 leg 2: "Find docs with 'roadmap' in content"
        ↓
Merge results → Dedup
        ↓
Reciprocal Rank Fusion (combine rank lists)
        ↓
Voyage cross-encoder reranks (0–1 score)
        ↓
Threshold: 0.4 (no result is better than garbage)
        ↓
Return top 10

RRF (Reciprocal Rank Fusion) is the trick that fuses both. It doesn't average scores. It combines ranked lists, giving strong signals from any source high weight.

The cross-encoder (Voyage rerank-2.5-lite) is the final gate. If it scores a result below 0.4, we return nothing instead of noise.

Real-Time Progress Tracking

A user connects their Google Drive. 50,000 files. The indexing could take minutes.

We show them live progress:

Stage 1 — Discovery (0–5%)
"Listing your files..."

Stage 2 — Processing (5–98.5%)
"142/300 docs · 28/50 sheets · 15/45 PDFs"

Each worker calls update_google_background_progress() after finishing a document. Real-time updates via Ably.

Stage 3 — Finalization (98.5–100%)
Progress is capped at 98.5% until all 3 worker pools confirm they're done. Distributed join point. No "100% done but still processing."

From Search to Action

The agent doesn't know where "the PRD" lives.

Here's the flow:

Agent: "Find the Q1 PRD and summarize it"
        ↓
Runs parallel searches:
├─ HYBRID_SEARCH(google_drive, ...) → 3 results ✓
├─ HYBRID_SEARCH(notion, ...) → 1 result ✓
└─ HYBRID_SEARCH(linear, ...) → 0 results

Results tell agent: "Found in google_docs and notion"
        ↓
Agent calls ADD_INTEGRATIONS(["google_docs", "notion"])
        ↓
Agent fetches full content (not just vectors/chunks)
        ↓
Agent summarizes and cites

This is the lazy tool loading pattern. Results carry an integrations list. Tools are only loaded if search proves they're needed.

The search returns metadata and chunks (enough to assess relevance), not full documents. Full content is fetched only on demand.

Why This Matters

Most teams build integration A, then integration B, then duct tape them together. Each one is a special case.

We built one indexing architecture that scales horizontally. Add a new platform? Wire up a fetcher, a parser, and a NATS consumer. The rest works.

Specialized workers keep costs down (don't make PDFs compete with text).
Scale-to-zero means idle capacity costs nothing.
Namespaced vectors mean multi-tenant is the default.
Hybrid search means you find what you're actually looking for.
Lazy tool loading means the agent only loads what it needs.

The result: an agent that can search across 8+ platforms, find the right document regardless of where it lives, and cite it correctly.

This powers Dimension's indexing and search layer across 8+ integrations for thousands of users daily.