Context Engine: Indexing Architecture

April 22, 2026

Production system indexing Google Drive, Gmail, Outlook, Notion, Linear, Airtable, Dropbox, and more. Consumer-driven pipelines with scale-to-zero workers and hybrid search.

  • Integrations: 8+ providers
  • Built: Lead engineer, 10 months
  • Stack: NATS · TurboPuffer · KEDA

01 — Eight Integrations, One Architecture

Each integration has a dedicated indexing system with its own NATS consumers, fetcher, parser, and vector store schema. They share the same infrastructure (NATS JetStream, TurboPuffer, KEDA, Voyage embeddings) but handle integration-specific APIs, data models, and sync strategies independently.

IntegrationWhat's IndexedSync Strategy
Google DriveDocs, Sheets, Slides, PDFs, Word, PPT, Excel, CSV, Markdown, TXTFull sync + Drive webhook, fan-out to 3 specialized workers
GmailEmails (1 vector per message), relevance-scored1-year history + Pub/Sub push, VIP relationship extraction
OutlookEmail messages, calendar eventsFull sync + Graph subscriptions, per-calendar webhooks
OneDrivePDFs, Word, Excel, CSV, PPT, Markdown, TXTGraph delta queries + subscription webhooks, fan-out to 3 workers
NotionPages (recursive block tree crawl, max 5000 blocks per page)Full sync + per-page webhooks (created/updated/deleted/restored)
LinearIssues, projects, documents, commentsFull sync + per-entity webhooks, 4 separate namespace types
AirtableBase schemas and table structure metadataFull sync + webhook, schema hashing for change detection
DropboxPDFs, Word, Excel, CSV, PPT, Markdown, TXTFull sync + delta cursor webhook, fan-out to 3 specialized workers

Note: The rest of this document uses Google Drive as the running example. The patterns shown here — consumer-driven ingestion, diff-based sync, scale-to-zero workers, namespace-isolated vector storage — are the same across all integrations.


02 — The Ingestion Pipeline

A user connects Google Drive. A NATS consumer fires the orchestrator (GoogleFetcher), which discovers every file, diffs against what's already indexed, classifies by MIME type, and publishes to three specialized worker queues. Each worker scales independently from zero.

  ┌───────────┐  ┌──────────┐
   Full Sync    Webhook  
   (User        (File    
   connects)    changed) 
  └───────────┘  └──────────┘
         
  ┌─────────────────────────────┐
   NATS JetStream              
   GoogleDriveSyncWorkspace    
   Consumer · WebhookConsumer  
  └─────────────────────────────┘
         
  ┌─────────────────────────────┐
   GoogleFetcher               
   1. Discovery                
   2. Diff                     
   3. Classify                 
   4. Fan-out                  
  └─────────────────────────────┘
         
  ┌──────────┐ ┌───────────┐ ┌──────────────┐
   PDF        Text        Tabular      
   Worker     Worker      Worker       
   Gemini     750m CPU    ~5K row      
   vision     / 1Gi RAM   batch        
   Max 10     Max 5       streaming    
   pods       pods        Max 5 pods   
  └──────────┘ └───────────┘ └──────────────┘
         
  ┌─────────────────────────────┐
   TurboPuffer                 
   Namespace per Drive ·       
   User-level ACL ·            
   Voyage AI embeddings        
  └─────────────────────────────┘

KEDA watches NATS queue depth and spins pods only when there's work. All workers run on GKE spot instances (60–91% discount) with 15-second termination grace periods.


03 — Document Discovery & Fan-Out

The GoogleFetcher Lifecycle

01 — Discovery
Query Drive API (My Drive + Shared Drives), filter by MIME type, resolve shortcuts, build folder paths, extract collaborators.

02 — Diff
Query TurboPuffer for existing doc IDs. Compare and classify: to_connect (new), to_delete (removed), to_update (metadata changed).

03 — Classify & Limit
Apply tier-based caps (Free: 1K, Plus: 5K, Pro: 10K, Max: 25K files). Keep most-recently-modified. Track cost per connection.

04 — Publish
Route new docs to PDF / text / tabular NATS queues. Delete removed docs from vector store. Patch metadata-only changes in-place.

MIME Classification

Worker QueueDocument Types
PDFapplication/pdf
TextGoogle Docs, Slides, Word, PowerPoint, Markdown, Plain Text
TabularGoogle Sheets, Excel (.xlsx/.xls), CSV

Why Three Queues? Each document type has different resource profiles. PDFs need Gemini vision for page-by-page extraction — CPU and memory heavy. Sheets stream tab-by-tab in batches of ~5,000 rows. Docs need API-specific tree parsing. Separating them means a burst of 500 PDFs doesn't starve text processing.


04 — Scale-to-Zero Workers

Zero replicas is the default state. KEDA polls each NATS JetStream consumer every 15 seconds. When pending messages exceed the lag threshold (25), KEDA scales up the deployment. When the queue drains and stays empty for 5 minutes, it scales back to zero. No idle pods burning compute.

  • PDF Pods: 0→10 — Gemini Vision · heavy CPU/RAM
  • Text Pods: 0→5 — 750m CPU / 1Gi RAM
  • Tabular Pods: 0→5 — ~5K row batch streaming

Message Delivery Guarantees

OutcomeActionEffect
Successmsg.ack()Message removed from queue
Bad datamsg.ack()Skip — retrying won't fix it
OAuth revokedmsg.ack()Permanent failure — user must re-auth
Transient errormsg.nak(delay=7200)Retry in 2 hours

Worker Isolation: Each worker type has independent resource limits and max replica counts. A burst of heavy PDFs (Gemini vision + image rendering) can't exhaust memory for lighter text processing. Kubernetes resource requests guarantee each pod gets dedicated CPU and memory. All workers run on GKE spot instances with 15-second termination grace periods for preemption.


05 — Vector Store & Multi-Tenancy

Each Google Drive (personal or shared) maps to one TurboPuffer namespace. This keeps vector search scoped — querying a user's documents doesn't scan every document in the system. The namespace identifier comes from the Drive API (drive ID for shared drives, account identifier for personal drives).

Vector Schema

FieldPurpose
document_idGoogle Drive file ID
user_idsUsers with access — ACL array, filtered at query time
embeddingVoyage AI vector (Voyage-4)
contentChunk text (searchable via BM25)
title_ngramsEdge n-grams (1–10 chars) for fuzzy title matching
folder_path["My Drive", "Projects", "Q1"] — folder hierarchy
urlwebViewLink for citations in agent responses

Dedup Strategy

Workers check if a doc exists by document_id before processing. If it exists, add current user to user_ids and skip. Prevents duplicate Gemini/embedding costs on webhook storms.

Disconnect Handling

When a user disconnects: if they're the only user with access, delete the document. If others still have access, remove their user ID but keep the document. Clean, zero-orphan teardown.


06 — Incremental Sync via Webhooks

After the initial full sync, Google Drive's changes API fires on views, shares, renames, moves, metadata edits, and actual content changes. The smart diff layer prevents reprocessing on every event. Without it, a popular shared folder would trigger constant re-indexing from read-only activity.

Initial Sync

  • User connects Google Drive
  • GoogleFetcher.main()
  • List ALL files, diff, index
  • Register 7-day webhook channel

Incremental Sync

  • Drive API detects file change
  • Webhook fires → NATS consumer
  • handle_webhook_sync()
  • Fetch changes since last page token

Smart Diffing: Mutable vs Immutable

Document TypeOn ChangeWhy
Docs, Sheets, SlidesCompare updated_at → re-index if changedMutable — edits change content
PDFs, Office, text, CSVPatch user_ids only — never re-indexImmutable once uploaded

A PDF shared with 10 new team members triggers 10 lightweight user_ids patches, not 10 full Gemini vision re-extractions.

Channel Rotation

Google Drive webhook channels expire after 7 days. A dedicated NATS consumer (GoogleDriveRotateExpiringChannelConsumer) handles rotation automatically.

The page token is the bridge between old and new channels. It marks "changes since this point" — the new channel picks up exactly where the old one stopped. No gap in coverage.


07 — Cost Control & Real-Time Progress

Embedding costs scale linearly with document count. A free user connecting a 50,000-file corporate Drive shouldn't generate $200 in Voyage AI costs. Tier limits cap files indexed per integration, selecting the most-recently-modified so users get the most relevant content within budget.

PlanTotal FilesDocsSheetsSlides
Free1,0002505050
Plus5,000750250250
Pro10,0001,500500500
Max25,0003,0001,0001,000

Cost tracking is per-connection, not per-user. A user with two Google accounts gets independent budgets for each. If one hits its limit mid-sync, that connection stops — others continue unaffected.

Phased Progress Tracking

01 — Discovery — 0–5%
GoogleFetcher lists files from the Drive API. Quick for small drives, may take minutes for large workspaces.

02 — Processing — 5–98.5%
Per-entity tracking via Ably: "142/300 docs · 28/50 sheets · 15/45 PDFs." Each worker calls update_google_background_progress() after finishing a document.

03 — Finalization — 98.5–100%
Progress capped at 98.5% until finalize_google_background_job() confirms all registered jobs are complete. Distributed join point across 3 worker pools.


08 — Hybrid Search Architecture

Every search runs as a multi_query — multiple retrieval legs in a single TurboPuffer call. The agent provides a vector_query (semantic) and a bm25_query (keywords). Both execute across all user namespaces for the provider in parallel.

  HYBRID_SEARCH (provider, vector_query, bm25_query)
                    
  ┌───────────┐ ┌───────────┐ ┌───────────┐
  │Namespace 1│ │Namespace 2│ │Namespace N│
  │multi_query│ │multi_query│ │multi_query│
  └───────────┘ └───────────┘ └───────────┘
                    
  Dedup by chunk_id  Reciprocal Rank Fusion (k=60)
                    
  Voyage rerank-2.5-lite  Threshold: 0.4  Top 10

Query Legs per Provider

ProviderVector ANNBM25 Leg 1BM25 Leg 2
Google Driveembeddingcontentdocument_name
Gmailembeddingsubjectcontent
Notionembeddingcontentpage_title
Linearembeddingtitledescription
GitHubembeddingcontentpath + repo_name
Dropboxembeddingcontentfile_name

Why RRF + Reranking? Vector misses exact keywords. BM25 misses semantics. RRF fuses both. Voyage cross-encoder scores true relevance. Below 0.4? Returns nothing rather than noise.

Access Control: Every query filters by user_ids Contains user_id. Unauthorized docs never enter the ranking pipeline. All embeddings use Voyage-4.


09 — Search to Action Bridge

The agent doesn't know where a document lives. "The PRD" could be in Google Docs, Notion, or Linear. So the agent runs parallel searches across likely providers, picks the most relevant result regardless of source, and lazily loads only the tools needed to fetch full content.

  "Find the PRD for Project Alpha and summarize it"
                    
  ┌─────────────┐ ┌──────────┐ ┌─────────────┐
  │HYBRID_SEARCH│ │HYBRID     │HYBRID_SEARCH│
  │google-drive  │SEARCH     │linear-docs  
  │3 results    │notion     │0 results    
                │1 result ✓│              
  └─────────────┘ └──────────┘ └─────────────┘
                    
  Results include: integrations: ["google_docs", "notion"]
                    
  ADD_INTEGRATIONS      Fetch Full Content     Respond to User
  Load google_docs       GOOGLE_DOCS_GET(id)     Summarize, cite, act
  + notion

Pointers, Not Content

Search returns metadata and chunks (enough to assess relevance), not full documents. Full content is fetched only when needed, via integration-specific tools.

Lazy Tool Loading

Results carry an integrations list telling the agent which tools to load. No tools are loaded until search proves they're needed.


This architecture powers dimension.dev's context engine — indexing 8+ integrations and serving hybrid search for thousands of users daily.