IndexFoundry-MCP
Deterministic Vector Index Factory - An MCP server for automated, auditable vector database creation from any content source, with deployable project-based workflows.
Tools don't think, they act.
Every tool in this server is:
- Deterministic: Same inputs → same outputs
- Idempotent: Re-running produces identical artifacts (unless
force: true) - Auditable: Every operation produces manifests, hashes, and logs
- Composable: Tools can be run independently or chained
Architecture
IndexFoundry provides two complementary workflows:
1. Run-Based Pipeline (Fine-Grained Control)
Individual pipeline runs with isolated artifacts, suitable for experimentation and detailed auditing.
2. Project-Based Workflow (Deployable RAG Applications)
Self-contained projects that generate deployment-ready repositories with MCP server, Dockerfile, and Railway configuration.
Pipeline Phases (Run-Based)
Connect → Extract → Normalize → Index → Serve
↓ ↓ ↓ ↓ ↓
raw/ extracted/ normalized/ indexed/ served/
Phase 1: Connect
Fetch content from URLs, sitemaps, folders, or PDFs. Every artifact gets a content hash.
Phase 2: Extract
Convert raw bytes to text using pinned extractors (pdfminer, cheerio, etc.).
Phase 3: Normalize
Chunk text deterministically, enrich metadata (no LLM), and deduplicate.
Phase 4: Index
Generate embeddings with a pinned model, upsert to vector DB.
Phase 5: Serve
Generate OpenAPI spec and optionally start a retrieval API.
Quick Start
# Install dependencies
npm install
# Build
npm run build
# Run on stdio (for Claude Desktop, Cline, etc.)
npm start
# Run as HTTP server
npm run start:http
Workflow Options
Option 1: Run-Based Pipeline (Detailed Control)
Use individual pipeline tools for fine-grained control over each phase:
// Create a new run
const runId = crypto.randomUUID();
await client.callTool("indexfoundry_connect_folder", {
run_id: runId,
path: "/path/to/documents",
glob: "**/*.pdf"
});
// Extract PDF content
await client.callTool("indexfoundry_extract_pdf", {
run_id: runId,
pdf_path: "raw/<sha256>.pdf",
mode: "layout"
});
// Chunk: text
await client.callTool("indexfoundry_normalize_chunk", {
run_id: runId,
input_paths: ["extracted/<sha256>.pages.jsonl"],
strategy: "recursive",
max_chars: 1500,
overlap_chars: 150
});
// Generate embeddings
await client.callTool("indexfoundry_index_embed", {
run_id: runId,
model: {
provider: "openai",
model_name: "text-embedding-3-small",
api_key_env: "OPENAI_API_KEY"
}
});
// Upsert to vector DB
await client.callTool("indexfoundry_index_upsert", {
run_id: runId,
provider: "local",
connection: { collection: "my_docs" }
});
Option 2: Project-Based Workflow (Deployable RAG)
Create a self-contained, deployable RAG application:
// Create a new project
await client.callTool("indexfoundry_project_create", {
project_id: "my-rag-app",
name: "My RAG Search",
description: "Searchable knowledge base for documentation",
embedding_model: {
provider: "openai",
model_name: "text-embedding-3-small",
api_key_env: "OPENAI_API_KEY"
},
chunk_config: {
strategy: "recursive",
max_chars: 1500,
overlap_chars: 150
}
});
// Add data sources
await client.callTool("indexfoundry_project_add_source", {
project_id: "my-rag-app",
url: "https://docs.example.com",
source_name: "Documentation Site",
tags: ["docs", "api"]
});
// Build: vector database
await client.callTool("indexfoundry_project_build", {
project_id: "my-rag-app"
});
// Query: built index
await client.callTool("indexfoundry_project_query", {
project_id: "my-rag-app",
query: "How do I configure authentication?",
mode: "hybrid",
top_k: 5
});
// Export for deployment
await client.callTool("indexfoundry_project_export", {
project_id: "my-rag-app",
server_name: "my-rag-server",
include_http: true,
railway_config: true
});
After export, a project directory contains a complete deployable repository:
Dockerfile- Container configurationrailway.toml- Railway deployment configsrc/index.ts- Generated MCP server with search toolsREADME.md- Project-specific documentation
Push to GitHub and deploy:
cd projects/my-rag-app
git init
git add .
git commit -m "Initial RAG application"
git push
# Then connect to Railway and deploy
Tool Overview
Run-Based Pipeline Tools
Connect Phase
indexfoundry_connect_url- Fetch a single URL with domain allowlistingindexfoundry_connect_sitemap- Crawl a sitemap with URL filteringindexfoundry_connect_folder- Load local files with glob patternsindexfoundry_connect_pdf- Fetch PDF with metadata extraction
Extract Phase
indexfoundry_extract_pdf- PDF to text (layout/plain/OCR modes)indexfoundry_extract_html- HTML to clean text with structure preservationindexfoundry_extract_document- Generic document extraction (markdown, txt, CSV, JSON)
Normalize Phase
indexfoundry_normalize_chunk- Split text into chunks (recursive/paragraph/heading/page/sentence/fixed)indexfoundry_normalize_enrich- Add metadata (language detection, regex tags, section classification)indexfoundry_normalize_dedupe- Remove duplicates (exact/simhash/minhash)
Index Phase
indexfoundry_index_embed- Generate embeddings (OpenAI/Cohere/sentence-transformers/local)indexfoundry_index_upsert- Write to vector DB (Pinecone/Weaviate/Qdrant/Milvus/Chroma/local)indexfoundry_index_build_profile- Configure retrieval (top_k, hybrid search, reranking)
Serve Phase
indexfoundry_serve_openapi- Generate OpenAPI 3.1 specificationindexfoundry_serve_start- Start HTTP search API serverindexfoundry_serve_stop- Stop running API serverindexfoundry_serve_status- Get server statusindexfoundry_serve_query- Query running server directly
Run Utilities
indexfoundry_run_status- Get detailed status of a runindexfoundry_run_list- List all runs with filteringindexfoundry_run_diff- Compare two runs (config, chunks, timing)indexfoundry_run_cleanup- Delete old runs with retention policies
Project-Based Workflow Tools
Project Management
indexfoundry_project_create- Create a new project with embedding and chunk configindexfoundry_project_list- List all projects with optional statisticsindexfoundry_project_get- Get project details, manifest, and sourcesindexfoundry_project_delete- Delete a project (requiresconfirm: true)
Source Management
indexfoundry_project_add_source- Add data source (url/sitemap/folder/pdf) with tags
Build & Query
indexfoundry_project_build- Process all pending sources (fetch, chunk, embed, upsert)indexfoundry_project_query- Search project's vector database (semantic/keyword/hybrid)
Deployment
indexfoundry_project_export- Generate deployment files (Dockerfile, MCP server, railway.toml)
Directory Structures
Run-Based Structure
runs/<run_id>/
├── manifest.json # Master audit trail
├── config.json # Frozen config
├── raw/ # Fetched artifacts
├── extracted/ # Text extraction
├── normalized/ # Chunks
├── indexed/ # Embeddings
├── served/ # API artifacts
└── logs/ # Event logs
Project-Based Structure
projects/<project_id>/
├── project.json # Project manifest (embedding config, stats)
├── sources.jsonl # Source records (url/sitemap/folder/pdf)
├── data/
│ ├── chunks.jsonl # Indexed chunks
│ └── vectors.jsonl # Generated embeddings
├── runs/ # Per-source build runs
├── src/
│ └── index.ts # Generated MCP server
├── Dockerfile # Container configuration
├── railway.toml # Railway deployment config
├── package.json # Server dependencies
├── tsconfig.json # TypeScript config
└── README.md # Project documentation
Configuration
Environment Variables
# Run-based pipeline
INDEXFOUNDRY_RUNS_DIR=./runs # Where to store runs
# Embeddings
OPENAI_API_KEY=sk-... # For OpenAI embeddings
EMBEDDING_API_KEY=sk-... # Generic env variable (configurable per project)
# Server
PORT=3000 # For HTTP transport
TRANSPORT=stdio # stdio or http
Project Configuration
Projects store configuration in project.json:
{
"project_id": "my-rag",
"name": "My RAG Search",
"embedding_model": {
"provider": "openai",
"model_name": "text-embedding-3-small",
"api_key_env": "OPENAI_API_KEY"
},
"chunk_config": {
"strategy": "recursive",
"max_chars": 1500,
"overlap_chars": 150
}
}
Example Usage
Run-Based Pipeline Example
// Create a new run
const runId = crypto.randomUUID();
// Connect: fetch from folder
await client.callTool("indexfoundry_connect_folder", {
run_id: runId,
path: "/path/to/documents",
glob: "**/*.pdf"
});
// Extract: PDF to text
await client.callTool("indexfoundry_extract_pdf", {
run_id: runId,
pdf_path: "raw/<sha256>.pdf",
mode: "layout"
});
// Normalize: chunk text
await client.callTool("indexfoundry_normalize_chunk", {
run_id: runId,
input_paths: ["extracted/<sha256>.pages.jsonl"],
strategy: "recursive",
max_chars: 1500,
overlap_chars: 150
});
// Index: generate embeddings
await client.callTool("indexfoundry_index_embed", {
run_id: runId,
model: {
provider: "openai",
model_name: "text-embedding-3-small",
api_key_env: "OPENAI_API_KEY"
}
});
// Upsert to local vector DB
await client.callTool("indexfoundry_index_upsert", {
run_id: runId,
provider: "local",
connection: { collection: "my_docs" }
});
// Serve: start HTTP API
await client.callTool("indexfoundry_serve_start", {
run_id: runId,
port: 8080
});
Project-Based Workflow Example
// Create a deployable RAG project
await client.callTool("indexfoundry_project_create", {
project_id: "my-docs-rag",
name: "Company Documentation Search",
description: "Searchable knowledge base for internal docs",
embedding_model: {
provider: "openai",
model_name: "text-embedding-3-small",
api_key_env": "OPENAI_API_KEY"
},
chunk_config: {
strategy: "recursive",
max_chars: 1500,
overlap_chars: 150
}
});
// Add multiple sources
await client.callTool("indexfoundry_project_add_source", {
project_id: "my-docs-rag",
url: "https://docs.company.com",
source_name: "Main Docs",
tags: ["docs", "internal"]
});
await client.callTool("indexfoundry_project_add_source", {
project_id: "my-docs-rag",
folder_path: "/path/to/pdfs",
source_name: "Policy Documents",
tags: ["policy", "pdf"]
});
// Build: vector database
await client.callTool("indexfoundry_project_build", {
project_id: "my-docs-rag"
});
// Query: index
const results = await client.callTool("indexfoundry_project_query", {
project_id: "my-docs-rag",
query: "What is the vacation policy?",
mode: "hybrid",
top_k: 5,
filter_tags: ["policy"]
});
// Export for deployment
await client.callTool("indexfoundry_project_export", {
project_id: "my-docs-rag",
server_name: "docs-search-server",
server_description: "Internal documentation search API",
include_http: true,
railway_config: true
});
After export, a project directory contains a deployable repository:
cd projects/my-docs-rag
git init
git add .
git commit -m "Initial RAG application"
git push origin main
# Deploy on Railway
Development
# Development with watch mode
npm run dev
# Run tests (single run)
npm test
# Run tests (watch mode)
npm run test:watch
# Lint
npm run lint
# Test with MCP Inspector
npm run inspector
Testing
The MCP server has been validated with end-to-end testing:
- ✅ Project creation, listing, and retrieval
- ✅ Source addition (URL, folder, PDF, sitemap)
- ✅ Build pipeline (fetch → chunk → embed → upsert)
- ✅ Vector search with semantic, keyword, and hybrid modes
- ✅ Deployment file generation (Dockerfile, railway.toml, MCP server)
Deployment
Railway Deployment
- Create and export a project:
await client.callTool("indexfoundry_project_export", {
project_id: "my-rag",
railway_config: true
});
-
Push to GitHub and connect to Railway
-
Railway automatically detects
railway.tomland deploys
Docker Deployment
cd projects/my-rag
docker build -t my-rag-server .
docker run -p 8080:8080 -e OPENAI_API_KEY=sk-... my-rag-server
Determinism Guarantees
- Sorted inputs: File lists sorted before processing
- Stable IDs: Chunk IDs derived from content + position
- Content hashes: SHA256 on every artifact
- Pinned versions: Extractor versions locked in config
- No randomness: No sampling, shuffling, or non-deterministic algorithms
License
PROPRIETARY SOFTWARE LICENSE
