Crawl4AI RAG MCP Server
A high-performance Retrieval-Augmented Generation (RAG) system using Crawl4AI for web content extraction, sqlite-vec for vector storage, and MCP integration for AI assistants.
Summary
This system provides a production-ready RAG solution that combines:
- Crawl4AI for intelligent web content extraction with markdown conversion
- SQLite with sqlite-vec for vector storage and semantic search
- RAM Database Mode for 10-50x faster query performance
- MCP Server for AI assistant integration (LM-Studio, Claude Desktop, etc.)
- REST API for bidirectional communication and remote access
- Security Layer with input sanitization and domain blocking
Quick Start
Option 1: Local Development
- Clone and setup:
git clone https://github.com/Rob-P-Smith/mcpragcrawl4ai.git
cd mcpragcrawl4ai
python3 -m venv .venv
source .venv/bin/activate # Linux/Mac
pip install -r requirements.txt
- Start Crawl4AI service:
docker run -d --name crawl4ai -p 11235:11235 unclecode/crawl4ai:latest
- Configure environment:
# Create .env file
cat > .env << EOF
IS_SERVER=true
USE_MEMORY_DB=true
LOCAL_API_KEY=dev-api-key
CRAWL4AI_URL=http://localhost:11235
EOF
- Run MCP server:
python3 core/rag_processor.py
Option 2: Docker Server Deployment
- Deploy full server (REST API + MCP):
cd mcpragcrawl4ai
docker compose -f deployments/server/docker-compose.yml up -d
- Test deployment:
curl http://localhost:8080/health
See Deployment Guide for complete deployment options.
Architecture
Core Components
- MCP Server (core/rag_processor.py) - JSON-RPC 2.0 protocol handler
- RAG Database (core/data/storage.py) - SQLite + sqlite-vec vector storage with RAM mode support
- Content Cleaner (core/data/content_cleaner.py) - Navigation removal and quality filtering
- Sync Manager (core/data/sync_manager.py) - RAM database differential sync with virtual table support
- Crawler (core/operations/crawler.py) - Web crawling with DFS algorithm and content extraction
- Defense Layer (core/data/dbdefense.py) - Input sanitization and security
- REST API (api/api.py) - FastAPI server with 15+ endpoints
- Auth System (api/auth.py) - API key authentication and rate limiting
- Recrawl Utility (core/utilities/recrawl_utility.py) - Batch URL recrawling via API with concurrent processing
Database Schema
- crawled_content - Web content with markdown, embeddings, and metadata
- content_vectors - Vector embeddings (sqlite-vec vec0 virtual table with rowid support)
- sessions - User session tracking for temporary content
- blocked_domains - Domain blocklist with wildcard patterns
- _sync_tracker - Change tracking for RAM database differential sync (memory mode only)
Technology Stack
- Python 3.11+ with asyncio for concurrent operations
- SQLite with sqlite-vec extension for vector similarity search
- SentenceTransformers (all-MiniLM-L6-v2) for embedding generation
- langdetect for language detection and filtering
- FastAPI for REST API with automatic OpenAPI documentation
- Crawl4AI for intelligent web content extraction with fit_markdown
- Docker for containerized deployment
- aiohttp for async HTTP requests in utilities
Documentation
For detailed documentation, see:
- Deployment Guide - Comprehensive deployment options
- Installation Guide - Setup and configuration
- API Documentation - REST API reference
- Quick Start Guide - Get started quickly
- Troubleshooting - Common issues and solutions
- Full Documentation - Complete documentation index
Key Features
Performance
- RAM Database Mode: In-memory SQLite with differential sync for 10-50x faster queries
- Vector Search: 384-dimensional embeddings using all-MiniLM-L6-v2 for semantic search
- Batch Crawling: High-performance batch processing with retry logic and progress tracking
- Content Optimization: 70-80% storage reduction through intelligent cleaning and filtering
- Efficient Storage: fit_markdown conversion and content chunking for optimal retrieval
Functionality
- Deep Crawling: DFS-based multi-page crawling with depth and page limits
- Content Cleaning: Automatic removal of navigation, boilerplate, and low-quality content
- Language Filtering: Automatic detection and filtering of non-English content
- Semantic Search: Vector similarity search with tag filtering and deduplication
- Target Search: Intelligent search with automatic tag expansion
- Content Management: Full CRUD operations with retention policies and session management
- Batch Recrawling: Concurrent URL recrawling via API with rate limiting and progress tracking
Security
- Input Sanitization: Comprehensive SQL injection defense and input validation
- Domain Blocking: Wildcard-based domain blocking with social media and adult content filters
- API Authentication: API key-based authentication with rate limiting
- Safe Crawling: Automatic detection and blocking of forbidden content
Integration
- MCP Server: Full MCP protocol support for AI assistant integration
- REST API: Complete REST API with 15+ endpoints for all operations
- Bidirectional Mode: Server mode (host API) and client mode (forward to remote)
- Docker Deployment: Production-ready containerized deployment
Quick Usage Examples
Via MCP (in LM-Studio/Claude Desktop)
crawl_and_remember("https://docs.python.org/3/tutorial/", tags="python, tutorial")
search_memory("list comprehensions", tags="python", limit=5)
target_search("async programming best practices", initial_limit=5, expanded_limit=20)
get_database_stats()
Via REST API
# Crawl and store content
curl -X POST http://localhost:8080/api/v1/crawl/store \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.python.org/3/tutorial/", "tags": "python, tutorial"}'
# Semantic search
curl -X POST http://localhost:8080/api/v1/search \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"query": "list comprehensions", "tags": "python", "limit": 5}'
# Get database stats
curl http://localhost:8080/api/v1/stats \
-H "Authorization: Bearer YOUR_API_KEY"
Via Python Client
from api.api import Crawl4AIClient
client = Crawl4AIClient("http://localhost:8080", "YOUR_API_KEY")
result = await client.crawl_and_store("https://example.com", tags="example")
search_results = await client.search("python tutorials", limit=10)
stats = await client.get_database_stats()
Performance Metrics
With RAM database mode enabled:
- Search queries: 20-50ms (vs 200-500ms disk mode)
- Batch crawling: 2,000+ URLs successfully processed
- Database size: 215MB (2,296 pages, 8,196 embeddings)
- Sync overhead: <100ms for differential sync (idle: 5s, periodic: 5min)
- Sync reliability: 100% success rate with virtual table support
- Memory usage: ~500MB for full in-memory database
- Storage optimization: 70-80% reduction through content cleaning
