Crawl4AI RAG MCP Server

A high-performance Retrieval-Augmented Generation (RAG) system using Crawl4AI for web content extraction, sqlite-vec for vector storage, and MCP integration for AI assistants.

Summary

This system provides a production-ready RAG solution that combines:

Crawl4AI for intelligent web content extraction with markdown conversion
SQLite with sqlite-vec for vector storage and semantic search
RAM Database Mode for 10-50x faster query performance
MCP Server for AI assistant integration (LM-Studio, Claude Desktop, etc.)
REST API for bidirectional communication and remote access
Security Layer with input sanitization and domain blocking

Quick Start

Option 1: Local Development

Clone and setup:

git clone https://github.com/Rob-P-Smith/mcpragcrawl4ai.git
cd mcpragcrawl4ai
python3 -m venv .venv
source .venv/bin/activate  # Linux/Mac
pip install -r requirements.txt

Start Crawl4AI service:

docker run -d --name crawl4ai -p 11235:11235 unclecode/crawl4ai:latest

Configure environment:

# Create .env file
cat > .env << EOF
IS_SERVER=true
USE_MEMORY_DB=true
LOCAL_API_KEY=dev-api-key
CRAWL4AI_URL=http://localhost:11235
EOF

Run MCP server:

python3 core/rag_processor.py

Option 2: Docker Server Deployment

Deploy full server (REST API + MCP):

cd mcpragcrawl4ai
docker compose -f deployments/server/docker-compose.yml up -d

Test deployment:

curl http://localhost:8080/health

See Deployment Guide for complete deployment options.

Architecture

Core Components

MCP Server (core/rag_processor.py) - JSON-RPC 2.0 protocol handler
RAG Database (core/data/storage.py) - SQLite + sqlite-vec vector storage with RAM mode support
Content Cleaner (core/data/content_cleaner.py) - Navigation removal and quality filtering
Sync Manager (core/data/sync_manager.py) - RAM database differential sync with virtual table support
Crawler (core/operations/crawler.py) - Web crawling with DFS algorithm and content extraction
Defense Layer (core/data/dbdefense.py) - Input sanitization and security
REST API (api/api.py) - FastAPI server with 15+ endpoints
Auth System (api/auth.py) - API key authentication and rate limiting
Recrawl Utility (core/utilities/recrawl_utility.py) - Batch URL recrawling via API with concurrent processing

Database Schema

crawled_content - Web content with markdown, embeddings, and metadata
content_vectors - Vector embeddings (sqlite-vec vec0 virtual table with rowid support)
sessions - User session tracking for temporary content
blocked_domains - Domain blocklist with wildcard patterns
_sync_tracker - Change tracking for RAM database differential sync (memory mode only)

Technology Stack

Python 3.11+ with asyncio for concurrent operations
SQLite with sqlite-vec extension for vector similarity search
SentenceTransformers (all-MiniLM-L6-v2) for embedding generation
langdetect for language detection and filtering
FastAPI for REST API with automatic OpenAPI documentation
Crawl4AI for intelligent web content extraction with fit_markdown
Docker for containerized deployment
aiohttp for async HTTP requests in utilities

Documentation

For detailed documentation, see:

Deployment Guide - Comprehensive deployment options
Installation Guide - Setup and configuration
API Documentation - REST API reference
Quick Start Guide - Get started quickly
Troubleshooting - Common issues and solutions
Full Documentation - Complete documentation index

Key Features

Performance

RAM Database Mode: In-memory SQLite with differential sync for 10-50x faster queries
Vector Search: 384-dimensional embeddings using all-MiniLM-L6-v2 for semantic search
Batch Crawling: High-performance batch processing with retry logic and progress tracking
Content Optimization: 70-80% storage reduction through intelligent cleaning and filtering
Efficient Storage: fit_markdown conversion and content chunking for optimal retrieval

Functionality

Deep Crawling: DFS-based multi-page crawling with depth and page limits
Content Cleaning: Automatic removal of navigation, boilerplate, and low-quality content
Language Filtering: Automatic detection and filtering of non-English content
Semantic Search: Vector similarity search with tag filtering and deduplication
Target Search: Intelligent search with automatic tag expansion
Content Management: Full CRUD operations with retention policies and session management
Batch Recrawling: Concurrent URL recrawling via API with rate limiting and progress tracking

Security

Input Sanitization: Comprehensive SQL injection defense and input validation
Domain Blocking: Wildcard-based domain blocking with social media and adult content filters
API Authentication: API key-based authentication with rate limiting
Safe Crawling: Automatic detection and blocking of forbidden content

Integration

MCP Server: Full MCP protocol support for AI assistant integration
REST API: Complete REST API with 15+ endpoints for all operations
Bidirectional Mode: Server mode (host API) and client mode (forward to remote)
Docker Deployment: Production-ready containerized deployment

Quick Usage Examples

Via MCP (in LM-Studio/Claude Desktop)

crawl_and_remember("https://docs.python.org/3/tutorial/", tags="python, tutorial")
search_memory("list comprehensions", tags="python", limit=5)
target_search("async programming best practices", initial_limit=5, expanded_limit=20)
get_database_stats()

Via REST API

# Crawl and store content
curl -X POST http://localhost:8080/api/v1/crawl/store \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.python.org/3/tutorial/", "tags": "python, tutorial"}'

# Semantic search
curl -X POST http://localhost:8080/api/v1/search \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "list comprehensions", "tags": "python", "limit": 5}'

# Get database stats
curl http://localhost:8080/api/v1/stats \
  -H "Authorization: Bearer YOUR_API_KEY"

Via Python Client

from api.api import Crawl4AIClient

client = Crawl4AIClient("http://localhost:8080", "YOUR_API_KEY")
result = await client.crawl_and_store("https://example.com", tags="example")
search_results = await client.search("python tutorials", limit=10)
stats = await client.get_database_stats()

Performance Metrics

With RAM database mode enabled:

Search queries: 20-50ms (vs 200-500ms disk mode)
Batch crawling: 2,000+ URLs successfully processed
Database size: 215MB (2,296 pages, 8,196 embeddings)
Sync overhead: <100ms for differential sync (idle: 5s, periodic: 5min)
Sync reliability: 100% success rate with virtual table support
Memory usage: ~500MB for full in-memory database
Storage optimization: 70-80% reduction through content cleaning