crawl-mcp-server
A comprehensive MCP (Model Context Protocol) server providing 11 powerful tools for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration.
✨ Features
- 🔍 SearXNG Web Search - Search the web with automatic browser management
- 📄 4 Crawling Tools - Extract and convert web content to Markdown
- 🚀 Auto-Browser-Launch - Search tools automatically manage browser lifecycle
- 📦 11 Total Tools - Complete toolkit for web interaction
- 💾 Built-in Caching - SHA-256 based caching with graceful fallbacks
- ⚡ Concurrent Processing - Handle multiple URLs simultaneously (up to 50)
- 🎯 LLM-Optimized Output - Clean Markdown perfect for AI consumption
- 🛡️ Robust Error Handling - Graceful failure with detailed error messages
- 🧪 Comprehensive Testing - Full CI/CD with performance benchmarks
📦 Installation
Method 1: npm (Recommended)
npm install crawl-mcp-server
Method 2: Direct from Git
# Install latest from GitHub
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
# Or specific branch
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main
# Or from a fork
npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git
Method 3: Clone and Build
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
npm install
npm run build
Method 4: npx (No Installation)
# Run directly without installing
npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
🔧 Setup for Claude Code
Option 1: MCP Desktop (Recommended)
Add to your Claude Desktop configuration file:
** macOS/Linux: ~/.config/claude/claude_desktop_config.json**
{
"mcpServers": {
"crawl-server": {
"command": "npx",
"args": [
"git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
],
"env": {
"NODE_ENV": "production"
}
}
}
}
** Windows: %APPDATA%\Claude\claude_desktop_config.json**
{
"mcpServers": {
"crawl-server": {
"command": "npx",
"args": [
"git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
],
"env": {
"NODE_ENV": "production"
}
}
}
}
Option 2: Local Installation
If you've installed locally:
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": [
"/path/to/crawl-mcp-server/dist/index.js"
],
"env": {}
}
}
}
Option 3: Custom Path
For a specific installation:
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": [
"/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js"
],
"env": {}
}
}
}
After configuration, restart Claude Desktop.
🔧 Setup for Other MCP Clients
Claude CLI
# Using npx
claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
# Using local installation
claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js
Zed Editor
Add to ~/.config/zed/settings.json:
{
"assistant": {
"mcp": {
"servers": {
"crawl-server": {
"command": "node",
"args": ["/path/to/crawl-mcp-server/dist/index.js"]
}
}
}
}
}
VSCode with Copilot Chat
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": ["/path/to/crawl-mcp-server/dist/index.js"]
}
}
}
🚀 Quick Start
Using MCP Inspector (Testing)
# Install MCP Inspector globally
npm install -g @modelcontextprotocol/inspector
# Run the server
node dist/index.js
# In another terminal, test tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list
Development Mode
# Watch mode (auto-rebuild on changes)
npm run dev
# Build TypeScript
npm run build
# Run tests
npm run test:run
📚 Available Tools
Search Tools (7 tools)
1. search_searx
Search the web using SearXNG with automatic browser management.
// Example call
{
"query": "TypeScript MCP server",
"maxResults": 10,
"category": "general",
"timeRange": "week",
"language": "en"
}
Parameters:
query(string, required): Search querymaxResults(number, default: 20): Results to return (1-50)category(enum, default: general): one of general, images, videos, news, map, music, it, sciencetimeRange(enum, optional): one of day, week, month, yearlanguage(string, default: en): Language code
Returns: JSON with search results array, URLs, and metadata
2. launch_chrome_cdp
Launch system Chrome with remote debugging for advanced SearX usage.
{
"headless": true,
"port": 9222,
"userDataDir": "/path/to/profile"
}
Parameters:
headless(boolean, default: true): Run Chrome headlessport(number, default: 9222): Remote debugging portuserDataDir(string, optional): Custom Chrome profile
3. connect_cdp
Connect to remote CDP browser (Browserbase, etc.).
{
"cdpWsUrl": "http://localhost:9222"
}
Parameters:
cdpWsUrl(string, required): CDP WebSocket URL or HTTP endpoint
4. launch_local
Launch bundled Chromium for SearX search.
{
"headless": true,
"userAgent": "custom user agent string"
}
Parameters:
headless(boolean, default: true): Run headlessuserAgent(string, optional): Custom user agent
5. chrome_status
Check Chrome CDP status and health.
{}
Returns: Running status, health, endpoint URL, and PID
6. close
Close browser session (keeps Chrome CDP running).
{}
7. shutdown_chrome_cdp
Shutdown Chrome CDP and cleanup resources.
{}
Crawling Tools (4 tools)
1. crawl_read ⭐ (Simple & Fast)
Quick single-page extraction to Markdown.
{
"url": "https://example.com/article",
"options": {
"timeout": 30000
}
}
Best for:
- ✅ News articles
- ✅ Blog posts
- ✅ Documentation pages
- ✅ Simple content extraction
Returns: Clean Markdown content
2. crawl_read_batch ⭐ (Multiple URLs)
Process 1-50 URLs concurrently.
{
"urls": [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
],
"options": {
"maxConcurrency": 5,
"timeout": 30000,
"maxResults": 10
}
}
Best for:
- ✅ Processing multiple articles
- ✅ Building content aggregates
- ✅ Bulk content extraction
Returns: Array of Markdown results with summary statistics
3. crawl_fetch_markdown
Controlled single-page extraction with full option control.
{
"url": "https://example.com/article",
"options": {
"timeout": 30000
}
}
Best for:
- ✅ Advanced crawling options
- ✅ Custom timeout control
- ✅ Detailed extraction
4. crawl_fetch
Multi-page crawling with intelligent link extraction.
{
"url": "https://example.com",
"options": {
"pages": 5,
"maxConcurrency": 3,
"sameOriginOnly": true,
"timeout": 30000,
"maxResults": 20
}
}
Best for:
- ✅ Crawling entire sites
- ✅ Link-based discovery
- ✅ Multi-page scraping
Features:
- Extracts links from starting page
- Crawls discovered pages
- Concurrent processing
- Same-origin filtering (configurable)
💡 Usage Examples
Example 1: Search + Crawl Workflow
// Step 1: Search for topics
{
"tool": "search_searx",
"arguments": {
"query": "TypeScript best practices 2024",
"maxResults": 5
}
}
// Step 2: Extract URLs from results
// (Parse the search results to get URLs)
// Step 3: Crawl selected articles
{
"tool": "crawl_read_batch",
"arguments": {
"urls": [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
]
}
}
Example 2: Batch Content Extraction
{
"tool": "crawl_read_batch",
"arguments": {
"urls": [
"https://news.site/article1",
"https://news.site/article2",
"https://news.site/article3"
],
"options": {
"maxConcurrency": 10,
"timeout": 30000,
"maxResults": 3
}
}
}
Example 3: Site Crawling
{
"tool": "crawl_fetch",
"arguments": {
"url": "https://docs.example.com",
"options": {
"pages": 10,
"maxConcurrency": 5,
"sameOriginOnly": true,
"timeout": 30000,
"maxResults": 10
}
}
}
🎯 Tool Selection Guide
| Use Case | Recommended Tool | Complexity |
|---|---|---|
| Single article | crawl_read | Simple |
| Multiple articles | crawl_read_batch | Simple |
| Advanced options | crawl_fetch_markdown | Medium |
| Site crawling | crawl_fetch | Complex |
| Web search | search_searx | Simple |
| Research workflow | search_searx → crawl_read | Medium |
🏗️ Architecture
Core Components
┌─────────────────────────────────────────┐
│ crawl-mcp-server │
├─────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────┐ │
│ │ MCP Server Core │ │
│ │ - 11 registered tools │ │
│ │ - STDIO/HTTP transport │ │
│ └──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────┐ │
│ │ @just-every/crawl │ │
│ │ - HTML → Markdown │ │
│ │ - Mozilla Readability │ │
│ │ - Concurrent crawling │ │
│ └──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────┐ │
│ │ Playwright (Browser) │ │
│ │ - SearXNG integration │ │
│ │ - Auto browser management │ │
│ │ - Anti-detection │ │
│ └──────────────────────────────┘ │
│ │
└─────────────────────────────────────────┘
Technology Stack
- Runtime: Node.js 18+
- Language: TypeScript 5.7
- Framework: MCP SDK (@modelcontextprotocol/sdk)
- Crawling: @just-every/crawl
- Browser: Playwright Core
- Validation: Zod
- Transport: STDIO (local) + HTTP (remote)
Data Flow
Client Request
↓
MCP Protocol
↓
Tool Handler
↓
┌─────────────────────┐
│ Crawl/Search │
│ @just-every/crawl │ → HTML content
│ or SearXNG │ → Search results
└─────────────────────┘
↓
HTML → Markdown
↓
Result Formatting
↓
MCP Response
↓
Client
🧪 Testing
Run Test Suite
# All unit tests
npm run test:run
# Performance benchmarks
npm run test:performance
# Full CI suite
npm run test:ci
# Individual tool test
npx @modelcontextprotocol/inspector --cli node dist/index.js \
--method tools/call \
--tool-name crawl_read \
--tool-arg url="https://example.com"
Test Coverage
- ✅ All 11 tools tested
- ✅ Error handling validated
- ✅ Performance benchmarks
- ✅ Integration workflows
- ✅ Multi-Node support (Node 18, 20, 22)
CI/CD Pipeline
┌────────────────────────────────────┐
│ GitHub Actions │
├────────────────────────────────────┤
│ 1. Test (Matrix: Node 18,20,22) │
│ 2. Integration Tests (PR only) │
│ 3. Performance Tests (main) │
│ 4. Security Scan │
│ 5. Coverage Report │
└────────────────────────────────────┘
🔧 Development
Prerequisites
- Node.js 18 or higher
- npm or yarn
Setup
# Clone the repository
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
# Install dependencies
npm install
# Build TypeScript
npm run build
# Run in development mode (watch)
npm run dev
Development Commands
# Build project
npm run build
# Watch mode (auto-rebuild)
npm run dev
# Run tests
npm run test:run
# Lint code
npm run lint
# Type check
npm run typecheck
# Clean build artifacts
npm run clean
Project Structure
crawl-mcp-server/
├── src/
│ ├── index.ts # Main server (11 tools)
│ ├── types.ts # TypeScript interfaces
│ └── cdp.ts # Chrome CDP manager
├── test/
│ ├── run-tests.ts # Unit test suite
│ ├── performance.ts # Performance tests
│ └── config.ts # Test configuration
├── dist/ # Compiled JavaScript
├── .github/workflows/ # CI/CD pipeline
└── package.json
📊 Performance
Benchmarks
| Operation | Avg Duration | Max Memory |
|---|---|---|
| crawl_read | ~1500ms | 32MB |
| crawl_read_batch (2 URLs) | ~2500ms | 64MB |
| search_searx | ~4000ms | 128MB |
| crawl_fetch | ~2000ms | 48MB |
| tools/list | ~100ms | 8MB |
Performance Features
- ✅ Concurrent request processing (up to 20)
- ✅ Built-in caching (SHA-256)
- ✅ Automatic timeout management
- ✅ Memory optimization
- ✅ Resource cleanup
🛡️ Error Handling
All tools include comprehensive error handling:
- Network errors: Graceful degradation with error messages
- Timeout handling: Configurable timeouts
- Partial failures: Batch operations continue on individual failures
- Structured errors: Clear error codes and messages
- Recovery: Automatic retries where appropriate
Example error response:
{
"content": [
{
"type": "text",
"text": "Error: Failed to fetch https://example.com: Timeout after 30000ms"
}
],
"structuredContent": {
"error": "Network timeout",
"url": "https://example.com",
"code": "TIMEOUT"
}
}
🔐 Security
- No API keys required for basic crawling
- Respect robots.txt (configurable)
- User agent rotation
- Rate limiting (built-in via concurrency limits)
- Input validation (Zod schemas)
- Dependency scanning (npm audit, Snyk)
🌐 Transport Modes
STDIO (Default)
For local MCP clients:
node dist/index.js
HTTP
For remote access:
TRANSPORT=http PORT=3000 node dist/index.js
Server runs on: http://localhost:3000/mcp
📝 Configuration
Environment Variables
# Transport mode (stdio or http)
TRANSPORT=stdio
# HTTP port (when TRANSPORT=http)
PORT=3000
# Node environment
NODE_ENV=production
Tool Configuration
Each tool accepts an options object:
{
"timeout": 30000, // Request timeout (ms)
"maxConcurrency": 5, // Concurrent requests (1-20)
"maxResults": 10, // Limit results (1-50)
"respectRobots": false, // Respect robots.txt
"sameOriginOnly": true // Only same-origin URLs
}
🤝 Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make changes and add tests
- Run tests:
npm run test:ci - Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open a Pull Request
Development Guidelines
- Follow TypeScript strict mode
- Add tests for new features
- Update documentation
- Run linting:
npm run lint - Ensure CI passes
📄 License
MIT License - see LICENSE file
🙏 Acknowledgments
- @just-every/crawl - Web crawling
- Model Context Protocol - MCP specification
- SearXNG - Search aggregator
- Playwright - Browser automation
📞 Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: your-email@example.com
🚀 What's Next?
- Add DuckDuckGo search support
- Implement content filtering
- Add screenshot capabilities
- Support for authenticated content
- PDF extraction
- Real-time monitoring
Built with ❤️ using TypeScript, MCP, and modern web technologies.
