Semantic Search MCP Server
A local Model Context Protocol (MCP) server that enables AI agents to perform semantic search over codebases using natural language queries. The server converts queries into efficient text search patterns (grep/ripgrep) and verifies relevance before returning results.
Quick Setup
Installation
pip install -e .
Environment Variables
Set the following environment variables:
REPO_PATH- Path to the repository to search (defaults to current directory)SEARCHER_TYPE- Searcher implementation to use (default:sgr_gemini_flash_lite)
API Keys (choose one based on your searcher type):
- For Claude-based searchers:
CLAUDE_API_KEYorANTHROPIC_API_KEY - For Gemini-based searchers:
GOOGLE_API_KEY,GEMINI_API_KEY,AI_STUDIO, orVERTEX_AI_API_KEY - For OpenAI-based searchers:
OPENAI_API_KEY
Available Searchers
SGR (Schema-Guided Reasoning) searchers - Production-ready implementations:
sgr/sgr_gemini_flash_lite- Default, recommended (Gemini Flash Lite)sgr_gemini_flash- SGR with Gemini Flashsgr_gemini_pro- SGR with Gemini Prosgr_gpt4o- SGR with GPT-4osgr_gpt4o_mini- SGR with GPT-4o Mini
Note: Other searcher types (ripgrep_claude, agent_claude, agent_gemini_flash_lite, etc.) are experimental implementations from earlier development phases and are not recommended for production use.
Running the MCP Server
Important: The MCP server is not meant to be run directly in a terminal. It communicates via STDIO using JSON-RPC protocol and must be launched by an IDE or MCP client.
Cursor Configuration
Add to your cursor-mcp-config.json:
{
"mcpServers": {
"qure-semantic-search": {
"command": "/path/to/.venv/bin/qure-semantic-search-mcp",
"env": {
"REPO_PATH": "/path/to/your/repo"
}
}
}
}
After configuring, restart Cursor. The server will be automatically launched when you use the semantic_search tool in Cursor's AI chat.
Note: If you see JSON parsing errors when running the command directly in terminal, this is expected - the server requires an MCP client (like Cursor) to communicate with it via JSON-RPC protocol.
Evaluation
Running Evaluation
Standard mode (single run per query):
python -m eval.run_eval
Stability mode (10 runs per query to measure consistency):
python -m eval.run_eval --stability
Stability mode with custom runs (e.g., 20 runs per query):
python -m eval.run_eval --stability --runs 20
Evaluate all searchers (compares different searcher implementations):
python -m eval.run_all_searchers --stability
Additional options:
--verbose/-v- Print detailed per-query statistics--single-dataset- Use only main dataset (exclude easy dataset)--output <path>- Export results to JSON file
Datasets
The evaluation uses two datasets:
-
Main dataset (
data/dataset.jsonl) - 12 challenging examples across different codebases (Django, Gin, CodeQL, QGIS, etc.) with non-trivial queries where simple keyword matching fails. -
Easy dataset (
data/dataset_easy.jsonl) - 14 simpler examples designed for faster evaluation and testing. These queries are more straightforward but still require semantic understanding.
By default, both datasets are used together (26 queries total). Use --single-dataset to evaluate only the main dataset.
Metrics
For detailed metric definitions and mathematical proof of perfection, see METRICS_LOGIC.md.
Quick Summary:
- Precision@K = TP / (TP + FP) - Fraction of returned results that are relevant
- Recall@K = TP / (TP + FN) - Fraction of all relevant items that were returned
- F1@K = Harmonic mean of Precision and Recall
- File Discovery Rate = Files Found / Files Expected
- Substring Coverage = Substrings Found / Substrings Required
The Logic Test: If all metrics score 1.0, the solution is mathematically perfect (see proof in METRICS_LOGIC.md).
See eval/metrics.py for detailed implementations.
Performance Results
Evaluation results for sgr_gemini_flash_lite searcher (10 runs per query, 26 queries total):
Overall Performance
| Metric | Value | Stability |
|---|---|---|
| Precision@10 | 0.30 ± 0.38 | ⚠ High variance (CV=127%) |
| Recall@10 | 0.31 ± 0.41 | ⚠ High variance (CV=133%) |
| F1@10 | 0.29 ± 0.38 | ⚠ High variance (CV=130%) |
| Success Rate@10 | 0.40 ± 0.46 | ⚠ High variance (CV=114%) |
| File Discovery Rate | 0.61 ± 0.40 | ⚠ Moderate variance (CV=66%) |
| Substring Coverage | 0.35 ± 0.39 | ⚠ High variance (CV=111%) |
| Avg Latency | 20.6s ± 7.9s | Range: 9.6s - 38.3s |
| Stability Score | 73.9% | 16/26 stable queries (61.5%) |
Dataset Breakdown
Easy Dataset (14 examples)
- Precision@10: 0.40 ± 0.44
- Recall@10: 0.46 ± 0.49
- F1@10: 0.42 ± 0.45
- File Discovery Rate: 0.92 ± 0.13 ✓ (Good stability)
- Avg Latency: 15.0s ± 4.8s
- Stability Score: 85.9% ✓ (Good stability)
Main Dataset (12 examples)
- Precision@10: 0.17 ± 0.25
- Recall@10: 0.13 ± 0.18
- F1@10: 0.14 ± 0.20
- File Discovery Rate: 0.26 ± 0.30
- Avg Latency: 27.2s ± 5.3s
- Stability Score: 60.0% ⚠ (Moderate stability)
Notes
- High variance in metrics is expected due to LLM non-determinism and the complexity of semantic search queries
- File Discovery Rate shows better stability, especially on easier queries (92% success rate)
- Latency varies significantly (9-38s) depending on query complexity and codebase size
- Results are evaluated on non-trivial queries where simple keyword matching fails
Project Structure
src/- Core MCP server and searcher implementationseval/- Evaluation scripts and metricsdata/- Evaluation dataset and test repositoriesscripts/- Utility scripts for testing and debugging
Documentation
- METRICS_LOGIC.md - Mathematical justification for metric selection and proof of perfection
- KNOWN_ISSUES.md - Current limitations, known problems, and workarounds
- FUTURE_ROADMAP.md - Planned improvements and mitigation strategies
