Semantic Search MCP Server

A local Model Context Protocol (MCP) server that enables AI agents to perform semantic search over codebases using natural language queries. The server converts queries into efficient text search patterns (grep/ripgrep) and verifies relevance before returning results.

Quick Setup

Installation

pip install -e .

Environment Variables

Set the following environment variables:

REPO_PATH - Path to the repository to search (defaults to current directory)
SEARCHER_TYPE - Searcher implementation to use (default: sgr_gemini_flash_lite)

API Keys (choose one based on your searcher type):

For Claude-based searchers: CLAUDE_API_KEY or ANTHROPIC_API_KEY
For Gemini-based searchers: GOOGLE_API_KEY, GEMINI_API_KEY, AI_STUDIO, or VERTEX_AI_API_KEY
For OpenAI-based searchers: OPENAI_API_KEY

Available Searchers

SGR (Schema-Guided Reasoning) searchers - Production-ready implementations:

sgr / sgr_gemini_flash_lite - Default, recommended (Gemini Flash Lite)
sgr_gemini_flash - SGR with Gemini Flash
sgr_gemini_pro - SGR with Gemini Pro
sgr_gpt4o - SGR with GPT-4o
sgr_gpt4o_mini - SGR with GPT-4o Mini

Note: Other searcher types (ripgrep_claude, agent_claude, agent_gemini_flash_lite, etc.) are experimental implementations from earlier development phases and are not recommended for production use.

Running the MCP Server

Important: The MCP server is not meant to be run directly in a terminal. It communicates via STDIO using JSON-RPC protocol and must be launched by an IDE or MCP client.

Cursor Configuration

Add to your cursor-mcp-config.json:

{
  "mcpServers": {
    "qure-semantic-search": {
      "command": "/path/to/.venv/bin/qure-semantic-search-mcp",
      "env": {
        "REPO_PATH": "/path/to/your/repo"
      }
    }
  }
}

After configuring, restart Cursor. The server will be automatically launched when you use the semantic_search tool in Cursor's AI chat.

Note: If you see JSON parsing errors when running the command directly in terminal, this is expected - the server requires an MCP client (like Cursor) to communicate with it via JSON-RPC protocol.

Evaluation

Running Evaluation

Standard mode (single run per query):

python -m eval.run_eval

Stability mode (10 runs per query to measure consistency):

python -m eval.run_eval --stability

Stability mode with custom runs (e.g., 20 runs per query):

python -m eval.run_eval --stability --runs 20

Evaluate all searchers (compares different searcher implementations):

python -m eval.run_all_searchers --stability

Additional options:

--verbose / -v - Print detailed per-query statistics
--single-dataset - Use only main dataset (exclude easy dataset)
--output <path> - Export results to JSON file

Datasets

The evaluation uses two datasets:

Main dataset (data/dataset.jsonl) - 12 challenging examples across different codebases (Django, Gin, CodeQL, QGIS, etc.) with non-trivial queries where simple keyword matching fails.
Easy dataset (data/dataset_easy.jsonl) - 14 simpler examples designed for faster evaluation and testing. These queries are more straightforward but still require semantic understanding.

By default, both datasets are used together (26 queries total). Use --single-dataset to evaluate only the main dataset.

Metrics

For detailed metric definitions and mathematical proof of perfection, see METRICS_LOGIC.md.

Quick Summary:

Precision@K = TP / (TP + FP) - Fraction of returned results that are relevant
Recall@K = TP / (TP + FN) - Fraction of all relevant items that were returned
F1@K = Harmonic mean of Precision and Recall
File Discovery Rate = Files Found / Files Expected
Substring Coverage = Substrings Found / Substrings Required

The Logic Test: If all metrics score 1.0, the solution is mathematically perfect (see proof in METRICS_LOGIC.md).

See eval/metrics.py for detailed implementations.

Performance Results

Evaluation results for sgr_gemini_flash_lite searcher (10 runs per query, 26 queries total):

Overall Performance

Metric	Value	Stability
Precision@10	0.30 ± 0.38	⚠ High variance (CV=127%)
Recall@10	0.31 ± 0.41	⚠ High variance (CV=133%)
F1@10	0.29 ± 0.38	⚠ High variance (CV=130%)
Success Rate@10	0.40 ± 0.46	⚠ High variance (CV=114%)
File Discovery Rate	0.61 ± 0.40	⚠ Moderate variance (CV=66%)
Substring Coverage	0.35 ± 0.39	⚠ High variance (CV=111%)
Avg Latency	20.6s ± 7.9s	Range: 9.6s - 38.3s
Stability Score	73.9%	16/26 stable queries (61.5%)

Dataset Breakdown

Easy Dataset (14 examples)

Precision@10: 0.40 ± 0.44
Recall@10: 0.46 ± 0.49
F1@10: 0.42 ± 0.45
File Discovery Rate: 0.92 ± 0.13 ✓ (Good stability)
Avg Latency: 15.0s ± 4.8s
Stability Score: 85.9% ✓ (Good stability)

Main Dataset (12 examples)

Precision@10: 0.17 ± 0.25
Recall@10: 0.13 ± 0.18
F1@10: 0.14 ± 0.20
File Discovery Rate: 0.26 ± 0.30
Avg Latency: 27.2s ± 5.3s
Stability Score: 60.0% ⚠ (Moderate stability)

Notes

High variance in metrics is expected due to LLM non-determinism and the complexity of semantic search queries
File Discovery Rate shows better stability, especially on easier queries (92% success rate)
Latency varies significantly (9-38s) depending on query complexity and codebase size
Results are evaluated on non-trivial queries where simple keyword matching fails

Project Structure

src/ - Core MCP server and searcher implementations
eval/ - Evaluation scripts and metrics
data/ - Evaluation dataset and test repositories
scripts/ - Utility scripts for testing and debugging

Documentation

METRICS_LOGIC.md - Mathematical justification for metric selection and proof of perfection
KNOWN_ISSUES.md - Current limitations, known problems, and workarounds
FUTURE_ROADMAP.md - Planned improvements and mitigation strategies