MCP Data Catalog
A Model Context Protocol (MCP) server that provides AI assistants with structured access to tabular datasets from CSV files. Query, filter, and retrieve data through a clean, type-safe interface.
Features (MVP)
✅ 4 MCP Tools
list_datasets- List all available datasetsdescribe_dataset- Get schema and field informationquery_dataset- Query with filters, projections, and limitsget_by_id- Retrieve specific row by lookup key
✅ Type-Safe Schema
- String, number, boolean, and enum field types
- Field validation and type checking
- Required field enforcement
✅ Filtering (MVP)
eq(equals)contains(case-insensitive substring)and(multiple conditions)
✅ Smart Limits
- Per-dataset row limits
- Truncation indicators
- Configurable defaults
✅ Hot Reload
- Config changes apply automatically (1-3ms)
- No server restart needed
- Invalid configs are rejected safely
✅ Stable MVP
- Hexagonal architecture
- Comprehensive test coverage
- Type-safe implementation
- Production-quality error handling
Quick Start
1. Install
npm install
npm run build
2. Configure
Copy an example configuration:
cp examples/config/typical.json config/datasets.json
Or create your own:
{
"datasets": [
{
"id": "users",
"name": "User Directory",
"schema": {
"fields": [
{ "name": "id", "type": "number", "required": true },
{ "name": "name", "type": "string", "required": true },
{ "name": "email", "type": "string", "required": true },
{ "name": "role", "type": "enum", "values": ["admin", "user", "guest"] }
],
"visibleFields": ["id", "name", "email", "role"]
},
"source": {
"type": "csv",
"path": "./data/users.csv"
},
"lookupKey": "id",
"limits": {
"maxRows": 100,
"defaultRows": 20
}
}
]
}
3. Run
npm run dev
The server starts on stdio and exposes 4 MCP tools.
Usage
MCP Client Configuration
Add to your MCP client config (e.g., Claude Desktop):
{
"mcpServers": {
"catalog": {
"command": "node",
"args": ["/path/to/catalog-mcp/dist/index.js"],
"env": {
"CONFIG_PATH": "/path/to/config/datasets.json"
}
}
}
}
Note: Replace
/path/to/...with your actual local file paths. The MCP server runs as a Node.js process and reads theCONFIG_PATHenvironment variable at startup.
Available Tools
1. list_datasets
List all configured datasets.
Request:
{
"tool": "list_datasets"
}
Response:
{
"datasets": [
{
"id": "users",
"name": "User Directory"
},
{
"id": "products",
"name": "Product Catalog"
}
]
}
2. describe_dataset
Get detailed schema information for a dataset.
Request:
{
"tool": "describe_dataset",
"arguments": {
"datasetId": "users"
}
}
Response:
{
"id": "users",
"name": "User Directory",
"fields": [
{
"name": "id",
"type": "number",
"required": true
},
{
"name": "name",
"type": "string",
"required": true
},
{
"name": "role",
"type": "enum",
"values": ["admin", "user", "guest"]
}
],
"lookupKey": "id",
"limits": {
"maxRows": 100,
"defaultRows": 20
}
}
3. query_dataset
Query a dataset with optional filters, field projection, and limits.
Simple Query:
{
"tool": "query_dataset",
"arguments": {
"datasetId": "users",
"limit": 10
}
}
With Filter:
{
"tool": "query_dataset",
"arguments": {
"datasetId": "users",
"filters": {
"field": "role",
"op": "eq",
"value": "admin"
},
"limit": 10
}
}
With Multiple Filters:
{
"tool": "query_dataset",
"arguments": {
"datasetId": "users",
"filters": {
"and": [
{ "field": "role", "op": "eq", "value": "admin" },
{ "field": "name", "op": "contains", "value": "smith" }
]
}
}
}
With Field Projection:
{
"tool": "query_dataset",
"arguments": {
"datasetId": "users",
"fields": ["id", "name"],
"limit": 10
}
}
Response:
{
"rows": [
{ "id": 1, "name": "Alice Smith", "email": "alice@example.com", "role": "admin" },
{ "id": 6, "name": "Frank Miller", "email": "frank@example.com", "role": "admin" }
],
"fields": ["id", "name", "email", "role"],
"rowsReturned": 2,
"totalRows": 2,
"truncated": false
}
4. get_by_id
Retrieve a single row by its lookup key.
Request:
{
"tool": "get_by_id",
"arguments": {
"datasetId": "users",
"id": "1"
}
}
Response:
{
"row": {
"id": 1,
"name": "Alice Smith",
"email": "alice@example.com",
"role": "admin"
}
}
If not found:
{
"row": null
}
Filter Operators
The MVP supports three operators:
eq (equals)
Exact match (case-sensitive for strings).
{
"field": "role",
"op": "eq",
"value": "admin"
}
contains (substring)
Case-insensitive substring search.
{
"field": "name",
"op": "contains",
"value": "smith"
}
and (conjunction)
All conditions must be true.
{
"and": [
{ "field": "role", "op": "eq", "value": "admin" },
{ "field": "active", "op": "eq", "value": true }
]
}
Filter Schema
All filter expressions follow this canonical JSON structure:
Simple filter:
{
"field": "fieldName",
"op": "eq" | "contains",
"value": any
}
Compound filter (and):
{
"and": [
{ "field": "...", "op": "...", "value": ... },
{ "field": "...", "op": "...", "value": ... }
]
}
Post-MVP: Additional operators coming (ne, gt, gte, lt, lte, in, or) using the same structure.
Configuration
Dataset Structure
{
"datasets": [
{
"id": "string", // Unique identifier
"name": "string", // Display name
"schema": {
"fields": [ // Field definitions
{
"name": "string", // Field name (matches CSV header)
"type": "string", // string | number | boolean | enum
"required": boolean, // Optional, default: false
"values": ["..."] // Required for enum type
}
],
"visibleFields": ["..."] // Fields accessible in queries
},
"source": {
"type": "csv", // Only CSV in MVP
"path": "string" // Relative path to CSV file
},
"lookupKey": "string", // Optional, field name for get_by_id
"limits": {
"maxRows": number, // Hard limit
"defaultRows": number // Default when not specified
}
}
]
}
Field Types
| Type | Description | Example Values |
|---|---|---|
string | Text data | "Alice", "alice@example.com" |
number | Numeric data | 42, 99.99, -5 |
boolean | True/false | true, false |
enum | Predefined values | "admin" (must be in values array) |
Configuration Format
Configuration uses JSON format. This is the primary and recommended format for the MVP.
Note: YAML support may be added in future versions, but JSON remains the canonical format.
Configuration Validation
The server validates configuration on startup and rejects invalid configs:
✅ Checks performed:
- All required fields present
- Field types are valid
- Enum fields have non-empty
valuesarrays visibleFieldsreference existing fieldslookupKeyreferences an existing field- Dataset IDs are unique
- Limits are valid (positive, maxRows ≥ defaultRows)
- CSV files exist and are readable
Fail-Fast: Invalid configuration prevents server startup with clear error messages.
CSV File Format
Requirements
- Header row with column names (first row)
- Column names must match field definitions (case-sensitive)
- Data types must match field types
- UTF-8 encoding
- Standard CSV format (comma-delimited)
Example
id,name,email,role,active
1,Alice Smith,alice@example.com,admin,true
2,Bob Johnson,bob@example.com,user,true
3,Charlie Brown,charlie@example.com,guest,false
Type Formatting
Boolean: Must be true or false (lowercase)
active
true
false
Number: Integers or decimals
price,quantity
99.99,5
149.99,10
Enum: Must match one of the configured values
role
admin
user
guest
Hot Reload
Configuration changes are detected automatically:
- Edit
config/datasets.json - Save the file
- Changes apply in 1-3ms (catalog swap only)
- Invalid changes are rejected (keeps current config)
Watch the logs:
[INFO] Config reloaded successfully in 2ms
How it works:
- Config file is watched for changes
- On change: validates new config
- If valid: atomically swaps to new catalog
- If invalid: preserves current state, logs error
No server restart needed!
AI Usage Guidelines
This server is optimized for local, schema-aware access to CSV-backed reference data – the kind of data I use for project design, exploration, documentation aggregation, and hobby systems. For high-volume or mission-critical production workloads, you would typically pair LLMs with a dedicated database-backed MCP server and keep this catalog focused on lightweight, structured datasets close to the agent.
When designing datasets:
Dataset Design for AI
Keep datasets focused:
- Small, single-purpose tables work better than large multi-purpose sheets
- Separate reference data (IDs, names, codes) from descriptive content
- Break complex domains into multiple related datasets
Optimize for token efficiency:
- Use
visibleFieldsto expose only necessary columns - Keep field names short but meaningful
- Prefer IDs and codes over long text fields for filtering
Design for stable querying:
- Use consistent, stable identifiers (numeric IDs, SKUs, codes)
- Avoid relying on free-text names as lookup keys
- Normalize categorical data (use enums, not free text)
Structure for filtering:
- Tag-based fields enable flexible queries (
status,category,type) - Use enums for controlled vocabularies
- Boolean flags for common filters (
active,published,available)
Example patterns:
- Index dataset: IDs, names, tags, status (small, frequently queried)
- Detail dataset: Full records with all fields (queried by ID)
- Reference dataset: Lookup tables, enums, measurement scales (small, stable)
Architecture
This project follows Hexagonal Architecture for clean separation of concerns:
┌─────────────────────────────────────────────────────┐
│ Primary Adapters (MCP) │
│ ↓ │
│ Primary Ports (Tools) │
│ ↓ │
│ Use Cases (Business Logic) │
│ ↓ │
│ Domain Layer (Pure Logic) │
│ ↓ │
│ Secondary Ports (Storage) │
│ ↓ │
│ Secondary Adapters (CSV, Config) │
└─────────────────────────────────────────────────────┘
Key Principles:
- Domain layer has zero external dependencies
- Dependencies point inward (adapters → domain)
- Business logic is isolated and testable
- Easy to extend with new adapters
See docs/dev/mcp-data-catalog.md for detailed architecture documentation.
Examples
The examples/ directory contains:
Configurations
minimal.json- Single dataset, basic featurestypical.json- Multiple datasets, common patternsadvanced.json- Complex scenarios with many features
Datasets
minimal.csv- 5 rows, 2 columnssample-users.csv- 10 users with rolessample-products.csv- 15 products with categoriesemployees.csv- 15 employees with departmentsinventory.csv- 20 inventory itemsorders.csv- 20 customer orders
Try them:
cp examples/config/typical.json config/datasets.json
npm run dev
See examples/README.md for detailed documentation.
Development
Setup
npm install
npm run build
Run in Development Mode
npm run dev
Run Tests
# All tests
npm test
# With coverage
npm run test:coverage
# Watch mode
npm run test:watch
Build for Production
npm run build
Output in dist/ directory.
Testing
Test Coverage:
- Comprehensive test suite with high coverage
- Unit tests for domain logic and use cases
- Integration tests for MCP tools and hot reload
- Both statement and branch coverage tracked
Test Structure:
test/
├── unit/
│ ├── domain/ # Domain services and value objects
│ └── use-cases/ # Use case orchestration
└── integration/
├── mcp-tools.test.ts # All 4 MCP tools
├── config-hot-reload.test.ts # Hot reload behavior
├── csv-hot-visibility.test.ts # Visible fields
└── limits-and-truncation.test.ts # Row limits
Performance
Characteristics:
- Config reload: 1-3ms (catalog swap only)
- CSV load: 5-10ms per file (varies with size)
- Query execution: 1-2ms for in-memory operations
- Memory: O(n) where n = dataset size
CSV Loading Behavior:
- CSV files are read on-demand for each query
- No in-memory caching of CSV data
- This keeps memory usage low but includes file I/O in query latency
- Config catalog is cached; only CSV data is loaded per-query
Scalability:
- Suitable for datasets up to ~100K rows
- Query latency includes file read time (~5-10ms per CSV)
- For high-performance or large datasets, use database backends (post-MVP)
- Consider dataset design: multiple small CSVs better than one large CSV
Roadmap
✅ MVP Complete (All 6 Phases)
- Hexagonal architecture
- 4 MCP tools (list, describe, query, get_by_id)
- MVP filter operators (eq, contains, and)
- Type validation (string, number, boolean, enum)
- Hot reload support
- Comprehensive test coverage
- Error handling and logging
- Complete documentation (README, dev guide, examples)
🚀 Post-MVP Features
- Additional filter operators (ne, gt, gte, lt, lte, in, or)
- Sorting (ORDER BY)
- Complex types (arrays, nested objects)
- Multiple data sources (PostgreSQL, SQLite, JSON)
- Aggregations (COUNT, SUM, AVG, etc.)
- Full-text search
- Caching layer for performance
- GraphQL-style query language
See docs/project-plans/project-plan-v1.final.md for details.
Documentation
- Project Overview - Development approach and engineering narrative
- Developer Guide - Architecture and internals
- Configuration Reference - Complete config documentation
- Examples - Sample datasets and configs
- Project Plan - MVP scope and roadmap
- Phase Execution - Implementation tracking
Contributing
Contributions welcome! This project follows:
- Hexagonal architecture - Keep domain pure
- Test-driven development - Write tests first
- Type safety - Leverage TypeScript
- Clean code - Follow existing patterns
See .clinerules/core-rules.md for architectural guidelines.
Troubleshooting
Server won't start
Check configuration:
node dist/index.js
# Look for validation errors in output
Common issues:
- CSV file path incorrect
- Field type mismatch
- Missing required fields
- Duplicate dataset IDs
Queries return no results
Verify dataset:
{
"tool": "describe_dataset",
"arguments": { "datasetId": "your-dataset" }
}
Check:
- Dataset ID is correct
- CSV file has data
- Filters match field types
- Field names are in
visibleFields
Hot reload not working
Verify file watching:
- Config file path is correct
- File system permissions allow reading
- Check server logs for reload confirmation
License
MIT
Credits
Built with:
- TypeScript
- Model Context Protocol SDK
- Zod - Schema validation
- Chokidar - File watching
- Jest - Testing
Project Status
Version: 1.0.0-mvp
Status: Stable MVP
Last Updated: 2025-11-30
Note: Test and performance numbers in badges reflect the state at release. Designed for production-style workloads, but validate performance and fit for your specific environment.
All 6 phases complete:
- ✅ Phase 1: Skeleton & Config
- ✅ Phase 2: Core Use Cases
- ✅ Phase 3: Hot Reload
- ✅ Phase 4: MCP Adapter
- ✅ Phase 5: Hardening & Testing
- ✅ Phase 6: Documentation
See docs/execution/master-checklist.md for detailed progress.
