Crawl4AI RAG
STDIOWeb crawling and RAG capabilities for AI agents and coding assistants.
Web crawling and RAG capabilities for AI agents and coding assistants.
Web Crawling and RAG Capabilities for AI Agents and AI Coding Assistants
A powerful implementation of the Model Context Protocol (MCP) integrated with Crawl4AI and Supabase for providing AI agents and AI coding assistants with advanced web crawling and RAG capabilities.
With this MCP server, you can scrape anything and then use that knowledge anywhere for RAG.
The primary goal is to bring this MCP server into Archon as I evolve it to be more of a knowledge engine for AI coding assistants to build AI agents. This first version of the Crawl4AI/RAG MCP server will be improved upon greatly soon, especially making it more configurable so you can use different embedding models and run everything locally with Ollama.
This MCP server provides tools that enable AI agents to crawl websites, store content in a vector database (Supabase), and perform RAG over the crawled content. It follows the best practices for building MCP servers based on the Mem0 MCP server template I provided on my channel previously.
The server includes several advanced RAG strategies that can be enabled to enhance retrieval quality:
See the Configuration section below for details on how to enable and configure these strategies.
The Crawl4AI RAG MCP server is just the beginning. Here's where we're headed:
Integration with Archon: Building this system directly into Archon to create a comprehensive knowledge engine for AI coding assistants to build better AI agents.
Multiple Embedding Models: Expanding beyond OpenAI to support a variety of embedding models, including the ability to run everything locally with Ollama for complete control and privacy.
Advanced RAG Strategies: Implementing sophisticated retrieval techniques like contextual retrieval, late chunking, and others to move beyond basic "naive lookups" and significantly enhance the power and precision of the RAG system, especially as it integrates with Archon.
Enhanced Chunking Strategy: Implementing a Context 7-inspired chunking approach that focuses on examples and creates distinct, semantically meaningful sections for each chunk, improving retrieval precision.
Performance Optimization: Increasing crawling and indexing speed to make it more realistic to "quickly" index new documentation to then leverage it within the same prompt in an AI coding assistant.
The server provides essential web crawling and search tools:
crawl_single_page
: Quickly crawl a single web page and store its content in the vector databasesmart_crawl_url
: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)get_available_sources
: Get a list of all available sources (domains) in the databaseperform_rag_query
: Search for relevant content using semantic search with optional source filteringsearch_code_examples
(requires USE_AGENTIC_RAG=true
): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants.USE_KNOWLEDGE_GRAPH=true
)parse_github_repository
: Parse a GitHub repository into a Neo4j knowledge graph, extracting classes, methods, functions, and their relationships for hallucination detectioncheck_ai_script_hallucinations
: Analyze Python scripts for AI hallucinations by validating imports, method calls, and class usage against the knowledge graphquery_knowledge_graph
: Explore and query the Neo4j knowledge graph with commands like repos
, classes
, methods
, and custom Cypher queriesClone this repository:
git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
Build the Docker image:
docker build -t mcp/crawl4ai-rag --build-arg PORT=8051 .
Create a .env
file based on the configuration section below
Clone this repository:
git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
Install uv if you don't have it:
pip install uv
Create and activate a virtual environment:
uv venv .venv\Scripts\activate # on Mac/Linux: source .venv/bin/activate
Install dependencies:
uv pip install -e . crawl4ai-setup
Create a .env
file based on the configuration section below
Before running the server, you need to set up the database with the pgvector extension:
Go to the SQL Editor in your Supabase dashboard (create a new project first if necessary)
Create a new query and paste the contents of crawled_pages.sql
Run the query to create the necessary tables and functions
To enable AI hallucination detection and repository analysis features, you need to set up Neo4j:
The easiest way to get Neo4j running locally is with the Local AI Package - a curated collection of local AI services including Neo4j:
Clone the Local AI Package:
git clone https://github.com/coleam00/local-ai-packaged.git cd local-ai-packaged
Start Neo4j: Follow the instructions in the Local AI Package repository to start Neo4j with Docker Compose
Default connection details:
bolt://localhost:7687
neo4j
Alternatively, install Neo4j directly:
Install Neo4j Desktop: Download from neo4j.com/download
Create a new database:
neo4j
userNote your connection details:
bolt://localhost:7687
(default)neo4j
(default)Create a .env
file in the project root with the following variables:
# MCP Server Configuration
HOST=0.0.0.0
PORT=8051
TRANSPORT=sse
# OpenAI API Configuration
OPENAI_API_KEY=your_openai_api_key
# LLM for summaries and contextual embeddings
MODEL_CHOICE=gpt-4.1-nano
# RAG Strategies (set to "true" or "false", default to "false")
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=false
USE_AGENTIC_RAG=false
USE_RERANKING=false
USE_KNOWLEDGE_GRAPH=false
# Supabase Configuration
SUPABASE_URL=your_supabase_project_url
SUPABASE_SERVICE_KEY=your_supabase_service_key
# Neo4j Configuration (required for knowledge graph functionality)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_neo4j_password
The Crawl4AI RAG MCP server supports four powerful RAG strategies that can be enabled independently:
When enabled, this strategy enhances each chunk's embedding with additional context from the entire document. The system passes both the full document and the specific chunk to an LLM (configured via MODEL_CHOICE
) to generate enriched context that gets embedded alongside the chunk content.
Combines traditional keyword search with semantic vector search to provide more comprehensive results. The system performs both searches in parallel and intelligently merges results, prioritizing documents that appear in both result sets.
Enables specialized code example extraction and storage. When crawling documentation, the system identifies code blocks (≥300 characters), extracts them with surrounding context, generates summaries, and stores them in a separate vector database table specifically designed for code search.
search_code_examples
tool that AI agents can use to find specific code implementations.Applies cross-encoder reranking to search results after initial retrieval. Uses a lightweight cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2
) to score each result against the original query, then reorders results by relevance.
Enables AI hallucination detection and repository analysis using Neo4j knowledge graphs. When enabled, the system can parse GitHub repositories into a graph database and validate AI-generated code against real repository structures.
parse_github_repository
for indexing codebases, check_ai_script_hallucinations
for validating AI-generated code, and query_knowledge_graph
for exploring indexed repositories.For general documentation RAG:
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=false
USE_RERANKING=true
For AI coding assistant with code examples:
USE_CONTEXTUAL_EMBEDDINGS=true
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=true
USE_RERANKING=true
USE_KNOWLEDGE_GRAPH=false
For AI coding assistant with hallucination detection:
USE_CONTEXTUAL_EMBEDDINGS=true
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=true
USE_RERANKING=true
USE_KNOWLEDGE_GRAPH=true
For fast, basic RAG:
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=false
USE_RERANKING=false
USE_KNOWLEDGE_GRAPH=false
docker run --env-file .env -p 8051:8051 mcp/crawl4ai-rag
uv run src/crawl4ai_mcp.py
The server will start and listen on the configured host and port.
Once you have the server running with SSE transport, you can connect to it using this configuration:
{ "mcpServers": { "crawl4ai-rag": { "transport": "sse", "url": "http://localhost:8051/sse" } } }
Note for Windsurf users: Use
serverUrl
instead ofurl
in your configuration:{ "mcpServers": { "crawl4ai-rag": { "transport": "sse", "serverUrl": "http://localhost:8051/sse" } } }
Note for Docker users: Use
host.docker.internal
instead oflocalhost
if your client is running in a different container. This will apply if you are using this MCP server within n8n!
Note for Claude Code users:
claude mcp add-json crawl4ai-rag '{"type":"http","url":"http://localhost:8051/sse"}' --scope user
Add this server to your MCP configuration for Claude Desktop, Windsurf, or any other MCP client:
{ "mcpServers": { "crawl4ai-rag": { "command": "python", "args": ["path/to/crawl4ai-mcp/src/crawl4ai_mcp.py"], "env": { "TRANSPORT": "stdio", "OPENAI_API_KEY": "your_openai_api_key", "SUPABASE_URL": "your_supabase_url", "SUPABASE_SERVICE_KEY": "your_supabase_service_key", "USE_KNOWLEDGE_GRAPH": "false", "NEO4J_URI": "bolt://localhost:7687", "NEO4J_USER": "neo4j", "NEO4J_PASSWORD": "your_neo4j_password" } } } }
{ "mcpServers": { "crawl4ai-rag": { "command": "docker", "args": ["run", "--rm", "-i", "-e", "TRANSPORT", "-e", "OPENAI_API_KEY", "-e", "SUPABASE_URL", "-e", "SUPABASE_SERVICE_KEY", "-e", "USE_KNOWLEDGE_GRAPH", "-e", "NEO4J_URI", "-e", "NEO4J_USER", "-e", "NEO4J_PASSWORD", "mcp/crawl4ai"], "env": { "TRANSPORT": "stdio", "OPENAI_API_KEY": "your_openai_api_key", "SUPABASE_URL": "your_supabase_url", "SUPABASE_SERVICE_KEY": "your_supabase_service_key", "USE_KNOWLEDGE_GRAPH": "false", "NEO4J_URI": "bolt://localhost:7687", "NEO4J_USER": "neo4j", "NEO4J_PASSWORD": "your_neo4j_password" } } } }
The knowledge graph system stores repository code structure in Neo4j with the following components:
knowledge_graphs/
folder):parse_repo_into_neo4j.py
: Clones and analyzes GitHub repositories, extracting Python classes, methods, functions, and imports into Neo4j nodes and relationshipsai_script_analyzer.py
: Parses Python scripts using AST to extract imports, class instantiations, method calls, and function usageknowledge_graph_validator.py
: Validates AI-generated code against the knowledge graph to detect hallucinations (non-existent methods, incorrect parameters, etc.)hallucination_reporter.py
: Generates comprehensive reports about detected hallucinations with confidence scores and recommendationsquery_knowledge_graph.py
: Interactive CLI tool for exploring the knowledge graph (functionality now integrated into MCP tools)The Neo4j database stores code structure as:
Nodes:
Repository
: GitHub repositoriesFile
: Python files within repositoriesClass
: Python classes with methods and attributesMethod
: Class methods with parameter informationFunction
: Standalone functionsAttribute
: Class attributesRelationships:
Repository
-[:CONTAINS]-> File
File
-[:DEFINES]-> Class
File
-[:DEFINES]-> Function
Class
-[:HAS_METHOD]-> Method
Class
-[:HAS_ATTRIBUTE]-> Attribute
parse_github_repository
tool to clone and analyze open-source repositoriescheck_ai_script_hallucinations
tool to validate AI-generated Python scriptsquery_knowledge_graph
tool to explore available repositories, classes, and methodsThis implementation provides a foundation for building more complex MCP servers with web crawling capabilities. To build your own:
@mcp.tool()
decoratorutils.py
file for any helper functions you need