Documentation Search Engine
STDIOToolset to crawl websites, generate Markdown docs, and make them searchable via MCP server.
Toolset to crawl websites, generate Markdown docs, and make them searchable via MCP server.
This project provides a toolset to crawl websites, generate Markdown documentation, and make that documentation searchable via a Model Context Protocol (MCP) server, designed for integration with tools like Cursor.
crawler_cli
):
crawl4ai
../storage/
by default.mcp_server
):
./storage/
directory.sentence-transformers
(multi-qa-mpnet-base-dot-v1
).storage/document_chunks_cache.pkl
) to store processed chunks and embeddings.
.md
files in ./storage/
haven't changed, the server loads directly from the cache, resulting in much faster startup times..md
file in ./storage/
is modified, added, or removed since the cache was last created.fastmcp
for clients like Cursor:
list_documents
: Lists available crawled documents.get_document_headings
: Retrieves the heading structure for a document.search_documentation
: Performs semantic search over document chunks using vector similarity.stdio
transport for use within Cursor.crawler_cli
tool to crawl a website and generate a .md
file in ./storage/
.mcp_server
(typically managed by an MCP client like Cursor)..md
files in ./storage/
.list_documents
, search_documentation
, etc.) to query the crawled content.This project uses uv
for dependency management and execution.
Install uv
: Follow the instructions on the uv website.
Clone the repository:
git clone https://github.com/alizdavoodi/MCPDocSearch.git cd MCPDocSearch
Install dependencies:
uv sync
This command creates a virtual environment (usually .venv
) and installs all dependencies listed in pyproject.toml
.
Run the crawler using the crawl.py
script or directly via uv run
.
Basic Example:
uv run python crawl.py https://docs.example.com
This will crawl https://docs.example.com
with default settings and save the output to ./storage/docs.example.com.md
.
Example with Options:
uv run python crawl.py https://docs.another.site --output ./storage/custom_name.md --max-depth 2 --keyword "API" --keyword "Reference" --exclude-pattern "*blog*"
View all options:
uv run python crawl.py --help
Key options include:
--output
/-o
: Specify output file path.--max-depth
/-d
: Set crawl depth (must be between 1 and 5).--include-pattern
/--exclude-pattern
: Filter URLs to crawl.--keyword
/-k
: Keywords for relevance scoring during crawl.--remove-links
/--keep-links
: Control HTML cleaning.--cache-mode
: Control crawl4ai
caching (DEFAULT
, BYPASS
, FORCE_REFRESH
).--wait-for
: Wait for a specific time (seconds) or CSS selector before capturing content (e.g., 5
or 'css:.content'
). Useful for pages with delayed loading.--js-code
: Execute custom JavaScript on the page before capturing content.--page-load-timeout
: Set the maximum time (seconds) to wait for a page to load.--wait-for-js-render
/--no-wait-for-js-render
: Enable a specific script to better handle JavaScript-heavy Single Page Applications (SPAs) by scrolling and clicking potential "load more" buttons. Automatically sets a default wait time if --wait-for
is not specified.Sometimes, you might want to crawl only a specific subsection of a documentation site. This often requires some trial and error with --include-pattern
and --max-depth
.
--include-pattern
: Restricts the crawler to only follow links whose URLs match the given pattern(s). Use wildcards (*
) for flexibility.--max-depth
: Controls how many "clicks" away from the starting URL the crawler will go. A depth of 1 means it only crawls pages directly linked from the start URL. A depth of 2 means it crawls those pages and pages linked from them (if they also match include patterns), and so on.Example: Crawling only the Pulsar Admin API section
Suppose you want only the content under https://pulsar.apache.org/docs/4.0.x/admin-api-*
.
https://pulsar.apache.org/docs/4.0.x/admin-api-overview/
.admin-api
: --include-pattern "*admin-api*"
.2
and increase if needed.-v
to see which URLs are being visited or skipped, which helps debug the patterns and depth.uv run python crawl.py https://pulsar.apache.org/docs/4.0.x/admin-api-overview/ -v --include-pattern "*admin-api*" --max-depth 2
Check the output file (./storage/pulsar.apache.org.md
by default in this case). If pages are missing, try increasing --max-depth
to 3
. If too many unrelated pages are included, make the --include-pattern
more specific or add --exclude-pattern
rules.
The MCP server is designed to be run by an MCP client like Cursor via the stdio
transport. The command to run the server is:
python -m mcp_server.main
However, it needs to be run from the project's root directory (MCPDocSearch
) so that Python can find the mcp_server
module.
The MCP server generates embeddings locally the first time it runs or whenever the source Markdown files in ./storage/
change. This process involves loading a machine learning model and processing all the text chunks.
To use this server with Cursor, create a .cursor/mcp.json
file in the root of this project (MCPDocSearch/.cursor/mcp.json
) with the following content:
{ "mcpServers": { "doc-query-server": { "command": "uv", "args": [ "--directory", // IMPORTANT: Replace with the ABSOLUTE path to this project directory on your machine "/path/to/your/MCPDocSearch", "run", "python", "-m", "mcp_server.main" ], "env": {} } } }
Explanation:
"doc-query-server"
: A name for the server within Cursor."command": "uv"
: Specifies uv
as the command runner."args"
:
"--directory", "/path/to/your/MCPDocSearch"
: Crucially, tells uv
to change its working directory to your project root before running the command. Replace /path/to/your/MCPDocSearch
with the actual absolute path on your system."run", "python", "-m", "mcp_server.main"
: The command uv
will execute within the correct directory and virtual environment.After saving this file and restarting Cursor, the "doc-query-server" should become available in Cursor's MCP settings and usable by the Agent (e.g., @doc-query-server search documentation for "how to install"
).
For Claude for Desktop, you can use this official documentation to set up the MCP server
Key libraries used:
crawl4ai
: Core web crawling functionality.fastmcp
: MCP server implementation.sentence-transformers
: Generating text embeddings.torch
: Required by sentence-transformers
.typer
: Building the crawler CLI.uv
: Project and environment management.beautifulsoup4
(via crawl4ai
): HTML parsing.rich
: Enhanced terminal output.The project follows this basic flow:
crawler_cli
: You run this tool, providing a starting URL and options.crawl4ai
): The tool uses crawl4ai
to fetch web pages, following links based on configured rules (depth, patterns).crawler_cli/markdown.py
): Optionally, HTML content is cleaned (removing navigation, links) using BeautifulSoup.crawl4ai
): Cleaned HTML is converted to Markdown../storage/
): The generated Markdown content is saved to a file in the ./storage/
directory.mcp_server
Startup: When the MCP server starts (usually via Cursor's config), it runs mcp_server/data_loader.py
..pkl
). If valid, it loads chunks and embeddings from the cache. Otherwise, it reads .md
files from ./storage/
.sentence-transformers
and stored in memory (and saved to cache).mcp_server/mcp_tools.py
): The server exposes tools (list_documents
, search_documentation
, etc.) via fastmcp
.search_documentation
uses the pre-computed embeddings to find relevant chunks based on semantic similarity to the query.This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to open an issue or submit a pull request.
pickle
module to cache processed data (storage/document_chunks_cache.pkl
). Unpickling data from untrusted sources can be insecure. Ensure that the ./storage/
directory is only writable by trusted users/processes.