
Pulse获取
STDIO网页抓取MCP服务器,具备反机器人绕过和智能缓存
网页抓取MCP服务器,具备反机器人绕过和智能缓存
Haven't heard about MCP yet? The easiest way to keep up-to-date is to read our weekly newsletter at PulseMCP.
This is an MCP (Model Context Protocol) Server that pulls specific resources from the open internet into context, designed for agent-building frameworks and MCP clients that lack built-in fetch capabilities.
Pulse Fetch is purpose-built for extracting clean, structured content from web pages while minimizing token usage and providing reliable access to protected content through advanced anti-bot bypassing capabilities.
This project is built and maintained by PulseMCP.
Clean content extraction: Strips out HTML noise using Mozilla's Readability algorithm to minimize token usage during MCP Tool calls.
Intelligent caching: Automatically caches scraped content as MCP Resources. Subsequent requests for the same URL return cached content instantly without network calls, dramatically improving performance.
Anti-bot bypass: Integrates with Firecrawl and BrightData APIs to reliably work around anti-scraping technology.
Smart strategy selection: Automatically learns and applies the best scraping method for specific URL patterns, improving performance over time.
LLM-optimized: Offers MCP Prompts and descriptive Tool design for better LLM interaction reliability.
Flexible formats: Supports multiple output formats including clean markdown, HTML, screenshots, and structured data extraction.
Intelligent extraction: Extract specific information using natural language queries powered by LLMs.
This server is built and tested on macOS with Claude Desktop. It should work with other MCP clients as well.
Tool Name | Description |
---|---|
scrape | Scrape a single webpage with advanced content extraction options and multiple output formats. |
scrape
tool handles all web content extraction needs and automatically bypasses anti-bot protection when necessaryforceRescrape: true
to bypass the cache and get fresh content when you know the page has changedsaveResult: false
to disable both caching and resource saving (not recommended)maxChars
and startIndex
parameters to handle large content that exceeds token limitsextract
parameter with natural language queries to extract specific information from pages (requires LLM configuration)User: "Get the main content from this article: https://example.com/article"
Assistant: I'll extract the content from that article for you.
[Uses scrape tool with onlyMainContent: true]
I've extracted the main article content. The article is titled "Example Article Title" and contains approximately 2,500 words discussing...
User: "Extract the product information from this e-commerce page: https://shop.example.com/product/123"
Assistant: I'll extract the structured product data from that page.
[Uses scrape tool with format: 'extract' and appropriate schema]
I've extracted the product information:
- Product Name: Example Product
- Price: $99.99
- Rating: 4.5/5 stars
- Description: High-quality example product...
User: "This page is blocking me with CAPTCHA. Can you get the content from https://protected.example.com/data"
Assistant: I'll extract the content from that protected page for you.
[Uses scrape tool with automatic anti-bot bypass]
I successfully bypassed the protection and extracted the content from the page using BrightData's Web Unlocker capabilities.
User: "Get the content from https://example.com/article again"
Assistant: I'll retrieve that content for you.
[Uses scrape tool - automatically returns cached content]
I've retrieved the content from cache (originally scraped 2 hours ago). The article contains...
User: "Actually, I think that article was updated. Can you get the latest version?"
Assistant: I'll fetch a fresh copy of the article for you.
[Uses scrape tool with forceRescrape: true]
I've fetched the latest version of the article. I can see it was indeed updated with new information about...
If our Design Principles resonate with you, you should consider using our server.
The official reference implementation of fetch
is the closest alternative. However:
fetch
has no mechanisms for bypassing anti-scraping technology, meaning attempts may randomly fail. We integrate with third-party services for reliable access.fetch
is maintained by volunteers, so bugs or edge cases are less likely to be addressed quickly.Pulse Fetch
caches responses as Resources, allowing easy inspection and re-use of Tool call outcomes.Pulse Fetch
has more descriptive Tool design that more reliably triggers and completes desired tasks.Most other alternatives fall short on one or more vectors:
Environment Variable | Description | Required | Default Value | Example |
---|---|---|---|---|
FIRECRAWL_API_KEY | API key for Firecrawl service to bypass anti-bot measures | No | N/A | fc-abc123... |
BRIGHTDATA_BEARER_TOKEN | Bearer token for BrightData Web Unlocker service | No | N/A | Bearer bd_abc123... |
PULSE_FETCH_STRATEGY_CONFIG_PATH | Path to markdown file containing scraping strategy configuration | No | OS temp dir | /path/to/scraping-strategies.md |
OPTIMIZE_FOR | Optimization strategy for scraping: cost or speed | No | cost | speed |
MCP_RESOURCE_STORAGE | Storage backend for saved resources: memory or filesystem | No | memory | filesystem |
MCP_RESOURCE_FILESYSTEM_ROOT | Directory for filesystem storage (only used with filesystem type) | No | /tmp/pulse-fetch/resources | /home/user/mcp-resources |
The extract feature provides an alternative to MCP's native sampling capability for clients that don't support it. When configured, it enables intelligent information extraction from scraped content using LLMs. If neither LLM configuration nor MCP sampling is available, the extract
parameter will not be shown in the tool.
Environment Variable | Description | Required | Default Value | Example |
---|---|---|---|---|
LLM_PROVIDER | LLM provider: anthropic , openai , openai-compatible | No | N/A | anthropic |
LLM_API_KEY | API key for the chosen LLM provider | No | N/A | sk-abc123... |
LLM_API_BASE_URL | Base URL for OpenAI-compatible providers | No | N/A | https://api.together.xyz/v1 |
LLM_MODEL | Specific model to use for extraction | No | See defaults below | gpt-4-turbo |
Default Models:
claude-sonnet-4-20250514
(Claude Sonnet 4 - latest and most capable)gpt-4.1-mini
(GPT-4.1 Mini - latest and most capable)You'll need Node.js installed on your machine to run the local version.
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Add this configuration to your Claude Desktop config file:
Minimal configuration (uses native fetch only):
{ "mcpServers": { "pulse-fetch": { "command": "npx", "args": ["-y", "@pulsemcp/pulse-fetch"] } } }
Full configuration (with all optional environment variables):
{ "mcpServers": { "pulse-fetch": { "command": "npx", "args": ["-y", "@pulsemcp/pulse-fetch"], "env": { "FIRECRAWL_API_KEY": "your-firecrawl-api-key", "BRIGHTDATA_BEARER_TOKEN": "your-brightdata-bearer-token", "PULSE_FETCH_STRATEGY_CONFIG_PATH": "/path/to/your/scraping-strategies.md", "OPTIMIZE_FOR": "cost", "MCP_RESOURCE_STORAGE": "filesystem", "MCP_RESOURCE_FILESYSTEM_ROOT": "/path/to/resource/storage" } } } }
To set up the local version:
cd pulse-fetch/local
npm install
npm run build
For a hosted solution, refer to Pulse Fetch (Remote).
pulse-fetch/
├── local/ # Local server implementation
│ ├── src/
│ │ └── index.ts # Main entry point
│ ├── build/ # Compiled output
│ └── package.json
├── shared/ # Shared business logic
│ ├── src/
│ │ ├── tools.ts # Tool implementations
│ │ ├── resources.ts # Resource implementations
│ │ └── types.ts # Shared types
│ └── package.json
└── remote/ # Remote server (planned)
└── README.md
# Build shared module first cd shared npm install npm run build # Run local server in development cd ../local npm install npm run dev
This project includes comprehensive testing capabilities:
# Install all dependencies npm run install-all # Run tests (if implemented) npm test # Run linting npm run lint # Auto-fix linting issues npm run lint:fix # Format code npm run format # Check formatting npm run format:check
The project uses ESLint and Prettier for code quality and consistency:
# Check for linting issues npm run lint # Auto-fix linting issues npm run lint:fix # Format all code npm run format # Check if code is properly formatted npm run format:check
Scrape a single webpage with advanced options for content extraction.
Content Cleaning
By default (cleanScrape: true
), the tool automatically cleans scraped content:
Disable cleaning (cleanScrape: false
) only when:
Parameters:
url
(string, required): URL to scrapetimeout
(number): Maximum time to wait for page loadmaxChars
(number): Maximum characters to return (default: 100,000)startIndex
(number): Character index to start output from (for pagination)saveResult
(boolean): Save result as MCP Resource (default: true)forceRescrape
(boolean): Force fresh scrape even if cached (default: false)cleanScrape
(boolean): Clean HTML content by converting to semantic Markdown (default: true)extract
(string): Natural language query for intelligent content extraction (requires LLM configuration)screenshot
and screenshot-full-page
in scrape
toolEnhanced scraping parameters:
includeHtmlTags
: HTML tags to include in outputexcludeHtmlTags
: HTML tags to exclude from outputcustomUserAgent
: Custom User-Agent stringignoreRobotsTxt
: Whether to ignore robots.txt restrictionsproxyUrl
: Optional proxy URLheaders
: Custom headers for requestsfollowLinks
: Follow related links on the pageInteractive capabilities:
Image processing:
imageStartIndex
: Starting position for image collectionraw
: Return raw content instead of processed markdownimageMaxCount
: Maximum images to process per requestimageMaxHeight/Width
: Image dimension limitsimageQuality
: JPEG quality (1-100)enableFetchImages
: Enable image fetching and processingMIT
The pulse-fetch MCP server includes an intelligent strategy system that automatically selects the best scraping method for different websites.
The OPTIMIZE_FOR
environment variable controls the order and selection of scraping strategies:
COST
(default): Optimizes for the lowest cost by trying native fetch first, then Firecrawl, then BrightData
native → firecrawl → brightdata
SPEED
: Optimizes for faster results by skipping native fetch and starting with more powerful scrapers
firecrawl → brightdata
(skips native entirely)Example configuration:
export OPTIMIZE_FOR=SPEED # For faster, more reliable scraping export OPTIMIZE_FOR=COST # For cost-effective scraping (default)
native
: Fast native fetch using Node.js fetch API (best for simple pages)firecrawl
: Enhanced content extraction using Firecrawl API (good for complex layouts)brightdata
: Anti-bot bypass using BrightData Web Unlocker (for protected content)The configuration is stored in a markdown table. By default, it's automatically created in your OS temp directory (e.g., /tmp/pulse-fetch/scraping-strategies.md
on Unix systems). You can customize the location by setting the PULSE_FETCH_STRATEGY_CONFIG_PATH
environment variable.
The table has three columns:
reddit.com
or reddit.com/r/
)native
, firecrawl
, or brightdata
)| prefix | default_strategy | notes | | ------------- | ---------------- | --------------------------------------------------- | | reddit.com/r/ | brightdata | Reddit requires anti-bot bypass for subreddit pages | | reddit.com | firecrawl | General Reddit pages work well with Firecrawl | | github.com | native | GitHub pages are simple and work with native fetch |
github.com
matches github.com
, www.github.com
, and subdomain.github.com
reddit.com/r/
matches reddit.com/r/programming
but not reddit.com/user/test
When scraping a new URL:
The system extracts URL patterns by removing the last path segment:
yelp.com/biz/dolly-san-francisco
→ yelp.com/biz/
reddit.com/r/programming/comments/123
→ reddit.com/r/programming/comments/
example.com/blog/2024/article
→ example.com/blog/2024/
stackoverflow.com/questions/123456
→ stackoverflow.com/questions/
For single-segment URLs or root URLs, only the hostname is saved. Query parameters and fragments are ignored during pattern extraction.
The system uses an abstraction layer for config storage:
PULSE_FETCH_STRATEGY_CONFIG_PATH
if set/tmp/pulse-fetch/scraping-strategies.md
)You can swap the storage backend by providing a different StrategyConfigFactory
when creating the MCP server.
Pulse Fetch stores scraped content as MCP Resources for caching and later retrieval. The storage system supports multiple tiers to preserve content at different processing stages.
Resources are saved in three separate stages:
When using filesystem storage (MCP_RESOURCE_STORAGE=filesystem
), files are organized into subdirectories:
/tmp/pulse-fetch/resources/
├── raw/
│ └── example.com_article_20250701_123456.md
├── cleaned/
│ └── example.com_article_20250701_123456.md
└── extracted/
└── example.com_article_20250701_123456.md
Each stage shares the same filename for easy correlation. The extracted files include the extraction prompt in their metadata for full traceability.
Memory storage uses a similar structure with URIs like:
memory://raw/example.com_article_20250701_123456
memory://cleaned/example.com_article_20250701_123456
memory://extracted/example.com_article_20250701_123456
The extract feature enables intelligent information extraction from scraped web content using LLMs. It serves as an alternative to MCP's native sampling capability for clients that don't support it.
The extract functionality provides two ways to extract information:
When neither option is available, the tool will work without extraction capabilities, returning raw scraped content only.
When you provide an extract
parameter with a natural language query, the tool will:
The implementation supports three provider types:
Anthropic (Native): Direct integration using Anthropic's SDK
OpenAI: Direct integration with OpenAI's API
OpenAI-Compatible: Support for any provider with OpenAI-compatible endpoints
Configure the extract feature using the environment variables described in the LLM Configuration section above.
User: "Get the author and publication date from this article: https://example.com/article"
Assistant: I'll extract that information from the article.
[Uses scrape tool with extract: "author name and publication date"]
The article was written by John Doe and published on March 15, 2024.
User: "Extract all product specifications from this page: https://shop.example.com/laptop"
Assistant: I'll extract the detailed specifications from that product page.
[Uses scrape tool with extract: "all technical specifications including processor, RAM, storage, display details, ports, and dimensions"]
Here are the laptop specifications:
- Processor: Intel Core i7-13700H
- RAM: 16GB DDR5
- Storage: 512GB NVMe SSD
...
AnthropicClient
: Native Anthropic API integrationOpenAIClient
: OpenAI API integrationOpenAICompatibleClient
: Flexible client for any OpenAI-compatible endpoint