
Scientific Paper Harvester
STDIOMCP server providing real-time access to scientific papers from 6 major academic sources
MCP server providing real-time access to scientific papers from 6 major academic sources
A comprehensive Model Context Protocol (MCP) server that provides LLMs with real-time access to scientific papers from 6 major academic sources: arXiv, OpenAlex, PMC (PubMed Central), Europe PMC, bioRxiv/medRxiv, and CORE.
npm install npm run build
To use this server with an MCP client (like Claude Desktop), add the following to your MCP client configuration:
Option 1: Using npx (recommended for AI tools like Claude)
{ "mcpServers": { "scientific-papers": { "command": "npx", "args": [ "-y", "@futurelab-studio/latest-science-mcp@latest" ] } } }
Option 2: Global installation
npm install -g @futurelab-studio/latest-science-mcp
Then configure:
{ "mcpServers": { "scientific-papers": { "command": "latest-science-mcp" } } }
# List arXiv categories node dist/cli.js list-categories --source=arxiv # List OpenAlex concepts node dist/cli.js list-categories --source=openalex # List PMC biomedical categories node dist/cli.js list-categories --source=pmc # List Europe PMC life science categories node dist/cli.js list-categories --source=europepmc # List bioRxiv/medRxiv categories (includes both servers) node dist/cli.js list-categories --source=biorxiv # List CORE academic categories node dist/cli.js list-categories --source=core
# Get latest AI papers from arXiv node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=10 # Get latest biology papers from bioRxiv node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=5 # Get latest immunology papers from PMC node dist/cli.js fetch-latest --source=pmc --category=immunology --count=3 # Get latest papers from CORE by subject node dist/cli.js fetch-latest --source=core --category=computer_science --count=5 # Search by concept name (OpenAlex) node dist/cli.js fetch-latest --source=openalex --category="machine learning" --count=3
# Get top 20 cited papers in machine learning since 2024 node dist/cli.js fetch-top-cited --concept="machine learning" --since=2024-01-01 --count=20 # Get top cited papers by concept ID node dist/cli.js fetch-top-cited --concept=C41008148 --since=2023-06-01 --count=10
# Search by keywords across all fields node dist/cli.js search-papers --source=arxiv --query="machine learning" --count=10 # Search by paper title node dist/cli.js search-papers --source=openalex --query="neural networks" --field=title --count=5 # Search by author name node dist/cli.js search-papers --source=europepmc --query="John Smith" --field=author --count=10 # Search full-text content sorted by citations node dist/cli.js search-papers --source=core --query="climate change" --field=fulltext --sortBy=citations --count=20
# Get arXiv paper by ID node dist/cli.js fetch-content --source=arxiv --id=2401.12345 # Get bioRxiv paper by DOI node dist/cli.js fetch-content --source=biorxiv --id="10.1101/2021.01.01.425001" # Get PMC paper by ID node dist/cli.js fetch-content --source=pmc --id=PMC8245678 # Get CORE paper by ID node dist/cli.js fetch-content --source=core --id=12345678 # Show text content with preview node dist/cli.js fetch-content --source=arxiv --id=2401.12345 --show-text --text-preview=500
list_categories
Lists available categories/concepts from any data source.
Parameters:
source
: "arxiv"
| "openalex"
| "pmc"
| "europepmc"
| "biorxiv"
| "core"
Returns:
id
, name
, and optional description
Examples:
{ "name": "list_categories", "arguments": { "source": "biorxiv" } }
fetch_latest
Fetches the latest papers from any source for a given category with metadata only (no text extraction).
Parameters:
source
: "arxiv"
| "openalex"
| "pmc"
| "europepmc"
| "biorxiv"
| "core"
category
: Category ID or concept name (varies by source)count
: Number of papers to fetch (default: 50, max: 200)Category Examples by Source:
"cs.AI"
, "physics.gen-ph"
, "math.CO"
"artificial intelligence"
, "machine learning"
, "C41008148"
"immunology"
, "genetics"
, "neuroscience"
"biology"
, "medicine"
, "cancer"
"biorxiv:neuroscience"
, "medrxiv:psychiatry"
"computer_science"
, "mathematics"
, "physics"
Returns:
text: ""
) - use fetch_content
for full textfetch_top_cited
Fetches the top cited papers from OpenAlex for a given concept since a specific date.
Parameters:
concept
: Concept name or OpenAlex concept IDsince
: Start date in YYYY-MM-DD formatcount
: Number of papers to fetch (default: 50, max: 200)search_papers
Searches for papers across multiple academic sources with field-specific search and sorting options.
Parameters:
source
: "arxiv"
| "openalex"
| "europepmc"
| "core"
query
: Search query string (max 1500 characters)field
: "all"
| "title"
| "abstract"
| "author"
| "fulltext"
(default: "all")count
: Number of results to return (default: 50, max: 200)sortBy
: "relevance"
| "date"
| "citations"
(default: "relevance")Search Capabilities by Source:
Example Queries:
"machine learning"
, "climate change"
"artificial intelligence"
(use quotes for exact phrases)"deep learning AND neural networks"
(arXiv supports this)"John Smith"
, "Smith J"
Returns:
text: ""
) - use fetch_content
for full textfetch_content
Fetches full metadata and text content for a specific paper by ID with complete text extraction.
Parameters:
source
: Any of the 6 supported sourcesid
: Paper ID (format varies by source)ID Formats by Source:
"2401.12345"
, "cs/0601001"
, "1234.5678v2"
"W2741809807"
or numeric 2741809807
"PMC8245678"
or "12345678"
"PMC8245678"
, "12345678"
, or DOI"10.1101/2021.01.01.425001"
or "2021.01.01.425001"
"12345678"
All tools return paper objects with the following structure:
{ id: string; // Paper ID title: string; // Paper title authors: string[]; // List of author names date: string; // Publication date (ISO format) pdf_url?: string; // PDF URL (if available) text: string; // Extracted full text content textTruncated?: boolean; // Warning: text was truncated due to size limits textExtractionFailed?: boolean; // Warning: text extraction failed }
Each source has specialized text extraction approaches:
arxiv.org/html
with ar5iv.labs.arxiv.org
fallbackAdvanced DOI resolver with multiple fallback strategies:
Respectful API usage with per-source rate limiting:
For enhanced CORE access, set environment variable:
export CORE_API_KEY="your-api-key"
# Run all tests npm test # Run integration tests npm run test -- tests/integration # Run end-to-end workflow tests npm run test -- tests/e2e # Run performance benchmarks npm run test -- tests/integration/performance.test.ts
Source | Papers | Disciplines | Full-Text | Citation Data | Preprints | Search |
---|---|---|---|---|---|---|
arXiv | 2.3M+ | STEM | HTML ✓ | Limited | ✓ | ✓✓✓ |
OpenAlex | 200M+ | All | Variable | ✓✓✓ | ✓ | ✓✓✓ |
PMC | 7M+ | Biomedical | XML/HTML ✓ | Limited | ✗ | Limited |
Europe PMC | 40M+ | Life Sciences | HTML ✓ | Limited | ✓ | ✓✓✓ |
bioRxiv/medRxiv | 500K+ | Bio/Medical | HTML ✓ | Limited | ✓✓✓ | Limited |
CORE | 200M+ | All | PDF/HTML ✓ | Limited | ✓ | ✓✓✓ |
npm run build
# Test specific sources node dist/cli.js list-categories --source=arxiv node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=3 node dist/cli.js fetch-content --source=core --id=12345678 # Test search functionality node dist/cli.js search-papers --source=arxiv --query="artificial intelligence" --count=5 node dist/cli.js search-papers --source=openalex --query="quantum computing" --field=title --count=3
# Run performance benchmarks npm run test -- tests/integration/performance.test.ts # Test memory usage npm run test -- --reporter=verbose
Comprehensive error handling for all sources:
CORE_API_KEY
environment variablecount
parameters (smaller for faster responses)fetch_latest
for discovery, fetch_content
for detailed readingMIT
Ready to explore the world's scientific knowledge? Start with any of the 6 sources and discover papers across all academic disciplines! 🔬📚