PDF Reader MCP Server

Empower your AI agents with the ability to securely read and extract information from PDF files using the Model Context Protocol (MCP).

✨ Features

📄 Extract text content from PDF files (full document or specific pages)
🖼️ Extract embedded images from PDF pages as base64-encoded data
📐 Preserve content order - Text and images returned in exact document layout order (NEW v1.2.0)
📊 Get metadata (author, title, creation date, etc.)
🔢 Count pages in PDF documents
🌐 Support for both local files and URLs
🛡️ Secure - Confines file access to project root directory
⚡ Fast - Parallel processing for maximum performance
🔄 Batch processing - Handle multiple PDFs in a single request
📦 Multiple deployment options - npm or Smithery

🆕 Recent Updates (October 2025)

v1.2.0 - Content Ordering (Latest)

✅ Y-Coordinate Based Ordering: Text and images returned in exact document order
✅ Natural Reading Flow: Content parts preserve the layout sequence as it appears in PDF
✅ Intelligent Grouping: Automatically groups text items on the same line
✅ Optimized for AI: Enables AI models to understand content in natural reading order

v1.1.0 - Image Extraction

✅ Image Extraction: Extract embedded images from PDF pages as base64-encoded data
✅ Performance Optimization: Parallel page processing for 5-10x speedup
✅ Deep Refactoring: Modular architecture with 98.9% test coverage (91 tests)

Previous Updates

✅ Fixed critical bugs: Buffer/Uint8Array compatibility for PDF.js v5.x
✅ Fixed schema validation: Resolved exclusiveMinimum issue affecting Windsurf, Mistral API, and other tools
✅ Improved metadata extraction: Robust fallback handling for PDF.js compatibility
✅ Updated dependencies: All packages updated to latest versions
✅ Migrated to Biome: 50x faster linting and formatting with unified tooling

📦 Installation

Option 1: Using Smithery (Easiest)

Install automatically for Claude Desktop:

npx -y @smithery/cli install @sylphxltd/pdf-reader-mcp --client claude

Option 2: Using npm/pnpm (Recommended)

Install the package:

pnpm add @sylphx/pdf-reader-mcp
# or
npm install @sylphx/pdf-reader-mcp

Configure your MCP client (e.g., Claude Desktop, Cursor):

{
  "mcpServers": {
    "pdf-reader-mcp": {
      "command": "npx",
      "args": ["@sylphx/pdf-reader-mcp"]
    }
  }
}

Important: Make sure your MCP client sets the correct working directory (cwd) to your project root.

Option 3: Local Development Build

git clone https://github.com/sylphlab/pdf-reader-mcp.git
cd pdf-reader-mcp
pnpm install
pnpm run build

Then configure your MCP client to use node dist/index.js.

🚀 Quick Start

Once configured, your AI agent can read PDFs using the read_pdf tool:

Example 1: Extract text from specific pages

{
  "sources": [
    {
      "path": "documents/report.pdf",
      "pages": [1, 2, 3]
    }
  ],
  "include_metadata": true
}

Example 2: Get metadata and page count only

{
  "sources": [{ "path": "documents/report.pdf" }],
  "include_metadata": true,
  "include_page_count": true,
  "include_full_text": false
}

Example 3: Read from URL

{
  "sources": [
    {
      "url": "https://example.com/document.pdf"
    }
  ],
  "include_full_text": true
}

Example 4: Process multiple PDFs

{
  "sources": [
    { "path": "doc1.pdf", "pages": "1-5" },
    { "path": "doc2.pdf" },
    { "url": "https://example.com/doc3.pdf" }
  ],
  "include_full_text": true
}

Example 5: Extract images from PDF

{
  "sources": [
    {
      "path": "presentation.pdf",
      "pages": [1, 2, 3]
    }
  ],
  "include_images": true,
  "include_full_text": true
}

Response includes:

Text content from each page
Embedded images as base64-encoded data with metadata (width, height, format)
Each image includes page number and index

Note: Image extraction works best with JPEG and PNG images. Large PDFs with many images may produce large responses.

📖 Usage Guide

Page Specification

You can specify pages in multiple ways:

Array of page numbers: [1, 3, 5] (1-based indexing)
Range string: "1-10" (extracts pages 1 through 10)
Multiple ranges: "1-5,10-15,20" (commas separate ranges and individual pages)
Omit for all pages: Don't include the pages field to extract all pages

Working with Large PDFs

For large PDF files (>20 MB), extract specific pages instead of the full document:

{
  "sources": [
    {
      "path": "large-document.pdf",
      "pages": "1-10"
    }
  ]
}

This prevents hitting AI model context limits and improves performance.

Image Extraction

Extract embedded images from PDF pages as base64-encoded data:

{
  "sources": [{ "path": "document.pdf" }],
  "include_images": true
}

Image data format:

{
  "images": [
    {
      "page": 1,
      "index": 0,
      "width": 800,
      "height": 600,
      "format": "rgb",
      "data": "base64-encoded-image-data..."
    }
  ]
}

Supported formats:

✅ RGB - Standard color images (most common)
✅ RGBA - Images with transparency
✅ Grayscale - Black and white images
✅ Works with JPEG, PNG, and other embedded formats

Important considerations:

🔸 Image extraction increases response size significantly
🔸 Useful for AI models with vision capabilities
🔸 Set include_images: false (default) to extract text only
🔸 Combine with pages parameter to limit extraction scope

Content Ordering (NEW in v1.2.0)

Text and images are now returned in exact document order!

The server uses Y-coordinates from PDF.js to preserve the natural reading flow of the document. This means AI models receive content parts in the same sequence as they appear on the page.

Example document layout:

Page 1:
  [Heading text]
  [Image: Chart]
  [Description text]
  [Image: Photo A]
  [Image: Photo B]
  [Conclusion text]

Content parts returned:

[
  { type: "text", text: "Heading text" },
  { type: "image", data: "base64..." },  // Chart
  { type: "text", text: "Description text" },
  { type: "image", data: "base64..." },  // Photo A
  { type: "image", data: "base64..." },  // Photo B
  { type: "text", text: "Conclusion text" }
]

Benefits:

✅ AI understands context between text and images
✅ Natural reading flow preserved
✅ Better comprehension for complex documents
✅ Automatic line grouping for multi-line text blocks

When is ordering applied?

Automatically enabled when include_images: true
Works with both specific pages and full document extraction
Content on each page is independently sorted by Y-position

Security: Relative Paths Only

Important: The server only accepts relative paths for security reasons. Absolute paths are blocked to prevent unauthorized file system access.

✅ Good: "path": "documents/report.pdf" ❌ Bad: "path": "/Users/john/documents/report.pdf"

Solution: Configure the cwd (current working directory) in your MCP client settings.

🔧 Troubleshooting

Issue: "No tools" showing up

Solution: Clear npm cache and reinstall:

npm cache clean --force
npx @sylphx/pdf-reader-mcp@latest

Restart your MCP client completely after updating.

Issue: "File not found" errors

Causes:

Using absolute paths (not allowed for security)
Incorrect working directory

Solution: Use relative paths and configure cwd in your MCP client:

{
  "mcpServers": {
    "pdf-reader-mcp": {
      "command": "npx",
      "args": ["@sylphx/pdf-reader-mcp"],
      "cwd": "/path/to/your/project"
    }
  }
}

Issue: Cursor/Claude Code compatibility

Solution: Update to the latest version (all recent compatibility issues have been fixed):

npm update @sylphx/pdf-reader-mcp@latest

Then restart your editor completely.

⚡ Performance

Benchmarks on a standard PDF file:

Operation	Ops/sec	Speed
Handle Non-Existent File	~12,933	Fastest
Get Full Text	~5,575
Get Specific Page	~5,329
Get Multiple Pages	~5,242
Get Metadata & Page Count	~4,912	Slowest

Performance varies based on PDF complexity and system resources.

See Performance Documentation for details.

🏗️ Architecture

Tech Stack

Runtime: Node.js 22+
PDF Processing: PDF.js (pdfjs-dist)
Validation: Zod with JSON Schema generation
Protocol: Model Context Protocol (MCP) SDK
Build: TypeScript
Testing: Vitest with 100% coverage goal
Code Quality: Biome (linting + formatting)
CI/CD: GitHub Actions

Design Principles

Security First: Strict path validation and sandboxing
Simple Interface: Single tool handles all PDF operations
Structured Output: Predictable JSON format for AI parsing
Performance: Efficient caching and lazy loading
Reliability: Comprehensive error handling and validation

See Design Philosophy for more details.

🧪 Development

Prerequisites

Node.js >= 22.0.0
pnpm (recommended) or npm

Setup

git clone https://github.com/sylphlab/pdf-reader-mcp.git
cd pdf-reader-mcp
pnpm install

Available Scripts

pnpm run build        # Build TypeScript to dist/
pnpm run watch        # Build in watch mode
pnpm run test         # Run tests
pnpm run test:watch   # Run tests in watch mode
pnpm run test:cov     # Run tests with coverage
pnpm run check        # Run Biome (lint + format check)
pnpm run check:fix    # Fix Biome issues automatically
pnpm run lint         # Lint with Biome
pnpm run format       # Format with Biome
pnpm run typecheck    # TypeScript type checking
pnpm run benchmark    # Run performance benchmarks
pnpm run validate     # Full validation (check + test)

Testing

We maintain high test coverage using Vitest:

pnpm run test         # Run all tests
pnpm run test:cov     # Run with coverage report

All tests must pass before merging. Current: 31/31 tests passing ✅

Code Quality

The project uses Biome for fast, unified linting and formatting:

pnpm run check        # Check code quality
pnpm run check:fix    # Auto-fix issues

Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes and ensure tests pass
Run pnpm run check:fix to format code
Commit using Conventional Commits
Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

📚 Documentation

Full Documentation - Complete guides and API reference
Getting Started Guide - Quick start guide
API Reference - Detailed API documentation
Design Philosophy - Architecture and design decisions
Performance - Benchmarks and optimization
Comparison - How it compares to alternatives

🗺️ Roadmap

~~Image extraction from PDFs~~ ✅ Completed (v1.0.0)
~~Performance optimizations for parallel processing~~ ✅ Completed (v1.0.0)
Annotation extraction support
OCR integration for scanned PDFs
Streaming support for very large files
Enhanced caching mechanisms
PDF form field extraction

🤝 Support & Community

Issues: GitHub Issues
Discussions: GitHub Discussions
Contributing: CONTRIBUTING.md

If you find this project useful, please:

⭐ Star the repository
👀 Watch for updates
🐛 Report bugs
💡 Suggest features
🔀 Contribute code

📄 License

This project is licensed under the MIT License.

Made with ❤️ by Sylphx