
VisionAgent
STDIOOfficialMCP server providing computer vision and document analysis tools via VisionAgent APIs
MCP server providing computer vision and document analysis tools via VisionAgent APIs
Beta – v0.1
This project is early access and subject to breaking changes until v1.0.
Modern LLM “agents” call external tools through the Model Context Protocol (MCP). VisionAgent MCP is a lightweight, side-car MCP server that runs locally on STDIN/STDOUT, translating each tool call from an MCP-compatible client (Claude Desktop, Cursor, Cline, etc.) into an authenticated HTTPS request to Landing AI’s VisionAgent REST APIs. The response JSON, plus any images or masks, is streamed back to the model so that you can issue natural-language computer-vision and document-analysis commands from your editor without writing custom REST code or loading an extra SDK.
https://github.com/user-attachments/assets/2017fa01-0e7f-411c-a417-9f79562627b7
Capability | Description |
---|---|
agentic-document-analysis | Parse PDFs / images to extract text, tables, charts, and diagrams taking into account layouts and other visual cues. Web Version here. |
text-to-object-detection | Detect free-form prompts (“all traffic lights”) using OWLv2 / CountGD / Florence-2 / Agentic Object Detection (Web Version here); outputs bounding boxes. |
text-to-instance-segmentation | Pixel-perfect masks via Florence-2 + Segment-Anything-v2 (SAM-2). |
activity-recognition | Recognise multiple activities in video with start/end timestamps. |
depth-pro | High-resolution monocular depth estimation for single images. |
Run
npm run generate-tools
whenever VisionAgent releases new endpoints. The script fetches the latest OpenAPI spec and regenerates the local tool map automatically.
If you do not have a VisionAgent API key, create an account and obtain your API key.
# 1 Install npm install -g vision-tools-mcp # 2 Configure your MCP client with the following settings: { "mcpServers": { "VisionAgent": { "command": "npx", "args": ["vision-tools-mcp"], "env": { "VISION_AGENT_API_KEY": "<YOUR_API_KEY>", "OUTPUT_DIRECTORY": "/path/to/output/directory", "IMAGE_DISPLAY_ENABLED": "true" # or false, see below } } } }
Detect all traffic lights in /path/to/mcp/vision-agent-mcp/assets/street.png
If your client supports inline resources, you’ll see bounding-box overlays; otherwise, the PNG is saved to your output directory, and the chat shows its path.
Software | Minimum Version |
---|---|
Node.js | 20 (LTS) |
VisionAgent account | Any paid or free tier (needs API key) |
MCP client | Claude Desktop / Cursor / Cline / etc. |
ENV var | Required | Default | Purpose |
---|---|---|---|
VISION_AGENT_API_KEY | Yes | — | Landing AI auth token. |
OUTPUT_DIRECTORY | No | — | Where rendered images / masks / depth maps are stored. |
IMAGE_DISPLAY_ENABLED | No | true | false ➜ skip rendering |
.mcp.json
for VS Code / Cursor){ "mcpServers": { "VisionAgent": { "command": "npx", "args": ["vision-tools-mcp"], "env": { "VISION_AGENT_API_KEY": "912jkefief09jfjkMfoklwOWdp9293jefklwfweLQWO9jfjkMfoklwDK", "OUTPUT_DIRECTORY": "/Users/me/documents/mcp/test", "IMAGE_DISPLAY_ENABLED": "false" } } } }
For MCP clients without image display capabilities, like Cursor, set IMAGE_DISPLAY_ENABLED to False. For MCP clients with image display capabilities, like Claude Desktop, set IMAGE_DISPLAY_ENABLED to true to visualize tool outputs. Generally, MCP clients that support resources (see this list: https://modelcontextprotocol.io/clients) will support image display.
Scenario | Prompt (after uploading file) |
---|---|
Invoice extraction | “Extract vendor, invoice date & total from this PDF using agentic-document-analysis .” |
Pedrestrian Recognition | “Locate every pedestrian in street.jpg via text-to-object-detection .” |
Agricultural segmentation | “Segment all tomatoes in kitchen.png with text-to-instance-segmentation .” |
Activity recognition (video) | “Identify activities occurring in match.mp4 via activity-recognition .” |
Depth estimation | “Produce a depth map for selfie.png using depth-pro .” |
┌────────────────────┐ 1. human prompt ┌───────────────────┐ │ MCP-capable client │───────────────────────────▶│ VisionAgent MCP │ │ (Cursor, Claude) │ │ (this repo) │ └────────────────────┘ └─────────▲─────────┘ ▲ 6. rendered PNG / JSON │ 2. JSON tool call │ │ │ 5. preview path / data 3. HTTPS │ │ ▼ local disk ◀──────────┐ Landing AI VisionAgent └────────────── Cloud APIs 4. JSON / media blob
Here’s how to dive into the code, add new endpoints, or troubleshoot issues.
Clone the repository:
git clone https://github.com/landing-ai/vision-agent-mcp.git
Navigate into the project directory:
cd vision-agent-mcp
Install dependencies:
npm install
Build the project:
npm run build
VISION_AGENT_API_KEY
- Required API key for VisionAgent authenticationOUTPUT_DIRECTORY
- Optional directory for saving processed outputs (supports relative and absolute paths)IMAGE_DISPLAY_ENABLED
- Set to "true"
to enable image visualization featuresAfter building, configure your MCP client with the following settings:
{ "mcpServers": { "VisionAgent": { "command": "node", "args": [ "/path/to/build/index.js" ], "env": { "VISION_AGENT_API_KEY": "<YOUR_API_KEY>", "OUTPUT_DIRECTORY": "../../output", "IMAGE_DISPLAY_ENABLED": "true" } } } }
Note: Replace
/path/to/build/index.js
with the actual path to your builtindex.js
file, and set your environment variables as needed. For MCP clients without image display capabilities, like Cursor, set IMAGE_DISPLAY_ENABLED to False. For MCP clients with image display capabilities, like Claude Desktop, set IMAGE_DISPLAY_ENABLED to true to visualize tool outputs. Generally, MCP clients that support resources (see this list: https://modelcontextprotocol.io/clients) will support image display.
Script | Purpose |
---|---|
npm run build | Compile TypeScript → build/ (adds executable bit). |
npm run start | Build and run (node build/index.js ). |
npm run typecheck | Type-only check (tsc --noEmit ). |
npm run generate-tools | Fetch latest OpenAPI and regenerate toolDefinitionMap.ts . |
npm run build:all | Convenience: npm run build + npm run generate-tools . |
Pro Tip: If you modify any files under
src/
or want to pick up new endpoints from VisionAgent, runnpm run build:all
to recompile + regenerate tool definitions.
vision-agent-mcp/ ├── .eslintrc.json # ESLint config (optional) ├── .gitignore # Ignore node_modules, build/, .env, etc. ├── jest.config.js # Placeholder for future unit tests ├── mcp-va.md # Draft docs (incomplete) ├── package.json # npm metadata, scripts, dependencies ├── package-lock.json # Lockfile ├── tsconfig.json # TypeScript compiler config ├── .env # Your environment variables (not committed) │ ├── src/ # TypeScript source code │ ├── generateTools.ts # Dev script: fetch OpenAPI → generate MCP tool definitions (Zod schemas) │ ├── index.ts # Entry point: load .env, start MCP server, handle signals │ ├── toolDefinitionMap.ts # Auto-generated MCP tool definitions (don’t edit by hand) │ ├── toolUtils.ts # Helpers to build MCP tool objects (metadata, descriptions) │ ├── types.ts # Core TS interfaces (MCP, environment config, etc.) │ │ │ ├── server/ # MCP server logic │ │ ├── index.ts # Create & start the MCP server (Server + Stdio transport) │ │ ├── handlers.ts # `handleListTools` & `handleCallTool` implementations │ │ ├── visualization.ts # Post-process & save image/video outputs (masks, boxes, depth maps) │ │ └── config.ts # Load & validate .env, export SERVER_CONFIG & EnvConfig │ │ │ ├── utils/ # Generic utilities │ │ ├── file.ts # File handling (base64 encode images/PDFs, read streams) │ │ └── http.ts # Axios wrappers & error formatting │ │ │ └── validation/ # Zod schema generation & argument validation │ └── schema.ts # Convert JSON Schema → Zod, validate incoming tool args │ ├── build/ # Compiled JavaScript (generated after `npm run build`) │ ├── index.js │ ├── generateTools.js │ ├── toolDefinitionMap.js │ └── … # Mirror of `src/` structure │ ├── output/ # Runtime artifacts (bounding boxes, masks, depth maps, etc.) │ └── assets/ # Static assets (e.g., demo.gif) └── demo.gif
src/generateTools.ts
https://api.va.landing.ai/openapi.json
(VisionAgent’s public OpenAPI).toolDefinitionMap.ts
with a Map<string, McpToolDefinition>
.npm run generate-tools
.src/toolDefinitionMap.ts
src/server/handlers.ts
Implements handleListTools
: returns [ { name, description, inputSchema } ]
.
Implements handleCallTool
:
arguments
with Zod.imagePath
, pdfPath
), reads & base64-encodes via src/utils/file.ts
.IMAGE_DISPLAY_ENABLED=true
, calls src/server/visualization.ts
to save PNGs/JSON.src/server/visualization.ts
OUTPUT_DIRECTORY
.src/utils/file.ts
readFileAsBase64(path: string): Promise<string>
: Reads any binary (image, PDF, video) and returns base64.loadFileStream(path: string)
: Returns a Node.js stream for large file uploads.src/utils/http.ts
https://api.va.landing.ai
.Authorization: Bearer ${VISION_AGENT_API_KEY}
header.src/validation/schema.ts
buildZodSchema(jsonSchema: any): ZodObject
used by generateTools.ts
.src/index.ts
Loads dotenv
(reads .env
).
Validates required env vars (VISION_AGENT_API_KEY
).
Imports generated toolDefinitionMap
.
Creates an MCP Server
(from @modelcontextprotocol/sdk/server
) with StdioServerTransport
.
Wires ListTools
→ handleListTools
, CallTool
→ handleCallTool
.
Logs startup info:
vision-tools-api MCP Server (v0.1.0) running on stdio, proxying to https://api.va.landing.ai
Listens for SIGINT
/SIGTERM
to gracefully shut down.
Validation Errors If you send invalid or missing parameters, the server returns:
{ "id": 3, "error": { "code": -32602, "message": "Validation error: missing required parameter ‘imagePath’" } }
Network Errors Axios errors (timeouts, 5xx) are caught and returned as:
{ "id": 4, "error": { "code": -32000, "message": "VisionAgent API error: 502 Bad Gateway" } }
Internal Exceptions Uncaught exceptions in handlers produce:
{ "id": 5, "error": { "code": -32603, "message": "Internal error: Unexpected token in JSON at position 345" } }
VISION_AGENT_API_KEY
is correct and active.api.va.landing.ai
isn’t blocked by a proxy/VPN.The local tool map may be stale. Run:
npm run generate-tools npm start
The code uses the Blob & FormData APIs natively introduced in Node 20.
Upgrade via nvm install 20
(mac/Linux) or download from nodejs.org if on Windows.
For other issues, refer to the MCP documentation: https://modelcontextprotocol.io/quickstart/user
Also not that specific clients will have their own helpful documentation. For example, if you are using the OpenAI Agents SDK, refer to their documentation here: https://openai.github.io/openai-agents-python/mcp/
We love PRs!
git checkout -b feature/my-feature
.npm run typecheck
(no errors)OUTPUT_DIRECTORY
only on your machine.Made with ❤️ by the LandingAI Team.