
Key-Value Extractor
STDIOExtracts key-value pairs from unstructured text with type safety and multiple output formats.
Extracts key-value pairs from unstructured text with type safety and multiple output formats.
Version: 0.3.1
This MCP server extracts key-value pairs from arbitrary, noisy, or unstructured text using LLMs (GPT-4.1-mini) and pydantic-ai. It ensures type safety and supports multiple output formats (JSON, YAML, TOML). The server is robust to any input and always attempts to structure data as much as possible, however, perfect extraction is not guaranteed.
While many Large Language Model (LLMs) services offer structured output capabilities, this MCP server provides distinct advantages for key-value extraction, especially from challenging real-world text:
/extract_json
: Extracts type-safe key-value pairs in JSON format from input text./extract_yaml
: Extracts type-safe key-value pairs in YAML format from input text./extract_toml
: Extracts type-safe key-value pairs in TOML format from input text.
Note:
Input Tokens | Input Characters (approx.) | Measured Processing Time (sec) | Model Configuration |
---|---|---|---|
200 | ~400 | ~15 | gpt-4.1-mini |
Actual processing time may vary significantly depending on API response, network conditions, and model load. Even short texts may take 15 seconds or more.
The server has been tested with various inputs, including:
Below is a flowchart representing the processing flow of the key-value extraction pipeline as implemented in server.py
:
flowchart TD A[Input Text] --> B[Step 0: Preprocessing with spaCy Lang Detect then NER] B --> C[Step 1: Key-Value Extraction - LLM] C --> D[Step 2: Type Annotation - LLM] D --> E[Step 3: Type Evaluation - LLM] E --> F[Step 4: Type Normalization - Static Rules + LLM] F --> G[Step 5: Final Structuring with Pydantic] G --> H[Output in JSON/YAML/TOML]
This server uses spaCy with automatic language detection to extract named entities from the input text before passing it to the LLM. Supported languages are Japanese (ja_core_news_md
), English (en_core_web_sm
), and Chinese (Simplified/Traditional, zh_core_web_sm
).
The language of the input text is automatically detected using langdetect
.
If the detected language is not Japanese, English, or Chinese, the server returns an error: Unsupported lang detected
.
The appropriate spaCy model is automatically downloaded and loaded as needed. No manual installation is required.
The extracted phrase list is included in the LLM prompt as follows:
[Preprocessing Candidate Phrases (spaCy NER)] The following is a list of phrases automatically extracted from the input text using spaCy's detected language model. These phrases represent detected entities such as names, dates, organizations, locations, numbers, etc. This list is for reference only and may contain irrelevant or incorrect items. The LLM uses its own judgment and considers the entire input text to flexibly infer the most appropriate key-value pairs.
This project's key-value extraction pipeline consists of multiple steps. Each step's details are as follows:
ja_core_news_md
, en_core_web_sm
, zh_core_web_sm
) to extract named entities.key: person, value: ["Tanaka", "Sato"]
key: person, value: ["Tanaka", "Sato"] -> list[str]
This pipeline is designed to accommodate future list format support and Pydantic schema extensions.
items = ["A", "B"]
) can be represented natively, but
arrays of objects (dicts) or deeply nested structures cannot be directly represented due to TOML specifications.[{"name": "A"}, {"name": "B"}]
) are
stored as "JSON strings" in TOML values.Input:
Thank you for your order (Order Number: ORD-98765). Product: High-Performance Laptop, Price: 89,800 JPY (tax excluded), Delivery: May 15-17. Shipping address: 1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101. Phone: 090-1234-5678. Payment: Credit Card (VISA, last 4 digits: 1234). For changes, contact [email protected].
Output (JSON):
{ "order_number": "ORD-98765", "product_name": "High-Performance Laptop", "price": 89800, "price_currency": "JPY", "tax_excluded": true, "delivery_start_date": "20240515", "delivery_end_date": "20240517", "shipping_address": "1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101", "phone_number": "090-1234-5678", "payment_method": "Credit Card", "card_type": "VISA", "card_last4": "1234", "customer_support_email": "[email protected]" }
Output (YAML):
order_number: ORD-98765 product_name: High-Performance Laptop price: 89800 price_currency: JPY tax_excluded: true delivery_start_date: '20240515' delivery_end_date: '20240517' shipping_address: 1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101 phone_number: 090-1234-5678 payment_method: Credit Card card_type: VISA card_last4: '1234' customer_support_email: [email protected]
Output (TOML, simple case):
order_number = "ORD-98765" product_name = "High-Performance Laptop" price = 89800 price_currency = "JPY" tax_excluded = true delivery_start_date = "20240515" delivery_end_date = "20240517" shipping_address = "1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101" phone_number = "090-1234-5678" payment_method = "Credit Card" card_type = "VISA" card_last4 = "1234"
Output (TOML, complex case):
items = '[{"name": "A", "qty": 2}, {"name": "B", "qty": 5}]' addresses = '[{"city": "Tokyo", "zip": "160-0022"}, {"city": "Osaka", "zip": "530-0001"}]'
Note: Arrays of objects or nested structures are stored as JSON strings in TOML.
extract_json
input_text
(string): Input string containing noisy or unstructured data.{ "success": True, "result": ... }
or { "success": False, "error": ... }
{ "success": true, "result": { "foo": 1, "bar": "baz" } }
extract_yaml
input_text
(string): Input string containing noisy or unstructured data.{ "success": True, "result": ... }
or { "success": False, "error": ... }
{ "success": true, "result": "foo: 1\nbar: baz" }
extract_toml
input_text
(string): Input string containing noisy or unstructured data.{ "success": True, "result": ... }
or { "success": False, "error": ... }
{ "success": true, "result": "foo = 1\nbar = \"baz\"" }
To install kv-extractor-mcp-server for Claude Desktop automatically via Smithery:
npx -y @smithery/cli install @KunihiroS/kv-extractor-mcp-server --client claude
settings.json
under env
)python server.py
In case you want to run the server manually.
When running this MCP Server, you must explicitly specify the log output mode and (if enabled) the absolute log file path via command-line arguments.
--log=off
: Disable all logging (no logs are written)--log=on --logfile=/absolute/path/to/logfile.log
: Enable logging and write logs to the specified absolute file path"kv-extractor-mcp-server": { "command": "pipx", "args": ["run", "kv-extractor-mcp-server", "--log=off"], "env": { "OPENAI_API_KEY": "{apikey}" } }
"kv-extractor-mcp-server": { "command": "pipx", "args": ["run", "kv-extractor-mcp-server", "--log=on", "--logfile=/workspace/logs/kv-extractor-mcp-server.log"], "env": { "OPENAI_API_KEY": "{apikey}" } }
Note:
- When logging is enabled, logs are written only to the specified absolute file path. Relative paths or omission of
--logfile
will cause an error.- When logging is disabled, no logs are output.
- If the required arguments are missing or invalid, the server will not start and will print an error message.
- The log file must be accessible and writable by the MCP Server process.
- If you have trouble to run this server, it may be due to caching older version of kv-extractor-mcp-server. Please try to run it with the latest version (set
x.y.z
to the latest version) of kv-extractor-mcp-server by the below setting.
"kv-extractor-mcp-server": { "command": "pipx", "args": ["run", "kv-extractor-mcp-server==x.y.z", "--log=off"], "env": { "OPENAI_API_KEY": "{apikey}" } }
GPL-3.0-or-later
KunihiroS (and contributors)