
Archive Agent
STDIOOpen-source semantic file tracker with OCR and AI-powered search capabilities
Open-source semantic file tracker with OCR and AI-powered search capabilities
🍀 Collaborators welcome
You are invited to contribute to this open source project!
Feel free to file issues and submit pull requests anytime.
Archive Agent is an open-source semantic file tracker with OCR + AI search.
🤓 Watch me explain this on YouTube
Looking for the CLI command reference? 👉 Run Archive Agent
Looking for the MCP tool reference? 👉 MCP Tools
📷 Screenshot of command-line interface (CLI) using Typer:
📷 Screenshot of graphical user interface (GUI) using Streamlit: (enlarge)
Please install these requirements first:
Archive Agent has been tested with these configurations:
If you're using Archive Agent with another setup, please let me know and I'll add it here!
This should work on any Linux distribution derived from Ubuntu (e.g. Linux Mint).
To install Archive Agent in the current directory of your choice, run this once:
git clone https://github.com/shredEngineer/Archive-Agent cd Archive-Agent poetry install poetry run python -m spacy download xx_sent_ud_sm sudo apt install -y pandoc python3-tk chmod +x *.sh echo "alias archive-agent='$(pwd)/archive-agent.sh'" >> ~/.bashrc && source ~/.bashrc
This will create a global archive-agent
command for the current user.
📌 Note: Complete Qdrant server setup before using the archive-agent
command.
To update your Archive Agent installation, run this in the installation root:
git pull poetry install
💡 Good to know: To update the Qdrant docker image, run this:
docker stop archive-agent-qdrant-server docker pull qdrant/qdrant ./ensure-qdrant.sh
🚨 IMPORTANT: To manage Docker without root, run this once and reboot:
sudo usermod -aG docker $USER
To launch Qdrant with persistent storage and auto-restart, run this once:
./ensure-qdrant.sh
This will download the Qdrant docker image on the first run.
📌 Note: In case you need to stop the Qdrant Docker image, run this:
docker stop archive-agent-qdrant-server
Archive Agent currently supports these file types:
.txt
, .md
.html
, .htm
.odt
, .docx
(including images).pdf
(including images, see note below).jpg
, .jpeg
, .png
, .gif
, .webp
, .bmp
📌 Note: There are different OCR strategies supported by Archive Agent:
Strict OCR strategy:
Relaxed OCR strategy:
💡 Good to know: You will be prompted to choose an OCR strategy at startup; see: Run Archive Agent.
Ultimately, Archive Agent decodes everything to text like this:
Using Pandoc for documents, PyMuPDF4LLM for PDFs, Pillow for images.
📌 Note: Unsupported files are tracked but not processed.
Archive Agent processes decoded text like this:
💡 Good to know: This smart chunking improves the accuracy and effectiveness of the retrieval.
Archive Agent retrieves chunks related to your question like this:
Archive Agent answers your question using retrieved chunks like this:
The LLM's answer is structured to be multi-faceted, making Archive Agent a helpful assistant.
Archive Agent uses patterns to select your files:
/home/user/Documents/*.txt
(or ~/Documents/*.txt
).**
to match any files and zero or more directories, subdirectories, and symbolic links to directories.There are included patterns and excluded patterns:
This approach gives you the best control over the specific files or file types to track.
Archive Agent lets you choose between different AI providers:
Remote APIs (higher performance and costs, less privacy):
Local APIs (lower performance and costs, best privacy):
💡 Good to know: You will be prompted to choose an AI provider at startup; see: Run Archive Agent.
📌 Note: You can customize the specific models used by the AI provider in the Archive Agent settings. However, you cannot change the AI provider of an existing profile, as the embeddings will be incompatible; to choose a different AI provider, create a new profile instead.
If the OpenAI provider is selected, Archive Agent requires the OpenAI API key.
To export your OpenAI API key, replace sk-...
with your actual key and run this once:
echo "export OPENAI_API_KEY='sk-...'" >> ~/.bashrc && source ~/.bashrc
This will persist the export for the current user.
💡 Good to know: OpenAI won't use your data for training.
If the Ollama provider is selected, Archive Agent requires Ollama running at http://localhost:11434
.
With the default Archive Agent Settings, these Ollama models are expected to be installed:
ollama pull llama3.1:8b # for chunk/query ollama pull llava:7b-v1.6 # for vision ollama pull nomic-embed-text:v1.5 # for embed
💡 Good to know: Ollama also works without a GPU. At least 32 GiB RAM is recommended for smooth performance.
If the LM Studio provider is selected, Archive Agent requires LM Studio running at http://localhost:1234
.
With the default Archive Agent Settings, these LM Studio models are expected to be installed:
meta-llama-3.1-8b-instruct # for chunk/query llava-v1.5-7b # for vision text-embedding-nomic-embed-text-v1.5 # for embed
💡 Good to know: LM Studio also works without a GPU. At least 32 GiB RAM is recommended for smooth performance.
To show the list of supported commands, run this:
archive-agent
To switch to a new or existing profile, run this:
archive-agent switch "My Other Profile"
📌 Note: Always use quotes for the profile name argument, or skip it to get an interactive prompt.
💡 Good to know: Profiles are useful to manage independent Qdrant collections and Archive Agent settings.
To add one or more included patterns, run this:
archive-agent include "~/Documents/*.txt"
📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion), or skip it to get an interactive prompt.
To add one or more excluded patterns, run this:
archive-agent exclude "~/Documents/*.txt"
📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion), or skip it to get an interactive prompt.
To remove one or more previously included / excluded patterns, run this:
archive-agent remove "~/Documents/*.txt"
📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion), or skip it to get an interactive prompt.
To show the list of included / excluded patterns, run this:
archive-agent patterns
To resolve all patterns and track changes to your files, run this:
archive-agent track
To show the list of tracked files, run this:
archive-agent list
📌 Note: Don't forget to track
your files first.
To show the list of changed files, run this:
archive-agent diff
📌 Note: Don't forget to track
your files first.
To sync changes to your files with the Qdrant database, run this:
archive-agent commit
💡 Good to know: Changes are triggered by:
📌 Note: Don't forget to track
your files first.
To track
and then commit
in one go, run this:
archive-agent update
archive-agent search "Which files mention donuts?"
Lists files relevant to the question.
📌 Note: Always use quotes for the question argument, or skip it to get an interactive prompt.
archive-agent query "Which files mention donuts?"
Answers your question using RAG.
📌 Note: Always use quotes for the question argument, or skip it to get an interactive prompt.
To launch the Archive Agent GUI in your browser, run this:
archive-agent gui
📌 Note: Press CTRL+C
in the console to close the GUI server.
To start the Archive Agent MCP server, run this:
archive-agent mcp
📌 Note: Press CTRL+C
in the console to close the MCP server.
💡 Good to know: Use these MCP configurations to let your IDE or AI extension automate Archive Agent:
.vscode/mcp.json
for GitHub Copilot agent mode (VS Code):.roo/mcp.json
for Roo Code (VS Code extension)Archive Agent exposes these tools via MCP:
MCP tool | Equivalent CLI command(s) | Argument(s) | Description |
---|---|---|---|
get_patterns | patterns | None | Get the list of included / excluded patterns. |
get_files_tracked | track and then list | None | Get the list of tracked files. |
get_files_changed | track and then diff | None | Get the list of changed files. |
get_search_result | search | question | Get the list of files relevant to the question. |
get_answer_rag | query | question | Get answer to question using RAG. |
📌 Note: These commands are read-only, preventing the AI from changing your Qdrant database.
💡 Good to know: Just type #get_answer_rag
(e.g.) in your IDE or AI extension to call the tool directly.
Archive Agent settings are organized as profile folders in ~/.archive-agent-settings/
.
E.g., the default
profile is located in ~/.archive-agent-settings/default/
.
The currently used profile is stored in ~/.archive-agent-settings/profile.json
.
Each profile folder contains these files:
config.json
:
Key | Description |
---|---|
config_version | Config version |
ocr_strategy | OCR strategy in DecoderSettings.py |
ai_provider | AI provider in ai_provider_registry.py |
ai_server_url | AI server URL |
ai_model_chunk | AI model used for chunking |
ai_model_embed | AI model used for embedding |
ai_model_query | AI model used for queries |
ai_model_vision | AI model used for vision ("" disables vision) |
ai_vector_size | Vector size of embeddings (used for Qdrant collection) |
ai_temperature_query | Temperature of the query model |
qdrant_server_url | URL of the Qdrant server |
qdrant_collection | Name of the Qdrant collection |
qdrant_score_min | Minimum similarity score of retrieved chunks (0 ...1 ) |
qdrant_chunks_max | Maximum number of retrieved chunks |
chunk_lines_block | Number of lines per block for chunking |
mcp_server_port | MCP server port (default 8008 ) |
watchlist.json
:
include
/ exclude
/ remove
/ track
/ commit
/ update
commands.📌 Note: To delete a profile, simply delete the folder. This will not delete the Qdrant collection.
The Qdrant database is stored in ~/.archive-agent-qdrant-storage/
.
📌 Note: This folder is created by the Qdrant Docker image running as root.
💡 Good to know: Visit your Qdrant dashboard to manage collections and snapshots.
Archive Agent was written from scratch for educational purposes (on either end of the software).
To get started, check out these epic modules:
archive_agent/core/ContextManager.py
archive_agent/config/ConfigManager.py
archive_agent/__main__.py
archive_agent/core/CommitManager.py
archive_agent/util/CliManager.py
archive_agent/core/GuiManager.py
archive_agent/ai/AiManager.py
archive_agent/ai_provider/ai_provider_registry.py
If you miss something or spot bad patterns, feel free to contribute and refactor!
To run unit tests, check types, and check style, run this:
./audit.sh
(Some remaining type errors need to be fixed…)
To enable the PDF image debugger window, run this in your current shell:
export ARCHIVE_AGENT_IMAGE_DEBUGGER=1
📌 Note: PDF image debugger windows must be closed manually in order to proceed.
While track
initially reports a file as added, subsequent track
calls report it as changed.
Removing and restoring a tracked file in the tracking phase is currently not handled properly:
{size=0, mtime=0, diff=removed}
.{size=X, mtime=Y, diff=added}
.size
and mtime
were cleared, we lost the information to detect a restored file.Copyright © 2025 Dr.-Ing. Paul Wilhelm <[email protected]>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
See LICENSE for details.