Overview
A desktop application — Rust backend, Tauri 2 shell, Svelte 5 frontend — that parses documents locally and extracts text plus images with their surrounding context preserved. SQLite database for fast search, watch folder for automation, cross-platform (Linux and Windows).
The point is not parsing. The point is locality. Cloud OCR and document AI services exist; this exists for teams whose documents legally cannot leave the host.
Who it's for
Targets are organisations where document confidentiality is a regulatory or contractual requirement, not a preference:
- Law firms — contract review, pleadings analysis, opinion drafts. Attorney-client privilege precludes cloud AI.
- Medical research and clinical labs — patient records, study protocols, trial data. GDPR and patient consent don't extend to overseas cloud LLMs.
- Patent attorneys — pre-filing applications. Even an embedding leaked to a model's training set is enough to lose novelty.
- Engineering and R&D teams — unpatented IP, pre-publication research, internal designs.
- Public-sector operators — classified, restricted, or sovereignty-controlled material.
- Corporations with M&A pipelines or strategic IP — board reports, due diligence materials, financial models.
Features
- Multi-format parsing — PDF, DOCX, DOC, TXT, RTF.
- Image extraction with surrounding context — every image carries 200 chars of preceding and following text, a position marker, optional OCR, and an optional AI description. Charts and diagrams stay anchored to the prose that explains them.
- Document type classification — automatic detection
of types (configurable; ships with examples for legal documents
umowa,pozew,ustawa). - Watch folder — drop files into a monitored directory; they get parsed and indexed automatically.
- SQLite database — fast cross-document search without a server. Single file, easy backup, no extra service.
- Modern UI — dark theme, drag-and-drop, responsive. Built with Svelte 5; reactive, small bundle.
- Cross-platform binary — Linux and Windows. Tauri keeps the package small (~10–20 MB) and the runtime fast.
Output structure
Each processed document creates a self-contained directory:
processed/<document-id>/
├── document.md # Human-readable markdown
├── document.json # Structured data for AI ingestion
├── images/
│ ├── img_001.png # Extracted images
│ ├── img_001.json # Image metadata + surrounding context
│ └── thumb_001.png # Thumbnail
└── original.pdf # Original file copy (audit trail)
Two parallel formats: document.md for humans and
document.json for AI ingestion. Images live alongside,
each with its own metadata sidecar describing the textual context it
was extracted from. The original file is preserved for audit.
Ecosystem fit
Document Processor is a producer; the rest of BuildOnAI consumes what it produces.
- → Consciousness Server — ingest a folder of documents, push each parsed result as training records and notes into shared memory. Your team's archive becomes queryable by any agent.
- → Cortex — Cortex
reads
document.jsonas context, answers questions against the parsed corpus locally. "Show me every contract clause longer than 12 months." No cloud LLM, no data leaving the host. - → Key Server —
distribute Document Processor to multiple law-firm or research
workstations; each fetches its API token from the vault rather than
having one hardcoded in
.env.
Install (build from source)
Pre-1.0 ships from source. Pre-built installers for Linux and Windows arrive with v1.0.
# Linux (Ubuntu/Debian) — system deps
sudo apt install -y libwebkit2gtk-4.1-dev libappindicator3-dev \
librsvg2-dev patchelf libssl-dev
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone and build
git clone https://github.com/build-on-ai/document-processor.git
cd document-processor
npm install
npm run tauri build Windows note: install WebView2 and Visual Studio Build Tools first. Tauri uses the system WebView; nothing bundled.
Status
pre-1.0 — working and in active use, but not yet production-hardened. The public release is intentionally a clean cut from the private development trajectory; no leaked private data, no embedded customer IDs.
What "pre-1.0" means in practice:
- API may change without a deprecation cycle. Names
of fields in
document.json, paths underprocessed/, plugin hooks — all subject to revision until v1.0. - Edge-case PDFs may fail or produce noisy output. Unusual layouts, heavy scans, files over 100 MB — works on most documents, but failure modes are real.
- No pre-built installers yet. Build from source for
now (instructions below). Linux
.AppImage+.deband Windows.msiarrive with v1.0. - OCR pipeline is generic. Scanned documents pass through a basic OCR; vertical-tuned OCR (legal forms, medical records) is on the v1.0 roadmap.
- Safe to use on real documents — output goes to a local directory, the original file is preserved as-is, no destructive operations on input.
If you're trialling Document Processor in a regulated workflow, the honest call is: parse a representative subset first, verify the output format meets your needs, then expand. v1.0 will lock the schema.
Roadmap to v1.0:
- Regression test corpus across vendor PDFs (Adobe, Foxit, scanned, OCR'd).
- Vertical-specific classification packs — legal, medical, scientific publications, patents.
- Privacy-preserving OCR pipeline tuned for confidential workloads.
- Document diff workflow — compare two revisions, highlight changes, extract clauses that moved between versions.
- Pre-built installers (Linux
.AppImage+.deb, Windows.msi).
Next steps
- Key Server →
- View on GitHub
- Security posture →