Document Processor

Overview

A desktop application — Rust backend, Tauri 2 shell, Svelte 5 frontend — that parses documents locally and extracts text plus images with their surrounding context preserved. SQLite database for fast search, watch folder for automation, cross-platform (Linux and Windows).

The point is not parsing. The point is locality. Cloud OCR and document AI services exist; this exists for teams whose documents legally cannot leave the host.

Who it's for

Targets are organisations where document confidentiality is a regulatory or contractual requirement, not a preference:

Law firms — contract review, pleadings analysis, opinion drafts. Attorney-client privilege precludes cloud AI.
Medical research and clinical labs — patient records, study protocols, trial data. GDPR and patient consent don't extend to overseas cloud LLMs.
Patent attorneys — pre-filing applications. Even an embedding leaked to a model's training set is enough to lose novelty.
Engineering and R&D teams — unpatented IP, pre-publication research, internal designs.
Public-sector operators — classified, restricted, or sovereignty-controlled material.
Corporations with M&A pipelines or strategic IP — board reports, due diligence materials, financial models.

Features

Multi-format parsing — PDF, DOCX, DOC, TXT, RTF.
Image extraction with surrounding context — every image carries 200 chars of preceding and following text, a position marker, optional OCR, and an optional AI description. Charts and diagrams stay anchored to the prose that explains them.
Document type classification — automatic detection of types (configurable; ships with examples for legal documents umowa, pozew, ustawa).
Watch folder — drop files into a monitored directory; they get parsed and indexed automatically.
SQLite database — fast cross-document search without a server. Single file, easy backup, no extra service.
Modern UI — dark theme, drag-and-drop, responsive. Built with Svelte 5; reactive, small bundle.
Cross-platform binary — Linux and Windows. Tauri keeps the package small (~10–20 MB) and the runtime fast.

Output structure

Each processed document creates a self-contained directory:

processed/<document-id>/
├── document.md          # Human-readable markdown
├── document.json        # Structured data for AI ingestion
├── images/
│   ├── img_001.png      # Extracted images
│   ├── img_001.json     # Image metadata + surrounding context
│   └── thumb_001.png    # Thumbnail
└── original.pdf         # Original file copy (audit trail)

Two parallel formats: document.md for humans and document.json for AI ingestion. Images live alongside, each with its own metadata sidecar describing the textual context it was extracted from. The original file is preserved for audit.

Ecosystem fit

Document Processor is a producer; the rest of BuildOnAI consumes what it produces.

→ Consciousness Server — ingest a folder of documents, push each parsed result as training records and notes into shared memory. Your team's archive becomes queryable by any agent.
→ Cortex — Cortex reads document.json as context, answers questions against the parsed corpus locally. "Show me every contract clause longer than 12 months." No cloud LLM, no data leaving the host.
→ Key Server — distribute Document Processor to multiple law-firm or research workstations; each fetches its API token from the vault rather than having one hardcoded in .env.

Install (build from source)

Pre-1.0 ships from source. Pre-built installers for Linux and Windows arrive with v1.0.

# Linux (Ubuntu/Debian) — system deps
sudo apt install -y libwebkit2gtk-4.1-dev libappindicator3-dev \
                    librsvg2-dev patchelf libssl-dev

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/build-on-ai/document-processor.git
cd document-processor
npm install
npm run tauri build

Windows note: install WebView2 and Visual Studio Build Tools first. Tauri uses the system WebView; nothing bundled.

Status

pre-1.0 — working and in active use. Some edge formats and error paths still need polish before 1.0.

What "pre-1.0" means in practice:

API may change without a deprecation cycle. Names of fields in document.json, paths under processed/, plugin hooks — all subject to revision until v1.0.
Edge-case PDFs may fail or produce noisy output. Unusual layouts, heavy scans, files over 100 MB — works on most documents, but failure modes are real.
No pre-built installers yet. Build from source for now (instructions below). Linux .AppImage + .deb and Windows .msi arrive with v1.0.
OCR pipeline is generic. Scanned documents pass through a basic OCR; vertical-tuned OCR (legal forms, medical records) is on the v1.0 roadmap.
Safe to use on real documents — output goes to a local directory, the original file is preserved as-is, no destructive operations on input.

If you're trialling Document Processor in a regulated workflow, the honest call is: parse a representative subset first, verify the output format meets your needs, then expand. v1.0 will lock the schema.

Roadmap to v1.0:

Regression test corpus across vendor PDFs (Adobe, Foxit, scanned, OCR'd).
Vertical-specific classification packs — legal, medical, scientific publications, patents.
Privacy-preserving OCR pipeline tuned for confidential workloads.
Document diff workflow — compare two revisions, highlight changes, extract clauses that moved between versions.
Pre-built installers (Linux .AppImage + .deb, Windows .msi).

Overview

Who it's for

Features

Output structure

Ecosystem fit

Install (build from source)

Status

Next steps