Document pipeline — BuildOnAI Docs

Prerequisites

A working ecosystem from the Quickstart (CS on :3032, semantic-search on :3037).
Document Processor installed (Tauri desktop app, available for Linux / macOS / Windows).
One document parsed. Drag a PDF onto the Document Processor window; wait for the green check.

What Document Processor produces

Each parsed document lands in ~/.local/share/document-processor/exports/<date>_<slug>/ (Linux paths shown; macOS uses ~/Library/Application Support/document-processor/; Windows uses %APPDATA%\\document-processor\\). The shape:

# Document Processor writes a tree per parsed document:
~/.local/share/document-processor/exports/
└── 2026-04-27_supplier-contract/
    ├── document.json        # structured metadata (kind, title, dates, parties)
    ├── chunks.jsonl         # semantic chunks, one JSON per line
    ├── images/              # extracted figures with surrounding-context captions
    │   ├── fig-001.png
    │   └── fig-001.txt      # "Figure 1: revenue breakdown 2025..."
    └── full-text.txt        # raw text fallback

You hand the directory off to the next stage. The chunks.jsonl file is the workhorse — one retrievable unit per line, with section headings preserved.

Ingest into Consciousness Server

Each chunk becomes a training record. The type field is required — for prose chunks use explanation, for clause-like or sectional content use architecture. Tag every record with the document id so you can scope searches later.

import json, requests
from pathlib import Path

CS = "http://127.0.0.1:3032"
EXPORT = Path.home() / ".local/share/document-processor/exports/2026-04-27_supplier-contract"

meta = json.loads((EXPORT / "document.json").read_text())

# Each chunk becomes a training record. The "type" field is REQUIRED.
# For document content, use "explanation" (a self-contained chunk of
# meaning) or "architecture" (a structural section like a contract clause).
with (EXPORT / "chunks.jsonl").open() as f:
    for line in f:
        chunk = json.loads(line)
        requests.post(f"{CS}/api/memory/training", json={
            "agent": "doc-pipeline",
            "type": "explanation",
            "goal": f"ingest:{meta['title']}",
            "instruction": "search-retrievable chunk",
            "input": chunk["heading"] or "",
            "output": chunk["text"],
            "tags": [meta["kind"], "doc:" + meta["id"]],
        }).raise_for_status()

print(f"Ingested {meta['title']} — {meta['chunk_count']} chunks indexed.")

CS embeds each record into ChromaDB via Ollama on the host. Index size grows linearly with chunk count; embeddings are ~1.5 KB each, so a 10 000-chunk corpus is roughly 15 MB plus the ChromaDB overhead. All of it on local disk.

Retrieve by meaning, not filename

Once ingested, an agent finds the right clause without knowing what file it lived in:

# Now an agent can find that contract by meaning, not filename.
hits = requests.post("http://127.0.0.1:3037/api/search", json={
    "query": "what penalty applies if delivery slips by 30 days",
    "limit": 5,
    "filters": {"tags": ["doc:" + meta["id"]]},
}).json()

for h in hits["results"]:
    print(f"score={h['score']:.2f}  {h['snippet'][:120]}")

filters.tags narrows the search to one document; remove the filter for a corpus-wide query. The score is cosine similarity (0..1).

Make it continuous

Document Processor is a desktop app — every parsed document appears as a new directory under exports/. A one-line directory watcher turns "I parsed a file" into "the chunks are searchable" without any UI work:

# A directory-watcher daemon turns Document Processor's exports/ into
# a continuous ingest feed. systemd-path or inotify both work; here's
# the inotify version:
inotifywait -m -e create -r ~/.local/share/document-processor/exports/ |
while read dir _ name; do
    if [ -f "$dir$name/document.json" ]; then
        python ingest.py "$dir$name"
    fi
done

Wrap that in a systemd user unit (~/.config/systemd/user/doc-ingest.service) so it survives reboots. From the operator's point of view, dropping a PDF on Document Processor now also drops it into the corpus.

Next steps

Write a custom agent → that searches the corpus and answers questions over it.
Switch on signed requests → before exposing the corpus to multiple agents.
Document Processor product page → for the parser internals (image-with-context extraction, classifier, MIME-type handling).