Prerequisites
- A working ecosystem from the Quickstart (CS on
:3032, semantic-search on:3037). - Document Processor installed (Tauri desktop app, available for Linux / macOS / Windows).
- One document parsed. Drag a PDF onto the Document Processor window; wait for the green check.
What Document Processor produces
Each parsed document lands in
~/.local/share/document-processor/exports/<date>_<slug>/
(Linux paths shown; macOS uses
~/Library/Application Support/document-processor/;
Windows uses %APPDATA%\\document-processor\\).
The shape:
# Document Processor writes a tree per parsed document:
~/.local/share/document-processor/exports/
└── 2026-04-27_supplier-contract/
├── document.json # structured metadata (kind, title, dates, parties)
├── chunks.jsonl # semantic chunks, one JSON per line
├── images/ # extracted figures with surrounding-context captions
│ ├── fig-001.png
│ └── fig-001.txt # "Figure 1: revenue breakdown 2025..."
└── full-text.txt # raw text fallback
You hand the directory off to the next stage. The
chunks.jsonl file is the workhorse — one
retrievable unit per line, with section headings preserved.
Ingest into Consciousness Server
Each chunk becomes a training record. The
type field is required — for prose chunks use
explanation, for clause-like or sectional
content use architecture. Tag every record with
the document id so you can scope searches later.
import json, requests
from pathlib import Path
CS = "http://127.0.0.1:3032"
EXPORT = Path.home() / ".local/share/document-processor/exports/2026-04-27_supplier-contract"
meta = json.loads((EXPORT / "document.json").read_text())
# Each chunk becomes a training record. The "type" field is REQUIRED.
# For document content, use "explanation" (a self-contained chunk of
# meaning) or "architecture" (a structural section like a contract clause).
with (EXPORT / "chunks.jsonl").open() as f:
for line in f:
chunk = json.loads(line)
requests.post(f"{CS}/api/memory/training", json={
"agent": "doc-pipeline",
"type": "explanation",
"goal": f"ingest:{meta['title']}",
"instruction": "search-retrievable chunk",
"input": chunk["heading"] or "",
"output": chunk["text"],
"tags": [meta["kind"], "doc:" + meta["id"]],
}).raise_for_status()
print(f"Ingested {meta['title']} — {meta['chunk_count']} chunks indexed.") CS embeds each record into ChromaDB via Ollama on the host. Index size grows linearly with chunk count; embeddings are ~1.5 KB each, so a 10 000-chunk corpus is roughly 15 MB plus the ChromaDB overhead. All of it on local disk.
Retrieve by meaning, not filename
Once ingested, an agent finds the right clause without knowing what file it lived in:
# Now an agent can find that contract by meaning, not filename.
hits = requests.post("http://127.0.0.1:3037/api/search", json={
"query": "what penalty applies if delivery slips by 30 days",
"limit": 5,
"filters": {"tags": ["doc:" + meta["id"]]},
}).json()
for h in hits["results"]:
print(f"score={h['score']:.2f} {h['snippet'][:120]}") filters.tags narrows the search to one
document; remove the filter for a corpus-wide query. The
score is cosine similarity (0..1).
Make it continuous
Document Processor is a desktop app — every parsed document
appears as a new directory under exports/. A
one-line directory watcher turns "I parsed a file" into "the
chunks are searchable" without any UI work:
# A directory-watcher daemon turns Document Processor's exports/ into
# a continuous ingest feed. systemd-path or inotify both work; here's
# the inotify version:
inotifywait -m -e create -r ~/.local/share/document-processor/exports/ |
while read dir _ name; do
if [ -f "$dir$name/document.json" ]; then
python ingest.py "$dir$name"
fi
done
Wrap that in a systemd user unit (~/.config/systemd/user/doc-ingest.service)
so it survives reboots. From the operator's point of view,
dropping a PDF on Document Processor now also drops it into
the corpus.
Next steps
- Write a custom agent → that searches the corpus and answers questions over it.
- Switch on signed requests → before exposing the corpus to multiple agents.
- Document Processor product page → for the parser internals (image-with-context extraction, classifier, MIME-type handling).