pdf-brain-ingest
Installation
SKILL.md
PDF Brain Ingest v2 (ADR-0234)
Staged artifact-chain pipeline with durable NAS storage, nomic embeddings, and workload queue orchestration.
Pipeline v2 Architecture
PDF (source, immutable)
→ Stage 1: CONVERT — opendataloader-pdf → {docId}.md (NAS artifact)
→ Stage 2: CLASSIFY + SUMMARIZE — taxonomy + LLM summary → {docId}.meta.json (NAS artifact)
→ Stage 3: CHUNK — markdown-native headings, no overlap → {docId}.chunks.jsonl (NAS artifact)
→ Stage 4: INDEX — upsert to docs + docs_chunks_v2 (nomic-embed-text-v1.5, 768-dim)
Key properties:
- Durable: artifacts on NAS RAID5, survive reboots/crashes
- Resumable: each stage checks for existing artifacts, skips if present
- Recoverable: re-run any stage from existing artifacts without re-extracting
- Observable: OTEL event per stage per book