pdf-brain-ingest

Installation
SKILL.md

PDF Brain Ingest v2 (ADR-0234)

Staged artifact-chain pipeline with durable NAS storage, nomic embeddings, and workload queue orchestration.

Pipeline v2 Architecture

PDF (source, immutable)
  → Stage 1: CONVERT — opendataloader-pdf → {docId}.md (NAS artifact)
  → Stage 2: CLASSIFY + SUMMARIZE — taxonomy + LLM summary → {docId}.meta.json (NAS artifact)
  → Stage 3: CHUNK — markdown-native headings, no overlap → {docId}.chunks.jsonl (NAS artifact)
  → Stage 4: INDEX — upsert to docs + docs_chunks_v2 (nomic-embed-text-v1.5, 768-dim)

Key properties:

  • Durable: artifacts on NAS RAID5, survive reboots/crashes
  • Resumable: each stage checks for existing artifacts, skips if present
  • Recoverable: re-run any stage from existing artifacts without re-extracting
  • Observable: OTEL event per stage per book
Installs
36
GitHub Stars
57
First Seen
Feb 27, 2026
pdf-brain-ingest — joelhooks/joelclaw