large-document-processing
Installation
SKILL.md
Large Document Processing & Intelligent Text Chunking
⚠️ Repo Reality Check (read this first)
The real components are:
- Top-level pipeline:
LargeDocumentProcessor- Structure-aware parser:
AdvancedDocumentParser- Streaming OCR with progress:
EnhancedOCRProcessor- Chunker:
IntelligentTextChunker— see the intelligent-text-chunking skill.- Training data generation:
AITrainingDataGenerator- Setup helper:
scripts/setup_large_document_processing.pyThe NWT EPUB parser exposes only
get_verse(book_num, chapter, verse)(nwt_epub_parser.py) — there is noget_chapter/get_book. See the bible-epub-processing skill.Source data lives under
config/data/(NOT a top-leveldata/).Always wrap chunking calls with
protect_scripture_references/restore_scripture_referencesfromsrc/utils/scripture_parser.pywhen input may contain Bible references.