Large Document Processing & Intelligent Text Chunking

⚠️ Repo Reality Check (read this first)

The real components are:

Top-level pipeline: LargeDocumentProcessor

Structure-aware parser: AdvancedDocumentParser

Streaming OCR with progress: EnhancedOCRProcessor

Chunker: IntelligentTextChunker — see the intelligent-text-chunking skill.

Training data generation: AITrainingDataGenerator

Setup helper: scripts/setup_large_document_processing.py

The NWT EPUB parser exposes only get_verse(book_num, chapter, verse) (nwt_epub_parser.py) — there is no get_chapter / get_book. See the bible-epub-processing skill.

Source data lives under config/data/ (NOT a top-level data/).

Always wrap chunking calls with protect_scripture_references / restore_scripture_references from src/utils/scripture_parser.py when input may contain Bible references.

large-document-processing

Large Document Processing & Intelligent Text Chunking

⚠️ Repo Reality Check (read this first)

Overview