Document Intelligence Promotion

Single-pass extraction + multi-stage post-processing pipeline.

Note: This pipeline uses pdfplumber for single-document extraction (not batch). For batch text extraction across the corpus, use pdftotext via subprocess — see pdf/pdftotext-poppler sub-skill.

Architecture

PDF/DOCX → parser (single read) → manifest.yaml
                                       ↓
                            deep_extract.py (post-processors):
                            ├── table_exporter.py → CSV files
                            ├── worked_example_parser.py → pytest files
                            └── chart_extractor.py → images + metadata YAML

Installs

Repository

vamseeachanta/w…pace-hub

GitHub Stars

First Seen

Jun 1, 2026