extracting-pdf-text
Installation
SKILL.md
Extracting PDF Text for LLMs
This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.
Quick Decision Guide
| PDF Type | Best Approach | Script |
|---|---|---|
| Simple text PDF | PyMuPDF | scripts/extract_pymupdf.py |
| PDF with tables | pdfplumber | scripts/extract_pdfplumber.py |
| Scanned/image PDF (local) | pytesseract | scripts/extract_with_ocr.py |
| Complex layout, highest accuracy | Mistral OCR API | scripts/extract_mistral_ocr.py |
| End-to-end RAG pipeline | marker-pdf | pip install marker-pdf |
Recommended Workflow
- Try PyMuPDF first - fastest, handles most text-based PDFs well
- If tables are mangled - switch to pdfplumber
- If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)