pdf-extraction-guide
Installation
SKILL.md
PDF Extraction Guide
Extract text, tables, figures, and metadata from academic PDFs using Python libraries, with strategies for handling multi-column layouts, mathematical content, and scanned documents.
PDF Extraction Tools Comparison
| Tool | Text | Tables | Figures | Layout | OCR | Speed |
|---|---|---|---|---|---|---|
| PyMuPDF (fitz) | Excellent | Manual | Yes | Blocks | No (add with OCR engine) | Fast |
| pdfplumber | Good | Excellent | No | Tables focus | No | Medium |
| PyPDF2 / pypdf | Basic | No | No | No | No | Fast |
| Tabula-py | No | Excellent | No | No | No | Medium |
| GROBID | Structured | Yes | References | Academic layout | No | Slow (ML-based) |
| Nougat (Meta) | Excellent | Yes | Yes | Academic layout | Built-in | Slow (GPU) |
| Marker | Excellent | Yes | Yes | Multi-column | Built-in | Medium |
| pdf2image + Tesseract | Via OCR | Via OCR | Via OCR | No | Yes | Slow |
PyMuPDF (fitz) — Fast Text Extraction
Related skills
More from wentorai/research-plugins
academic-paper-summarizer
Summarize academic papers with structured extraction of key elements
43academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide
38academic-writing-refiner
Checklist-driven academic English polishing and Chinglish correction
34academic-citation-manager
Manage academic citations across BibTeX, APA, MLA, and Chicago formats
33abstract-writing-guide
Craft structured research abstracts that maximize clarity and journal acceptance
15ai-writing-humanizer
Remove AI-generated patterns to produce natural, authentic academic writing
14