pdf-extraction-guide

Installation
SKILL.md

PDF Extraction Guide

Extract text, tables, figures, and metadata from academic PDFs using Python libraries, with strategies for handling multi-column layouts, mathematical content, and scanned documents.

PDF Extraction Tools Comparison

Tool Text Tables Figures Layout OCR Speed
PyMuPDF (fitz) Excellent Manual Yes Blocks No (add with OCR engine) Fast
pdfplumber Good Excellent No Tables focus No Medium
PyPDF2 / pypdf Basic No No No No Fast
Tabula-py No Excellent No No No Medium
GROBID Structured Yes References Academic layout No Slow (ML-based)
Nougat (Meta) Excellent Yes Yes Academic layout Built-in Slow (GPU)
Marker Excellent Yes Yes Multi-column Built-in Medium
pdf2image + Tesseract Via OCR Via OCR Via OCR No Yes Slow

PyMuPDF (fitz) — Fast Text Extraction

Related skills
Installs
4
GitHub Stars
217
First Seen
Apr 2, 2026