PDF Text Extractor Skill

Overview

This skill extracts text from PDF files using PyMuPDF (fitz), with intelligent chunking, page tracking, and metadata preservation. Handles large PDF collections with batch processing and error recovery.

RECOMMENDED WORKFLOW: For all PDF documents, first convert to markdown using OpenAI Codex (see pdf skill), then process the structured markdown. This skill is best used for:

Batch processing where Codex conversion is impractical
Legacy workflows requiring direct PDF extraction
Cases where raw text is sufficient

Quick Start

Recommended Approach (with Codex conversion):

# 1. Convert PDF to markdown first (see pdf skill)
from pdf_skill import pdf_to_markdown_codex

md_path = pdf_to_markdown_codex("document.pdf")

pdf-text-extractor

PDF Text Extractor Skill

Overview

Quick Start