PDF Processing

Overview

Generate, manipulate, and extract data from PDF documents. This skill covers the Python PDF ecosystem: pypdf for merging/splitting/metadata, pdfplumber for text and table extraction, reportlab for generation, pytesseract for OCR, and strategies for form filling, watermarking, and complex document assembly.

Apply this skill whenever PDFs need to be created, parsed, transformed, or combined through code.

Multi-Phase Process

Phase 1: Requirements

Determine operation type (generate, extract, manipulate)
Identify input PDF characteristics (scanned, digital, forms)
Define output requirements (format, quality, size)
Plan data pipeline (source data to PDF or PDF to data)
Assess volume and performance requirements

STOP — Do NOT select a library until the operation type and input characteristics are clear.

pdf-processing

PDF Processing

Overview

Multi-Phase Process

Phase 1: Requirements