extracting-pdf-text

Installation
SKILL.md

Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF Type Best Approach Script
Simple text PDF PyMuPDF scripts/extract_pymupdf.py
PDF with tables pdfplumber scripts/extract_pdfplumber.py
Scanned/image PDF (local) pytesseract scripts/extract_with_ocr.py
Complex layout, highest accuracy Mistral OCR API scripts/extract_mistral_ocr.py
End-to-end RAG pipeline marker-pdf pip install marker-pdf

Recommended Workflow

  1. Try PyMuPDF first - fastest, handles most text-based PDFs well
  2. If tables are mangled - switch to pdfplumber
  3. If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)
Related skills

More from letta-ai/skills

Installs
257
Repository
letta-ai/skills
GitHub Stars
97
First Seen
Jan 24, 2026