pdf-text-extractor
PDF Text Extractor Skill
Overview
This skill extracts text from PDF files using PyMuPDF (fitz), with intelligent chunking, page tracking, and metadata preservation. Handles large PDF collections with batch processing and error recovery.
RECOMMENDED WORKFLOW: For all PDF documents, first convert to markdown using OpenAI Codex (see pdf skill), then process the structured markdown. This skill is best used for:
- Batch processing where Codex conversion is impractical
- Legacy workflows requiring direct PDF extraction
- Cases where raw text is sufficient
Quick Start
Recommended Approach (with Codex conversion):
# 1. Convert PDF to markdown first (see pdf skill)
from pdf_skill import pdf_to_markdown_codex
md_path = pdf_to_markdown_codex("document.pdf")
More from vamseeachanta/workspace-hub
echarts
Create powerful interactive charts with Apache ECharts - balanced ease-of-use
139gis
Cross-application GIS skill — CRS reference, data formats, Blender/QGIS integration via digitalmodel.gis
80pandoc
Universal document converter for transforming Markdown to PDF, DOCX, HTML, LaTeX, and 40+ other formats. Covers templates, filters, citations with BibTeX/CSL, and batch conversion automation scripts.
74mkdocs
Build professional project documentation with MkDocs and Material theme.
73cli-productivity
Essential CLI tools and shell productivity patterns for efficient terminal workflows
55python-docx
Create and manipulate Microsoft Word documents programmatically. Build reports, contracts, and documentation with full control over paragraphs, tables, headers, styles, and images.
50