robust-pdf-read

Installation
SKILL.md

Robust PDF Text Extraction

Problem

Standard file reading tools (e.g., read_file) often fail to extract text from PDF documents. Instead of returning parsed text, they may return:

  • Raw binary data
  • Base64 encoded images
  • Garbled characters or null bytes

This occurs because PDFs are complex binary formats, not plain text files. Attempts to parse them using general-purpose Python libraries (like PyMuPDF) in sandboxed environments may also fail due to missing dependencies or environment restrictions.

Solution

Use the pdftotext command-line utility (part of poppler-utils) via run_shell. This tool is commonly pre-installed in Linux environments and reliably extracts text content from PDFs.

Procedure

1. Detect Extraction Failure

When attempting to read a PDF:

  • Check the content returned by read_file.
  • If the content contains null bytes (\x00), appears as base64, or is clearly binary/garbled, assume standard reading has failed.
Related skills
Installs
1
Repository
hkuds/openspace
GitHub Stars
6.0K
First Seen
6 days ago