Robust PDF Text Extraction

Problem

Standard file reading tools (e.g., read_file) often fail to extract text from PDF documents. Instead of returning parsed text, they may return:

Raw binary data
Base64 encoded images
Garbled characters or null bytes

This occurs because PDFs are complex binary formats, not plain text files. Attempts to parse them using general-purpose Python libraries (like PyMuPDF) in sandboxed environments may also fail due to missing dependencies or environment restrictions.

Solution

Use the pdftotext command-line utility (part of poppler-utils) via run_shell. This tool is commonly pre-installed in Linux environments and reliably extracts text content from PDFs.

Procedure

1. Detect Extraction Failure

When attempting to read a PDF:

Check the content returned by read_file.
If the content contains null bytes (\x00), appears as base64, or is clearly binary/garbled, assume standard reading has failed.

robust-pdf-read

Robust PDF Text Extraction

Problem

Solution

Procedure

1. Detect Extraction Failure