arxiv-paper-processor

Installation
SKILL.md

arXiv Paper Processor

Overview

The arXiv Paper Processor skill provides a complete pipeline for downloading, parsing, and analyzing arXiv papers programmatically. While the arXiv API provides metadata, researchers often need to work with the full text—extracting sections, equations, figures, and references for deeper analysis.

This skill covers the entire processing chain: retrieving papers by ID or search query, downloading PDF and LaTeX source files, extracting structured content, and producing analysis-ready outputs. It is particularly valuable for researchers conducting large-scale literature analysis, building training datasets from academic text, or automating evidence extraction for systematic reviews.

The pipeline handles common challenges in academic PDF processing including multi-column layouts, mathematical notation, table extraction, and reference parsing. It integrates with tools like GROBID for PDF parsing and can work directly with arXiv LaTeX sources for higher-fidelity extraction.

Paper Retrieval and Download

Fetching by arXiv ID

The most reliable method is to fetch papers by their arXiv identifier:

import urllib.request
import feedparser
Related skills
Installs
6
GitHub Stars
217
First Seen
Mar 31, 2026