grobid-pdf-parsing

Installation
SKILL.md

GROBID PDF Parsing Guide

Overview

Academic PDFs are the primary format for distributing research, yet extracting structured data from them remains challenging. PDFs encode visual layout, not semantic structure -- headings, paragraphs, equations, tables, and citations are all just positioned text and graphics. GROBID (GeneRation Of BIbliographic Data) is the leading open-source tool for parsing academic PDFs into structured XML/TEI format, extracting metadata, body text, references, and figures with high accuracy.

GROBID is used by major academic platforms including CORE, ResearchGate, and others for large-scale document processing. It combines machine learning models (CRF and deep learning) with heuristic rules to handle the diverse formatting of academic papers across publishers and disciplines.

This guide covers installing and running GROBID, using its REST API for batch processing, extracting specific elements (metadata, references, body sections), and integrating GROBID output into downstream workflows such as knowledge bases, systematic reviews, and literature analysis pipelines.

Installation

Docker (Recommended)

# Pull the latest GROBID image
docker pull grobid/grobid:0.8.1

# Run GROBID server
Related skills
Installs
2
GitHub Stars
217
First Seen
Apr 13, 2026