repository-harvesting-guide

Installation
SKILL.md

Repository Harvesting Guide

A skill for harvesting metadata from open access repositories using the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol. Covers protocol fundamentals, building harvesters in Python, handling resumption tokens for large collections, metadata format parsing (Dublin Core, MARC, METS), selective harvesting by date and set, and integrating harvested data into research workflows.

OAI-PMH Protocol Fundamentals

What Is OAI-PMH

OAI-PMH is a standardized protocol that allows metadata to be harvested from repository systems. It is the backbone of library interoperability and is supported by virtually every institutional repository, preprint server, and digital library worldwide.

OAI-PMH Architecture:

Data Providers (repositories):
  - Expose metadata through a standardized HTTP interface
  - Must support Dublin Core as minimum metadata format
  - May support additional formats (MARC, MODS, DataCite, etc.)
  - Examples: arXiv, PubMed Central, DSpace repositories,
    EPrints, institutional repositories
Related skills
Installs
1
GitHub Stars
217
First Seen
Apr 2, 2026