repository-harvesting-guide
Repository Harvesting Guide
A skill for harvesting metadata from open access repositories using the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol. Covers protocol fundamentals, building harvesters in Python, handling resumption tokens for large collections, metadata format parsing (Dublin Core, MARC, METS), selective harvesting by date and set, and integrating harvested data into research workflows.
OAI-PMH Protocol Fundamentals
What Is OAI-PMH
OAI-PMH is a standardized protocol that allows metadata to be harvested from repository systems. It is the backbone of library interoperability and is supported by virtually every institutional repository, preprint server, and digital library worldwide.
OAI-PMH Architecture:
Data Providers (repositories):
- Expose metadata through a standardized HTTP interface
- Must support Dublin Core as minimum metadata format
- May support additional formats (MARC, MODS, DataCite, etc.)
- Examples: arXiv, PubMed Central, DSpace repositories,
EPrints, institutional repositories
More from wentorai/research-plugins
academic-paper-summarizer
Summarize academic papers with structured extraction of key elements
43academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide
38academic-writing-refiner
Checklist-driven academic English polishing and Chinglish correction
34academic-citation-manager
Manage academic citations across BibTeX, APA, MLA, and Chicago formats
33abstract-writing-guide
Craft structured research abstracts that maximize clarity and journal acceptance
15ai-writing-humanizer
Remove AI-generated patterns to produce natural, authentic academic writing
14