paper-fetch

Fetch the PDF for a paper given a DOI (or title). Tries multiple sources in priority order and stops at the first hit.

Resolution order

Unpaywall — https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL, read best_oa_location.url_for_pdf (skipped if UNPAYWALL_EMAIL not set)
Semantic Scholar — https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds
arXiv — if externalIds.ArXiv present, https://arxiv.org/pdf/{arxiv_id}.pdf
PubMed Central OA — if PMCID present, https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/
bioRxiv / medRxiv — if DOI prefix is 10.1101, query https://api.biorxiv.org/details/{server}/{doi} for the latest version PDF URL
Publisher direct (institutional mode only — PAPER_FETCH_INSTITUTIONAL=1) — DOI-prefix → publisher PDF template (Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier). The caller's own subscription IP / cookies / EZproxy are what authorize the fetch; unauthorized responses fail the %PDF check and fall through to step 7.
Sci-Hub mirrors (on by default; disable with PAPER_FETCH_NO_SCIHUB=1) — last-resort fallback. Tries the mirror list in PAPER_FETCH_SCIHUB_MIRRORS (or built-in defaults sci-hub.ru, sci-hub.st, sci-hub.su, sci-hub.box, sci-hub.red, sci-hub.al, sci-hub.mk, sci-hub.ee) in order; on full miss, scrapes https://www.sci-hub.pub/ once per process for fresh mirrors. CAPTCHA / missing-paper pages have no PDF iframe and fall through silently.
Otherwise → report failure with title/authors so the user can request via ILL

CloakBrowser fallback (download layer, opt-in — PAPER_FETCH_CLOAK=1). This is not a separate source: it sits at the download chokepoint, so it applies to any of the sources above. When a resolved PDF URL is blocked by Cloudflare — HTTP 403/429, or a "Just a moment…" HTML interstitial served in place of the file — and the operator opted in, the URL is retried through CloakBrowser (a stealth Chromium that passes the JS challenge) via the cloak_pdf.py companion. Bytes it returns are re-validated through the same %PDF magic-byte + 50 MB checks; on success the result carries via: "cloak". Off by default, fails closed (missing CloakBrowser → silent fall-through), and the agent cannot opt in — see CloakBrowser access below.

If only a title is given, pass it directly via --title "<title>". Resolution chain: