Transformers for Biomedical NLP

Overview

HuggingFace Transformers provides a unified API to load, run, and fine-tune 500+ biomedical language models. The key biomedical models — BioBERT (trained on PubMed abstracts + PMC full text), PubMedBERT (trained from scratch on PubMed), BioGPT (generative, trained on PubMed), and BioMedLM — significantly outperform general-purpose BERT on biomedical NER, relation extraction, and question answering. The pipeline() abstraction handles tokenization, inference, and postprocessing in one call. Fine-tuning on task-specific labeled data (e.g., BC5CDR for chemical/disease NER) takes under an hour on a single GPU. The datasets library provides direct access to standard biomedical benchmarks.

When to Use

Extracting gene names, disease mentions, drug names, or chemical entities from biomedical abstracts (NER)
Classifying abstracts by topic, sentiment of clinical outcomes, or PICO elements for systematic reviews
Answering specific questions from biomedical literature using extractive QA (BioASQ format)
Generating hypotheses or summaries from biomedical text using BioGPT or BioMedLM
Fine-tuning a pre-trained biomedical model on a custom labeled dataset (e.g., your lab's annotations)
Embedding biomedical sentences for semantic similarity search across literature
Use spaCy + en_core_sci_lg for fast rule-augmented NER; use Stanza for dependency parsing

transformers-bio-nlp

Transformers for Biomedical NLP

Overview

When to Use

Prerequisites

More from jaechang-hits/sciagent-skills

scientific-brainstorming

gene-database

snakemake-workflow-engine

esm-protein-language-model

matchms-spectral-matching

chembl-database-bioactivity