PyTDC (Therapeutics Data Commons)

Overview

PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery. It organizes therapeutics data into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions), and generation (molecule design, retrosynthesis). All datasets come with standardized splits, evaluation metrics, and molecular oracles.

When to Use

Loading curated ADME, toxicity, or bioactivity datasets for ML model training
Benchmarking drug discovery models with standardized 5-seed evaluation protocols
Predicting drug-target or drug-drug interactions with proper cold-split evaluation
Generating novel molecules and scoring them with molecular oracles (QED, SA, DRD2, GSK3B)
Accessing scaffold-based or temporal train/test splits for pharmaceutical ML
Converting molecular representations (SMILES to PyG graphs, ECFP fingerprints, SELFIES)
For chemical database queries (compound search, bioactivity), use chembl-database-bioactivity instead
For molecular featurization beyond format conversion, use molfeat instead

pytdc-therapeutics-data-commons

PyTDC (Therapeutics Data Commons)

Overview

When to Use

Prerequisites

More from jaechang-hits/sciagent-skills

scientific-brainstorming

gene-database

snakemake-workflow-engine

esm-protein-language-model

biopython-sequence-analysis

shap-model-explainability