tooluniverse-epigenomics
Installation
SKILL.md
Genomics and Epigenomics Data Processing
⚠️ TOP-OF-MIND RULE: long-format methylation CSV — count ROWS, not unique positions
When the input is a long-format methylation CSV (one row per (sample, CpG_position)
e.g. columns Pos, Chromosome, MethylationPercentage), "how many sites are
removed when filtering" almost always means rows removed, NOT unique-position
removals. The two answers differ by a factor of ≈ n_samples.
| Question phrasing | What it means |
|---|---|
| "how many sites are removed when filtering …" | rows removed (= samples × positions failing the filter) |
| "how many unique CpG sites pass filter" | unique positions (dedupe by Pos then filter) |
❌ WRONG: df.drop_duplicates(["Pos"]).query("MethylationPercentage<10 or >90") then len(filtered) → counts unique positions (typically 100–1500)
✅ RIGHT: df.query("MethylationPercentage<10 or MethylationPercentage>90") then len(df) - len(filtered) → counts rows (typically 10k–30k)