GTARS: Fast Genomic Token Arithmetic and BED File Processing

Overview

GTARS is a Python library with a Rust-backed core for high-performance genomic interval operations. It provides BED file I/O, set-theoretic interval operations (intersection, union, merge, complement, subtract), genomic region tokenization against a reference universe, and utilities for building consensus universe BED files. GTARS is designed for workflows that process hundreds to thousands of BED files efficiently, serving as a preprocessing engine for ML pipelines (including geniml) and general bioinformatics pipelines.

When to Use

Read and write large BED files efficiently, leveraging Rust-backed parsing for speed over pure Python alternatives
Compute genomic interval intersections, merges, complements, or subtracts between BED file pairs or sets
Tokenize a collection of genomic regions against a fixed universe vocabulary for ML input preparation
Build consensus universe BED files from a collection of sample BED files
Count overlap statistics between two BED files without launching bedtools processes
Preprocess ATAC-seq, ChIP-seq, or ENCODE peak files before feeding into geniml or other ML tools
For full BED/BAM/SAM reading with CIGAR-level detail, use pysam-genomic-files instead

gtars

GTARS: Fast Genomic Token Arithmetic and BED File Processing

Overview

When to Use

Prerequisites

More from jaechang-hits/sciagent-skills

scientific-brainstorming

gene-database

snakemake-workflow-engine

esm-protein-language-model

biopython-sequence-analysis

shap-model-explainability