Roary Pan-Genome Pipeline

Overview

Roary is a high-throughput pan-genome pipeline for prokaryotes that takes per-sample GFF3 annotations (typically from Prokka or Bakta) and produces a clustered gene presence/absence matrix across the entire input set. It first reduces redundancy with CD-HIT iterative clustering, then performs an all-vs-all BLASTP within each pre-cluster, and finally applies MCL graph clustering to define orthologous gene families. The output partitions the gene space into core (≥ 99 %), soft-core (95–99 %), shell (15–95 %), and cloud (< 15 %) genes and optionally builds a concatenated core-gene alignment suitable for phylogenetic inference.

When to Use

Computing a pan-genome from a set of bacterial isolate annotations (10–10,000 genomes)
Producing a gene_presence_absence.csv matrix for downstream GWAS, accessory-gene mining, or core-gene phylogenetics
Building a concatenated core-gene multi-FASTA alignment for ML/Bayesian phylogenetic trees
Generating a pan-genome reference FASTA to use as a non-redundant gene catalog
Comparative genomics across closely related strains where >95 % nucleotide identity is expected
Use Panaroo instead when assemblies are highly fragmented or annotations are noisy (Panaroo aggressively cleans annotation errors)
Use PIRATE instead when paralog-aware clustering with multiple identity thresholds is needed
Use PPanGGOLiN instead when graph-based, statistically grounded gene-family partitioning is preferred over fixed-frequency cutoffs

roary-pangenome

Roary Pan-Genome Pipeline

Overview

When to Use

Prerequisites

More from jaechang-hits/sciagent-skills

scientific-brainstorming

snakemake-workflow-engine

esm-protein-language-model

biopython-sequence-analysis

shap-model-explainability

archs4-database