etetoolkit
ETE Toolkit: Phylogenetic Tree Analysis and Visualization
Overview
ETE Toolkit (ETE3) is a Python framework for phylogenetic tree exploration, manipulation, and publication-quality visualization. It supports reading and writing Newick, NHX, PhyloXML, and NeXML formats, rich node annotation, programmatic tree traversal, NCBI taxonomy integration, and a flexible rendering engine for customizable tree figures. ETE3 is widely used in comparative genomics, phylogenomics, and evolutionary biology workflows.
When to Use
- Parse phylogenetic trees from Newick, NHX, PhyloXML, or NeXML files and programmatically traverse or modify topology
- Annotate tree nodes with metadata (bootstrap values, gene names, taxonomic ranks, expression data) for visualization or downstream analysis
- Render publication-quality tree figures with custom node shapes, colors, branch widths, and face decorations using TreeStyle and NodeStyle
- Map NCBI taxonomy IDs to lineage information, validate species names, or build taxonomy-aware trees
- Compute evolutionary statistics: branch lengths, tree distances (Robinson-Foulds), LCA queries, monophyly tests
- Build PhyloTree objects for comparative genomics — gene duplication/speciation event annotation, orthologs/paralogs inference
- Prune, reroot, or ultrametricize trees programmatically before passing to downstream tools (BEAST, IQ-TREE, etc.)
- For sequence alignment prior to tree building, use
biopython-molecular-biologyinstead
Prerequisites
More from jaechang-hits/sciagent-skills
scientific-brainstorming
Structured ideation methods: SCAMPER, Six Thinking Hats, Morphological Analysis, TRIZ, Biomimicry, plus more. Decision framework for picking methods by challenge type (stuck, improving, systematic exploration, contradiction). Use when generating research ideas or exploring interdisciplinary connections.
12snakemake-workflow-engine
Python-based workflow management system for reproducible, scalable pipelines. Define rules with file-based dependencies; Snakemake automatically determines the execution order and parallelism. Supports local, SLURM, LSF, AWS, and Google Cloud execution via profiles; per-rule conda/Singularity environments. Use for bioinformatics NGS pipelines, ML training workflows, and any multi-step file-processing analysis. Use Nextflow instead for Groovy-based dataflow pipelines or when nf-core ecosystem integration is required.
11esm-protein-language-model
Protein language models (ESM3, ESM C) for sequence generation, structure prediction, inverse folding, and protein embeddings. Use when designing novel proteins, extracting sequence representations for downstream ML, or predicting structure from sequence. Local GPU or EvolutionaryScale Forge cloud API. For traditional structure prediction use AlphaFold; for small-molecule cheminformatics use RDKit.
11biopython-sequence-analysis
Biopython sequence analysis: parse FASTA/FASTQ/GenBank/GFF (SeqIO), NCBI Entrez (esearch/efetch/elink), remote/local BLAST, pairwise/MSA alignment (PairwiseAligner, MUSCLE/ClustalW), phylogenetic trees (Phylo). Use for gene family studies, phylogenomics, comparative genomics, NCBI pipelines. For PCR/restriction/cloning use biopython-molecular-biology; for SAM/BAM use pysam.
11shap-model-explainability
>-
11archs4-database
Query ARCHS4 REST API for uniformly processed RNA-seq expression, tissue patterns, co-expression across 1M+ human/mouse samples. Retrieve z-scores, co-expressed genes, samples by metadata, HDF5 matrices. For variant population genetics use gnomad-database; for pathway enrichment use gget-genomic-databases (Enrichr).
11