plink2-gwas-analysis
PLINK2 — GWAS and Population Genetics
Overview
PLINK2 is the high-performance successor to PLINK 1.9, designed for genome-wide association studies (GWAS) and population genetics analysis on large cohorts. It processes genotype data in PLINK binary format (.bed/.bim/.fam), VCF, and BGEN formats — performing sample and variant quality control (QC), kinship estimation, principal component analysis (PCA), and linear/logistic regression association testing. PLINK2 is 10–100× faster than PLINK 1.9 on most tasks due to multithreading and optimized I/O. Output files are compatible with downstream visualization (Manhattan/QQ plots) and meta-analysis tools.
When to Use
- Running GWAS on a case-control or quantitative trait cohort after genotyping array QC
- Performing sample QC: missingness, heterozygosity outliers, sex check, cryptic relatedness
- Computing genome-wide LD pruning for PCA or relatedness estimation
- Running PCA on genotype data to identify population stratification
- Converting between PLINK binary, VCF, and BGEN formats
- Filtering variants by MAF, HWE, missingness, or INFO score in VCF/imputed data
- Use regenie or SAIGE instead for biobank-scale GWAS (>100k samples) requiring mixed model association to control for population structure
- Use VCFtools as an alternative for VCF-specific population genetics statistics
Prerequisites
More from jaechang-hits/sciagent-skills
scientific-brainstorming
Structured ideation methods: SCAMPER, Six Thinking Hats, Morphological Analysis, TRIZ, Biomimicry, plus more. Decision framework for picking methods by challenge type (stuck, improving, systematic exploration, contradiction). Use when generating research ideas or exploring interdisciplinary connections.
12snakemake-workflow-engine
Python-based workflow management system for reproducible, scalable pipelines. Define rules with file-based dependencies; Snakemake automatically determines the execution order and parallelism. Supports local, SLURM, LSF, AWS, and Google Cloud execution via profiles; per-rule conda/Singularity environments. Use for bioinformatics NGS pipelines, ML training workflows, and any multi-step file-processing analysis. Use Nextflow instead for Groovy-based dataflow pipelines or when nf-core ecosystem integration is required.
11esm-protein-language-model
Protein language models (ESM3, ESM C) for sequence generation, structure prediction, inverse folding, and protein embeddings. Use when designing novel proteins, extracting sequence representations for downstream ML, or predicting structure from sequence. Local GPU or EvolutionaryScale Forge cloud API. For traditional structure prediction use AlphaFold; for small-molecule cheminformatics use RDKit.
11biopython-sequence-analysis
Biopython sequence analysis: parse FASTA/FASTQ/GenBank/GFF (SeqIO), NCBI Entrez (esearch/efetch/elink), remote/local BLAST, pairwise/MSA alignment (PairwiseAligner, MUSCLE/ClustalW), phylogenetic trees (Phylo). Use for gene family studies, phylogenomics, comparative genomics, NCBI pipelines. For PCR/restriction/cloning use biopython-molecular-biology; for SAM/BAM use pysam.
11shap-model-explainability
>-
11archs4-database
Query ARCHS4 REST API for uniformly processed RNA-seq expression, tissue patterns, co-expression across 1M+ human/mouse samples. Retrieve z-scores, co-expressed genes, samples by metadata, HDF5 matrices. For variant population genetics use gnomad-database; for pathway enrichment use gget-genomic-databases (Enrichr).
11