VCF Variant Filtering Guide

Overview

Raw VCF files produced by variant callers (GATK HaplotypeCaller, bcftools mpileup, DeepVariant, etc.) contain a mixture of true variants and artifacts from sequencing errors, alignment issues, and low-coverage regions. Computing summary statistics -- Ts/Tv ratio, variant counts, allele frequency distributions -- on unfiltered data yields unreliable results because false-positive calls disproportionately inflate transversion counts and depress the Ts/Tv ratio. This guide covers how to detect whether a VCF is raw, how to apply appropriate quality filters, when filtering is not appropriate, and how to interpret the resulting statistics correctly.

Key Concepts

VCF Quality Scores (QUAL Field)

The QUAL column in a VCF file represents the Phred-scaled probability that the variant site is polymorphic. A QUAL score of 30 means a 1-in-1000 chance the call is wrong; a QUAL of 20 means 1-in-100. Variant callers assign QUAL scores based on read evidence, base qualities, and mapping qualities. Low-QUAL variants (below 20-30) are enriched for sequencing errors and alignment artifacts. Filtering on QUAL is the simplest and most widely used first-pass quality control step.

Common QUAL thresholds and their interpretation:

vcf-variant-filtering

VCF Variant Filtering Guide

Overview

Key Concepts

VCF Quality Scores (QUAL Field)

More from jaechang-hits/sciagent-skills

scientific-brainstorming

snakemake-workflow-engine

esm-protein-language-model

biopython-sequence-analysis

shap-model-explainability

archs4-database