Vaex DataFrames

Overview

Vaex is a high-performance Python library for lazy, out-of-core DataFrame operations on datasets too large to fit in RAM. It processes over a billion rows per second using memory-mapped files and lazy evaluation, enabling interactive exploration and analysis without loading data into memory.

When to Use

Processing tabular datasets larger than available RAM (10 GB to terabytes)
Fast statistical aggregations on massive datasets (mean, std, quantiles at billion-row scale)
Creating visualizations (heatmaps, histograms) of large datasets without sampling
Building ML preprocessing pipelines (scaling, encoding, PCA) on big data
Converting between data formats (CSV to HDF5/Arrow for fast repeated access)
Feature engineering with virtual columns that consume zero additional memory
Working with astronomical catalogs, financial time series, or large scientific datasets
For in-memory speed on data that fits in RAM, use polars instead
For distributed multi-node computing, use dask instead

vaex-dataframes

Vaex DataFrames

Overview

When to Use

Prerequisites

More from jaechang-hits/sciagent-skills

scientific-brainstorming

gene-database

snakemake-workflow-engine

esm-protein-language-model

matchms-spectral-matching

chembl-database-bioactivity