exploratory-data-analysis
Installation
SKILL.md
Exploratory Data Analysis for Scientific Data
Overview
Exploratory data analysis (EDA) is the systematic examination of scientific data files to understand their structure, content, quality, and characteristics before formal analysis. This knowhow covers methodology for detecting file types, selecting appropriate analysis approaches, assessing data quality, and generating comprehensive reports across all major scientific data domains.
Key Concepts
Scientific Data Type Categories
| Category | Common Formats | Typical Analysis | Key Libraries |
|---|---|---|---|
| Tabular | CSV, TSV, XLSX, Parquet | Summary statistics, distributions, correlations, missing values | pandas, polars |
| Sequence | FASTA, FASTQ, SAM/BAM | Length distribution, quality scores, GC content, alignment stats | BioPython, pysam |
| Image/Microscopy | TIFF, ND2, CZI, DICOM | Dimensions (XYZCT), intensity stats, metadata, calibration | tifffile, aicsimageio, nd2reader |
| Spectral | mzML, SPC, JCAMP, FID | Peak detection, baseline, S/N ratio, resolution | pymzml, nmrglue, pyteomics |
| Structural | PDB, CIF, MOL, SDF | Atom counts, bond validation, B-factors, completeness | BioPython, RDKit, MDAnalysis |
| Array/Tensor | NPY, HDF5, Zarr, NetCDF | Shape, dtype, value range, NaN/Inf check, chunk structure | numpy, h5py, zarr, xarray |
| Omics | H5AD, MTX, VCF, BED | Feature/sample counts, sparsity, annotation completeness | scanpy, pyranges, cyvcf2 |