pandas
Installation
SKILL.md
Pandas
Overview
Pandas is a Python library for loading, cleaning, transforming, and analyzing tabular data. It provides DataFrames for structured data manipulation, supports CSV, Excel, SQL, JSON, and Parquet formats, and offers powerful groupby aggregation, merge/join operations, time series resampling, and method chaining for building analysis pipelines.
Instructions
- When loading data, use
pd.read_parquet()for large datasets (faster, smaller, type-preserving),pd.read_csv()with explicitdtypefor CSVs, andpd.read_sql()for database queries. - When cleaning data, handle missing values with
fillna()ordropna(), deduplicate withdrop_duplicates(), use string methods (.str.strip(),.str.lower()) for text cleaning, and convert types explicitly withastype()andpd.to_datetime(). - When transforming data, use
assign()for computed columns,pipe()for method chaining,melt()andpivot_table()for reshaping, andpd.cut()/pd.qcut()for binning. - When aggregating, use
groupby().agg()with named aggregation for readable column names,transform()to broadcast results back to original shape, andresample()for time-based grouping. - When merging, use
pd.merge()with explicithowandvalidateparameters to catch data quality issues at merge time, andpd.concat()for stacking DataFrames. - When optimizing performance, use
categorydtype for low-cardinality strings, vectorized operations over.apply(), and Parquet for storage; for datasets over 10GB, consider Polars or DuckDB.
Examples
Example 1: Clean and analyze a sales dataset
Related skills