scikit-learn
Installation
SKILL.md
scikit-learn
Overview
Scikit-learn is a Python machine learning library that provides a consistent API for the full ML workflow: data preprocessing (scaling, encoding, imputation), model selection (classification, regression, clustering), hyperparameter tuning (grid search, randomized search), cross-validation, and pipeline construction. It supports serialization via joblib for production deployment.
Instructions
- When preprocessing data, use
ColumnTransformerto apply different transformers to numeric and categorical columns (StandardScaler, OneHotEncoder, SimpleImputer), always within a Pipeline to prevent data leakage. - When choosing models, start with fast baselines (LogisticRegression, RandomForest) and use
HistGradientBoostingClassifierfor best tabular performance, since it handles missing values natively and is faster than GradientBoosting. - When evaluating, use
cross_val_scorewith 5-fold CV instead of single train/test splits, and useclassification_report()instead of accuracy alone since accuracy is misleading on imbalanced datasets. - When tuning hyperparameters, use
RandomizedSearchCVwhen the search space exceeds 100 combinations (faster than exhaustive GridSearchCV), and useStratifiedKFoldorTimeSeriesSplitas appropriate. - When building pipelines, chain preprocessing and model steps with
Pipelineto ensure transformers fit only on training data, then serialize the full pipeline withjoblib.dump()for deployment. - When selecting features, use
permutation_importance()for model-agnostic measurement,SelectKBestfor statistical filtering, orfeature_importances_from tree-based models.
Examples
Example 1: Build a customer churn prediction pipeline
Related skills