scikit-learn

Installation
SKILL.md

scikit-learn

Overview

Scikit-learn is a Python machine learning library that provides a consistent API for the full ML workflow: data preprocessing (scaling, encoding, imputation), model selection (classification, regression, clustering), hyperparameter tuning (grid search, randomized search), cross-validation, and pipeline construction. It supports serialization via joblib for production deployment.

Instructions

  • When preprocessing data, use ColumnTransformer to apply different transformers to numeric and categorical columns (StandardScaler, OneHotEncoder, SimpleImputer), always within a Pipeline to prevent data leakage.
  • When choosing models, start with fast baselines (LogisticRegression, RandomForest) and use HistGradientBoostingClassifier for best tabular performance, since it handles missing values natively and is faster than GradientBoosting.
  • When evaluating, use cross_val_score with 5-fold CV instead of single train/test splits, and use classification_report() instead of accuracy alone since accuracy is misleading on imbalanced datasets.
  • When tuning hyperparameters, use RandomizedSearchCV when the search space exceeds 100 combinations (faster than exhaustive GridSearchCV), and use StratifiedKFold or TimeSeriesSplit as appropriate.
  • When building pipelines, chain preprocessing and model steps with Pipeline to ensure transformers fit only on training data, then serialize the full pipeline with joblib.dump() for deployment.
  • When selecting features, use permutation_importance() for model-agnostic measurement, SelectKBest for statistical filtering, or feature_importances_ from tree-based models.

Examples

Example 1: Build a customer churn prediction pipeline

Related skills
Installs
1
GitHub Stars
48
First Seen
Mar 17, 2026