scikit-learn

Overview

Scikit-learn is a Python machine learning library that provides a consistent API for the full ML workflow: data preprocessing (scaling, encoding, imputation), model selection (classification, regression, clustering), hyperparameter tuning (grid search, randomized search), cross-validation, and pipeline construction. It supports serialization via joblib for production deployment.

Instructions

When preprocessing data, use ColumnTransformer to apply different transformers to numeric and categorical columns (StandardScaler, OneHotEncoder, SimpleImputer), always within a Pipeline to prevent data leakage.
When choosing models, start with fast baselines (LogisticRegression, RandomForest) and use HistGradientBoostingClassifier for best tabular performance, since it handles missing values natively and is faster than GradientBoosting.
When evaluating, use cross_val_score with 5-fold CV instead of single train/test splits, and use classification_report() instead of accuracy alone since accuracy is misleading on imbalanced datasets.
When tuning hyperparameters, use RandomizedSearchCV when the search space exceeds 100 combinations (faster than exhaustive GridSearchCV), and use StratifiedKFold or TimeSeriesSplit as appropriate.
When building pipelines, chain preprocessing and model steps with Pipeline to ensure transformers fit only on training data, then serialize the full pipeline with joblib.dump() for deployment.
When selecting features, use permutation_importance() for model-agnostic measurement, SelectKBest for statistical filtering, or feature_importances_ from tree-based models.

scikit-learn

scikit-learn

Overview

Instructions

Examples

Example 1: Build a customer churn prediction pipeline

More from terminalskills/skills

api-tester

instagram-marketing

directus

coolify

agent-memory

reddit-insights