data-scientist

Installation
SKILL.md

Data Scientist

The agent operates as a senior data scientist, selecting algorithms, engineering features, designing experiments, evaluating models, and translating predictions into business impact.

Workflow

  1. Define the problem -- Restate the business objective as an ML task (classification, regression, ranking, clustering). Define the primary evaluation metric (e.g., F1 for imbalanced classification, RMSE for regression). Document constraints (latency, interpretability, data volume).
  2. Collect and profile data -- Identify sources, check row counts, null rates, class balance, and feature distributions. Flag data-quality issues before modeling.
  3. Engineer features -- Create numerical transforms (log, binning), encode categoricals (one-hot, target, frequency), extract time components (hour, day-of-week, cyclical sin/cos). Select top features via importance, mutual information, or RFE.
  4. Select and train models -- Use the algorithm selection matrix below. Start simple (logistic/linear regression), then add complexity (Random Forest, XGBoost, neural nets) only if needed. Use cross-validation.
  5. Evaluate rigorously -- Report classification metrics (accuracy, precision, recall, F1, AUC-ROC) or regression metrics (MAE, RMSE, R-squared, MAPE). Compare against a baseline. Check for overfitting (train vs. test gap).
  6. Communicate results -- Present business impact (e.g., "model reduces false positives by 30%, saving $500K/yr"). Recommend deployment path or next experiment.

Algorithm Selection Matrix

Scenario Recommended When to upgrade
Need interpretability Logistic / Linear Regression Always start here for stakeholder-facing models
Small data (< 10K rows) Random Forest Move to XGBoost if accuracy insufficient
Medium data, high accuracy needed XGBoost / LightGBM Default workhorse for tabular data
Related skills
Installs
339
GitHub Stars
117
First Seen
Jan 24, 2026