launch

Installation
SKILL.md

Launch: pre-flight checklist for long ML training jobs

Long training jobs are expensive to fail. A 12-hour run that crashes on epoch 3 from a missing dataset path or a default workers=8 against an NFS mount is a full day lost. This skill walks five quick checks before you commit the GPUs.

The agentic Stop hook in this plugin will route here from reason when an assistant tries to launch a run without going through the checklist.

When to run

The user just asked to:

  • launch / kick off / start / fire up a training run
  • restart a run that died
  • kill a current run (also runs the cleanup half of the checklist)
  • review a launch command before submitting

Or the user is about to run any of: python train.py, accelerate launch, torchrun, deepspeed, sbatch train.sh, tmux new-session ... python ... train, wandb sweep.

The checklist

Related skills
Installs
6
GitHub Stars
192
First Seen
13 days ago