Launch: pre-flight checklist for long ML training jobs

Long training jobs are expensive to fail. A 12-hour run that crashes on epoch 3 from a missing dataset path or a default workers=8 against an NFS mount is a full day lost. This skill walks five quick checks before you commit the GPUs.

The agentic Stop hook in this plugin will route here from reason when an assistant tries to launch a run without going through the checklist.

When to run

The user just asked to:

launch / kick off / start / fire up a training run
restart a run that died
kill a current run (also runs the cleanup half of the checklist)
review a launch command before submitting

Or the user is about to run any of: python train.py, accelerate launch, torchrun, deepspeed, sbatch train.sh, tmux new-session ... python ... train, wandb sweep.

launch

Launch: pre-flight checklist for long ML training jobs

When to run

The checklist