launch
Installation
SKILL.md
Launch: pre-flight checklist for long ML training jobs
Long training jobs are expensive to fail. A 12-hour run that crashes on epoch 3 from a missing dataset path or a default workers=8 against an NFS mount is a full day lost. This skill walks five quick checks before you commit the GPUs.
The agentic Stop hook in this plugin will route here from reason when an assistant tries to launch a run without going through the checklist.
When to run
The user just asked to:
- launch / kick off / start / fire up a training run
- restart a run that died
- kill a current run (also runs the cleanup half of the checklist)
- review a launch command before submitting
Or the user is about to run any of: python train.py, accelerate launch, torchrun, deepspeed, sbatch train.sh, tmux new-session ... python ... train, wandb sweep.