launch
Launch: pre-flight checklist for long ML training jobs
Long training jobs are expensive to fail. A 12-hour run that crashes on epoch 3 from a missing dataset path or a default workers=8 against an NFS mount is a full day lost. This skill walks five quick checks before you commit the GPUs.
The agentic Stop hook in this plugin will route here from reason when an assistant tries to launch a run without going through the checklist.
When to run
The user just asked to:
- launch / kick off / start / fire up a training run
- restart a run that died
- kill a current run (also runs the cleanup half of the checklist)
- review a launch command before submitting
Or the user is about to run any of: python train.py, accelerate launch, torchrun, deepspeed, sbatch train.sh, tmux new-session ... python ... train, wandb sweep.
The checklist
More from fcakyon/phd-skills
paper-writing
>
8reviewer-defense
>
7dataset-curation
>
7reproduce
End-to-end paper reproduction from arxiv URL through smoke runs to replication experiments. Handles missing or partial official code, missing training scripts, missing hyperparameters, and private datasets via similar-public-dataset substitution. Use when the user asks to reproduce, implement, replicate, or re-run a paper from scratch, or pastes an arxiv URL with reproduction intent.
7experiment-design
>
7paper-verification
>
7