launch-nemo-rl
Installation
SKILL.md
launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s
This is the playbook for the nrl-k8s CLI at infra/nrl_k8s/. Follow it when the user asks to launch / iterate / debug a NeMo-RL recipe on a Kubernetes cluster. Verify current state (kubectl, git log, the recipe + infra files) before acting — the cluster is shared and the cost of a wrong action is high.
1. One command, two modes
There is a single top-level submission command: nrl-k8s run. It has two lifecycle modes.
| Mode | Invocation | When to use | Cluster after? |
|---|---|---|---|
| Ephemeral (default) | nrl-k8s run |
One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs. | No (auto) |
| Long-lived | nrl-k8s run --raycluster |
Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass --recreate to replace). Then submits daemons and training. First-choice for iteration. |
Yes |
Ask: Do I need this cluster after the run? If yes, use --raycluster. Otherwise use the default (ephemeral).
The rest of the CLI is observability / stage-by-stage control: