launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s

This is the playbook for the nrl-k8s CLI at infra/nrl_k8s/. Follow it when the user asks to launch / iterate / debug a NeMo-RL recipe on a Kubernetes cluster. Verify current state (kubectl, git log, the recipe + infra files) before acting — the cluster is shared and the cost of a wrong action is high.

1. One command, two modes

There is a single top-level submission command: nrl-k8s run. It has two lifecycle modes.

Mode	Invocation	When to use	Cluster after?
Ephemeral (default)	`nrl-k8s run`	One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs.	No (auto)
Long-lived	`nrl-k8s run --raycluster`	Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass `--recreate` to replace). Then submits daemons and training. First-choice for iteration.	Yes

Ask: Do I need this cluster after the run? If yes, use --raycluster. Otherwise use the default (ephemeral).

The rest of the CLI is observability / stage-by-stage control: