launch-nemo-rl

Installation
SKILL.md

launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s

This is the playbook for the nrl-k8s CLI at infra/nrl_k8s/. Follow it when the user asks to launch / iterate / debug a NeMo-RL recipe on a Kubernetes cluster. Verify current state (kubectl, git log, the recipe + infra files) before acting — the cluster is shared and the cost of a wrong action is high.

1. One command, two modes

There is a single top-level submission command: nrl-k8s run. It has two lifecycle modes.

Mode Invocation When to use Cluster after?
Ephemeral (default) nrl-k8s run One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs. No (auto)
Long-lived nrl-k8s run --raycluster Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass --recreate to replace). Then submits daemons and training. First-choice for iteration. Yes

Ask: Do I need this cluster after the run? If yes, use --raycluster. Otherwise use the default (ephemeral).

The rest of the CLI is observability / stage-by-stage control:

Installs
138
Repository
nvidia/skills
GitHub Stars
1.0K
First Seen
7 days ago
launch-nemo-rl — nvidia/skills