tao-run-on-kubernetes

Installation
SKILL.md

Kubernetes

Submits TAO container jobs as Kubernetes Jobs. Works on any cluster reachable via kubeconfig (EKS / GKE / AKS / on-prem) or in-cluster service account (when the SDK runs inside a pod).

Single-pod by default; opt into multi-node distributed training via num_nodes > 1 (uses Indexed Job + headless Service, see Multi-node training below).

Preflight

Four checks: GPU host runtime ready, SDK installed, cluster reachable, GPU Operator/device plugin present.

# 0. GPU node host runtime.
# Run this on each self-managed GPU worker node or in the node image build.
# Set TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1 only when using managed GPU nodes whose
# driver/toolkit lifecycle is owned by the cloud provider or GPU Operator policy.
if [ "${TAO_K8S_SKIP_NODE_RUNTIME_CHECK:-0}" != "1" ]; then
  TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}"
  SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
Installs
843
Repository
nvidia/skills
GitHub Stars
2.2K
First Seen
Jun 8, 2026
tao-run-on-kubernetes — nvidia/skills