gke-ai-troubleshooting-jobset-interruption
Installation
SKILL.md
GKE JobSet Interruption Troubleshooting
Use this skill to systematically diagnose and resolve JobSet interruptions, restarts, and preemptions on GKE clusters hosting large-scale AI/ML workloads.
⚠️ Prerequisites
- JobSet metrics package must be enabled in
kube-state-metricsfor your cluster (see KSM JobSet Metrics). - Cloud Logging and Cloud Monitoring must be enabled for the Google Cloud Project.
🔍 Diagnostic Workflow
Step 0: Context Acquisition & Time Window Definition
To begin troubleshooting, acquire the following context from the user: