gke-ai-troubleshooting-jobset-interruption

Installation
SKILL.md

GKE JobSet Interruption Troubleshooting

Use this skill to systematically diagnose and resolve JobSet interruptions, restarts, and preemptions on GKE clusters hosting large-scale AI/ML workloads.

⚠️ Prerequisites

  • JobSet metrics package must be enabled in kube-state-metrics for your cluster (see KSM JobSet Metrics).
  • Cloud Logging and Cloud Monitoring must be enabled for the Google Cloud Project.

🔍 Diagnostic Workflow

Step 0: Context Acquisition & Time Window Definition

To begin troubleshooting, acquire the following context from the user:

Installs
10
GitHub Stars
158
First Seen
7 days ago
gke-ai-troubleshooting-jobset-interruption — googlecloudplatform/gke-mcp