gke-ai-troubleshooting-tpu-connection-failure-vbar-oom
Installation
SKILL.md
TPU Connection Failure and VBAR OOM Troubleshooting
Use this skill to systematically diagnose and prevent vbar_control_agent
segfaults and Out-Of-Memory (OOM) errors on TPU v6e nodes.
⚠️ Prerequisites
- Cloud Logging must be enabled for the project.
- Access to the project and cluster via
gcloudor equivalent tool.
🔍 Diagnostic Workflow
Step 0: Context Acquisition & Time Window Definition
To begin troubleshooting, acquire the following context from the user: