gke-ai-troubleshooting-tpu-connection-failure-vbar-oom

Installation
SKILL.md

TPU Connection Failure and VBAR OOM Troubleshooting

Use this skill to systematically diagnose and prevent vbar_control_agent segfaults and Out-Of-Memory (OOM) errors on TPU v6e nodes.

⚠️ Prerequisites

  • Cloud Logging must be enabled for the project.
  • Access to the project and cluster via gcloud or equivalent tool.

🔍 Diagnostic Workflow

Step 0: Context Acquisition & Time Window Definition

To begin troubleshooting, acquire the following context from the user:

Installs
12
GitHub Stars
158
First Seen
May 7, 2026
gke-ai-troubleshooting-tpu-connection-failure-vbar-oom — googlecloudplatform/gke-mcp