vastai-incident-runbook

Installation
SKILL.md

Vast.ai Incident Runbook

Overview

Rapid incident response procedures for Vast.ai-related outages.

Prerequisites

  • Access to Vast.ai dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)

Severity Levels

Level Definition Response Time Examples
P1 Complete outage < 15 min Vast.ai API unreachable
P2 Degraded service < 1 hour High latency, partial failures
P3 Minor impact < 4 hours Webhook delays, non-critical errors
P4 No user impact Next business day Monitoring gaps
Related skills
Installs
1
GitHub Stars
2.2K
First Seen
Mar 4, 2026