training-check

Installation
SKILL.md

Training Check

Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.

Context: $ARGUMENTS

Constants

  • WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format: entity/project/run_id)
  • CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap)
  • REVIEWER_MODEL = gpt-5.4 — used via Codex MCP for ambiguous cases only

When to Use

  • After training is confirmed running (session alive, loss decreasing for first few steps)
  • Set up via CronCreate to fire periodically during training
  • This skill checks training QUALITY, not process HEALTH. Process health (session alive, GPU utilization) is watchdog.py's job.

Workflow

Related skills

More from shaun-z/auto-claude-code-research-in-sleep

Installs
3
First Seen
Apr 1, 2026