pipeline-test-runner

Installation
SKILL.md

Pipeline Test Runner

Purpose

Validate that generated pipeline skills actually work by running them against real targets. Chain validation (done by chain-composer and scripts/artifact-utils.py validate-chain) checks type compatibility between steps. This skill checks execution -- does the skill produce valid artifacts when given real input? The distinction matters because a chain can be type-valid but produce empty content, crash on domain-specific inputs, or timeout due to overly complex research phases.

This skill is Phase 5 of the pipeline orchestrator's 7-phase flow. Its output feeds directly into pipeline-retro (Phase 6), which traces failures back to the generator using the Three-Layer Pattern.

Operator Context

Hardcoded Behaviors (Always Apply)

  • CLAUDE.md Compliance: Read and follow repository CLAUDE.md files before execution. Project instructions override default skill behaviors.
  • Pipeline Spec Required: Input MUST include the Pipeline Spec JSON (same spec consumed by pipeline-scaffolder). The spec defines what subdomains exist, what skills were generated, and what scripts/references each skill expects. WHY: Without the spec, the test runner doesn't know what to test or what "success" looks like.
  • Per-Subdomain Results: Every subdomain gets its own result (PASS/PARTIAL/FAIL/TIMEOUT). Never aggregate into a single pass/fail that hides individual failures. WHY: The retro skill (Phase 6) needs per-subdomain failure traces to fix the correct generator component. A blanket "FAIL" tells the retro nothing about which subdomain or which chain step broke.
  • No Production Targets: Test against repo files, fixtures, or synthetic inputs only. Never invoke skills against live/external systems. WHY: Test runs happen during pipeline generation -- they must be safe, repeatable, and free of side effects.
  • Artifact Validation via Script: Always use scripts/artifact-utils.py validate-manifest for manifest validation rather than manual JSON inspection. WHY: The script implements the canonical validation rules from the ADR. Manual checks will drift from the spec over time.

Default Behaviors (ON unless disabled)

  • Communication Style: Report facts without self-congratulation. Show per-subdomain results table, not narrative descriptions.
Related skills
Installs
1
GitHub Stars
366
First Seen
Mar 23, 2026