Watch — read a video like a PDF

You don't have a video input. This skill bundles a Python script that fetches the timestamped transcript (and optionally frames) so you can answer questions about video content.

Default mode is transcript-only. No video download, no frame extraction — just captions pulled via yt-dlp in a few seconds. That covers ~every YouTube video and is the right default for research, summarization, and content analysis.

Opt into --with-frames only when the visual layer matters: debugging a screen recording, breaking down a thumbnail/hook visually, reading on-screen text, analyzing UI behavior.

Requirements

Locally installed:

yt-dlp — brew install yt-dlp (macOS) / pipx install yt-dlp (Linux) / winget install yt-dlp.yt-dlp (Windows)
ffmpeg + ffprobe — brew install ffmpeg / sudo apt install ffmpeg / winget install Gyan.FFmpeg. Only required for --with-frames and the Whisper audio fallback. Transcript-only on a captioned YouTube video does not need ffmpeg.

If the user's missing one, tell them the install command and stop. Do not auto-install.

How to invoke

The script lives next to this SKILL.md. Invoke it with the absolute path or ${CLAUDE_SKILL_DIR}/watch.py: