youtube-video-analyzer
SKILL.md
YouTube Video Analyzer — Multimodal
This skill performs deep analysis of YouTube videos through both information channels:
- Audio channel: Transcript with timestamps (what is SAID)
- Visual channel: Frame extraction + image analysis (what is SHOWN)
Most YouTube skills only extract transcripts. This skill closes the gap by synchronizing visual frames with spoken content, enabling accurate step-by-step guides where "click the blue button" is matched with the actual screenshot showing which button.
Workflow Overview
YouTube URL
|
+---> 1. Get metadata (title, duration, video ID)
|
+---> 2. Extract transcript (yt-dlp --dump-json + curl)
| -> Timestamped segments
|
+---> 3. Extract frames (yt-dlp + ffmpeg)