YouTube Video Analyzer — Multimodal

This skill performs deep analysis of YouTube videos through both information channels:

Audio channel: Transcript with timestamps (what is SAID)
Visual channel: Frame extraction + image analysis (what is SHOWN)

Most YouTube skills only extract transcripts. This skill closes the gap by synchronizing visual frames with spoken content, enabling accurate step-by-step guides where "click the blue button" is matched with the actual screenshot showing which button.

Workflow Overview

YouTube URL
    |
    +---> 1. Get metadata (title, duration, video ID)
    |
    +---> 2. Extract transcript (yt-dlp --dump-json + curl)
    |         -> Timestamped segments
    |
    +---> 3. Extract frames (yt-dlp + ffmpeg)

youtube-video-analyzer

YouTube Video Analyzer — Multimodal

Workflow Overview