video-understanding

Installation
SKILL.md

What this does

Turns a source video into an understanding index an agent (or a downstream stage) can read:

  1. Scene detectionscenes.json (cut points, durations) + junk-scene filtering.
  2. Frame extraction — sampled frames for the visual analysis.
  3. ASRasr_result.json (timestamped dialogue) via MiMo mimo-v2.5-asr.
  4. Silence detectionsilence_periods.json (quiet windows, has_speech flag).
  5. VLM analysisvlm_analysis.json (per-scene description, depth analysis, frame_facts).
  6. Timeline fusion + brieftimeline_fusion.json, asr_writing_chunks.json, agent_narration_brief.md.

Stateless: reusable stages are skipped only when their output and provenance sidecar match the current source video plus output-affecting settings. --force recomputes.

Requirements

# ffmpeg: brew install ffmpeg | apt install ffmpeg | choco install ffmpeg
export MIMO_API_KEY=***          # one key drives ASR (mimo-v2.5-asr) + VLM (mimo-v2.5)
Installs
16
GitHub Stars
366
First Seen
Jun 14, 2026
video-understanding — worldwonderer/video-recap-skills