VTake Local Workflow

VTake converts a local input video into a card-based composition. The agent designs the cards (timing + content) and writes each card's HTML directly in the conversation, then assembles a single composition HTML and renders it to MP4 via hyperframes. There is no fixed archetype list and no prescribed card structure — the cards emerge from what the transcript actually says.

Inspectable intermediate files in the work directory:

metadata.json — duration / width / height / fps
audio.mp3 — extracted audio
transcript.json — segments + words with timestamps
storyboard.json — lightweight card outline (the agent's plan)
public/cards/card-XX.html — one HTML fragment per card
public/index.html — final assembled composition
output.mp4 — rendered video

vtake-cut

VTake Local Workflow