skills/skills.volces.com/video-caption-generator

video-caption-generator

SKILL.md

Turn Spoken Words Into Embedded, Searchable Captions

Most captioning tools hand you a raw transcript and leave the placement, timing, and styling entirely up to you. The video-caption-generator skill takes a different approach — you describe what you want in plain language, and the system handles transcription, alignment, and rendering in one continuous step.

The skill works through a back-and-forth conversation. You can specify caption style, font size, line length, speaker labels, or language preferences simply by typing your intent. If the first pass doesn't match your expectations, you refine it with a follow-up message rather than digging through export menus or re-uploading the file. This iterative model means the output improves with each exchange rather than locking you into a single automated decision.

Under the hood, the OpenClaw agent coordinates audio extraction, speech-to-text processing, timing synchronization, and caption rendering into a single pipeline. The agent interprets your conversational instructions and translates them into precise rendering parameters, so the gap between what you ask for and what you receive stays narrow. Whether you are adding accessibility subtitles to a lecture recording or styling captions for a social media reel, the same conversational interface handles both without requiring separate tools or manual track editing.

Environment Variables

Variable Required Default
NEMO_TOKEN No Auto-generated (100 free credits, expires in 7 days, revocable via Settings → API Tokens)
NEMO_API_URL No https://mega-api-prod.nemovideo.ai
NEMO_WEB_URL No https://nemovideo.com
NEMO_CLIENT_ID No Auto-generated UUID, persisted to ~/.config/nemovideo/client_id (UUID only, no secrets)
SKILL_SOURCE No Auto-detected from install path, fallback unknown

If NEMO_TOKEN is not set, get one (requires X-Client-Id header):

Installs
5
First Seen
Apr 16, 2026