Turn Spoken Words Into Embedded, Searchable Captions

Most captioning tools hand you a raw transcript and leave the placement, timing, and styling entirely up to you. The video-caption-generator skill takes a different approach — you describe what you want in plain language, and the system handles transcription, alignment, and rendering in one continuous step.

The skill works through a back-and-forth conversation. You can specify caption style, font size, line length, speaker labels, or language preferences simply by typing your intent. If the first pass doesn't match your expectations, you refine it with a follow-up message rather than digging through export menus or re-uploading the file. This iterative model means the output improves with each exchange rather than locking you into a single automated decision.

Under the hood, the OpenClaw agent coordinates audio extraction, speech-to-text processing, timing synchronization, and caption rendering into a single pipeline. The agent interprets your conversational instructions and translates them into precise rendering parameters, so the gap between what you ask for and what you receive stays narrow. Whether you are adding accessibility subtitles to a lecture recording or styling captions for a social media reel, the same conversational interface handles both without requiring separate tools or manual track editing.

Environment Variables

Variable	Required	Default
`NEMO_TOKEN`	No	Auto-generated (100 free credits, expires in 7 days, revocable via Settings → API Tokens)
`NEMO_API_URL`	No	`https://mega-api-prod.nemovideo.ai`
`NEMO_WEB_URL`	No	`https://nemovideo.com`
`NEMO_CLIENT_ID`	No	Auto-generated UUID, persisted to `~/.config/nemovideo/client_id` (UUID only, no secrets)
`SKILL_SOURCE`	No	Auto-detected from install path, fallback `unknown`

If NEMO_TOKEN is not set, get one (requires X-Client-Id header):