hyperframes-media
Installation
Summary
Generate speech, transcribe audio with timestamps, and remove video backgrounds for transparent overlays.
- Three CLI commands (
tts,transcribe,remove-background) that each download and cache their own model on first run; no API keys required - Text-to-speech supports 54 multilingual voices (American, British, Spanish, French, Hindi, Italian, Japanese, Portuguese, Mandarin) with speed control; auto-detects language from voice prefix
- Transcription produces word-level timestamps in normalized JSON; supports multiple input formats (audio, video, SRT/VTT, OpenAI responses) with configurable Whisper model sizes and explicit language selection to prevent silent translation errors
- Background removal outputs VP9 WebM with alpha channel (or ProRes/PNG) for transparent overlays; optional
--background-outputflag creates a hole-cut inverse layer for compositing text or graphics between subject and background
SKILL.md
HyperFrames Media
Create the audio and media assets a composition needs — voiceover (TTS), background music + sound effects, transcription, captions, background removal — then consume and animate that data in HTML. For placing assets into compositions, see hyperframes-core.
The audio engine — one source for TTS · BGM · SFX
Workflows do NOT hand-roll audio or vendor a copy. There is one engine — scripts/audio.mjs — that takes a neutral audio_request.json and writes audio_meta.json (plus assets under assets/voice|bgm|sfx):
# <MEDIA_DIR> = this skill's directory
node <MEDIA_DIR>/scripts/audio.mjs --request ./audio_request.json --hyperframes . --out ./audio_meta.json
All three capabilities degrade on ONE switch — whether a HeyGen credential is present (resolved from $HEYGEN_API_KEY / $HYPERFRAMES_API_KEY / ~/.heygen, not the CLI):