hyperframes-media

Installation
Summary

Generate speech, transcribe audio with timestamps, and remove video backgrounds for transparent overlays.

  • Three CLI commands (tts, transcribe, remove-background) that each download and cache their own model on first run; no API keys required
  • Text-to-speech supports 54 multilingual voices (American, British, Spanish, French, Hindi, Italian, Japanese, Portuguese, Mandarin) with speed control; auto-detects language from voice prefix
  • Transcription produces word-level timestamps in normalized JSON; supports multiple input formats (audio, video, SRT/VTT, OpenAI responses) with configurable Whisper model sizes and explicit language selection to prevent silent translation errors
  • Background removal outputs VP9 WebM with alpha channel (or ProRes/PNG) for transparent overlays; optional --background-output flag creates a hole-cut inverse layer for compositing text or graphics between subject and background
SKILL.md

HyperFrames Media

Create the audio and media assets a composition needs — voiceover (TTS), background music + sound effects, transcription, captions, background removal — then consume and animate that data in HTML. For placing assets into compositions, see hyperframes-core.

The audio engine — one source for TTS · BGM · SFX

Workflows do NOT hand-roll audio or vendor a copy. There is one engine — scripts/audio.mjs — that takes a neutral audio_request.json and writes audio_meta.json (plus assets under assets/voice|bgm|sfx):

# <MEDIA_DIR> = this skill's directory
node <MEDIA_DIR>/scripts/audio.mjs --request ./audio_request.json --hyperframes . --out ./audio_meta.json

All three capabilities degrade on ONE switch — whether a HeyGen credential is present (resolved from $HEYGEN_API_KEY / $HYPERFRAMES_API_KEY / ~/.heygen, not the CLI):

Installs
103.2K
GitHub Stars
31.4K
First Seen
May 5, 2026
hyperframes-media — heygen-com/hyperframes