StepFun stepaudio-2.5-tts

Generate Chinese / Japanese speech with stepaudio-2.5-tts (released 2026-04, verified 2026-04-23). Contextual TTS — emotion and prosody go through natural-language description, not fixed labels.

Companion: for transcription with stepaudio-2.5-asr (the sibling model), use the stepfun-asr skill — they share an API key but live on different endpoints with different body shapes.

Why this skill exists — StepAudio 2.5 has two non-obvious pitfalls that cost hours if you don't know them:

stepaudio-2.5-tts rejects voice_label (the step-tts-2 way). Emotion/prosody now goes through instruction (natural-language description, ≤200 chars) and inline () parentheses inside the text itself.
Censorship is stricter — anything containing 死 / 消失 / sensitive political terms returns censorship_block. Your rewrite options are in references/migration_from_v2.md.

Config and auth

API key lives in $STEPFUN_API_KEY (preferred) or ${CLAUDE_PLUGIN_DATA}/config.json (fallback for cross-session persistence). All bundled scripts try env first, then config.

First-time setup (one-liner):

stepfun-tts

StepFun stepaudio-2.5-tts

Config and auth