openai-whisper from modelscope.cn

Purpose

This skill enables local audio transcription using the OpenAI Whisper model, processing files directly on the device for privacy and speed. It's ideal for converting speech to text with features like multi-language support, word-level timestamps, and speaker diarization.

When to Use

Use this skill for tasks involving local audio files, such as transcribing interviews, podcasts, or meetings, when you need offline processing to avoid network dependencies or data privacy concerns. Apply it in workflows requiring accurate timestamps or speaker identification, like content creation or analysis.

Key Capabilities

Transcription: Converts audio to text in over 90 languages; specify language via --language flag (e.g., --language en for English).
Timestamps: Outputs word-level timings; enable with --word-level for detailed segments (e.g., each word's start/end in seconds).
Speaker Diarization: Identifies speakers in audio; requires additional setup like Pyannote library; use --diarize flag if configured.
Multi-Format Support: Handles input formats like MP3, WAV, or FLAC; outputs JSON or SRT for easy parsing.
Model Selection: Choose from models like tiny, base, small, medium, or large; larger models improve accuracy but increase compute needs (e.g., --model medium).

Usage Patterns

Always run Whisper in a Python environment with the library installed. For basic transcription, load an audio file and specify options via CLI. Use it in pipelines by piping output to other tools, like text analysis skills. For speaker diarization, ensure dependencies are installed first. Example 1: Transcribe a short audio clip for note-taking. Example 2: Process a multi-speaker recording for meeting summaries.