openai-whisper
Purpose
This skill enables local audio transcription using the OpenAI Whisper model, processing files directly on the device for privacy and speed. It's ideal for converting speech to text with features like multi-language support, word-level timestamps, and speaker diarization.
When to Use
Use this skill for tasks involving local audio files, such as transcribing interviews, podcasts, or meetings, when you need offline processing to avoid network dependencies or data privacy concerns. Apply it in workflows requiring accurate timestamps or speaker identification, like content creation or analysis.
Key Capabilities
- Transcription: Converts audio to text in over 90 languages; specify language via
--languageflag (e.g.,--language enfor English). - Timestamps: Outputs word-level timings; enable with
--word-levelfor detailed segments (e.g., each word's start/end in seconds). - Speaker Diarization: Identifies speakers in audio; requires additional setup like Pyannote library; use
--diarizeflag if configured. - Multi-Format Support: Handles input formats like MP3, WAV, or FLAC; outputs JSON or SRT for easy parsing.
- Model Selection: Choose from models like tiny, base, small, medium, or large; larger models improve accuracy but increase compute needs (e.g.,
--model medium).
Usage Patterns
Always run Whisper in a Python environment with the library installed. For basic transcription, load an audio file and specify options via CLI. Use it in pipelines by piping output to other tools, like text analysis skills. For speaker diarization, ensure dependencies are installed first. Example 1: Transcribe a short audio clip for note-taking. Example 2: Process a multi-speaker recording for meeting summaries.