ai-multimodal

Installation

SKILL.md

AI Multimodal Processing Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

Core Capabilities

Audio Processing

Transcription with timestamps (up to 9.5 hours)
Audio summarization and analysis
Speech understanding and speaker identification
Music and environmental sound analysis
Text-to-speech generation with controllable voice

Image Understanding

Image captioning and description
Object detection with bounding boxes (2.0+)
Pixel-level segmentation (2.5+)
Visual question answering
Multi-image comparison (up to 3,600 images)

Related skills

More from samhvw8/dotfiles

Installs

4

Repository

samhvw8/dotfiles

GitHub Stars

12

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass