gemini-3-multimodal
Installation
SKILL.md
Gemini 3 Pro Multimodal Input Processing
Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see gemini-3-image-generation for OUTPUT (generating images).
Overview
Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.
Key Capabilities
- Image Understanding: Object detection, OCR, visual Q&A, code from screenshots
- Video Processing: Up to 1 hour of video, frame analysis, OCR
- Audio Processing: Up to 9.5 hours of audio, speech understanding
- PDF Documents: Native PDF support, multi-page analysis, text extraction
- Media Resolution Control: Low/medium/high resolution for token optimization
- Token Optimization: Granular control over processing costs