multimodal-models
Installation
SKILL.md
Multimodal Models
Pre-trained models for vision, audio, and cross-modal tasks.
Model Overview
| Model | Modality | Task |
|---|---|---|
| CLIP | Image + Text | Zero-shot classification, similarity |
| Whisper | Audio → Text | Transcription, translation |
| Stable Diffusion | Text → Image | Image generation, editing |
CLIP (Vision-Language)
Zero-shot image classification without training on specific labels.