multimodal-llm

Installation
SKILL.md

Multimodal LLM Patterns

Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling v3, Sora 2, Veo 3.1 std/lite/fast tiers, Runway Gen-4.5 via gen4_turbo).

Canonical model IDs (pinned against yonatan-hq/platform/apps/api/app/config.py):

Provider Model IDs
Anthropic claude-opus-4-7 (latest), claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001
OpenAI gpt-5.2 (current flagship)
Google gemini-3.1-pro-preview (flagship), gemini-3.1-flash-lite-preview (cost)
Veo veo-3.1-generate-preview / veo-3.1-lite-generate-preview / veo-3.1-fast-generate-preview
Kling kling-v3 (model_name field in Kling API)
Runway gen4_turbo (product label: Gen-4.5)

Quick Reference

Category Rules Impact When to Use
Related skills

More from yonatangross/orchestkit

Installs
146
GitHub Stars
170
First Seen
Feb 14, 2026