ag2-multimodal-input

Installation
SKILL.md

Multimodal inputs

When to use

The user wants the agent to process non-text input: an image to describe, audio to transcribe, video to summarise, or a PDF / document to extract from. The same factory pattern works across providers; per-provider support varies.

60-second recipe

from autogen.beta import Agent
from autogen.beta.config import GeminiConfig
from autogen.beta.events import ImageInput

agent = Agent(
    "vision",
    "You describe images.",
    config=GeminiConfig(model="gemini-3-flash-preview"),
)
Installs
19
GitHub Stars
3
First Seen
May 8, 2026
ag2-multimodal-input — ag2ai/ag2-skills