ag2-multimodal-input
Installation
SKILL.md
Multimodal inputs
When to use
The user wants the agent to process non-text input: an image to describe, audio to transcribe, video to summarise, or a PDF / document to extract from. The same factory pattern works across providers; per-provider support varies.
60-second recipe
from autogen.beta import Agent
from autogen.beta.config import GeminiConfig
from autogen.beta.events import ImageInput
agent = Agent(
"vision",
"You describe images.",
config=GeminiConfig(model="gemini-3-flash-preview"),
)