Multimodal inputs

When to use

The user wants the agent to process non-text input: an image to describe, audio to transcribe, video to summarise, or a PDF / document to extract from. The same factory pattern works across providers; per-provider support varies.

60-second recipe

from autogen.beta import Agent
from autogen.beta.config import GeminiConfig
from autogen.beta.events import ImageInput

agent = Agent(
    "vision",
    "You describe images.",
    config=GeminiConfig(model="gemini-3-flash-preview"),
)

Installs

Repository

ag2ai/ag2-skills

GitHub Stars

First Seen

May 8, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn