multimodal-rag

Installation
SKILL.md

Multimodal RAG ()

Build retrieval-augmented generation systems that handle images, text, and mixed content.

Overview

  • Image + text retrieval (product search, documentation)
  • Cross-modal search (text query -> image results)
  • Multimodal document processing (PDFs with charts)
  • Visual question answering with context
  • Image similarity and deduplication
  • Hybrid search pipelines

Architecture Approaches

Approach Pros Cons Best For
Joint Embedding (CLIP) Direct comparison Limited context Pure image search
Caption-based Works with text LLMs Lossy conversion Existing text RAG
Related skills

More from yonatangross/orchestkit

Installs
11
GitHub Stars
170
First Seen
Jan 22, 2026