glmv-caption
SKILL.md
GLM-V Caption Skill
Generate captions for images, videos, and documents using the ZhiPu GLM-V multimodal model.
When to Use
- Describe, caption, summarize, or interpret image/video/document content
- User mentions "describe this image", "caption", "summarize this video", "图片描述", "视频摘要", "文档解读", "看图说话"
- Extract visual or textual information from media files
- Compare multiple images
- User provides an image/video/file and asks what's in it
Supported Input Types
| Type | Formats | Max Size | Max Count | Base64 |
|---|---|---|---|---|
| Image | jpg, png, jpeg | 5MB / 6000×6000px | 50 | ✅ |
| Video | mp4, mkv, mov | 200MB | — | ❌ |
| File | pdf, docx, txt, xlsx, pptx, jsonl | — | 50 | ❌ |