Model Comparator

Overview

This skill helps engineering and product teams make informed, structured decisions about which AI or LLM model to use for a given task. It covers comparison across multiple dimensions: benchmark performance, real-world task capability, inference cost per token, latency (time-to-first-token and throughput), context window size, multimodal capabilities, fine-tuning availability, licensing, and data privacy. It provides frameworks for structured comparison, cost modeling at scale, and task-specific head-to-head evaluation to move beyond marketing benchmarks to production-relevant decisions.

When to Use

Choosing between frontier models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, etc.) for a new product feature
Deciding whether to use a proprietary API or a self-hosted open-source model
Selecting an embedding model for a RAG (retrieval-augmented generation) pipeline
Evaluating cost-quality tradeoffs for a high-volume production use case
Justifying a model switch to stakeholders with data
Comparing models for latency-sensitive applications (real-time chat, autocomplete)
Assessing model capabilities for a specialized domain (medical, legal, code, multilingual)

When NOT to Use

Building evaluation infrastructure from scratch (use eval-designer skill)
Fine-tuning or training a model on custom data (use model training skills)
Comparing internal model versions (use eval-designer skill with your specific metrics)
Choosing between ML frameworks (TensorFlow vs PyTorch) — that is an infrastructure decision

model-comparator

Model Comparator

Overview

When to Use

When NOT to Use