hebrew-llm-eval-suite
Installation
SKILL.md
Hebrew LLM Eval Suite
Problem
Israeli product teams pick LLMs blind. There is no standardized Hebrew benchmark that a PM can run in an afternoon to compare Claude against GPT against DictaLM against AI21 Jamba on their actual use case. The HuggingFace Open Hebrew LLM Leaderboard exists but is built for base models and few-shot prompts, not for API-hosted chat models. DictaLM publishes benchmark results but only for its own suite. Teams end up guessing, testing informally, or trusting marketing claims. The result is costly model switches after launch, or shipping Hebrew products on models that silently fail on native speakers.
Instructions
Step 1: Pick the right benchmark set for your task
Different benchmarks test different things. Choose the smallest set that covers your actual use case.