Hebrew LLM Eval Suite

Problem

Israeli product teams pick LLMs blind. There is no standardized Hebrew benchmark that a PM can run in an afternoon to compare Claude against GPT against DictaLM against AI21 Jamba on their actual use case. The HuggingFace Open Hebrew LLM Leaderboard exists but is built for base models and few-shot prompts, not for API-hosted chat models. DictaLM publishes benchmark results but only for its own suite. Teams end up guessing, testing informally, or trusting marketing claims. The result is costly model switches after launch, or shipping Hebrew products on models that silently fail on native speakers.

Instructions

Step 1: Pick the right benchmark set for your task

Different benchmarks test different things. Choose the smallest set that covers your actual use case.

hebrew-llm-eval-suite

Hebrew LLM Eval Suite

Problem

Instructions

Step 1: Pick the right benchmark set for your task