datasets
Generate Evaluation Datasets
You are a senior evaluation engineer helping the user create a realistic, high-quality evaluation dataset. Your goal is to produce data that is indistinguishable from real production traffic — not generic, not sanitized, not robotic.
NON-NEGOTIABLE: every row must look like THIS bot's actual users
Before you write a single row, ask yourself: "Would a real user of THIS specific bot — given its system prompt, persona, and domain — ever send this message?" If the answer is "no" or "not really", do not include the row.
This is the most failed criterion of this skill. Examples of what is automatically wrong:
- A tweet-style emoji bot getting
"What is the capital of France?"or"Explain photosynthesis"— real users of a fun emoji bot send "lol roast my Monday outfit 🫠", "hot take on cilantro??", "describe my mood in 3 emojis", not high-school trivia. - A customer support bot getting
"Tell me about quantum computing"— real users send "WHERE IS MY ORDER #4521 ITS BEEN 2 WEEKS", "refund pls — package arrived smashed". - A SQL assistant getting
"Hi how are you?"— real users paste schemas and ask "join orders to users where signup_date > 2024". - A RAG knowledge-base bot getting questions whose answers are obviously not in its corpus, with no negative-case framing — real users mostly ask things the docs cover, with a sprinkle of off-topic.
The "what if it's a general-purpose chatbot?" excuse is invalid: read its system prompt. Even general bots have a tone, a length budget, an emoji policy, a refusal policy. Match THAT.
If you find yourself reaching for "What is the capital of [country]?", "Explain [scientific concept]", "What is [historical event]?", or "Tell me about [generic topic]" — stop, re-read the system prompt, and pick something a real user of this bot would say.
More from langwatch/skills
evaluations
Set up comprehensive evaluations for your AI agent with LangWatch — experiments (batch testing), evaluators (scoring functions), datasets, online evaluation (production monitoring), and guardrails (real-time blocking). Supports both code (SDK) and platform (CLI) approaches. Use when the user wants to evaluate, test, benchmark, monitor, or safeguard their agent.
51scenarios
Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios via the `langwatch` CLI, and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.
50tracing
Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.
46level-up
Take your AI agent to the next level with full LangWatch integration. Adds tracing, prompt versioning, evaluation experiments, and simulation tests in one go. Use when the user wants comprehensive observability, testing, and prompt management for their agent.
38prompts
Version and manage your agent's prompts with LangWatch Prompts CLI. Use for both onboarding (set up prompt versioning for an entire codebase) and targeted operations (version a specific prompt, create a new prompt version). Supports Python and TypeScript.
37analytics
Analyze your AI agent's performance using LangWatch analytics. Use when the user wants to understand costs, latency, error rates, usage trends, or debug specific traces. Works with any LangWatch-instrumented agent.
32