training-data-curation
Training Data Curation Guidelines
Best practices for gathering and preparing training data for LLM fine-tuning.
Data Quality Principles
Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.
Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.
Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.
Format Requirements
Supervised Fine-Tuning (SFT)
Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:
More from sundial-org/skills
icml-reviewer
|
82cs-research-methodology
Conduct a literature review and develop a CS research proposal. Use when asked to review a research area, find gaps in existing work, and propose a novel research contribution. The output is a research proposal identifying an assumption to challenge (the "bit flip") and how to validate it.
58ai-co-scientist
Transform Claude Code into an AI Scientist that orchestrates research workflows using tree-based hypothesis exploration. Triggers on "research project", "scientific experiment", "run experiments", "AI scientist", "tree search experimentation", "systematic study".
54tinker
Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.
53project-referee
Critiques ML conference papers with reviewer-style feedback. Use when users want to anticipate reviewer concerns, identify weaknesses, check claim-evidence gaps, or find missing citations.
51commit-splitter
Split large sets of uncommitted changes into logical, well-organized commits. Use when the user has many uncommitted changes and wants structured commits, or proactively suggest when detecting a large diff that would benefit from splitting.
49