Training Data Curation Guidelines

Best practices for gathering and preparing training data for LLM fine-tuning.

Data Quality Principles

Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.

Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.

Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.

Format Requirements

Supervised Fine-Tuning (SFT)

Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:

training-data-curation

Training Data Curation Guidelines

Data Quality Principles

Format Requirements

Supervised Fine-Tuning (SFT)

More from sundial-org/skills

icml-reviewer

cs-research-methodology

ai-co-scientist

tinker

project-referee

commit-splitter