training-data-curation
Training Data Curation Guidelines
Best practices for gathering and preparing training data for LLM fine-tuning.
Data Quality Principles
Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.
Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.
Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.
Format Requirements
Supervised Fine-Tuning (SFT)
Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:
More from m4n5ter/skills
ipynb-notebooks
面向 .ipynb Notebook(Jupyter / JupyterLab / Google Colab / VS Code)的创建、审阅、重构与展示。涵盖工程化目录结构、token 高效处理、演示/分享模式、以及 uv/venv 可复现工作流。
16jj-vcs
面向 Jujutsu(jj) 版本控制的使用、工作流、revset/fileset 语法、Git 互操作与配置排错指导。用于解答 jj 命令与概念差异、迁移 Git 流程到 jj、处理冲突/回滚、配置与远程书签相关问题。
1docx
全面的文档创建、编辑和分析,支持修订(tracked changes)、批注、格式保留和文本提取。当需要处理专业文档(.docx 文件)用于:(1)创建新文档,(2)修改或编辑内容,(3)处理修订,(4)添加批注,或任何其他文档任务时使用。
1xlsx
全面的电子表格创建、编辑和分析,支持公式、格式设置、数据分析和可视化。当需要处理电子表格(.xlsx, .xlsm, .csv, .tsv 等)以进行以下操作时使用:(1) 创建带有公式和格式的新电子表格,(2) 读取或分析数据,(3) 在保留公式的同时修改现有电子表格,(4) 电子表格中的数据分析和可视化,或 (5) 重新计算公式
1agent-browser
Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.
1tinker-training-cost
Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.
1