rlhf
Installation
SKILL.md
Understanding RLHF
Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences. Rather than relying solely on next-token prediction, RLHF uses human judgment to guide model behavior toward helpful, harmless, and honest outputs.
Table of Contents
- Core Concepts
- The RLHF Pipeline
- Preference Data
- Instruction Tuning
- Reward Modeling
- Policy Optimization
- Direct Alignment Algorithms
- Challenges
- Best Practices
- References