Technology

DPO

Direct Preference Optimization (DPO) is the LLM fine-tuning method that bypasses complex Reinforcement Learning from Human Feedback (RLHF), directly optimizing models for human preferences with a simple classification loss.

DPO is a streamlined, single-step approach to aligning Large Language Models (LLMs) with human feedback. It eliminates the two-stage RLHF process, specifically removing the separate, often unstable, reward model and the complex Proximal Policy Optimization (PPO) training loop. The core innovation, introduced by Rafailov et al. in a 2023 paper, involves a reparameterization of the RLHF objective: this allows the policy to be optimized directly using a simple binary cross-entropy loss on preference pairs (chosen versus rejected responses). This results in a training procedure that is more stable, computationally lightweight, and demonstrably matches or exceeds PPO-based RLHF performance in key tasks like sentiment control and summarization.

https://arxiv.org/abs/2305.18290

1 project · 1 city

Related technologies

LoRA 13 ORPO 1 RLHF 2 Transformer Lab 2

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Transformer Lab: No Code Tuning

Toronto Sep 20

LoRA DPO