跳转至

Unsloth RL Training

Unsloth offers a lightweight interface for reinforcement learning on top of popular Llama-class models. It focuses on efficient PPO-style fine-tuning where an agent learns from reward signals rather than fixed outputs.

Key Points

  • Wrap a base model with FastLanguageModel and provide a reward function.
  • Use proximal policy optimization (PPO) steps to update the policy.
  • Log metrics such as reward, KL divergence, and token usage for debugging.

Minimal Example

from unsloth import FastLanguageModel
from unsloth.ppo import PPOTrainer, RLConfig

# load a quantized base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3-8b",
    load_in_4bit=True,
)

config = RLConfig(batch_size=4, lr=1e-5, ppo_epochs=1)
trainer = PPOTrainer(model, tokenizer, config)

prompt = "Translate to French: Hello world"
outputs = trainer.generate(prompt)
reward = some_reward_fn(prompt, outputs)
trainer.step(prompt, outputs, reward)

Notes

  • Reward models can be learned or heuristic.
  • Training often alternates between supervised fine-tuning and RL steps.
  • Monitor stability; clip KL to prevent divergence.

Further Reading