跳转至

LLM Cody Wiki

Unsloth RL

Unsloth RL Training¶

Unsloth offers a lightweight interface for reinforcement learning on top of popular Llama-class models. It focuses on efficient PPO-style fine-tuning where an agent learns from reward signals rather than fixed outputs.

Key Points¶

Wrap a base model with FastLanguageModel and provide a reward function.
Use proximal policy optimization (PPO) steps to update the policy.
Log metrics such as reward, KL divergence, and token usage for debugging.

Minimal Example¶

from unsloth import FastLanguageModel
from unsloth.ppo import PPOTrainer, RLConfig

# load a quantized base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3-8b",
    load_in_4bit=True,
)

config = RLConfig(batch_size=4, lr=1e-5, ppo_epochs=1)
trainer = PPOTrainer(model, tokenizer, config)

prompt = "Translate to French: Hello world"
outputs = trainer.generate(prompt)
reward = some_reward_fn(prompt, outputs)
trainer.step(prompt, outputs, reward)

Notes¶

Reward models can be learned or heuristic.
Training often alternates between supervised fine-tuning and RL steps.
Monitor stability; clip KL to prevent divergence.

Further Reading¶

Unsloth RL Guide — official walkthrough.