跳转至

Alignment & Safety

Methods to align models with human preferences and constraints.

Topics

  • RLHF/DPO and preference data
  • Guardrails, policy engines, allow/deny lists
  • Red teaming and jailbreak resilience
  • Privacy, security, and compliance