跳转至

Evaluation

Measuring model quality for tasks and safety.

Topics

  • Task metrics (accuracy, F1, BLEU, ROUGE)
  • Human eval vs automated eval
  • Prompt robustness and adversarial testing
  • Safety, bias, and fairness checks