Evaluation¶ Measuring model quality for tasks and safety. Topics¶ Task metrics (accuracy, F1, BLEU, ROUGE) Human eval vs automated eval Prompt robustness and adversarial testing Safety, bias, and fairness checks