Phare: A Safety Probe for Large Language Models
Pierre Le Jeune, Benoît Malézieux, Weixuan Xiao, Matteo Dora
确保大型语言模型(LLM)的安全性对于负责任的部署至关重要,但现有的评估通常优先考虑性能而不是识别故障模式。 我们引入了Phare,一个多语言诊断框架,用于在三个关键维度上探索和评估LLM行为:幻觉和可靠性,社会偏见和有害内容生成。 我们对17个最先进的LLM的评估揭示了所有安全维度的系统漏洞模式,包括系统,快速灵敏度和刻板印象再现。 通过突出这些特定的故障模式,而不是简单的排名模型,Pare为研究人员和从业者提供了可操作的见解,以构建更强大,对齐和值得信赖的语言系统。
Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions,...