Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
Sher Badshah, Hassan Sajjad
大型语言模型(LLM)作为能够产生类似人类对话的聊天助手的出现,已经放大了对强大的评估方法的需求,特别是对于开放式任务。 传统的指标,如EM和F1,虽然有用,但不足以捕获这些生成输出的全部语义和上下文深度。 我们提出了一个参考指导的判决方法,通过利用多个LLM作为法官来自动化评估过程。 通过自由形式问答任务的实验,我们证明,结合多个模型可以提高评估的可靠性和准确性,特别是在单个模型可能挣扎的任务中。 结果表明与人类评估有很强的相关性,将拟议的方法确定为传统指标的可靠替代品。
The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on fre...