活水快报 - 42Digest

参考指导判决:自由形式QA自动评估中的LLMs-as-Judges

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Sher Badshah, Hassan Sajjad

arXiv

2024年8月17日

大型语言模型(LLM)作为能够产生类似人类对话的聊天助手的出现,已经放大了对强大的评估方法的需求,特别是对于开放式任务。传统的指标,如EM和F1,虽然有用,但不足以捕获这些生成输出的全部语义和上下文深度。我们提出了一个参考指导的判决方法,通过利用多个LLM作为法官来自动化评估过程。通过自由形式问答任务的实验,我们证明,结合多个模型可以提高评估的可靠性和准确性,特别是在单个模型可能挣扎的任务中。结果表明与人类评估有很强的相关性,将拟议的方法确定为传统指标的可靠替代品。

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on fre...

计算与语言人工智能

View Source