MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
He Zhang and Wenqian Cui and Haoning Xu and Xiaohui Li and Lei Zhu and Shaohua Ma and Irwin King
全双工语音语言模型(FD-SLM)可实现实时、重叠的对话交互,与传统半双工模型相比,提供更动态的用户体验。 然而,现有的基准主要侧重于评估单轮交互和会话功能,忽略了多轮通信的复杂性和关键功能,如指令遵循和安全。 在多轮环境中评估FD-SLM带来了重大挑战,包括在模型推理过程中通信和上下文不一致的模糊转弯边界。 为了解决这些差距,我们引入了MTR-DuplexBench,这是一个新颖的基准,将连续的全双工对话分为离散的转弯,能够在对话质量,对话动态,指令遵循和安全方面对FD-SLM进行全面的逐个评估。 实验结果表明,目前的FD-SLM在多轮和评估维度上保持一致的性能方面面临困难,突出了我们拟议基准的必要性和有效性。 基准和代码将在未来提供。
Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, ...