Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
Xinlu He, Jacob Whitehill
修道院多扬声器自动语音识别(ASR)仍然具有挑战性,因为数据稀缺以及识别单词和将单词归因于单个扬声器的内在困难,特别是在重叠的语音中。 最近的进步推动了从级联系统到端到端(E2E)架构的转变,这减少了错误传播,并更好地利用了语音内容和扬声器身份之间的协同作用。 尽管E2E多扬声器ASR进展迅速,但该领域缺乏对最近发展的全面审查。 这项调查为多扬声器ASR提供了E2E神经方法的系统分类,突出了最近的进展和比较分析。 具体地说,我们分析:(1)架构范式(SIMO vs. SISO)用于预分段音频,分析其鲜明的特征和权衡; (2)最近基于这两种范式的架构和算法改进; (3) 长形语音的扩展,包括分割策略和扬声器一致的假设拼接。 此外,我们(4)跨标准基准评估和比较方法。 最后,我们讨论了开放挑战和未来的研究方向,以建立强大且可扩展的多扬声器ASR。
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a compreh...