SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models
S Sakshi and Vaibhavi Lokegaonkar and Neil Zhang and Ramani Duraiswami and Sreyan Ghosh and Dinesh Manocha and Lie Lu
空间感知是听觉智能的核心,能够准确理解现实世界的声学场景,并推进人类对周围世界的感知。 虽然最近的大型音频语言模型(LALM)对复杂的音频表现出很强的推理,但大多数都在单声道输入上运行,并且缺乏捕捉方向,高度和距离等空间线索的能力。 我们引入了SPUR,这是一种轻量级的插件方法,通过最小的架构变化为LALM提供空间感知。 SPUR包括:(i)一个第一方向的Amisonics(FOA)编码器,它将(W,X,Y,Z)通道映射到旋转感知,以听众为中心的空间特征,通过多模态适配器集成到目标LALM中;和(ii)SPUR-Set,一个将开源FOA记录与受控模拟相结合的空间QA数据集,强调相对方向,海拔,距离和重叠进行监督空间推理。 在 SPUR-Set 上微调我们的模型持续改进空间 QA 和多扬声器的归因,同时保持一般音频理解。 SPUR提供了一个简单的配方,将单耳LALM转化为空间感知模型。 广泛的减量验证了我们方法的有效性。
Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal...