SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition
Jiaqi Wang, Liutao Yu, Xiongri Shen, Sihang Guo, Chenlin Zhou, Leilei Zhao, Yi Zhong, Zhiguo Zhang, Zhengyu Ma
尖峰神经网络(SNN)通过利用其事件驱动的处理范式,为节能语音命令识别(SCR)提供了一条有希望的途径。 然而,由于有限的时间建模和基于二进制的尖峰表示,现有的基于SNN的SCR方法通常难以从语音中捕获丰富的时间依赖性和上下文信息。 为了应对这些挑战,我们首先介绍了多视图尖刻时间感知自注意(MSTASA)模块,该模块将有效的尖刻时间感知注意力与多视图学习框架相结合,以模拟语音命令中的互补时间依赖关系。 在MSTASA的基础上,我们进一步提出了SpikCommander,这是一种完全尖峰驱动的变压器架构,将MTASA与尖峰上下文细化通道MLP(SCR-MLP)集成在一起,共同增强时间上下文建模和通道智能功能集成。 我们在三个基准数据集上评估我们的方法:Spiking Heidelberg Dataset(SHD)、Spiking Speech Commands(SSC)和Google Speech Commands V2(GSC)。 广泛的实验表明,SpikCommander在可比时间步骤下的参数较少的情况下,一直优于最先进的(SOTA)SNN方法,突出了其有效性和效率,以实现强大的语音命令识别。
Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combin...