活水快报 - 42Digest

ADI-20:阿拉伯方言识别数据集和模型

ADI-20: Arabic Dialect Identification dataset and models

Haroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares

arXiv

2025年11月13日

我们介绍了ADI-20,这是先前发布的ADI-17阿拉伯方言识别(ADI)数据集的扩展。 ADI-20涵盖所有阿拉伯语国家的方言。它包括3,556小时从19个阿拉伯语方言以及现代标准阿拉伯语(MSA)。我们使用此数据集来训练和评估各种最先进的ADI系统。我们探索了基于ECPA-TDNN的预训练模型的微调,以及Whisper编码器块,以及注意力汇集层和分类致密层。我们研究了(i)训练数据大小和(ii)模型的参数数量对识别性能的影响。我们的结果显示,F1得分略有下降,而仅使用原始训练数据的30%。我们开源我们收集的数据和训练有素的模型,以实现我们工作的复制,并支持ADI的进一步研究。

We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification de...

计算与语言

View Source