North Africans in NLP @ NAACL 2022
Schedule
Mode: Remote
Date: Sunday, July 10
7:45–8:00 | Opening remarks |
8:00–9:00 | 1st keynote |
9:00–11:00 | Poster session |
11:00–11:15 | Break |
11:15–12:15 | 2nd keynote |
12:15–12:30 | Break |
12:30–13:30 | Panel discussion |
13:30–14:00 | Closing remarks (and possible open QA session) |
Accepted Contributions
Tunisian dialectal speech recognition model
Abir Messaoudi, Hatem Haddad
The Tunisian dialect (TD) is considered as an under-resourced language due to the lack of available data. Code-switching is the main characteristic of the Tunisian dialect, where Tunisians alternate between two or more than a language: between Modern Standard Arabic (MSA) and the local Tunisian dialect which is influenced mainly by French and English. The interest in building Automatic Speech Recognition (ASR) systems has increased, since most Tunisians tend to use voice performing voice search queries, sending voice messages and interacting with voice assistants on a daily basis.
GOUD.MA: A News Article Dataset For Summarization In Moroccan Darija
Abderrahmane Issam and Khalil Mrini Moroccan
Darija is a vernacular spoken by over 30 million people primarily in Morocco. Despite a high number of speakers, it remains a low-resource language. In this paper, we introduce GOUD.MA: a dataset of over 158k news articles for au- tomatic summarization in code-switched Moroccan Darija. We analyze the dataset and find that it requires a high level of abstractive reasoning. We fine-tune the Arabic-language BERT (AraBERT), and the language models for the Moroccan (DarijaBERT), and Algerian (DziriBERT) national vernaculars for summarization on GOUD.MA. The results show that GOUD.MA is a challenging summarization benchmark dataset. We release our dataset publicly in an effort to encourage the diversity of evaluation tasks to improve language modeling in Moroccan Darija