North Africans in NLP @ NAACL 2022

Schedule

Mode: Remote

Date: Sunday, July 10

7:45–8:00	Opening remarks
8:00–9:00	1st keynote
9:00–11:00	Poster session
11:00–11:15	Break
11:15–12:15	2nd keynote
12:15–12:30	Break
12:30–13:30	Panel discussion
13:30–14:00	Closing remarks (and possible open QA session)

Accepted Contributions

Tunisian dialectal speech recognition model
Abir Messaoudi, Hatem Haddad

The Tunisian dialect (TD) is considered as an under-resourced language due to the lack of available data. Code-switching is the main characteristic of the Tunisian dialect, where Tunisians alternate between two or more than a language: between Modern Standard Arabic (MSA) and the local Tunisian dialect which is influenced mainly by French and English. The interest in building Automatic Speech Recognition (ASR) systems has increased, since most Tunisians tend to use voice performing voice search queries, sending voice messages and interacting with voice assistants on a daily basis.

GOUD.MA: A News Article Dataset For Summarization In Moroccan Darija
Abderrahmane Issam and Khalil Mrini Moroccan

Darija is a vernacular spoken by over 30 million people primarily in Morocco. Despite a high number of speakers, it remains a low-resource language. In this paper, we introduce GOUD.MA: a dataset of over 158k news articles for au- tomatic summarization in code-switched Moroccan Darija. We analyze the dataset and find that it requires a high level of abstractive reasoning. We fine-tune the Arabic-language BERT (AraBERT), and the language models for the Moroccan (DarijaBERT), and Algerian (DziriBERT) national vernaculars for summarization on GOUD.MA. The results show that GOUD.MA is a challenging summarization benchmark dataset. We release our dataset publicly in an effort to encourage the diversity of evaluation tasks to improve language modeling in Moroccan Darija

Share on

Twitter Facebook LinkedIn