The Arabic language is a family of diverse dialects spreading from Oman to Morocco. Across this region, Modern Standard Arabic (MSA) is the official shared language of culture, media, and education; however it is not the form of Arabic native speakers use on a daily basis (off-line and on-line). Arabic dialects vary phonologically, morphologically, lexically and to a degree syntactically. Dialect identification is a useful enabling technology that can help with user profiling and task adaptation. For Arabic, dialects are typically coarsely grouped into four or so regions: Levant, Gulf, Egypt, Maghreb. In this talk, we present recent results on fine-grained dialect identification in terms of 25-city dialect labels and 21-country dialect labels on text input. We discuss data collection, annotation, and automatic classification methods within the Multi-Arabic Dialect Applications and Resources (MADAR) project, and we report on a large shared task we organized in 2019 on this challenge.
Nizar Habash is an Associate Professor and Program Head of Computer Science at New York University Abu Dhabi (NYUAD). He is also the director of the Computational Approaches to Modeling Language (CAMeL) Lab. Professor Habash specializes in natural language processing and computational linguistics. Before joining NYUAD, he was a research scientist at Columbia University's Center for Computational Learning Systems. He received is PhD in Computer Science from the University of Maryland College Park. His research includes extensive work on machine translation, morphological analysis, and computational modeling of Arabic and its dialects. Professor Habash has been a principal investigator or co-investigator on over 20 grants. And he has over 150 publications including a book entitled "Introduction to Arabic Natural Language Processing". His website is www.nizarhabash.com.