Abstract
Social media (SM) platforms such as Twitter provide large quantities of real-time data that can be leveraged during mass emergencies. Developing tools to support crisis-affected communities requires available datasets, which often do not exist for low resource languages. This paper introduces Kawarith1 a multi-dialect Arabic Twitter corpus for crisis events, comprising more than a million Arabic tweets collected during 22 crises that occurred between 2018 and 2020 and involved several types of hazard. Exploration of this content revealed the most discussed topics and information types, and the paper presents a labelled dataset from seven emergency events that serves as a gold standard for several tasks in crisis informatics research. Using annotated data from the same event, a BERT model is fine-tuned to classify tweets into different categories in the multilabel setting. Results show that BERT-based models yield good performance on this task even with small amounts of task-specific training data.
Original language | English |
---|---|
Title of host publication | Proceedings of the Sixth Arabic Natural Language Processing Workshop |
Editors | Nizar Habash, Houda Bouamor, Hazem Hajj, Walid Magdy, Wajdi Zaghouani, Fethi Bougares, Nadi Tomeh, Ibrahim Abu Farha, Samia Touileb |
Publisher | Association for Computational Linguistics, ACL |
Pages | 42-52 |
Number of pages | 11 |
ISBN (Electronic) | 9781954085091 |
Publication status | Published - 19 Apr 2021 |
Event | 6th Arabic Natural Language Processing Workshop, WANLP 2021 - Virtual, Kyiv, Ukraine Duration: 19 Apr 2021 → 19 Apr 2021 |
Conference
Conference | 6th Arabic Natural Language Processing Workshop, WANLP 2021 |
---|---|
Country/Territory | Ukraine |
City | Kyiv |
Period | 19/04/21 → 19/04/21 |
Bibliographical note
Publisher Copyright:© WANLP 2021 - 6th Arabic Natural Language Processing Workshop
ASJC Scopus subject areas
- Language and Linguistics
- Computational Theory and Mathematics
- Software
- Linguistics and Language