Kawarith: An Arabic Twitter Corpus for Crisis Events

Alaa Alharbi, Mark Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Social media (SM) platforms such as Twitter provide large quantities of real-time data that can be leveraged during mass emergencies. Developing tools to support crisis-affected communities requires available datasets, which often do not exist for low resource languages. This paper introduces Kawarith1 a multi-dialect Arabic Twitter corpus for crisis events, comprising more than a million Arabic tweets collected during 22 crises that occurred between 2018 and 2020 and involved several types of hazard. Exploration of this content revealed the most discussed topics and information types, and the paper presents a labelled dataset from seven emergency events that serves as a gold standard for several tasks in crisis informatics research. Using annotated data from the same event, a BERT model is fine-tuned to classify tweets into different categories in the multilabel setting. Results show that BERT-based models yield good performance on this task even with small amounts of task-specific training data.

Original languageEnglish
Title of host publicationProceedings of the Sixth Arabic Natural Language Processing Workshop
EditorsNizar Habash, Houda Bouamor, Hazem Hajj, Walid Magdy, Wajdi Zaghouani, Fethi Bougares, Nadi Tomeh, Ibrahim Abu Farha, Samia Touileb
PublisherAssociation for Computational Linguistics, ACL
Pages42-52
Number of pages11
ISBN (Electronic)9781954085091
Publication statusPublished - 19 Apr 2021
Event6th Arabic Natural Language Processing Workshop, WANLP 2021 - Virtual, Kyiv, Ukraine
Duration: 19 Apr 202119 Apr 2021

Conference

Conference6th Arabic Natural Language Processing Workshop, WANLP 2021
Country/TerritoryUkraine
CityKyiv
Period19/04/2119/04/21

Bibliographical note

Publisher Copyright:
© WANLP 2021 - 6th Arabic Natural Language Processing Workshop

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Software
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Kawarith: An Arabic Twitter Corpus for Crisis Events'. Together they form a unique fingerprint.

Cite this