Abstract
Twitter and other social media platforms offer users the chance to share their ideas via short posts. While the easy exchange of ideas has value, these microblogs can be leveraged by people who want to share hatred, and such individuals can share negative views about an individual, race, or group with millions of people at the click of a button. There is thus an urgent need to establish a method that can automatically identify hate speech and offensive language. To contribute to this development, during the OSACT4 workshop, a shared task was undertaken to detect offensive language in Arabic. A key challenge was the uniqueness of the language used on social media, prompting the out-of-vocabulary (OOV) problem. In addition, the use of different dialects in Arabic exacerbates this problem. To deal with the issues associated with OOV, we generated a character-level embeddings model, which was trained on a massive data collected carefully. This level of embeddings can work effectively in resolving the problem of OOV words through its ability to learn the vectors of character n-grams or parts of words. The proposed systems were ranked 7th and 8th for Subtasks A and B, respectively.
Original language | English |
---|---|
Title of host publication | Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection |
Subtitle of host publication | at LREC 2020 - Language Resources and Evaluation Conference |
Editors | Hend Al-Khalifa, Walid Magdy, Kareem Darwish, Tamer Elsayed, Hamdy Mubarak |
Publisher | European Language Resources Association (ELRA) |
Pages | 91-96 |
ISBN (Print) | 9791095546511 |
Publication status | Published - 12 May 2020 |
Event | 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020) - Marseille, France Duration: 11 May 2020 → 16 May 2020 |
Conference
Conference | 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020) |
---|---|
Country/Territory | France |
City | Marseille |
Period | 11/05/20 → 16/05/20 |
Keywords
- character-level embeddings
- word-level embeddings
- Arabic offensive language detection