Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic

Abdullah Alharbi; Mark Lee

Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic

Abdullah Alharbi, Mark Lee

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Twitter and other social media platforms offer users the chance to share their ideas via short posts. While the easy exchange of ideas has value, these microblogs can be leveraged by people who want to share hatred, and such individuals can share negative views about an individual, race, or group with millions of people at the click of a button. There is thus an urgent need to establish a method that can automatically identify hate speech and offensive language. To contribute to this development, during the OSACT4 workshop, a shared task was undertaken to detect offensive language in Arabic. A key challenge was the uniqueness of the language used on social media, prompting the out-of-vocabulary (OOV) problem. In addition, the use of different dialects in Arabic exacerbates this problem. To deal with the issues associated with OOV, we generated a character-level embeddings model, which was trained on a massive data collected carefully. This level of embeddings can work effectively in resolving the problem of OOV words through its ability to learn the vectors of character n-grams or parts of words. The proposed systems were ranked 7th and 8th for Subtasks A and B, respectively.

Original language	English
Title of host publication	Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection
Subtitle of host publication	at LREC 2020 - Language Resources and Evaluation Conference
Editors	Hend Al-Khalifa, Walid Magdy, Kareem Darwish, Tamer Elsayed, Hamdy Mubarak
Publisher	European Language Resources Association (ELRA)
Pages	91-96
ISBN (Print)	9791095546511
Publication status	Published - 12 May 2020
Event	4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020) - Marseille, France Duration: 11 May 2020 → 16 May 2020

Conference

Conference	4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020)
Country/Territory	France
City	Marseille
Period	11/05/20 → 16/05/20

Keywords

character-level embeddings
word-level embeddings
Arabic offensive language detection

Access to Document

https://www.aclweb.org/anthology/2020.osact-1.15.pdfLicence: Creative Commons: Attribution-NonCommercial (CC BY-NC)

Cite this

Alharbi, A., & Lee, M. (2020). Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic. In H. Al-Khalifa, W. Magdy, K. Darwish, T. Elsayed, & H. Mubarak (Eds.), Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection: at LREC 2020 - Language Resources and Evaluation Conference (pp. 91-96). European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.osact-1.15.pdf

Alharbi, Abdullah ; Lee, Mark. / Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic. Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection: at LREC 2020 - Language Resources and Evaluation Conference. editor / Hend Al-Khalifa ; Walid Magdy ; Kareem Darwish ; Tamer Elsayed ; Hamdy Mubarak. European Language Resources Association (ELRA), 2020. pp. 91-96

@inproceedings{59f058bcb09b4d1c919f21ca5862fb0b,

title = "Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic",

abstract = "Twitter and other social media platforms offer users the chance to share their ideas via short posts. While the easy exchange of ideas has value, these microblogs can be leveraged by people who want to share hatred, and such individuals can share negative views about an individual, race, or group with millions of people at the click of a button. There is thus an urgent need to establish a method that can automatically identify hate speech and offensive language. To contribute to this development, during the OSACT4 workshop, a shared task was undertaken to detect offensive language in Arabic. A key challenge was the uniqueness of the language used on social media, prompting the out-of-vocabulary (OOV) problem. In addition, the use of different dialects in Arabic exacerbates this problem. To deal with the issues associated with OOV, we generated a character-level embeddings model, which was trained on a massive data collected carefully. This level of embeddings can work effectively in resolving the problem of OOV words through its ability to learn the vectors of character n-grams or parts of words. The proposed systems were ranked 7th and 8th for Subtasks A and B, respectively.",

keywords = "character-level embeddings, word-level embeddings, Arabic offensive language detection",

author = "Abdullah Alharbi and Mark Lee",

year = "2020",

month = may,

day = "12",

language = "English",

isbn = "9791095546511",

pages = "91--96",

editor = "Hend Al-Khalifa and Walid Magdy and Kareem Darwish and Tamer Elsayed and Hamdy Mubarak",

booktitle = "Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection",

publisher = "European Language Resources Association (ELRA)",

note = "4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020) ; Conference date: 11-05-2020 Through 16-05-2020",

}

Alharbi, A & Lee, M 2020, Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic. in H Al-Khalifa, W Magdy, K Darwish, T Elsayed & H Mubarak (eds), Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection: at LREC 2020 - Language Resources and Evaluation Conference. European Language Resources Association (ELRA), pp. 91-96, 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020), Marseille, France, 11/05/20. <https://www.aclweb.org/anthology/2020.osact-1.15.pdf>

Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic. / Alharbi, Abdullah; Lee, Mark.
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection: at LREC 2020 - Language Resources and Evaluation Conference. ed. / Hend Al-Khalifa; Walid Magdy; Kareem Darwish; Tamer Elsayed; Hamdy Mubarak. European Language Resources Association (ELRA), 2020. p. 91-96.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic

AU - Alharbi, Abdullah

AU - Lee, Mark

PY - 2020/5/12

Y1 - 2020/5/12

N2 - Twitter and other social media platforms offer users the chance to share their ideas via short posts. While the easy exchange of ideas has value, these microblogs can be leveraged by people who want to share hatred, and such individuals can share negative views about an individual, race, or group with millions of people at the click of a button. There is thus an urgent need to establish a method that can automatically identify hate speech and offensive language. To contribute to this development, during the OSACT4 workshop, a shared task was undertaken to detect offensive language in Arabic. A key challenge was the uniqueness of the language used on social media, prompting the out-of-vocabulary (OOV) problem. In addition, the use of different dialects in Arabic exacerbates this problem. To deal with the issues associated with OOV, we generated a character-level embeddings model, which was trained on a massive data collected carefully. This level of embeddings can work effectively in resolving the problem of OOV words through its ability to learn the vectors of character n-grams or parts of words. The proposed systems were ranked 7th and 8th for Subtasks A and B, respectively.

AB - Twitter and other social media platforms offer users the chance to share their ideas via short posts. While the easy exchange of ideas has value, these microblogs can be leveraged by people who want to share hatred, and such individuals can share negative views about an individual, race, or group with millions of people at the click of a button. There is thus an urgent need to establish a method that can automatically identify hate speech and offensive language. To contribute to this development, during the OSACT4 workshop, a shared task was undertaken to detect offensive language in Arabic. A key challenge was the uniqueness of the language used on social media, prompting the out-of-vocabulary (OOV) problem. In addition, the use of different dialects in Arabic exacerbates this problem. To deal with the issues associated with OOV, we generated a character-level embeddings model, which was trained on a massive data collected carefully. This level of embeddings can work effectively in resolving the problem of OOV words through its ability to learn the vectors of character n-grams or parts of words. The proposed systems were ranked 7th and 8th for Subtasks A and B, respectively.

KW - character-level embeddings

KW - word-level embeddings

KW - Arabic offensive language detection

M3 - Conference contribution

SN - 9791095546511

SP - 91

EP - 96

BT - Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

A2 - Al-Khalifa, Hend

A2 - Magdy, Walid

A2 - Darwish, Kareem

A2 - Elsayed, Tamer

A2 - Mubarak, Hamdy

PB - European Language Resources Association (ELRA)

T2 - 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020)

Y2 - 11 May 2020 through 16 May 2020

ER -

Alharbi A, Lee M. Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic. In Al-Khalifa H, Magdy W, Darwish K, Elsayed T, Mubarak H, editors, Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection: at LREC 2020 - Language Resources and Evaluation Conference. European Language Resources Association (ELRA). 2020. p. 91-96