Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

Khalid Almeman; Mark Lee

doi:10.1109/ICCSPA.2013.6487247

Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

Khalid Almeman^*, Mark Lee

^*Corresponding author for this work

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

19 Citations (Scopus)

Abstract

The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorise distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.

Original language	English
Title of host publication	2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013
DOIs	https://doi.org/10.1109/ICCSPA.2013.6487247
Publication status	Published - 16 Apr 2013
Event	2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 - Sharjah, United Arab Emirates Duration: 12 Feb 2013 → 14 Feb 2013

Conference

Conference	2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013
Country/Territory	United Arab Emirates
City	Sharjah
Period	12/02/13 → 14/02/13

Keywords

Automatic Building
Multi Dialect
Text Corpora

ASJC Scopus subject areas

Computer Networks and Communications
Signal Processing

Access to Document

10.1109/ICCSPA.2013.6487247

Cite this

@inproceedings{036d1948f91846da8a0466df711807a3,

title = "Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words",

abstract = "The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorise distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.",

keywords = "Automatic Building, Multi Dialect, Text Corpora",

author = "Khalid Almeman and Mark Lee",

year = "2013",

month = apr,

day = "16",

doi = "10.1109/ICCSPA.2013.6487247",

language = "English",

isbn = "9781467328210",

booktitle = "2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013",

note = "2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 ; Conference date: 12-02-2013 Through 14-02-2013",

}

Almeman, K & Lee, M 2013, Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words. in 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013., 6487247, 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013, Sharjah, United Arab Emirates, 12/02/13. https://doi.org/10.1109/ICCSPA.2013.6487247

TY - GEN

T1 - Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

AU - Almeman, Khalid

AU - Lee, Mark

PY - 2013/4/16

Y1 - 2013/4/16

N2 - The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorise distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.

AB - The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorise distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.

KW - Automatic Building

KW - Multi Dialect

KW - Text Corpora

UR - http://www.scopus.com/inward/record.url?scp=84876036099&partnerID=8YFLogxK

U2 - 10.1109/ICCSPA.2013.6487247

DO - 10.1109/ICCSPA.2013.6487247

M3 - Conference contribution

AN - SCOPUS:84876036099

SN - 9781467328210

BT - 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013

T2 - 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013

Y2 - 12 February 2013 through 14 February 2013

ER -

Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

Abstract

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this