Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

Khalid Almeman, Mark Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorise distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.

Original languageEnglish
Title of host publication2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013
DOIs
Publication statusPublished - 16 Apr 2013
Event2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 - Sharjah, United Arab Emirates
Duration: 12 Feb 201314 Feb 2013

Conference

Conference2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013
Country/TerritoryUnited Arab Emirates
CitySharjah
Period12/02/1314/02/13

Keywords

  • Automatic Building
  • Multi Dialect
  • Text Corpora

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Signal Processing

Fingerprint

Dive into the research topics of 'Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words'. Together they form a unique fingerprint.

Cite this