Abstract
The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorise distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.
Original language | English |
---|---|
Title of host publication | 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 |
DOIs | |
Publication status | Published - 16 Apr 2013 |
Event | 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 - Sharjah, United Arab Emirates Duration: 12 Feb 2013 → 14 Feb 2013 |
Conference
Conference | 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 |
---|---|
Country/Territory | United Arab Emirates |
City | Sharjah |
Period | 12/02/13 → 14/02/13 |
Keywords
- Automatic Building
- Multi Dialect
- Text Corpora
ASJC Scopus subject areas
- Computer Networks and Communications
- Signal Processing