Multi dialect Arabic speech parallel corpora

Khalid Almeman; Mark Lee; Ali Abdulrahman Almiman

doi:10.1109/ICCSPA.2013.6487288

Multi dialect Arabic speech parallel corpora

Khalid Almeman^*, Mark Lee, Ali Abdulrahman Almiman

^*Corresponding author for this work

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

15 Citations (Scopus)

Abstract

This paper describes the building of a multi dialect Arabic speech parallel corpus. It is designed to encompass four main dialects; Modern Standard Arabic (MSA), Gulf, Egypt and Levantine dialects. We have chosen a specific linguistic domain to work with it: travel and tourism. Parallel prompts were written for the four main dialects, which involved 1291 recordings for MSA and 1069 recordings for other dialects. The recordings were conducted with the consent of 52 participants. We have obtained about 32 speech hours. After the segmentation stage, we have obtained a total number of 67,132 speech files. These are the first Arabic parallel texts, and speech corpora and will be an open source for researchers.

Original language	English
Title of host publication	2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013
DOIs	https://doi.org/10.1109/ICCSPA.2013.6487288
Publication status	Published - 16 Apr 2013
Event	2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 - Sharjah, United Arab Emirates Duration: 12 Feb 2013 → 14 Feb 2013

Conference

Conference	2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013
Country/Territory	United Arab Emirates
City	Sharjah
Period	12/02/13 → 14/02/13

Keywords

Arabic Dialects
Multi-Dialect
Parallel
Speech Corpora

ASJC Scopus subject areas

Computer Networks and Communications
Signal Processing

Access to Document

10.1109/ICCSPA.2013.6487288

Cite this

@inproceedings{90c91d2638e543b68e48e627ef49f2ac,

title = "Multi dialect Arabic speech parallel corpora",

abstract = "This paper describes the building of a multi dialect Arabic speech parallel corpus. It is designed to encompass four main dialects; Modern Standard Arabic (MSA), Gulf, Egypt and Levantine dialects. We have chosen a specific linguistic domain to work with it: travel and tourism. Parallel prompts were written for the four main dialects, which involved 1291 recordings for MSA and 1069 recordings for other dialects. The recordings were conducted with the consent of 52 participants. We have obtained about 32 speech hours. After the segmentation stage, we have obtained a total number of 67,132 speech files. These are the first Arabic parallel texts, and speech corpora and will be an open source for researchers.",

keywords = "Arabic Dialects, Multi-Dialect, Parallel, Speech Corpora",

author = "Khalid Almeman and Mark Lee and Almiman, {Ali Abdulrahman}",

year = "2013",

month = apr,

day = "16",

doi = "10.1109/ICCSPA.2013.6487288",

language = "English",

isbn = "9781467328210",

booktitle = "2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013",

note = "2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 ; Conference date: 12-02-2013 Through 14-02-2013",

}

Almeman, K, Lee, M & Almiman, AA 2013, Multi dialect Arabic speech parallel corpora. in 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013., 6487288, 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013, Sharjah, United Arab Emirates, 12/02/13. https://doi.org/10.1109/ICCSPA.2013.6487288

TY - GEN

T1 - Multi dialect Arabic speech parallel corpora

AU - Almeman, Khalid

AU - Lee, Mark

AU - Almiman, Ali Abdulrahman

PY - 2013/4/16

Y1 - 2013/4/16

N2 - This paper describes the building of a multi dialect Arabic speech parallel corpus. It is designed to encompass four main dialects; Modern Standard Arabic (MSA), Gulf, Egypt and Levantine dialects. We have chosen a specific linguistic domain to work with it: travel and tourism. Parallel prompts were written for the four main dialects, which involved 1291 recordings for MSA and 1069 recordings for other dialects. The recordings were conducted with the consent of 52 participants. We have obtained about 32 speech hours. After the segmentation stage, we have obtained a total number of 67,132 speech files. These are the first Arabic parallel texts, and speech corpora and will be an open source for researchers.

AB - This paper describes the building of a multi dialect Arabic speech parallel corpus. It is designed to encompass four main dialects; Modern Standard Arabic (MSA), Gulf, Egypt and Levantine dialects. We have chosen a specific linguistic domain to work with it: travel and tourism. Parallel prompts were written for the four main dialects, which involved 1291 recordings for MSA and 1069 recordings for other dialects. The recordings were conducted with the consent of 52 participants. We have obtained about 32 speech hours. After the segmentation stage, we have obtained a total number of 67,132 speech files. These are the first Arabic parallel texts, and speech corpora and will be an open source for researchers.

KW - Arabic Dialects

KW - Multi-Dialect

KW - Parallel

KW - Speech Corpora

UR - http://www.scopus.com/inward/record.url?scp=84876045124&partnerID=8YFLogxK

U2 - 10.1109/ICCSPA.2013.6487288

DO - 10.1109/ICCSPA.2013.6487288

M3 - Conference contribution

AN - SCOPUS:84876045124

SN - 9781467328210

BT - 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013

T2 - 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013

Y2 - 12 February 2013 through 14 February 2013

ER -

Multi dialect Arabic speech parallel corpora

Abstract

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this