Analysis of a low-dimensional bottleneck neural network representation of speech for modelling speech dynamics

Linxue Bai; Peter Jancovic; Martin Russell; Phil Weber

doi:10.21437/Interspeech.2015-208

Analysis of a low-dimensional bottleneck neural network representation of speech for modelling speech dynamics

Linxue Bai, Peter Jancovic, Martin Russell, Phil Weber

Electronic, Electrical and Systems Engineering

Research output: Contribution to conference (unpublished) › Paper › peer-review

Abstract

This paper presents an analysis of a low-dimensional representation of speech for modelling speech dynamics, extracted using bottleneck neural networks. The input to the neural network is a set of spectral feature vectors. We explore the effect of various designs and training of the network, such as varying the size of context in the input layer, size of the bottleneck and other hidden layers, and using input reconstruction or phone posteriors as targets. Experiments are performed on TIMIT. The bottleneck features are employed in a conventional HMM-based phoneme recognition system, with recognition accuracy of 70.6% on the core test achieved using only 9-dimensional features. We also analyse how the bottleneck features fit the assumptions of dynamic models of speech. Specifically, we employ the continuous-state hidden Markov model (CS-HMM), which considers speech as a sequence of dwell and transition regions. We demonstrate that the bottleneck features preserve well the trajectory continuity over time and can provide a suitable representation for CS-HMM.

Original language	English
Pages	583-587
DOIs	https://doi.org/10.21437/Interspeech.2015-208
Publication status	Published - Sept 2015
Event	Interspeech 2015 - Dresden, Germany Duration: 6 Sept 2015 → 10 Sept 2015

Conference

Conference	Interspeech 2015
Country/Territory	Germany
City	Dresden
Period	6/09/15 → 10/09/15

Access to Document

10.21437/Interspeech.2015-208Licence: None: All rights reserved

Speech Recognition by Synthesis (SRbS)
Russell, M. & Jancovic, P.
1/10/12 → 31/10/16
Project: Other Government Departments

Cite this

@conference{7b1b56e20af04785ab74fa5fb6a98da3,

title = "Analysis of a low-dimensional bottleneck neural network representation of speech for modelling speech dynamics",

abstract = "This paper presents an analysis of a low-dimensional representation of speech for modelling speech dynamics, extracted using bottleneck neural networks. The input to the neural network is a set of spectral feature vectors. We explore the effect of various designs and training of the network, such as varying the size of context in the input layer, size of the bottleneck and other hidden layers, and using input reconstruction or phone posteriors as targets. Experiments are performed on TIMIT. The bottleneck features are employed in a conventional HMM-based phoneme recognition system, with recognition accuracy of 70.6% on the core test achieved using only 9-dimensional features. We also analyse how the bottleneck features fit the assumptions of dynamic models of speech. Specifically, we employ the continuous-state hidden Markov model (CS-HMM), which considers speech as a sequence of dwell and transition regions. We demonstrate that the bottleneck features preserve well the trajectory continuity over time and can provide a suitable representation for CS-HMM.",

author = "Linxue Bai and Peter Jancovic and Martin Russell and Phil Weber",

year = "2015",

month = sep,

doi = "10.21437/Interspeech.2015-208",

language = "English",

pages = "583--587",

note = "Interspeech 2015 ; Conference date: 06-09-2015 Through 10-09-2015",

}

TY - CONF

T1 - Analysis of a low-dimensional bottleneck neural network representation of speech for modelling speech dynamics

AU - Bai, Linxue

AU - Jancovic, Peter

AU - Russell, Martin

AU - Weber, Phil

PY - 2015/9

Y1 - 2015/9

N2 - This paper presents an analysis of a low-dimensional representation of speech for modelling speech dynamics, extracted using bottleneck neural networks. The input to the neural network is a set of spectral feature vectors. We explore the effect of various designs and training of the network, such as varying the size of context in the input layer, size of the bottleneck and other hidden layers, and using input reconstruction or phone posteriors as targets. Experiments are performed on TIMIT. The bottleneck features are employed in a conventional HMM-based phoneme recognition system, with recognition accuracy of 70.6% on the core test achieved using only 9-dimensional features. We also analyse how the bottleneck features fit the assumptions of dynamic models of speech. Specifically, we employ the continuous-state hidden Markov model (CS-HMM), which considers speech as a sequence of dwell and transition regions. We demonstrate that the bottleneck features preserve well the trajectory continuity over time and can provide a suitable representation for CS-HMM.

AB - This paper presents an analysis of a low-dimensional representation of speech for modelling speech dynamics, extracted using bottleneck neural networks. The input to the neural network is a set of spectral feature vectors. We explore the effect of various designs and training of the network, such as varying the size of context in the input layer, size of the bottleneck and other hidden layers, and using input reconstruction or phone posteriors as targets. Experiments are performed on TIMIT. The bottleneck features are employed in a conventional HMM-based phoneme recognition system, with recognition accuracy of 70.6% on the core test achieved using only 9-dimensional features. We also analyse how the bottleneck features fit the assumptions of dynamic models of speech. Specifically, we employ the continuous-state hidden Markov model (CS-HMM), which considers speech as a sequence of dwell and transition regions. We demonstrate that the bottleneck features preserve well the trajectory continuity over time and can provide a suitable representation for CS-HMM.

U2 - 10.21437/Interspeech.2015-208

DO - 10.21437/Interspeech.2015-208

M3 - Paper

SP - 583

EP - 587

T2 - Interspeech 2015

Y2 - 6 September 2015 through 10 September 2015

ER -

Analysis of a low-dimensional bottleneck neural network representation of speech for modelling speech dynamics

Abstract

Conference

Access to Document

Fingerprint

Projects

Speech Recognition by Synthesis (SRbS)

Cite this