Phone classification using a non-linear manifold with broad phone class dependent DNNs

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Authors

Abstract

Most state-of-the-art automatic speech recognition (ASR) systems use a single deep neural network (DNN) to map the acoustic space to the decision space. However, different phonetic classes employ different production mechanisms and are best described by different types of features. Hence it may be
advantageous to replace this single DNN with several phone class dependent DNNs. The appropriate mathematical formalism for this is a manifold. This paper assesses the use of a nonlinear manifold structure with multiple DNNs for phone classification. The system has two levels. The first comprises a set of broad phone class (BPC) dependent DNN-based mappings and the second level is a fusion network. Various ways of designing and training the networks in both levels are assessed, including varying the size of hidden layers, the use of the bottleneck or softmax outputs as input to the fusion network, and the use
of different broad class definitions. Phone classification experiments are performed on TIMIT. The results show that using the BPC-dependent DNNs provides small but significant improvements in phone classification accuracy relative to a single global DNN. The paper concludes with visualisations of the structures learned by the local and global DNNs and discussion of their
interpretations.

Details

Original languageEnglish
Title of host publicationProceedings of Interspeech 2017
Publication statusPublished - 24 Aug 2017
EventInterspeech 2017: Situated Interaction - Stockholm University, Stockholm, Sweden
Duration: 20 Aug 201724 Aug 2017
http://www.interspeech2017.org/

Conference

ConferenceInterspeech 2017
CountrySweden
CityStockholm
Period20/08/1724/08/17
Internet address