Interpretation of low dimensional neural network bottleneck features in terms of human perception and production

Phil Weber, Linxue Bai, Martin Russell, Peter Jancovic, Stephen Houghton

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Low-dimensional ‘bottleneck’ features extracted from neural networks have been shown to give phoneme recognition accuracy similar to that obtained with higher-dimensional MFCCs, using GMM-HMM models. Such features have also been shown to preserve well the assumptions of speech trajectory dynamics made by dynamic models of speech such as Continuous-State HMMs. However, little is understood about how networks derive these features and how and whether they can be interpreted in terms of human speech perception and production.
We analyse three-dimensional bottleneck features. We show that for vowels, their spatial representation is very close to the familiar F1 :F2 vowel quadrilateral. For other classes of phonemes the features can similarly be related to phonetic and acoustic spatial representations presented in the literature. This suggests that these networks derive representations specific to particular phonetic categories, with properties similar to those used by human perception. The representation of the full set of phonemes in the bottleneck space is consistent with a hypothesized comprehensive model of speech perception and also with models of speech perception such as prototype theory.
Original languageEnglish
Title of host publicationProceedings of Interspeech
PublisherISCA
Number of pages5
DOIs
Publication statusAccepted/In press - 10 Jun 2016
EventInterspeech 2016 - San Francisco, San Francisco, United States
Duration: 8 Sept 201612 Sept 2016
http://interspeech2016.org/

Conference

ConferenceInterspeech 2016
Country/TerritoryUnited States
CitySan Francisco
Period8/09/1612/09/16
Internet address

Fingerprint

Dive into the research topics of 'Interpretation of low dimensional neural network bottleneck features in terms of human perception and production'. Together they form a unique fingerprint.

Cite this