Gaze-contingent automatic speech recognition

Neil Cooke; Martin Russell

doi:10.1049/iet-spr:20070127

Gaze-contingent automatic speech recognition

Neil Cooke, Martin Russell

Electronic, Electrical and Systems Engineering

Research output: Contribution to journal › Article

7 Citations (Scopus)

Abstract

There has been progress in improving speech recognition using a tightly-coupled modality such as lip movement; and using additional input interfaces to improve recognition of commands in multimodal human computer interfaces such as speech and pen-based systems. However, there has been little work that attempts to improve the recognition of spontaneous, conversational speech by adding information from a loosely-coupled modality. The study investigated this idea by integrating information from gaze into an automatic speech recognition (ASR) system. A probabilistic framework for multimodal recognition was formalised and applied to the specific case of integrating gaze and speech. Gaze-contingent ASR systems were developed from a baseline ASR system by redistributing language model probability mass according to the visual attention. These systems were tested on a corpus of matched eye movement and related spontaneous conversational British English speech segments (n = 1355) for a visual-based, goal-driven task. The best performing systems had similar word error rates to the baseline ASR system and showed an increase in keyword spotting accuracy. The core values of this work may be useful for developing robust speech-centric multimodal decoding system functions.

Original language	English
Pages (from-to)	369-380
Number of pages	12
Journal	IET Signal Processing
Volume	2
Issue number	4
DOIs	https://doi.org/10.1049/iet-spr:20070127
Publication status	Published - 1 Jan 2008

Access to Document

10.1049/iet-spr:20070127

Cite this

@article{6b92811a3b0a4babba8214f56dfe4438,

title = "Gaze-contingent automatic speech recognition",

abstract = "There has been progress in improving speech recognition using a tightly-coupled modality such as lip movement; and using additional input interfaces to improve recognition of commands in multimodal human computer interfaces such as speech and pen-based systems. However, there has been little work that attempts to improve the recognition of spontaneous, conversational speech by adding information from a loosely-coupled modality. The study investigated this idea by integrating information from gaze into an automatic speech recognition (ASR) system. A probabilistic framework for multimodal recognition was formalised and applied to the specific case of integrating gaze and speech. Gaze-contingent ASR systems were developed from a baseline ASR system by redistributing language model probability mass according to the visual attention. These systems were tested on a corpus of matched eye movement and related spontaneous conversational British English speech segments (n = 1355) for a visual-based, goal-driven task. The best performing systems had similar word error rates to the baseline ASR system and showed an increase in keyword spotting accuracy. The core values of this work may be useful for developing robust speech-centric multimodal decoding system functions.",

author = "Neil Cooke and Martin Russell",

year = "2008",

month = jan,

day = "1",

doi = "10.1049/iet-spr:20070127",

language = "English",

volume = "2",

pages = "369--380",

journal = "IET Signal Processing",

issn = "1751-9683",

publisher = "Institution of Engineering and Technology",

number = "4",

}

TY - JOUR

T1 - Gaze-contingent automatic speech recognition

AU - Cooke, Neil

AU - Russell, Martin

PY - 2008/1/1

Y1 - 2008/1/1

N2 - There has been progress in improving speech recognition using a tightly-coupled modality such as lip movement; and using additional input interfaces to improve recognition of commands in multimodal human computer interfaces such as speech and pen-based systems. However, there has been little work that attempts to improve the recognition of spontaneous, conversational speech by adding information from a loosely-coupled modality. The study investigated this idea by integrating information from gaze into an automatic speech recognition (ASR) system. A probabilistic framework for multimodal recognition was formalised and applied to the specific case of integrating gaze and speech. Gaze-contingent ASR systems were developed from a baseline ASR system by redistributing language model probability mass according to the visual attention. These systems were tested on a corpus of matched eye movement and related spontaneous conversational British English speech segments (n = 1355) for a visual-based, goal-driven task. The best performing systems had similar word error rates to the baseline ASR system and showed an increase in keyword spotting accuracy. The core values of this work may be useful for developing robust speech-centric multimodal decoding system functions.

AB - There has been progress in improving speech recognition using a tightly-coupled modality such as lip movement; and using additional input interfaces to improve recognition of commands in multimodal human computer interfaces such as speech and pen-based systems. However, there has been little work that attempts to improve the recognition of spontaneous, conversational speech by adding information from a loosely-coupled modality. The study investigated this idea by integrating information from gaze into an automatic speech recognition (ASR) system. A probabilistic framework for multimodal recognition was formalised and applied to the specific case of integrating gaze and speech. Gaze-contingent ASR systems were developed from a baseline ASR system by redistributing language model probability mass according to the visual attention. These systems were tested on a corpus of matched eye movement and related spontaneous conversational British English speech segments (n = 1355) for a visual-based, goal-driven task. The best performing systems had similar word error rates to the baseline ASR system and showed an increase in keyword spotting accuracy. The core values of this work may be useful for developing robust speech-centric multimodal decoding system functions.

U2 - 10.1049/iet-spr:20070127

DO - 10.1049/iet-spr:20070127

M3 - Article

SN - 1751-9683

VL - 2

SP - 369

EP - 380

JO - IET Signal Processing

JF - IET Signal Processing

IS - 4

ER -

Gaze-contingent automatic speech recognition

Abstract

Access to Document

Fingerprint

Cite this