Attention guided learnable time-domain filterbanks for speech depression detection

Wenju Yang; Jiankang Liu; Peng Cao; Rongxin Zhu; Yang Wang; Jian K. Liu; Fei Wang; Xizhe Zhang

doi:10.1016/j.neunet.2023.05.041

Attention guided learnable time-domain filterbanks for speech depression detection

Wenju Yang, Jiankang Liu, Peng Cao^*, Rongxin Zhu, Yang Wang, Jian K. Liu, Fei Wang^*, Xizhe Zhang^*

^*Corresponding author for this work

Computer Science

Research output: Contribution to journal › Article › peer-review

Abstract

Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600–700Hz, which corresponds to the Mandarin vowels /e/ and /ê/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.

Original language	English
Pages (from-to)	135-149
Number of pages	15
Journal	Neural Networks
Volume	165
Early online date	26 May 2023
DOIs	https://doi.org/10.1016/j.neunet.2023.05.041
Publication status	Published - Aug 2023

Bibliographical note

Acknowledgments:
This study is funded by the National Key Research and Development Program (2022YFC2405603 to Xizhe Zhang), National Natural Science Foundation of China (62076059 to Peng Cao, 62176129 to Xizhe Zhang), Science Project of Liaoning Province, China (2021-MS-105 to Peng Cao), National Science Fund for Distinguished Young Scholars (81725005 to Fei Wang), the National Natural Science Foundation Regional Innovation and Development Joint Fund (U20A6005 to Fei Wang), Jiangsu Provincial Key Research and Development Program, China (BE2021617 to Fei Wang).

Keywords

Speech depression detection
Filterbanks
Time–frequency analysis
Interpretability
Affective computing

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1016/j.neunet.2023.05.041Licence: None: All rights reserved

Cite this

@article{053e4812a8784318ad93c480f546f1d4,

title = "Attention guided learnable time-domain filterbanks for speech depression detection",

abstract = "Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600–700Hz, which corresponds to the Mandarin vowels /e/ and /{\^e}/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.",

keywords = "Speech depression detection, Filterbanks, Time–frequency analysis, Interpretability, Affective computing",

author = "Wenju Yang and Jiankang Liu and Peng Cao and Rongxin Zhu and Yang Wang and Liu, {Jian K.} and Fei Wang and Xizhe Zhang",

note = "Acknowledgments: This study is funded by the National Key Research and Development Program (2022YFC2405603 to Xizhe Zhang), National Natural Science Foundation of China (62076059 to Peng Cao, 62176129 to Xizhe Zhang), Science Project of Liaoning Province, China (2021-MS-105 to Peng Cao), National Science Fund for Distinguished Young Scholars (81725005 to Fei Wang), the National Natural Science Foundation Regional Innovation and Development Joint Fund (U20A6005 to Fei Wang), Jiangsu Provincial Key Research and Development Program, China (BE2021617 to Fei Wang).",

year = "2023",

month = aug,

doi = "10.1016/j.neunet.2023.05.041",

language = "English",

volume = "165",

pages = "135--149",

journal = "Neural Networks",

issn = "0893-6080",

publisher = "Elsevier",

}

TY - JOUR

T1 - Attention guided learnable time-domain filterbanks for speech depression detection

AU - Yang, Wenju

AU - Liu, Jiankang

AU - Cao, Peng

AU - Zhu, Rongxin

AU - Wang, Yang

AU - Liu, Jian K.

AU - Wang, Fei

AU - Zhang, Xizhe

N1 - Acknowledgments: This study is funded by the National Key Research and Development Program (2022YFC2405603 to Xizhe Zhang), National Natural Science Foundation of China (62076059 to Peng Cao, 62176129 to Xizhe Zhang), Science Project of Liaoning Province, China (2021-MS-105 to Peng Cao), National Science Fund for Distinguished Young Scholars (81725005 to Fei Wang), the National Natural Science Foundation Regional Innovation and Development Joint Fund (U20A6005 to Fei Wang), Jiangsu Provincial Key Research and Development Program, China (BE2021617 to Fei Wang).

PY - 2023/8

Y1 - 2023/8

N2 - Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600–700Hz, which corresponds to the Mandarin vowels /e/ and /ê/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.

AB - Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600–700Hz, which corresponds to the Mandarin vowels /e/ and /ê/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.

KW - Speech depression detection

KW - Filterbanks

KW - Time–frequency analysis

KW - Interpretability

KW - Affective computing

U2 - 10.1016/j.neunet.2023.05.041

DO - 10.1016/j.neunet.2023.05.041

M3 - Article

SN - 0893-6080

VL - 165

SP - 135

EP - 149

JO - Neural Networks

JF - Neural Networks

ER -

Attention guided learnable time-domain filterbanks for speech depression detection

Abstract

Bibliographical note

Keywords

UN SDGs

Access to Document

Fingerprint

Cite this