Attention guided learnable time-domain filterbanks for speech depression detection

Wenju Yang, Jiankang Liu, Peng Cao*, Rongxin Zhu, Yang Wang, Jian K. Liu, Fei Wang*, Xizhe Zhang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600–700Hz, which corresponds to the Mandarin vowels /e/ and /ê/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.
Original languageEnglish
Pages (from-to)135-149
Number of pages15
JournalNeural Networks
Volume165
Early online date26 May 2023
DOIs
Publication statusPublished - Aug 2023

Bibliographical note

Acknowledgments:
This study is funded by the National Key Research and Development Program (2022YFC2405603 to Xizhe Zhang), National Natural Science Foundation of China (62076059 to Peng Cao, 62176129 to Xizhe Zhang), Science Project of Liaoning Province, China (2021-MS-105 to Peng Cao), National Science Fund for Distinguished Young Scholars (81725005 to Fei Wang), the National Natural Science Foundation Regional Innovation and Development Joint Fund (U20A6005 to Fei Wang), Jiangsu Provincial Key Research and Development Program, China (BE2021617 to Fei Wang).

Keywords

  • Speech depression detection
  • Filterbanks
  • Time–frequency analysis
  • Interpretability
  • Affective computing

Fingerprint

Dive into the research topics of 'Attention guided learnable time-domain filterbanks for speech depression detection'. Together they form a unique fingerprint.

Cite this