Abstract
Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600–700Hz, which corresponds to the Mandarin vowels /e/ and /ê/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.
Original language | English |
---|---|
Pages (from-to) | 135-149 |
Number of pages | 15 |
Journal | Neural Networks |
Volume | 165 |
Early online date | 26 May 2023 |
DOIs | |
Publication status | Published - Aug 2023 |
Bibliographical note
Acknowledgments:This study is funded by the National Key Research and Development Program (2022YFC2405603 to Xizhe Zhang), National Natural Science Foundation of China (62076059 to Peng Cao, 62176129 to Xizhe Zhang), Science Project of Liaoning Province, China (2021-MS-105 to Peng Cao), National Science Fund for Distinguished Young Scholars (81725005 to Fei Wang), the National Natural Science Foundation Regional Innovation and Development Joint Fund (U20A6005 to Fei Wang), Jiangsu Provincial Key Research and Development Program, China (BE2021617 to Fei Wang).
Keywords
- Speech depression detection
- Filterbanks
- Time–frequency analysis
- Interpretability
- Affective computing