Etd

Towards Instantaneous Mental Health Screening From Voice Using Machine and Deep Learning

Public

The World Health Organization (WHO) has identified mental health disorders as a serious global epidemic. In the US mental health disorders affect up to a quarter of the population, and are the leading cause of disability, responsible for 18.7% of all years of life lost to disability and premature mortality. Despite of early detection being crucial to improving prognosis, mental illness remains largely undiagnosed. Given the recent high prevalence of voice clips from digital assistants and smartphone technologies, there are now tremendous opportunities that hold the promise of a disruptive transformation in mental health screening. In one scenario, screening could become ubiquitously integrated into virtual assistants and smartphone technologies. However, several challenges must be overcome to achieve accurate mental health screening from voice. First, emotion cues are not evenly distributed within a voice clip, and a considerable part of the voice content might be neutral signal. Second, due to privacy concerns, audio datasets with mental health labels have a small number of participants, causing current classification models to suffer from low performance. Finally, we must recognize the multimodal nature of voice, that is both verbal and non-verbal content must be taken into account. This makes the problem particularly hard, as voice clips containing the same sentence could mean different things based on the tone. We have tackled the aforementioned challenges in several stages. First, to tackle the challenge of non-uniform cues we introduce Sub-Clip Classification Boosting (SCB) Framework, a multi-step methodology for audio classification from non-textual features. We apply this framework to available emotion benchmark datasets and achieve state-of-the-art results with classification accuracy of up to 88% when distinguishing between 7 emotional states. SCB features a highly-effective sub-clip boosting methodology for classification that, unlike traditional boosting using feature subsets, instead works at the sub-instance level. Multiple sub-instance classifications increase the likelihood that an emotion cue will be found within a voice clip, even if its location varies between speakers. Second, we have collected and evaluated retroactive and voice data from smartphones to detect depression and suicidal ideation using machine learning. The collected retroactive data include human behaviours and characteristics such as GPS movement data, social media posts and call frequency logs. Our voice data contains recordings of participants reading a scripted sentence. To accomplish the aforementioned tasks we developed two iterative frameworks, namely Moodable and EMU. Each framework is comprised of a smartphone app for data collection and screening, as well as the backend infrastructure for data storage and analysis. Initial results show that voice clips are a most promising modality for mental health screening. Specifically, using the Moodable dataset with baseline models, we achieve depression F1 scores of up to 0.766 for severe depression, and F1 score between 0.578 to .848 for suicidal ideation depending on the severity. However, baseline models prove inadequate for milder depression detection (F1 score 0.575) which is addressed subsequently. Third, we extended Sub-Clip Classification Boosting (SCB) from emotion to depression detection from voice. In addition, we expanded the algorithm into the Sliding Window Sub-clip Pooling (SWUP) variant which features more complex sub-clipping and pooling techniques. We applied SWUP to the data collected using the Moodable and EMU frameworks, as well as to audio data from a corpus of clinical interviews. Our experimental results show consistent improvement compared to our initial baselines. Notably, even for milder depression which was not accurately detected with baselines models, we achieve F1 scores of 0.735 for the Moodable dataset, 0.717 for EMU dataset, and 0.631 for DAIC-WOZ. Finally, we introduce Audio-Assisted BERT (AudiBERT), a novel deep learning framework that leverages the multimodal nature of human voice. To alleviate the small data problem, AudiBERT integrates pretrained audio and text representation models that generate high quality embeddings for the respective modalities. In addition, we augmented the model using a dual self-attention mechanism applied to both audio and text modalities. When applied to unimodal scripted audio from our Moodable and EMU datasets, AudiBERT achived an F1 score of up to 0.846 and 0.769 for depression and anxiety detection, respectively. AudiBERT applied to depression classification using multimodal voice clips consistently achieves promising performance with an increase in F1 scores between 6% and 30% compared to state-of-the-art audio and text models for 15 thematic question datasets, that together make up the DAIC-WOZ corpus of interviews. Using answers from medically targeted and general wellness questions, our framework achieves F1 scores of up to 0.9 and 0.86, respectively, demonstrating the feasibility of depression screening from informal dialogue. These steps complete comprehensive research that encompasses several aspects of automated mental health screening such as, emotion detection, anxiety detection, suicidal ideation, and more so depression detection. The results confirm the feasibility of mental health screening from voice clips, especially when using multimodal deep learning models, that take advantage of transfer learning to overcome the small data problem. This research has the potential to have a broad impact in expanding mental health screening to voice enabled devices, alleviating a severe global need.

Creator
Contributors
Degree
Unit
Publisher
Identifier
  • etd-6161
Keyword
Advisor
Committee
Defense date
Year
  • 2021
Date created
  • 2021-03-19
Resource type
Rights statement
Last modified
  • 2023-11-07

Relations

In Collection:

Items

Items

Permanent link to this page: https://digital.wpi.edu/show/3x816q57f