Sarasi Jayasekara

Roman Holowinsky, PhD

September 06, 2024

DIRECTOR

DATE

TEAM

A Vocal-Cue Interpreter for Minimally-Verbal Individuals

Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara

The ReCANVo dataset consists of ~7k audio recordings of vocalizations from 8 minimally-verbal individuals (mostly people with developmental disabilities). The recordings were made in a real-world setting, and were categorized on the spot by the speaker's caregiver based on context, non-verbal cues, and familiarity with the speaker. There are several pre-defined categories such as selftalk, frustrated, delighted, request, etc., and caregivers could also specify custom categories. Our goal was to train a model, per individual, that accurately predicts labels and improves upon previous work.

We train several different combinations of models of the form “Feature Extractor + Classifier”. For extracting features from audio data, we use two deep models (HuBERT and AST) each with pre-trained weights, as well as mel spectrograms. As classifiers, we use a 4 layer CNN-based neural network (for mel spectrograms), NNs with fully-connected layers (for features coming from deep models), and more.

THE ERDŐS INSTITUTE

Helping PhDs get and create jobs they love at every stage of their career.

Sarasi Jayasekara

TEAM

A Vocal-Cue Interpreter for Minimally-Verbal Individuals

Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara