Frequency domain features and their robustness for spoken emotion classification with CNN
Abstract:
Speech Emotion Recognition (SER) is an important task since emotion is an important dimension in human communication. But SER systems are very sensitive to environmental noises. CNNs are the state-of-the art machine learning technique used to solve this ongoing problem. Literature indicates that frequency domain features obtained via Fourier transformation and related techniques perform better than the other types such as time domain features. Research indicates that various features and their combinations have varying impacts on the performance and noise-robustness of machine learning models. This aspect is not been thoroughly investigated for SER in the past. The aim of this research is to study the performance of CNN based SER systems using several different frequency domain features and their combinations.
Committee:
- Madhur Behl (Chair)
- John Stankovic (Advisor)
- Yangfeng Ji
- Brad Campbell