Authors : Juhwan Yoon, Seyun Um, Woo-Jin Chung, Hong-Goo Kang
Year : 2024
Publisher / Conference : International Conference on Electronics, Information, and Communication (ICEIC)
Research area : Speech Signal Processing, Etc
Presentation/Publication date : 2024.01.29
Presentation : Poster
We propose a novel deep learning-based model for speech emotion recognition, SC-ERM, which focuses on speakercentric learning. This model effectively estimates emotions and demonstrates the ability to generalize to unseen speakers. Our proposed model utilizes speaker-specific emotion characteristics in two steps: first, it extracts emotion representations using an emotion encoder, and second, it employs speaker-centric learning by incorporating speaker style embeddings as a condition through a speaker mask generator. We evaluate our model’s performance using an emotional dataset and find that it demonstrates outstanding performance in recognizing emotional states. Notably, it achieves a 9.2% relative improvement in accuracy compared to the baseline when classifying emotions for speakers not seen during training. Overall, our model demonstrates promising performance in accurately identifying emotions across a range of emotional expressions, irrespective of the speakers involved.