Multi-Scale Attention-Guided CNN-BiLSTM Framework for Emotion Recognition in Multimodal Video Data

J. Biju; Lavanya K; J. Raja; M. Kiruthiga Devi; Payala Krishnanjaneyulu

doi:10.5935/jetia.v12i58.2986

J. Biju Assistant Professor, Division of Data Science and Cyber Security, Karunya Institute of Technology and Sciences, Coimbatore, India. https://orcid.org/0009-0009-1628-7871
Lavanya K Assistant professor of Artificial Intelligence & Data Science, S, Akshaya college of engineering and technology, Kinathukadavu, CBE, India. https://orcid.org/0000-0002-1116-5738
J. Raja Associate. professor, Computer Science and Engineering, School of Computing, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, India. https://orcid.org/0000-0003-2183-8585
M. Kiruthiga Devi Assistant Professor (Sr.G), Department of Computer Science and Engineering, SRM Institute of Science and Technology, Vadapalani, India. https://orcid.org/0000-0003-2949-6102
Payala Krishnanjaneyulu Assistant professor, Koneru Lakshmaiah Education Foundation Bowrampet Hyderabad, India. https://orcid.org/0000-0003-4767-0178

DOI: https://doi.org/10.5935/jetia.v12i58.2986

Abstract

A mental state, emotion is connected to human behaviour, thoughts, and the degree of positive or negative experiences. Human emotion does not yet have a precise definition. By allowing AI systems to precisely comprehend and sympathetically react to human emotions, this discovery has the potential to completely transform human-machine interaction and open the door for increasingly sophisticated and emotionally intelligent computers. The main research problem is creating models that accurately read emotions from multimodal data; this calls for big, diverse datasets for video data to capture complex emotional cues and fine-tuned CNNs for audio data to identify minor speech changes. This study introduces a novel multimodal emotion detection method that seamlessly combines voice and video modalities to correctly infer emotional states. The attention-based CNN-Bi-LSTM model handles the video component and provides deep semantic understanding through its bidirectional layers. An attention-based fusion process is used to blend the results of both modalities, balancing their respective contributions. Here, the suggested methodology is thoroughly tested using two different datasets: the YouTube and Carnegie Mellon University SAVEE datasets.The results show higher efficacy compared to current frameworks. This comprehensive technology enables accurate emotion recognition and contributes to a number of noteworthy developments in the industry.

Downloads

Download data is not yet available.

JETIA Journal Data
Available:	2015 - 2026
Volumes:	12
Issues:	58
Articles:	1.110
Article Processing Charges (APC):	PAID