Multi-Scale Attention-Guided CNN-BiLSTM Framework for Emotion Recognition in Multimodal Video Data

  • J. Biju Assistant Professor, Division of Data Science and Cyber Security, Karunya Institute of Technology and Sciences, Coimbatore, India. https://orcid.org/0009-0009-1628-7871
  • Lavanya K Assistant professor of Artificial Intelligence & Data Science, S, Akshaya college of engineering and technology, Kinathukadavu, CBE, India. https://orcid.org/0000-0002-1116-5738
  • J. Raja Associate. professor, Computer Science and Engineering, School of Computing, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, India. https://orcid.org/0000-0003-2183-8585
  • M. Kiruthiga Devi Assistant Professor (Sr.G), Department of Computer Science and Engineering, SRM Institute of Science and Technology, Vadapalani, India. https://orcid.org/0000-0003-2949-6102
  • Payala Krishnanjaneyulu Assistant professor, Koneru Lakshmaiah Education Foundation Bowrampet Hyderabad, India. https://orcid.org/0000-0003-4767-0178

Abstract

A mental state, emotion is connected to human behaviour, thoughts, and the degree of positive or negative experiences.   Human emotion does not yet have a precise definition.   By allowing AI systems to precisely comprehend and sympathetically react to human emotions, this discovery has the potential to completely transform human-machine interaction and open the door for increasingly sophisticated and emotionally intelligent computers. The main research problem is creating models that accurately read emotions from multimodal data; this calls for big, diverse datasets for video data to capture complex emotional cues and fine-tuned CNNs for audio data to identify minor speech changes.  This study introduces a novel multimodal emotion detection method that seamlessly combines voice and video modalities to correctly infer emotional states.   The attention-based CNN-Bi-LSTM model handles the video component and provides deep semantic understanding through its bidirectional layers.   An attention-based fusion process is used to blend the results of both modalities, balancing their respective contributions. Here, the suggested methodology is thoroughly tested using two different datasets: the YouTube and Carnegie Mellon University SAVEE datasets.The results show higher efficacy compared to current frameworks.    This comprehensive technology enables accurate emotion recognition and contributes to a number of noteworthy developments in the industry.

 

Downloads

Download data is not yet available.
Published
2026-04-27
How to Cite
Biju, J., K, L., Raja, J., Kiruthiga Devi, M., & Krishnanjaneyulu, P. (2026). Multi-Scale Attention-Guided CNN-BiLSTM Framework for Emotion Recognition in Multimodal Video Data. ITEGAM-JETIA, 12(58), 741-754. https://doi.org/10.5935/jetia.v12i58.2986
Section
Articles