Multi-Scale Attention-Guided CNN-BiLSTM Framework for Emotion Recognition in Multimodal Video Data
Abstract
A mental state, emotion is connected to human behaviour, thoughts, and the degree of positive or negative experiences. Human emotion does not yet have a precise definition. By allowing AI systems to precisely comprehend and sympathetically react to human emotions, this discovery has the potential to completely transform human-machine interaction and open the door for increasingly sophisticated and emotionally intelligent computers. The main research problem is creating models that accurately read emotions from multimodal data; this calls for big, diverse datasets for video data to capture complex emotional cues and fine-tuned CNNs for audio data to identify minor speech changes. This study introduces a novel multimodal emotion detection method that seamlessly combines voice and video modalities to correctly infer emotional states. The attention-based CNN-Bi-LSTM model handles the video component and provides deep semantic understanding through its bidirectional layers. An attention-based fusion process is used to blend the results of both modalities, balancing their respective contributions. Here, the suggested methodology is thoroughly tested using two different datasets: the YouTube and Carnegie Mellon University SAVEE datasets.The results show higher efficacy compared to current frameworks. This comprehensive technology enables accurate emotion recognition and contributes to a number of noteworthy developments in the industry.
Downloads
Copyright (c) 2026 ITEGAM-JETIA

This work is licensed under a Creative Commons Attribution 4.0 International License.








