A Multimodal Graph Contrastive Learning for Human Activity Recognition Using Deep Learning Technique
Abstract
The deep learning technique for Human Activity Recognition (HAR) systems has achieved remarkable improvements in recent years in recognition of complex action classes and real-world contexts.This research advances our unified deep learning framework, called the Hybrid Dense Temporal Transformer Network (HDTTN) to capture spatial, temporal, and semantic information to capture better human activity detection. We introduce Dense Net to improve spatial feature extraction from visual inputs, Temporal Convolutional Networks (TCNs) for learning short-term motion patterns, and Transformer encoders for learning long-range temporal dependencies that are crucial for processing complex and subtle activities. Thestudy employs early multimodal feature fusion strategy for further enhance representational coherence which makes it easier to incorporate heterogeneous cues at the feature level and to learn dynamic multimodal representations. Moreover, a hybrid optimization approach is integrated for parameter fine-tuning for efficiency, reduce overfitting, and boost model robustness. The proposed HDTTN framework is shown to be effective on the large scale and difficult Kinetics dataset containing a wide range of unconstrained human activities. Experimental results show that our proposed model is 93% accurate, compared to several existing state-of-the-art baseline approaches. Moreover, qualitative and quantitative analyses validate HDTTN's ability to identify intricate and nuanced activities across a multitude of environments.
Downloads
Copyright (c) 2026 ITEGAM-JETIA

This work is licensed under a Creative Commons Attribution 4.0 International License.








