Inter-Cluster Distance-Based SMOTE Modification for Enhanced Diabetes Classification

Abstract

Diabetes is a significant global health challenge, with early diagnosis playing an important role in preventing serious complications. However, medical datasets often exhibit class imbalance, where the number of non-diabetes cases is much larger than diabetes cases. This imbalance causes machine learning models to be biased towards the majority class, thus degrading prediction performance on the minority class. The problem with the commonly used oversampling method SMOTE (Synthetic Minority Oversampling Technique) is that the selection of new synthetic data formation points is done randomly, which often results in less representative synthetic data and reduces model performance. This research proposes a modification of SMOTE based on inter-cluster distance to overcome this problem. This approach uses the distance between cluster centroids in minority classes to form new synthetic data that is more representative. The research methodology involves data preprocessing, including missing value imputation, normalization, and data balancing using SMOTE modification, followed by classification using Random Forest algorithm. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics. The results showed that the proposed approach achieved very high evaluation values, with accuracy, precision, recall, and F1-score of 99.7% each, far surpassing previous studies that used standard oversampling methods. This study proves that the inter-cluster distance-based SMOTE modification is effective in overcoming class imbalance and producing more representative synthetic data.

Downloads

Download data is not yet available.
Published
2025-02-21
How to Cite
Nurzari, I., Sari, E., Harris, D., Priyatno, A., & Rusnedy, H. (2025). Inter-Cluster Distance-Based SMOTE Modification for Enhanced Diabetes Classification. ITEGAM-JETIA, 11(51). https://doi.org/10.5935/jetia.v11i51.1453
Section
Articles