Explainable Diabetes Prediction via Hybrid Data Preprocessing and Ensemble Learning
Publish place: International Journal of Web Research، Vol: 8، Issue: 4
Publish Year: 1404
نوع سند: مقاله ژورنالی
زبان: English
View: 85
This Paper With 16 Page And PDF Format Ready To Download
- Certificate
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
JR_IJWR-8-4_005
تاریخ نمایه سازی: 3 آبان 1404
Abstract:
Accurate and early prediction of diabetes is crucial for initiating prompt treatment and minimizing the risk of long-term health issues. This study introduces a comprehensive machine learning model aimed at improving diabetes prediction by leveraging two clinical datasets: the PIMA Indians Diabetes Dataset and the Early-Stage Diabetes Dataset. The pipeline tackles common challenges in medical data, such as missing values, class imbalance, and feature relevance, through a series of advanced preprocessing steps, including class-specific imputation, engineered feature construction, and SMOTETomek resampling. To identify the most informative predictors, a hybrid feature selection strategy is employed, integrating recursive elimination, Random Forest-based importance, and gradient boosting. Model training uses Random Forest and Gradient Boosting classifiers, which are fine-tuned and combined through weighted ensemble averaging to boost predictive performance. The resulting model achieves ۹۳.۳۳% accuracy on the PIMA dataset and ۹۸.۴۴% accuracy on the Early-Stage dataset, outperforming previously reported approaches. To enhance transparency and clinical applicability, both local (LIME) and global (SHAP) explainability methods are applied, highlighting clinically relevant features. Furthermore, probability calibration is performed to ensure that predicted risk scores align with true outcome frequencies, increasing trust in the model’s use for clinical decision support. Overall, the proposed model offers a robust, interpretable, and clinically reliable solution for early-stage diabetes prediction.
Keywords:
Authors
Ghazaleh Kakavand Teimoory
Data Mining Laboratory, Department of Computer Engineering Faculty of Engineering, Alzahra University Tehran, Iran.
Mohammad Reza Keyvanpour
Department of Computer Engineering, Faculty of Engineering, Alzahra University, Tehran, Iran.
Maryam Ghaebi
Data Mining Laboratory, Department of Computer Engineering Faculty of Engineering, Alzahra University Tehran, Iran.
مراجع و منابع این Paper:
لیست زیر مراجع و منابع استفاده شده در این Paper را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود Paper لینک شده اند :