SMOTE-TomekLink Super-Learner Ensemble Model (STL-SLEM) for the Prediction of Parkinson’s Disease
Keywords:
Parkinson's Disease, , Prediction Models, Super Learner, SMOTE-TomekLink.Abstract
ABSTRACT
Parkinson's Disease (PD) is a progressive neurodegenerative disorder that affects millions of people worldwide. Early detection and prediction of Parkinson's Disease can significantly improve patient outcomes by enabling timely intervention and personalized treatment. Over the years, many Parkinson Disease (PD) prediction models have been developed using machine learning algorithms. Some of these existing models suffer over-fitting of data due to unavailability of sufficient dataset in PD as well as data imbalance. Hence, this work developed a Super Learner Ensemble Model (SLEM) that aggregated several machine learning models configurations to overcome the challenge of over-fitting thereby enhancing the performance of PD prediction. The dataset used for this research is Parkinson disease datasets obtained from Kaggle website and also local datasets from Federal Medical Center, Abeokuta, Nigeria for the validation of the developed model. The dataset from Kaggle website consists of 195 biomedical voice measurements from 31 people taken severally, 23 out of the 31 have Parkinson's disease and 8 without Parkinson's disease, while the local datasets consists of 13 people, 9 with PD and 4 without PD. The acquired dataset has class imbalance, and to handle this issue, Synthetic Minority Over Sampling Technique with TomekLink (SMOTE-TomekLink) was adopted to resample the dataset for class-balancing. For computational efficiency, six base learners were used to develop the Super Learner model, which includes Logistic Regression (LR), Decision Tree (DT), Naïve Bayes (NB), Adaptive Boosting (AB), Bagging Ensemble (BE), and Random Forest (RF) algorithms. The performances of each base model were measured, and the performance of the Super Learner ensemble model was also obtained using the following performance metrics: Accuracy, Precision, Recall, F1-Score, Matthews Correlation Coefficient (MCC), and Balanced Accuracy Score (BAS). However, Accuracy for LR, DT, NB, AB, BE, and RF with SMOTE-TomekLink-resampled datasets were 95.0%, 94.0%, 91.0%, 93.0%, 95.0%, and 96.5%, respectively, while the corresponding Accuracy for Super Learner Ensemble model was 99.0%.. The developed model showed an improvement in the performance metrics.