Thesis Title: Identification of Potential Risk Factors of Diabetes and its Prediction Using Machine Learning Approach

Diabetes is a chronic metabolic disease, characterized by an elevated level of blood sugar. It acts as a silent killer in human body, triggering several health complications. To tackle this disease, we need accurate diagnosis at early stage. Thus, the aim of this study is to identify the risk factors associated with diabetes and classify diabetic patients using advanced machine learning (ML) techniques. A total of 11866 respondents were considered from 89819 respondents from the Bangladesh Demographic and Health Survey 2017-18 dataset. Risk factors of diabetes were identified with an extensive literature review and using statistical tools such as the t-test, chi-square test, Boruta, and least absolute shrinkage and selection operator (LASSO) feature selection technique (FST). Ten different ML-based classifiers like logistic regression (LR), random forest (RF), support vector machine-linear (SVM-l), support vector machine-radial (SVM-r), linear discriminant analysis (LDA), naïve bayes (NB), extreme gradient boosting (XGBoost), neural network (NNET), k-nearest neighbor (KNN), and decision tree (DT) were implemented. A combination of the Monte-Carlo and k-fold (K2, K5, and K10) cross validation protocol was implemented with 25 trials. Area under the curve (AUC), accuracy (ACC), and other traditional performance measures were used to compare and identify the best model. The prevalence of diabetes was 10% in Bangladesh. In this study 15 variables such as place of residence, division, age, gender, education, marital status, working status, wealth index, smoking status, electricity, arm circumference, systolic blood pressure, diastolic blood pressure, body mass index, and the outcome variable diabetes status were finally considered for this study. Where, Boruta selected all the variables based on importance, and LASSO selected 13 variables except electricity and gender. Among the implemented models, RF had the highest performance (for K2: ACC = 0.84, AUC = 0.98; for K5: ACC = 0.89, AUC = 0.99; for K10: ACC = 0.90, AUC = 0.99) with LASSO. This result was also validated with BDHS-2011 dataset. Classifiers with both Boruta, and LASSO based FST gave similar results, although LASSO with K10 protocol was less complex and more time efficient as it had lesser number of variables. Thus, the combination of LASSO based FST and RF classifier is more preferable. 

Details
Role Supervisor
Class / Degree Masters
Students

Hasibul Hasan Shanto (Student ID: MS-202010)

Start Date 2019
End Date 2022