Address:
Email:
Contact:
Mobile: +61 476 855 648 Whatsapp: +88 01714960969
Personal Webpage:
click hereThesis Title: Identification of Potential Risk Factors of Diabetes and its Prediction Using Machine Learning Approach
Diabetes is a chronic metabolic disease, characterized
by an elevated level of blood sugar. It acts as a silent killer in human body,
triggering several health complications. To tackle this disease, we need accurate
diagnosis at early stage. Thus, the aim of this study is to identify the risk
factors associated with diabetes and classify diabetic patients using advanced
machine learning (ML) techniques. A total of 11866 respondents were considered
from 89819 respondents from the Bangladesh Demographic and Health Survey
2017-18 dataset. Risk factors of diabetes were identified with an extensive
literature review and using statistical tools such as the t-test, chi-square
test, Boruta, and least absolute shrinkage and selection operator (LASSO)
feature selection technique (FST). Ten different ML-based classifiers like
logistic regression (LR), random forest (RF), support vector machine-linear
(SVM-l), support vector machine-radial (SVM-r), linear discriminant analysis
(LDA), naïve bayes (NB), extreme gradient boosting (XGBoost), neural network
(NNET), k-nearest neighbor (KNN), and decision tree (DT) were implemented. A
combination of the Monte-Carlo and k-fold (K2, K5, and K10) cross validation
protocol was implemented with 25 trials. Area under the curve (AUC), accuracy
(ACC), and other traditional performance measures were used to compare and
identify the best model. The prevalence of diabetes was 10% in Bangladesh. In
this study 15 variables such as place of residence, division, age, gender,
education, marital status, working status, wealth index, smoking status,
electricity, arm circumference, systolic blood pressure, diastolic blood
pressure, body mass index, and the outcome variable diabetes status were finally
considered for this study. Where, Boruta selected all the variables based on
importance, and LASSO selected 13 variables except electricity and gender. Among
the implemented models, RF had the highest performance (for K2: ACC = 0.84, AUC
= 0.98; for K5: ACC = 0.89, AUC = 0.99; for K10: ACC = 0.90, AUC = 0.99) with
LASSO. This result was also validated with BDHS-2011 dataset. Classifiers with
both Boruta, and LASSO based FST gave similar results, although LASSO with K10
protocol was less complex and more time efficient as it had lesser number of
variables. Thus, the combination of LASSO based FST and RF classifier is more
preferable.
Details | |||
Role | Supervisor | ||
---|---|---|---|
Class / Degree | Masters | ||
Students | Hasibul Hasan Shanto (Student ID: MS-202010) | ||
Start Date | 2019 | ||
End Date | 2022 |