Room No.: Stat-3162, Ground Floor, Statistics Discipline, Kabi Jibananda Das Academic Building (3rd Academic Building), Khulna University, Khulna-9208, Bangladesh.



    +880 1714960969

    Personal Webpage:
    click here

Imputation and Analysis of Missing Values Using Different Data Mining Techniques


Background and objectives: Chronic kidney disease (CKD) is a slow and progress loss of kidney function with a high economic cost to health system and is an independent risk factor for cardio vascular disease. About 10% of the population worldwide are affected by CKD and million die each year because they do not access to affordable treatment. The main objectives of this study is to impute the missing values of CKD dataset using different imputation techniques and also to classify CKD patients by various data mining techniques and compare the classifiers.

Methods and materials: CKD dataset is taken from UCI machine learning repository. The dataset contains missing values. To impute the missing values of the CKD dataset well-known imputation techniques are used. In this study numerical missing values are imputed by mean, median and linear trend and nominal or categorical missing values are imputed by random number generator. There are also used some statistical R packages named as “mice” and “Amelia” for imputing missing observations. To classify CKD patients four classifiers are used namely, logistic regression (LR), support vector machine (SVM), random forest (RF) and linear discriminant analysis (LDA). 70% of the dataset is taken as a training set and rest of the dataset is taken as a test set and repeated this procedure 1000 times. The performance of these classifiers are evaluated by accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure and area under the curve (AUC).

Results: In the dataset maximum 38% observation are missing in a variable. All of the missing value imputation techniques and the classifiers performed well. Among them SVM gives 100% ACC, SE, SP, PPV, NPV, F-measure and AUC for the dataset without missing values. Comparison of various classifiers for MRNG, SVM gives highest SP (99.47%) and PPV (99.63%). Comparison of various classification techniques for MeRNG, SVM gives highest ACC (98.82%) and AUC (99.96%). Imputation of missing values using mice package, SVM gives highest AUC (100%) and imputing missing value by Amelia SVM gives highest AUC (99.99%).

Conclusion: So we may conclude that all missing value imputation techniques are performed very well among them Amelia gives higher accuracy and SVM be the best classifier compare to others.

Role Supervisor
Class / Degree Masters

(Rafayat Zakia Mim, Student No. MS-162008 Session 2015-2016, Examination: 2018).

Start Date 2017
End Date 2018