Address:
Room No.: Stat-3161, Ground Floor, Statistics Discipline, Kabi Jibananda Das Academic Building, Khulna University, Khulna-9208, Bangladesh.
Email:
office@stat.ku.ac.bd
Contact:
+880 1752385769 ; +880 1970385769
Personal Webpage:
click hereThesis 01 : Feature selection and classification techniques for microarrray gene expression data
Background and objectives: Breast cancer is one of the most leading detrimental cancer disease. According to WHO, there were about 18.10 million new cases and 9.6 million directly death due to cancer in 2018 in worldwide. So the cure of cancer is must for surviving the mankind. Medical scientists have proven that there are a huge number of genes are responsible for a particular diseases. Among them, all genes are not equally responsible. Therefore the most informative genes are needed to find out for controlling them. The main objective of this study is to find the most informative genes using different feature selection techniques as well as find the best classifier.
Materials and methods: Breast cancer data has been taken from Kent ridge biomedical data repository, USA. There are total 24,481 genes and 97 sample patient. Among them, 46 patients are cancer and 51 are control. We have used different feature selection techniques such as t-test & wilcoxon sign rank sum (WCSRS) test. Adaboost (AB), artificial neural network (ANN), random forest (RF), k-nearest neighbor (KNN), linear discriminant analysis (LDA) and naive Bayes (NB) are treated as classification techniques. We consider 70% of the dataset as training set and the rest is test set and repeated this procedure about 1000 times. The performance of these classifiers are evaluated by accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (FM) and area under curve (AUC). We used simulated dataset for checking the validity of our experiment.
Results: The results indicate that for breast cancer dataset, t-test gives the large mumber (2265) of informative genes when p-value is less than 0.05. On the contrary, NB gives the highest accuracy (87.96%) as well as the area under curve (91.18%) while the features is selected using 1 test. We simulate another dataset to valid our result of breast cancer dataset
Conclusion: For the last few year, researchers have started exploring cancer classification using gene expression. Most of the previous proposed algorithm on cancer classification used to show the accuracy of the classification only and does not look upon the running time which is most expensive. Through this study, we hope to better understand the problem of cancer classification for breast cancer which can help to develop more systematic and productive feature selection and classification algorithm.
| Details | |||
| Role | Supervisor | ||
|---|---|---|---|
| Class / Degree | Masters | ||
| Students | Murfia Rahman Muna (ID-232004) | ||
| Start Date | 1st July, 2023 | ||
| End Date | 26th September, 2024 | ||