Cover
Vol. 19 No. 2 (2023)

Published: December 31, 2023

Pages: 43-53

Original Article

Using Pearson Correlation and Mutual Information (PC-MI) to Select Features for Accurate Breast Cancer Diagnosis Based on a Soft Voting Classifier

Abstract

Breast cancer is one of the most critical diseases suffered by many people around the world, making it the most common medical risk they will face. This disease is considered the leading cause of death around the world, and early detection is difficult. In the field of healthcare, where early diagnosis based on machine learning (ML) helps save patients’ lives from the risks of diseases, better-performing diagnostic procedures are crucial. ML models have been used to improve the effectiveness of early diagnosis. In this paper, we proposed a new feature selection method that combines two filter methods, Pearson correlation and mutual information (PC-MI), to analyse the correlation amongst features and then select important features before passing them to a classification model. Our method is capable of early breast cancer prediction and depends on a soft voting classifier that combines a certain set of ML models (decision tree, logistic regression and support vector machine) to produce one model that carries the strengths of the models that have been combined, yielding the best prediction accuracy. Our work is evaluated by using the Wisconsin Diagnostic Breast Cancer datasets. The proposed methodology outperforms previous work, achieving 99.3% accuracy, an F1 score of 0.9922, a recall of 0.9846, a precision of 1 and an AUC of 0.9923. Furthermore, the accuracy of 10-fold cross-validation is 98.2%.

References

  1. W. H. O. . WHO, “http://www.who.int/cancer/ prevention/diagnosis-screening/breast-cancer/en/,” World Breast Cancer Rep., 2020.
  2. A. B. Nassif, M. A. Talib, Q. Nasir, Y. Afadar, and O. Elgendy, “Breast cancer detection using artificial intelligence techniques: A systematic literature review,” Artificial Intelligence in Medicine, vol. 127, p. 102276, 2022.
  3. A. Haleem, M. Javaid, and I. H. Khan, “Current status and applications of artificial intelligence (ai) in medical field: An overview,” Current Medicine Research and Practice, vol. 9, no. 6, pp. 231–237, 2019.
  4. H. Asri, H. Mousannif, H. Al Moatassime, and T. Noel, “Using machine learning algorithms for breast cancer risk prediction and diagnosis,” Procedia Computer Science, vol. 83, pp. 1064–1069, 2016.
  5. S. Guo, Y. Liu, R. Chen, X. Sun, and X. Wang, “Im- proved smote algorithm to deal with imbalanced activ- ity classes in smart homes,” Neural Processing Letters, vol. 50, pp. 1503–1526, 2019.
  6. A. Hazra, “Study and analysis of breast cancer cell de- tection using naive bayes , svm study and analysis of breast cancer cell detection using naive bayes , svm and ensemble algorithms,” Int. J. Comput. Appl., vol. 145, no. January 2017, pp. 39–45, 2016.
  7. N. Khuriwal, “Breast cancer diagnosis using adaptive voting ensemble machine learning algorithm,” 2018 IEEMA Eng. Infin. Conf., pp. 1–5, 2018.
  8. M. Allam and M. Nandhini, “Optimal feature selection using binary teaching learning based optimization algo- rithm,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 2, pp. 329–341, 2022.
  9. M. H. Memon, J. P. Li, A. U. Haq, M. H. Memon, and W. Zhou, “Breast cancer detection in the iot health environment using modified recursive feature selec- tion,” wireless communications and mobile computing, vol. 2019, pp. 1–19, 2019.
  10. H. Dhahri, E. Al Maghayreh, A. Mahmood, W. Elkilani, M. Faisal Nagi, et al., “Automated breast cancer diagno- sis based on machine learning algorithms,” Journal of healthcare engineering, vol. 2019, 2019.
  11. S. Ibrahim and S. Nazir, “Feature selection using cor- relation analysis and principal component analysis for accurate breast cancer diagnosis,” J. Imaging, vol. 225, pp. 1–7, 2021.
  12. A. U. Haq, J. P. Li, A. Saboor, J. Khan, S. Wali, S. Ah- mad, A. Ali, G. A. Khan, and W. Zhou, “Detection of breast cancer through clinical data using supervised and unsupervised feature selection techniques,” IEEE Access, vol. 9, pp. 22090–22105, 2021. 53 | Hashim & Yassin
  13. Z. Huang and D. Chen, “A breast cancer diagnosis method based on vim feature selection and hierarchi- cal clustering random forest algorithm,” IEEE Access, vol. 10, pp. 3284–3293, 2022.
  14. J. Jumanto, M. F. Mardiansyah, R. N. Pratama, M. F. Al Hakim, and B. Rawat, “Optimization of breast cancer classification using feature selection on neural network,” Journal of Soft Computing Exploration, vol. 3, no. 2, pp. 105–110, 2022.
  15. D. W. H. Wolberg, “https://archive.ics.uci.edu/ml/datasets/breast can- cer wisconsin (diagnostic),” M.L Repos., 1995.
  16. K. Teh, P. Armitage, S. Tesfaye, D. Selvarajah, and I. D. Wilkinson, “Imbalanced learning: Improving classifi- cation of diabetic neuropathy from magnetic resonance imaging,” PloS one, vol. 15, no. 12, p. e0243907, 2020.
  17. K. Potdar, “A comparative study of categorical variable encoding techniques for neural network classifiers,” Int. J. Comput. Appl, vol. 175, no. 4, pp. 7–9, 2017.
  18. Q. Al-Tashi, S. J. Abdulkadir, H. M. Rais, S. Mirjalili, and H. Alhussian, “Approaches to multi-objective fea- ture selection: A systematic literature review,” IEEE Access, vol. 8, pp. 125076–125096, 2020.
  19. R. Saidi, W. Bouaguel, and N. Essoussi, “Hybrid fea- ture selection method based on the genetic algorithm and pearson correlation coefficient,” Machine learning paradigms: theory and application, pp. 3–24, 2019.
  20. B. Gierlichs and E. Prouff, “Mutual information analysis: a comprehensive study mutual information analysis: a comprehensive study,” J. Cryptol, vol. 24, no. 2, pp. 269– 291, 2011.
  21. A. Alonso-betanzos, “Filter methods for feature selec- tion – a comparative study filter methods for feature selection . a comparative study,” Int. Conf. Intell. Data Eng. Autom. Learn. Springer, Berlin, Heidelb., vol. 4881, no. December, pp. 178–187, 2007.
  22. P. Ferreira, D. C. Le, and N. Zincir-Heywood, “Explor- ing feature normalization and temporal information for machine learning based insider threat detection,” in 2019 15th International Conference on Network and Service Management (CNSM), pp. 1–7, IEEE, 2019.
  23. W. T. Ambrosius, Topics in biostatistics. Springer, 2007.
  24. G. H. Lewes, “Support vector machines for classifica- tion,” Effic. Learn. Mach. Apress, Berkeley, CA, no. Jan- uary, pp. 39–66, 2015.
  25. L. Rokach and O. Maimon, “Decision trees,” Data Min. Knowl. Discov. handbook. Springer, Boston, MA, no. Jan- uary, pp. 165–192, 2005.
  26. M. A. Khan, M. A. Khan Khattk, S. Latif, A. A. Shah, M. Ur Rehman, W. Boulila, M. Driss, and J. Ahmad, Vot- ing classifier-based intrusion detection for iot networks. 2022.