Page 49 - 2023-Vol19-Issue2
P. 49
45 | Hashim & Yassin
best features where these methods were applied to a WDBC comprised 569 samples (B=357 and M=212) and 32 features.
dataset. Seven classification algorithms were applied, and a These features display the fundamental properties of the breast
soft and hard voting classifier was used from these algorithms cell. Two of these features are not used on the practical side
to achieve the best accuracy. The results of all these classifiers (id, Unnamed:32). The diagnosis field uses the remaining
were compared. The soft voting classifier obtained the highest 30 features that contain a real value [15]. In the proposed
accuracy of 99% with the use of 21 features selected using methodology, we will develop a method that combines two
correlation analysis and principal component analysis. HAQ filter methods (feature selection based on Pearson correlation
et al. [12] used the WDBC dataset in their research, and three and feature selection based on mutual information), which
methods of feature selection were applied, namely, principal in turn reduces the number of features from 30 to 18 for
component analysis, relief, and autoencoder algorithms. SVM increasing classification accuracy, which will be explained
was used as a classification model and applied to all results of later.
feature selection methods to compare results. SVM with prin-
cipal component analysis using only 18 features achieved the
highest accuracy of 99%. HUANG and CHEN [13] used the
variable Importance Measure (VIM) as a method for selecting
a feature from the WDBC and WBC datasets. They developed
a new model known as hierarchical clustering random Forest
(HCRF), which is based on a DT and random forest. Three
models were applied, namely, AdaBoost, DT and random
forest. We then compared the results of these models on both
datasets. As a result, the HCRF model obtained the highest
accuracy in the WDBC dataset by 97.05% and 97.76% in the
WBC dataset.
Jumanto et al. [14] used forward feature Selection and
random forest for selecting features from the WDBC dataset.
As a classifier, backpropagation ANN was used to predict
whether breast cancer tumour is malignant or benign. The
results showed that the classifier used had an accuracy of
98.3%.
Furthermore, we notice the previous works have suffered
from limitations in the accuracy of the diagnosis, and this
could be due to the bias that occurred during the training of
the AI models as a result of the imbalance of the dataset, or it
could be that the feature selection method and the AI model
are not significantly proportional to this dataset. Therefore,
these works need to improve the accuracy of early diagnosis,
which in turn helps preserve the patient’s life.
III. METHODOLOGY Fig. 1. Proposed methodology.
In this study, we propose a methodology that uses the voting A. Pre-processing Phase
classifier method, which combines multiple models to produce At this phase, we perform a set of initial operations on
the best prediction accuracy for the diagnosis of breast cancer.
Sub section ( C ) explains the voting classifier in further detail. the dataset to improve the quality of the data and ensure that
The proposed methodology consists of three main phases: the classification model works well. The main operations in
pre-processing phase, feature selection phase and prediction this phase are cleaning, balancing and label encoding for the
phase (Fig. 1). Before we explain the main phases, we will dataset.
describe the dataset used in this study.
1) Cleaning Dataset: The first process focuses on cleaning
The Wisconsin Diagnostic Breast Cancer (WDBC) dataset the dataset, which involves identifying data errors and then
obtained from the UCI ML repository is used in this paper. editing, updating or removing data for overcoming errors,
The University of Wisconsin originally provided a dataset where we filter the data for the next stage. The cleaning
containing two classes: malignant (M) and benign (B). It