Hybridization of Swarm for Features Selection to Modeling Heart Attack Data

Abstract


INTRODUCTION
In modern society, heart disease is one of the world's fatal diseases and will become the world's largest disease burden [1]. Heart disease includes coronary heart disease (heart attack), congestive heart failure, stroke, peripheral artery disease, carotid artery disease, and aortic disease. [2]. With the rapid development of computer technology and artificial intelligence (AI), machine learning technology has opened up new ideas for risk assessment of disease prediction. Because AI systems need to have the ability to acquire knowledge by themselves, that is, the ability to extract patterns from raw data, this ability is called machine learning (ML) [3] [4].
The introduction of machine learning allows computers to solve many problems related to the real world and make seemingly subjective decisions. Accumulated a large amount of data from heart patients in the electronic health records (EHR). Still, the busy clinical environment makes the integration and effective use of these data extremely challenging, so the data itself does not better serve clinical decision-making ability [5] [6]. In addition, many studies based on biomedical data come from conventional assumptions, that

Al-Rafidain Journal of Computer Sciences and Mathematics (RJCM)
www.csmj.mosuljournals.com is, to explore the impact of selected preselected variables on cardiovascular phenotypes [7]. In contrast, AI-based methods can be used under hypothetical conditions. Multiple variables drive data mining and problem discovery to select the similarities and differences of phenotypes between patients[a8]. Therefore, standardizing clinical diagnosis, improving existing treatment methods, finding new drug targets, and achieving data-driven high-quality care at a higher rate are important implementation measures to promote innovation in the medical field [9].
Research studies attempt to model medical record data based on the above starting point. To model those records, data are analyzed and processed. Modeling algorithms and statistical methods are used to reduce the error between the expected results and the real results. The potential of the accumulated data and risk variables are determined. The complex and non-linear effects between these variables are explored, and a heart attack prediction model is created based on data sets [10] [11]. The assessment of heart attack risk relies on various risk factors for cardiovascular disease to predict an individual's likelihood of having an acute heart attack [12]. In this way, the corresponding intervention measures are taken to reduce the influence of risk factors, prevent and reduce the occurrence of such clinical events in a timely manner, and improve the health of the whole community. To find the appropriate model for predicting heart attacks [13]. In this paper, medical data sets and previous literature surveys were analyzed, hybridization of the results of the (PSO, BAT, and CS) algorithms and adopt the precision for minority classes as a fitness function for each swarm to select the important features. In order to model the data set in which features were selected, proposed Catboost as a classifier for heart attack prediction. The proposed method to select features has been compared with each of the three swarms, and the Catboost algorithm was compared with traditional classification algorithms (naive Bayes, decision trees) The paper has been organized as follows, first surveying the previous literature, then presenting the theoretical framework of the algorithms specified and the proposed methodology, which included displaying the data set and selecting the features using swarms algorithms in addition to the proposed method (HSFS) then applying the selected machine learning algorithms. In the end, testing and comparing the models based on the methods of selecting features using the accuracy scale and the precision.

Related works
Kim et al [14], used (KNHANES-VI) dataset and neural network feature correlation analysis (NN_FCA) method has been proposed. Since (FCA) includes two stages, the first stage is feature selection, and the second stage is feature correlation. Then used is a neural network algorithm for classification. This method improved the neural network algorithm performance for predicting heart disease. Accuracy reached 85.70 %.
Kasbe et al [15], the authors suggested a fuzzy expert system for predicting heart disease. It consisted of three main steps fuzzification, rule base, and defuzzification. for defuzzification, the centroid technique had used. The system contains 13 input parameters and one output parameter, using the heart disease dataset UCI repository. The system is very easy in usability, and the patient can use it by themselves. The accuracy of this method achieved a 93.33%.
Malav et al [16], in this study, K means, and artificial neural networks have been hybridized to build a heart disease prediction model, applied to the heart disease UCI dataset. They reached an accuracy of 97%. The results showed that the hybrid systems were superior to the traditional machine learning algorithms.
Kamboj et al [17], the authors have compared the performance of machine learning algorithms to predict heart disease such as (SVM, KNN, Naive Bayes, Random Forest Classifier, Logistic Regression) on a heart disease UCI dataset. The study concluded that KNN is the best classifier with an accuracy of 87% compared to the rest of the specified algorithms.
Riazet et al [18], the authors suggested building a predictive system for predicting heart disease at an early stage using artificial neural networks. They used PCA for feature extraction. PCA improved the results to an accuracy rate of 97.7% compared to 94.7.
Shah et al [19], the Cleveland dataset from the UCI repository has been used, which comprised 303 states and 76 features. Apply pre-processing on this dataset, such as processing missing values and removing the noise. Used only 14 most important features, then machine learning supervised algorithms on this dataset applied such as KNN, Decision trees, random forest, naïve Bayes. The KNN algorithm achieved the highest accuracy, equal to 90.78.
Siva et al [20], this paper used the heart disease dataset from the UCI repository, then made data pre-processing features selection and applied these features on the hybrid random forest with a linear model for predation heart disease. The accuracy of this method achieved 92%.
Walaa Adel Mahmoud et al [21], the authors have used the Framingham dataset, this dataset is unbalanced. The imputation means method has been used to handle missing data and outlier data values. The authors proposed using different classifier algorithms such as (k nearest neighbours, support vectors machine, decision tree, linear regression random forest). The accuracy reached 83.95,84,5,84.89, and 85.05% for the as (k nearest neighbours, support vector machine, decision tree, linear regression random forest) algorithms respectively.

Theoretical Framework 3.1. Feature selection based on binary swarm
Feature selection is to find a targeted subset from the feature set of the original data to carry the most effective classification information. Feature selection aims to select as few feature subsets as possible according to a certain algorithm to achieve the best possible classification [22]. Three binary swarms were used in this study to select features .

Binary particle swarm optimization (PSO)
The main steps of the applied feature selection algorithm are shown in Figure (3).The process begins with generating an array of particles with random locations in the search space In the next step, the pbest and gbest are calculated at the end of each iteration. If the solution found by the particle is higher than the previous pbest, this solution will be the new pbest for that particle and the best pbest among all the particles [23].

Binary bat
The algorithm uses basic rules that bats use echolocation in the sensing space. Bats can differentiate between danger and food [26]. Bats (binary) fly randomly, quickly, in a position, and at a fixed frequency, with different wavelengths and loudness, to search for prey. Bats (binaries) can automatically adjust the wavelength (or frequency) of their emitted pulses and adjust the pulse emission rate depending on the proximity of their target. Loudness can vary in many ways [24]. Figure (2) shows the stages of finding features using the bat algorithm [25]

Binary cuckoo search (BCS)
Another swarm used to select the features is BCS to find the best features. Each host nest is specified algorithmically as an agent carrying a single egg (unique dimension problem) or several eggs (multidimensional problem). CS begins by randomly arranging the nest population in the search space. In each algorithm iteration, the nests are updated using random walk via L'evy flights [26]. Figure (3) shows the stages of finding features using the bat algorithm.

Modeling Heart Attack 3.2.1. Naive bayes
The models built using the Naive Bayes algorithm are considered the simplest models. It does not contain any parameters because the probability of the dependent variable is calculated from the probability of the event. To build a naive Bayes model for classifying bank loans, (GaussianNB) has been used [27]. The following is equation of Naive Bayes

Decision trees
A decision tree is one of the supervised machine learning methods used in "Classification" and "Regression" problems. The principle of dividing into nodes from top to bottom, the data set is divided into smaller and smaller subgroups until reaching the target nodes (class), where it starts from the root node that contains all records, and then divides according to the "Class Label" column (which is the column that classification is based on it), as for its algorithms, it has many types according to the data set. In this study, type DT C4.5 was used [28] [29]. The feature selection for each node depends on the following measurements: Information entropy: Information entropy is an index to measure the degree of disorder of elements.
( ) ∑ Where is the total number of features, c is the number of classes and p(i) indicates the probability of records belonging to that class.
Information gain: Information gain measures the change in information entropy between independent attributes. X)) is the information gain of feature X. Entropy(S) is the information entropy of the entire dataset, and entropy(S, X) is the information entropy of feature X.
Classification error: p(i/t) indicates the probability of records belonging to that class

CatBoost
CatBoost is a new gradient-boosting algorithm introduced by Prokhorenkova et al. (2018), and its performance has been proven to be quite exciting compared to another boosting algorithm. In particular [ 30]. CatBoost splits a given data set into random permutations. By default, CatBoost creates four random permutations. Randomness can stop modifying our model [31]. The following mathematical formula can represent it: Where is the corresponding weight, P denotes the prior value ( is the random vector of m features and ε R denotes the corresponding label

Suggested methodology
To find the appropriate model for predicting heart attacks, a data set containing patient records were used as input and whether or not a heart attack occurred as a final output. The first stage is the data initialization process, which includes cleaning the data in the event of anomalies, missing values, or categorical data, and dividing the data set into two groups, a training group and a test group with a ratio of (80:20) respectively. The stage of preparing the data for training includes selecting the features. The concept of swarms or the so-called (binary swarms) was used, and each method was applied separately, and the effect of the outputs of each method was measured.
Three swarms were applied (PSO, BAT, BCS), and compared the results among them were. a new hybrid swarm method was proposed called (HSFS)and compared the proposed method with each swarm separately. Feature selection is applied to the training data. After the swarms are applied, features are determined in the training and test data for each method on a separate data set in addition to the original data set. After creating the total and sub-datasets, the selected algorithms (naive Bayes, decision trees, and CatBoost) are trained to build models. Then measure the performance of each model concerning the method of selecting data and measure the system's performance as a whole, in addition to finding the precision of each classification class and comparing them. After the models are built and tested using the performance measures relevant to the study objective (Accuracy, Precision), the comparison is made, and the proposed methods' importance and applicability are determined. Figure (4) represents the framework proposed in this paper.

Heart Attack Dataset
In this paper, NHANES data were used for information on adults over 18 years. The dataset contains 37,080 records of diverse individuals. 1,300 people have coronary heart disease (heart attack) and 35780 non-coronary heart diseases. Each person has 50 features and one target value (Non-CHD and CHD). The data set is unbalanced, as people with coronary heart disease represent 3.5% of the total data [32].

Feature selection
To reduce the number of features entered into prediction models, binary swarm algorithms (PSO, BAT, BCS) were used in this study to find the features related to the accuracy of the prediction model. The total number of features to be reduced is 50. A method was proposed to hybrid three swarms and take into account the features selected by the three swarms to determine the features. Figure (5) shows the general structure for selecting features using swarms Depending on the model's performance, the fitness value is returned to the swarm to update the swarm parameters. This process is repeated until the required optimization is reached according to the stopping criteria. After applying the swarms, the PSO algorithm selected 8 features, while the BAT algorithm selected 30 features, and finally CS selected 29 features. Table (2) shows the parameters for each swarm.   Where n is a number of population, m_i is a number of max iteration, minf is minimazation flag, dim is number of feature, qmin is frequency minimum to step, qmax: frequency maximum to step, loud_A is value of Loudness and r is Pulse rate. Where n is a number of population, m_i is Number of max iteration, minf is minimization flag, dim is Number of feature, alpha and beta: Arguments in levy flight and pa is Probability to destroy inferior nest

Hybrid swarms feature selecting (HSFS)
To find the best features from the dataset. It is suggested to combine the outputs of the three swarms. Binary swarms find important features according to the swarm's method, and each swarm finds a different set of features. In this paper, a hybrid between swarms was proposed based on a Merage of the outputs of the specific swarm's algorithms to find the optimal final feature subset. The vector of values for each binary swarm (binary vector) is taken, then the vector of each swarm and the rest of the swarms are combined so that the final features are taken and considered as an effective feature if they are found in any of the three swarms or were in two or all swarms

Training machine learning algorithms and model building
After initializing the data, analyzing it, and feature selection using single and hybrid swarms, the selected machine learning models are trained to determine the efficiency of each method to select features and to find and test the efficiency of the system as a whole. In this paper, three machine learning algorithms (naive Bayes, decision trees, CatBoost) were tested and trained using the training data set, representing 80 percent of the total data.

Comparing the performance of machine learning algorithms using swarms feature selection
To determine how to select the best features and compare them with the proposed hyper swarm method and the performance of each machine learning algorithm, the following is a comparison of the results of those methods and finding the best model.

Compare models using accuracy
The accuracy scale represents a basic pillar in the performance measures for machine learning algorithms [33]. The formula for accuracy is:

Compare models using precision
The scale of precision is the scale adopted in this paper, as it expresses the true balance value and shows the actual classification of both categories]33]. The formula for Precision is: Table (5) shows the performance of models for the minority category, which is the target value.  Table that the performance of the CatBoost algorithm was the best when using the method of selecting features that depend on the hybrid swarm with a precision of (0.56), which indicates an improvement in the model's performance concerning the target value (the person with a heart attack). However, this method has been the best according to the precision scale was, followed by the use of the BCS algorithm with a precision of (0.45), which indicates that the CatBoost algorithm is the best and that the hybrid swarms increased the precision of the model for the target group.

Conclusions
Heart attack prediction is considered one of the important topics in the health field. Building a predictive model for the classification of heart attacks faces many challenges. The most important conclusions obtained in this paper can be summarized as follows: 1-Selecting features is important to reduce the dimensions of the data set, which improves the performance of machine learning models. And give a look to health workers on the extent to which each health worker is related to the probability of having a heart attack. Swarm algorithms were used to select these features and compare them with the performance of machine learning models using the original data set and to propose a new method for selecting features (hybrid swarms). The proposed method showed an improvement in the performance of the models in terms of prediction accuracy and balancing the data. 2-Two types of machine learning algorithms were used. Single learning (DT and NB) and gradient boosting ensemble learning (CatBoost) were compared using different scales to determine the accuracy of general prediction models and the accuracy of the models for each category because the data set is unbalanced. The results showed that gradient boosting ensemble learning is better for the accuracy of the results and achieves a better balance for this type of data.