Predicting Bank Loan Risks Using Machine Learning Algorithms

Bank loans play a crucial role in the development of banks investment business. Nowadays, there are many risk-related issues associated with bank loans. With the advent of computerization systems, banks have become able to register borrowers' data according their criteria. In fact, there is a tremendous amount of borrowers’ data, which makes the process of load management a challenging task. Many studies have utilized data mining algorithms for the purpose of loans classification in terms of repayment or when the loans are not based on customers’ financial history. This kind of algorithms can help banks in making grant decisions for their customers. In this paper, the performance of machine learning algorithms has been compared for the purpose of classifying bank loan risks using the standard criteria and then choosing ( Multilayer Perceptron ) as it has given best accuracy compared to RandomForest , BayesNet, NaiveBayes and DTJ48 algorithms.


I Introduction:
Granting loans is an essential part of the work of any bank. Most of the banks' profits come from the benefits that are taken on these loans and most of the capital is in them. These days many banks agree on the loan after verification and validation, but there is still no whether the applicant is the appropriate applicant [13,14].
Machine learning algorithms have been used in many areas of business, business administration, human resource management and medical purposes and have shown good success in data mining and decision support systems [3].
In the proposed paper, the use of neuronal networks as one of the machine learning algorithms for the purpose of classifying bank loans in terms of risks depending on the data in banks for previous loans and training algorithms in classifying loans using the characteristics of the borrower and comparing the performance of the neuronal network algorithm with decision trees j48, random forests and statistical methods (NaiveBayes , BayesNet).

II Methods:
In this section, we summarize the machine learning algorithms that were used for the purpose of classifying and forecasting loan risks.
A. DT J48, is one of algorithms used for making a decision tree developed by Ross Quinlan [1,8]. The tree is built in the same way as building Iterative Dichotomiser 3 (ID3), where the contract is chosen based on the concept of gain, where the attribute with the highest classification ability (highest gain) is considered as the root of the tree that is branched into leaves. These leaves also choose (in the same way) the attribute with the ability to rank higher than the remaining attribute at the next level. This separation continues until the entire tree is built [5,11]. The attribute selection for each node is based on the following three measurements: Where c is the number of class and p(i|t) indicates the probability of records belonging to that class.
B. Random Forest the RF approach, it is based on the creation of many taxonomy trees based on different subsets of data using random subsections of available variables. The overall result of this approach creating and refining a set of correct theories and assumptions represented by trees, and combine trees in a "forest of classifiers" that its final decision depends on the results of the different decision trees. An additional powerful advantage of this approach, it is based on decentralized group behavior without any central or hierarchical learning structure [11]. Each tree is built similar to J48 and the final result depends on the average output of those trees, as in the following equation: = ∑ (4) RF sub (i): the end result of feature i from all trees in the RF model T sub (j): the output of tree for i in tree j C. Bayes's theorem It is one of statistical probability theories used to predict the occurrence of a particular event based on the attribute of that event. It can be performed by calculating the probability of each attribute and its impact on the occurrence of that event [10]. Bayes's theorem is mathematically represented by the following formula: Where is: ( ) : The probability that y will occur if event x occurs.
( ) : The overall probability of a result of that class.
( ) : The probability that event will occur for all events within a particular attribute. In this paper, two types of Bayes' theory have been used 1. BayesNet, it is a probabilistic graphical model that uses Bayesian reasoning to calculate probabilities. The Bayesian network relies on conditional dependence, causation, and inferring from random variables by calculating these probabilities and according to the influence of each factor. 2. NaiveBayes, is one of probabilistic classifiers family that based on Bayesian theorem.
This model is based on the principle of a maximum a posteriori decision rule and takes. The probability of each attribute independently without considering the relationships between those attributes. D. Multilayer Perceptron, it is a mathematical model that derives its principle from the way neurons work in the human brain. The network consists of a group of artificial cells linked by connections. The work of Neuronal networks is based on the principle of parallelism that enables the network to analyze many problems with multiple variables [8,11]. The multi-layered neural network is composed of an input layer that in turn receives the input values for the network and a number of hidden layers depending on the network structure. In this research we use a network with one hidden layer and this layer is called hidden because it is considered as a black box for the user as its inputs are the outputs of the input layer and its results are the inputs of the last layer in the sense that its inputs and outputs are not visible to the user and finally the output layer which consists of one [11]. The mathematical formula of a neuron is the following equation: = (∑ + ) (6) Where is: { 1 , 2 , … … . } : Input signal. { 1 , 2 , … … . } : Weights for the neuron k.
: It represents bias that can be counted as one of the weights. : Activation function.

III Dataset
The dataset used for classification purpose entitled "German Credit data" collected from UCI repository that contains 1000 Instances, 11 attributes as shown in Table (1).

Table 1: German Credit dataset
After the preprocessing step, dataset become 24 numerical attributes and 1 binary classifier because when converting columns with categorical data to numeric, they will become more than one column [4]. The dataset is divided into two subsets, 80% of the data for training, and then 20% of these data was used for testing. The chosen dataset contains two formats of data (original data, numerical data). The numerical dataset was used to compare it with various machine learning algorithms.

VI Implementation
The performance comparisons were applied to 1,000 cases, including 700 loan repayments and 300 payment default loans. The weka Version 3.8.4 environment was used for this research for the purpose of model building and testing [6]. The proposed algorithms were trained using a dataset consisting of 800 instances of loans through supervised learning and targeted data (YES, NO) YES in the case of loan repayment and NO in case of payment defaulted after training the algorithms, a trainer model for each algorithm was obtained. The algorithms were tested using a test set of 200 Instances then we obtained the results of each algorithm and analyzed those results as shown in the figure 2.

Figure (2): Proposed frame work
The results were analyzed by comparing the performance of each algorithm using several measurements as shown in Table 2 where it appeared that neuronal networks possess the highest accuracy compared to the other of the algorithms. In machine learning algorithms, there are standard measurements used to explain the performance of each algorithm with respect to the targeted data. In this paper, there are two targeted values YES in the case of payment and NO in the case of default. The performance of each algorithm was as shown in Table 3.  One important metric for measuring the performance of binary classification algorithms is Receiver Operating Characteristic (ROC), which showed the separatability of each algorithm depending on (true positive rate) and (false positive rate). Multi-layer neuronal networks have the potential for higher separation compared to the rest of the algorithms on this type of dataset as shown in Figure (3-7).

Conclusion:
Machine learning algorithms play a significant role in predicting the risks of bank loans and decision support systems. The choice of the algorithm used to make the decision (whether the borrower will default), which is the key to addressing decision management when issuing a loan. In this paper, the performance of machine learning algorithms has been tested and their performance compared to standard measurements used on a dataset that includes 1000 loans and their repayment status. Finally, the results showed the possibility of using the proposed algorithms for this purpose with acceptable accuracy rates and superiority of the neural networks for this purpose.