Build a Multi-Layer Perceptron (Backpropagation) for a Bank Marketing

Amesh Jayaweera
13 min readMay 28, 2021

Predicting whether the client subscribed to a term deposit (Yes/No)?

Photo by Alina Grubnyak on Unsplash

This article is a quick guide to develop MLP (Backpropagation ) model. Here I will be using Bank Marketing Dataset that is taken from UCI Machine Learning Repository to create our MLP model. Prerequisites as follows,

Prerequisites

In this article, I will not focus on explaining the Neural Network theoretical side and their mathematical and statistics background. I assume that you are aware of it. Here, Actually, I will be focusing on the guide step by step to develop a best-suited MLP model. But, If you don’t have much theoretical and mathematical background, Don’t worry. I hope to write the next article about explaining the theory and mathematics behind Neural Network fundamentals.

This article outline as follows,

Outline

  1. Domain Knowlege on Bank Marketing
  2. Overview of Dataset
  3. Pre-processing
  4. Perform Feature Engineering
  5. Multi-Layer Perception (Backpropagation)
  6. Model Evaluation

Domain Knowledge on Bank Marketing

In bank Marketing, Term deposits are a method of obtaining and delivering series of benefits to both financial institution(banks) and their customer. Term deposits are a kind of fixed investments in which deposits an amount of cash into a financial institution(bank) for some agreed time period. That means, customers, invest cash in the financial institution, and then that financial institution returns the corresponding interest according to customer invested money and the period of time.

Here, I use a dataset that related to the term deposit of a banking institution and use that dataset to predict whether the customer will subscribe to a term deposit or not.

Overview of Dataset

Figure 1

Figure 1 shows you an overview of all columns in our dataset. There are 21 columns. But, here I will be not considering the duration column. So, as a first step, we can drop the duration column. Now, we have 20 columns. one target variable (y) and 19 features (X). There are 9 continuous features and 10 categorical features. Our target variable has only two values yes/no, so our target is to create MLP Classifier to predict the client subscribed to a term deposit.

Pre-Processing

Now, we have an idea about our dataset features bit. Therefore, we can perform pre-processing of our dataset. Pre-processing is a crucial step when creating learning models. Because it will directly affect the model accuracy and qualify of output. Actually, this is a time-consuming event. but we need to do it for better performance. I will be following these four steps in pre-processing.

  1. Handling Missing Values
  2. Handling Outliers
  3. Feature Coding
  4. Feature Scaling

Handling Missing Values

Figure 2

I will recommend reading more about handling missing values using this article Handling Missing Values.

In our case, there are no null values. Hence, we don’t want to spend time handling missing values.

Handling Outliers

I will recommend reading more about handling outlier values using this article Handling Outlier Values.

The next step is handling outliers. We only do outlier handling for only continuous variables. Because continuous variables have a big range when compare to categorical variables. So, let’s take the first continuous variables and check for any outliers as figure 3 shown below.

Figure 3

Let’s count the number of outliers vs column to get an idea about how many outliers each column has. As the figure 4, we can see a lot of outlier points in most of the columns in our dataset when using minimum percentile 0.05 and maximum percentile 0.95. So, we cannot just drop all the outliers. We have to check any contextual, collective, or global (any anomalies) in each feature. I did this in a Python notebook that is attached below to this article. So, I will be not showing all outlier handling here.

Figure 4

Below Figure 5 shows the boxplot for the age feature. Age lies between 17 — 100, So, there is no contextual outlier. But, there is a data point we can see a little bit out from the other outliers. It is a small gap between others. So, let’s keep it and go ahead without removing it.

Figure 5

Figure 6 shows you, a boxplot for the campaign column. We can clearly see there is an anomaly data point when compared to the other data points that far out from other data points. Therefore, we can remove it as a global outlier.

Figure 6

Figure 6 shows you the code to apply for handling campaign column outliers.

Figure 7

Figure 8 shows you before and after applying outlier handling for the campaign column.

Figure 8

You can see the above operation only removed the one row and it is better by looking at Before Shape and After Shape. Because we always need to minimize our data loss during pre-processing. Otherwise, we wouldn’t be able to capture some patterns or trends in our data. So, it may lead to underfitting our model.

Here, I only show outlier handling and reasoning for two features. rest of the other feature outlier handling you can see in the Python notebook.

Feature Coding

Feature Coding refers to convert text or labels to numerical data. Because our neural network learning algorithms work only numerical data.

I will recommend reading more about feature coding using this article Feature Coding.

Now, it‘s time to do feature coding. before feature coding, we need to take all categorical features as figure 9.

Figure 9

Let’s see the number of unique values in each categorical column.

Figure 10

We can use label coding for contact and y variables because, it having only 2 types of values and, The One-Hot Encoding can be used for the rest of the variables or features. the reason for the One-Hot encoding for other features. Because they are having more than two unique values. So, if we apply label encoding for that type of context some of the categorical variables get higher weights, and the model also gets unnecessary weights for our predictions. and our algorithm may be lead to think there is rank or precedence with categorical values. You can see code for both label encoding and one-hot encoding in the Python Notebook. Figure 11 shows you the data frame with some columns after applying feature coding techniques.

Figure 11

Feature Scaling

Feature scaling refers to the methods used to normalize a large range of values. This is a necessary step. Because this step directly influences the regression coefficient values. And also, Learning is also faster when features are on similar scales. There are so many feature scaling techniques. But here we will be applying standardization as the following equation.

X_hat = (X - mean(X)) / std(X)

Before feature scaling, we need to join df_cont and df_label frames together as follows,

Figure 12

Now, we have joined both categorical and continuous features. before feature scaling, we need to split our dataset otherwise, Data Leakage will happen. Data Leakage is simply our model is seen in our testing or real-world data before the training phase. So, when training phase and testing phase our model will be performed well. But, when working in the real world we may be losing our model’s performance. So, from here onwards I will be using training and testing data separately. Figure 13 shows you how to split our dataset. and note that there is an important technical fact after split our dataset. It is, we need to reset our X_train, X_test, y_train, y_test indexes. Otherwise, we can expect misbehaves when continuing.

Figure 13

Now, before feature scaling, we need to remove all categorical features and do feature scaling. Figure 14 shows you how to do feature scaling and Figure 15 shows you after feature scaling how our data frame look likes.

Figure 14
Figure 15

Before standardizing

Figure 16

After standardizing

Figure 17

Figure 16 shows you before standardizing, how our data look likes in histograms. Figure 17 shows you after standardizing effect, how our data look likes. So, with help of both two figures, we can see all continuous features scaled up to the same scale.

Now, we have completed applied pre-processing steps

Feature engineering

Feature engineering simply refers to selecting features which significant for our model. Identifying highly correlated features for our target has a huge impact on our model performance. I have seen most of the guys skip this step and continuing with all columns without knowing how much each features significant for our target. But, if you skip this step your model complexity will be increase. and our model tries to capture all noise as well. So, it will lead to overfitted during training and some times testing phase. So, the Model may be failed to generalize the real-world data pattern.

First, we should identify dependent and independent features using heatmap for continuous feature values. Figure 18 shows you, heatmap for continuous variable features.

Figure 18

If the correlation between two features is near +1, then, there is a strong positive correlation and we can conclude that the two features are dependent on each other. If the correlation between two features is near -1, then, there is a strong negative correlation between two features, and those two features also dependent on each other. If the correlation between two features is near 0, then we can conclude both features do not depend on each other. So, here in our context, we can see emp.var.rate and cons.price.idx and euribor3m and nr.employed are dependent on each other. So, we can keep the one feature from that dependent feature bucket and continue. Here, also we can see that other continuous variable features independent of each other.

Dependent Features

  • emp.var.rate and cons.price.idx and euribor3m and nr.employed

Independent Features

  • age
  • campaign
  • pdays
  • previous
  • cons.conf.idx

Next, we can check the significance of each continuous value feature with our target variable y. Figure 19 shows you, heatmap to check the significance of our target variables.

Figure 19

Each continuous variable feature does not have a stronger correlation except the emp.var.rate, cons.price.idx , euribor3m, and nr.employed. But, those four features depend on each other. Therefore we can keep the highest correlation feature from the dependent feature bucket. It is nr.employed and drops other dependent features with nr.employed. age, campaign, pdays, previous and cons.conf.idx are independent features but there is no higher correlation with our target. Hence, it is not a good idea to drop all. Because each feature has a considerable amount of correlation with our target variable y. So, let's keep all features except the emp.var.rate, cons.price.idx and euribor3m and try to reduce our dimensions using PCA.

Note that, Here I will not be considering categorical features for the heatmap. Since it has many columns and it is hard to find correlations using heatmap if we have many features. So, here I only considered continuous value features.

We have now 58 features both continuous and categorical. If we consider all of 58 features then our model complexity will be high and also our model gets overfitted. So, we can easily apply PCA to dimensionality reduction without using the heatmap.

Let’s Apply PCA

PCA is an algorithm that can be used to reduce the dimensions use in our model.

Note that, PCA does not eliminate redundant features, it creates a new set of features that is a linear combination of the input features and it will map into an eigenvector vector. Those variables called principal components and all PC are orthogonal to each other. Hence, it avoids redundant information. To select features it will we use the eigenvalues in the eigenvector and we can choose features that have achieved 95% of covariance using eigenvalues.

You can read about PCA further using this PCA article.

Figure 20 shows you, Covariance of all 58 features. It is recommended to take a number of components that have greater than a total of 95% of covariance for our model.

Figure 20

We need 27 components to achieve 95% of the covariance for our model and the rest of the 31 features only achieved nearly 5% of covariance. But, don’t take all features to increase accuracy. If you take all features your model gets overfitted and will be failed on when performing in real. And also, if you reduce the number of components, then you will get less amount of covariance, and the model can be under-fitted. So, now we reduced our model dimensions from 58 to 27 here.

Next, we can define PCA with 27 components as figure 21. So, it basically reduced our X_train and X_test frame to 27 dimensions.

Figure 21

Multi-Layer Perception (Backpropagation)

Now we have completed pre-processing steps and features engineering. Now, we can apply MLP Backpropagation to our training data.

What is Multi-Layer Perception?

“A feedforward artificial neural network (ANN) called a multilayer perceptron (MLP) is a type of feedforward artificial neural network. Backpropagation is a supervised learning technique used by MLP during training. MLP is distinguished from a linear perceptron by its numerous layers and non-linear activation functions. It can tell the difference between data that isn’t linearly separable.”

I recommended this article to read more about Multi-Layer Perception.

I recommended this article to read more about Non-linear activation functions.

First, we need to define our model using the using number of hidden layers, the number of neurons for each hidden layer, and which activation function is using to train our model. Here, I used two hidden layers and In both layers, I used the sigmoid function here. because in this context y is having only two possible classes it is better to use sigmoid. Because it very popular for using that type of context. The reason is sigmoid function lie between 0 and 1. So, we can get easily probability for two classes between 0 and 1. And for the first hidden layer, I used 32 neurons and for the second layer, I used 1 neuron. As per my knowledge, there is no elegant way to decide how many hidden layers need, what are activation functions need to choose for each layer, how many neurons need for each layer. You can decide with it expertise experience and also your experience and experiments. But, in general, the experts recommended to using two hidden layers is enough to two-class classification.

The next step is to define and build the model and it shows in Figures 22,23,24 respectively.

Note that, in this article, I used the TensorFlow Keras library to build our MLP model using Sequential. You can see it in the Python Notebook.

Define model

Figure 22
Figure 23

Build Model

Figure 24

Now we have built our model. Now it’s time to evaluate our model.

Model Evaluation

We can evaluate our model using the Keras library evaluate method. It gives us 90.228 %. accuracy in testing data. Actually, it is good accuracy we can be achieved. But, there is a problem and it will discuss under the confusion matrix and classification report. Figure 25 shows you our model performance.

Figure 25

Confusion Matrix

Figure 26

Figure 26 shows you confusion matrix for our model. I will simply explain what does that means. The horizontal axis represents the Predicted category or class and the vertical axis represents the true category or class that is the correct class prediction belongs to. In our model, 7200 predictions are correctly classified into no class, and 220 predictions are correctly classified into yes class. In terms of jargon using in the AI field, we called True Negatives (TN) count is 7200 and True Positive Count (TP) is 220. Also, 110 predictions are wrongly classified into the yes class and 700 predictions are wrongly classified into no class. So, the False Negative (FN) count is 110 and the False Positive (FP) count is 700. So, TP is lower than the FP. I will be discussing further with this issue in the classification report.

Classification Report

Figure 27

Figure 27 shows you classification report for our model. I mentioned TP count is lower than the FP count. Actually, it means our model does not perform well for yes classification. You can see precision for the 1 (yes) class is 0.67. It is lower when compared to the 0 (no) class. This happens because of our dataset data class imbalance. It simply means our dataset has a higher amount of no classes when compared to the yes class. To overcome this issue we have to use SMOTE technique. I will be not focusing on SMOTE in this article and I hope to deeply discuss it in another article. As a summary we can see, our model has 0.79 macro average precision and 0.88 weighted average precision.

Click here to go for Python Notebook — https://github.com/amesh97/ann

Thank you for reading …

--

--

Amesh Jayaweera

I’m passionate to explore the knowledge and improve my skills level best in Science & Technology including Programming, and Artificial Intelligence.