Loan Approval Prediction with Machine Learning

Kadek Ayu Novita

7 min readFeb 22, 2021

source: https://nccrepair.com/loan-approval/

A brief description of this project:

Predict the loan status of a customer who wants to apply for a credit loan by comparing 3 Machine Learning Models:

Logistic regression
Random Forest
K — Nearest Neighbors

The features used in this project are:

Loan_ID
Gender
Married
Dependents
Education
Entrepreneur
Applicant Income
Joint Applicant Income
Loan Amount
Loan Amount Term
Credit History
Property Area
Loan_Status

Web app Loan Approval Prediction:

This project is one of the requirements that must be fulfilled when I complete the bootcamp in Purwadhika Start Up and Coding School. The reason why I take this topics is because my background was in Banking and very relevant to this topics. In the Banking mechanism, to able to find out whether someone is eligible or not to get a loan is to analyse their several important variables such as: Credit history, Income, and Loan Amount. However, they often analyse customer document for a long time, which cause customers move to a Bank that has instant approval process.

The goal of this project is to create a simple web app which can be used as a first step to predict whether someone is eligible or not to get a loan. For the processing steps, I will explain as follows:

1. Gathering the Data

In this project, I am Using dataset from Kaggle that can be downloaded here.

2. Data Pre-Processing

Explanation of Variable in Loan Dataset:

Loan_ID: Unique Loan_ID
Gender: Male/Female
Married: Applicant Married (Y/N)
Dependents: Number of Dependants
Education: Graduate/Non Graduate
Self_Employed: Self Employed (Y/N)
ApplicantIncome: Applicant Income
Co-applicantIncome: CoApplicant Income
LoanAmount: Loan Amount in thousand
Loan_Amount_Term: Term of loan in months
Credit_History: Credit history meat guidelines
Property Area: Urban/Semiurban/Rural

Target:

Loan_Status: Loan Approved (Y/N)

2.1. Load and Explore the Dataset

2.2. Missing Values Imputation

There are some variables that have missing values, such as: Gender, Married, Dependants, Self_Employed, LoanAmount, Loan_Amount_Term and Credit_History

For Categorical Variable: imputation using Mode (replacing missing values by the mode or the most frequent-category value).

There are very less missing values in Gender, Married, Dependents, Credit_History and Self_Employed features so we can fill them using the mode of the features
In loan amount term variable, the value of 360 is repeating the most. Will replace the missing values in this variable using the mode of this variable.

For Numerical Variable: imputation using Median.

In LoanAmount variable will use median to fill the null values because loan amount have outliers so the mean will not be the proper approach as it is highly affected by the presence of outliers.

2.3. Exploratory Data Analysis (EDA)

In this part, will doing some data analysis using visualization including:

Univariate Analysis: examine each variable individually
Bivariate Analysis: examine each variable with target variable.

2.3.1. Univariate Analysis

Visualize Independent variable (Categorical Feature)

It can be seen from thebar plots that:

80% applicants in the dataset are male.
Around 65% of the applicants in the dataset are married.
Around 15% applicants in the dataset are self employed.
Around 85% applicants have repaid their debts.

Visualize independent variable (Ordinal Feature)

It can be seen from the bar plots that:

Most of the applicants don’t have any dependents.
Around 80% of the applicants are Graduate.
Most of the applicants are from Semiurban area.

Visualize independent variable (Numeric Features)

2.3.2. Bivariate Analysis

Categorical Independent Variable vs Target Variable

It can be seen from Gender vs Loan_Status bar plot that the proportion of male and female applicants is more or less same for both approved and unapproved loans.

It can be seen from Dependents vs Loan_Status and Married vs Loan_Status bar plot that:

Proportion of married applicants is higher for the approved loans.
Distribution of applicants with 1 or 3+ dependents is similar across both the categories of Loan_Status.

It can be seen from Education vs Loan_Status and Self_Employed vs Loan_Status bar plot that:

Proportion of graduate applicants is higher for the approved loans.
There is nothing significant we can infer from Self_Employed vs Loan_Status plot.

It can be seen from Credit_History vs Loan_Status and Property_Area vs Loan_Status bar plot that:

It seems people with credit history as 1 are more likely to get their loans approved.
Proportion of loans getting approved in semiurban area is higher as compared to that in rural or urban areas.

Numerical Independent Variable vs Target Variable

It can be seen from Applicant_Income vs Loan_Status bar plot that:
Applicant income does not affect the chances of loan approval.

It can be seen from CoApplicantIncome vs Loan_Status bar plot that:

It shows that if coapplicant’s income is less, the chances of loan approval are high. But this does not look right. The possible reason behind this maybe that most of the applicants don’t have any coapplicant so the co-applicant income for such applicant is 0 and hence the loan approval is not dependent on it.
So we can make a new variable (Total_Income) in which we will combine the applicant’s and coapplicant’s income to visualize the combind effect of income on loan approval.

It can be seen from Total_Income vs Loan_Status bar plot that:

Proportion of loans getting approved for applicant having low Total_Income is verry less as compared to that applicant with average, high, and very high income.

It can be seen from LoanAmount vs Loan_Status bar plot that:

The proportion of approved loans is higher for low and average loan amount.

2.4. Label Encoder

Convert the variable categories into 0 and 1 so that We can find its correlation with numerical variables.
One more reason to do so is few models takes only numeric values as input.

2.5. Correlation

The most correlated variables are:

TotalIncome — LoanAmount (0.62)
Loan_Status — Credit_History (0.54)

2.6. Standardization

Standardize in this project using StandardScaler() and aims to finding and drop the outliers.

Total Outlier in variable:

Total income=13
LoanAmount=19
Loan Amount Term= 56

After dropping the outlier, shape the dataset is: 536, 11.

3. Building Machine Learning Model

3.1. Splitting Dataset (Train and Test)

Train the model on training dataset and make predictions for the test dataset.
Use the train_test_split function from sklearn to divide train dataset.
Test size = 20%, random state=107
Total train (xtr, ytr) dataset = 428
Total test(xval, yval) dataset = 108

3.2. Model Building

In this part, I will compare 3 Machine Learning Model for predict loan approval status:

Logistic Regression
Random Forest
K-Nearest Neighbours

3.3. Hyper-parameter Tuning

For tuning strategies, I am using Grid Search. Let’s see the differences before and after hyper-parameter tuning:

Model Score:

Logistic Regression: 83%
Random Forest: 80%
KNN : 62%

Model Score after Tuning:

Logistic Regression: 83%
Random Forest: 83%
KNN : 62%

3.4. Feature Selection

Use selectFromModel object from sklearn to automatically select the features.
Identify the most important feature.
Compare the accuracy of the ‘full featured’ classifier to the accuracy of the ‘important featured’ classifier.

3.5. Evaluation Metrics

Cross Validation
Confusion Matrix
ROC AUC