Loan Approval Prediction with Machine Learning
A brief description of this project:
Predict the loan status of a customer who wants to apply for a credit loan by comparing 3 Machine Learning Models:
- Logistic regression
- Random Forest
- K — Nearest Neighbors
The features used in this project are:
- Loan_ID
- Gender
- Married
- Dependents
- Education
- Entrepreneur
- Applicant Income
- Joint Applicant Income
- Loan Amount
- Loan Amount Term
- Credit History
- Property Area
- Loan_Status
Web app Loan Approval Prediction:
This project is one of the requirements that must be fulfilled when I complete the bootcamp in Purwadhika Start Up and Coding School. The reason why I take this topics is because my background was in Banking and very relevant to this topics. In the Banking mechanism, to able to find out whether someone is eligible or not to get a loan is to analyse their several important variables such as: Credit history, Income, and Loan Amount. However, they often analyse customer document for a long time, which cause customers move to a Bank that has instant approval process.
The goal of this project is to create a simple web app which can be used as a first step to predict whether someone is eligible or not to get a loan. For the processing steps, I will explain as follows:
1. Gathering the Data
In this project, I am Using dataset from Kaggle that can be downloaded here.
2. Data Pre-Processing
Explanation of Variable in Loan Dataset:
- Loan_ID: Unique Loan_ID
- Gender: Male/Female
- Married: Applicant Married (Y/N)
- Dependents: Number of Dependants
- Education: Graduate/Non Graduate
- Self_Employed: Self Employed (Y/N)
- ApplicantIncome: Applicant Income
- Co-applicantIncome: CoApplicant Income
- LoanAmount: Loan Amount in thousand
- Loan_Amount_Term: Term of loan in months
- Credit_History: Credit history meat guidelines
- Property Area: Urban/Semiurban/Rural
Target:
- Loan_Status: Loan Approved (Y/N)
2.1. Load and Explore the Dataset
2.2. Missing Values Imputation
For Categorical Variable: imputation using Mode (replacing missing values by the mode or the most frequent-category value).
- There are very less missing values in Gender, Married, Dependents, Credit_History and Self_Employed features so we can fill them using the mode of the features
- In loan amount term variable, the value of 360 is repeating the most. Will replace the missing values in this variable using the mode of this variable.
For Numerical Variable: imputation using Median.
- In LoanAmount variable will use median to fill the null values because loan amount have outliers so the mean will not be the proper approach as it is highly affected by the presence of outliers.
2.3. Exploratory Data Analysis (EDA)
In this part, will doing some data analysis using visualization including:
- Univariate Analysis: examine each variable individually
- Bivariate Analysis: examine each variable with target variable.
2.3.1. Univariate Analysis
Visualize Independent variable (Categorical Feature)
It can be seen from thebar plots that:
- 80% applicants in the dataset are male.
- Around 65% of the applicants in the dataset are married.
- Around 15% applicants in the dataset are self employed.
- Around 85% applicants have repaid their debts.
Visualize independent variable (Ordinal Feature)
It can be seen from the bar plots that:
- Most of the applicants don’t have any dependents.
- Around 80% of the applicants are Graduate.
- Most of the applicants are from Semiurban area.
Visualize independent variable (Numeric Features)
2.3.2. Bivariate Analysis
Categorical Independent Variable vs Target Variable
- It can be seen from Gender vs Loan_Status bar plot that the proportion of male and female applicants is more or less same for both approved and unapproved loans.
It can be seen from Dependents vs Loan_Status and Married vs Loan_Status bar plot that:
- Proportion of married applicants is higher for the approved loans.
- Distribution of applicants with 1 or 3+ dependents is similar across both the categories of Loan_Status.
It can be seen from Education vs Loan_Status and Self_Employed vs Loan_Status bar plot that:
- Proportion of graduate applicants is higher for the approved loans.
- There is nothing significant we can infer from Self_Employed vs Loan_Status plot.
It can be seen from Credit_History vs Loan_Status and Property_Area vs Loan_Status bar plot that:
- It seems people with credit history as 1 are more likely to get their loans approved.
- Proportion of loans getting approved in semiurban area is higher as compared to that in rural or urban areas.
Numerical Independent Variable vs Target Variable
- It can be seen from Applicant_Income vs Loan_Status bar plot that:
- Applicant income does not affect the chances of loan approval.
It can be seen from CoApplicantIncome vs Loan_Status bar plot that:
- It shows that if coapplicant’s income is less, the chances of loan approval are high. But this does not look right. The possible reason behind this maybe that most of the applicants don’t have any coapplicant so the co-applicant income for such applicant is 0 and hence the loan approval is not dependent on it.
- So we can make a new variable (Total_Income) in which we will combine the applicant’s and coapplicant’s income to visualize the combind effect of income on loan approval.
It can be seen from Total_Income vs Loan_Status bar plot that:
- Proportion of loans getting approved for applicant having low Total_Income is verry less as compared to that applicant with average, high, and very high income.
It can be seen from LoanAmount vs Loan_Status bar plot that:
- The proportion of approved loans is higher for low and average loan amount.
2.4. Label Encoder
- Convert the variable categories into 0 and 1 so that We can find its correlation with numerical variables.
- One more reason to do so is few models takes only numeric values as input.
2.5. Correlation
The most correlated variables are:
- TotalIncome — LoanAmount (0.62)
- Loan_Status — Credit_History (0.54)
2.6. Standardization
Standardize in this project using StandardScaler() and aims to finding and drop the outliers.
Total Outlier in variable:
- Total income=13
- LoanAmount=19
- Loan Amount Term= 56
After dropping the outlier, shape the dataset is: 536, 11.
3. Building Machine Learning Model
3.1. Splitting Dataset (Train and Test)
- Train the model on training dataset and make predictions for the test dataset.
- Use the train_test_split function from sklearn to divide train dataset.
- Test size = 20%, random state=107
- Total train (xtr, ytr) dataset = 428
- Total test(xval, yval) dataset = 108
3.2. Model Building
In this part, I will compare 3 Machine Learning Model for predict loan approval status:
- Logistic Regression
- Random Forest
- K-Nearest Neighbours
3.3. Hyper-parameter Tuning
For tuning strategies, I am using Grid Search. Let’s see the differences before and after hyper-parameter tuning:
Model Score:
- Logistic Regression: 83%
- Random Forest: 80%
- KNN : 62%
Model Score after Tuning:
- Logistic Regression: 83%
- Random Forest: 83%
- KNN : 62%
3.4. Feature Selection
- Use selectFromModel object from sklearn to automatically select the features.
- Identify the most important feature.
- Compare the accuracy of the ‘full featured’ classifier to the accuracy of the ‘important featured’ classifier.
3.5. Evaluation Metrics
- Cross Validation
- Confusion Matrix
- ROC AUC