Bank Credit Default Case Study

Exploratory Data Analysis

Business Understanding :

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Data is collected about a hypothetical finance bank which specialises in lending various types of loans to urban customers. EDA is used to analyse the patterns present in the data. The bank can use this data to ensure that the applicants capable of repaying the loan are not rejected.

When the bank receives a loan application, the bank has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

  • If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the bank
  • If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the bank.

The data contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

  • The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,
  • All other cases: All other cases when the payment is paid on time.

When a client applies for a loan, there are four types of decisions that could be taken by the client/bank):

  • Approved: The bank has approved loan Application
  • Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.
  • Refused: The bank had rejected the loan (because the client does not meet their requirements etc.).
  • Unused offer: Loan has been cancelled by the client but on different stages of the process.

This case study uses EDA techniques to understand how consumer attributes and loan attributes influence the tendency of default.

Business Objectives

This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected.

In other words, we have tried to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The bank can utilise this knowledge for its portfolio and risk assessment.

Data Understanding

  1. 'application_data.csv' contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.
  2. 'previous_application.csv' contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.

Analysis Approach & Conclusions:

  • The purpose of this analysis is to give insights on credit default analysis and suggest customer parameters which have a clear indication of default.
  • The Analysis is done on current loan application data and previous loan application status data.
  • The current loan application is used to explore for insights on the factors that contribute to payment difficulties / loan default.
  • A merged data set of current loan application and previous application data is analysed to give insights on factors that contribute to Loan Approval, Rejection, Cancellations and unused offers.
  • The risk of default could broadly divided into
  • Risk attributed to Credit Good [Credit Good Risk]
  • Risk attributed to Consumer liabilities / behaviour [Consumer related Risk]
  • Loan application is approved based on below examination,
  • If applicant is new customer, then we look at Credibility and Demographics
  • If applicant is old customer, then we look at Credit history and Previous rejection causes.

The following impressions have been drawn from the analysis

Univariate Analysis

Gender

  • There is a very high proportion of Female applicants (65.8%)
  • Only 7% of total female applicants have payment difficulties compared to 10% of total male applications.
  • It is possible that female applicants are safer borrowers than male.

Owning Realty

  • Applicants who do not own Real-estate seem more likely to face payment difficulties than compared to the group of people owning the Real-estate.

Contact Address vs Work Address

  • Applicants that live and work in same city are at less risk of payment difficulties.

Education

  • Higher the education, lesser is the risk of payment difficulties.

Marital Status

  • Applicants with FAMILY_STATUS as seperated or single or have higher risk of payment difficulties compared to that of Widow and Married.
  • For the purpose of this analysis, civil marriage and married are treated equal. They might be merged into the one category.

Number of Days Employed

  • The applicants with payment difficulties have lesser median days of employment compared to "all other cases". So, greater the DAYS_EMPLOYED, lesser is the risk of payment difficulties.

Bivariate Analysis

Credit Amount and Price of Credit Goods

  • Lower Credit Amount and Lower Price of Credit Goods have higher cases of Payment difficulties

Gender & Marital Status

  • Customers most attractive to the bank by likelihood of default, are 'Female Widows' > 'Married Females' = 'Separated Females' > 'Single Females' > 'Married Males' > 'Male Widowers' > 'Single Males' = 'Separated Males'
  • The bank may focus their marketing campaigns on converting applications of Female Widows, Married Females, Separated Females.
  • May note that the dataset is skewed towards Females than Males.

Goods Price

  • Median of "Approved" Category is 95,208.48. Any amount above this has high risk of getting refused or being cancelled.

Income and Annuity

  • Income Range and Annuity have a mild correlation with Payment difficulties

Education & Client Type

  • Bank should look into the reasons why Repeat customers who are cancelling their offers. It could also focus on loans catered to academic degree holders who are first time customers.

Top 10 correlations for 'No default'

  • AMT_GOODS_PRICE & AMT_CREDIT
  • AMT_GOODS_PRICE & AMT_ANNUITY
  • AMT_ANNUITY & AMT_CREDIT
  • AGE_YEARS & DAYS_EMPLOYED
  • AMT_ANNUITY & AMT_INCOME_TOTAL
  • AMT_GOODS_PRICE & AMT_INCOME_TOTAL
  • AMT_INCOME_TOTAL & AMT_CREDIT
  • AGE_YEARS & EXT_SOURCE_3
  • DAYS_LAST_PHONE_CHANGE & EXT_SOURCE_2
  • DAYS_EMPLOYED & AMT_INCOME_TOTAL

Top 10 correlations for 'Default' are :

  • AMT_GOODS_PRICE & AMT_CREDIT
  • AMT_CREDIT & AMT_ANNUITY
  • AMT_ANNUITY & AMT_GOODS_PRICE
  • AGE_YEARS & DAYS_EMPLOYED
  • EXT_SOURCE_2 & DAYS_LAST_PHONE_CHANGE
  • EXT_SOURCE_3 & AGE_YEARS
  • AGE_YEARS & AMT_GOODS_PRICE
  • AGE_YEARS & AMT_CREDIT
  • AMT_GOODS_PRICE & EXT_SOURCE_2
  • AMT_CREDIT & EXT_SOURCE_2

Look below for a detailed analysis

In [98]:
#importing required Libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import sidetable
from tabulate import tabulate

import warnings
warnings.filterwarnings('ignore')
In [99]:
#setting max number of columns to be displayed to 100, to get a better view.
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)
from IPython.display import HTML
HTML('''<script>
code_show_err=false; 
function code_toggle_err() {
 if (code_show_err){
 $('div.output_stderr').hide();
 } else {
 $('div.output_stderr').show();
 }
 code_show_err = !code_show_err
} 
$( document ).ready(code_toggle_err);
</script>
''')
Out[99]:

Loading Application Data

In [100]:
application_data=pd.read_csv('application_data.csv')
In [101]:
#checking shape 
application_data.shape
Out[101]:
(307511, 122)
In [102]:
application_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
In [103]:
application_data.describe()
Out[103]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE ... LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,499.00 307,233.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 104,582.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,509.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 134,133.00 306,851.00 246,546.00 151,450.00 127,568.00 157,504.00 103,023.00 92,646.00 143,620.00 152,683.00 154,491.00 98,869.00 124,921.00 97,312.00 153,161.00 93,997.00 137,829.00 151,450.00 127,568.00 157,504.00 103,023.00 ... 97,312.00 153,161.00 93,997.00 137,829.00 151,450.00 127,568.00 157,504.00 103,023.00 92,646.00 143,620.00 152,683.00 154,491.00 98,869.00 124,921.00 97,312.00 153,161.00 93,997.00 137,829.00 159,080.00 306,490.00 306,490.00 306,490.00 306,490.00 307,510.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 265,992.00 265,992.00 265,992.00 265,992.00 265,992.00 265,992.00
mean 278,180.52 0.08 0.42 168,797.92 599,026.00 27,108.57 538,396.21 0.02 -16,037.00 63,815.05 -4,986.12 -2,994.20 12.06 1.00 0.82 0.20 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 ... 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102,790.18 0.27 0.72 237,123.15 402,490.78 14,493.74 369,446.46 0.01 4,363.99 141,275.77 3,522.89 1,509.45 11.94 0.00 0.38 0.40 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 ... 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.00 0.06 0.00 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100,002.00 0.00 0.00 25,650.00 45,000.00 1,615.50 40,500.00 0.00 -25,229.00 -17,912.00 -24,672.00 -7,197.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4,292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189,145.50 0.00 0.00 112,500.00 270,000.00 16,524.00 238,500.00 0.01 -19,682.00 -2,760.00 -7,479.50 -4,299.00 5.00 1.00 1.00 0.00 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 ... 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1,570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278,202.00 0.00 0.00 147,150.00 513,531.00 24,903.00 450,000.00 0.02 -15,750.00 -1,213.00 -4,504.00 -3,254.00 9.00 1.00 1.00 0.00 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 ... 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367,142.50 0.00 1.00 202,500.00 808,650.00 34,596.00 679,500.00 0.03 -12,413.00 -289.00 -2,010.00 -1,720.00 15.00 1.00 1.00 0.00 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 ... 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456,255.00 1.00 19.00 117,000,000.00 4,050,000.00 258,025.50 4,050,000.00 0.07 -7,489.00 365,243.00 0.00 0.00 91.00 1.00 1.00 1.00 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 ... 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00

8 rows × 106 columns

Data Quality Check

Checking Columns with High Null Percentage

In [104]:
def color_red(value):
    '''
      Colors elements in a dateframe
      green if nulls <50% and red if
      >=50%.
    '''
    if value >=50:
        color = 'red'
    elif value <50:
        color = 'green'

    return 'color: %s' % color

#Finding Percentage of Missing values in each and every column
missing_data = np.round(100*application_data.isnull().sum()/len(application_data),2)
missing_data = pd.DataFrame(data={'Column Name' :missing_data.index, 'Null Percentage' : missing_data.values})
missing_data.style.applymap(color_red,subset=['Null Percentage'])
Out[104]:
Column Name Null Percentage
0 SK_ID_CURR 0.00
1 TARGET 0.00
2 NAME_CONTRACT_TYPE 0.00
3 CODE_GENDER 0.00
4 FLAG_OWN_CAR 0.00
5 FLAG_OWN_REALTY 0.00
6 CNT_CHILDREN 0.00
7 AMT_INCOME_TOTAL 0.00
8 AMT_CREDIT 0.00
9 AMT_ANNUITY 0.00
10 AMT_GOODS_PRICE 0.09
11 NAME_TYPE_SUITE 0.42
12 NAME_INCOME_TYPE 0.00
13 NAME_EDUCATION_TYPE 0.00
14 NAME_FAMILY_STATUS 0.00
15 NAME_HOUSING_TYPE 0.00
16 REGION_POPULATION_RELATIVE 0.00
17 DAYS_BIRTH 0.00
18 DAYS_EMPLOYED 0.00
19 DAYS_REGISTRATION 0.00
20 DAYS_ID_PUBLISH 0.00
21 OWN_CAR_AGE 65.99
22 FLAG_MOBIL 0.00
23 FLAG_EMP_PHONE 0.00
24 FLAG_WORK_PHONE 0.00
25 FLAG_CONT_MOBILE 0.00
26 FLAG_PHONE 0.00
27 FLAG_EMAIL 0.00
28 OCCUPATION_TYPE 31.35
29 CNT_FAM_MEMBERS 0.00
30 REGION_RATING_CLIENT 0.00
31 REGION_RATING_CLIENT_W_CITY 0.00
32 WEEKDAY_APPR_PROCESS_START 0.00
33 HOUR_APPR_PROCESS_START 0.00
34 REG_REGION_NOT_LIVE_REGION 0.00
35 REG_REGION_NOT_WORK_REGION 0.00
36 LIVE_REGION_NOT_WORK_REGION 0.00
37 REG_CITY_NOT_LIVE_CITY 0.00
38 REG_CITY_NOT_WORK_CITY 0.00
39 LIVE_CITY_NOT_WORK_CITY 0.00
40 ORGANIZATION_TYPE 0.00
41 EXT_SOURCE_1 56.38
42 EXT_SOURCE_2 0.21
43 EXT_SOURCE_3 19.83
44 APARTMENTS_AVG 50.75
45 BASEMENTAREA_AVG 58.52
46 YEARS_BEGINEXPLUATATION_AVG 48.78
47 YEARS_BUILD_AVG 66.50
48 COMMONAREA_AVG 69.87
49 ELEVATORS_AVG 53.30
50 ENTRANCES_AVG 50.35
51 FLOORSMAX_AVG 49.76
52 FLOORSMIN_AVG 67.85
53 LANDAREA_AVG 59.38
54 LIVINGAPARTMENTS_AVG 68.35
55 LIVINGAREA_AVG 50.19
56 NONLIVINGAPARTMENTS_AVG 69.43
57 NONLIVINGAREA_AVG 55.18
58 APARTMENTS_MODE 50.75
59 BASEMENTAREA_MODE 58.52
60 YEARS_BEGINEXPLUATATION_MODE 48.78
61 YEARS_BUILD_MODE 66.50
62 COMMONAREA_MODE 69.87
63 ELEVATORS_MODE 53.30
64 ENTRANCES_MODE 50.35
65 FLOORSMAX_MODE 49.76
66 FLOORSMIN_MODE 67.85
67 LANDAREA_MODE 59.38
68 LIVINGAPARTMENTS_MODE 68.35
69 LIVINGAREA_MODE 50.19
70 NONLIVINGAPARTMENTS_MODE 69.43
71 NONLIVINGAREA_MODE 55.18
72 APARTMENTS_MEDI 50.75
73 BASEMENTAREA_MEDI 58.52
74 YEARS_BEGINEXPLUATATION_MEDI 48.78
75 YEARS_BUILD_MEDI 66.50
76 COMMONAREA_MEDI 69.87
77 ELEVATORS_MEDI 53.30
78 ENTRANCES_MEDI 50.35
79 FLOORSMAX_MEDI 49.76
80 FLOORSMIN_MEDI 67.85
81 LANDAREA_MEDI 59.38
82 LIVINGAPARTMENTS_MEDI 68.35
83 LIVINGAREA_MEDI 50.19
84 NONLIVINGAPARTMENTS_MEDI 69.43
85 NONLIVINGAREA_MEDI 55.18
86 FONDKAPREMONT_MODE 68.39
87 HOUSETYPE_MODE 50.18
88 TOTALAREA_MODE 48.27
89 WALLSMATERIAL_MODE 50.84
90 EMERGENCYSTATE_MODE 47.40
91 OBS_30_CNT_SOCIAL_CIRCLE 0.33
92 DEF_30_CNT_SOCIAL_CIRCLE 0.33
93 OBS_60_CNT_SOCIAL_CIRCLE 0.33
94 DEF_60_CNT_SOCIAL_CIRCLE 0.33
95 DAYS_LAST_PHONE_CHANGE 0.00
96 FLAG_DOCUMENT_2 0.00
97 FLAG_DOCUMENT_3 0.00
98 FLAG_DOCUMENT_4 0.00
99 FLAG_DOCUMENT_5 0.00
100 FLAG_DOCUMENT_6 0.00
101 FLAG_DOCUMENT_7 0.00
102 FLAG_DOCUMENT_8 0.00
103 FLAG_DOCUMENT_9 0.00
104 FLAG_DOCUMENT_10 0.00
105 FLAG_DOCUMENT_11 0.00
106 FLAG_DOCUMENT_12 0.00
107 FLAG_DOCUMENT_13 0.00
108 FLAG_DOCUMENT_14 0.00
109 FLAG_DOCUMENT_15 0.00
110 FLAG_DOCUMENT_16 0.00
111 FLAG_DOCUMENT_17 0.00
112 FLAG_DOCUMENT_18 0.00
113 FLAG_DOCUMENT_19 0.00
114 FLAG_DOCUMENT_20 0.00
115 FLAG_DOCUMENT_21 0.00
116 AMT_REQ_CREDIT_BUREAU_HOUR 13.50
117 AMT_REQ_CREDIT_BUREAU_DAY 13.50
118 AMT_REQ_CREDIT_BUREAU_WEEK 13.50
119 AMT_REQ_CREDIT_BUREAU_MON 13.50
120 AMT_REQ_CREDIT_BUREAU_QRT 13.50
121 AMT_REQ_CREDIT_BUREAU_YEAR 13.50
In [105]:
#Finding the columns that have more than or equal to 50% null values and storing it to columns_to_drop.
columns_to_drop=application_data.columns[100*application_data.isnull().sum()/len(application_data)>=50]
columns_to_drop
Out[105]:
Index(['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG',
       'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG',
       'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BUILD_MODE',
       'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
       'BASEMENTAREA_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI',
       'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI',
       'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE',
       'WALLSMATERIAL_MODE'],
      dtype='object')
In [106]:
#dropping the columns where the null values are >= 50%.
application_data.drop(labels=columns_to_drop,axis=1,inplace=True)
In [107]:
#verifying if the columns are dropped.
application_data.shape
Out[107]:
(307511, 81)

Imputation Consideration

Finding the best values to impute below columns that have <=13% of null values

In [108]:
#Finding columns with <= 13% missing columns

null_info = pd.DataFrame(100*application_data.isnull().sum()/len(application_data))
null_info.columns = ['Null Percentage']
null_info[(null_info['Null Percentage'] > 0) & (null_info['Null Percentage'] <=13)]
Out[108]:
Null Percentage
AMT_ANNUITY 0.00
AMT_GOODS_PRICE 0.09
NAME_TYPE_SUITE 0.42
CNT_FAM_MEMBERS 0.00
EXT_SOURCE_2 0.21
OBS_30_CNT_SOCIAL_CIRCLE 0.33
DEF_30_CNT_SOCIAL_CIRCLE 0.33
OBS_60_CNT_SOCIAL_CIRCLE 0.33
DEF_60_CNT_SOCIAL_CIRCLE 0.33
DAYS_LAST_PHONE_CHANGE 0.00

Columns To Impute

We have chosen the following columns to consider the best value to impute the missing values with

  • AMT_ANNUITY
  • AMT_GOODS_PRICE
  • NAME_TYPE_SUITE
  • CNT_FAM_MEMBERS
  • OBS_60_CNT_SOCIAL_CIRCLE

AMT_ANNUITY Imputation

In [109]:
# Boxplot to check for outliers
application_data['AMT_ANNUITY'].plot.box()
plt.title('\n Box Plot of AMT_GOODS_PRICE')

# Calculating Quantiles
print('Quantile\tAMT_ANNUITY')
application_data['AMT_ANNUITY'].quantile([0.5,0.8,0.85,0.90,0.95,1])
Quantile	AMT_ANNUITY
Out[109]:
0.50    24,903.00
0.80    37,516.50
0.85    40,806.00
0.90    45,954.00
0.95    53,325.00
1.00   258,025.50
Name: AMT_ANNUITY, dtype: float64

From the above box plot of AMT_ANNUITY, there are a lot of outliers.

Calculating Quantiles confirms the same. There is a huge jump from 95% value to max value .
Hence, Median : 24903 is the best value to impute the missing values in AMT_ANNUITY, since mean is not robust to outliers.

AMT_GOODS_PRICE Imputation

In [110]:
print('Quantile\tAMT_GOODS_PRICE')
application_data['AMT_GOODS_PRICE'].plot.box()
plt.title('\n Box Plot of AMT_GOODS_PRICE')
application_data['AMT_GOODS_PRICE'].quantile([0.5,0.8,0.85,0.90,0.95,1])
Quantile	AMT_GOODS_PRICE
Out[110]:
0.50     450,000.00
0.80     814,500.00
0.85     900,000.00
0.90   1,093,500.00
0.95   1,305,000.00
1.00   4,050,000.00
Name: AMT_GOODS_PRICE, dtype: float64

The above box plot show significant number of outliers. This confirmed by the quantiles.
There is a huge jump from 95 percentile value to Maximum value.
In this case, since there are many outliers, Mean is not a good representation of the data since it is not robust to outliers.

Hence ,Median : 450000 is the best metric to impute missing values

NAME_TYPE_SUITE Imputation :

In [111]:
print('Data type of NAME_TYPE_SUITE : ',application_data['NAME_TYPE_SUITE'].dtype,'\n\n')
print('Category\tNormalized Count\n\n',application_data['NAME_TYPE_SUITE'].value_counts(normalize=True))
data = application_data['NAME_TYPE_SUITE'].value_counts(normalize=True)
plt.bar(data.index,data.values)
# data.hist()
plt.xticks(rotation=90)
plt.ylabel('Normalized Value Counts')
plt.title('\nNAME_TYPE_SUITE');
Data type of NAME_TYPE_SUITE :  object 


Category	Normalized Count

 Unaccompanied     0.81
Family            0.13
Spouse, partner   0.04
Children          0.01
Other_B           0.01
Other_A           0.00
Group of people   0.00
Name: NAME_TYPE_SUITE, dtype: float64

NAME_TYPE_SUITE is a categorical variable.
The best metric to impute missing values is Mode of the data.
From the above plot, 'Unaccompanied' is the Mode

Hence, Unaccompanied is the best value to impute the missing values.

CNT_FAM_MEMBERS Imputation

In [112]:
application_data['CNT_FAM_MEMBERS'] = application_data['CNT_FAM_MEMBERS'].astype('category')

print('Data type of CNT_FAM_MEMBERS : ',application_data['CNT_FAM_MEMBERS'].dtype,'\n\n')

print('Fly Mems | Value Counts\n',application_data['CNT_FAM_MEMBERS'].value_counts(normalize=True))

(application_data['CNT_FAM_MEMBERS'].value_counts()).sort_index().plot(kind='bar')
plt.title('\nNo of Family Members vs Value Counts');
Data type of CNT_FAM_MEMBERS :  category 


Fly Mems | Value Counts
 2.00    0.51
1.00    0.22
3.00    0.17
4.00    0.08
5.00    0.01
6.00    0.00
7.00    0.00
8.00    0.00
9.00    0.00
10.00   0.00
16.00   0.00
12.00   0.00
14.00   0.00
20.00   0.00
11.00   0.00
13.00   0.00
15.00   0.00
Name: CNT_FAM_MEMBERS, dtype: float64

EXT_SOURCE_2 Imputation

In [113]:
print('Data type of EXT_SOURCE_2 : ',application_data['EXT_SOURCE_2'].dtype,'\n\n')
application_data['EXT_SOURCE_2'].plot.box()
plt.title('\nEXT_SOURCE_2');
print('Quantile\tValue')
application_data['EXT_SOURCE_2'].quantile([0.5,0.8,0.85,0.90,0.95,1])
Data type of EXT_SOURCE_2 :  float64 


Quantile	Value
Out[113]:
0.50   0.57
0.80   0.68
0.85   0.70
0.90   0.72
0.95   0.75
1.00   0.85
Name: EXT_SOURCE_2, dtype: float64
In [114]:
round(application_data['EXT_SOURCE_2'].mean(),2)
Out[114]:
0.51

EXT_SOURCE_2 is the credit rating by an external source. It is a numerical continuous variable. The above bar plot and the quantiles show no outliers. In this case Mean and Median are very close. Because there are no outliers, Mean : 0.51 could be used to impute missing values.

Checking for Disguised Missing Values

In [115]:
def cat_value_counts(column_name) : 
    print(tabulate(pd.DataFrame(application_data.stb.freq([column_name])), headers='keys', tablefmt='psql'))
    print(pd.DataFrame(application_data[column_name]).stb.missing(),'\n\n\n')
In [116]:
#Replacing XAN with np.nan in Gender column : 

application_data['CODE_GENDER'] = application_data['CODE_GENDER'].replace('XNA',np.nan)
cat_value_counts('CODE_GENDER')
+----+---------------+---------+-----------+--------------------+----------------------+
|    | CODE_GENDER   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | F             |  202448 |  0.658352 |             202448 |             0.658352 |
|  1 | M             |  105059 |  0.341648 |             307507 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+
             Missing   Total  Percent
CODE_GENDER        4  307511     0.00 



Since CODE_GENDER is a categorical variable, the missing values could be imputed with Mode : F

In [117]:
# replacing Unknown in NAME_FAMILY_STATUS with np.nan 
application_data['NAME_FAMILY_STATUS'] = application_data['NAME_FAMILY_STATUS'].replace('Unknown',np.nan)
cat_value_counts('NAME_FAMILY_STATUS')
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_FAMILY_STATUS   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Married              |  196432 | 0.638785  |             196432 |             0.638785 |
|  1 | Single / not married |   45444 | 0.147781  |             241876 |             0.786566 |
|  2 | Civil marriage       |   29775 | 0.0968264 |             271651 |             0.883392 |
|  3 | Separated            |   19770 | 0.0642908 |             291421 |             0.947683 |
|  4 | Widow                |   16088 | 0.0523172 |             307509 |             1        |
+----+----------------------+---------+-----------+--------------------+----------------------+
                    Missing   Total  Percent
NAME_FAMILY_STATUS        2  307511     0.00 



Since NAME_FAMILY_STATUS is a categorical variable, the missing values could be imputed with Mode : Married

In [118]:
# replacing XNA values in ORGANIZATION_TYPE 
application_data['ORGANIZATION_TYPE'] = application_data['ORGANIZATION_TYPE'].replace('XNA',np.nan)
cat_value_counts('ORGANIZATION_TYPE')
+----+------------------------+---------+-------------+--------------------+----------------------+
|    | ORGANIZATION_TYPE      |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------+---------+-------------+--------------------+----------------------|
|  0 | Business Entity Type 3 |   67992 | 0.269663    |              67992 |             0.269663 |
|  1 | Self-employed          |   38412 | 0.152346    |             106404 |             0.422009 |
|  2 | Other                  |   16683 | 0.0661664   |             123087 |             0.488175 |
|  3 | Medicine               |   11193 | 0.0443925   |             134280 |             0.532568 |
|  4 | Business Entity Type 2 |   10553 | 0.0418542   |             144833 |             0.574422 |
|  5 | Government             |   10404 | 0.0412633   |             155237 |             0.615685 |
|  6 | School                 |    8893 | 0.0352705   |             164130 |             0.650956 |
|  7 | Trade: type 7          |    7831 | 0.0310585   |             171961 |             0.682014 |
|  8 | Kindergarten           |    6880 | 0.0272868   |             178841 |             0.709301 |
|  9 | Construction           |    6721 | 0.0266561   |             185562 |             0.735957 |
| 10 | Business Entity Type 1 |    5984 | 0.0237331   |             191546 |             0.75969  |
| 11 | Transport: type 4      |    5398 | 0.021409    |             196944 |             0.781099 |
| 12 | Trade: type 3          |    3492 | 0.0138496   |             200436 |             0.794949 |
| 13 | Industry: type 9       |    3368 | 0.0133578   |             203804 |             0.808307 |
| 14 | Industry: type 3       |    3278 | 0.0130009   |             207082 |             0.821307 |
| 15 | Security               |    3247 | 0.0128779   |             210329 |             0.834185 |
| 16 | Housing                |    2958 | 0.0117317   |             213287 |             0.845917 |
| 17 | Industry: type 11      |    2704 | 0.0107243   |             215991 |             0.856641 |
| 18 | Military               |    2634 | 0.0104467   |             218625 |             0.867088 |
| 19 | Bank                   |    2507 | 0.00994301  |             221132 |             0.877031 |
| 20 | Agriculture            |    2454 | 0.0097328   |             223586 |             0.886764 |
| 21 | Police                 |    2341 | 0.00928463  |             225927 |             0.896049 |
| 22 | Transport: type 2      |    2204 | 0.00874128  |             228131 |             0.90479  |
| 23 | Postal                 |    2157 | 0.00855487  |             230288 |             0.913345 |
| 24 | Security Ministries    |    1974 | 0.00782908  |             232262 |             0.921174 |
| 25 | Trade: type 2          |    1900 | 0.00753559  |             234162 |             0.928709 |
| 26 | Restaurant             |    1811 | 0.0071826   |             235973 |             0.935892 |
| 27 | Services               |    1575 | 0.0062466   |             237548 |             0.942139 |
| 28 | University             |    1327 | 0.00526301  |             238875 |             0.947402 |
| 29 | Industry: type 7       |    1307 | 0.00518369  |             240182 |             0.952585 |
| 30 | Transport: type 3      |    1187 | 0.00470776  |             241369 |             0.957293 |
| 31 | Industry: type 1       |    1039 | 0.00412078  |             242408 |             0.961414 |
| 32 | Hotel                  |     966 | 0.00383125  |             243374 |             0.965245 |
| 33 | Electricity            |     950 | 0.00376779  |             244324 |             0.969013 |
| 34 | Industry: type 4       |     877 | 0.00347827  |             245201 |             0.972491 |
| 35 | Trade: type 6          |     631 | 0.00250261  |             245832 |             0.974994 |
| 36 | Industry: type 5       |     599 | 0.00237569  |             246431 |             0.977369 |
| 37 | Insurance              |     597 | 0.00236776  |             247028 |             0.979737 |
| 38 | Telecom                |     577 | 0.00228844  |             247605 |             0.982026 |
| 39 | Emergency              |     560 | 0.00222101  |             248165 |             0.984247 |
| 40 | Industry: type 2       |     458 | 0.00181647  |             248623 |             0.986063 |
| 41 | Advertising            |     429 | 0.00170146  |             249052 |             0.987765 |
| 42 | Realtor                |     396 | 0.00157057  |             249448 |             0.989335 |
| 43 | Culture                |     379 | 0.00150315  |             249827 |             0.990838 |
| 44 | Industry: type 12      |     369 | 0.00146349  |             250196 |             0.992302 |
| 45 | Trade: type 1          |     348 | 0.0013802   |             250544 |             0.993682 |
| 46 | Mobile                 |     317 | 0.00125725  |             250861 |             0.994939 |
| 47 | Legal Services         |     305 | 0.00120966  |             251166 |             0.996149 |
| 48 | Cleaning               |     260 | 0.00103119  |             251426 |             0.99718  |
| 49 | Transport: type 1      |     201 | 0.000797186 |             251627 |             0.997977 |
| 50 | Industry: type 6       |     112 | 0.000444203 |             251739 |             0.998421 |
| 51 | Industry: type 10      |     109 | 0.000432305 |             251848 |             0.998854 |
| 52 | Religion               |      85 | 0.000337118 |             251933 |             0.999191 |
| 53 | Industry: type 13      |      67 | 0.000265729 |             252000 |             0.999457 |
| 54 | Trade: type 4          |      64 | 0.00025383  |             252064 |             0.99971  |
| 55 | Trade: type 5          |      49 | 0.000194339 |             252113 |             0.999905 |
| 56 | Industry: type 8       |      24 | 9.51863e-05 |             252137 |             1        |
+----+------------------------+---------+-------------+--------------------+----------------------+
                   Missing   Total  Percent
ORGANIZATION_TYPE    55374  307511     0.18 



Since ORGANIZATION_TYPE is a categorical variable, the missing values could be imputed with Mode : Business Entity Type 3

Data Type Checks

In [119]:
pd.DataFrame(application_data.dtypes)
Out[119]:
0
SK_ID_CURR int64
TARGET int64
NAME_CONTRACT_TYPE object
CODE_GENDER object
FLAG_OWN_CAR object
FLAG_OWN_REALTY object
CNT_CHILDREN int64
AMT_INCOME_TOTAL float64
AMT_CREDIT float64
AMT_ANNUITY float64
AMT_GOODS_PRICE float64
NAME_TYPE_SUITE object
NAME_INCOME_TYPE object
NAME_EDUCATION_TYPE object
NAME_FAMILY_STATUS object
NAME_HOUSING_TYPE object
REGION_POPULATION_RELATIVE float64
DAYS_BIRTH int64
DAYS_EMPLOYED int64
DAYS_REGISTRATION float64
DAYS_ID_PUBLISH int64
FLAG_MOBIL int64
FLAG_EMP_PHONE int64
FLAG_WORK_PHONE int64
FLAG_CONT_MOBILE int64
FLAG_PHONE int64
FLAG_EMAIL int64
OCCUPATION_TYPE object
CNT_FAM_MEMBERS category
REGION_RATING_CLIENT int64
REGION_RATING_CLIENT_W_CITY int64
WEEKDAY_APPR_PROCESS_START object
HOUR_APPR_PROCESS_START int64
REG_REGION_NOT_LIVE_REGION int64
REG_REGION_NOT_WORK_REGION int64
LIVE_REGION_NOT_WORK_REGION int64
REG_CITY_NOT_LIVE_CITY int64
REG_CITY_NOT_WORK_CITY int64
LIVE_CITY_NOT_WORK_CITY int64
ORGANIZATION_TYPE object
EXT_SOURCE_2 float64
EXT_SOURCE_3 float64
YEARS_BEGINEXPLUATATION_AVG float64
FLOORSMAX_AVG float64
YEARS_BEGINEXPLUATATION_MODE float64
FLOORSMAX_MODE float64
YEARS_BEGINEXPLUATATION_MEDI float64
FLOORSMAX_MEDI float64
TOTALAREA_MODE float64
EMERGENCYSTATE_MODE object
OBS_30_CNT_SOCIAL_CIRCLE float64
DEF_30_CNT_SOCIAL_CIRCLE float64
OBS_60_CNT_SOCIAL_CIRCLE float64
DEF_60_CNT_SOCIAL_CIRCLE float64
DAYS_LAST_PHONE_CHANGE float64
FLAG_DOCUMENT_2 int64
FLAG_DOCUMENT_3 int64
FLAG_DOCUMENT_4 int64
FLAG_DOCUMENT_5 int64
FLAG_DOCUMENT_6 int64
FLAG_DOCUMENT_7 int64
FLAG_DOCUMENT_8 int64
FLAG_DOCUMENT_9 int64
FLAG_DOCUMENT_10 int64
FLAG_DOCUMENT_11 int64
FLAG_DOCUMENT_12 int64
FLAG_DOCUMENT_13 int64
FLAG_DOCUMENT_14 int64
FLAG_DOCUMENT_15 int64
FLAG_DOCUMENT_16 int64
FLAG_DOCUMENT_17 int64
FLAG_DOCUMENT_18 int64
FLAG_DOCUMENT_19 int64
FLAG_DOCUMENT_20 int64
FLAG_DOCUMENT_21 int64
AMT_REQ_CREDIT_BUREAU_HOUR float64
AMT_REQ_CREDIT_BUREAU_DAY float64
AMT_REQ_CREDIT_BUREAU_WEEK float64
AMT_REQ_CREDIT_BUREAU_MON float64
AMT_REQ_CREDIT_BUREAU_QRT float64
AMT_REQ_CREDIT_BUREAU_YEAR float64
In [120]:
application_data[['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH']].head()
Out[120]:
DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH
0 -9461 -637 -3,648.00 -2120
1 -16765 -1188 -1,186.00 -291
2 -19046 -225 -4,260.00 -2531
3 -19005 -3039 -9,833.00 -2437
4 -19932 -3038 -4,311.00 -3458

The DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH have negative values, since they are in past with respect to date of application.
To standard size these columns for the purpose of analysis, we could converted them to their absolute values

In [121]:
columns_to_convert = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH']
application_data[columns_to_convert] = application_data[columns_to_convert].abs()
application_data[columns_to_convert].head()
Out[121]:
DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH
0 9,461.00 637.00 3,648.00 2,120.00
1 16,765.00 1,188.00 1,186.00 291.00
2 19,046.00 225.00 4,260.00 2,531.00
3 19,005.00 3,039.00 9,833.00 2,437.00
4 19,932.00 3,038.00 4,311.00 3,458.00
In [122]:
# checking columns with binary values
values_per_column = application_data.nunique().sort_values()
col_values_dtype = pd.DataFrame(index=values_per_column.index, data= {'Unique Values' : values_per_column.values, 'Data Type' : application_data.dtypes})
col_values_dtype
Out[122]:
Unique Values Data Type
FLAG_DOCUMENT_3 2 int64
FLAG_PHONE 2 int64
FLAG_DOCUMENT_4 2 int64
FLAG_DOCUMENT_2 2 int64
REG_REGION_NOT_LIVE_REGION 2 int64
REG_REGION_NOT_WORK_REGION 2 int64
LIVE_REGION_NOT_WORK_REGION 2 int64
REG_CITY_NOT_LIVE_CITY 2 int64
REG_CITY_NOT_WORK_CITY 2 int64
LIVE_CITY_NOT_WORK_CITY 2 int64
FLAG_DOCUMENT_14 2 int64
FLAG_DOCUMENT_13 2 int64
FLAG_DOCUMENT_12 2 int64
FLAG_DOCUMENT_11 2 int64
FLAG_DOCUMENT_10 2 int64
FLAG_DOCUMENT_9 2 int64
FLAG_DOCUMENT_8 2 int64
FLAG_DOCUMENT_7 2 int64
EMERGENCYSTATE_MODE 2 object
FLAG_DOCUMENT_6 2 int64
FLAG_CONT_MOBILE 2 int64
FLAG_WORK_PHONE 2 int64
FLAG_EMAIL 2 int64
FLAG_MOBIL 2 int64
TARGET 2 int64
NAME_CONTRACT_TYPE 2 object
CODE_GENDER 2 object
FLAG_OWN_CAR 2 object
FLAG_OWN_REALTY 2 object
FLAG_DOCUMENT_21 2 int64
FLAG_DOCUMENT_20 2 int64
FLAG_EMP_PHONE 2 int64
FLAG_DOCUMENT_19 2 int64
FLAG_DOCUMENT_5 2 int64
FLAG_DOCUMENT_15 2 int64
FLAG_DOCUMENT_16 2 int64
FLAG_DOCUMENT_17 2 int64
FLAG_DOCUMENT_18 2 int64
REGION_RATING_CLIENT_W_CITY 3 int64
REGION_RATING_CLIENT 3 int64
AMT_REQ_CREDIT_BUREAU_HOUR 5 float64
NAME_FAMILY_STATUS 5 object
NAME_EDUCATION_TYPE 5 object
NAME_HOUSING_TYPE 6 object
NAME_TYPE_SUITE 7 object
WEEKDAY_APPR_PROCESS_START 7 object
NAME_INCOME_TYPE 8 object
AMT_REQ_CREDIT_BUREAU_DAY 9 float64
AMT_REQ_CREDIT_BUREAU_WEEK 9 float64
DEF_60_CNT_SOCIAL_CIRCLE 9 float64
DEF_30_CNT_SOCIAL_CIRCLE 10 float64
AMT_REQ_CREDIT_BUREAU_QRT 11 float64
CNT_CHILDREN 15 int64
CNT_FAM_MEMBERS 17 category
OCCUPATION_TYPE 18 object
HOUR_APPR_PROCESS_START 24 int64
AMT_REQ_CREDIT_BUREAU_MON 24 float64
AMT_REQ_CREDIT_BUREAU_YEAR 25 float64
FLOORSMAX_MODE 25 float64
OBS_60_CNT_SOCIAL_CIRCLE 33 float64
OBS_30_CNT_SOCIAL_CIRCLE 33 float64
FLOORSMAX_MEDI 49 float64
ORGANIZATION_TYPE 57 object
REGION_POPULATION_RELATIVE 81 float64
YEARS_BEGINEXPLUATATION_MODE 221 float64
YEARS_BEGINEXPLUATATION_MEDI 245 float64
YEARS_BEGINEXPLUATATION_AVG 285 float64
FLOORSMAX_AVG 403 float64
EXT_SOURCE_3 814 float64
AMT_GOODS_PRICE 1002 float64
AMT_INCOME_TOTAL 2548 float64
DAYS_LAST_PHONE_CHANGE 3773 float64
TOTALAREA_MODE 5116 float64
AMT_CREDIT 5603 float64
DAYS_ID_PUBLISH 6168 float64
DAYS_EMPLOYED 12574 float64
AMT_ANNUITY 13672 float64
DAYS_REGISTRATION 15688 float64
DAYS_BIRTH 17460 float64
EXT_SOURCE_2 119831 float64
SK_ID_CURR 307511 int64

The above columns are 'categorical' variables with only two values. But they have been read as 'int' datatype
We could convert them into 'categorical' data type

In [123]:
# converting to category data type 
convert_to_cat = col_values_dtype[col_values_dtype['Unique Values']<=8].index
application_data[convert_to_cat] = application_data[convert_to_cat].astype('category')
In [124]:
# check if the columns are converted
values_per_column = application_data.nunique().sort_values()
new_categories  = pd.DataFrame(index=values_per_column.index, data= {'Unique Values' : values_per_column.values, 'Data Type' : application_data.dtypes})
new_categories
Out[124]:
Unique Values Data Type
FLAG_DOCUMENT_3 2 category
FLAG_PHONE 2 category
FLAG_DOCUMENT_4 2 category
FLAG_DOCUMENT_2 2 category
REG_REGION_NOT_LIVE_REGION 2 category
REG_REGION_NOT_WORK_REGION 2 category
LIVE_REGION_NOT_WORK_REGION 2 category
REG_CITY_NOT_LIVE_CITY 2 category
REG_CITY_NOT_WORK_CITY 2 category
LIVE_CITY_NOT_WORK_CITY 2 category
FLAG_DOCUMENT_14 2 category
FLAG_DOCUMENT_13 2 category
FLAG_DOCUMENT_12 2 category
FLAG_DOCUMENT_11 2 category
FLAG_DOCUMENT_10 2 category
FLAG_DOCUMENT_9 2 category
FLAG_DOCUMENT_8 2 category
FLAG_DOCUMENT_7 2 category
EMERGENCYSTATE_MODE 2 category
FLAG_DOCUMENT_6 2 category
FLAG_CONT_MOBILE 2 category
FLAG_WORK_PHONE 2 category
FLAG_EMAIL 2 category
FLAG_MOBIL 2 category
TARGET 2 category
NAME_CONTRACT_TYPE 2 category
CODE_GENDER 2 category
FLAG_OWN_CAR 2 category
FLAG_OWN_REALTY 2 category
FLAG_DOCUMENT_21 2 category
FLAG_DOCUMENT_20 2 category
FLAG_EMP_PHONE 2 category
FLAG_DOCUMENT_19 2 category
FLAG_DOCUMENT_5 2 category
FLAG_DOCUMENT_15 2 category
FLAG_DOCUMENT_16 2 category
FLAG_DOCUMENT_17 2 category
FLAG_DOCUMENT_18 2 category
REGION_RATING_CLIENT_W_CITY 3 category
REGION_RATING_CLIENT 3 category
AMT_REQ_CREDIT_BUREAU_HOUR 5 category
NAME_FAMILY_STATUS 5 category
NAME_EDUCATION_TYPE 5 category
NAME_HOUSING_TYPE 6 category
NAME_TYPE_SUITE 7 category
WEEKDAY_APPR_PROCESS_START 7 category
NAME_INCOME_TYPE 8 category
AMT_REQ_CREDIT_BUREAU_DAY 9 float64
AMT_REQ_CREDIT_BUREAU_WEEK 9 float64
DEF_60_CNT_SOCIAL_CIRCLE 9 float64
DEF_30_CNT_SOCIAL_CIRCLE 10 float64
AMT_REQ_CREDIT_BUREAU_QRT 11 float64
CNT_CHILDREN 15 int64
CNT_FAM_MEMBERS 17 category
OCCUPATION_TYPE 18 object
HOUR_APPR_PROCESS_START 24 int64
AMT_REQ_CREDIT_BUREAU_MON 24 float64
AMT_REQ_CREDIT_BUREAU_YEAR 25 float64
FLOORSMAX_MODE 25 float64
OBS_60_CNT_SOCIAL_CIRCLE 33 float64
OBS_30_CNT_SOCIAL_CIRCLE 33 float64
FLOORSMAX_MEDI 49 float64
ORGANIZATION_TYPE 57 object
REGION_POPULATION_RELATIVE 81 float64
YEARS_BEGINEXPLUATATION_MODE 221 float64
YEARS_BEGINEXPLUATATION_MEDI 245 float64
YEARS_BEGINEXPLUATATION_AVG 285 float64
FLOORSMAX_AVG 403 float64
EXT_SOURCE_3 814 float64
AMT_GOODS_PRICE 1002 float64
AMT_INCOME_TOTAL 2548 float64
DAYS_LAST_PHONE_CHANGE 3773 float64
TOTALAREA_MODE 5116 float64
AMT_CREDIT 5603 float64
DAYS_ID_PUBLISH 6168 float64
DAYS_EMPLOYED 12574 float64
AMT_ANNUITY 13672 float64
DAYS_REGISTRATION 15688 float64
DAYS_BIRTH 17460 float64
EXT_SOURCE_2 119831 float64
SK_ID_CURR 307511 int64

Checking for Outliers

Checking for outliers in the following numerical columns

  • AMT_INCOME_TOTAL
  • AMT_CREDIT
  • AMT_ANNUITY
  • AMT_GOODS_PRICE
  • DAYS_BIRTH
In [125]:
# Adding a new column "AGE_YEARS" using 'DAYS_BIRTH' with age in years
def days_to_years(x) : 
    if x < 0 : 
        x = -1*x 
    return x//365
application_data['AGE_YEARS'] = application_data['DAYS_BIRTH'].apply(days_to_years)
application_data['AGE_YEARS'].describe()
Out[125]:
count   307,511.00
mean         43.44
std          11.95
min          20.00
25%          34.00
50%          43.00
75%          53.00
max          69.00
Name: AGE_YEARS, dtype: float64
In [126]:
# Box plots of the above numerical variables 
outlier_check_col = ["AMT_INCOME_TOTAL","AMT_CREDIT","AMT_ANNUITY","AMT_GOODS_PRICE","DAYS_BIRTH"]

fig,ax = plt.subplots(3,2)
fig.set_figheight(15)
fig.set_figwidth(15)
ax[0,0].set_yscale('log')
ax[0,0].set(ylabel ='Income in Log Scale')
application_data[outlier_check_col[0]].plot.box(ax=ax[0,0],);
application_data[outlier_check_col[1]].plot.box(ax=ax[0,1]);
application_data[outlier_check_col[2]].plot.box(ax=ax[1,0]);
application_data[outlier_check_col[3]].plot.box(ax=ax[1,1]);
ax[2,0].set(ylabel ='Age In Days')
application_data[outlier_check_col[4]].plot.box(ax=ax[2,0]); 
ax[2,1].axis('off')
print('Box Plots of "AMT_INCOME_TOTAL","AMT_CREDIT","AMT_ANNUITY","AMT_GOODS_PRICE","DAYS_BIRTH" \n')
    
Box Plots of "AMT_INCOME_TOTAL","AMT_CREDIT","AMT_ANNUITY","AMT_GOODS_PRICE","DAYS_BIRTH" 

In [127]:
## Calculating quantiles for the above columns
pd.options.display.float_format = '{:,.2f}'.format
for col in outlier_check_col : 
    print(col,'\n',application_data[col].quantile([0.5,0.8,0.85,0.90,0.95,1]),'\n\n')
    
AMT_INCOME_TOTAL 
 0.50       147,150.00
0.80       225,000.00
0.85       234,000.00
0.90       270,000.00
0.95       337,500.00
1.00   117,000,000.00
Name: AMT_INCOME_TOTAL, dtype: float64 


AMT_CREDIT 
 0.50     513,531.00
0.80     900,000.00
0.85   1,024,740.00
0.90   1,133,748.00
0.95   1,350,000.00
1.00   4,050,000.00
Name: AMT_CREDIT, dtype: float64 


AMT_ANNUITY 
 0.50    24,903.00
0.80    37,516.50
0.85    40,806.00
0.90    45,954.00
0.95    53,325.00
1.00   258,025.50
Name: AMT_ANNUITY, dtype: float64 


AMT_GOODS_PRICE 
 0.50     450,000.00
0.80     814,500.00
0.85     900,000.00
0.90   1,093,500.00
0.95   1,305,000.00
1.00   4,050,000.00
Name: AMT_GOODS_PRICE, dtype: float64 


DAYS_BIRTH 
 0.50   15,750.00
0.80   20,474.00
0.85   21,316.00
0.90   22,181.00
0.95   23,204.00
1.00   25,229.00
Name: DAYS_BIRTH, dtype: float64 


Outliers in Numerical Columns

From the above box plots and quantile calculations, we see that
AMT_INCOME_TOTAL has huge jump from 95th Percentile (337,500.00) to Maximum value (117,000,000.00) {346 times 95th Percentile}.
Apart from this jump, there are several values > 1.5 times IQR. All of these are deemed outliers.
To avoid skewed results of analysis, values greater than 95th percentile value may be capped to 95th Percentile value (337,500.00)

AMT_CREDIT also has a huge jump from 95th Percentile (1,350,000.00) to Maximum value (4,050,000.00) {3 times 95th percentile}.
Apart from this jump, there are several values > 1.5 times IQR. All of these are deemed outliers.
To avoid skewed results of analysis, values greater than 95th percentile value may be capped to 95th Percentile value (1,350,000.00)

Similarly , AMT_ANNUITY, AMT_ANNUITY have outliers and they could be capped to 95th percentile values.

DAYS_BIRTH has no outliers

Binning Continuous Variables

In [128]:
# AMT_INCOME_TOTAL
min_income = int(application_data['AMT_INCOME_TOTAL'].min())
max_income = int(application_data['AMT_INCOME_TOTAL'].max())


bins = [0,25000,50000,75000,100000,125000,150000,175000,200000,225000,250000,275000,300000,325000,350000,375000,400000,425000,450000,475000,500000,10000000000]
intervals = ['0-25000', '25000-50000','50000-75000','75000-100000','100000-125000', '125000-150000', '150000-175000','175000-200000',
       '200000-225000','225000-250000','250000-275000','275000-300000','300000-325000','325000-350000','350000-375000',
       '375000-400000','400000-425000','425000-450000','450000-475000','475000-500000','500000 and above']

application_data['AMT_INCOME_CAT']=pd.cut(application_data['AMT_INCOME_TOTAL'],bins,labels=intervals)
print('Income Range\t Count')
print(application_data['AMT_INCOME_CAT'].value_counts())

income_cat = application_data['AMT_INCOME_CAT'].value_counts()
plt.hist(application_data['AMT_INCOME_CAT'])

plt.title('\n Income Range vs No of Applications')
plt.xticks(rotation=90);
Income Range	 Count
125000-150000       47890
100000-125000       43701
200000-225000       40797
75000-100000        39806
150000-175000       34663
175000-200000       29644
50000-75000         19375
250000-275000       12733
225000-250000        7340
300000-325000        6752
350000-375000        4537
25000-50000          4517
275000-300000        4306
425000-450000        3113
500000 and above     2702
325000-350000        2122
400000-425000        1811
375000-400000        1265
475000-500000         312
450000-475000         125
0-25000                 0
Name: AMT_INCOME_CAT, dtype: int64
In [129]:
#AMT_CREDIT
bins = [0,150000,200000,250000,300000,350000,400000,450000,500000,550000,600000,650000,700000,750000,800000,850000,900000,1000000000]
intervals = ['0-150000', '150000-200000','200000-250000', '250000-300000', '300000-350000', '350000-400000','400000-450000',
        '450000-500000','500000-550000','550000-600000','600000-650000','650000-700000','700000-750000','750000-800000',
        '800000-850000','850000-900000','900000 and above']

application_data['AMT_CREDIT_RANGE']=pd.cut(application_data['AMT_CREDIT'],bins=bins,labels=intervals)
application_data['AMT_CREDIT_RANGE'] = application_data['AMT_CREDIT_RANGE'].astype('category')
print('Credit Range\t Count')
credit_range = application_data['AMT_CREDIT_RANGE'].value_counts()
print(credit_range)
plt.hist(application_data['AMT_CREDIT_RANGE'])
plt.xticks(rotation=90)
Credit Range	 Count
900000 and above    58912
250000-300000       31759
200000-250000       23054
500000-550000       22678
400000-450000       18239
0-150000            18159
150000-200000       17985
300000-350000       16205
650000-700000       15051
450000-500000       13799
750000-800000       12380
800000-850000       11559
550000-600000       11554
850000-900000       10233
350000-400000       10133
600000-650000        8998
700000-750000        6813
Name: AMT_CREDIT_RANGE, dtype: int64
Out[129]:
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
 <a list of 17 Text xticklabel objects>)

Data Imbalance

In [130]:
# Target Variable - 1: Client with Payment difficulties, 0 : All other cases 
cat_value_counts('TARGET')
+----+----------+---------+-----------+--------------------+----------------------+
|    |   TARGET |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------+---------+-----------+--------------------+----------------------|
|  0 |        0 |  282686 | 0.919271  |             282686 |             0.919271 |
|  1 |        1 |   24825 | 0.0807288 |             307511 |             1        |
+----+----------+---------+-----------+--------------------+----------------------+
        Missing   Total  Percent
TARGET        0  307511     0.00 



From the above, you can see that this data set contains 91.9% of records about Clients with Payment difficulties and 0.08% records about all other cases.
Ratio of classes = .919/0.08 = 11.375
This represents an huge Data Imbalance
The dataset is skewed towards 'Clients with Payment difficulties'

Data Set division wrt TARGET

In [131]:
application_data0=application_data.loc[application_data["TARGET"]==0]
application_data1=application_data.loc[application_data["TARGET"]==1]

Columns Selection for Analysis

In [132]:
columnsForAnalysis = ['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE',
       'DAYS_EMPLOYED','FLAG_MOBIL', 'FLAG_CONT_MOBILE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT_W_CITY',
                      'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
        'DAYS_LAST_PHONE_CHANGE' ,'AGE_YEARS', 'AMT_INCOME_CAT',
       'AMT_CREDIT_RANGE']

Analysis

Univariate Analysis

In [133]:
# LOAN DATA
In [134]:
# function for categorical variable univariate analysis
def cat_univariate_analysis(column_name,figsize=(10,5)) : 
    # print unique values
    print('TARGET 0\n', application_data0[column_name].unique(),'\n')
    print('TARGET 1\n',application_data1[column_name].unique(),'\n')
    
    # column vs target count plot
    plt.figure(figsize=figsize)
    ax = sns.countplot(x=column_name,hue='TARGET',data=application_data)
    title = column_name + ' vs Number of Applications'
    ax.set(title= title)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2,
                height + 10,
                format(height),
                ha="center")
    # Percentages 
    print('All Other Cases (TARGET : 0)')
    print(tabulate(pd.DataFrame(application_data0.stb.freq([column_name])), headers='keys', tablefmt='psql'),'\n')
    print('Clients with Payment Difficulties (TARGET : 1)')
    print(tabulate(pd.DataFrame(application_data1.stb.freq([column_name])), headers='keys', tablefmt='psql'),'\n')
In [135]:
# NAME_CONTRACT_TYPE
cat_univariate_analysis('NAME_CONTRACT_TYPE')
TARGET 0
 [Cash loans, Revolving loans]
Categories (2, object): [Cash loans, Revolving loans] 

TARGET 1
 [Cash loans, Revolving loans]
Categories (2, object): [Cash loans, Revolving loans] 

All Other Cases (TARGET : 0)
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_CONTRACT_TYPE   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Cash loans           |  255011 | 0.9021    |             255011 |               0.9021 |
|  1 | Revolving loans      |   27675 | 0.0979001 |             282686 |               1      |
+----+----------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_CONTRACT_TYPE   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Cash loans           |   23221 | 0.935388  |              23221 |             0.935388 |
|  1 | Revolving loans      |    1604 | 0.0646123 |              24825 |             1        |
+----+----------------------+---------+-----------+--------------------+----------------------+ 

Client's with All other cases (Target=0)
  • Out of 'All other cases', 90.21% have taken 'Cash Loans' and 9.79% have taken 'Revolving Loans'
Client's with Payment Difficulties (Target=1)
  • Out of all 'Client's with payment difficulties', 93.54% have taken 'Cash Loans' and 6.46% have taken 'Revolving Loans'
In [136]:
# function for numerical variable univariate analysis

def num_univariate_analysis(column_name,scale='linear') : 
    # boxplot for column vs target
    plt.figure(figsize=(8,6))
    ax = sns.boxplot(x='TARGET', y = column_name, data = application_data)
    title = column_name+' vs Target'
    ax.set(title=title)
    if scale == 'log' :
        plt.yscale('log')
        ax.set(ylabel=column_name + '(Log Scale)')
    # summary statistic
    print('All Other Cases (TARGET : 0)')
    print(application_data0[column_name].describe(),'\n')
    print('Clients with Payment Difficulties (TARGET : 1)')
    print(application_data1[column_name].describe())
In [137]:
# AMT_CREDIT 
num_univariate_analysis('AMT_CREDIT')
All Other Cases (TARGET : 0)
count     282,686.00
mean      602,648.28
std       406,845.91
min        45,000.00
25%       270,000.00
50%       517,788.00
75%       810,000.00
max     4,050,000.00
Name: AMT_CREDIT, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count      24,825.00
mean      557,778.53
std       346,433.24
min        45,000.00
25%       284,400.00
50%       497,520.00
75%       733,315.50
max     4,027,680.00
Name: AMT_CREDIT, dtype: float64

Amount of credit

  • The median amount of credit taken by Clients having payment difficulties' is almost same as 'All other cases'.
  • You could notice that the IQR of Target 1 is less than Target 0, which implies that more proportion of client's with payment difficulties take credit amounts centered around the median amount than 'All other cases'
  • Comparing the outliers of Target 1 and 0, we could see that Clients with payment difficulties favour lower amount of credit than all other cases. Conversly, people with higher amount of credit have are more probably to fall in the category 'All other cases' than 'Clients with Payment Difficulties'
In [138]:
#AMT_ANNUITY
num_univariate_analysis('AMT_ANNUITY','log')
All Other Cases (TARGET : 0)
count   282,674.00
mean     27,163.62
std      14,658.31
min       1,615.50
25%      16,456.50
50%      24,876.00
75%      34,749.00
max     258,025.50
Name: AMT_ANNUITY, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count    24,825.00
mean     26,481.74
std      12,450.68
min       2,722.50
25%      17,361.00
50%      25,263.00
75%      32,976.00
max     149,211.00
Name: AMT_ANNUITY, dtype: float64

Amount of Annuity

  • There is not much difference in Amount of Annuity between Clients with Payment Difficulties and All other cases.
  • The median and mean are around the same.
  • However, Clients with Payment difficulties have annuity amounts which are less dispersed than All other cases.
In [139]:
#AMT_GOODS_PRICE
num_univariate_analysis('AMT_GOODS_PRICE')
All Other Cases (TARGET : 0)
count     282,429.00
mean      542,736.80
std       373,785.49
min        40,500.00
25%       238,500.00
50%       450,000.00
75%       688,500.00
max     4,050,000.00
Name: AMT_GOODS_PRICE, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count      24,804.00
mean      488,972.41
std       311,636.50
min        45,000.00
25%       238,500.00
50%       450,000.00
75%       675,000.00
max     3,600,000.00
Name: AMT_GOODS_PRICE, dtype: float64

Goods Price Amount

  • With respect to AMT_GOODS_PRICE, there is no significant difference between Clients with Payment difficulties and All another Cases.
In [140]:
#REGION_RATING_CLIENT_W_CITY
cat_univariate_analysis('REGION_RATING_CLIENT_W_CITY')
TARGET 0
 [1, 2, 3]
Categories (3, int64): [1, 2, 3] 

TARGET 1
 [2, 3, 1]
Categories (3, int64): [2, 3, 1] 

All Other Cases (TARGET : 0)
+----+-------------------------------+---------+-----------+--------------------+----------------------+
|    |   REGION_RATING_CLIENT_W_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------+---------+-----------+--------------------+----------------------|
|  0 |                             2 |  211314 |  0.747522 |             211314 |             0.747522 |
|  1 |                             3 |   38859 |  0.137463 |             250173 |             0.884985 |
|  2 |                             1 |   32513 |  0.115015 |             282686 |             1        |
+----+-------------------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+-------------------------------+---------+-----------+--------------------+----------------------+
|    |   REGION_RATING_CLIENT_W_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------+---------+-----------+--------------------+----------------------|
|  0 |                             2 |   18170 | 0.731923  |              18170 |             0.731923 |
|  1 |                             3 |    5001 | 0.20145   |              23171 |             0.933374 |
|  2 |                             1 |    1654 | 0.0666264 |              24825 |             1        |
+----+-------------------------------+---------+-----------+--------------------+----------------------+ 

Bank City Rating

The bank has rated cities into three categories [1,2,3]

  • Among 'All other cases' , 74.7% belong to category 2 cities followed by 13.7% in category 3 cities and 11.5% in category 1 cities.
  • 73.1% of Clients with Payment difficulties belong to cities of category 2 followed by 20.1% belonging to category 3 and 6.6% in category 1

One can see that there are higher cases of Payment Difficulties in category 2 cities (20.1%) compared to All other cases (13.7%). Also , there are less cases of Payment difficulties in city category 1 (6.6%) compared to All other cases (11.5%)

In [141]:
#EXT_SOURCE_2
num_univariate_analysis('EXT_SOURCE_2')
All Other Cases (TARGET : 0)
count   282,078.00
mean          0.52
std           0.19
min           0.00
25%           0.41
50%           0.57
75%           0.67
max           0.85
Name: EXT_SOURCE_2, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count   24,773.00
mean         0.41
std          0.21
min          0.00
25%          0.24
50%          0.44
75%          0.59
max          0.81
Name: EXT_SOURCE_2, dtype: float64

Client Credit Rating By External Source - 2

  • The median credit rating of clients with Payment difficulties (0.44) is less than those with all other cases (0.57)
  • We can say that there is higher probability of client with less credit rating to have payment difficulties.
  • Further, there are some outliers in 'All other cases'. These could be clients with no credit history and hence their credit ratings are marked as zero.
In [142]:
#EXT_SOURCE_3
num_univariate_analysis('EXT_SOURCE_3')
All Other Cases (TARGET : 0)
count   227,398.00
mean          0.52
std           0.19
min           0.00
25%           0.39
50%           0.55
75%           0.67
max           0.89
Name: EXT_SOURCE_3, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count   19,148.00
mean         0.39
std          0.21
min          0.00
25%          0.22
50%          0.38
75%          0.55
max          0.90
Name: EXT_SOURCE_3, dtype: float64

Client Credit Rating By External Source - 3

  • The median credit rating of clients with Payment difficulties (0.38) is less than those with all other cases (0.55)
  • We can say that there is higher probability of client with less credit rating to have payment difficulties.
In [143]:
#CODE_GENDER
cat_univariate_analysis('CODE_GENDER')
TARGET 0
 [F, M, NaN]
Categories (2, object): [F, M] 

TARGET 1
 [M, F]
Categories (2, object): [M, F] 

All Other Cases (TARGET : 0)
+----+---------------+---------+-----------+--------------------+----------------------+
|    | CODE_GENDER   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | F             |  188278 |  0.666042 |             188278 |             0.666042 |
|  1 | M             |   94404 |  0.333958 |             282682 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+---------------+---------+-----------+--------------------+----------------------+
|    | CODE_GENDER   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | F             |   14170 |  0.570796 |              14170 |             0.570796 |
|  1 | M             |   10655 |  0.429204 |              24825 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+ 

In [144]:
print(application_data['CODE_GENDER'].value_counts(),'\n')
print('Proportion of Females (Target 0) :', round(100*188278/202448,2))
print('Proportion of Females (Target 1) :', round(100*14170/202448,2))
F    202448
M    105059
Name: CODE_GENDER, dtype: int64 

Proportion of Females (Target 0) : 93.0
Proportion of Females (Target 1) : 7.0
In [145]:
print('Proportion of Males (Target 0) :', round(100*94404/105059,2))
print('Proportion of Males (Target 1) :', round(100*10655/105059,2))
Proportion of Males (Target 0) : 89.86
Proportion of Males (Target 1) : 10.14

Gender

  • There are very high proportion of Female applicants (65.8%)
  • Only 7% of total female applicants have payment difficulties compared to 10% of total male applications.
  • It is possible that female applicants are safer borrowers than male.
In [146]:
def cat_proportions(column_name) : 
    values = application_data[column_name].unique()
    values = values.to_numpy()
    values.tolist()
    data0 = application_data0[column_name].value_counts().to_dict()
    data1 = application_data1[column_name].value_counts().to_dict()
    data = application_data[column_name].value_counts().to_dict()
    

    for i in values : 
        if i in data0 and i in data1 and i in data : 
            print('Proportion of '+ str(i) + ' in Target 0 : ', round(data0[i]*100/data[i],2))
            print('Proportion of '+ str(i) + ' in Target 1 : ', round(data1[i]*100/data[i],2),'\n' )
In [147]:
# FLAG_OWN_CAR
cat_univariate_analysis('FLAG_OWN_CAR')
cat_proportions('FLAG_OWN_CAR')
TARGET 0
 [N, Y]
Categories (2, object): [N, Y] 

TARGET 1
 [N, Y]
Categories (2, object): [N, Y] 

All Other Cases (TARGET : 0)
+----+----------------+---------+-----------+--------------------+----------------------+
|    | FLAG_OWN_CAR   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | N              |  185675 |  0.656824 |             185675 |             0.656824 |
|  1 | Y              |   97011 |  0.343176 |             282686 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------+---------+-----------+--------------------+----------------------+
|    | FLAG_OWN_CAR   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | N              |   17249 |  0.694824 |              17249 |             0.694824 |
|  1 | Y              |    7576 |  0.305176 |              24825 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+ 

Proportion of N in Target 0 :  91.5
Proportion of N in Target 1 :  8.5 

Proportion of Y in Target 0 :  92.76
Proportion of Y in Target 1 :  7.24 

In [148]:
application_data['FLAG_OWN_CAR'].value_counts(normalize=True)
Out[148]:
N   0.66
Y   0.34
Name: FLAG_OWN_CAR, dtype: float64

Owning A Car

  • 34% of total customers own a car
  • Among customers who own a car, 92.76% belong to 'All other cases' category and 7.24% have payment difficulties.

  • Among customers with Payment difficulties, 30.5% own a car
  • In 'All Other Cases', 34.3% own a car

  • Customers having cars lower probability of having Payment Difficulties.

    • Customers owning cars in 'Clients with Payment difficulties ' : 8.5%
    • Customers owning cars in 'Clients with Payment difficulties ' 7.24
In [149]:
# FLAG_OWN_REALTY
cat_univariate_analysis('FLAG_OWN_REALTY')
cat_proportions('FLAG_OWN_REALTY')
TARGET 0
 [N, Y]
Categories (2, object): [N, Y] 

TARGET 1
 [Y, N]
Categories (2, object): [Y, N] 

All Other Cases (TARGET : 0)
+----+-------------------+---------+-----------+--------------------+----------------------+
|    | FLAG_OWN_REALTY   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-----------+--------------------+----------------------|
|  0 | Y                 |  196329 |  0.694513 |             196329 |             0.694513 |
|  1 | N                 |   86357 |  0.305487 |             282686 |             1        |
+----+-------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+-------------------+---------+-----------+--------------------+----------------------+
|    | FLAG_OWN_REALTY   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-----------+--------------------+----------------------|
|  0 | Y                 |   16983 |  0.684109 |              16983 |             0.684109 |
|  1 | N                 |    7842 |  0.315891 |              24825 |             1        |
+----+-------------------+---------+-----------+--------------------+----------------------+ 

Proportion of Y in Target 0 :  92.04
Proportion of Y in Target 1 :  7.96 

Proportion of N in Target 0 :  91.68
Proportion of N in Target 1 :  8.32 

Owning Realty

  • In Overall population of clients, 69.36% of people own real estate.
  • In that group of people owning the real estate, 7.96% of people are having payment difficulties and the rest 92.04% belong to 'other cases'.

  • In Overall population of clients, 30.63% of people do not own Realestate.
  • In that group of people not owning the real estate, 8.32% of people are having payment difficulties and the rest 91.68% belong to 'other cases'.

On closer comparision of above observations, we can infer that people who do not own Real-estate seem more likely to face payment difficulties than compared to the group of people owning the Real-estate.

In [150]:
#NAME_EDUCATION_TYPE

cat_univariate_analysis('NAME_EDUCATION_TYPE')
plt.xticks(rotation=90)
cat_proportions('NAME_EDUCATION_TYPE')
TARGET 0
 [Higher education, Secondary / secondary special, Incomplete higher, Lower secondary, Academic degree]
Categories (5, object): [Higher education, Secondary / secondary special, Incomplete higher, Lower secondary, Academic degree] 

TARGET 1
 [Secondary / secondary special, Incomplete higher, Higher education, Lower secondary, Academic degree]
Categories (5, object): [Secondary / secondary special, Incomplete higher, Higher education, Lower secondary, Academic degree] 

All Other Cases (TARGET : 0)
+----+-------------------------------+---------+-------------+--------------------+----------------------+
|    | NAME_EDUCATION_TYPE           |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Secondary / secondary special |  198867 | 0.703491    |             198867 |             0.703491 |
|  1 | Higher education              |   70854 | 0.250646    |             269721 |             0.954136 |
|  2 | Incomplete higher             |    9405 | 0.0332701   |             279126 |             0.987407 |
|  3 | Lower secondary               |    3399 | 0.0120239   |             282525 |             0.99943  |
|  4 | Academic degree               |     161 | 0.000569537 |             282686 |             1        |
+----+-------------------------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+-------------------------------+---------+-------------+--------------------+----------------------+
|    | NAME_EDUCATION_TYPE           |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Secondary / secondary special |   19524 | 0.786465    |              19524 |             0.786465 |
|  1 | Higher education              |    4009 | 0.16149     |              23533 |             0.947956 |
|  2 | Incomplete higher             |     872 | 0.0351259   |              24405 |             0.983082 |
|  3 | Lower secondary               |     417 | 0.0167976   |              24822 |             0.999879 |
|  4 | Academic degree               |       3 | 0.000120846 |              24825 |             1        |
+----+-------------------------------+---------+-------------+--------------------+----------------------+ 

Proportion of Secondary / secondary special in Target 0 :  91.06
Proportion of Secondary / secondary special in Target 1 :  8.94 

Proportion of Higher education in Target 0 :  94.64
Proportion of Higher education in Target 1 :  5.36 

Proportion of Incomplete higher in Target 0 :  91.52
Proportion of Incomplete higher in Target 1 :  8.48 

Proportion of Lower secondary in Target 0 :  89.07
Proportion of Lower secondary in Target 1 :  10.93 

Proportion of Academic degree in Target 0 :  98.17
Proportion of Academic degree in Target 1 :  1.83 

Highest Education

  • 1.24% of the applicants are having education type of 'Lower secondary'
  • Out of that 1.24%, 10.93% of them are likely to face Payment difficulties.

  • 71% of the applicants are having education type of 'Secodary/seconday special' .
  • Out of that 71%, 8.93% of them are likely to face Payment difficulties.

  • 3.34% of the applicants are having education type of 'Incomplete higher'.
  • Out of that 3.34%, 8.48% of them are likely to face Payment difficulties.

  • 24.34% of the applicants are having education type of 'Higher education'.
  • Out of that 24.34%, 5.36% of them are likely to face Payment difficulties.

  • 0.05% of the applicants are having education type of 'Academic degree'.
  • Out of that 0.05%, 1.83% of them are likely to face Payment difficulties.

We can clearly infer from the above observations that higher the education, lesser is the risk of payment difficulties.

In [151]:
#NAME_FAMILY_STATUS

cat_univariate_analysis('NAME_FAMILY_STATUS')
plt.xticks(rotation=90)
cat_proportions('NAME_FAMILY_STATUS')
TARGET 0
 [Married, Single / not married, Civil marriage, Widow, Separated, NaN]
Categories (5, object): [Married, Single / not married, Civil marriage, Widow, Separated] 

TARGET 1
 [Single / not married, Widow, Married, Civil marriage, Separated]
Categories (5, object): [Single / not married, Widow, Married, Civil marriage, Separated] 

All Other Cases (TARGET : 0)
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_FAMILY_STATUS   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Married              |  181582 | 0.64235   |             181582 |             0.64235  |
|  1 | Single / not married |   40987 | 0.144992  |             222569 |             0.787342 |
|  2 | Civil marriage       |   26814 | 0.094855  |             249383 |             0.882197 |
|  3 | Separated            |   18150 | 0.064206  |             267533 |             0.946403 |
|  4 | Widow                |   15151 | 0.0535969 |             282684 |             1        |
+----+----------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_FAMILY_STATUS   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Married              |   14850 | 0.598187  |              14850 |             0.598187 |
|  1 | Single / not married |    4457 | 0.179537  |              19307 |             0.777724 |
|  2 | Civil marriage       |    2961 | 0.119275  |              22268 |             0.896999 |
|  3 | Separated            |    1620 | 0.0652568 |              23888 |             0.962256 |
|  4 | Widow                |     937 | 0.0377442 |              24825 |             1        |
+----+----------------------+---------+-----------+--------------------+----------------------+ 

Proportion of Single / not married in Target 0 :  90.19
Proportion of Single / not married in Target 1 :  9.81 

Proportion of Married in Target 0 :  92.44
Proportion of Married in Target 1 :  7.56 

Proportion of Civil marriage in Target 0 :  90.06
Proportion of Civil marriage in Target 1 :  9.94 

Proportion of Widow in Target 0 :  94.18
Proportion of Widow in Target 1 :  5.82 

Proportion of Separated in Target 0 :  91.81
Proportion of Separated in Target 1 :  8.19 

Marital Status

  • 5.2% of applicants have their family status as'Widow'
  • Out of that 5.2%, 5.82% of them are likely to face Payment difficulties.

  • 63.87% of applicants have their family status as'Married'
  • Out of that 63.87%, 7.56% of them are likely to face Payment difficulties.

  • 6.42% of applicants have their family status as'Seperated'
  • Out of that 6.42%, 8.19% of them are likely to face Payment difficulties.

  • 14.77% of applicants have their family status as'Single / not married'
  • Out of that 14.77%, 9.81% of them are likely to face Payment difficulties.

  • 9.68% of applicants have their family status as'Civil marriage'
  • Out of that 9.68%, 9.94% of them are likely to face Payment difficulties.

Based on the above obervations, we can infer that the applicants with NAME_FAMILY_STATUS as seperated or single or have higher risk of payment difficulties compared to that of Widow and Married.

For the purpose of this analysis, civil marriage and married are treated equal. They might be merged into the one category

In [152]:
#NAME_HOUSING_TYPE
cat_univariate_analysis('NAME_HOUSING_TYPE',figsize=(10,5))
plt.xticks(rotation=90)
cat_proportions('NAME_HOUSING_TYPE')
TARGET 0
 [House / apartment, Rented apartment, With parents, Municipal apartment, Office apartment, Co-op apartment]
Categories (6, object): [House / apartment, Rented apartment, With parents, Municipal apartment, Office apartment, Co-op apartment] 

TARGET 1
 [House / apartment, With parents, Municipal apartment, Rented apartment, Office apartment, Co-op apartment]
Categories (6, object): [House / apartment, With parents, Municipal apartment, Rented apartment, Office apartment, Co-op apartment] 

All Other Cases (TARGET : 0)
+----+---------------------+---------+------------+--------------------+----------------------+
|    | NAME_HOUSING_TYPE   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------+---------+------------+--------------------+----------------------|
|  0 | House / apartment   |  251596 | 0.890019   |             251596 |             0.890019 |
|  1 | With parents        |   13104 | 0.0463553  |             264700 |             0.936375 |
|  2 | Municipal apartment |   10228 | 0.0361815  |             274928 |             0.972556 |
|  3 | Rented apartment    |    4280 | 0.0151405  |             279208 |             0.987697 |
|  4 | Office apartment    |    2445 | 0.00864917 |             281653 |             0.996346 |
|  5 | Co-op apartment     |    1033 | 0.00365423 |             282686 |             1        |
+----+---------------------+---------+------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+---------------------+---------+-----------+--------------------+----------------------+
|    | NAME_HOUSING_TYPE   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------+---------+-----------+--------------------+----------------------|
|  0 | House / apartment   |   21272 | 0.856878  |              21272 |             0.856878 |
|  1 | With parents        |    1736 | 0.0699295 |              23008 |             0.926808 |
|  2 | Municipal apartment |     955 | 0.0384693 |              23963 |             0.965277 |
|  3 | Rented apartment    |     601 | 0.0242095 |              24564 |             0.989486 |
|  4 | Office apartment    |     172 | 0.0069285 |              24736 |             0.996415 |
|  5 | Co-op apartment     |      89 | 0.0035851 |              24825 |             1        |
+----+---------------------+---------+-----------+--------------------+----------------------+ 

Proportion of House / apartment in Target 0 :  92.2
Proportion of House / apartment in Target 1 :  7.8 

Proportion of Rented apartment in Target 0 :  87.69
Proportion of Rented apartment in Target 1 :  12.31 

Proportion of With parents in Target 0 :  88.3
Proportion of With parents in Target 1 :  11.7 

Proportion of Municipal apartment in Target 0 :  91.46
Proportion of Municipal apartment in Target 1 :  8.54 

Proportion of Office apartment in Target 0 :  93.43
Proportion of Office apartment in Target 1 :  6.57 

Proportion of Co-op apartment in Target 0 :  92.07
Proportion of Co-op apartment in Target 1 :  7.93 

Housing Type

  • 0.85% of applicants have their housing type as'Office apartment'
  • Out of that 0.85%, 6.57% of them are likely to face Payment difficulties.

  • 88.73% of applicants have their housing type as'House/ apartment'
  • Out of that 88.73%, 7.8% of them are likely to face Payment difficulties.

  • 0.36% of applicants have their housing type as'Co-op apartment'
  • Out of that 0.36%, 7.93% of them are likely to face Payment difficulties.

  • 3.63% of applicants have their housing type as'Municipal apartment'
  • Out of that 3.63%, 8.54% of them are likely to face Payment difficulties.

  • 4.82% of applicants have their housing type as'with parents'
  • Out of that 4.82%, 11.7% of them are likely to face Payment difficulties.

  • 1.58% of applicants have their housing type as'Rented apartment'
  • Out of that 1.58%, 12.31% of them are likely to face Payment difficulties.

Based on the above obervations, we can infer that the applicants with NAME_HOUSING_TYPE as 'Municipal apartment' or 'with parents' or 'Rented apartment ' have higher risk of payment difficulties compared to that of other housing types.

In [153]:
#AMT_INCOME_CAT
cat_univariate_analysis('AMT_INCOME_CAT',figsize=(20,5))
plt.xticks(rotation=90)
cat_proportions('AMT_INCOME_CAT')
TARGET 0
 [250000-275000, 50000-75000, 125000-150000, 100000-125000, 75000-100000, ..., 325000-350000, 375000-400000, 400000-425000, 450000-475000, 475000-500000]
Length: 20
Categories (20, object): [25000-50000 < 50000-75000 < 75000-100000 < 100000-125000 ... 425000-450000 < 450000-475000 < 475000-500000 < 500000 and above] 

TARGET 1
 [200000-225000, 100000-125000, 125000-150000, 75000-100000, 300000-325000, ..., 225000-250000, 375000-400000, 325000-350000, 450000-475000, 475000-500000]
Length: 20
Categories (20, object): [25000-50000 < 50000-75000 < 75000-100000 < 100000-125000 ... 425000-450000 < 450000-475000 < 475000-500000 < 500000 and above] 

All Other Cases (TARGET : 0)
+----+------------------+---------+-------------+--------------------+----------------------+
|    | AMT_INCOME_CAT   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------+---------+-------------+--------------------+----------------------|
|  0 | 125000-150000    |   43837 | 0.155073    |              43837 |             0.155073 |
|  1 | 100000-125000    |   39860 | 0.141005    |              83697 |             0.296078 |
|  2 | 200000-225000    |   37595 | 0.132992    |             121292 |             0.42907  |
|  3 | 75000-100000     |   36450 | 0.128942    |             157742 |             0.558011 |
|  4 | 150000-175000    |   31685 | 0.112085    |             189427 |             0.670097 |
|  5 | 175000-200000    |   27190 | 0.0961845   |             216617 |             0.766281 |
|  6 | 50000-75000      |   17849 | 0.0631407   |             234466 |             0.829422 |
|  7 | 250000-275000    |   11846 | 0.0419052   |             246312 |             0.871327 |
|  8 | 225000-250000    |    6814 | 0.0241045   |             253126 |             0.895432 |
|  9 | 300000-325000    |    6342 | 0.0224348   |             259468 |             0.917866 |
| 10 | 350000-375000    |    4282 | 0.0151475   |             263750 |             0.933014 |
| 11 | 25000-50000      |    4174 | 0.0147655   |             267924 |             0.94778  |
| 12 | 275000-300000    |    4000 | 0.01415     |             271924 |             0.961929 |
| 13 | 425000-450000    |    2933 | 0.0103755   |             274857 |             0.972305 |
| 14 | 500000 and above |    2556 | 0.00904183  |             277413 |             0.981347 |
| 15 | 325000-350000    |    1987 | 0.007029    |             279400 |             0.988376 |
| 16 | 400000-425000    |    1696 | 0.00599959  |             281096 |             0.994375 |
| 17 | 375000-400000    |    1180 | 0.00417424  |             282276 |             0.99855  |
| 18 | 475000-500000    |     296 | 0.0010471   |             282572 |             0.999597 |
| 19 | 450000-475000    |     114 | 0.000403274 |             282686 |             1        |
+----+------------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+------------------+---------+-------------+--------------------+----------------------+
|    | AMT_INCOME_CAT   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------+---------+-------------+--------------------+----------------------|
|  0 | 125000-150000    |    4053 | 0.163263    |               4053 |             0.163263 |
|  1 | 100000-125000    |    3841 | 0.154723    |               7894 |             0.317986 |
|  2 | 75000-100000     |    3356 | 0.135186    |              11250 |             0.453172 |
|  3 | 200000-225000    |    3202 | 0.128983    |              14452 |             0.582155 |
|  4 | 150000-175000    |    2978 | 0.11996     |              17430 |             0.702115 |
|  5 | 175000-200000    |    2454 | 0.098852    |              19884 |             0.800967 |
|  6 | 50000-75000      |    1526 | 0.0614703   |              21410 |             0.862437 |
|  7 | 250000-275000    |     887 | 0.0357301   |              22297 |             0.898167 |
|  8 | 225000-250000    |     526 | 0.0211883   |              22823 |             0.919355 |
|  9 | 300000-325000    |     410 | 0.0165156   |              23233 |             0.935871 |
| 10 | 25000-50000      |     343 | 0.0138167   |              23576 |             0.949688 |
| 11 | 275000-300000    |     306 | 0.0123263   |              23882 |             0.962014 |
| 12 | 350000-375000    |     255 | 0.0102719   |              24137 |             0.972286 |
| 13 | 425000-450000    |     180 | 0.00725076  |              24317 |             0.979537 |
| 14 | 500000 and above |     146 | 0.00588117  |              24463 |             0.985418 |
| 15 | 325000-350000    |     135 | 0.00543807  |              24598 |             0.990856 |
| 16 | 400000-425000    |     115 | 0.00463243  |              24713 |             0.995488 |
| 17 | 375000-400000    |      85 | 0.00342397  |              24798 |             0.998912 |
| 18 | 475000-500000    |      16 | 0.000644512 |              24814 |             0.999557 |
| 19 | 450000-475000    |      11 | 0.000443102 |              24825 |             1        |
+----+------------------+---------+-------------+--------------------+----------------------+ 

Proportion of 200000-225000 in Target 0 :  92.15
Proportion of 200000-225000 in Target 1 :  7.85 

Proportion of 250000-275000 in Target 0 :  93.03
Proportion of 250000-275000 in Target 1 :  6.97 

Proportion of 50000-75000 in Target 0 :  92.12
Proportion of 50000-75000 in Target 1 :  7.88 

Proportion of 125000-150000 in Target 0 :  91.54
Proportion of 125000-150000 in Target 1 :  8.46 

Proportion of 100000-125000 in Target 0 :  91.21
Proportion of 100000-125000 in Target 1 :  8.79 

Proportion of 75000-100000 in Target 0 :  91.57
Proportion of 75000-100000 in Target 1 :  8.43 

Proportion of 150000-175000 in Target 0 :  91.41
Proportion of 150000-175000 in Target 1 :  8.59 

Proportion of 350000-375000 in Target 0 :  94.38
Proportion of 350000-375000 in Target 1 :  5.62 

Proportion of 25000-50000 in Target 0 :  92.41
Proportion of 25000-50000 in Target 1 :  7.59 

Proportion of 175000-200000 in Target 0 :  91.72
Proportion of 175000-200000 in Target 1 :  8.28 

Proportion of 425000-450000 in Target 0 :  94.22
Proportion of 425000-450000 in Target 1 :  5.78 

Proportion of 275000-300000 in Target 0 :  92.89
Proportion of 275000-300000 in Target 1 :  7.11 

Proportion of 500000 and above in Target 0 :  94.6
Proportion of 500000 and above in Target 1 :  5.4 

Proportion of 300000-325000 in Target 0 :  93.93
Proportion of 300000-325000 in Target 1 :  6.07 

Proportion of 225000-250000 in Target 0 :  92.83
Proportion of 225000-250000 in Target 1 :  7.17 

Proportion of 325000-350000 in Target 0 :  93.64
Proportion of 325000-350000 in Target 1 :  6.36 

Proportion of 375000-400000 in Target 0 :  93.28
Proportion of 375000-400000 in Target 1 :  6.72 

Proportion of 400000-425000 in Target 0 :  93.65
Proportion of 400000-425000 in Target 1 :  6.35 

Proportion of 450000-475000 in Target 0 :  91.2
Proportion of 450000-475000 in Target 1 :  8.8 

Proportion of 475000-500000 in Target 0 :  94.87
Proportion of 475000-500000 in Target 1 :  5.13 

posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
In [154]:
# We will calculate the pecentage of "Clients with Payment difficulties" for every income category.
t0=application_data0['AMT_INCOME_CAT'].value_counts()
t1=application_data1['AMT_INCOME_CAT'].value_counts()
prop = 100*t1/(t1+t0)
print(tabulate(pd.DataFrame(prop), headers=['AMT_INCOME_CAT','PERCENTAGE'], tablefmt='psql'))
+------------------+--------------+
| AMT_INCOME_CAT   |   PERCENTAGE |
|------------------+--------------|
| 0-25000          |    nan       |
| 25000-50000      |      7.59354 |
| 50000-75000      |      7.87613 |
| 75000-100000     |      8.43089 |
| 100000-125000    |      8.78927 |
| 125000-150000    |      8.46314 |
| 150000-175000    |      8.59129 |
| 175000-200000    |      8.27824 |
| 200000-225000    |      7.84862 |
| 225000-250000    |      7.16621 |
| 250000-275000    |      6.96615 |
| 275000-300000    |      7.10636 |
| 300000-325000    |      6.07227 |
| 325000-350000    |      6.36192 |
| 350000-375000    |      5.62045 |
| 375000-400000    |      6.71937 |
| 400000-425000    |      6.35008 |
| 425000-450000    |      5.7822  |
| 450000-475000    |      8.8     |
| 475000-500000    |      5.12821 |
| 500000 and above |      5.4034  |
+------------------+--------------+

Income Category

  • By the above table of 'AMT_INCOME_CAT' vs 'percentage of customers with payment difficulties' in that range, we can infer that as the income increases, applicant is less likely to face payment difficulties(except in exceptional cases, such as 450000 - 475000).
In [155]:
#CNT_CHILDREN
#Lets convert this into a categorical variable
application_data['CNT_CHILDREN']=application_data['CNT_CHILDREN'].astype('category')
application_data0['CNT_CHILDREN']=application_data0['CNT_CHILDREN'].astype('category')
application_data1['CNT_CHILDREN']=application_data1['CNT_CHILDREN'].astype('category')
In [156]:
#CNT_CHILDREN
cat_univariate_analysis('CNT_CHILDREN',figsize=(20,5))
column_name='CNT_CHILDREN'
cat_proportions('CNT_CHILDREN')
# values = application_data[column_name].unique()
# values=values.dropna()
# values = values.to_numpy()
# values.tolist()
# data0 = application_data0[column_name].value_counts().to_dict()
# data1 = application_data1[column_name].value_counts().to_dict()
# data = application_data[column_name].value_counts().to_dict()

# for i in values[:7]: 
#     print('Proportion of '+ str(i) + ' in Target 0 : ', round(data0[i]*100/data[i],2))
#     print('Proportion of '+ str(i) + ' in Target 1 : ', round(data1[i]*100/data[i],2),'\n' )
TARGET 0
 [0, 1, 2, 3, 4, ..., 8, 12, 10, 19, 14]
Length: 13
Categories (13, int64): [0, 1, 2, 3, ..., 12, 10, 19, 14] 

TARGET 1
 [0, 1, 3, 2, 4, 5, 9, 11, 6]
Categories (9, int64): [0, 1, 3, 2, ..., 5, 9, 11, 6] 

All Other Cases (TARGET : 0)
+----+----------------+---------+-------------+--------------------+----------------------+
|    |   CNT_CHILDREN |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |  198762 | 0.703119    |             198762 |             0.703119 |
|  1 |              1 |   55665 | 0.196915    |             254427 |             0.900034 |
|  2 |              2 |   24416 | 0.0863715   |             278843 |             0.986405 |
|  3 |              3 |    3359 | 0.0118824   |             282202 |             0.998288 |
|  4 |              4 |     374 | 0.00132302  |             282576 |             0.999611 |
|  5 |              5 |      77 | 0.000272387 |             282653 |             0.999883 |
|  6 |              6 |      15 | 5.30624e-05 |             282668 |             0.999936 |
|  7 |              7 |       7 | 2.47625e-05 |             282675 |             0.999961 |
|  8 |             14 |       3 | 1.06125e-05 |             282678 |             0.999972 |
|  9 |             19 |       2 | 7.07499e-06 |             282680 |             0.999979 |
| 10 |             12 |       2 | 7.07499e-06 |             282682 |             0.999986 |
| 11 |             10 |       2 | 7.07499e-06 |             282684 |             0.999993 |
| 12 |              8 |       2 | 7.07499e-06 |             282686 |             1        |
+----+----------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------+---------+-------------+--------------------+----------------------+
|    |   CNT_CHILDREN |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |   16609 | 0.669043    |              16609 |             0.669043 |
|  1 |              1 |    5454 | 0.219698    |              22063 |             0.888741 |
|  2 |              2 |    2333 | 0.0939778   |              24396 |             0.982719 |
|  3 |              3 |     358 | 0.0144209   |              24754 |             0.99714  |
|  4 |              4 |      55 | 0.00221551  |              24809 |             0.999355 |
|  5 |              5 |       7 | 0.000281974 |              24816 |             0.999637 |
|  6 |              6 |       6 | 0.000241692 |              24822 |             0.999879 |
|  7 |              9 |       2 | 8.05639e-05 |              24824 |             0.99996  |
|  8 |             11 |       1 | 4.0282e-05  |              24825 |             1        |
+----+----------------+---------+-------------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  92.29
Proportion of 0 in Target 1 :  7.71 

Proportion of 1 in Target 0 :  91.08
Proportion of 1 in Target 1 :  8.92 

Proportion of 2 in Target 0 :  91.28
Proportion of 2 in Target 1 :  8.72 

Proportion of 3 in Target 0 :  90.37
Proportion of 3 in Target 1 :  9.63 

Proportion of 4 in Target 0 :  87.18
Proportion of 4 in Target 1 :  12.82 

Proportion of 5 in Target 0 :  91.67
Proportion of 5 in Target 1 :  8.33 

Proportion of 6 in Target 0 :  71.43
Proportion of 6 in Target 1 :  28.57 

posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
In [157]:
# Calculating the pecentage of "Clients with Payment difficulties" for every income category.
t0=application_data0['CNT_CHILDREN'].value_counts()
t1=application_data1['CNT_CHILDREN'].value_counts()
prop = 100*t1/(t1+t0)

print(tabulate(pd.DataFrame(prop), headers=['No of Children','Percentage'], tablefmt='psql'))
+------------------+--------------+
|   No of Children |   Percentage |
|------------------+--------------|
|                0 |      7.71181 |
|                1 |      8.92358 |
|                2 |      8.72182 |
|                3 |      9.63142 |
|                4 |     12.8205  |
|                5 |      8.33333 |
|                6 |     28.5714  |
|                7 |    nan       |
|                8 |    nan       |
|                9 |    nan       |
|               10 |    nan       |
|               11 |    nan       |
|               12 |    nan       |
|               14 |    nan       |
|               19 |    nan       |
+------------------+--------------+

Number of Children

  • By looking at the plot and above table, we can clearly infer that starting from 0 to 6, applicants having 0 children have lowest risk of facing payment dificulties. With the increase in number of children, the risk of payment difficulties also increase.
  • Above 6, there might be some data quality issues. We can even expect that this is error while taking the data from thr applicants, because the numericals above 6 are not practical in most of the cases.
  • If the CNT_CHILDREN > 6 is true, then there is 0 probabilty that the applicant might face the payment difficulties.
In [158]:
#CNT_FAM_MEMBERS
cat_univariate_analysis('CNT_FAM_MEMBERS',figsize=(20,5))
cat_proportions('CNT_FAM_MEMBERS')
TARGET 0
 [2.00, 1.00, 3.00, 4.00, 5.00, ..., 14.00, 12.00, 20.00, 15.00, 16.00]
Length: 16
Categories (15, float64): [2.00, 1.00, 3.00, 4.00, ..., 12.00, 20.00, 15.00, 16.00] 

TARGET 1
 [1.00, 2.00, 5.00, 3.00, 4.00, ..., 7.00, 10.00, 13.00, 8.00, 11.00]
Length: 11
Categories (11, float64): [1.00, 2.00, 5.00, 3.00, ..., 10.00, 13.00, 8.00, 11.00] 

All Other Cases (TARGET : 0)
+----+-------------------+---------+-------------+--------------------+----------------------+
|    |   CNT_FAM_MEMBERS |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-------------+--------------------+----------------------|
|  0 |                 2 |  146348 | 0.517709    |             146348 |             0.517709 |
|  1 |                 1 |   62172 | 0.219935    |             208520 |             0.737643 |
|  2 |                 3 |   47993 | 0.169776    |             256513 |             0.90742  |
|  3 |                 4 |   22561 | 0.07981     |             279074 |             0.98723  |
|  4 |                 5 |    3151 | 0.0111467   |             282225 |             0.998376 |
|  5 |                 6 |     353 | 0.00124874  |             282578 |             0.999625 |
|  6 |                 7 |      75 | 0.000265314 |             282653 |             0.99989  |
|  7 |                 8 |      14 | 4.95253e-05 |             282667 |             0.99994  |
|  8 |                 9 |       6 | 2.12251e-05 |             282673 |             0.999961 |
|  9 |                20 |       2 | 7.07504e-06 |             282675 |             0.999968 |
| 10 |                16 |       2 | 7.07504e-06 |             282677 |             0.999975 |
| 11 |                14 |       2 | 7.07504e-06 |             282679 |             0.999982 |
| 12 |                12 |       2 | 7.07504e-06 |             282681 |             0.999989 |
| 13 |                10 |       2 | 7.07504e-06 |             282683 |             0.999996 |
| 14 |                15 |       1 | 3.53752e-06 |             282684 |             1        |
+----+-------------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+-------------------+---------+-------------+--------------------+----------------------+
|    |   CNT_FAM_MEMBERS |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-------------+--------------------+----------------------|
|  0 |                 2 |   12009 | 0.483746    |              12009 |             0.483746 |
|  1 |                 1 |    5675 | 0.2286      |              17684 |             0.712346 |
|  2 |                 3 |    4608 | 0.185619    |              22292 |             0.897966 |
|  3 |                 4 |    2136 | 0.0860423   |              24428 |             0.984008 |
|  4 |                 5 |     327 | 0.0131722   |              24755 |             0.99718  |
|  5 |                 6 |      55 | 0.00221551  |              24810 |             0.999396 |
|  6 |                 8 |       6 | 0.000241692 |              24816 |             0.999637 |
|  7 |                 7 |       6 | 0.000241692 |              24822 |             0.999879 |
|  8 |                13 |       1 | 4.0282e-05  |              24823 |             0.999919 |
|  9 |                11 |       1 | 4.0282e-05  |              24824 |             0.99996  |
| 10 |                10 |       1 | 4.0282e-05  |              24825 |             1        |
+----+-------------------+---------+-------------+--------------------+----------------------+ 

Proportion of 1.0 in Target 0 :  91.64
Proportion of 1.0 in Target 1 :  8.36 

Proportion of 2.0 in Target 0 :  92.42
Proportion of 2.0 in Target 1 :  7.58 

Proportion of 3.0 in Target 0 :  91.24
Proportion of 3.0 in Target 1 :  8.76 

Proportion of 4.0 in Target 0 :  91.35
Proportion of 4.0 in Target 1 :  8.65 

Proportion of 5.0 in Target 0 :  90.6
Proportion of 5.0 in Target 1 :  9.4 

Proportion of 6.0 in Target 0 :  86.52
Proportion of 6.0 in Target 1 :  13.48 

Proportion of 9.0 in Target 0 :  100.0
Proportion of 9.0 in Target 1 :  0.0 

Proportion of 7.0 in Target 0 :  92.59
Proportion of 7.0 in Target 1 :  7.41 

Proportion of 8.0 in Target 0 :  70.0
Proportion of 8.0 in Target 1 :  30.0 

Proportion of 10.0 in Target 0 :  66.67
Proportion of 10.0 in Target 1 :  33.33 

Proportion of 13.0 in Target 0 :  0.0
Proportion of 13.0 in Target 1 :  100.0 

Proportion of 14.0 in Target 0 :  100.0
Proportion of 14.0 in Target 1 :  0.0 

Proportion of 12.0 in Target 0 :  100.0
Proportion of 12.0 in Target 1 :  0.0 

Proportion of 20.0 in Target 0 :  100.0
Proportion of 20.0 in Target 1 :  0.0 

Proportion of 15.0 in Target 0 :  100.0
Proportion of 15.0 in Target 1 :  0.0 

Proportion of 16.0 in Target 0 :  100.0
Proportion of 16.0 in Target 1 :  0.0 

Proportion of 11.0 in Target 0 :  0.0
Proportion of 11.0 in Target 1 :  100.0 

posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

Number of Family members

  • When we look at the plot and proportions above, we can clearly infer that increase in family members is directly proportional to increase in risk of facing payment difficulties.
  • Above 6, we cant infer because there were not enough applicant samples.
In [159]:
#DAYS_EMPLOYED.
num_univariate_analysis('DAYS_EMPLOYED','log')
All Other Cases (TARGET : 0)
count   282,686.00
mean     69,668.81
std     140,983.36
min           0.00
25%         967.00
50%       2,304.00
75%       6,074.00
max     365,243.00
Name: DAYS_EMPLOYED, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count    24,825.00
mean     45,587.32
std     118,303.30
min           0.00
25%         677.00
50%       1,458.00
75%       3,280.00
max     365,243.00
Name: DAYS_EMPLOYED, dtype: float64

Number of Days Employed

  • If we observe the above box plot, we can clearly notice the outliers, so lets concentrate on median.
  • The applicants with payment difficulties have lesser median and IQR compared to "all other cases". So, greater the DAYS_EMPLOYED, lesser is the risk of payment difficulties.
In [160]:
#NAME_INCOME_TYPE
cat_univariate_analysis('NAME_INCOME_TYPE',figsize=(15,5))
plt.xticks(rotation=90)
cat_proportions('NAME_INCOME_TYPE')
posx and posy should be finite values
posx and posy should be finite values
TARGET 0
 [State servant, Working, Commercial associate, Pensioner, Unemployed, Student, Businessman, Maternity leave]
Categories (8, object): [State servant, Working, Commercial associate, Pensioner, Unemployed, Student, Businessman, Maternity leave] 

TARGET 1
 [Working, Commercial associate, Pensioner, State servant, Unemployed, Maternity leave]
Categories (6, object): [Working, Commercial associate, Pensioner, State servant, Unemployed, Maternity leave] 

All Other Cases (TARGET : 0)
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | NAME_INCOME_TYPE     |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | Working              |  143550 | 0.507807    |             143550 |             0.507807 |
|  1 | Commercial associate |   66257 | 0.234384    |             209807 |             0.742191 |
|  2 | Pensioner            |   52380 | 0.185294    |             262187 |             0.927485 |
|  3 | State servant        |   20454 | 0.0723559   |             282641 |             0.999841 |
|  4 | Student              |      18 | 6.36749e-05 |             282659 |             0.999904 |
|  5 | Unemployed           |      14 | 4.95249e-05 |             282673 |             0.999954 |
|  6 | Businessman          |      10 | 3.53749e-05 |             282683 |             0.999989 |
|  7 | Maternity leave      |       3 | 1.06125e-05 |             282686 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | NAME_INCOME_TYPE     |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | Working              |   15224 | 0.613253    |              15224 |             0.613253 |
|  1 | Commercial associate |    5360 | 0.215911    |              20584 |             0.829164 |
|  2 | Pensioner            |    2982 | 0.120121    |              23566 |             0.949285 |
|  3 | State servant        |    1249 | 0.0503122   |              24815 |             0.999597 |
|  4 | Unemployed           |       8 | 0.000322256 |              24823 |             0.999919 |
|  5 | Maternity leave      |       2 | 8.05639e-05 |              24825 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+ 

Proportion of Working in Target 0 :  90.41
Proportion of Working in Target 1 :  9.59 

Proportion of State servant in Target 0 :  94.25
Proportion of State servant in Target 1 :  5.75 

Proportion of Commercial associate in Target 0 :  92.52
Proportion of Commercial associate in Target 1 :  7.48 

Proportion of Pensioner in Target 0 :  94.61
Proportion of Pensioner in Target 1 :  5.39 

Proportion of Unemployed in Target 0 :  63.64
Proportion of Unemployed in Target 1 :  36.36 

Proportion of Student in Target 0 :  100.0
Proportion of Student in Target 1 :  0.0 

Proportion of Businessman in Target 0 :  100.0
Proportion of Businessman in Target 1 :  0.0 

Proportion of Maternity leave in Target 0 :  60.0
Proportion of Maternity leave in Target 1 :  40.0 

posx and posy should be finite values
posx and posy should be finite values

Income type

  • 52% of applicants have their NAME_INCOME_TYPE as'Working'.
  • Out of that 52%, 9.59% of them are likely to face Payment difficulties.

  • 23% of applicants have their NAME_INCOME_TYPE as'Commercial associate'
  • Out of that 23%, 7.48% of them are likely to face Payment difficulties.

  • 7% of applicants have their NAME_INCOME_TYPE as'State servant'
  • Out of that 7%, 5.75% of them are likely to face Payment difficulties.

  • 18% of applicants have their NAME_INCOME_TYPE as'Pensioner'
  • Out of that 18%, 5.39% of them are likely to face Payment difficulties.

From the above Graph and proportions, we can infer that Pensioners and state servants have lower risk of facing payment difficulties compared to the Working and Commercial associates.

Unemployed,Student,Businessman and Maternity leave cannot be inferred because of the reason that they are not well populated.

In [161]:
#DAYS_LAST_PHONE_CHANGE
application_data['DAYS_LAST_PHONE_CHANGE'] = np.abs(application_data['DAYS_LAST_PHONE_CHANGE'])
application_data0['DAYS_LAST_PHONE_CHANGE'] = np.abs(application_data0['DAYS_LAST_PHONE_CHANGE'])
application_data1['DAYS_LAST_PHONE_CHANGE'] = np.abs(application_data1['DAYS_LAST_PHONE_CHANGE'])
num_univariate_analysis('DAYS_LAST_PHONE_CHANGE')
All Other Cases (TARGET : 0)
count   282,685.00
mean        976.39
std         831.21
min           0.00
25%         281.00
50%         776.00
75%       1,586.00
max       4,292.00
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count   24,825.00
mean       808.80
std        758.09
min          0.00
25%        194.00
50%        594.00
75%      1,301.00
max      4,070.00
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64

Days Since Phone Number Was Changed.

  • From the above boxplot, we can clearly notice that the median for 'all other cases' is 776 days and the median for 'Clients with payment difficulties' is 594 days.
  • By this, we can clearly infer that,lesser the days since last phone change, greater is the risk of payment difficulties
In [162]:
#FLAG_CONT_MOBILE
cat_univariate_analysis('FLAG_CONT_MOBILE',figsize=(15,5))
cat_proportions('FLAG_CONT_MOBILE')
TARGET 0
 [1, 0]
Categories (2, int64): [1, 0] 

TARGET 1
 [1, 0]
Categories (2, int64): [1, 0] 

All Other Cases (TARGET : 0)
+----+--------------------+---------+------------+--------------------+----------------------+
|    |   FLAG_CONT_MOBILE |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+------------+--------------------+----------------------|
|  0 |                  1 |  282157 | 0.998129   |             282157 |             0.998129 |
|  1 |                  0 |     529 | 0.00187133 |             282686 |             1        |
+----+--------------------+---------+------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------------+---------+------------+--------------------+----------------------+
|    |   FLAG_CONT_MOBILE |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+------------+--------------------+----------------------|
|  0 |                  1 |   24780 | 0.998187   |              24780 |             0.998187 |
|  1 |                  0 |      45 | 0.00181269 |              24825 |             1        |
+----+--------------------+---------+------------+--------------------+----------------------+ 

Proportion of 1 in Target 0 :  91.93
Proportion of 1 in Target 1 :  8.07 

Proportion of 0 in Target 0 :  92.16
Proportion of 0 in Target 1 :  7.84 

Flag Whether Mobile Phone Was Reachable

  • From the above plot, we can observe that 0.18% applicants mobile numbers were not reachable.
  • From the people whose mobiles numbers were reachable, 8.07% of them are likely to face Payment difficulties, whereas the people whose mobile numbers were'nt reachable, 7.84% of them are likely to face Payment difficulties.

From these we cannot infer a relationship between FLAG_CONT_MOBILE and TARGET variable.

In [163]:
#FLAG_EMAIL.
cat_univariate_analysis('FLAG_EMAIL',figsize=(15,5))
cat_proportions('FLAG_EMAIL')
TARGET 0
 [0, 1]
Categories (2, int64): [0, 1] 

TARGET 1
 [0, 1]
Categories (2, int64): [0, 1] 

All Other Cases (TARGET : 0)
+----+--------------+---------+-----------+--------------------+----------------------+
|    |   FLAG_EMAIL |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------+---------+-----------+--------------------+----------------------|
|  0 |            0 |  266618 | 0.94316   |             266618 |              0.94316 |
|  1 |            1 |   16068 | 0.0568405 |             282686 |              1       |
+----+--------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------+---------+-----------+--------------------+----------------------+
|    |   FLAG_EMAIL |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------+---------+-----------+--------------------+----------------------|
|  0 |            0 |   23451 | 0.944653  |              23451 |             0.944653 |
|  1 |            1 |    1374 | 0.0553474 |              24825 |             1        |
+----+--------------+---------+-----------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  91.92
Proportion of 0 in Target 1 :  8.08 

Proportion of 1 in Target 0 :  92.12
Proportion of 1 in Target 1 :  7.88 

Whether E-mail Was Provided

  • From the above plot and proportions, we can clearly see that 94.3% of applicants have not given the E-mail address.
  • Out of 94.3%, 8.08% are likely to have payment difficulties.

  • From the above plot and proportions, we can clearly see that 5.7% of applicants have provided the E-mail address.
  • Out of 5.7%, 7.88% are likely to have paymenr difficulties.

From these we cannot infer a relationship between FLAG_EMAIL and TARGET variable because they stand in similar grounds.

In [164]:
#FLAG_MOBIL
cat_univariate_analysis('FLAG_MOBIL',figsize=(15,5))
cat_proportions('FLAG_MOBIL')
TARGET 0
 [1, 0]
Categories (2, int64): [1, 0] 

TARGET 1
 [1]
Categories (1, int64): [1] 

All Other Cases (TARGET : 0)
posx and posy should be finite values
posx and posy should be finite values
+----+--------------+---------+-------------+--------------------+----------------------+
|    |   FLAG_MOBIL |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------+---------+-------------+--------------------+----------------------|
|  0 |            1 |  282685 | 0.999996    |             282685 |             0.999996 |
|  1 |            0 |       1 | 3.53749e-06 |             282686 |             1        |
+----+--------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------+---------+-----------+--------------------+----------------------+
|    |   FLAG_MOBIL |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------+---------+-----------+--------------------+----------------------|
|  0 |            1 |   24825 |         1 |              24825 |                    1 |
+----+--------------+---------+-----------+--------------------+----------------------+ 

Proportion of 1 in Target 0 :  91.93
Proportion of 1 in Target 1 :  8.07 

Proportion of 0 in Target 0 :  100.0
Proportion of 0 in Target 1 :  0.0 

Mobile Number

  • 99.9% of them have given their mobile numbers. Out of those 99%, 8.07% have payment difficulties.
  • There is only one applicant who failed to provide the mobile number, since this number is small and negligible, we cannot infer using this data.
In [165]:
#LIVE_CITY_NOT_WORK_CITY
cat_univariate_analysis('LIVE_CITY_NOT_WORK_CITY',figsize=(15,5))
cat_proportions('LIVE_CITY_NOT_WORK_CITY')
TARGET 0
 [0, 1]
Categories (2, int64): [0, 1] 

TARGET 1
 [0, 1]
Categories (2, int64): [0, 1] 

All Other Cases (TARGET : 0)
+----+---------------------------+---------+-----------+--------------------+----------------------+
|    |   LIVE_CITY_NOT_WORK_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-----------+--------------------+----------------------|
|  0 |                         0 |  232974 |  0.824144 |             232974 |             0.824144 |
|  1 |                         1 |   49712 |  0.175856 |             282686 |             1        |
+----+---------------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+---------------------------+---------+-----------+--------------------+----------------------+
|    |   LIVE_CITY_NOT_WORK_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-----------+--------------------+----------------------|
|  0 |                         0 |   19322 |  0.778328 |              19322 |             0.778328 |
|  1 |                         1 |    5503 |  0.221672 |              24825 |             1        |
+----+---------------------------+---------+-----------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  92.34
Proportion of 0 in Target 1 :  7.66 

Proportion of 1 in Target 0 :  90.03
Proportion of 1 in Target 1 :  9.97 

Contact Address vs Work Address

  • From the above plot and proportions, we can clearly notice that 82% of them live and work in same city.
  • Out of that 82%, 7.66% of applicants likely to have payment difficulties.

  • From the above plot and proportions, we can clearly notice that 17% of them donot live and work in same city.
  • Out of that 17%, 9.97% of applicants have payment difficulties.

By this, we can clearly infer that, applicants that live and work in same city are at less risk of payment difficulties.

In [166]:
#REG_CITY_NOT_LIVE_CITY.
cat_univariate_analysis('REG_CITY_NOT_LIVE_CITY',figsize=(15,5))
cat_proportions('REG_CITY_NOT_LIVE_CITY')
TARGET 0
 [0, 1]
Categories (2, int64): [0, 1] 

TARGET 1
 [0, 1]
Categories (2, int64): [0, 1] 

All Other Cases (TARGET : 0)
+----+--------------------------+---------+-----------+--------------------+----------------------+
|    |   REG_CITY_NOT_LIVE_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------+---------+-----------+--------------------+----------------------|
|  0 |                        0 |  261586 | 0.925359  |             261586 |             0.925359 |
|  1 |                        1 |   21100 | 0.0746411 |             282686 |             1        |
+----+--------------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------------------+---------+-----------+--------------------+----------------------+
|    |   REG_CITY_NOT_LIVE_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------+---------+-----------+--------------------+----------------------|
|  0 |                        0 |   21886 |  0.881611 |              21886 |             0.881611 |
|  1 |                        1 |    2939 |  0.118389 |              24825 |             1        |
+----+--------------------------+---------+-----------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  92.28
Proportion of 0 in Target 1 :  7.72 

Proportion of 1 in Target 0 :  87.77
Proportion of 1 in Target 1 :  12.23 

Permanent Address vs Contact Address

  • From the above plot and proportions,92% of the applicant's permanent and contact address are same.
  • Out of that 92%, 7.72% of applicants likely to have payment difficulties.

  • From the above plot and proportions, 8% of the applicant's permanent and contact address are not same.
  • Out of that 8%, 12.23% of applicants likely to have payment difficulties.

By this, we can infer that, applicants that same permanent and contact address are likely to have less risk of payment difficulties.

In [167]:
#REG_CITY_NOT_WORK_CITY.
cat_univariate_analysis('REG_CITY_NOT_WORK_CITY',figsize=(15,5))
cat_proportions('REG_CITY_NOT_WORK_CITY')
TARGET 0
 [0, 1]
Categories (2, int64): [0, 1] 

TARGET 1
 [0, 1]
Categories (2, int64): [0, 1] 

All Other Cases (TARGET : 0)
+----+--------------------------+---------+-----------+--------------------+----------------------+
|    |   REG_CITY_NOT_WORK_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------+---------+-----------+--------------------+----------------------|
|  0 |                        0 |  219339 |   0.77591 |             219339 |              0.77591 |
|  1 |                        1 |   63347 |   0.22409 |             282686 |              1       |
+----+--------------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------------------+---------+-----------+--------------------+----------------------+
|    |   REG_CITY_NOT_WORK_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------+---------+-----------+--------------------+----------------------|
|  0 |                        0 |   17305 |   0.69708 |              17305 |              0.69708 |
|  1 |                        1 |    7520 |   0.30292 |              24825 |              1       |
+----+--------------------------+---------+-----------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  92.69
Proportion of 0 in Target 1 :  7.31 

Proportion of 1 in Target 0 :  89.39
Proportion of 1 in Target 1 :  10.61 

Permanent Address vs Work Address

  • From the above plot and proportions, we can notice that 77% of applicants permanent address and work address are same. Coming to other side, 23% of applicants permanent address and work address are not same.

  • Looking at the proportions of "Clients with payment difficulties" on both sides, we can infer that the applicants whose permanent adress and work adress matches are less likely to have payment difficulties.

Bivariate Analysis

In [168]:
#AMT_ANNUITY, AMT_INCOME_TOTAL vs TARGET
column_names = ['AMT_INCOME_CAT','AMT_ANNUITY']
plt.figure(figsize=(16,8))
sns.barplot(x = column_names[0],y = column_names[1],hue='TARGET',data = application_data)
plt.title(column_names[0] + ' vs '+ column_names[1] + ' vs Target')
plt.xticks(rotation=90);
In [169]:
plt.figure(figsize=(16,8))
# sns.catplot(x = column_names[0],y =column_names[1],hue='TARGET',data = application_data, kind='violin',height=8,aspect=4);
sns.violinplot(x = 'AMT_INCOME_CAT',y='AMT_ANNUITY',hue='TARGET',data = application_data,split=True, inner="quartile")
plt.title(column_names[0] + ' vs '+ column_names[1] + ' vs Target')
plt.xticks(rotation=90);

Income and Annuity

  • Both plots show a gradual increase in the amount of annuity with increase in income.
  • Also, as the income increases there are higher number of outliers i.e the propensity to take on very high annuities compared to most customers in the same income range.
  • It can be seen that in low income ranges, the median no of "Clients with Payment difficulties' is slightly higher.
  • And the in high income ranges,the median no of "Clients with Payment difficulties' is slightly lower.

Income Range and Annuity have a mild correlation with Payment difficulties

In [170]:
#AMT_CREDIT, AMT_GOODS_PRICE vs TARGET
column_names = ['AMT_CREDIT', 'AMT_GOODS_PRICE']
fig,ax = plt.subplots(1,2)
fig.set_figheight(8)
fig.set_figwidth(15)

ax[0].set(title = column_names[0] + ' vs '+ column_names[1] + ' vs Target');
sns.scatterplot(x=column_names[0],y=column_names[1],hue='TARGET',data=application_data, alpha=0.8,ax=ax[0])
plt.xticks(rotation=90);

plt.title(column_names[0] + ' vs '+ column_names[1] + ' vs Target for AMT_GOODS_PRICE < 1000000');
sns.scatterplot(x=column_names[0],y=column_names[1],hue='TARGET',data=application_data[application_data['AMT_GOODS_PRICE'] <=1000000], alpha=0.8,ax=ax[1])
plt.xticks(rotation=90);

Credit Amount and Price of Credit Goods

  • There is an overall linear relationship between Credit Amount and Price of Credit Goods.
  • Notice that more clients with payment difficulties are the ones who have borrowed lesser money for lesser priced goods.
  • And most outliers do not have payment difficulties.

Lower Credit Amount and Lower Price of Credit Goods have higher cases of Payment difficulties

In [171]:
#NAME_CONTRACT_TYPE vs REGION_RATING_CLIENT_W_CITY vs TARGET

column_names = ['NAME_CONTRACT_TYPE','REGION_RATING_CLIENT_W_CITY']
application_data.groupby(column_names)['TARGET'].value_counts(normalize=True)
Out[171]:
NAME_CONTRACT_TYPE  REGION_RATING_CLIENT_W_CITY  TARGET
Cash loans          1                            0        0.95
                                                 1        0.05
                    2                            0        0.92
                                                 1        0.08
                    3                            0        0.88
                                                 1        0.12
Revolving loans     1                            0        0.97
                                                 1        0.03
                    2                            0        0.94
                                                 1        0.06
                    3                            0        0.93
                                                 1        0.07
Name: TARGET, dtype: float64
In [ ]:
 
In [172]:
sns.catplot(x='REGION_RATING_CLIENT_W_CITY', hue='TARGET', col="NAME_CONTRACT_TYPE", kind="count", data=application_data);

Type of Loan and City Rating

  • In cash loans, Category 3 Cities have 7% more probability of payment difficulties than Category 1 Cities.
  • In Revolving loans, Category 3 Cities have 4% more probability of payment difficulties than Category 1 Cities.

Although this just confirms correlation between the City Rating and default, now we know that city rating has a higher impact on Cash Loans than Revolving Loans

In [173]:
# EXT_SOURCE_2 vs EXT_SOURCE_3 vs TARGET
creditScores = ['EXT_SOURCE_2','EXT_SOURCE_3']
plt.figure(figsize=[8,8])
sns.scatterplot(x=creditScores[0],y = creditScores[1], hue='TARGET',data = application_data, alpha=0.5);
plt.title(creditScores[0] + ' vs '+ creditScores[1]+ ' vs '+ 'TARGET');

Credit Rating

  • Checking Credit Ratings from two different sources confirms that trend of customers with lower credit ratings have a higher default rate.
In [174]:
#FLAG_OWN_CAR vs FLAG_OWN_REALTY vs TARGET
application_data.groupby(['FLAG_OWN_CAR','FLAG_OWN_REALTY'])['TARGET'].value_counts(normalize=True)
Out[174]:
FLAG_OWN_CAR  FLAG_OWN_REALTY  TARGET
N             N                0        0.91
                               1        0.09
              Y                0        0.92
                               1        0.08
Y             N                0        0.93
                               1        0.07
              Y                0        0.93
                               1        0.07
Name: TARGET, dtype: float64

Owning A Car & Realty

From the above table , we see that customers with no car or realty have a high probability of default (9%), followed by customers who don't have a car but own realty (8%)
Since there's not much difference posed by owning a car or realty vs not owning them, we cannot infer any trend in default using these variables

In [175]:
#CODE_GENDER vs NAME_FAMILY_STATUS vs TARGET

# For the sake of this analysis since Civil marriage and Married are same, clubbing them into married category
application_data['NAME_FAMILY_STATUS'].replace('Civil marriage','Married', inplace=True)

ax = sns.catplot(x='NAME_FAMILY_STATUS', hue='TARGET', col="CODE_GENDER", kind="count", data=application_data, height = 5, aspect=2);
In [176]:
# classification over both categories
pd.DataFrame(application_data.groupby(['CODE_GENDER','NAME_FAMILY_STATUS'])['TARGET'].value_counts(normalize=True))
Out[176]:
TARGET
CODE_GENDER NAME_FAMILY_STATUS TARGET
F Married 0 0.93
1 0.07
Separated 0 0.93
1 0.07
Single / not married 0 0.92
1 0.08
Widow 0 0.95
1 0.05
M Married 0 0.91
1 0.09
Separated 0 0.87
1 0.13
Single / not married 0 0.87
1 0.13
Widow 0 0.88
1 0.12
In [177]:
application_data.groupby(['CODE_GENDER','NAME_FAMILY_STATUS'])['TARGET'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(8,6), kind='barh', stacked=True);

Gender & Marital Status

  • Looking at the above values, customers most attractive to the bank by likelihood of default, are
    'Female Widows' > 'Married Females' = 'Separated Females' > 'Single Females' > 'Married Males' > 'Male Widowers' > 'Single Males' = 'Separated Males'

The bank may focus their marketing campaigns on converting applications of Female Widows, Married Females, Separated Females
May note that the dataset is skewed towards Females than Males

In [178]:
# NAME_HOUSING_TYPE vs NAME_INCOME_TYPE vs TARGET
income_housing = pd.DataFrame(application_data.groupby(['NAME_HOUSING_TYPE','NAME_INCOME_TYPE'])['TARGET'].value_counts(normalize=True))
income_housing.columns = ['Normalized Count']
income_housing 
Out[178]:
Normalized Count
NAME_HOUSING_TYPE NAME_INCOME_TYPE TARGET
Co-op apartment Commercial associate 0 0.91
1 0.09
Pensioner 0 0.96
1 0.04
State servant 0 0.99
1 0.01
Working 0 0.91
1 0.09
House / apartment Businessman 0 1.00
Commercial associate 0 0.93
1 0.07
Maternity leave 0 0.60
1 0.40
Pensioner 0 0.95
1 0.05
State servant 0 0.94
1 0.06
Student 0 1.00
Unemployed 0 0.65
1 0.35
Working 0 0.91
1 0.09
Municipal apartment Commercial associate 0 0.92
1 0.08
Pensioner 0 0.93
1 0.07
State servant 0 0.94
1 0.06
Unemployed 1 1.00
Working 0 0.90
1 0.10
Office apartment Commercial associate 0 0.93
1 0.07
Pensioner 0 0.94
1 0.06
State servant 0 0.96
1 0.04
Student 0 1.00
Working 0 0.92
1 0.08
Rented apartment Commercial associate 0 0.88
1 0.12
Pensioner 0 0.93
1 0.07
State servant 0 0.93
1 0.07
Student 0 1.00
Unemployed 0 1.00
Working 0 0.86
1 0.14
With parents Commercial associate 0 0.90
1 0.10
Pensioner 0 0.95
1 0.05
State servant 0 0.91
1 0.09
Student 0 1.00
Unemployed 0 0.67
1 0.33
Working 0 0.87
1 0.13
In [179]:
application_data.groupby(['NAME_HOUSING_TYPE','NAME_INCOME_TYPE'])['TARGET'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(8,6), kind='barh', stacked=True);
In [180]:
income_housing[np.in1d(income_housing.index.get_level_values(2), 1)].sort_values(by='Normalized Count', ascending=True)
Out[180]:
Normalized Count
NAME_HOUSING_TYPE NAME_INCOME_TYPE TARGET
Co-op apartment State servant 1 0.01
Office apartment State servant 1 0.04
Co-op apartment Pensioner 1 0.04
With parents Pensioner 1 0.05
House / apartment Pensioner 1 0.05
State servant 1 0.06
Office apartment Pensioner 1 0.06
Municipal apartment State servant 1 0.06
Rented apartment State servant 1 0.07
Municipal apartment Pensioner 1 0.07
Rented apartment Pensioner 1 0.07
House / apartment Commercial associate 1 0.07
Office apartment Commercial associate 1 0.07
Working 1 0.08
Municipal apartment Commercial associate 1 0.08
Co-op apartment Commercial associate 1 0.09
Working 1 0.09
House / apartment Working 1 0.09
With parents State servant 1 0.09
Commercial associate 1 0.10
Municipal apartment Working 1 0.10
Rented apartment Commercial associate 1 0.12
With parents Working 1 0.13
Rented apartment Working 1 0.14
With parents Unemployed 1 0.33
House / apartment Unemployed 1 0.35
Maternity leave 1 0.40
Municipal apartment Unemployed 1 1.00

Housing Type & Income Type

  • The above tables list the categories in the order of least likelihood of Payment difficulties
  • The top 5 categories are :
    • State servant living in Co-op apartment
    • State servant living in Office apartment
    • Pensioner living in Co-op apartment
    • Pensioner living With parents
    • Pensioner living in House / apartment

The bank may focus their marketing campaigns on converting applications of the above categories
Pensioner living With parents category doesn't seem right and may be further checked if it's an anomaly

In [181]:
#FLAG_EMAIL vs FLAG_MOBIL vs TARGET
pd.DataFrame(application_data.groupby(['FLAG_EMAIL','FLAG_MOBIL'])['TARGET'].value_counts(normalize=True))\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(8,6), kind='barh', stacked=True);
In [182]:
pd.DataFrame(application_data.groupby(['FLAG_EMAIL','FLAG_MOBIL'])['TARGET'].value_counts())
Out[182]:
TARGET
FLAG_EMAIL FLAG_MOBIL TARGET
0 0 0 1
1 0 266617
1 23451
1 1 0 16068
1 1374

Email & Mobile Information

  • 8% of customers who have not provided Email but have provided a Mobile Number have had payment difficulties
  • Similarly 8% of customers who provided both Email and Mobile Number have had payment difficulties

Hence we cannot infer any conclusive trend in payment difficulties using these variables

In [183]:
# DAYS_LAST_PHONE_CHANGE vs FLAG_CONT_MOBILE vs TARGET

sns.barplot(x = 'FLAG_CONT_MOBILE', y= 'DAYS_LAST_PHONE_CHANGE', hue='TARGET',data = application_data)
plt.title("FLAG_CONT_MOBILE vs DAYS_LAST_PHONE_CHANGE vs TARGET")
Out[183]:
Text(0.5, 1.0, 'FLAG_CONT_MOBILE vs DAYS_LAST_PHONE_CHANGE vs TARGET')
In [184]:
sns.violinplot(x = 'FLAG_CONT_MOBILE', y= 'DAYS_LAST_PHONE_CHANGE', hue='TARGET', split=True, data = application_data,inner="quartile", height=5, aspect=1)
plt.title("FLAG_CONT_MOBILE vs DAYS_LAST_PHONE_CHANGE vs TARGET")
Out[184]:
Text(0.5, 1.0, 'FLAG_CONT_MOBILE vs DAYS_LAST_PHONE_CHANGE vs TARGET')

Days Since Phone Number Changed & Whether was Phone was Reachable

  • Notice that the number of clients with reachable phones is higher than unreachable phones. (~900 days vs ~350 days)
  • Among clients with reachable phones, Clients with Payment difficulties have median newer phone numbers than All other cases. (See violin plot for FLAG_CONT_MOBILE = 1 )
  • Among clients with unreachable phones, the difference is less (~50 days vs ~ 200 days). In fact, the median age of phone number is same for defaulters and others.

We could conclude that clients with reachable and older phone numbers are better borrowers than other categories.When the phone is unreachable, last date of phone number change doesn't matter.

In [185]:
application_data['DAYS_EMPLOYED'].plot.hist()
Out[185]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd5ecc96090>
In [186]:
#DAYS_EMPLOYED & NAME_EDUCATION_TYPE vs TARGET
fig,ax = plt.subplots(1,2)
fig.set_figheight(5)
fig.set_figwidth(15)

fig.suptitle(t="DAYS_EMPLOYED & NAME_EDUCATION_TYPE vs TARGET");
ax[0].tick_params(axis='x',rotation=90)
sns.barplot(x = 'NAME_EDUCATION_TYPE', y= 'DAYS_EMPLOYED', hue='TARGET',data = application_data,ax=ax[0])

ax[1].tick_params(axis='x',rotation=90)
# ax[1].set_yscale('log')
sns.violinplot(x = 'NAME_EDUCATION_TYPE', y= 'DAYS_EMPLOYED', hue='TARGET', split=True, data = application_data,inner="quartile", height=5, aspect=1, ax=ax[1])
#sns.swarmplot(x = 'NAME_EDUCATION_TYPE', y= 'DAYS_EMPLOYED', hue='TARGET', data = application_data, ax=ax[1])
Out[186]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd608cbf7d0>

Employment Period & Education

  • The barplot between Days Employed and Education clearly shows that the clients holding an Academic Degree and having held employment for a long time have very low probability of default (difference between Target 0 , 1 of ~60000)
  • Among clients with Secondary education, there are two groups, ones who are recently employed and others with long term employment. The recently employed clients in this category have equal probability to default or not. But the clients long term employment have lower probability to default.

The bank may place emphasis on education and employment period. Higher education and long term employment are good indicators of no default.Further clients with secondary education are far more attractive borrowers when they have held long term employment than otherwise

In [187]:
#CNT_CHILDREN & CNT_FAM_MEMBERS vs TARGET
subset = application_data[['CNT_CHILDREN','CNT_FAM_MEMBERS','TARGET']] 
subset = subset.dropna().astype('int')
childrenvsFamily = pd.pivot_table(index='CNT_CHILDREN', columns = 'CNT_FAM_MEMBERS',aggfunc=np.mean,data=subset)
childrenvsFamily
Out[187]:
TARGET
CNT_FAM_MEMBERS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20
CNT_CHILDREN
0 0.08 0.07 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
1 nan 0.10 0.09 nan nan nan nan nan nan nan nan nan nan nan nan nan nan
2 nan nan 0.10 0.09 nan nan nan nan nan nan nan nan nan nan nan nan nan
3 nan nan nan 0.12 0.09 nan nan nan nan nan nan nan nan nan nan nan nan
4 nan nan nan nan 0.04 0.13 nan nan nan nan nan nan nan nan nan nan nan
5 nan nan nan nan nan 0.20 0.08 nan nan nan nan nan nan nan nan nan nan
6 nan nan nan nan nan nan 0.00 0.32 nan nan nan nan nan nan nan nan nan
7 nan nan nan nan nan nan nan 0.00 0.00 nan nan nan nan nan nan nan nan
8 nan nan nan nan nan nan nan nan nan 0.00 nan nan nan nan nan nan nan
9 nan nan nan nan nan nan nan nan nan 1.00 1.00 nan nan nan nan nan nan
10 nan nan nan nan nan nan nan nan nan nan nan 0.00 nan nan nan nan nan
11 nan nan nan nan nan nan nan nan nan nan nan nan 1.00 nan nan nan nan
12 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.00 nan nan nan
14 nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0.00 0.00 nan
19 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0.00

Children & Family Members

  • These variables are chosen to verify the amplified liabilities of clients having both more children and more family members.
  • The most attractive sub-categories are the clients with 1 or 2 family members and 0 children.
  • This is followed by clients having 1 or 2 children with 3 or 4 family members.

Note that more than 5 children is deemed an error. This needs further analysis
Small families with less children has low risk of default

In [188]:
# NAME_EDUCATION_TYPE vs AMT_INCOME_TOTAL vs TARGET 

plt.figure(figsize=(10,8))
sns.violinplot(x = 'NAME_EDUCATION_TYPE', y= 'AMT_INCOME_TOTAL', hue='TARGET', split=True, data = application_data,inner="quartile", height=5, aspect=3)
plt.yscale('log')
plt.xticks(rotation=90);

Type of Income vs Amount of Income

  • For the same average income across income types, clients with Secondary Education have the lowest probability of default.
  • Among clients with an Academic Degree, median income of clients with payment difficulties is higher than all other cases.
  • Among clients with Higher Education, Incomplete Higher Education and Lower Secondary education, the median income has no correlation with risk of default.
In [189]:
#'NAME_INCOME_TYPE vs NAME_EDUCATION_TYPE'
subset = application_data[['NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','TARGET']] 
subset = subset.dropna()
subset['TARGET'] = subset['TARGET'].astype('int')
# pivot table for percentage of default for education type vs income type
incomeTypevsEdu = pd.pivot_table(index='NAME_EDUCATION_TYPE', columns = 'NAME_INCOME_TYPE',aggfunc=[np.mean],data=subset)
incomeTypevsEdu
Out[189]:
mean
TARGET
NAME_INCOME_TYPE Businessman Commercial associate Maternity leave Pensioner State servant Student Unemployed Working
NAME_EDUCATION_TYPE
Academic degree nan 0.02 nan 0.00 0.00 nan nan 0.03
Higher education 0.00 0.05 0.00 0.04 0.04 0.00 0.00 0.06
Incomplete higher nan 0.08 nan 0.04 0.07 0.00 0.33 0.09
Lower secondary nan 0.12 nan 0.07 0.08 nan nan 0.15
Secondary / secondary special nan 0.09 1.00 0.06 0.07 0.00 0.54 0.11
In [190]:
# pivot table of the count of defaults for education type vs income type
pd.pivot_table(index='NAME_EDUCATION_TYPE', columns = 'NAME_INCOME_TYPE',aggfunc='count',data=subset)
Out[190]:
TARGET
NAME_INCOME_TYPE Businessman Commercial associate Maternity leave Pensioner State servant Student Unemployed Working
NAME_EDUCATION_TYPE
Academic degree nan 43.00 nan 26.00 23.00 nan nan 72.00
Higher education 10.00 24,025.00 3.00 8,188.00 8,863.00 6.00 6.00 33,762.00
Incomplete higher nan 3,400.00 nan 518.00 770.00 1.00 3.00 5,585.00
Lower secondary nan 460.00 nan 1,529.00 102.00 nan nan 1,725.00
Secondary / secondary special nan 43,689.00 2.00 45,101.00 11,945.00 11.00 13.00 117,630.00

Education Type & Income Type

  • From the above pivot tables we could say that clients with Academic Degree have the least default risk.
  • The highest volume of applications are from clients with Secondary education. Among them, the least default risk is posed by Pensioners.
  • The next highest in terms of volume of applications is from clients with Higher Education. Among these, the least risk is posed by Pensioners and State Servants.
  • Note that there's not enough application from Businessmen with Higher Education to conclusively say they have very low risk of default.
  • Highest default risk is posed by Unemployed clients. Among them unemployed clients with Secondary Education have the higest risk of default.
In [191]:
sns.violinplot(x = 'NAME_INCOME_TYPE', y= 'AMT_INCOME_TOTAL', hue='TARGET', split=True, data = application_data,inner="quartile", height=5, aspect=3)
plt.yscale('log')
plt.xticks(rotation=90);

Correlation Analysis

Correlation For Target : 0

In [192]:
# function to correlate variables
def correlation(dataframe) : 
    cor0=dataframe[columnsForAnalysis].corr()
    type(cor0)
    cor0.where(np.triu(np.ones(cor0.shape),k=1).astype(np.bool))
    cor0=cor0.unstack().reset_index()
    cor0.columns=['VAR1','VAR2','CORR']
    cor0.dropna(subset=['CORR'], inplace=True)
    cor0.CORR=round(cor0['CORR'],2)
    cor0.CORR=cor0.CORR.abs()
    cor0.sort_values(by=['CORR'],ascending=False)
    cor0=cor0[~(cor0['VAR1']==cor0['VAR2'])]
    return pd.DataFrame(cor0.sort_values(by=['CORR'],ascending=False))
In [193]:
# Correlation for Target : 0 
# Absolute values are reported 
pd.set_option('precision', 2)
cor_0 = correlation(application_data0)
cor_0.style.background_gradient(cmap='GnBu').hide_index()
Out[193]:
VAR1 VAR2 CORR
AMT_GOODS_PRICE AMT_CREDIT 0.99
AMT_CREDIT AMT_GOODS_PRICE 0.99
AMT_GOODS_PRICE AMT_ANNUITY 0.78
AMT_ANNUITY AMT_GOODS_PRICE 0.78
AMT_ANNUITY AMT_CREDIT 0.77
AMT_CREDIT AMT_ANNUITY 0.77
AGE_YEARS DAYS_EMPLOYED 0.63
DAYS_EMPLOYED AGE_YEARS 0.63
AMT_ANNUITY AMT_INCOME_TOTAL 0.42
AMT_INCOME_TOTAL AMT_ANNUITY 0.42
AMT_GOODS_PRICE AMT_INCOME_TOTAL 0.35
AMT_INCOME_TOTAL AMT_GOODS_PRICE 0.35
AMT_INCOME_TOTAL AMT_CREDIT 0.34
AMT_CREDIT AMT_INCOME_TOTAL 0.34
AGE_YEARS EXT_SOURCE_3 0.20
EXT_SOURCE_3 AGE_YEARS 0.20
DAYS_LAST_PHONE_CHANGE EXT_SOURCE_2 0.19
EXT_SOURCE_2 DAYS_LAST_PHONE_CHANGE 0.19
DAYS_EMPLOYED AMT_INCOME_TOTAL 0.14
EXT_SOURCE_2 AMT_INCOME_TOTAL 0.14
EXT_SOURCE_2 AMT_GOODS_PRICE 0.14
AMT_INCOME_TOTAL EXT_SOURCE_2 0.14
AMT_INCOME_TOTAL DAYS_EMPLOYED 0.14
AMT_GOODS_PRICE EXT_SOURCE_2 0.14
AMT_ANNUITY EXT_SOURCE_2 0.13
EXT_SOURCE_2 AMT_ANNUITY 0.13
AMT_CREDIT EXT_SOURCE_2 0.13
EXT_SOURCE_2 AMT_CREDIT 0.13
DAYS_EMPLOYED EXT_SOURCE_3 0.11
EXT_SOURCE_3 DAYS_EMPLOYED 0.11
DAYS_EMPLOYED AMT_ANNUITY 0.10
AMT_ANNUITY DAYS_EMPLOYED 0.10
EXT_SOURCE_2 AGE_YEARS 0.08
EXT_SOURCE_2 EXT_SOURCE_3 0.08
EXT_SOURCE_3 EXT_SOURCE_2 0.08
AGE_YEARS DAYS_LAST_PHONE_CHANGE 0.08
AGE_YEARS EXT_SOURCE_2 0.08
DAYS_LAST_PHONE_CHANGE AGE_YEARS 0.08
DAYS_LAST_PHONE_CHANGE AMT_GOODS_PRICE 0.07
DAYS_EMPLOYED AMT_CREDIT 0.07
EXT_SOURCE_3 DAYS_LAST_PHONE_CHANGE 0.07
EXT_SOURCE_3 AMT_INCOME_TOTAL 0.07
AMT_INCOME_TOTAL EXT_SOURCE_3 0.07
AMT_CREDIT DAYS_EMPLOYED 0.07
DAYS_EMPLOYED AMT_GOODS_PRICE 0.07
AMT_CREDIT DAYS_LAST_PHONE_CHANGE 0.07
DAYS_LAST_PHONE_CHANGE AMT_CREDIT 0.07
AMT_GOODS_PRICE DAYS_LAST_PHONE_CHANGE 0.07
AMT_GOODS_PRICE DAYS_EMPLOYED 0.07
DAYS_LAST_PHONE_CHANGE EXT_SOURCE_3 0.07
AGE_YEARS AMT_INCOME_TOTAL 0.06
AMT_ANNUITY DAYS_LAST_PHONE_CHANGE 0.06
DAYS_LAST_PHONE_CHANGE AMT_ANNUITY 0.06
AMT_INCOME_TOTAL AGE_YEARS 0.06
AMT_CREDIT AGE_YEARS 0.05
AGE_YEARS AMT_CREDIT 0.05
AGE_YEARS AMT_GOODS_PRICE 0.04
EXT_SOURCE_3 AMT_GOODS_PRICE 0.04
EXT_SOURCE_3 AMT_CREDIT 0.04
DAYS_LAST_PHONE_CHANGE AMT_INCOME_TOTAL 0.04
AMT_INCOME_TOTAL DAYS_LAST_PHONE_CHANGE 0.04
AMT_GOODS_PRICE EXT_SOURCE_3 0.04
AMT_CREDIT EXT_SOURCE_3 0.04
AMT_GOODS_PRICE AGE_YEARS 0.04
EXT_SOURCE_2 DAYS_EMPLOYED 0.03
EXT_SOURCE_3 AMT_ANNUITY 0.03
DAYS_EMPLOYED EXT_SOURCE_2 0.03
AMT_ANNUITY EXT_SOURCE_3 0.03
DAYS_LAST_PHONE_CHANGE DAYS_EMPLOYED 0.02
DAYS_EMPLOYED DAYS_LAST_PHONE_CHANGE 0.02
AMT_ANNUITY AGE_YEARS 0.01
AGE_YEARS AMT_ANNUITY 0.01
AGE_YEARS SK_ID_CURR 0.00
SK_ID_CURR AMT_INCOME_TOTAL 0.00
DAYS_LAST_PHONE_CHANGE SK_ID_CURR 0.00
EXT_SOURCE_3 SK_ID_CURR 0.00
EXT_SOURCE_2 SK_ID_CURR 0.00
SK_ID_CURR AMT_CREDIT 0.00
AMT_GOODS_PRICE SK_ID_CURR 0.00
AMT_ANNUITY SK_ID_CURR 0.00
AMT_CREDIT SK_ID_CURR 0.00
AMT_INCOME_TOTAL SK_ID_CURR 0.00
SK_ID_CURR AGE_YEARS 0.00
SK_ID_CURR DAYS_LAST_PHONE_CHANGE 0.00
SK_ID_CURR EXT_SOURCE_3 0.00
SK_ID_CURR EXT_SOURCE_2 0.00
SK_ID_CURR DAYS_EMPLOYED 0.00
SK_ID_CURR AMT_GOODS_PRICE 0.00
SK_ID_CURR AMT_ANNUITY 0.00
DAYS_EMPLOYED SK_ID_CURR 0.00

Top 10 correlations for Target : 0 are

  • AMT_GOODS_PRICE & AMT_CREDIT
  • AMT_GOODS_PRICE & AMT_ANNUITY
  • AMT_ANNUITY & AMT_CREDIT
  • AGE_YEARS & DAYS_EMPLOYED
  • AMT_ANNUITY & AMT_INCOME_TOTAL
  • AMT_GOODS_PRICE & AMT_INCOME_TOTAL
  • AMT_INCOME_TOTAL & AMT_CREDIT
  • AGE_YEARS & EXT_SOURCE_3
  • DAYS_LAST_PHONE_CHANGE & EXT_SOURCE_2
  • DAYS_EMPLOYED & AMT_INCOME_TOTAL

Correlation For Target : 1

In [194]:
# Correlation for Target : 1 
# Absolute values are reported 
pd.set_option('precision', 2)
cor_1 = correlation(application_data1)
cor_1.style.background_gradient(cmap='GnBu').hide_index()
Out[194]:
VAR1 VAR2 CORR
AMT_GOODS_PRICE AMT_CREDIT 0.98
AMT_CREDIT AMT_GOODS_PRICE 0.98
AMT_CREDIT AMT_ANNUITY 0.75
AMT_ANNUITY AMT_GOODS_PRICE 0.75
AMT_ANNUITY AMT_CREDIT 0.75
AMT_GOODS_PRICE AMT_ANNUITY 0.75
AGE_YEARS DAYS_EMPLOYED 0.58
DAYS_EMPLOYED AGE_YEARS 0.58
EXT_SOURCE_2 DAYS_LAST_PHONE_CHANGE 0.21
DAYS_LAST_PHONE_CHANGE EXT_SOURCE_2 0.21
EXT_SOURCE_3 AGE_YEARS 0.17
AGE_YEARS EXT_SOURCE_3 0.17
AGE_YEARS AMT_GOODS_PRICE 0.14
AGE_YEARS AMT_CREDIT 0.14
AMT_GOODS_PRICE AGE_YEARS 0.14
AMT_CREDIT AGE_YEARS 0.14
AMT_GOODS_PRICE EXT_SOURCE_2 0.13
EXT_SOURCE_2 AMT_GOODS_PRICE 0.13
AMT_CREDIT EXT_SOURCE_2 0.12
EXT_SOURCE_2 AMT_ANNUITY 0.12
AMT_ANNUITY EXT_SOURCE_2 0.12
DAYS_LAST_PHONE_CHANGE AMT_GOODS_PRICE 0.12
AMT_GOODS_PRICE DAYS_LAST_PHONE_CHANGE 0.12
EXT_SOURCE_2 AMT_CREDIT 0.12
EXT_SOURCE_2 AGE_YEARS 0.11
AGE_YEARS DAYS_LAST_PHONE_CHANGE 0.11
AMT_CREDIT DAYS_LAST_PHONE_CHANGE 0.11
DAYS_LAST_PHONE_CHANGE AMT_CREDIT 0.11
DAYS_LAST_PHONE_CHANGE AGE_YEARS 0.11
AGE_YEARS EXT_SOURCE_2 0.11
DAYS_EMPLOYED EXT_SOURCE_3 0.10
EXT_SOURCE_3 DAYS_EMPLOYED 0.10
AMT_ANNUITY DAYS_EMPLOYED 0.08
EXT_SOURCE_2 EXT_SOURCE_3 0.08
AMT_ANNUITY DAYS_LAST_PHONE_CHANGE 0.08
EXT_SOURCE_3 AMT_CREDIT 0.08
EXT_SOURCE_3 AMT_GOODS_PRICE 0.08
DAYS_LAST_PHONE_CHANGE AMT_ANNUITY 0.08
AMT_CREDIT EXT_SOURCE_3 0.08
EXT_SOURCE_3 EXT_SOURCE_2 0.08
AMT_GOODS_PRICE EXT_SOURCE_3 0.08
DAYS_EMPLOYED AMT_ANNUITY 0.08
EXT_SOURCE_3 DAYS_LAST_PHONE_CHANGE 0.07
DAYS_LAST_PHONE_CHANGE EXT_SOURCE_3 0.07
AMT_ANNUITY AMT_INCOME_TOTAL 0.05
AMT_INCOME_TOTAL AMT_ANNUITY 0.05
AMT_INCOME_TOTAL AMT_CREDIT 0.04
EXT_SOURCE_3 AMT_ANNUITY 0.04
AMT_GOODS_PRICE AMT_INCOME_TOTAL 0.04
AMT_ANNUITY EXT_SOURCE_3 0.04
AMT_CREDIT AMT_INCOME_TOTAL 0.04
AMT_INCOME_TOTAL AMT_GOODS_PRICE 0.04
EXT_SOURCE_3 AMT_INCOME_TOTAL 0.02
AMT_INCOME_TOTAL EXT_SOURCE_3 0.02
EXT_SOURCE_3 SK_ID_CURR 0.01
AGE_YEARS AMT_ANNUITY 0.01
SK_ID_CURR AMT_INCOME_TOTAL 0.01
EXT_SOURCE_2 AMT_INCOME_TOTAL 0.01
AMT_ANNUITY AGE_YEARS 0.01
SK_ID_CURR AMT_ANNUITY 0.01
SK_ID_CURR DAYS_EMPLOYED 0.01
SK_ID_CURR EXT_SOURCE_2 0.01
SK_ID_CURR EXT_SOURCE_3 0.01
AMT_INCOME_TOTAL SK_ID_CURR 0.01
AMT_INCOME_TOTAL DAYS_EMPLOYED 0.01
AMT_INCOME_TOTAL EXT_SOURCE_2 0.01
EXT_SOURCE_2 SK_ID_CURR 0.01
AMT_ANNUITY SK_ID_CURR 0.01
DAYS_EMPLOYED SK_ID_CURR 0.01
AMT_GOODS_PRICE DAYS_EMPLOYED 0.01
DAYS_EMPLOYED AMT_GOODS_PRICE 0.01
DAYS_EMPLOYED AMT_INCOME_TOTAL 0.01
AMT_GOODS_PRICE SK_ID_CURR 0.00
AGE_YEARS SK_ID_CURR 0.00
EXT_SOURCE_2 DAYS_EMPLOYED 0.00
SK_ID_CURR AMT_GOODS_PRICE 0.00
DAYS_EMPLOYED DAYS_LAST_PHONE_CHANGE 0.00
DAYS_EMPLOYED EXT_SOURCE_2 0.00
SK_ID_CURR DAYS_LAST_PHONE_CHANGE 0.00
AGE_YEARS AMT_INCOME_TOTAL 0.00
DAYS_EMPLOYED AMT_CREDIT 0.00
SK_ID_CURR AGE_YEARS 0.00
AMT_CREDIT DAYS_EMPLOYED 0.00
DAYS_LAST_PHONE_CHANGE DAYS_EMPLOYED 0.00
SK_ID_CURR AMT_CREDIT 0.00
AMT_INCOME_TOTAL DAYS_LAST_PHONE_CHANGE 0.00
AMT_INCOME_TOTAL AGE_YEARS 0.00
DAYS_LAST_PHONE_CHANGE SK_ID_CURR 0.00
AMT_CREDIT SK_ID_CURR 0.00
DAYS_LAST_PHONE_CHANGE AMT_INCOME_TOTAL 0.00

Top 10 correlations for Target 1 are :

  • AMT_GOODS_PRICE & AMT_CREDIT
  • AMT_CREDIT & AMT_ANNUITY
  • AMT_ANNUITY & AMT_GOODS_PRICE
  • AGE_YEARS & DAYS_EMPLOYED
  • EXT_SOURCE_2 & DAYS_LAST_PHONE_CHANGE
  • EXT_SOURCE_3 & AGE_YEARS
  • AGE_YEARS & AMT_GOODS_PRICE
  • AGE_YEARS & AMT_CREDIT
  • AMT_GOODS_PRICE & EXT_SOURCE_2
  • AMT_CREDIT & EXT_SOURCE_2
  • Merged data analysis can be found here