Bank Credit Default Case Study¶

Exploratory Data Analysis¶

Business Understanding :¶

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Data is collected about a hypothetical finance bank which specialises in lending various types of loans to urban customers. EDA is used to analyse the patterns present in the data. The bank can use this data to ensure that the applicants capable of repaying the loan are not rejected.

When the bank receives a loan application, the bank has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the bank
If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the bank.

The data contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,
All other cases: All other cases when the payment is paid on time.

When a client applies for a loan, there are four types of decisions that could be taken by the client/bank):

Approved: The bank has approved loan Application
Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.
Refused: The bank had rejected the loan (because the client does not meet their requirements etc.).
Unused offer: Loan has been cancelled by the client but on different stages of the process.

This case study uses EDA techniques to understand how consumer attributes and loan attributes influence the tendency of default.

Business Objectives¶

This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected.

In other words, we have tried to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The bank can utilise this knowledge for its portfolio and risk assessment.

Data Understanding¶

'application_data.csv' contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.
'previous_application.csv' contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.

Analysis Approach & Conclusions:¶

The purpose of this analysis is to give insights on credit default analysis and suggest customer parameters which have a clear indication of default.
The Analysis is done on current loan application data and previous loan application status data.
The current loan application is used to explore for insights on the factors that contribute to payment difficulties / loan default.
A merged data set of current loan application and previous application data is analysed to give insights on factors that contribute to Loan Approval, Rejection, Cancellations and unused offers.
The risk of default could broadly divided into
Risk attributed to Credit Good [Credit Good Risk]
Risk attributed to Consumer liabilities / behaviour [Consumer related Risk]
Loan application is approved based on below examination,
If applicant is new customer, then we look at Credibility and Demographics
If applicant is old customer, then we look at Credit history and Previous rejection causes.

The following impressions have been drawn from the analysis

Univariate Analysis¶

Gender

There is a very high proportion of Female applicants (65.8%)
Only 7% of total female applicants have payment difficulties compared to 10% of total male applications.
It is possible that female applicants are safer borrowers than male.

Owning Realty

Applicants who do not own Real-estate seem more likely to face payment difficulties than compared to the group of people owning the Real-estate.

Contact Address vs Work Address

Applicants that live and work in same city are at less risk of payment difficulties.

Education

Higher the education, lesser is the risk of payment difficulties.

Marital Status

Applicants with FAMILY_STATUS as seperated or single or have higher risk of payment difficulties compared to that of Widow and Married.
For the purpose of this analysis, civil marriage and married are treated equal. They might be merged into the one category.

Number of Days Employed

The applicants with payment difficulties have lesser median days of employment compared to "all other cases". So, greater the DAYS_EMPLOYED, lesser is the risk of payment difficulties.

Bivariate Analysis¶

Credit Amount and Price of Credit Goods

Lower Credit Amount and Lower Price of Credit Goods have higher cases of Payment difficulties

Gender & Marital Status

Customers most attractive to the bank by likelihood of default, are 'Female Widows' > 'Married Females' = 'Separated Females' > 'Single Females' > 'Married Males' > 'Male Widowers' > 'Single Males' = 'Separated Males'
The bank may focus their marketing campaigns on converting applications of Female Widows, Married Females, Separated Females.
May note that the dataset is skewed towards Females than Males.

Goods Price

Median of "Approved" Category is 95,208.48. Any amount above this has high risk of getting refused or being cancelled.

Income and Annuity

Income Range and Annuity have a mild correlation with Payment difficulties

Education & Client Type

Bank should look into the reasons why Repeat customers who are cancelling their offers. It could also focus on loans catered to academic degree holders who are first time customers.

Top 10 correlations for 'No default'¶

AMT_GOODS_PRICE & AMT_CREDIT
AMT_GOODS_PRICE & AMT_ANNUITY
AMT_ANNUITY & AMT_CREDIT
AGE_YEARS & DAYS_EMPLOYED
AMT_ANNUITY & AMT_INCOME_TOTAL
AMT_GOODS_PRICE & AMT_INCOME_TOTAL
AMT_INCOME_TOTAL & AMT_CREDIT
AGE_YEARS & EXT_SOURCE_3
DAYS_LAST_PHONE_CHANGE & EXT_SOURCE_2
DAYS_EMPLOYED & AMT_INCOME_TOTAL

Top 10 correlations for 'Default' are :¶

AMT_GOODS_PRICE & AMT_CREDIT
AMT_CREDIT & AMT_ANNUITY
AMT_ANNUITY & AMT_GOODS_PRICE
AGE_YEARS & DAYS_EMPLOYED
EXT_SOURCE_2 & DAYS_LAST_PHONE_CHANGE
EXT_SOURCE_3 & AGE_YEARS
AGE_YEARS & AMT_GOODS_PRICE
AGE_YEARS & AMT_CREDIT
AMT_GOODS_PRICE & EXT_SOURCE_2
AMT_CREDIT & EXT_SOURCE_2

Look below for a detailed analysis¶

#importing required Libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import sidetable
from tabulate import tabulate

import warnings
warnings.filterwarnings('ignore')

#setting max number of columns to be displayed to 100, to get a better view.
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)
from IPython.display import HTML
HTML('''<script>
code_show_err=false; 
function code_toggle_err() {
 if (code_show_err){
 $('div.output_stderr').hide();
 } else {
 $('div.output_stderr').show();
 }
 code_show_err = !code_show_err
} 
$( document ).ready(code_toggle_err);
</script>
''')

Loading Application Data¶

application_data=pd.read_csv('application_data.csv')

#checking shape 
application_data.shape

(307511, 122)

application_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB

application_data.describe()

Data Quality Check¶

Checking Columns with High Null Percentage¶

def color_red(value):
    '''
      Colors elements in a dateframe
      green if nulls <50% and red if
      >=50%.
    '''
    if value >=50:
        color = 'red'
    elif value <50:
        color = 'green'

    return 'color: %s' % color

#Finding Percentage of Missing values in each and every column
missing_data = np.round(100*application_data.isnull().sum()/len(application_data),2)
missing_data = pd.DataFrame(data={'Column Name' :missing_data.index, 'Null Percentage' : missing_data.values})
missing_data.style.applymap(color_red,subset=['Null Percentage'])

#Finding the columns that have more than or equal to 50% null values and storing it to columns_to_drop.
columns_to_drop=application_data.columns[100*application_data.isnull().sum()/len(application_data)>=50]
columns_to_drop

Index(['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG',
       'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG',
       'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BUILD_MODE',
       'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
       'BASEMENTAREA_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI',
       'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI',
       'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE',
       'WALLSMATERIAL_MODE'],
      dtype='object')

#dropping the columns where the null values are >= 50%.
application_data.drop(labels=columns_to_drop,axis=1,inplace=True)

#verifying if the columns are dropped.
application_data.shape

(307511, 81)

Imputation Consideration¶

Finding the best values to impute below columns that have <=13% of null values¶

#Finding columns with <= 13% missing columns

null_info = pd.DataFrame(100*application_data.isnull().sum()/len(application_data))
null_info.columns = ['Null Percentage']
null_info[(null_info['Null Percentage'] > 0) & (null_info['Null Percentage'] <=13)]

Columns To Impute¶

We have chosen the following columns to consider the best value to impute the missing values with

AMT_ANNUITY
AMT_GOODS_PRICE
NAME_TYPE_SUITE
CNT_FAM_MEMBERS
OBS_60_CNT_SOCIAL_CIRCLE

AMT_ANNUITY Imputation¶

# Boxplot to check for outliers
application_data['AMT_ANNUITY'].plot.box()
plt.title('\n Box Plot of AMT_GOODS_PRICE')

# Calculating Quantiles
print('Quantile\tAMT_ANNUITY')
application_data['AMT_ANNUITY'].quantile([0.5,0.8,0.85,0.90,0.95,1])

Quantile	AMT_ANNUITY

0.50    24,903.00
0.80    37,516.50
0.85    40,806.00
0.90    45,954.00
0.95    53,325.00
1.00   258,025.50
Name: AMT_ANNUITY, dtype: float64

From the above box plot of AMT_ANNUITY, there are a lot of outliers.

Calculating Quantiles confirms the same. There is a huge jump from 95% value to max value .
Hence, Median : 24903 is the best value to impute the missing values in AMT_ANNUITY, since mean is not robust to outliers.

AMT_GOODS_PRICE Imputation¶

print('Quantile\tAMT_GOODS_PRICE')
application_data['AMT_GOODS_PRICE'].plot.box()
plt.title('\n Box Plot of AMT_GOODS_PRICE')
application_data['AMT_GOODS_PRICE'].quantile([0.5,0.8,0.85,0.90,0.95,1])

Quantile	AMT_GOODS_PRICE

0.50     450,000.00
0.80     814,500.00
0.85     900,000.00
0.90   1,093,500.00
0.95   1,305,000.00
1.00   4,050,000.00
Name: AMT_GOODS_PRICE, dtype: float64

The above box plot show significant number of outliers. This confirmed by the quantiles.
There is a huge jump from 95 percentile value to Maximum value.
In this case, since there are many outliers, Mean is not a good representation of the data since it is not robust to outliers.

Hence ,Median : 450000 is the best metric to impute missing values

NAME_TYPE_SUITE Imputation :¶

print('Data type of NAME_TYPE_SUITE : ',application_data['NAME_TYPE_SUITE'].dtype,'\n\n')
print('Category\tNormalized Count\n\n',application_data['NAME_TYPE_SUITE'].value_counts(normalize=True))
data = application_data['NAME_TYPE_SUITE'].value_counts(normalize=True)
plt.bar(data.index,data.values)
# data.hist()
plt.xticks(rotation=90)
plt.ylabel('Normalized Value Counts')
plt.title('\nNAME_TYPE_SUITE');

Data type of NAME_TYPE_SUITE :  object 


Category	Normalized Count

 Unaccompanied     0.81
Family            0.13
Spouse, partner   0.04
Children          0.01
Other_B           0.01
Other_A           0.00
Group of people   0.00
Name: NAME_TYPE_SUITE, dtype: float64

NAME_TYPE_SUITE is a categorical variable.
The best metric to impute missing values is Mode of the data.
From the above plot, 'Unaccompanied' is the Mode

Hence, Unaccompanied is the best value to impute the missing values.

CNT_FAM_MEMBERS Imputation¶

application_data['CNT_FAM_MEMBERS'] = application_data['CNT_FAM_MEMBERS'].astype('category')

print('Data type of CNT_FAM_MEMBERS : ',application_data['CNT_FAM_MEMBERS'].dtype,'\n\n')

print('Fly Mems | Value Counts\n',application_data['CNT_FAM_MEMBERS'].value_counts(normalize=True))

(application_data['CNT_FAM_MEMBERS'].value_counts()).sort_index().plot(kind='bar')
plt.title('\nNo of Family Members vs Value Counts');

Data type of CNT_FAM_MEMBERS :  category 


Fly Mems | Value Counts
 2.00    0.51
1.00    0.22
3.00    0.17
4.00    0.08
5.00    0.01
6.00    0.00
7.00    0.00
8.00    0.00
9.00    0.00
10.00   0.00
16.00   0.00
12.00   0.00
14.00   0.00
20.00   0.00
11.00   0.00
13.00   0.00
15.00   0.00
Name: CNT_FAM_MEMBERS, dtype: float64

EXT_SOURCE_2 Imputation¶

print('Data type of EXT_SOURCE_2 : ',application_data['EXT_SOURCE_2'].dtype,'\n\n')
application_data['EXT_SOURCE_2'].plot.box()
plt.title('\nEXT_SOURCE_2');
print('Quantile\tValue')
application_data['EXT_SOURCE_2'].quantile([0.5,0.8,0.85,0.90,0.95,1])

Data type of EXT_SOURCE_2 :  float64 


Quantile	Value

0.50   0.57
0.80   0.68
0.85   0.70
0.90   0.72
0.95   0.75
1.00   0.85
Name: EXT_SOURCE_2, dtype: float64

round(application_data['EXT_SOURCE_2'].mean(),2)

0.51

EXT_SOURCE_2 is the credit rating by an external source. It is a numerical continuous variable. The above bar plot and the quantiles show no outliers. In this case Mean and Median are very close. Because there are no outliers, Mean : 0.51 could be used to impute missing values.

Checking for Disguised Missing Values¶

def cat_value_counts(column_name) : 
    print(tabulate(pd.DataFrame(application_data.stb.freq([column_name])), headers='keys', tablefmt='psql'))
    print(pd.DataFrame(application_data[column_name]).stb.missing(),'\n\n\n')

#Replacing XAN with np.nan in Gender column : 

application_data['CODE_GENDER'] = application_data['CODE_GENDER'].replace('XNA',np.nan)
cat_value_counts('CODE_GENDER')

+----+---------------+---------+-----------+--------------------+----------------------+
|    | CODE_GENDER   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | F             |  202448 |  0.658352 |             202448 |             0.658352 |
|  1 | M             |  105059 |  0.341648 |             307507 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+
             Missing   Total  Percent
CODE_GENDER        4  307511     0.00

Since CODE_GENDER is a categorical variable, the missing values could be imputed with Mode : F

# replacing Unknown in NAME_FAMILY_STATUS with np.nan 
application_data['NAME_FAMILY_STATUS'] = application_data['NAME_FAMILY_STATUS'].replace('Unknown',np.nan)
cat_value_counts('NAME_FAMILY_STATUS')

+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_FAMILY_STATUS   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Married              |  196432 | 0.638785  |             196432 |             0.638785 |
|  1 | Single / not married |   45444 | 0.147781  |             241876 |             0.786566 |
|  2 | Civil marriage       |   29775 | 0.0968264 |             271651 |             0.883392 |
|  3 | Separated            |   19770 | 0.0642908 |             291421 |             0.947683 |
|  4 | Widow                |   16088 | 0.0523172 |             307509 |             1        |
+----+----------------------+---------+-----------+--------------------+----------------------+
                    Missing   Total  Percent
NAME_FAMILY_STATUS        2  307511     0.00

Since NAME_FAMILY_STATUS is a categorical variable, the missing values could be imputed with Mode : Married

# replacing XNA values in ORGANIZATION_TYPE 
application_data['ORGANIZATION_TYPE'] = application_data['ORGANIZATION_TYPE'].replace('XNA',np.nan)
cat_value_counts('ORGANIZATION_TYPE')

+----+------------------------+---------+-------------+--------------------+----------------------+
|    | ORGANIZATION_TYPE      |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------+---------+-------------+--------------------+----------------------|
|  0 | Business Entity Type 3 |   67992 | 0.269663    |              67992 |             0.269663 |
|  1 | Self-employed          |   38412 | 0.152346    |             106404 |             0.422009 |
|  2 | Other                  |   16683 | 0.0661664   |             123087 |             0.488175 |
|  3 | Medicine               |   11193 | 0.0443925   |             134280 |             0.532568 |
|  4 | Business Entity Type 2 |   10553 | 0.0418542   |             144833 |             0.574422 |
|  5 | Government             |   10404 | 0.0412633   |             155237 |             0.615685 |
|  6 | School                 |    8893 | 0.0352705   |             164130 |             0.650956 |
|  7 | Trade: type 7          |    7831 | 0.0310585   |             171961 |             0.682014 |
|  8 | Kindergarten           |    6880 | 0.0272868   |             178841 |             0.709301 |
|  9 | Construction           |    6721 | 0.0266561   |             185562 |             0.735957 |
| 10 | Business Entity Type 1 |    5984 | 0.0237331   |             191546 |             0.75969  |
| 11 | Transport: type 4      |    5398 | 0.021409    |             196944 |             0.781099 |
| 12 | Trade: type 3          |    3492 | 0.0138496   |             200436 |             0.794949 |
| 13 | Industry: type 9       |    3368 | 0.0133578   |             203804 |             0.808307 |
| 14 | Industry: type 3       |    3278 | 0.0130009   |             207082 |             0.821307 |
| 15 | Security               |    3247 | 0.0128779   |             210329 |             0.834185 |
| 16 | Housing                |    2958 | 0.0117317   |             213287 |             0.845917 |
| 17 | Industry: type 11      |    2704 | 0.0107243   |             215991 |             0.856641 |
| 18 | Military               |    2634 | 0.0104467   |             218625 |             0.867088 |
| 19 | Bank                   |    2507 | 0.00994301  |             221132 |             0.877031 |
| 20 | Agriculture            |    2454 | 0.0097328   |             223586 |             0.886764 |
| 21 | Police                 |    2341 | 0.00928463  |             225927 |             0.896049 |
| 22 | Transport: type 2      |    2204 | 0.00874128  |             228131 |             0.90479  |
| 23 | Postal                 |    2157 | 0.00855487  |             230288 |             0.913345 |
| 24 | Security Ministries    |    1974 | 0.00782908  |             232262 |             0.921174 |
| 25 | Trade: type 2          |    1900 | 0.00753559  |             234162 |             0.928709 |
| 26 | Restaurant             |    1811 | 0.0071826   |             235973 |             0.935892 |
| 27 | Services               |    1575 | 0.0062466   |             237548 |             0.942139 |
| 28 | University             |    1327 | 0.00526301  |             238875 |             0.947402 |
| 29 | Industry: type 7       |    1307 | 0.00518369  |             240182 |             0.952585 |
| 30 | Transport: type 3      |    1187 | 0.00470776  |             241369 |             0.957293 |
| 31 | Industry: type 1       |    1039 | 0.00412078  |             242408 |             0.961414 |
| 32 | Hotel                  |     966 | 0.00383125  |             243374 |             0.965245 |
| 33 | Electricity            |     950 | 0.00376779  |             244324 |             0.969013 |
| 34 | Industry: type 4       |     877 | 0.00347827  |             245201 |             0.972491 |
| 35 | Trade: type 6          |     631 | 0.00250261  |             245832 |             0.974994 |
| 36 | Industry: type 5       |     599 | 0.00237569  |             246431 |             0.977369 |
| 37 | Insurance              |     597 | 0.00236776  |             247028 |             0.979737 |
| 38 | Telecom                |     577 | 0.00228844  |             247605 |             0.982026 |
| 39 | Emergency              |     560 | 0.00222101  |             248165 |             0.984247 |
| 40 | Industry: type 2       |     458 | 0.00181647  |             248623 |             0.986063 |
| 41 | Advertising            |     429 | 0.00170146  |             249052 |             0.987765 |
| 42 | Realtor                |     396 | 0.00157057  |             249448 |             0.989335 |
| 43 | Culture                |     379 | 0.00150315  |             249827 |             0.990838 |
| 44 | Industry: type 12      |     369 | 0.00146349  |             250196 |             0.992302 |
| 45 | Trade: type 1          |     348 | 0.0013802   |             250544 |             0.993682 |
| 46 | Mobile                 |     317 | 0.00125725  |             250861 |             0.994939 |
| 47 | Legal Services         |     305 | 0.00120966  |             251166 |             0.996149 |
| 48 | Cleaning               |     260 | 0.00103119  |             251426 |             0.99718  |
| 49 | Transport: type 1      |     201 | 0.000797186 |             251627 |             0.997977 |
| 50 | Industry: type 6       |     112 | 0.000444203 |             251739 |             0.998421 |
| 51 | Industry: type 10      |     109 | 0.000432305 |             251848 |             0.998854 |
| 52 | Religion               |      85 | 0.000337118 |             251933 |             0.999191 |
| 53 | Industry: type 13      |      67 | 0.000265729 |             252000 |             0.999457 |
| 54 | Trade: type 4          |      64 | 0.00025383  |             252064 |             0.99971  |
| 55 | Trade: type 5          |      49 | 0.000194339 |             252113 |             0.999905 |
| 56 | Industry: type 8       |      24 | 9.51863e-05 |             252137 |             1        |
+----+------------------------+---------+-------------+--------------------+----------------------+
                   Missing   Total  Percent
ORGANIZATION_TYPE    55374  307511     0.18

Since ORGANIZATION_TYPE is a categorical variable, the missing values could be imputed with Mode : Business Entity Type 3

Data Type Checks¶

pd.DataFrame(application_data.dtypes)

application_data[['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH']].head()

The DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH have negative values, since they are in past with respect to date of application.
To standard size these columns for the purpose of analysis, we could converted them to their absolute values

columns_to_convert = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH']
application_data[columns_to_convert] = application_data[columns_to_convert].abs()
application_data[columns_to_convert].head()

# checking columns with binary values
values_per_column = application_data.nunique().sort_values()
col_values_dtype = pd.DataFrame(index=values_per_column.index, data= {'Unique Values' : values_per_column.values, 'Data Type' : application_data.dtypes})
col_values_dtype

The above columns are 'categorical' variables with only two values. But they have been read as 'int' datatype
We could convert them into 'categorical' data type

# converting to category data type 
convert_to_cat = col_values_dtype[col_values_dtype['Unique Values']<=8].index
application_data[convert_to_cat] = application_data[convert_to_cat].astype('category')

# check if the columns are converted
values_per_column = application_data.nunique().sort_values()
new_categories  = pd.DataFrame(index=values_per_column.index, data= {'Unique Values' : values_per_column.values, 'Data Type' : application_data.dtypes})
new_categories

Checking for Outliers¶

Checking for outliers in the following numerical columns¶

AMT_INCOME_TOTAL
AMT_CREDIT
AMT_ANNUITY
AMT_GOODS_PRICE
DAYS_BIRTH

# Adding a new column "AGE_YEARS" using 'DAYS_BIRTH' with age in years
def days_to_years(x) : 
    if x < 0 : 
        x = -1*x 
    return x//365
application_data['AGE_YEARS'] = application_data['DAYS_BIRTH'].apply(days_to_years)
application_data['AGE_YEARS'].describe()

count   307,511.00
mean         43.44
std          11.95
min          20.00
25%          34.00
50%          43.00
75%          53.00
max          69.00
Name: AGE_YEARS, dtype: float64

# Box plots of the above numerical variables 
outlier_check_col = ["AMT_INCOME_TOTAL","AMT_CREDIT","AMT_ANNUITY","AMT_GOODS_PRICE","DAYS_BIRTH"]

fig,ax = plt.subplots(3,2)
fig.set_figheight(15)
fig.set_figwidth(15)
ax[0,0].set_yscale('log')
ax[0,0].set(ylabel ='Income in Log Scale')
application_data[outlier_check_col[0]].plot.box(ax=ax[0,0],);
application_data[outlier_check_col[1]].plot.box(ax=ax[0,1]);
application_data[outlier_check_col[2]].plot.box(ax=ax[1,0]);
application_data[outlier_check_col[3]].plot.box(ax=ax[1,1]);
ax[2,0].set(ylabel ='Age In Days')
application_data[outlier_check_col[4]].plot.box(ax=ax[2,0]); 
ax[2,1].axis('off')
print('Box Plots of "AMT_INCOME_TOTAL","AMT_CREDIT","AMT_ANNUITY","AMT_GOODS_PRICE","DAYS_BIRTH" \n')

Box Plots of "AMT_INCOME_TOTAL","AMT_CREDIT","AMT_ANNUITY","AMT_GOODS_PRICE","DAYS_BIRTH"

## Calculating quantiles for the above columns
pd.options.display.float_format = '{:,.2f}'.format
for col in outlier_check_col : 
    print(col,'\n',application_data[col].quantile([0.5,0.8,0.85,0.90,0.95,1]),'\n\n')

AMT_INCOME_TOTAL 
 0.50       147,150.00
0.80       225,000.00
0.85       234,000.00
0.90       270,000.00
0.95       337,500.00
1.00   117,000,000.00
Name: AMT_INCOME_TOTAL, dtype: float64 


AMT_CREDIT 
 0.50     513,531.00
0.80     900,000.00
0.85   1,024,740.00
0.90   1,133,748.00
0.95   1,350,000.00
1.00   4,050,000.00
Name: AMT_CREDIT, dtype: float64 


AMT_ANNUITY 
 0.50    24,903.00
0.80    37,516.50
0.85    40,806.00
0.90    45,954.00
0.95    53,325.00
1.00   258,025.50
Name: AMT_ANNUITY, dtype: float64 


AMT_GOODS_PRICE 
 0.50     450,000.00
0.80     814,500.00
0.85     900,000.00
0.90   1,093,500.00
0.95   1,305,000.00
1.00   4,050,000.00
Name: AMT_GOODS_PRICE, dtype: float64 


DAYS_BIRTH 
 0.50   15,750.00
0.80   20,474.00
0.85   21,316.00
0.90   22,181.00
0.95   23,204.00
1.00   25,229.00
Name: DAYS_BIRTH, dtype: float64

Outliers in Numerical Columns¶

From the above box plots and quantile calculations, we see that
AMT_INCOME_TOTAL has huge jump from 95th Percentile (337,500.00) to Maximum value (117,000,000.00) {346 times 95th Percentile}.
Apart from this jump, there are several values > 1.5 times IQR. All of these are deemed outliers.
To avoid skewed results of analysis, values greater than 95th percentile value may be capped to 95th Percentile value (337,500.00)

AMT_CREDIT also has a huge jump from 95th Percentile (1,350,000.00) to Maximum value (4,050,000.00) {3 times 95th percentile}.
Apart from this jump, there are several values > 1.5 times IQR. All of these are deemed outliers.
To avoid skewed results of analysis, values greater than 95th percentile value may be capped to 95th Percentile value (1,350,000.00)

Similarly , AMT_ANNUITY, AMT_ANNUITY have outliers and they could be capped to 95th percentile values.

DAYS_BIRTH has no outliers

Binning Continuous Variables¶

# AMT_INCOME_TOTAL
min_income = int(application_data['AMT_INCOME_TOTAL'].min())
max_income = int(application_data['AMT_INCOME_TOTAL'].max())


bins = [0,25000,50000,75000,100000,125000,150000,175000,200000,225000,250000,275000,300000,325000,350000,375000,400000,425000,450000,475000,500000,10000000000]
intervals = ['0-25000', '25000-50000','50000-75000','75000-100000','100000-125000', '125000-150000', '150000-175000','175000-200000',
       '200000-225000','225000-250000','250000-275000','275000-300000','300000-325000','325000-350000','350000-375000',
       '375000-400000','400000-425000','425000-450000','450000-475000','475000-500000','500000 and above']

application_data['AMT_INCOME_CAT']=pd.cut(application_data['AMT_INCOME_TOTAL'],bins,labels=intervals)
print('Income Range\t Count')
print(application_data['AMT_INCOME_CAT'].value_counts())

income_cat = application_data['AMT_INCOME_CAT'].value_counts()
plt.hist(application_data['AMT_INCOME_CAT'])

plt.title('\n Income Range vs No of Applications')
plt.xticks(rotation=90);

Income Range	 Count
125000-150000       47890
100000-125000       43701
200000-225000       40797
75000-100000        39806
150000-175000       34663
175000-200000       29644
50000-75000         19375
250000-275000       12733
225000-250000        7340
300000-325000        6752
350000-375000        4537
25000-50000          4517
275000-300000        4306
425000-450000        3113
500000 and above     2702
325000-350000        2122
400000-425000        1811
375000-400000        1265
475000-500000         312
450000-475000         125
0-25000                 0
Name: AMT_INCOME_CAT, dtype: int64

#AMT_CREDIT
bins = [0,150000,200000,250000,300000,350000,400000,450000,500000,550000,600000,650000,700000,750000,800000,850000,900000,1000000000]
intervals = ['0-150000', '150000-200000','200000-250000', '250000-300000', '300000-350000', '350000-400000','400000-450000',
        '450000-500000','500000-550000','550000-600000','600000-650000','650000-700000','700000-750000','750000-800000',
        '800000-850000','850000-900000','900000 and above']

application_data['AMT_CREDIT_RANGE']=pd.cut(application_data['AMT_CREDIT'],bins=bins,labels=intervals)
application_data['AMT_CREDIT_RANGE'] = application_data['AMT_CREDIT_RANGE'].astype('category')
print('Credit Range\t Count')
credit_range = application_data['AMT_CREDIT_RANGE'].value_counts()
print(credit_range)
plt.hist(application_data['AMT_CREDIT_RANGE'])
plt.xticks(rotation=90)

Credit Range	 Count
900000 and above    58912
250000-300000       31759
200000-250000       23054
500000-550000       22678
400000-450000       18239
0-150000            18159
150000-200000       17985
300000-350000       16205
650000-700000       15051
450000-500000       13799
750000-800000       12380
800000-850000       11559
550000-600000       11554
850000-900000       10233
350000-400000       10133
600000-650000        8998
700000-750000        6813
Name: AMT_CREDIT_RANGE, dtype: int64

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
 <a list of 17 Text xticklabel objects>)

Data Imbalance¶

# Target Variable - 1: Client with Payment difficulties, 0 : All other cases 
cat_value_counts('TARGET')

+----+----------+---------+-----------+--------------------+----------------------+
|    |   TARGET |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------+---------+-----------+--------------------+----------------------|
|  0 |        0 |  282686 | 0.919271  |             282686 |             0.919271 |
|  1 |        1 |   24825 | 0.0807288 |             307511 |             1        |
+----+----------+---------+-----------+--------------------+----------------------+
        Missing   Total  Percent
TARGET        0  307511     0.00

From the above, you can see that this data set contains 91.9% of records about Clients with Payment difficulties and 0.08% records about all other cases.
Ratio of classes = .919/0.08 = 11.375
This represents an huge Data Imbalance
The dataset is skewed towards 'Clients with Payment difficulties'

Data Set division wrt `TARGET`¶

application_data0=application_data.loc[application_data["TARGET"]==0]
application_data1=application_data.loc[application_data["TARGET"]==1]

Columns Selection for Analysis¶

columnsForAnalysis = ['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE',
       'DAYS_EMPLOYED','FLAG_MOBIL', 'FLAG_CONT_MOBILE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT_W_CITY',
                      'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
        'DAYS_LAST_PHONE_CHANGE' ,'AGE_YEARS', 'AMT_INCOME_CAT',
       'AMT_CREDIT_RANGE']

Analysis¶

Univariate Analysis¶

# LOAN DATA

# function for categorical variable univariate analysis
def cat_univariate_analysis(column_name,figsize=(10,5)) : 
    # print unique values
    print('TARGET 0\n', application_data0[column_name].unique(),'\n')
    print('TARGET 1\n',application_data1[column_name].unique(),'\n')
    
    # column vs target count plot
    plt.figure(figsize=figsize)
    ax = sns.countplot(x=column_name,hue='TARGET',data=application_data)
    title = column_name + ' vs Number of Applications'
    ax.set(title= title)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2,
                height + 10,
                format(height),
                ha="center")
    # Percentages 
    print('All Other Cases (TARGET : 0)')
    print(tabulate(pd.DataFrame(application_data0.stb.freq([column_name])), headers='keys', tablefmt='psql'),'\n')
    print('Clients with Payment Difficulties (TARGET : 1)')
    print(tabulate(pd.DataFrame(application_data1.stb.freq([column_name])), headers='keys', tablefmt='psql'),'\n')

# NAME_CONTRACT_TYPE
cat_univariate_analysis('NAME_CONTRACT_TYPE')

TARGET 0
 [Cash loans, Revolving loans]
Categories (2, object): [Cash loans, Revolving loans] 

TARGET 1
 [Cash loans, Revolving loans]
Categories (2, object): [Cash loans, Revolving loans] 

All Other Cases (TARGET : 0)
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_CONTRACT_TYPE   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Cash loans           |  255011 | 0.9021    |             255011 |               0.9021 |
|  1 | Revolving loans      |   27675 | 0.0979001 |             282686 |               1      |
+----+----------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_CONTRACT_TYPE   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Cash loans           |   23221 | 0.935388  |              23221 |             0.935388 |
|  1 | Revolving loans      |    1604 | 0.0646123 |              24825 |             1        |
+----+----------------------+---------+-----------+--------------------+----------------------+

Client's with All other cases (Target=0)¶

Out of 'All other cases', 90.21% have taken 'Cash Loans' and 9.79% have taken 'Revolving Loans'

Client's with Payment Difficulties (Target=1)¶

Out of all 'Client's with payment difficulties', 93.54% have taken 'Cash Loans' and 6.46% have taken 'Revolving Loans'

# function for numerical variable univariate analysis

def num_univariate_analysis(column_name,scale='linear') : 
    # boxplot for column vs target
    plt.figure(figsize=(8,6))
    ax = sns.boxplot(x='TARGET', y = column_name, data = application_data)
    title = column_name+' vs Target'
    ax.set(title=title)
    if scale == 'log' :
        plt.yscale('log')
        ax.set(ylabel=column_name + '(Log Scale)')
    # summary statistic
    print('All Other Cases (TARGET : 0)')
    print(application_data0[column_name].describe(),'\n')
    print('Clients with Payment Difficulties (TARGET : 1)')
    print(application_data1[column_name].describe())

# AMT_CREDIT 
num_univariate_analysis('AMT_CREDIT')

All Other Cases (TARGET : 0)
count     282,686.00
mean      602,648.28
std       406,845.91
min        45,000.00
25%       270,000.00
50%       517,788.00
75%       810,000.00
max     4,050,000.00
Name: AMT_CREDIT, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count      24,825.00
mean      557,778.53
std       346,433.24
min        45,000.00
25%       284,400.00
50%       497,520.00
75%       733,315.50
max     4,027,680.00
Name: AMT_CREDIT, dtype: float64

Amount of credit¶

The median amount of credit taken by Clients having payment difficulties' is almost same as 'All other cases'.
You could notice that the IQR of Target 1 is less than Target 0, which implies that more proportion of client's with payment difficulties take credit amounts centered around the median amount than 'All other cases'
Comparing the outliers of Target 1 and 0, we could see that Clients with payment difficulties favour lower amount of credit than all other cases. Conversly, people with higher amount of credit have are more probably to fall in the category 'All other cases' than 'Clients with Payment Difficulties'

#AMT_ANNUITY
num_univariate_analysis('AMT_ANNUITY','log')

All Other Cases (TARGET : 0)
count   282,674.00
mean     27,163.62
std      14,658.31
min       1,615.50
25%      16,456.50
50%      24,876.00
75%      34,749.00
max     258,025.50
Name: AMT_ANNUITY, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count    24,825.00
mean     26,481.74
std      12,450.68
min       2,722.50
25%      17,361.00
50%      25,263.00
75%      32,976.00
max     149,211.00
Name: AMT_ANNUITY, dtype: float64

Amount of Annuity¶

There is not much difference in Amount of Annuity between Clients with Payment Difficulties and All other cases.
The median and mean are around the same.
However, Clients with Payment difficulties have annuity amounts which are less dispersed than All other cases.

#AMT_GOODS_PRICE
num_univariate_analysis('AMT_GOODS_PRICE')

All Other Cases (TARGET : 0)
count     282,429.00
mean      542,736.80
std       373,785.49
min        40,500.00
25%       238,500.00
50%       450,000.00
75%       688,500.00
max     4,050,000.00
Name: AMT_GOODS_PRICE, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count      24,804.00
mean      488,972.41
std       311,636.50
min        45,000.00
25%       238,500.00
50%       450,000.00
75%       675,000.00
max     3,600,000.00
Name: AMT_GOODS_PRICE, dtype: float64

Goods Price Amount¶

With respect to AMT_GOODS_PRICE, there is no significant difference between Clients with Payment difficulties and All another Cases.

#REGION_RATING_CLIENT_W_CITY
cat_univariate_analysis('REGION_RATING_CLIENT_W_CITY')

TARGET 0
 [1, 2, 3]
Categories (3, int64): [1, 2, 3] 

TARGET 1
 [2, 3, 1]
Categories (3, int64): [2, 3, 1] 

All Other Cases (TARGET : 0)
+----+-------------------------------+---------+-----------+--------------------+----------------------+
|    |   REGION_RATING_CLIENT_W_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------+---------+-----------+--------------------+----------------------|
|  0 |                             2 |  211314 |  0.747522 |             211314 |             0.747522 |
|  1 |                             3 |   38859 |  0.137463 |             250173 |             0.884985 |
|  2 |                             1 |   32513 |  0.115015 |             282686 |             1        |
+----+-------------------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+-------------------------------+---------+-----------+--------------------+----------------------+
|    |   REGION_RATING_CLIENT_W_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------+---------+-----------+--------------------+----------------------|
|  0 |                             2 |   18170 | 0.731923  |              18170 |             0.731923 |
|  1 |                             3 |    5001 | 0.20145   |              23171 |             0.933374 |
|  2 |                             1 |    1654 | 0.0666264 |              24825 |             1        |
+----+-------------------------------+---------+-----------+--------------------+----------------------+

Bank City Rating¶

The bank has rated cities into three categories [1,2,3]

Among 'All other cases' , 74.7% belong to category 2 cities followed by 13.7% in category 3 cities and 11.5% in category 1 cities.
73.1% of Clients with Payment difficulties belong to cities of category 2 followed by 20.1% belonging to category 3 and 6.6% in category 1

One can see that there are higher cases of Payment Difficulties in category 2 cities (20.1%) compared to All other cases (13.7%). Also , there are less cases of Payment difficulties in city category 1 (6.6%) compared to All other cases (11.5%)

#EXT_SOURCE_2
num_univariate_analysis('EXT_SOURCE_2')

All Other Cases (TARGET : 0)
count   282,078.00
mean          0.52
std           0.19
min           0.00
25%           0.41
50%           0.57
75%           0.67
max           0.85
Name: EXT_SOURCE_2, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count   24,773.00
mean         0.41
std          0.21
min          0.00
25%          0.24
50%          0.44
75%          0.59
max          0.81
Name: EXT_SOURCE_2, dtype: float64

Client Credit Rating By External Source - 2¶

The median credit rating of clients with Payment difficulties (0.44) is less than those with all other cases (0.57)
We can say that there is higher probability of client with less credit rating to have payment difficulties.
Further, there are some outliers in 'All other cases'. These could be clients with no credit history and hence their credit ratings are marked as zero.

#EXT_SOURCE_3
num_univariate_analysis('EXT_SOURCE_3')

All Other Cases (TARGET : 0)
count   227,398.00
mean          0.52
std           0.19
min           0.00
25%           0.39
50%           0.55
75%           0.67
max           0.89
Name: EXT_SOURCE_3, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count   19,148.00
mean         0.39
std          0.21
min          0.00
25%          0.22
50%          0.38
75%          0.55
max          0.90
Name: EXT_SOURCE_3, dtype: float64

Client Credit Rating By External Source - 3¶

The median credit rating of clients with Payment difficulties (0.38) is less than those with all other cases (0.55)
We can say that there is higher probability of client with less credit rating to have payment difficulties.

#CODE_GENDER
cat_univariate_analysis('CODE_GENDER')

TARGET 0
 [F, M, NaN]
Categories (2, object): [F, M] 

TARGET 1
 [M, F]
Categories (2, object): [M, F] 

All Other Cases (TARGET : 0)
+----+---------------+---------+-----------+--------------------+----------------------+
|    | CODE_GENDER   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | F             |  188278 |  0.666042 |             188278 |             0.666042 |
|  1 | M             |   94404 |  0.333958 |             282682 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+---------------+---------+-----------+--------------------+----------------------+
|    | CODE_GENDER   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | F             |   14170 |  0.570796 |              14170 |             0.570796 |
|  1 | M             |   10655 |  0.429204 |              24825 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+

print(application_data['CODE_GENDER'].value_counts(),'\n')
print('Proportion of Females (Target 0) :', round(100*188278/202448,2))
print('Proportion of Females (Target 1) :', round(100*14170/202448,2))

F    202448
M    105059
Name: CODE_GENDER, dtype: int64 

Proportion of Females (Target 0) : 93.0
Proportion of Females (Target 1) : 7.0

print('Proportion of Males (Target 0) :', round(100*94404/105059,2))
print('Proportion of Males (Target 1) :', round(100*10655/105059,2))

Proportion of Males (Target 0) : 89.86
Proportion of Males (Target 1) : 10.14

Gender¶

There are very high proportion of Female applicants (65.8%)
Only 7% of total female applicants have payment difficulties compared to 10% of total male applications.
It is possible that female applicants are safer borrowers than male.

def cat_proportions(column_name) : 
    values = application_data[column_name].unique()
    values = values.to_numpy()
    values.tolist()
    data0 = application_data0[column_name].value_counts().to_dict()
    data1 = application_data1[column_name].value_counts().to_dict()
    data = application_data[column_name].value_counts().to_dict()
    

    for i in values : 
        if i in data0 and i in data1 and i in data : 
            print('Proportion of '+ str(i) + ' in Target 0 : ', round(data0[i]*100/data[i],2))
            print('Proportion of '+ str(i) + ' in Target 1 : ', round(data1[i]*100/data[i],2),'\n' )

# FLAG_OWN_CAR
cat_univariate_analysis('FLAG_OWN_CAR')
cat_proportions('FLAG_OWN_CAR')

TARGET 0
 [N, Y]
Categories (2, object): [N, Y] 

TARGET 1
 [N, Y]
Categories (2, object): [N, Y] 

All Other Cases (TARGET : 0)
+----+----------------+---------+-----------+--------------------+----------------------+
|    | FLAG_OWN_CAR   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | N              |  185675 |  0.656824 |             185675 |             0.656824 |
|  1 | Y              |   97011 |  0.343176 |             282686 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------+---------+-----------+--------------------+----------------------+
|    | FLAG_OWN_CAR   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | N              |   17249 |  0.694824 |              17249 |             0.694824 |
|  1 | Y              |    7576 |  0.305176 |              24825 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+ 

Proportion of N in Target 0 :  91.5
Proportion of N in Target 1 :  8.5 

Proportion of Y in Target 0 :  92.76
Proportion of Y in Target 1 :  7.24

application_data['FLAG_OWN_CAR'].value_counts(normalize=True)

N   0.66
Y   0.34
Name: FLAG_OWN_CAR, dtype: float64

Owning A Car¶

34% of total customers own a car
Among customers who own a car, 92.76% belong to 'All other cases' category and 7.24% have payment difficulties.

Among customers with Payment difficulties, 30.5% own a car
In 'All Other Cases', 34.3% own a car
Customers having cars lower probability of having Payment Difficulties.
- Customers owning cars in 'Clients with Payment difficulties ' : 8.5%
- Customers owning cars in 'Clients with Payment difficulties ' 7.24

# FLAG_OWN_REALTY
cat_univariate_analysis('FLAG_OWN_REALTY')
cat_proportions('FLAG_OWN_REALTY')

TARGET 0
 [N, Y]
Categories (2, object): [N, Y] 

TARGET 1
 [Y, N]
Categories (2, object): [Y, N] 

All Other Cases (TARGET : 0)
+----+-------------------+---------+-----------+--------------------+----------------------+
|    | FLAG_OWN_REALTY   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-----------+--------------------+----------------------|
|  0 | Y                 |  196329 |  0.694513 |             196329 |             0.694513 |
|  1 | N                 |   86357 |  0.305487 |             282686 |             1        |
+----+-------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+-------------------+---------+-----------+--------------------+----------------------+
|    | FLAG_OWN_REALTY   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-----------+--------------------+----------------------|
|  0 | Y                 |   16983 |  0.684109 |              16983 |             0.684109 |
|  1 | N                 |    7842 |  0.315891 |              24825 |             1        |
+----+-------------------+---------+-----------+--------------------+----------------------+ 

Proportion of Y in Target 0 :  92.04
Proportion of Y in Target 1 :  7.96 

Proportion of N in Target 0 :  91.68
Proportion of N in Target 1 :  8.32

Owning Realty¶

In Overall population of clients, 69.36% of people own real estate.
In that group of people owning the real estate, 7.96% of people are having payment difficulties and the rest 92.04% belong to 'other cases'.

In Overall population of clients, 30.63% of people do not own Realestate.
In that group of people not owning the real estate, 8.32% of people are having payment difficulties and the rest 91.68% belong to 'other cases'.

On closer comparision of above observations, we can infer that people who do not own Real-estate seem more likely to face payment difficulties than compared to the group of people owning the Real-estate.

#NAME_EDUCATION_TYPE

cat_univariate_analysis('NAME_EDUCATION_TYPE')
plt.xticks(rotation=90)
cat_proportions('NAME_EDUCATION_TYPE')

TARGET 0
 [Higher education, Secondary / secondary special, Incomplete higher, Lower secondary, Academic degree]
Categories (5, object): [Higher education, Secondary / secondary special, Incomplete higher, Lower secondary, Academic degree] 

TARGET 1
 [Secondary / secondary special, Incomplete higher, Higher education, Lower secondary, Academic degree]
Categories (5, object): [Secondary / secondary special, Incomplete higher, Higher education, Lower secondary, Academic degree] 

All Other Cases (TARGET : 0)
+----+-------------------------------+---------+-------------+--------------------+----------------------+
|    | NAME_EDUCATION_TYPE           |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Secondary / secondary special |  198867 | 0.703491    |             198867 |             0.703491 |
|  1 | Higher education              |   70854 | 0.250646    |             269721 |             0.954136 |
|  2 | Incomplete higher             |    9405 | 0.0332701   |             279126 |             0.987407 |
|  3 | Lower secondary               |    3399 | 0.0120239   |             282525 |             0.99943  |
|  4 | Academic degree               |     161 | 0.000569537 |             282686 |             1        |
+----+-------------------------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+-------------------------------+---------+-------------+--------------------+----------------------+
|    | NAME_EDUCATION_TYPE           |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Secondary / secondary special |   19524 | 0.786465    |              19524 |             0.786465 |
|  1 | Higher education              |    4009 | 0.16149     |              23533 |             0.947956 |
|  2 | Incomplete higher             |     872 | 0.0351259   |              24405 |             0.983082 |
|  3 | Lower secondary               |     417 | 0.0167976   |              24822 |             0.999879 |
|  4 | Academic degree               |       3 | 0.000120846 |              24825 |             1        |
+----+-------------------------------+---------+-------------+--------------------+----------------------+ 

Proportion of Secondary / secondary special in Target 0 :  91.06
Proportion of Secondary / secondary special in Target 1 :  8.94 

Proportion of Higher education in Target 0 :  94.64
Proportion of Higher education in Target 1 :  5.36 

Proportion of Incomplete higher in Target 0 :  91.52
Proportion of Incomplete higher in Target 1 :  8.48 

Proportion of Lower secondary in Target 0 :  89.07
Proportion of Lower secondary in Target 1 :  10.93 

Proportion of Academic degree in Target 0 :  98.17
Proportion of Academic degree in Target 1 :  1.83

Highest Education¶

1.24% of the applicants are having education type of 'Lower secondary'
Out of that 1.24%, 10.93% of them are likely to face Payment difficulties.

71% of the applicants are having education type of 'Secodary/seconday special' .
Out of that 71%, 8.93% of them are likely to face Payment difficulties.

3.34% of the applicants are having education type of 'Incomplete higher'.
Out of that 3.34%, 8.48% of them are likely to face Payment difficulties.

24.34% of the applicants are having education type of 'Higher education'.
Out of that 24.34%, 5.36% of them are likely to face Payment difficulties.

0.05% of the applicants are having education type of 'Academic degree'.
Out of that 0.05%, 1.83% of them are likely to face Payment difficulties.

We can clearly infer from the above observations that higher the education, lesser is the risk of payment difficulties.

#NAME_FAMILY_STATUS

cat_univariate_analysis('NAME_FAMILY_STATUS')
plt.xticks(rotation=90)
cat_proportions('NAME_FAMILY_STATUS')

TARGET 0
 [Married, Single / not married, Civil marriage, Widow, Separated, NaN]
Categories (5, object): [Married, Single / not married, Civil marriage, Widow, Separated] 

TARGET 1
 [Single / not married, Widow, Married, Civil marriage, Separated]
Categories (5, object): [Single / not married, Widow, Married, Civil marriage, Separated] 

All Other Cases (TARGET : 0)
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_FAMILY_STATUS   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Married              |  181582 | 0.64235   |             181582 |             0.64235  |
|  1 | Single / not married |   40987 | 0.144992  |             222569 |             0.787342 |
|  2 | Civil marriage       |   26814 | 0.094855  |             249383 |             0.882197 |
|  3 | Separated            |   18150 | 0.064206  |             267533 |             0.946403 |
|  4 | Widow                |   15151 | 0.0535969 |             282684 |             1        |
+----+----------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------------+---------+-----------+--------------------+----------------------+
|    | NAME_FAMILY_STATUS   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-----------+--------------------+----------------------|
|  0 | Married              |   14850 | 0.598187  |              14850 |             0.598187 |
|  1 | Single / not married |    4457 | 0.179537  |              19307 |             0.777724 |
|  2 | Civil marriage       |    2961 | 0.119275  |              22268 |             0.896999 |
|  3 | Separated            |    1620 | 0.0652568 |              23888 |             0.962256 |
|  4 | Widow                |     937 | 0.0377442 |              24825 |             1        |
+----+----------------------+---------+-----------+--------------------+----------------------+ 

Proportion of Single / not married in Target 0 :  90.19
Proportion of Single / not married in Target 1 :  9.81 

Proportion of Married in Target 0 :  92.44
Proportion of Married in Target 1 :  7.56 

Proportion of Civil marriage in Target 0 :  90.06
Proportion of Civil marriage in Target 1 :  9.94 

Proportion of Widow in Target 0 :  94.18
Proportion of Widow in Target 1 :  5.82 

Proportion of Separated in Target 0 :  91.81
Proportion of Separated in Target 1 :  8.19

Marital Status¶

5.2% of applicants have their family status as'Widow'
Out of that 5.2%, 5.82% of them are likely to face Payment difficulties.

63.87% of applicants have their family status as'Married'
Out of that 63.87%, 7.56% of them are likely to face Payment difficulties.

6.42% of applicants have their family status as'Seperated'
Out of that 6.42%, 8.19% of them are likely to face Payment difficulties.

14.77% of applicants have their family status as'Single / not married'
Out of that 14.77%, 9.81% of them are likely to face Payment difficulties.

9.68% of applicants have their family status as'Civil marriage'
Out of that 9.68%, 9.94% of them are likely to face Payment difficulties.

Based on the above obervations, we can infer that the applicants with NAME_FAMILY_STATUS as seperated or single or have higher risk of payment difficulties compared to that of Widow and Married.

For the purpose of this analysis, civil marriage and married are treated equal. They might be merged into the one category

#NAME_HOUSING_TYPE
cat_univariate_analysis('NAME_HOUSING_TYPE',figsize=(10,5))
plt.xticks(rotation=90)
cat_proportions('NAME_HOUSING_TYPE')

TARGET 0
 [House / apartment, Rented apartment, With parents, Municipal apartment, Office apartment, Co-op apartment]
Categories (6, object): [House / apartment, Rented apartment, With parents, Municipal apartment, Office apartment, Co-op apartment] 

TARGET 1
 [House / apartment, With parents, Municipal apartment, Rented apartment, Office apartment, Co-op apartment]
Categories (6, object): [House / apartment, With parents, Municipal apartment, Rented apartment, Office apartment, Co-op apartment] 

All Other Cases (TARGET : 0)
+----+---------------------+---------+------------+--------------------+----------------------+
|    | NAME_HOUSING_TYPE   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------+---------+------------+--------------------+----------------------|
|  0 | House / apartment   |  251596 | 0.890019   |             251596 |             0.890019 |
|  1 | With parents        |   13104 | 0.0463553  |             264700 |             0.936375 |
|  2 | Municipal apartment |   10228 | 0.0361815  |             274928 |             0.972556 |
|  3 | Rented apartment    |    4280 | 0.0151405  |             279208 |             0.987697 |
|  4 | Office apartment    |    2445 | 0.00864917 |             281653 |             0.996346 |
|  5 | Co-op apartment     |    1033 | 0.00365423 |             282686 |             1        |
+----+---------------------+---------+------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+---------------------+---------+-----------+--------------------+----------------------+
|    | NAME_HOUSING_TYPE   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------+---------+-----------+--------------------+----------------------|
|  0 | House / apartment   |   21272 | 0.856878  |              21272 |             0.856878 |
|  1 | With parents        |    1736 | 0.0699295 |              23008 |             0.926808 |
|  2 | Municipal apartment |     955 | 0.0384693 |              23963 |             0.965277 |
|  3 | Rented apartment    |     601 | 0.0242095 |              24564 |             0.989486 |
|  4 | Office apartment    |     172 | 0.0069285 |              24736 |             0.996415 |
|  5 | Co-op apartment     |      89 | 0.0035851 |              24825 |             1        |
+----+---------------------+---------+-----------+--------------------+----------------------+ 

Proportion of House / apartment in Target 0 :  92.2
Proportion of House / apartment in Target 1 :  7.8 

Proportion of Rented apartment in Target 0 :  87.69
Proportion of Rented apartment in Target 1 :  12.31 

Proportion of With parents in Target 0 :  88.3
Proportion of With parents in Target 1 :  11.7 

Proportion of Municipal apartment in Target 0 :  91.46
Proportion of Municipal apartment in Target 1 :  8.54 

Proportion of Office apartment in Target 0 :  93.43
Proportion of Office apartment in Target 1 :  6.57 

Proportion of Co-op apartment in Target 0 :  92.07
Proportion of Co-op apartment in Target 1 :  7.93

Housing Type¶

0.85% of applicants have their housing type as'Office apartment'
Out of that 0.85%, 6.57% of them are likely to face Payment difficulties.

88.73% of applicants have their housing type as'House/ apartment'
Out of that 88.73%, 7.8% of them are likely to face Payment difficulties.

0.36% of applicants have their housing type as'Co-op apartment'
Out of that 0.36%, 7.93% of them are likely to face Payment difficulties.

3.63% of applicants have their housing type as'Municipal apartment'
Out of that 3.63%, 8.54% of them are likely to face Payment difficulties.

4.82% of applicants have their housing type as'with parents'
Out of that 4.82%, 11.7% of them are likely to face Payment difficulties.

1.58% of applicants have their housing type as'Rented apartment'
Out of that 1.58%, 12.31% of them are likely to face Payment difficulties.

Based on the above obervations, we can infer that the applicants with NAME_HOUSING_TYPE as 'Municipal apartment' or 'with parents' or 'Rented apartment ' have higher risk of payment difficulties compared to that of other housing types.

#AMT_INCOME_CAT
cat_univariate_analysis('AMT_INCOME_CAT',figsize=(20,5))
plt.xticks(rotation=90)
cat_proportions('AMT_INCOME_CAT')

TARGET 0
 [250000-275000, 50000-75000, 125000-150000, 100000-125000, 75000-100000, ..., 325000-350000, 375000-400000, 400000-425000, 450000-475000, 475000-500000]
Length: 20
Categories (20, object): [25000-50000 < 50000-75000 < 75000-100000 < 100000-125000 ... 425000-450000 < 450000-475000 < 475000-500000 < 500000 and above] 

TARGET 1
 [200000-225000, 100000-125000, 125000-150000, 75000-100000, 300000-325000, ..., 225000-250000, 375000-400000, 325000-350000, 450000-475000, 475000-500000]
Length: 20
Categories (20, object): [25000-50000 < 50000-75000 < 75000-100000 < 100000-125000 ... 425000-450000 < 450000-475000 < 475000-500000 < 500000 and above] 

All Other Cases (TARGET : 0)
+----+------------------+---------+-------------+--------------------+----------------------+
|    | AMT_INCOME_CAT   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------+---------+-------------+--------------------+----------------------|
|  0 | 125000-150000    |   43837 | 0.155073    |              43837 |             0.155073 |
|  1 | 100000-125000    |   39860 | 0.141005    |              83697 |             0.296078 |
|  2 | 200000-225000    |   37595 | 0.132992    |             121292 |             0.42907  |
|  3 | 75000-100000     |   36450 | 0.128942    |             157742 |             0.558011 |
|  4 | 150000-175000    |   31685 | 0.112085    |             189427 |             0.670097 |
|  5 | 175000-200000    |   27190 | 0.0961845   |             216617 |             0.766281 |
|  6 | 50000-75000      |   17849 | 0.0631407   |             234466 |             0.829422 |
|  7 | 250000-275000    |   11846 | 0.0419052   |             246312 |             0.871327 |
|  8 | 225000-250000    |    6814 | 0.0241045   |             253126 |             0.895432 |
|  9 | 300000-325000    |    6342 | 0.0224348   |             259468 |             0.917866 |
| 10 | 350000-375000    |    4282 | 0.0151475   |             263750 |             0.933014 |
| 11 | 25000-50000      |    4174 | 0.0147655   |             267924 |             0.94778  |
| 12 | 275000-300000    |    4000 | 0.01415     |             271924 |             0.961929 |
| 13 | 425000-450000    |    2933 | 0.0103755   |             274857 |             0.972305 |
| 14 | 500000 and above |    2556 | 0.00904183  |             277413 |             0.981347 |
| 15 | 325000-350000    |    1987 | 0.007029    |             279400 |             0.988376 |
| 16 | 400000-425000    |    1696 | 0.00599959  |             281096 |             0.994375 |
| 17 | 375000-400000    |    1180 | 0.00417424  |             282276 |             0.99855  |
| 18 | 475000-500000    |     296 | 0.0010471   |             282572 |             0.999597 |
| 19 | 450000-475000    |     114 | 0.000403274 |             282686 |             1        |
+----+------------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+------------------+---------+-------------+--------------------+----------------------+
|    | AMT_INCOME_CAT   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------+---------+-------------+--------------------+----------------------|
|  0 | 125000-150000    |    4053 | 0.163263    |               4053 |             0.163263 |
|  1 | 100000-125000    |    3841 | 0.154723    |               7894 |             0.317986 |
|  2 | 75000-100000     |    3356 | 0.135186    |              11250 |             0.453172 |
|  3 | 200000-225000    |    3202 | 0.128983    |              14452 |             0.582155 |
|  4 | 150000-175000    |    2978 | 0.11996     |              17430 |             0.702115 |
|  5 | 175000-200000    |    2454 | 0.098852    |              19884 |             0.800967 |
|  6 | 50000-75000      |    1526 | 0.0614703   |              21410 |             0.862437 |
|  7 | 250000-275000    |     887 | 0.0357301   |              22297 |             0.898167 |
|  8 | 225000-250000    |     526 | 0.0211883   |              22823 |             0.919355 |
|  9 | 300000-325000    |     410 | 0.0165156   |              23233 |             0.935871 |
| 10 | 25000-50000      |     343 | 0.0138167   |              23576 |             0.949688 |
| 11 | 275000-300000    |     306 | 0.0123263   |              23882 |             0.962014 |
| 12 | 350000-375000    |     255 | 0.0102719   |              24137 |             0.972286 |
| 13 | 425000-450000    |     180 | 0.00725076  |              24317 |             0.979537 |
| 14 | 500000 and above |     146 | 0.00588117  |              24463 |             0.985418 |
| 15 | 325000-350000    |     135 | 0.00543807  |              24598 |             0.990856 |
| 16 | 400000-425000    |     115 | 0.00463243  |              24713 |             0.995488 |
| 17 | 375000-400000    |      85 | 0.00342397  |              24798 |             0.998912 |
| 18 | 475000-500000    |      16 | 0.000644512 |              24814 |             0.999557 |
| 19 | 450000-475000    |      11 | 0.000443102 |              24825 |             1        |
+----+------------------+---------+-------------+--------------------+----------------------+ 

Proportion of 200000-225000 in Target 0 :  92.15
Proportion of 200000-225000 in Target 1 :  7.85 

Proportion of 250000-275000 in Target 0 :  93.03
Proportion of 250000-275000 in Target 1 :  6.97 

Proportion of 50000-75000 in Target 0 :  92.12
Proportion of 50000-75000 in Target 1 :  7.88 

Proportion of 125000-150000 in Target 0 :  91.54
Proportion of 125000-150000 in Target 1 :  8.46 

Proportion of 100000-125000 in Target 0 :  91.21
Proportion of 100000-125000 in Target 1 :  8.79 

Proportion of 75000-100000 in Target 0 :  91.57
Proportion of 75000-100000 in Target 1 :  8.43 

Proportion of 150000-175000 in Target 0 :  91.41
Proportion of 150000-175000 in Target 1 :  8.59 

Proportion of 350000-375000 in Target 0 :  94.38
Proportion of 350000-375000 in Target 1 :  5.62 

Proportion of 25000-50000 in Target 0 :  92.41
Proportion of 25000-50000 in Target 1 :  7.59 

Proportion of 175000-200000 in Target 0 :  91.72
Proportion of 175000-200000 in Target 1 :  8.28 

Proportion of 425000-450000 in Target 0 :  94.22
Proportion of 425000-450000 in Target 1 :  5.78 

Proportion of 275000-300000 in Target 0 :  92.89
Proportion of 275000-300000 in Target 1 :  7.11 

Proportion of 500000 and above in Target 0 :  94.6
Proportion of 500000 and above in Target 1 :  5.4 

Proportion of 300000-325000 in Target 0 :  93.93
Proportion of 300000-325000 in Target 1 :  6.07 

Proportion of 225000-250000 in Target 0 :  92.83
Proportion of 225000-250000 in Target 1 :  7.17 

Proportion of 325000-350000 in Target 0 :  93.64
Proportion of 325000-350000 in Target 1 :  6.36 

Proportion of 375000-400000 in Target 0 :  93.28
Proportion of 375000-400000 in Target 1 :  6.72 

Proportion of 400000-425000 in Target 0 :  93.65
Proportion of 400000-425000 in Target 1 :  6.35 

Proportion of 450000-475000 in Target 0 :  91.2
Proportion of 450000-475000 in Target 1 :  8.8 

Proportion of 475000-500000 in Target 0 :  94.87
Proportion of 475000-500000 in Target 1 :  5.13

posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

# We will calculate the pecentage of "Clients with Payment difficulties" for every income category.
t0=application_data0['AMT_INCOME_CAT'].value_counts()
t1=application_data1['AMT_INCOME_CAT'].value_counts()
prop = 100*t1/(t1+t0)
print(tabulate(pd.DataFrame(prop), headers=['AMT_INCOME_CAT','PERCENTAGE'], tablefmt='psql'))

+------------------+--------------+
| AMT_INCOME_CAT   |   PERCENTAGE |
|------------------+--------------|
| 0-25000          |    nan       |
| 25000-50000      |      7.59354 |
| 50000-75000      |      7.87613 |
| 75000-100000     |      8.43089 |
| 100000-125000    |      8.78927 |
| 125000-150000    |      8.46314 |
| 150000-175000    |      8.59129 |
| 175000-200000    |      8.27824 |
| 200000-225000    |      7.84862 |
| 225000-250000    |      7.16621 |
| 250000-275000    |      6.96615 |
| 275000-300000    |      7.10636 |
| 300000-325000    |      6.07227 |
| 325000-350000    |      6.36192 |
| 350000-375000    |      5.62045 |
| 375000-400000    |      6.71937 |
| 400000-425000    |      6.35008 |
| 425000-450000    |      5.7822  |
| 450000-475000    |      8.8     |
| 475000-500000    |      5.12821 |
| 500000 and above |      5.4034  |
+------------------+--------------+

Income Category¶

By the above table of 'AMT_INCOME_CAT' vs 'percentage of customers with payment difficulties' in that range, we can infer that as the income increases, applicant is less likely to face payment difficulties(except in exceptional cases, such as 450000 - 475000).

#CNT_CHILDREN
#Lets convert this into a categorical variable
application_data['CNT_CHILDREN']=application_data['CNT_CHILDREN'].astype('category')
application_data0['CNT_CHILDREN']=application_data0['CNT_CHILDREN'].astype('category')
application_data1['CNT_CHILDREN']=application_data1['CNT_CHILDREN'].astype('category')

#CNT_CHILDREN
cat_univariate_analysis('CNT_CHILDREN',figsize=(20,5))
column_name='CNT_CHILDREN'
cat_proportions('CNT_CHILDREN')
# values = application_data[column_name].unique()
# values=values.dropna()
# values = values.to_numpy()
# values.tolist()
# data0 = application_data0[column_name].value_counts().to_dict()
# data1 = application_data1[column_name].value_counts().to_dict()
# data = application_data[column_name].value_counts().to_dict()

# for i in values[:7]: 
#     print('Proportion of '+ str(i) + ' in Target 0 : ', round(data0[i]*100/data[i],2))
#     print('Proportion of '+ str(i) + ' in Target 1 : ', round(data1[i]*100/data[i],2),'\n' )

TARGET 0
 [0, 1, 2, 3, 4, ..., 8, 12, 10, 19, 14]
Length: 13
Categories (13, int64): [0, 1, 2, 3, ..., 12, 10, 19, 14] 

TARGET 1
 [0, 1, 3, 2, 4, 5, 9, 11, 6]
Categories (9, int64): [0, 1, 3, 2, ..., 5, 9, 11, 6] 

All Other Cases (TARGET : 0)
+----+----------------+---------+-------------+--------------------+----------------------+
|    |   CNT_CHILDREN |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |  198762 | 0.703119    |             198762 |             0.703119 |
|  1 |              1 |   55665 | 0.196915    |             254427 |             0.900034 |
|  2 |              2 |   24416 | 0.0863715   |             278843 |             0.986405 |
|  3 |              3 |    3359 | 0.0118824   |             282202 |             0.998288 |
|  4 |              4 |     374 | 0.00132302  |             282576 |             0.999611 |
|  5 |              5 |      77 | 0.000272387 |             282653 |             0.999883 |
|  6 |              6 |      15 | 5.30624e-05 |             282668 |             0.999936 |
|  7 |              7 |       7 | 2.47625e-05 |             282675 |             0.999961 |
|  8 |             14 |       3 | 1.06125e-05 |             282678 |             0.999972 |
|  9 |             19 |       2 | 7.07499e-06 |             282680 |             0.999979 |
| 10 |             12 |       2 | 7.07499e-06 |             282682 |             0.999986 |
| 11 |             10 |       2 | 7.07499e-06 |             282684 |             0.999993 |
| 12 |              8 |       2 | 7.07499e-06 |             282686 |             1        |
+----+----------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------+---------+-------------+--------------------+----------------------+
|    |   CNT_CHILDREN |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |   16609 | 0.669043    |              16609 |             0.669043 |
|  1 |              1 |    5454 | 0.219698    |              22063 |             0.888741 |
|  2 |              2 |    2333 | 0.0939778   |              24396 |             0.982719 |
|  3 |              3 |     358 | 0.0144209   |              24754 |             0.99714  |
|  4 |              4 |      55 | 0.00221551  |              24809 |             0.999355 |
|  5 |              5 |       7 | 0.000281974 |              24816 |             0.999637 |
|  6 |              6 |       6 | 0.000241692 |              24822 |             0.999879 |
|  7 |              9 |       2 | 8.05639e-05 |              24824 |             0.99996  |
|  8 |             11 |       1 | 4.0282e-05  |              24825 |             1        |
+----+----------------+---------+-------------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  92.29
Proportion of 0 in Target 1 :  7.71 

Proportion of 1 in Target 0 :  91.08
Proportion of 1 in Target 1 :  8.92 

Proportion of 2 in Target 0 :  91.28
Proportion of 2 in Target 1 :  8.72 

Proportion of 3 in Target 0 :  90.37
Proportion of 3 in Target 1 :  9.63 

Proportion of 4 in Target 0 :  87.18
Proportion of 4 in Target 1 :  12.82 

Proportion of 5 in Target 0 :  91.67
Proportion of 5 in Target 1 :  8.33 

Proportion of 6 in Target 0 :  71.43
Proportion of 6 in Target 1 :  28.57

posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

# Calculating the pecentage of "Clients with Payment difficulties" for every income category.
t0=application_data0['CNT_CHILDREN'].value_counts()
t1=application_data1['CNT_CHILDREN'].value_counts()
prop = 100*t1/(t1+t0)

print(tabulate(pd.DataFrame(prop), headers=['No of Children','Percentage'], tablefmt='psql'))

+------------------+--------------+
|   No of Children |   Percentage |
|------------------+--------------|
|                0 |      7.71181 |
|                1 |      8.92358 |
|                2 |      8.72182 |
|                3 |      9.63142 |
|                4 |     12.8205  |
|                5 |      8.33333 |
|                6 |     28.5714  |
|                7 |    nan       |
|                8 |    nan       |
|                9 |    nan       |
|               10 |    nan       |
|               11 |    nan       |
|               12 |    nan       |
|               14 |    nan       |
|               19 |    nan       |
+------------------+--------------+

Number of Children¶

By looking at the plot and above table, we can clearly infer that starting from 0 to 6, applicants having 0 children have lowest risk of facing payment dificulties. With the increase in number of children, the risk of payment difficulties also increase.
Above 6, there might be some data quality issues. We can even expect that this is error while taking the data from thr applicants, because the numericals above 6 are not practical in most of the cases.
If the CNT_CHILDREN > 6 is true, then there is 0 probabilty that the applicant might face the payment difficulties.

#CNT_FAM_MEMBERS
cat_univariate_analysis('CNT_FAM_MEMBERS',figsize=(20,5))
cat_proportions('CNT_FAM_MEMBERS')

TARGET 0
 [2.00, 1.00, 3.00, 4.00, 5.00, ..., 14.00, 12.00, 20.00, 15.00, 16.00]
Length: 16
Categories (15, float64): [2.00, 1.00, 3.00, 4.00, ..., 12.00, 20.00, 15.00, 16.00] 

TARGET 1
 [1.00, 2.00, 5.00, 3.00, 4.00, ..., 7.00, 10.00, 13.00, 8.00, 11.00]
Length: 11
Categories (11, float64): [1.00, 2.00, 5.00, 3.00, ..., 10.00, 13.00, 8.00, 11.00] 

All Other Cases (TARGET : 0)
+----+-------------------+---------+-------------+--------------------+----------------------+
|    |   CNT_FAM_MEMBERS |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-------------+--------------------+----------------------|
|  0 |                 2 |  146348 | 0.517709    |             146348 |             0.517709 |
|  1 |                 1 |   62172 | 0.219935    |             208520 |             0.737643 |
|  2 |                 3 |   47993 | 0.169776    |             256513 |             0.90742  |
|  3 |                 4 |   22561 | 0.07981     |             279074 |             0.98723  |
|  4 |                 5 |    3151 | 0.0111467   |             282225 |             0.998376 |
|  5 |                 6 |     353 | 0.00124874  |             282578 |             0.999625 |
|  6 |                 7 |      75 | 0.000265314 |             282653 |             0.99989  |
|  7 |                 8 |      14 | 4.95253e-05 |             282667 |             0.99994  |
|  8 |                 9 |       6 | 2.12251e-05 |             282673 |             0.999961 |
|  9 |                20 |       2 | 7.07504e-06 |             282675 |             0.999968 |
| 10 |                16 |       2 | 7.07504e-06 |             282677 |             0.999975 |
| 11 |                14 |       2 | 7.07504e-06 |             282679 |             0.999982 |
| 12 |                12 |       2 | 7.07504e-06 |             282681 |             0.999989 |
| 13 |                10 |       2 | 7.07504e-06 |             282683 |             0.999996 |
| 14 |                15 |       1 | 3.53752e-06 |             282684 |             1        |
+----+-------------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+-------------------+---------+-------------+--------------------+----------------------+
|    |   CNT_FAM_MEMBERS |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-------------+--------------------+----------------------|
|  0 |                 2 |   12009 | 0.483746    |              12009 |             0.483746 |
|  1 |                 1 |    5675 | 0.2286      |              17684 |             0.712346 |
|  2 |                 3 |    4608 | 0.185619    |              22292 |             0.897966 |
|  3 |                 4 |    2136 | 0.0860423   |              24428 |             0.984008 |
|  4 |                 5 |     327 | 0.0131722   |              24755 |             0.99718  |
|  5 |                 6 |      55 | 0.00221551  |              24810 |             0.999396 |
|  6 |                 8 |       6 | 0.000241692 |              24816 |             0.999637 |
|  7 |                 7 |       6 | 0.000241692 |              24822 |             0.999879 |
|  8 |                13 |       1 | 4.0282e-05  |              24823 |             0.999919 |
|  9 |                11 |       1 | 4.0282e-05  |              24824 |             0.99996  |
| 10 |                10 |       1 | 4.0282e-05  |              24825 |             1        |
+----+-------------------+---------+-------------+--------------------+----------------------+ 

Proportion of 1.0 in Target 0 :  91.64
Proportion of 1.0 in Target 1 :  8.36 

Proportion of 2.0 in Target 0 :  92.42
Proportion of 2.0 in Target 1 :  7.58 

Proportion of 3.0 in Target 0 :  91.24
Proportion of 3.0 in Target 1 :  8.76 

Proportion of 4.0 in Target 0 :  91.35
Proportion of 4.0 in Target 1 :  8.65 

Proportion of 5.0 in Target 0 :  90.6
Proportion of 5.0 in Target 1 :  9.4 

Proportion of 6.0 in Target 0 :  86.52
Proportion of 6.0 in Target 1 :  13.48 

Proportion of 9.0 in Target 0 :  100.0
Proportion of 9.0 in Target 1 :  0.0 

Proportion of 7.0 in Target 0 :  92.59
Proportion of 7.0 in Target 1 :  7.41 

Proportion of 8.0 in Target 0 :  70.0
Proportion of 8.0 in Target 1 :  30.0 

Proportion of 10.0 in Target 0 :  66.67
Proportion of 10.0 in Target 1 :  33.33 

Proportion of 13.0 in Target 0 :  0.0
Proportion of 13.0 in Target 1 :  100.0 

Proportion of 14.0 in Target 0 :  100.0
Proportion of 14.0 in Target 1 :  0.0 

Proportion of 12.0 in Target 0 :  100.0
Proportion of 12.0 in Target 1 :  0.0 

Proportion of 20.0 in Target 0 :  100.0
Proportion of 20.0 in Target 1 :  0.0 

Proportion of 15.0 in Target 0 :  100.0
Proportion of 15.0 in Target 1 :  0.0 

Proportion of 16.0 in Target 0 :  100.0
Proportion of 16.0 in Target 1 :  0.0 

Proportion of 11.0 in Target 0 :  0.0
Proportion of 11.0 in Target 1 :  100.0

posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

Number of Family members¶

When we look at the plot and proportions above, we can clearly infer that increase in family members is directly proportional to increase in risk of facing payment difficulties.
Above 6, we cant infer because there were not enough applicant samples.

#DAYS_EMPLOYED.
num_univariate_analysis('DAYS_EMPLOYED','log')

All Other Cases (TARGET : 0)
count   282,686.00
mean     69,668.81
std     140,983.36
min           0.00
25%         967.00
50%       2,304.00
75%       6,074.00
max     365,243.00
Name: DAYS_EMPLOYED, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count    24,825.00
mean     45,587.32
std     118,303.30
min           0.00
25%         677.00
50%       1,458.00
75%       3,280.00
max     365,243.00
Name: DAYS_EMPLOYED, dtype: float64

Number of Days Employed¶

If we observe the above box plot, we can clearly notice the outliers, so lets concentrate on median.
The applicants with payment difficulties have lesser median and IQR compared to "all other cases". So, greater the DAYS_EMPLOYED, lesser is the risk of payment difficulties.

#NAME_INCOME_TYPE
cat_univariate_analysis('NAME_INCOME_TYPE',figsize=(15,5))
plt.xticks(rotation=90)
cat_proportions('NAME_INCOME_TYPE')

posx and posy should be finite values
posx and posy should be finite values

TARGET 0
 [State servant, Working, Commercial associate, Pensioner, Unemployed, Student, Businessman, Maternity leave]
Categories (8, object): [State servant, Working, Commercial associate, Pensioner, Unemployed, Student, Businessman, Maternity leave] 

TARGET 1
 [Working, Commercial associate, Pensioner, State servant, Unemployed, Maternity leave]
Categories (6, object): [Working, Commercial associate, Pensioner, State servant, Unemployed, Maternity leave] 

All Other Cases (TARGET : 0)
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | NAME_INCOME_TYPE     |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | Working              |  143550 | 0.507807    |             143550 |             0.507807 |
|  1 | Commercial associate |   66257 | 0.234384    |             209807 |             0.742191 |
|  2 | Pensioner            |   52380 | 0.185294    |             262187 |             0.927485 |
|  3 | State servant        |   20454 | 0.0723559   |             282641 |             0.999841 |
|  4 | Student              |      18 | 6.36749e-05 |             282659 |             0.999904 |
|  5 | Unemployed           |      14 | 4.95249e-05 |             282673 |             0.999954 |
|  6 | Businessman          |      10 | 3.53749e-05 |             282683 |             0.999989 |
|  7 | Maternity leave      |       3 | 1.06125e-05 |             282686 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | NAME_INCOME_TYPE     |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | Working              |   15224 | 0.613253    |              15224 |             0.613253 |
|  1 | Commercial associate |    5360 | 0.215911    |              20584 |             0.829164 |
|  2 | Pensioner            |    2982 | 0.120121    |              23566 |             0.949285 |
|  3 | State servant        |    1249 | 0.0503122   |              24815 |             0.999597 |
|  4 | Unemployed           |       8 | 0.000322256 |              24823 |             0.999919 |
|  5 | Maternity leave      |       2 | 8.05639e-05 |              24825 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+ 

Proportion of Working in Target 0 :  90.41
Proportion of Working in Target 1 :  9.59 

Proportion of State servant in Target 0 :  94.25
Proportion of State servant in Target 1 :  5.75 

Proportion of Commercial associate in Target 0 :  92.52
Proportion of Commercial associate in Target 1 :  7.48 

Proportion of Pensioner in Target 0 :  94.61
Proportion of Pensioner in Target 1 :  5.39 

Proportion of Unemployed in Target 0 :  63.64
Proportion of Unemployed in Target 1 :  36.36 

Proportion of Student in Target 0 :  100.0
Proportion of Student in Target 1 :  0.0 

Proportion of Businessman in Target 0 :  100.0
Proportion of Businessman in Target 1 :  0.0 

Proportion of Maternity leave in Target 0 :  60.0
Proportion of Maternity leave in Target 1 :  40.0

posx and posy should be finite values
posx and posy should be finite values

Income type¶

52% of applicants have their NAME_INCOME_TYPE as'Working'.
Out of that 52%, 9.59% of them are likely to face Payment difficulties.

23% of applicants have their NAME_INCOME_TYPE as'Commercial associate'
Out of that 23%, 7.48% of them are likely to face Payment difficulties.

7% of applicants have their NAME_INCOME_TYPE as'State servant'
Out of that 7%, 5.75% of them are likely to face Payment difficulties.

18% of applicants have their NAME_INCOME_TYPE as'Pensioner'
Out of that 18%, 5.39% of them are likely to face Payment difficulties.

From the above Graph and proportions, we can infer that Pensioners and state servants have lower risk of facing payment difficulties compared to the Working and Commercial associates.

Unemployed,Student,Businessman and Maternity leave cannot be inferred because of the reason that they are not well populated.

#DAYS_LAST_PHONE_CHANGE
application_data['DAYS_LAST_PHONE_CHANGE'] = np.abs(application_data['DAYS_LAST_PHONE_CHANGE'])
application_data0['DAYS_LAST_PHONE_CHANGE'] = np.abs(application_data0['DAYS_LAST_PHONE_CHANGE'])
application_data1['DAYS_LAST_PHONE_CHANGE'] = np.abs(application_data1['DAYS_LAST_PHONE_CHANGE'])
num_univariate_analysis('DAYS_LAST_PHONE_CHANGE')

All Other Cases (TARGET : 0)
count   282,685.00
mean        976.39
std         831.21
min           0.00
25%         281.00
50%         776.00
75%       1,586.00
max       4,292.00
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64 

Clients with Payment Difficulties (TARGET : 1)
count   24,825.00
mean       808.80
std        758.09
min          0.00
25%        194.00
50%        594.00
75%      1,301.00
max      4,070.00
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64

Days Since Phone Number Was Changed.¶

From the above boxplot, we can clearly notice that the median for 'all other cases' is 776 days and the median for 'Clients with payment difficulties' is 594 days.
By this, we can clearly infer that,lesser the days since last phone change, greater is the risk of payment difficulties

#FLAG_CONT_MOBILE
cat_univariate_analysis('FLAG_CONT_MOBILE',figsize=(15,5))
cat_proportions('FLAG_CONT_MOBILE')

TARGET 0
 [1, 0]
Categories (2, int64): [1, 0] 

TARGET 1
 [1, 0]
Categories (2, int64): [1, 0] 

All Other Cases (TARGET : 0)
+----+--------------------+---------+------------+--------------------+----------------------+
|    |   FLAG_CONT_MOBILE |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+------------+--------------------+----------------------|
|  0 |                  1 |  282157 | 0.998129   |             282157 |             0.998129 |
|  1 |                  0 |     529 | 0.00187133 |             282686 |             1        |
+----+--------------------+---------+------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------------+---------+------------+--------------------+----------------------+
|    |   FLAG_CONT_MOBILE |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+------------+--------------------+----------------------|
|  0 |                  1 |   24780 | 0.998187   |              24780 |             0.998187 |
|  1 |                  0 |      45 | 0.00181269 |              24825 |             1        |
+----+--------------------+---------+------------+--------------------+----------------------+ 

Proportion of 1 in Target 0 :  91.93
Proportion of 1 in Target 1 :  8.07 

Proportion of 0 in Target 0 :  92.16
Proportion of 0 in Target 1 :  7.84

Flag Whether Mobile Phone Was Reachable¶

From the above plot, we can observe that 0.18% applicants mobile numbers were not reachable.
From the people whose mobiles numbers were reachable, 8.07% of them are likely to face Payment difficulties, whereas the people whose mobile numbers were'nt reachable, 7.84% of them are likely to face Payment difficulties.

From these we cannot infer a relationship between FLAG_CONT_MOBILE and TARGET variable.

#FLAG_EMAIL.
cat_univariate_analysis('FLAG_EMAIL',figsize=(15,5))
cat_proportions('FLAG_EMAIL')

TARGET 0
 [0, 1]
Categories (2, int64): [0, 1] 

TARGET 1
 [0, 1]
Categories (2, int64): [0, 1] 

All Other Cases (TARGET : 0)
+----+--------------+---------+-----------+--------------------+----------------------+
|    |   FLAG_EMAIL |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------+---------+-----------+--------------------+----------------------|
|  0 |            0 |  266618 | 0.94316   |             266618 |              0.94316 |
|  1 |            1 |   16068 | 0.0568405 |             282686 |              1       |
+----+--------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------+---------+-----------+--------------------+----------------------+
|    |   FLAG_EMAIL |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------+---------+-----------+--------------------+----------------------|
|  0 |            0 |   23451 | 0.944653  |              23451 |             0.944653 |
|  1 |            1 |    1374 | 0.0553474 |              24825 |             1        |
+----+--------------+---------+-----------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  91.92
Proportion of 0 in Target 1 :  8.08 

Proportion of 1 in Target 0 :  92.12
Proportion of 1 in Target 1 :  7.88

Whether E-mail Was Provided¶

From the above plot and proportions, we can clearly see that 94.3% of applicants have not given the E-mail address.
Out of 94.3%, 8.08% are likely to have payment difficulties.

From the above plot and proportions, we can clearly see that 5.7% of applicants have provided the E-mail address.
Out of 5.7%, 7.88% are likely to have paymenr difficulties.

From these we cannot infer a relationship between FLAG_EMAIL and TARGET variable because they stand in similar grounds.

#FLAG_MOBIL
cat_univariate_analysis('FLAG_MOBIL',figsize=(15,5))
cat_proportions('FLAG_MOBIL')

TARGET 0
 [1, 0]
Categories (2, int64): [1, 0] 

TARGET 1
 [1]
Categories (1, int64): [1] 

All Other Cases (TARGET : 0)

posx and posy should be finite values
posx and posy should be finite values

+----+--------------+---------+-------------+--------------------+----------------------+
|    |   FLAG_MOBIL |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------+---------+-------------+--------------------+----------------------|
|  0 |            1 |  282685 | 0.999996    |             282685 |             0.999996 |
|  1 |            0 |       1 | 3.53749e-06 |             282686 |             1        |
+----+--------------+---------+-------------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------+---------+-----------+--------------------+----------------------+
|    |   FLAG_MOBIL |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------+---------+-----------+--------------------+----------------------|
|  0 |            1 |   24825 |         1 |              24825 |                    1 |
+----+--------------+---------+-----------+--------------------+----------------------+ 

Proportion of 1 in Target 0 :  91.93
Proportion of 1 in Target 1 :  8.07 

Proportion of 0 in Target 0 :  100.0
Proportion of 0 in Target 1 :  0.0

Mobile Number¶

99.9% of them have given their mobile numbers. Out of those 99%, 8.07% have payment difficulties.
There is only one applicant who failed to provide the mobile number, since this number is small and negligible, we cannot infer using this data.

#LIVE_CITY_NOT_WORK_CITY
cat_univariate_analysis('LIVE_CITY_NOT_WORK_CITY',figsize=(15,5))
cat_proportions('LIVE_CITY_NOT_WORK_CITY')

TARGET 0
 [0, 1]
Categories (2, int64): [0, 1] 

TARGET 1
 [0, 1]
Categories (2, int64): [0, 1] 

All Other Cases (TARGET : 0)
+----+---------------------------+---------+-----------+--------------------+----------------------+
|    |   LIVE_CITY_NOT_WORK_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-----------+--------------------+----------------------|
|  0 |                         0 |  232974 |  0.824144 |             232974 |             0.824144 |
|  1 |                         1 |   49712 |  0.175856 |             282686 |             1        |
+----+---------------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+---------------------------+---------+-----------+--------------------+----------------------+
|    |   LIVE_CITY_NOT_WORK_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-----------+--------------------+----------------------|
|  0 |                         0 |   19322 |  0.778328 |              19322 |             0.778328 |
|  1 |                         1 |    5503 |  0.221672 |              24825 |             1        |
+----+---------------------------+---------+-----------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  92.34
Proportion of 0 in Target 1 :  7.66 

Proportion of 1 in Target 0 :  90.03
Proportion of 1 in Target 1 :  9.97

Contact Address vs Work Address¶

From the above plot and proportions, we can clearly notice that 82% of them live and work in same city.
Out of that 82%, 7.66% of applicants likely to have payment difficulties.

From the above plot and proportions, we can clearly notice that 17% of them donot live and work in same city.
Out of that 17%, 9.97% of applicants have payment difficulties.

By this, we can clearly infer that, applicants that live and work in same city are at less risk of payment difficulties.

#REG_CITY_NOT_LIVE_CITY.
cat_univariate_analysis('REG_CITY_NOT_LIVE_CITY',figsize=(15,5))
cat_proportions('REG_CITY_NOT_LIVE_CITY')

TARGET 0
 [0, 1]
Categories (2, int64): [0, 1] 

TARGET 1
 [0, 1]
Categories (2, int64): [0, 1] 

All Other Cases (TARGET : 0)
+----+--------------------------+---------+-----------+--------------------+----------------------+
|    |   REG_CITY_NOT_LIVE_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------+---------+-----------+--------------------+----------------------|
|  0 |                        0 |  261586 | 0.925359  |             261586 |             0.925359 |
|  1 |                        1 |   21100 | 0.0746411 |             282686 |             1        |
+----+--------------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------------------+---------+-----------+--------------------+----------------------+
|    |   REG_CITY_NOT_LIVE_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------+---------+-----------+--------------------+----------------------|
|  0 |                        0 |   21886 |  0.881611 |              21886 |             0.881611 |
|  1 |                        1 |    2939 |  0.118389 |              24825 |             1        |
+----+--------------------------+---------+-----------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  92.28
Proportion of 0 in Target 1 :  7.72 

Proportion of 1 in Target 0 :  87.77
Proportion of 1 in Target 1 :  12.23

Permanent Address vs Contact Address¶

From the above plot and proportions,92% of the applicant's permanent and contact address are same.
Out of that 92%, 7.72% of applicants likely to have payment difficulties.

From the above plot and proportions, 8% of the applicant's permanent and contact address are not same.
Out of that 8%, 12.23% of applicants likely to have payment difficulties.

By this, we can infer that, applicants that same permanent and contact address are likely to have less risk of payment difficulties.

#REG_CITY_NOT_WORK_CITY.
cat_univariate_analysis('REG_CITY_NOT_WORK_CITY',figsize=(15,5))
cat_proportions('REG_CITY_NOT_WORK_CITY')

TARGET 0
 [0, 1]
Categories (2, int64): [0, 1] 

TARGET 1
 [0, 1]
Categories (2, int64): [0, 1] 

All Other Cases (TARGET : 0)
+----+--------------------------+---------+-----------+--------------------+----------------------+
|    |   REG_CITY_NOT_WORK_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------+---------+-----------+--------------------+----------------------|
|  0 |                        0 |  219339 |   0.77591 |             219339 |              0.77591 |
|  1 |                        1 |   63347 |   0.22409 |             282686 |              1       |
+----+--------------------------+---------+-----------+--------------------+----------------------+ 

Clients with Payment Difficulties (TARGET : 1)
+----+--------------------------+---------+-----------+--------------------+----------------------+
|    |   REG_CITY_NOT_WORK_CITY |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------+---------+-----------+--------------------+----------------------|
|  0 |                        0 |   17305 |   0.69708 |              17305 |              0.69708 |
|  1 |                        1 |    7520 |   0.30292 |              24825 |              1       |
+----+--------------------------+---------+-----------+--------------------+----------------------+ 

Proportion of 0 in Target 0 :  92.69
Proportion of 0 in Target 1 :  7.31 

Proportion of 1 in Target 0 :  89.39
Proportion of 1 in Target 1 :  10.61

Permanent Address vs Work Address¶

From the above plot and proportions, we can notice that 77% of applicants permanent address and work address are same. Coming to other side, 23% of applicants permanent address and work address are not same.
Looking at the proportions of "Clients with payment difficulties" on both sides, we can infer that the applicants whose permanent adress and work adress matches are less likely to have payment difficulties.

Bivariate Analysis¶

#AMT_ANNUITY, AMT_INCOME_TOTAL vs TARGET
column_names = ['AMT_INCOME_CAT','AMT_ANNUITY']
plt.figure(figsize=(16,8))
sns.barplot(x = column_names[0],y = column_names[1],hue='TARGET',data = application_data)
plt.title(column_names[0] + ' vs '+ column_names[1] + ' vs Target')
plt.xticks(rotation=90);

plt.figure(figsize=(16,8))
# sns.catplot(x = column_names[0],y =column_names[1],hue='TARGET',data = application_data, kind='violin',height=8,aspect=4);
sns.violinplot(x = 'AMT_INCOME_CAT',y='AMT_ANNUITY',hue='TARGET',data = application_data,split=True, inner="quartile")
plt.title(column_names[0] + ' vs '+ column_names[1] + ' vs Target')
plt.xticks(rotation=90);

Income and Annuity¶

Both plots show a gradual increase in the amount of annuity with increase in income.
Also, as the income increases there are higher number of outliers i.e the propensity to take on very high annuities compared to most customers in the same income range.
It can be seen that in low income ranges, the median no of "Clients with Payment difficulties' is slightly higher.
And the in high income ranges,the median no of "Clients with Payment difficulties' is slightly lower.

Income Range and Annuity have a mild correlation with Payment difficulties

#AMT_CREDIT, AMT_GOODS_PRICE vs TARGET
column_names = ['AMT_CREDIT', 'AMT_GOODS_PRICE']
fig,ax = plt.subplots(1,2)
fig.set_figheight(8)
fig.set_figwidth(15)

ax[0].set(title = column_names[0] + ' vs '+ column_names[1] + ' vs Target');
sns.scatterplot(x=column_names[0],y=column_names[1],hue='TARGET',data=application_data, alpha=0.8,ax=ax[0])
plt.xticks(rotation=90);

plt.title(column_names[0] + ' vs '+ column_names[1] + ' vs Target for AMT_GOODS_PRICE < 1000000');
sns.scatterplot(x=column_names[0],y=column_names[1],hue='TARGET',data=application_data[application_data['AMT_GOODS_PRICE'] <=1000000], alpha=0.8,ax=ax[1])
plt.xticks(rotation=90);

Credit Amount and Price of Credit Goods¶

There is an overall linear relationship between Credit Amount and Price of Credit Goods.
Notice that more clients with payment difficulties are the ones who have borrowed lesser money for lesser priced goods.
And most outliers do not have payment difficulties.

Lower Credit Amount and Lower Price of Credit Goods have higher cases of Payment difficulties

#NAME_CONTRACT_TYPE vs REGION_RATING_CLIENT_W_CITY vs TARGET

column_names = ['NAME_CONTRACT_TYPE','REGION_RATING_CLIENT_W_CITY']
application_data.groupby(column_names)['TARGET'].value_counts(normalize=True)

NAME_CONTRACT_TYPE  REGION_RATING_CLIENT_W_CITY  TARGET
Cash loans          1                            0        0.95
                                                 1        0.05
                    2                            0        0.92
                                                 1        0.08
                    3                            0        0.88
                                                 1        0.12
Revolving loans     1                            0        0.97
                                                 1        0.03
                    2                            0        0.94
                                                 1        0.06
                    3                            0        0.93
                                                 1        0.07
Name: TARGET, dtype: float64

sns.catplot(x='REGION_RATING_CLIENT_W_CITY', hue='TARGET', col="NAME_CONTRACT_TYPE", kind="count", data=application_data);

Type of Loan and City Rating¶

In cash loans, Category 3 Cities have 7% more probability of payment difficulties than Category 1 Cities.
In Revolving loans, Category 3 Cities have 4% more probability of payment difficulties than Category 1 Cities.

Although this just confirms correlation between the City Rating and default, now we know that city rating has a higher impact on Cash Loans than Revolving Loans

# EXT_SOURCE_2 vs EXT_SOURCE_3 vs TARGET
creditScores = ['EXT_SOURCE_2','EXT_SOURCE_3']
plt.figure(figsize=[8,8])
sns.scatterplot(x=creditScores[0],y = creditScores[1], hue='TARGET',data = application_data, alpha=0.5);
plt.title(creditScores[0] + ' vs '+ creditScores[1]+ ' vs '+ 'TARGET');

Credit Rating¶

Checking Credit Ratings from two different sources confirms that trend of customers with lower credit ratings have a higher default rate.

#FLAG_OWN_CAR vs FLAG_OWN_REALTY vs TARGET
application_data.groupby(['FLAG_OWN_CAR','FLAG_OWN_REALTY'])['TARGET'].value_counts(normalize=True)

FLAG_OWN_CAR  FLAG_OWN_REALTY  TARGET
N             N                0        0.91
                               1        0.09
              Y                0        0.92
                               1        0.08
Y             N                0        0.93
                               1        0.07
              Y                0        0.93
                               1        0.07
Name: TARGET, dtype: float64

Owning A Car & Realty¶

From the above table , we see that customers with no car or realty have a high probability of default (9%), followed by customers who don't have a car but own realty (8%)
Since there's not much difference posed by owning a car or realty vs not owning them, we cannot infer any trend in default using these variables

#CODE_GENDER vs NAME_FAMILY_STATUS vs TARGET

# For the sake of this analysis since Civil marriage and Married are same, clubbing them into married category
application_data['NAME_FAMILY_STATUS'].replace('Civil marriage','Married', inplace=True)

ax = sns.catplot(x='NAME_FAMILY_STATUS', hue='TARGET', col="CODE_GENDER", kind="count", data=application_data, height = 5, aspect=2);

# classification over both categories
pd.DataFrame(application_data.groupby(['CODE_GENDER','NAME_FAMILY_STATUS'])['TARGET'].value_counts(normalize=True))

application_data.groupby(['CODE_GENDER','NAME_FAMILY_STATUS'])['TARGET'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(8,6), kind='barh', stacked=True);

Gender & Marital Status¶

Looking at the above values, customers most attractive to the bank by likelihood of default, are
'Female Widows' > 'Married Females' = 'Separated Females' > 'Single Females' > 'Married Males' > 'Male Widowers' > 'Single Males' = 'Separated Males'

The bank may focus their marketing campaigns on converting applications of Female Widows, Married Females, Separated Females
May note that the dataset is skewed towards Females than Males

# NAME_HOUSING_TYPE vs NAME_INCOME_TYPE vs TARGET
income_housing = pd.DataFrame(application_data.groupby(['NAME_HOUSING_TYPE','NAME_INCOME_TYPE'])['TARGET'].value_counts(normalize=True))
income_housing.columns = ['Normalized Count']
income_housing

application_data.groupby(['NAME_HOUSING_TYPE','NAME_INCOME_TYPE'])['TARGET'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(8,6), kind='barh', stacked=True);

income_housing[np.in1d(income_housing.index.get_level_values(2), 1)].sort_values(by='Normalized Count', ascending=True)

Housing Type & Income Type¶

The above tables list the categories in the order of least likelihood of Payment difficulties
The top 5 categories are :
- State servant living in Co-op apartment
- State servant living in Office apartment
- Pensioner living in Co-op apartment
- Pensioner living With parents
- Pensioner living in House / apartment

The bank may focus their marketing campaigns on converting applications of the above categories
Pensioner living With parents category doesn't seem right and may be further checked if it's an anomaly

#FLAG_EMAIL vs FLAG_MOBIL vs TARGET
pd.DataFrame(application_data.groupby(['FLAG_EMAIL','FLAG_MOBIL'])['TARGET'].value_counts(normalize=True))\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(8,6), kind='barh', stacked=True);

pd.DataFrame(application_data.groupby(['FLAG_EMAIL','FLAG_MOBIL'])['TARGET'].value_counts())

Email & Mobile Information¶

8% of customers who have not provided Email but have provided a Mobile Number have had payment difficulties
Similarly 8% of customers who provided both Email and Mobile Number have had payment difficulties

Hence we cannot infer any conclusive trend in payment difficulties using these variables

# DAYS_LAST_PHONE_CHANGE vs FLAG_CONT_MOBILE vs TARGET

sns.barplot(x = 'FLAG_CONT_MOBILE', y= 'DAYS_LAST_PHONE_CHANGE', hue='TARGET',data = application_data)
plt.title("FLAG_CONT_MOBILE vs DAYS_LAST_PHONE_CHANGE vs TARGET")

Text(0.5, 1.0, 'FLAG_CONT_MOBILE vs DAYS_LAST_PHONE_CHANGE vs TARGET')

sns.violinplot(x = 'FLAG_CONT_MOBILE', y= 'DAYS_LAST_PHONE_CHANGE', hue='TARGET', split=True, data = application_data,inner="quartile", height=5, aspect=1)
plt.title("FLAG_CONT_MOBILE vs DAYS_LAST_PHONE_CHANGE vs TARGET")

Text(0.5, 1.0, 'FLAG_CONT_MOBILE vs DAYS_LAST_PHONE_CHANGE vs TARGET')

Days Since Phone Number Changed & Whether was Phone was Reachable¶

Notice that the number of clients with reachable phones is higher than unreachable phones. (~900 days vs ~350 days)
Among clients with reachable phones, Clients with Payment difficulties have median newer phone numbers than All other cases. (See violin plot for FLAG_CONT_MOBILE = 1 )
Among clients with unreachable phones, the difference is less (~50 days vs ~ 200 days). In fact, the median age of phone number is same for defaulters and others.

We could conclude that clients with reachable and older phone numbers are better borrowers than other categories.When the phone is unreachable, last date of phone number change doesn't matter.

application_data['DAYS_EMPLOYED'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7fd5ecc96090>

#DAYS_EMPLOYED & NAME_EDUCATION_TYPE vs TARGET
fig,ax = plt.subplots(1,2)
fig.set_figheight(5)
fig.set_figwidth(15)

fig.suptitle(t="DAYS_EMPLOYED & NAME_EDUCATION_TYPE vs TARGET");
ax[0].tick_params(axis='x',rotation=90)
sns.barplot(x = 'NAME_EDUCATION_TYPE', y= 'DAYS_EMPLOYED', hue='TARGET',data = application_data,ax=ax[0])

ax[1].tick_params(axis='x',rotation=90)
# ax[1].set_yscale('log')
sns.violinplot(x = 'NAME_EDUCATION_TYPE', y= 'DAYS_EMPLOYED', hue='TARGET', split=True, data = application_data,inner="quartile", height=5, aspect=1, ax=ax[1])
#sns.swarmplot(x = 'NAME_EDUCATION_TYPE', y= 'DAYS_EMPLOYED', hue='TARGET', data = application_data, ax=ax[1])

<matplotlib.axes._subplots.AxesSubplot at 0x7fd608cbf7d0>

Employment Period & Education¶

The barplot between Days Employed and Education clearly shows that the clients holding an Academic Degree and having held employment for a long time have very low probability of default (difference between Target 0 , 1 of ~60000)
Among clients with Secondary education, there are two groups, ones who are recently employed and others with long term employment. The recently employed clients in this category have equal probability to default or not. But the clients long term employment have lower probability to default.

The bank may place emphasis on education and employment period. Higher education and long term employment are good indicators of no default.Further clients with secondary education are far more attractive borrowers when they have held long term employment than otherwise

#CNT_CHILDREN & CNT_FAM_MEMBERS vs TARGET
subset = application_data[['CNT_CHILDREN','CNT_FAM_MEMBERS','TARGET']] 
subset = subset.dropna().astype('int')
childrenvsFamily = pd.pivot_table(index='CNT_CHILDREN', columns = 'CNT_FAM_MEMBERS',aggfunc=np.mean,data=subset)
childrenvsFamily

Children & Family Members¶

These variables are chosen to verify the amplified liabilities of clients having both more children and more family members.
The most attractive sub-categories are the clients with 1 or 2 family members and 0 children.
This is followed by clients having 1 or 2 children with 3 or 4 family members.

Note that more than 5 children is deemed an error. This needs further analysis
Small families with less children has low risk of default

# NAME_EDUCATION_TYPE vs AMT_INCOME_TOTAL vs TARGET 

plt.figure(figsize=(10,8))
sns.violinplot(x = 'NAME_EDUCATION_TYPE', y= 'AMT_INCOME_TOTAL', hue='TARGET', split=True, data = application_data,inner="quartile", height=5, aspect=3)
plt.yscale('log')
plt.xticks(rotation=90);

Type of Income vs Amount of Income¶

For the same average income across income types, clients with Secondary Education have the lowest probability of default.
Among clients with an Academic Degree, median income of clients with payment difficulties is higher than all other cases.
Among clients with Higher Education, Incomplete Higher Education and Lower Secondary education, the median income has no correlation with risk of default.

#'NAME_INCOME_TYPE vs NAME_EDUCATION_TYPE'
subset = application_data[['NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','TARGET']] 
subset = subset.dropna()
subset['TARGET'] = subset['TARGET'].astype('int')
# pivot table for percentage of default for education type vs income type
incomeTypevsEdu = pd.pivot_table(index='NAME_EDUCATION_TYPE', columns = 'NAME_INCOME_TYPE',aggfunc=[np.mean],data=subset)
incomeTypevsEdu

# pivot table of the count of defaults for education type vs income type
pd.pivot_table(index='NAME_EDUCATION_TYPE', columns = 'NAME_INCOME_TYPE',aggfunc='count',data=subset)

Education Type & Income Type¶

From the above pivot tables we could say that clients with Academic Degree have the least default risk.
The highest volume of applications are from clients with Secondary education. Among them, the least default risk is posed by Pensioners.
The next highest in terms of volume of applications is from clients with Higher Education. Among these, the least risk is posed by Pensioners and State Servants.
Note that there's not enough application from Businessmen with Higher Education to conclusively say they have very low risk of default.
Highest default risk is posed by Unemployed clients. Among them unemployed clients with Secondary Education have the higest risk of default.

sns.violinplot(x = 'NAME_INCOME_TYPE', y= 'AMT_INCOME_TOTAL', hue='TARGET', split=True, data = application_data,inner="quartile", height=5, aspect=3)
plt.yscale('log')
plt.xticks(rotation=90);

Correlation Analysis¶

Correlation For Target : 0¶

# function to correlate variables
def correlation(dataframe) : 
    cor0=dataframe[columnsForAnalysis].corr()
    type(cor0)
    cor0.where(np.triu(np.ones(cor0.shape),k=1).astype(np.bool))
    cor0=cor0.unstack().reset_index()
    cor0.columns=['VAR1','VAR2','CORR']
    cor0.dropna(subset=['CORR'], inplace=True)
    cor0.CORR=round(cor0['CORR'],2)
    cor0.CORR=cor0.CORR.abs()
    cor0.sort_values(by=['CORR'],ascending=False)
    cor0=cor0[~(cor0['VAR1']==cor0['VAR2'])]
    return pd.DataFrame(cor0.sort_values(by=['CORR'],ascending=False))

# Correlation for Target : 0 
# Absolute values are reported 
pd.set_option('precision', 2)
cor_0 = correlation(application_data0)
cor_0.style.background_gradient(cmap='GnBu').hide_index()

Top 10 correlations for Target : 0 are¶

AMT_GOODS_PRICE & AMT_CREDIT
AMT_GOODS_PRICE & AMT_ANNUITY
AMT_ANNUITY & AMT_CREDIT
AGE_YEARS & DAYS_EMPLOYED
AMT_ANNUITY & AMT_INCOME_TOTAL
AMT_GOODS_PRICE & AMT_INCOME_TOTAL
AMT_INCOME_TOTAL & AMT_CREDIT
AGE_YEARS & EXT_SOURCE_3
DAYS_LAST_PHONE_CHANGE & EXT_SOURCE_2
DAYS_EMPLOYED & AMT_INCOME_TOTAL

Correlation For Target : 1¶

# Correlation for Target : 1 
# Absolute values are reported 
pd.set_option('precision', 2)
cor_1 = correlation(application_data1)
cor_1.style.background_gradient(cmap='GnBu').hide_index()

Top 10 correlations for Target 1 are :¶

AMT_GOODS_PRICE & AMT_CREDIT
AMT_CREDIT & AMT_ANNUITY
AMT_ANNUITY & AMT_GOODS_PRICE
AGE_YEARS & DAYS_EMPLOYED
EXT_SOURCE_2 & DAYS_LAST_PHONE_CHANGE
EXT_SOURCE_3 & AGE_YEARS
AGE_YEARS & AMT_GOODS_PRICE
AGE_YEARS & AMT_CREDIT
AMT_GOODS_PRICE & EXT_SOURCE_2
AMT_CREDIT & EXT_SOURCE_2

Merged data analysis can be found here

	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH
0	-9461	-637	-3,648.00	-2120
1	-16765	-1188	-1,186.00	-291
2	-19046	-225	-4,260.00	-2531
3	-19005	-3039	-9,833.00	-2437
4	-19932	-3038	-4,311.00	-3458

	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH
0	9,461.00	637.00	3,648.00	2,120.00
1	16,765.00	1,188.00	1,186.00	291.00
2	19,046.00	225.00	4,260.00	2,531.00
3	19,005.00	3,039.00	9,833.00	2,437.00
4	19,932.00	3,038.00	4,311.00	3,458.00

	SK_ID_CURR	TARGET	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH	OWN_CAR_AGE	FLAG_MOBIL	FLAG_EMP_PHONE	FLAG_WORK_PHONE	FLAG_CONT_MOBILE	FLAG_PHONE	FLAG_EMAIL	CNT_FAM_MEMBERS	REGION_RATING_CLIENT	REGION_RATING_CLIENT_W_CITY	HOUR_APPR_PROCESS_START	REG_REGION_NOT_LIVE_REGION	REG_REGION_NOT_WORK_REGION	LIVE_REGION_NOT_WORK_REGION	REG_CITY_NOT_LIVE_CITY	REG_CITY_NOT_WORK_CITY	LIVE_CITY_NOT_WORK_CITY	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	APARTMENTS_AVG	BASEMENTAREA_AVG	YEARS_BEGINEXPLUATATION_AVG	YEARS_BUILD_AVG	COMMONAREA_AVG	ELEVATORS_AVG	ENTRANCES_AVG	FLOORSMAX_AVG	FLOORSMIN_AVG	LANDAREA_AVG	LIVINGAPARTMENTS_AVG	LIVINGAREA_AVG	NONLIVINGAPARTMENTS_AVG	NONLIVINGAREA_AVG	APARTMENTS_MODE	BASEMENTAREA_MODE	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_MODE	...	LIVINGAPARTMENTS_MODE	LIVINGAREA_MODE	NONLIVINGAPARTMENTS_MODE	NONLIVINGAREA_MODE	APARTMENTS_MEDI	BASEMENTAREA_MEDI	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BUILD_MEDI	COMMONAREA_MEDI	ELEVATORS_MEDI	ENTRANCES_MEDI	FLOORSMAX_MEDI	FLOORSMIN_MEDI	LANDAREA_MEDI	LIVINGAPARTMENTS_MEDI	LIVINGAREA_MEDI	NONLIVINGAPARTMENTS_MEDI	NONLIVINGAREA_MEDI	TOTALAREA_MODE	OBS_30_CNT_SOCIAL_CIRCLE	DEF_30_CNT_SOCIAL_CIRCLE	OBS_60_CNT_SOCIAL_CIRCLE	DEF_60_CNT_SOCIAL_CIRCLE	DAYS_LAST_PHONE_CHANGE	FLAG_DOCUMENT_2	FLAG_DOCUMENT_3	FLAG_DOCUMENT_4	FLAG_DOCUMENT_5	FLAG_DOCUMENT_6	FLAG_DOCUMENT_7	FLAG_DOCUMENT_8	FLAG_DOCUMENT_9	FLAG_DOCUMENT_10	FLAG_DOCUMENT_11	FLAG_DOCUMENT_12	FLAG_DOCUMENT_13	FLAG_DOCUMENT_14	FLAG_DOCUMENT_15	FLAG_DOCUMENT_16	FLAG_DOCUMENT_17	FLAG_DOCUMENT_18	FLAG_DOCUMENT_19	FLAG_DOCUMENT_20	FLAG_DOCUMENT_21	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
count	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,499.00	307,233.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	104,582.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,509.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	134,133.00	306,851.00	246,546.00	151,450.00	127,568.00	157,504.00	103,023.00	92,646.00	143,620.00	152,683.00	154,491.00	98,869.00	124,921.00	97,312.00	153,161.00	93,997.00	137,829.00	151,450.00	127,568.00	157,504.00	103,023.00	...	97,312.00	153,161.00	93,997.00	137,829.00	151,450.00	127,568.00	157,504.00	103,023.00	92,646.00	143,620.00	152,683.00	154,491.00	98,869.00	124,921.00	97,312.00	153,161.00	93,997.00	137,829.00	159,080.00	306,490.00	306,490.00	306,490.00	306,490.00	307,510.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	307,511.00	265,992.00	265,992.00	265,992.00	265,992.00	265,992.00	265,992.00
mean	278,180.52	0.08	0.42	168,797.92	599,026.00	27,108.57	538,396.21	0.02	-16,037.00	63,815.05	-4,986.12	-2,994.20	12.06	1.00	0.82	0.20	1.00	0.28	0.06	2.15	2.05	2.03	12.06	0.02	0.05	0.04	0.08	0.23	0.18	0.50	0.51	0.51	0.12	0.09	0.98	0.75	0.04	0.08	0.15	0.23	0.23	0.07	0.10	0.11	0.01	0.03	0.11	0.09	0.98	0.76	...	0.11	0.11	0.01	0.03	0.12	0.09	0.98	0.76	0.04	0.08	0.15	0.23	0.23	0.07	0.10	0.11	0.01	0.03	0.10	1.42	0.14	1.41	0.10	-962.86	0.00	0.71	0.00	0.02	0.09	0.00	0.08	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.01	0.00	0.01	0.00	0.00	0.00	0.01	0.01	0.03	0.27	0.27	1.90
std	102,790.18	0.27	0.72	237,123.15	402,490.78	14,493.74	369,446.46	0.01	4,363.99	141,275.77	3,522.89	1,509.45	11.94	0.00	0.38	0.40	0.04	0.45	0.23	0.91	0.51	0.50	3.27	0.12	0.22	0.20	0.27	0.42	0.38	0.21	0.19	0.19	0.11	0.08	0.06	0.11	0.08	0.13	0.10	0.14	0.16	0.08	0.09	0.11	0.05	0.07	0.11	0.08	0.06	0.11	...	0.10	0.11	0.05	0.07	0.11	0.08	0.06	0.11	0.08	0.13	0.10	0.15	0.16	0.08	0.09	0.11	0.05	0.07	0.11	2.40	0.45	2.38	0.36	826.81	0.01	0.45	0.01	0.12	0.28	0.01	0.27	0.06	0.00	0.06	0.00	0.06	0.05	0.03	0.10	0.02	0.09	0.02	0.02	0.02	0.08	0.11	0.20	0.92	0.79	1.87
min	100,002.00	0.00	0.00	25,650.00	45,000.00	1,615.50	40,500.00	0.00	-25,229.00	-17,912.00	-24,672.00	-7,197.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.01	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	...	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	-4,292.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
25%	189,145.50	0.00	0.00	112,500.00	270,000.00	16,524.00	238,500.00	0.01	-19,682.00	-2,760.00	-7,479.50	-4,299.00	5.00	1.00	1.00	0.00	1.00	0.00	0.00	2.00	2.00	2.00	10.00	0.00	0.00	0.00	0.00	0.00	0.00	0.33	0.39	0.37	0.06	0.04	0.98	0.69	0.01	0.00	0.07	0.17	0.08	0.02	0.05	0.05	0.00	0.00	0.05	0.04	0.98	0.70	...	0.05	0.04	0.00	0.00	0.06	0.04	0.98	0.69	0.01	0.00	0.07	0.17	0.08	0.02	0.05	0.05	0.00	0.00	0.04	0.00	0.00	0.00	0.00	-1,570.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
50%	278,202.00	0.00	0.00	147,150.00	513,531.00	24,903.00	450,000.00	0.02	-15,750.00	-1,213.00	-4,504.00	-3,254.00	9.00	1.00	1.00	0.00	1.00	0.00	0.00	2.00	2.00	2.00	12.00	0.00	0.00	0.00	0.00	0.00	0.00	0.51	0.57	0.54	0.09	0.08	0.98	0.76	0.02	0.00	0.14	0.17	0.21	0.05	0.08	0.07	0.00	0.00	0.08	0.07	0.98	0.76	...	0.08	0.07	0.00	0.00	0.09	0.08	0.98	0.76	0.02	0.00	0.14	0.17	0.21	0.05	0.08	0.07	0.00	0.00	0.07	0.00	0.00	0.00	0.00	-757.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00
75%	367,142.50	0.00	1.00	202,500.00	808,650.00	34,596.00	679,500.00	0.03	-12,413.00	-289.00	-2,010.00	-1,720.00	15.00	1.00	1.00	0.00	1.00	1.00	0.00	3.00	2.00	2.00	14.00	0.00	0.00	0.00	0.00	0.00	0.00	0.68	0.66	0.67	0.15	0.11	0.99	0.82	0.05	0.12	0.21	0.33	0.38	0.09	0.12	0.13	0.00	0.03	0.14	0.11	0.99	0.82	...	0.13	0.13	0.00	0.02	0.15	0.11	0.99	0.83	0.05	0.12	0.21	0.33	0.38	0.09	0.12	0.13	0.00	0.03	0.13	2.00	0.00	2.00	0.00	-274.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	3.00
max	456,255.00	1.00	19.00	117,000,000.00	4,050,000.00	258,025.50	4,050,000.00	0.07	-7,489.00	365,243.00	0.00	0.00	91.00	1.00	1.00	1.00	1.00	1.00	1.00	20.00	3.00	3.00	23.00	1.00	1.00	1.00	1.00	1.00	1.00	0.96	0.85	0.90	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	...	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	348.00	34.00	344.00	24.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	4.00	9.00	8.00	27.00	261.00	25.00

	Column Name	Null Percentage
0	SK_ID_CURR	0.00
1	TARGET	0.00
2	NAME_CONTRACT_TYPE	0.00
3	CODE_GENDER	0.00
4	FLAG_OWN_CAR	0.00
5	FLAG_OWN_REALTY	0.00
6	CNT_CHILDREN	0.00
7	AMT_INCOME_TOTAL	0.00
8	AMT_CREDIT	0.00
9	AMT_ANNUITY	0.00
10	AMT_GOODS_PRICE	0.09
11	NAME_TYPE_SUITE	0.42
12	NAME_INCOME_TYPE	0.00
13	NAME_EDUCATION_TYPE	0.00
14	NAME_FAMILY_STATUS	0.00
15	NAME_HOUSING_TYPE	0.00
16	REGION_POPULATION_RELATIVE	0.00
17	DAYS_BIRTH	0.00
18	DAYS_EMPLOYED	0.00
19	DAYS_REGISTRATION	0.00
20	DAYS_ID_PUBLISH	0.00
21	OWN_CAR_AGE	65.99
22	FLAG_MOBIL	0.00
23	FLAG_EMP_PHONE	0.00
24	FLAG_WORK_PHONE	0.00
25	FLAG_CONT_MOBILE	0.00
26	FLAG_PHONE	0.00
27	FLAG_EMAIL	0.00
28	OCCUPATION_TYPE	31.35
29	CNT_FAM_MEMBERS	0.00
30	REGION_RATING_CLIENT	0.00
31	REGION_RATING_CLIENT_W_CITY	0.00
32	WEEKDAY_APPR_PROCESS_START	0.00
33	HOUR_APPR_PROCESS_START	0.00
34	REG_REGION_NOT_LIVE_REGION	0.00
35	REG_REGION_NOT_WORK_REGION	0.00
36	LIVE_REGION_NOT_WORK_REGION	0.00
37	REG_CITY_NOT_LIVE_CITY	0.00
38	REG_CITY_NOT_WORK_CITY	0.00
39	LIVE_CITY_NOT_WORK_CITY	0.00
40	ORGANIZATION_TYPE	0.00
41	EXT_SOURCE_1	56.38
42	EXT_SOURCE_2	0.21
43	EXT_SOURCE_3	19.83
44	APARTMENTS_AVG	50.75
45	BASEMENTAREA_AVG	58.52
46	YEARS_BEGINEXPLUATATION_AVG	48.78
47	YEARS_BUILD_AVG	66.50
48	COMMONAREA_AVG	69.87
49	ELEVATORS_AVG	53.30
50	ENTRANCES_AVG	50.35
51	FLOORSMAX_AVG	49.76
52	FLOORSMIN_AVG	67.85
53	LANDAREA_AVG	59.38
54	LIVINGAPARTMENTS_AVG	68.35
55	LIVINGAREA_AVG	50.19
56	NONLIVINGAPARTMENTS_AVG	69.43
57	NONLIVINGAREA_AVG	55.18
58	APARTMENTS_MODE	50.75
59	BASEMENTAREA_MODE	58.52
60	YEARS_BEGINEXPLUATATION_MODE	48.78
61	YEARS_BUILD_MODE	66.50
62	COMMONAREA_MODE	69.87
63	ELEVATORS_MODE	53.30
64	ENTRANCES_MODE	50.35
65	FLOORSMAX_MODE	49.76
66	FLOORSMIN_MODE	67.85
67	LANDAREA_MODE	59.38
68	LIVINGAPARTMENTS_MODE	68.35
69	LIVINGAREA_MODE	50.19
70	NONLIVINGAPARTMENTS_MODE	69.43
71	NONLIVINGAREA_MODE	55.18
72	APARTMENTS_MEDI	50.75
73	BASEMENTAREA_MEDI	58.52
74	YEARS_BEGINEXPLUATATION_MEDI	48.78
75	YEARS_BUILD_MEDI	66.50
76	COMMONAREA_MEDI	69.87
77	ELEVATORS_MEDI	53.30
78	ENTRANCES_MEDI	50.35
79	FLOORSMAX_MEDI	49.76
80	FLOORSMIN_MEDI	67.85
81	LANDAREA_MEDI	59.38
82	LIVINGAPARTMENTS_MEDI	68.35
83	LIVINGAREA_MEDI	50.19
84	NONLIVINGAPARTMENTS_MEDI	69.43
85	NONLIVINGAREA_MEDI	55.18
86	FONDKAPREMONT_MODE	68.39
87	HOUSETYPE_MODE	50.18
88	TOTALAREA_MODE	48.27
89	WALLSMATERIAL_MODE	50.84
90	EMERGENCYSTATE_MODE	47.40
91	OBS_30_CNT_SOCIAL_CIRCLE	0.33
92	DEF_30_CNT_SOCIAL_CIRCLE	0.33
93	OBS_60_CNT_SOCIAL_CIRCLE	0.33
94	DEF_60_CNT_SOCIAL_CIRCLE	0.33
95	DAYS_LAST_PHONE_CHANGE	0.00
96	FLAG_DOCUMENT_2	0.00
97	FLAG_DOCUMENT_3	0.00
98	FLAG_DOCUMENT_4	0.00
99	FLAG_DOCUMENT_5	0.00
100	FLAG_DOCUMENT_6	0.00
101	FLAG_DOCUMENT_7	0.00
102	FLAG_DOCUMENT_8	0.00
103	FLAG_DOCUMENT_9	0.00
104	FLAG_DOCUMENT_10	0.00
105	FLAG_DOCUMENT_11	0.00
106	FLAG_DOCUMENT_12	0.00
107	FLAG_DOCUMENT_13	0.00
108	FLAG_DOCUMENT_14	0.00
109	FLAG_DOCUMENT_15	0.00
110	FLAG_DOCUMENT_16	0.00
111	FLAG_DOCUMENT_17	0.00
112	FLAG_DOCUMENT_18	0.00
113	FLAG_DOCUMENT_19	0.00
114	FLAG_DOCUMENT_20	0.00
115	FLAG_DOCUMENT_21	0.00
116	AMT_REQ_CREDIT_BUREAU_HOUR	13.50
117	AMT_REQ_CREDIT_BUREAU_DAY	13.50
118	AMT_REQ_CREDIT_BUREAU_WEEK	13.50
119	AMT_REQ_CREDIT_BUREAU_MON	13.50
120	AMT_REQ_CREDIT_BUREAU_QRT	13.50
121	AMT_REQ_CREDIT_BUREAU_YEAR	13.50

	0
SK_ID_CURR	int64
TARGET	int64
NAME_CONTRACT_TYPE	object
CODE_GENDER	object
FLAG_OWN_CAR	object
FLAG_OWN_REALTY	object
CNT_CHILDREN	int64
AMT_INCOME_TOTAL	float64
AMT_CREDIT	float64
AMT_ANNUITY	float64
AMT_GOODS_PRICE	float64
NAME_TYPE_SUITE	object
NAME_INCOME_TYPE	object
NAME_EDUCATION_TYPE	object
NAME_FAMILY_STATUS	object
NAME_HOUSING_TYPE	object
REGION_POPULATION_RELATIVE	float64
DAYS_BIRTH	int64
DAYS_EMPLOYED	int64
DAYS_REGISTRATION	float64
DAYS_ID_PUBLISH	int64
FLAG_MOBIL	int64
FLAG_EMP_PHONE	int64
FLAG_WORK_PHONE	int64
FLAG_CONT_MOBILE	int64
FLAG_PHONE	int64
FLAG_EMAIL	int64
OCCUPATION_TYPE	object
CNT_FAM_MEMBERS	category
REGION_RATING_CLIENT	int64
REGION_RATING_CLIENT_W_CITY	int64
WEEKDAY_APPR_PROCESS_START	object
HOUR_APPR_PROCESS_START	int64
REG_REGION_NOT_LIVE_REGION	int64
REG_REGION_NOT_WORK_REGION	int64
LIVE_REGION_NOT_WORK_REGION	int64
REG_CITY_NOT_LIVE_CITY	int64
REG_CITY_NOT_WORK_CITY	int64
LIVE_CITY_NOT_WORK_CITY	int64
ORGANIZATION_TYPE	object
EXT_SOURCE_2	float64
EXT_SOURCE_3	float64
YEARS_BEGINEXPLUATATION_AVG	float64
FLOORSMAX_AVG	float64
YEARS_BEGINEXPLUATATION_MODE	float64
FLOORSMAX_MODE	float64
YEARS_BEGINEXPLUATATION_MEDI	float64
FLOORSMAX_MEDI	float64
TOTALAREA_MODE	float64
EMERGENCYSTATE_MODE	object
OBS_30_CNT_SOCIAL_CIRCLE	float64
DEF_30_CNT_SOCIAL_CIRCLE	float64
OBS_60_CNT_SOCIAL_CIRCLE	float64
DEF_60_CNT_SOCIAL_CIRCLE	float64
DAYS_LAST_PHONE_CHANGE	float64
FLAG_DOCUMENT_2	int64
FLAG_DOCUMENT_3	int64
FLAG_DOCUMENT_4	int64
FLAG_DOCUMENT_5	int64
FLAG_DOCUMENT_6	int64
FLAG_DOCUMENT_7	int64
FLAG_DOCUMENT_8	int64
FLAG_DOCUMENT_9	int64
FLAG_DOCUMENT_10	int64
FLAG_DOCUMENT_11	int64
FLAG_DOCUMENT_12	int64
FLAG_DOCUMENT_13	int64
FLAG_DOCUMENT_14	int64
FLAG_DOCUMENT_15	int64
FLAG_DOCUMENT_16	int64
FLAG_DOCUMENT_17	int64
FLAG_DOCUMENT_18	int64
FLAG_DOCUMENT_19	int64
FLAG_DOCUMENT_20	int64
FLAG_DOCUMENT_21	int64
AMT_REQ_CREDIT_BUREAU_HOUR	float64
AMT_REQ_CREDIT_BUREAU_DAY	float64
AMT_REQ_CREDIT_BUREAU_WEEK	float64
AMT_REQ_CREDIT_BUREAU_MON	float64
AMT_REQ_CREDIT_BUREAU_QRT	float64
AMT_REQ_CREDIT_BUREAU_YEAR	float64

	mean
	TARGET
NAME_INCOME_TYPE	Businessman	Commercial associate	Maternity leave	Pensioner	State servant	Student	Unemployed	Working
NAME_EDUCATION_TYPE
Academic degree	nan	0.02	nan	0.00	0.00	nan	nan	0.03
Higher education	0.00	0.05	0.00	0.04	0.04	0.00	0.00	0.06
Incomplete higher	nan	0.08	nan	0.04	0.07	0.00	0.33	0.09
Lower secondary	nan	0.12	nan	0.07	0.08	nan	nan	0.15
Secondary / secondary special	nan	0.09	1.00	0.06	0.07	0.00	0.54	0.11

Bank Credit Default Case Study¶

Exploratory Data Analysis¶

Business Understanding :¶

Business Objectives¶

Data Understanding¶

Analysis Approach & Conclusions:¶

Univariate Analysis¶

Bivariate Analysis¶

Top 10 correlations for 'No default'¶

Top 10 correlations for 'Default' are :¶

Look below for a detailed analysis¶

Loading Application Data¶

Data Quality Check¶

Checking Columns with High Null Percentage¶

Imputation Consideration¶

Finding the best values to impute below columns that have <=13% of null values¶

Columns To Impute¶

AMT_ANNUITY Imputation¶

AMT_GOODS_PRICE Imputation¶

NAME_TYPE_SUITE Imputation :¶

CNT_FAM_MEMBERS Imputation¶

EXT_SOURCE_2 Imputation¶

Checking for Disguised Missing Values¶

Data Type Checks¶

Checking for Outliers¶

Checking for outliers in the following numerical columns¶

Outliers in Numerical Columns¶

Binning Continuous Variables¶

Data Imbalance¶

Data Set division wrt TARGET¶

Columns Selection for Analysis¶

Analysis¶

Univariate Analysis¶

Client's with All other cases (Target=0)¶

Client's with Payment Difficulties (Target=1)¶

Amount of credit¶

Amount of Annuity¶

Goods Price Amount¶

Bank City Rating¶

Client Credit Rating By External Source - 2¶

Client Credit Rating By External Source - 3¶

Gender¶

Owning A Car¶

Owning Realty¶

Highest Education¶

Marital Status¶

Housing Type¶

Income Category¶

Number of Children¶

Number of Family members¶

Number of Days Employed¶

Income type¶

Days Since Phone Number Was Changed.¶

Flag Whether Mobile Phone Was Reachable¶

Whether E-mail Was Provided¶

Mobile Number¶

Contact Address vs Work Address¶

Permanent Address vs Contact Address¶

Permanent Address vs Work Address¶

Bivariate Analysis¶

Income and Annuity¶

Credit Amount and Price of Credit Goods¶

Type of Loan and City Rating¶

Credit Rating¶

Owning A Car & Realty¶

Gender & Marital Status¶

Housing Type & Income Type¶

Email & Mobile Information¶

Days Since Phone Number Changed & Whether was Phone was Reachable¶

Employment Period & Education¶

Children & Family Members¶

Type of Income vs Amount of Income¶

Education Type & Income Type¶

Correlation Analysis¶

Correlation For Target : 0¶

Top 10 correlations for Target : 0 are¶

Correlation For Target : 1¶

Top 10 correlations for Target 1 are :¶

Data Set division wrt `TARGET`¶