Telecom Churn Case Study

Analysis Approach :

  • Telecommunications industry experiences an average of 15 - 25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has become even more important than customer acquisition.
  • Here we are given with 4 months of data related to customer usage. In this case study, we analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.
  • Churn is predicted using two approaches. Usage based churn and Revenue based churn. Usage based churn:
  • Customers who have zero usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time.
  • This case study only considers usage based churn.
  • In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage. Hence, this case study focuses on high value customers only.
  • The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively.

  • The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months.

  • This is a classification problem, where we need to predict whether the customers is about to churn or not. We have carried out Baseline Logistic Regression, then Logistic Regression with PCA, PCA + Random Forest, PCA + XGBoost.

Analysis Steps

Data Cleaning and EDA

  1. We have started with importing Necessary packages and libraries.
  2. We have loaded the dataset into a dataframe.
  3. We have checked the number of columns, their data types, Null count and unique value_value_count to get some understanding about data and to check if the columns are under correct data-type.
  4. Checking for duplicate records (rows) in the data. There were no duplicates.
  5. Since 'mobile_number' is the unique identifier available, we have made it our index to retain the identity.
  6. Have found some columns that donot follow the naming standard, we have renamed those columns to make sure all the variables follow the same naming convention.
  7. Follwing with column renaming, we have dealt with converting the columns into their respective data types. Here, we have evaluated all the columns which are having less than or equal to 29 unique values as catrgorical columns and rest as contineous columns.
  8. The date columns were having 'object' as their data type, we have converted to the proper datetime format.
  9. Since, our analysis is focused on the HVC(High value customers), we have filtered for high value customers to carryout the further analysis. The metric of this filtering of HVC is such that all the customers whose 'Average_rech_amt' of months 6 and 7 greater than or equal to 70th percentile of the 'Average_rech_amt' are considered as High Value Customers.
  10. Checked for missing values.
  11. Dropped all the columns with missing values greater than 50%.
  12. We have been given 4 months data. Since each months revenue and usage data is not related to other, we did month-wise drill down on missing values.
  13. Some columns had similar range of missing values. So, we have looked at their related columns and checked if these might be imputed with zero.
  14. We have found that 'last_date_of_the_month' had some misisng values, so this is very meaningful and we have imputed the last date based on the month.
  15. We have found some columns with only one unique value, so it is of no use for the analysis, hence we have dropped those columns.
  16. Once after checking all the data preparation tasks, tagged the Churn variable(which is our target variable).
  17. After imputing, we have dropped churn phase columns (Columns belonging to month - 9).
  18. After all the above processing, we have retained 30,011 rows and 126 columns.
  19. Exploratory Data Analysis
  • The telecom company has many users with negative average revenues in both phases. These users are likely to churn.
  • Most customers prefer the plans of '0' category.
  • The customers with lesser 'aon' are more likely to Churn when compared to the Customers with higer 'aon'.
  • Revenue generated by the Customers who are about to churn is very unstable.
  • The Customers whose arpu decreases in 7th month are more likely to churn when compared to ones with increase in arpu.
  • The Customers with high total_og_mou in 6th month and lower total_og_mou in 7th month are more likely to churn compared to the rest.
  • The Customers with decrease in rate of total_ic_mou in 7th month are more likely to churn, compared to the rest.
  • Customers with stable usage of 2g volume throughout 6 and 7 months are less likely to churn.
  • Customers with fall in usage of 2g volume in 7th month are more likely to Churn.
  • Customers with stable usage of 3g volume throughout 6 and 7 months are less likely to churn.
  • Customers with fall in consumption of 3g volume in 7th month are more likely to Churn.
  • The customers with lower total_og_mou in 6th and 8th months are more likely to Churn compared to the ones with higher total_og_mou.
  • The customers with lesser total_og_mou_8 and aon are more likely to churn compared to the one with higher total_og_mou_8 and aon.
  • The customers with less total_ic_mou_8 are more likely to churn irrespective of aon.
  • The customers with total_ic_mou_8 > 2000 are very less likely to churn.
  1. Correlation analysis has been performed.
  2. We have created the derived variables and then removed the variables that were used to derive new ones.
  3. Outlier treatment has been performed. We have looked at the quantiles to understand the spread of Data.
  4. We have capped the upper outliers to 99th percentile.
  5. We have checked categorical variables and contribution of classes in those variables. The classes with less ccontribution are grouped into 'Others'.
  6. Dummy Variables were created.

Pre-processing Steps

  1. Train-Test Split has been performed.
  2. The data has high class-imbalance with the ratio of 0.095 (class 1 : class 0).
  3. SMOTE technique has been used to overcome class-imbalance.
  4. Predictor columns have been standardized to mean - 0 and standard_deviation- 1.

Modelling

Model 1 : Logistic Regression with RFE & Manual Elimination ( Interpretable Model )
Most important predictors of Churn , in order of importance and their coefficients are as follows :

  • loc_ic_t2f_mou_8 -1.2736
  • total_rech_num_8 -1.2033
  • total_rech_num_6 0.6053
  • monthly_3g_8_0 0.3994
  • monthly_2g_8_0 0.3666
  • std_ic_t2f_mou_8 -0.3363
  • std_og_t2f_mou_8 -0.2474
  • const -0.2336
  • monthly_3g_7_0 -0.2099
  • std_ic_t2f_mou_7 0.1532
  • sachet_2g_6_0 -0.1108
  • sachet_2g_7_0 -0.0987
  • sachet_2g_8_0 0.0488
  • sachet_3g_6_0 -0.0399

PCA: PCA : 95% of variance in the train set can be explained by first 16 principal components and 100% of variance is explained by the first 45 principal components.

Model 2 : PCA + Logistic Regression

    Train Performance :  

    Accuracy : 0.627
    Sensitivity / True Positive Rate / Recall : 0.918
    Specificity / True Negative Rate :  0.599
    Precision / Positive Predictive Value : 0.179
    F1-score : 0.3

    Test Performance :

    Accuracy : 0.086
    Sensitivity / True Positive Rate / Recall : 1.0
    Specificity / True Negative Rate :  0.0
    Precision / Positive Predictive Value : 0.086
    F1-score : 0.158  

Model 3 : PCA + Random Forest Classifier

    Train Performance :  

    Accuracy : 0.882
    Sensitivity / True Positive Rate / Recall : 0.816
    Specificity / True Negative Rate :  0.888
    Precision / Positive Predictive Value : 0.408
    F1-score : 0.544

    Test Performance :

    Accuracy : 0.86
    Sensitivity / True Positive Rate / Recall : 0.80
    Specificity / True Negative Rate :  0.78
    Precision / Positive Predictive Value :0.37
    F1-score :0.51

Model 4 : PCA + XGBoost

    Train Performance :  

    Accuracy : 0.873
    Sensitivity / True Positive Rate / Recall : 0.887
    Specificity / True Negative Rate :  0.872
    Precision / Positive Predictive Value : 0.396
    F1-score : 0.548

    Test Performance :

    Accuracy : 0.086
    Sensitivity / True Positive Rate / Recall : 1.0
    Specificity / True Negative Rate :  0.0
    Precision / Positive Predictive Value : 0.086
    F1-score : 0.158

Recommendations :

Following are the strongest indicators of churn

Customers who churn show lower average monthly local incoming calls from fixed line in the action period by 1.27 standard deviations , compared to users who don't churn , when all other factors are held constant. This is the strongest indicator of churn. Customers who churn show lower number of recharges done in action period by 1.20 standard deviations, when all other factors are held constant. This is the second strongest indicator of churn. Further customers who churn have done 0.6 standard deviations higher recharge than non-churn customers. This factor when coupled with above factors is a good indicator of churn. Customers who churn are more likely to be users of 'monthly 2g package-0 / monthly 3g package-0' in action period (approximately 0.3 std deviations higher than other packages), when all other factors are held constant.

Based on the above indicators the recommendations to the telecom company are :

Concentrate on users with 1.27 std devations lower than average incoming calls from fixed line. They are most likely to churn. Concentrate on users who recharge less number of times ( less than 1.2 std deviations compared to avg) in the 8th month. They are second most likely to churn. Models with high sensitivity are the best for predicting churn. Use the PCA + Logistic Regression model to predict churn. It has an ROC score of 0.87, test sensitivity of 100%.

Analysis

Data Understanding

In [6]:
# Importing Necessary Libraries.
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
import warnings 
warnings.filterwarnings('ignore')

# Setting max display columns and rows.
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
In [7]:
# Reading Dataset into a DataFrame.
data=pd.read_csv('telecom_churn_data.csv')
data.head()
Out[7]:
mobile_number circle_id loc_og_t2o_mou std_og_t2o_mou loc_ic_t2o_mou last_date_of_month_6 last_date_of_month_7 last_date_of_month_8 last_date_of_month_9 arpu_6 arpu_7 arpu_8 arpu_9 onnet_mou_6 onnet_mou_7 onnet_mou_8 onnet_mou_9 offnet_mou_6 offnet_mou_7 offnet_mou_8 offnet_mou_9 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_ic_mou_9 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 roam_og_mou_9 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2t_mou_9 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2m_mou_9 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2f_mou_9 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_t2c_mou_9 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 loc_og_mou_9 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2t_mou_9 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2m_mou_9 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_t2f_mou_9 std_og_t2c_mou_6 std_og_t2c_mou_7 std_og_t2c_mou_8 std_og_t2c_mou_9 std_og_mou_6 std_og_mou_7 std_og_mou_8 std_og_mou_9 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 isd_og_mou_9 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 spl_og_mou_9 og_others_6 og_others_7 og_others_8 og_others_9 total_og_mou_6 total_og_mou_7 total_og_mou_8 total_og_mou_9 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2t_mou_9 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2m_mou_9 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_t2f_mou_9 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 loc_ic_mou_9 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2t_mou_9 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2m_mou_9 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_t2f_mou_9 std_ic_t2o_mou_6 std_ic_t2o_mou_7 std_ic_t2o_mou_8 std_ic_t2o_mou_9 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 std_ic_mou_9 total_ic_mou_6 total_ic_mou_7 total_ic_mou_8 total_ic_mou_9 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 spl_ic_mou_9 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 isd_ic_mou_9 ic_others_6 ic_others_7 ic_others_8 ic_others_9 total_rech_num_6 total_rech_num_7 total_rech_num_8 total_rech_num_9 total_rech_amt_6 total_rech_amt_7 total_rech_amt_8 total_rech_amt_9 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 max_rech_amt_9 date_of_last_rech_6 date_of_last_rech_7 date_of_last_rech_8 date_of_last_rech_9 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 last_day_rch_amt_9 date_of_last_rech_data_6 date_of_last_rech_data_7 date_of_last_rech_data_8 date_of_last_rech_data_9 total_rech_data_6 total_rech_data_7 total_rech_data_8 total_rech_data_9 max_rech_data_6 max_rech_data_7 max_rech_data_8 max_rech_data_9 count_rech_2g_6 count_rech_2g_7 count_rech_2g_8 count_rech_2g_9 count_rech_3g_6 count_rech_3g_7 count_rech_3g_8 count_rech_3g_9 av_rech_amt_data_6 av_rech_amt_data_7 av_rech_amt_data_8 av_rech_amt_data_9 vol_2g_mb_6 vol_2g_mb_7 vol_2g_mb_8 vol_2g_mb_9 vol_3g_mb_6 vol_3g_mb_7 vol_3g_mb_8 vol_3g_mb_9 arpu_3g_6 arpu_3g_7 arpu_3g_8 arpu_3g_9 arpu_2g_6 arpu_2g_7 arpu_2g_8 arpu_2g_9 night_pck_user_6 night_pck_user_7 night_pck_user_8 night_pck_user_9 monthly_2g_6 monthly_2g_7 monthly_2g_8 monthly_2g_9 sachet_2g_6 sachet_2g_7 sachet_2g_8 sachet_2g_9 monthly_3g_6 monthly_3g_7 monthly_3g_8 monthly_3g_9 sachet_3g_6 sachet_3g_7 sachet_3g_8 sachet_3g_9 fb_user_6 fb_user_7 fb_user_8 fb_user_9 aon aug_vbc_3g jul_vbc_3g jun_vbc_3g sep_vbc_3g
0 7000842753 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 197.385 214.816 213.803 21.100 NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN 0.00 0.00 0.00 0.00 NaN NaN 0.16 NaN NaN NaN 4.13 NaN NaN NaN 1.15 NaN NaN NaN 5.44 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN NaN NaN 0.00 NaN 0.00 0.00 5.44 0.00 NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN 4 3 2 6 362 252 252 0 252 252 252 0 6/21/2014 7/16/2014 8/8/2014 9/28/2014 252 252 252 0 6/21/2014 7/16/2014 8/8/2014 NaN 1.0 1.0 1.0 NaN 252.0 252.0 252.0 NaN 0.0 0.0 0.0 NaN 1.0 1.0 1.0 NaN 252.0 252.0 252.0 NaN 30.13 1.32 5.75 0.0 83.57 150.76 109.61 0.00 212.17 212.17 212.17 NaN 212.17 212.17 212.17 NaN 0.0 0.0 0.0 NaN 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1.0 1.0 1.0 NaN 968 30.4 0.0 101.20 3.58
1 7001865778 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 34.047 355.074 268.321 86.285 24.11 78.68 7.68 18.34 15.74 99.84 304.76 53.76 0.0 0.00 0.00 0.00 0.0 0.00 0.00 0.00 23.88 74.56 7.68 18.34 11.51 75.94 291.86 53.76 0.00 0.00 0.00 0.00 0.0 2.91 0.00 0.00 35.39 150.51 299.54 72.11 0.23 4.11 0.00 0.00 0.00 0.46 0.13 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.23 4.58 0.13 0.00 0.0 0.0 0.0 0.0 4.68 23.43 12.76 0.00 0.00 0.0 0.0 0.0 40.31 178.53 312.44 72.11 1.61 29.91 29.23 116.09 17.48 65.38 375.58 56.93 0.00 8.93 3.61 0.00 19.09 104.23 408.43 173.03 0.00 0.00 2.35 0.00 5.90 0.00 12.49 15.01 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 5.90 0.00 14.84 15.01 26.83 104.23 423.28 188.04 0.00 0.0 0.0 0.00 1.83 0.00 0.0 0.00 0.00 0.00 0.0 0.00 4 9 11 5 74 384 283 121 44 154 65 50 6/29/2014 7/31/2014 8/28/2014 9/30/2014 44 23 30 0 NaN 7/25/2014 8/10/2014 NaN NaN 1.0 2.0 NaN NaN 154.0 25.0 NaN NaN 1.0 2.0 NaN NaN 0.0 0.0 NaN NaN 154.0 50.0 NaN 0.00 108.07 365.47 0.0 0.00 0.00 0.00 0.00 NaN 0.00 0.00 NaN NaN 28.61 7.60 NaN NaN 0.0 0.0 NaN 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 NaN 1.0 1.0 NaN 1006 0.0 0.0 0.00 0.00
2 7001625959 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 167.690 189.058 210.226 290.714 11.54 55.24 37.26 74.81 143.33 220.59 208.36 118.91 0.0 0.00 0.00 38.49 0.0 0.00 0.00 70.94 7.19 28.74 13.58 14.39 29.34 16.86 38.46 28.16 24.11 21.79 15.61 22.24 0.0 135.54 45.76 0.48 60.66 67.41 67.66 64.81 4.34 26.49 22.58 8.76 41.81 67.41 75.53 9.28 1.48 14.76 22.83 0.0 0.0 0.0 0.0 0.0 47.64 108.68 120.94 18.04 0.0 0.0 0.0 0.0 46.56 236.84 96.84 42.08 0.45 0.0 0.0 0.0 155.33 412.94 285.46 124.94 115.69 71.11 67.46 148.23 14.38 15.44 38.89 38.98 99.48 122.29 49.63 158.19 229.56 208.86 155.99 345.41 72.41 71.29 28.69 49.44 45.18 177.01 167.09 118.18 21.73 58.34 43.23 3.86 0.0 0.0 0.0 0.0 139.33 306.66 239.03 171.49 370.04 519.53 395.03 517.74 0.21 0.0 0.0 0.45 0.00 0.85 0.0 0.01 0.93 3.14 0.0 0.36 5 4 2 7 168 315 116 358 86 200 86 100 6/17/2014 7/24/2014 8/14/2014 9/29/2014 0 200 86 0 NaN NaN NaN 9/17/2014 NaN NaN NaN 1.0 NaN NaN NaN 46.0 NaN NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN NaN 46.0 0.00 0.00 0.00 0.0 0.00 0.00 0.00 8.42 NaN NaN NaN 2.84 NaN NaN NaN 0.0 NaN NaN NaN 0.0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 NaN NaN NaN 1.0 1103 0.0 0.0 4.17 0.00
3 7001204172 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 221.338 251.102 508.054 389.500 99.91 54.39 310.98 241.71 123.31 109.01 71.68 113.54 0.0 54.86 44.38 0.00 0.0 28.09 39.04 0.00 73.68 34.81 10.61 15.49 107.43 83.21 22.46 65.46 1.91 0.65 4.91 2.06 0.0 0.00 0.00 0.00 183.03 118.68 37.99 83.03 26.23 14.89 289.58 226.21 2.99 1.73 6.53 9.99 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 29.23 16.63 296.11 236.21 0.0 0.0 0.0 0.0 10.96 0.00 18.09 43.29 0.00 0.0 0.0 0.0 223.23 135.31 352.21 362.54 62.08 19.98 8.04 41.73 113.96 64.51 20.28 52.86 57.43 27.09 19.84 65.59 233.48 111.59 48.18 160.19 43.48 66.44 0.00 129.84 1.33 38.56 4.94 13.98 1.18 0.00 0.00 0.00 0.0 0.0 0.0 0.0 45.99 105.01 4.94 143.83 280.08 216.61 53.13 305.38 0.59 0.0 0.0 0.55 0.00 0.00 0.0 0.00 0.00 0.00 0.0 0.80 10 11 18 14 230 310 601 410 60 50 50 50 6/28/2014 7/31/2014 8/31/2014 9/30/2014 30 50 50 30 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN 2491 0.0 0.0 0.00 0.00
4 7000142493 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 261.636 309.876 238.174 163.426 50.31 149.44 83.89 58.78 76.96 91.88 124.26 45.81 0.0 0.00 0.00 0.00 0.0 0.00 0.00 0.00 50.31 149.44 83.89 58.78 67.64 91.88 124.26 37.89 0.00 0.00 0.00 1.93 0.0 0.00 0.00 0.00 117.96 241.33 208.16 98.61 0.00 0.00 0.00 0.00 9.31 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 9.31 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.00 0.00 0.00 5.98 0.00 0.0 0.0 0.0 127.28 241.33 208.16 104.59 105.68 88.49 233.81 154.56 106.84 109.54 104.13 48.24 1.50 0.00 0.00 0.00 214.03 198.04 337.94 202.81 0.00 0.00 0.86 2.31 1.93 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 1.93 0.25 0.86 2.31 216.44 198.29 338.81 205.31 0.00 0.0 0.0 0.18 0.00 0.00 0.0 0.00 0.48 0.00 0.0 0.00 5 6 3 4 196 350 287 200 56 110 110 50 6/26/2014 7/28/2014 8/9/2014 9/28/2014 50 110 110 50 6/4/2014 NaN NaN NaN 1.0 NaN NaN NaN 56.0 NaN NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN NaN 56.0 NaN NaN NaN 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN NaN NaN 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0.0 NaN NaN NaN 1526 0.0 0.0 0.00 0.00
In [8]:
# Checking information about data.
print(data.info())
def metadata_matrix(data) : 
    return pd.DataFrame({
                'Datatype' : data.dtypes.astype(str), 
                'Non_Null_Count': data.count(axis = 0).astype(int), 
                'Null_Count': data.isnull().sum().astype(int), 
                'Null_Percentage': round(data.isnull().sum()/len(data) * 100 , 2), 
                'Unique_Values_Count': data.nunique().astype(int) 
                 }).sort_values(by='Null_Percentage', ascending=False)

metadata_matrix(data)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Columns: 226 entries, mobile_number to sep_vbc_3g
dtypes: float64(179), int64(35), object(12)
memory usage: 172.4+ MB
None
Out[8]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
arpu_3g_6 float64 25153 74846 74.85 7418
night_pck_user_6 float64 25153 74846 74.85 2
total_rech_data_6 float64 25153 74846 74.85 37
arpu_2g_6 float64 25153 74846 74.85 6990
max_rech_data_6 float64 25153 74846 74.85 48
fb_user_6 float64 25153 74846 74.85 2
av_rech_amt_data_6 float64 25153 74846 74.85 887
date_of_last_rech_data_6 object 25153 74846 74.85 30
count_rech_2g_6 float64 25153 74846 74.85 31
count_rech_3g_6 float64 25153 74846 74.85 25
date_of_last_rech_data_7 object 25571 74428 74.43 31
total_rech_data_7 float64 25571 74428 74.43 42
fb_user_7 float64 25571 74428 74.43 2
max_rech_data_7 float64 25571 74428 74.43 48
night_pck_user_7 float64 25571 74428 74.43 2
count_rech_2g_7 float64 25571 74428 74.43 36
av_rech_amt_data_7 float64 25571 74428 74.43 961
arpu_2g_7 float64 25571 74428 74.43 6586
count_rech_3g_7 float64 25571 74428 74.43 28
arpu_3g_7 float64 25571 74428 74.43 7246
total_rech_data_9 float64 25922 74077 74.08 37
count_rech_3g_9 float64 25922 74077 74.08 27
fb_user_9 float64 25922 74077 74.08 2
max_rech_data_9 float64 25922 74077 74.08 50
arpu_3g_9 float64 25922 74077 74.08 8063
date_of_last_rech_data_9 object 25922 74077 74.08 30
night_pck_user_9 float64 25922 74077 74.08 2
arpu_2g_9 float64 25922 74077 74.08 6795
count_rech_2g_9 float64 25922 74077 74.08 32
av_rech_amt_data_9 float64 25922 74077 74.08 945
total_rech_data_8 float64 26339 73660 73.66 46
arpu_3g_8 float64 26339 73660 73.66 7787
fb_user_8 float64 26339 73660 73.66 2
night_pck_user_8 float64 26339 73660 73.66 2
av_rech_amt_data_8 float64 26339 73660 73.66 973
max_rech_data_8 float64 26339 73660 73.66 50
count_rech_3g_8 float64 26339 73660 73.66 29
arpu_2g_8 float64 26339 73660 73.66 6652
count_rech_2g_8 float64 26339 73660 73.66 34
date_of_last_rech_data_8 object 26339 73660 73.66 31
ic_others_9 float64 92254 7745 7.75 1923
std_og_mou_9 float64 92254 7745 7.75 26553
std_og_t2c_mou_9 float64 92254 7745 7.75 1
isd_ic_mou_9 float64 92254 7745 7.75 5557
std_ic_mou_9 float64 92254 7745 7.75 11266
isd_og_mou_9 float64 92254 7745 7.75 1255
spl_og_mou_9 float64 92254 7745 7.75 4095
spl_ic_mou_9 float64 92254 7745 7.75 384
og_others_9 float64 92254 7745 7.75 235
loc_ic_t2t_mou_9 float64 92254 7745 7.75 12993
std_ic_t2o_mou_9 float64 92254 7745 7.75 1
loc_ic_t2m_mou_9 float64 92254 7745 7.75 21484
std_ic_t2f_mou_9 float64 92254 7745 7.75 3090
loc_ic_t2f_mou_9 float64 92254 7745 7.75 7091
loc_ic_mou_9 float64 92254 7745 7.75 27697
std_ic_t2m_mou_9 float64 92254 7745 7.75 8933
std_og_t2f_mou_9 float64 92254 7745 7.75 2295
std_og_t2t_mou_9 float64 92254 7745 7.75 17934
std_ic_t2t_mou_9 float64 92254 7745 7.75 6157
loc_og_mou_9 float64 92254 7745 7.75 25376
roam_og_mou_9 float64 92254 7745 7.75 5882
loc_og_t2m_mou_9 float64 92254 7745 7.75 20141
loc_og_t2f_mou_9 float64 92254 7745 7.75 3758
roam_ic_mou_9 float64 92254 7745 7.75 4827
offnet_mou_9 float64 92254 7745 7.75 30077
loc_og_t2c_mou_9 float64 92254 7745 7.75 2332
loc_og_t2t_mou_9 float64 92254 7745 7.75 12949
std_og_t2m_mou_9 float64 92254 7745 7.75 19052
onnet_mou_9 float64 92254 7745 7.75 23565
onnet_mou_8 float64 94621 5378 5.38 24089
std_ic_t2t_mou_8 float64 94621 5378 5.38 6352
std_ic_mou_8 float64 94621 5378 5.38 11662
loc_ic_t2t_mou_8 float64 94621 5378 5.38 13346
roam_og_mou_8 float64 94621 5378 5.38 6504
std_ic_t2m_mou_8 float64 94621 5378 5.38 9304
loc_ic_mou_8 float64 94621 5378 5.38 28200
std_ic_t2f_mou_8 float64 94621 5378 5.38 3051
roam_ic_mou_8 float64 94621 5378 5.38 5315
std_ic_t2o_mou_8 float64 94621 5378 5.38 1
loc_og_t2t_mou_8 float64 94621 5378 5.38 13336
loc_ic_t2f_mou_8 float64 94621 5378 5.38 7097
offnet_mou_8 float64 94621 5378 5.38 30908
loc_ic_t2m_mou_8 float64 94621 5378 5.38 21886
loc_og_t2m_mou_8 float64 94621 5378 5.38 20544
isd_og_mou_8 float64 94621 5378 5.38 1276
ic_others_8 float64 94621 5378 5.38 1896
og_others_8 float64 94621 5378 5.38 216
spl_ic_mou_8 float64 94621 5378 5.38 102
loc_og_t2f_mou_8 float64 94621 5378 5.38 3807
std_og_t2m_mou_8 float64 94621 5378 5.38 19786
spl_og_mou_8 float64 94621 5378 5.38 4390
std_og_t2c_mou_8 float64 94621 5378 5.38 1
isd_ic_mou_8 float64 94621 5378 5.38 5844
loc_og_t2c_mou_8 float64 94621 5378 5.38 2516
std_og_t2f_mou_8 float64 94621 5378 5.38 2333
std_og_t2t_mou_8 float64 94621 5378 5.38 18291
loc_og_mou_8 float64 94621 5378 5.38 25990
std_og_mou_8 float64 94621 5378 5.38 27491
date_of_last_rech_9 object 95239 4760 4.76 30
std_ic_t2f_mou_6 float64 96062 3937 3.94 3125
ic_others_6 float64 96062 3937 3.94 1817
isd_ic_mou_6 float64 96062 3937 3.94 5521
std_ic_t2m_mou_6 float64 96062 3937 3.94 9308
std_ic_mou_6 float64 96062 3937 3.94 11646
spl_ic_mou_6 float64 96062 3937 3.94 84
std_ic_t2o_mou_6 float64 96062 3937 3.94 1
loc_ic_t2f_mou_6 float64 96062 3937 3.94 7250
loc_ic_t2t_mou_6 float64 96062 3937 3.94 13540
std_og_t2c_mou_6 float64 96062 3937 3.94 1
std_og_t2f_mou_6 float64 96062 3937 3.94 2450
std_og_mou_6 float64 96062 3937 3.94 27502
std_og_t2m_mou_6 float64 96062 3937 3.94 19734
isd_og_mou_6 float64 96062 3937 3.94 1381
std_og_t2t_mou_6 float64 96062 3937 3.94 18244
spl_og_mou_6 float64 96062 3937 3.94 3965
loc_og_mou_6 float64 96062 3937 3.94 26372
og_others_6 float64 96062 3937 3.94 1018
loc_og_t2c_mou_6 float64 96062 3937 3.94 2235
loc_og_t2m_mou_6 float64 96062 3937 3.94 20905
loc_og_t2f_mou_6 float64 96062 3937 3.94 3860
loc_og_t2t_mou_6 float64 96062 3937 3.94 13539
roam_og_mou_6 float64 96062 3937 3.94 8038
std_ic_t2t_mou_6 float64 96062 3937 3.94 6279
onnet_mou_6 float64 96062 3937 3.94 24313
loc_ic_mou_6 float64 96062 3937 3.94 28569
offnet_mou_6 float64 96062 3937 3.94 31140
roam_ic_mou_6 float64 96062 3937 3.94 6512
loc_ic_t2m_mou_6 float64 96062 3937 3.94 22065
loc_og_t2c_mou_7 float64 96140 3859 3.86 2426
roam_ic_mou_7 float64 96140 3859 3.86 5230
loc_og_mou_7 float64 96140 3859 3.86 26091
loc_og_t2t_mou_7 float64 96140 3859 3.86 13411
offnet_mou_7 float64 96140 3859 3.86 31023
loc_og_t2f_mou_7 float64 96140 3859 3.86 3863
std_og_t2t_mou_7 float64 96140 3859 3.86 18567
std_ic_t2t_mou_7 float64 96140 3859 3.86 6481
onnet_mou_7 float64 96140 3859 3.86 24336
std_og_t2m_mou_7 float64 96140 3859 3.86 20018
loc_og_t2m_mou_7 float64 96140 3859 3.86 20637
std_og_t2f_mou_7 float64 96140 3859 3.86 2391
roam_og_mou_7 float64 96140 3859 3.86 6639
std_og_t2c_mou_7 float64 96140 3859 3.86 1
std_ic_t2m_mou_7 float64 96140 3859 3.86 9464
isd_og_mou_7 float64 96140 3859 3.86 1380
ic_others_7 float64 96140 3859 3.86 2002
loc_ic_t2f_mou_7 float64 96140 3859 3.86 7395
loc_ic_t2m_mou_7 float64 96140 3859 3.86 21918
std_ic_mou_7 float64 96140 3859 3.86 11889
loc_ic_t2t_mou_7 float64 96140 3859 3.86 13511
std_ic_t2f_mou_7 float64 96140 3859 3.86 3209
loc_ic_mou_7 float64 96140 3859 3.86 28390
spl_ic_mou_7 float64 96140 3859 3.86 107
og_others_7 float64 96140 3859 3.86 187
spl_og_mou_7 float64 96140 3859 3.86 4396
isd_ic_mou_7 float64 96140 3859 3.86 5789
std_ic_t2o_mou_7 float64 96140 3859 3.86 1
std_og_mou_7 float64 96140 3859 3.86 27951
date_of_last_rech_8 object 96377 3622 3.62 31
date_of_last_rech_7 object 98232 1767 1.77 31
last_date_of_month_9 object 98340 1659 1.66 1
date_of_last_rech_6 object 98392 1607 1.61 30
last_date_of_month_8 object 98899 1100 1.10 1
loc_ic_t2o_mou float64 98981 1018 1.02 1
std_og_t2o_mou float64 98981 1018 1.02 1
loc_og_t2o_mou float64 98981 1018 1.02 1
last_date_of_month_7 object 99398 601 0.60 1
sachet_3g_8 int64 99999 0 0.00 29
jul_vbc_3g float64 99999 0 0.00 14162
aug_vbc_3g float64 99999 0 0.00 14676
aon int64 99999 0 0.00 3489
jun_vbc_3g float64 99999 0 0.00 13312
monthly_2g_9 int64 99999 0 0.00 5
sachet_3g_6 int64 99999 0 0.00 25
vol_3g_mb_9 float64 99999 0 0.00 14472
sachet_3g_7 int64 99999 0 0.00 27
monthly_2g_8 int64 99999 0 0.00 6
monthly_3g_9 int64 99999 0 0.00 11
monthly_3g_8 int64 99999 0 0.00 12
sachet_3g_9 int64 99999 0 0.00 27
monthly_3g_7 int64 99999 0 0.00 15
monthly_3g_6 int64 99999 0 0.00 12
sachet_2g_9 int64 99999 0 0.00 32
sachet_2g_8 int64 99999 0 0.00 34
sachet_2g_7 int64 99999 0 0.00 35
sachet_2g_6 int64 99999 0 0.00 32
monthly_2g_7 int64 99999 0 0.00 6
monthly_2g_6 int64 99999 0 0.00 5
mobile_number int64 99999 0 0.00 99999
vol_3g_mb_8 float64 99999 0 0.00 14960
total_og_mou_9 float64 99999 0 0.00 39160
total_rech_num_7 int64 99999 0 0.00 101
total_rech_num_6 int64 99999 0 0.00 102
total_ic_mou_9 float64 99999 0 0.00 31260
total_ic_mou_8 float64 99999 0 0.00 32128
total_ic_mou_7 float64 99999 0 0.00 32242
total_ic_mou_6 float64 99999 0 0.00 32247
circle_id int64 99999 0 0.00 1
total_og_mou_8 float64 99999 0 0.00 40074
vol_3g_mb_7 float64 99999 0 0.00 14519
total_og_mou_7 float64 99999 0 0.00 40477
total_og_mou_6 float64 99999 0 0.00 40327
arpu_9 float64 99999 0 0.00 79937
arpu_8 float64 99999 0 0.00 83615
arpu_7 float64 99999 0 0.00 85308
arpu_6 float64 99999 0 0.00 85681
last_date_of_month_6 object 99999 0 0.00 1
total_rech_num_8 int64 99999 0 0.00 96
total_rech_num_9 int64 99999 0 0.00 97
total_rech_amt_6 int64 99999 0 0.00 2305
total_rech_amt_7 int64 99999 0 0.00 2329
vol_3g_mb_6 float64 99999 0 0.00 13773
vol_2g_mb_9 float64 99999 0 0.00 13919
vol_2g_mb_8 float64 99999 0 0.00 14994
vol_2g_mb_7 float64 99999 0 0.00 15114
vol_2g_mb_6 float64 99999 0 0.00 15201
last_day_rch_amt_9 int64 99999 0 0.00 185
last_day_rch_amt_8 int64 99999 0 0.00 199
last_day_rch_amt_7 int64 99999 0 0.00 173
last_day_rch_amt_6 int64 99999 0 0.00 186
max_rech_amt_9 int64 99999 0 0.00 201
max_rech_amt_8 int64 99999 0 0.00 213
max_rech_amt_7 int64 99999 0 0.00 183
max_rech_amt_6 int64 99999 0 0.00 202
total_rech_amt_9 int64 99999 0 0.00 2304
total_rech_amt_8 int64 99999 0 0.00 2347
sep_vbc_3g float64 99999 0 0.00 3720

Data Cleaning

In [9]:
# Checking if there are any duplicate records.
data['mobile_number'].value_counts().sum()
Out[9]:
99999
  • Since number of rows is same as distinct mobile numbers, there is no duplicate data
In [10]:
# mobile_number is a unique identifier 
# Setting mobile_number as the index 
data = data.set_index('mobile_number')
In [11]:
# Renaming columns 
data = data.rename({'jun_vbc_3g' : 'vbc_3g_6', 'jul_vbc_3g' : 'vbc_3g_7', 'aug_vbc_3g' : 'vbc_3g_8', 'sep_vbc_3g' : 'vbc_3g_9'}, axis=1)
In [12]:
#Converting columns into appropriate data types and extracting singe value columns.
# Columns with unique values < 29 are considered as categorical variables. 
# The number 30 is arrived at, by looking at the above metadata_matrix output. 

columns=data.columns
change_to_cat=[]
single_value_col=[]
for column in columns:
    unique_value_count=data[column].nunique()
    if unique_value_count==1:
        single_value_col.append(column)
    if unique_value_count<=29 and unique_value_count!=0 and data[column].dtype in ['int','float']:
        change_to_cat.append(column)
print( ' Columns to change to categorical data type : \n' ,pd.DataFrame(change_to_cat), '\n')
 Columns to change to categorical data type : 
                    0
0          circle_id
1     loc_og_t2o_mou
2     std_og_t2o_mou
3     loc_ic_t2o_mou
4   std_og_t2c_mou_6
5   std_og_t2c_mou_7
6   std_og_t2c_mou_8
7   std_og_t2c_mou_9
8   std_ic_t2o_mou_6
9   std_ic_t2o_mou_7
10  std_ic_t2o_mou_8
11  std_ic_t2o_mou_9
12   count_rech_3g_6
13   count_rech_3g_7
14   count_rech_3g_8
15   count_rech_3g_9
16  night_pck_user_6
17  night_pck_user_7
18  night_pck_user_8
19  night_pck_user_9
20      monthly_2g_6
21      monthly_2g_7
22      monthly_2g_8
23      monthly_2g_9
24      monthly_3g_6
25      monthly_3g_7
26      monthly_3g_8
27      monthly_3g_9
28       sachet_3g_6
29       sachet_3g_7
30       sachet_3g_8
31       sachet_3g_9
32         fb_user_6
33         fb_user_7
34         fb_user_8
35         fb_user_9 

In [13]:
# Converting all the above columns having <=29 unique values into categorical data type.
data[change_to_cat]=data[change_to_cat].astype('category')
In [14]:
# Converting *sachet* variables to categorical data type 
sachet_columns = data.filter(regex='.*sachet.*', axis=1).columns.values
data[sachet_columns] = data[sachet_columns].astype('category')
In [15]:
#Changing datatype of date variables to datetime.
columns=data.columns
col_with_date=[]
import re
for column in columns:
    x = re.findall("^date", column)
    if x:
        col_with_date.append(column)
data[col_with_date].dtypes
Out[15]:
date_of_last_rech_6         object
date_of_last_rech_7         object
date_of_last_rech_8         object
date_of_last_rech_9         object
date_of_last_rech_data_6    object
date_of_last_rech_data_7    object
date_of_last_rech_data_8    object
date_of_last_rech_data_9    object
dtype: object
In [16]:
# Checking the date format
data[col_with_date].head()
Out[16]:
date_of_last_rech_6 date_of_last_rech_7 date_of_last_rech_8 date_of_last_rech_9 date_of_last_rech_data_6 date_of_last_rech_data_7 date_of_last_rech_data_8 date_of_last_rech_data_9
mobile_number
7000842753 6/21/2014 7/16/2014 8/8/2014 9/28/2014 6/21/2014 7/16/2014 8/8/2014 NaN
7001865778 6/29/2014 7/31/2014 8/28/2014 9/30/2014 NaN 7/25/2014 8/10/2014 NaN
7001625959 6/17/2014 7/24/2014 8/14/2014 9/29/2014 NaN NaN NaN 9/17/2014
7001204172 6/28/2014 7/31/2014 8/31/2014 9/30/2014 NaN NaN NaN NaN
7000142493 6/26/2014 7/28/2014 8/9/2014 9/28/2014 6/4/2014 NaN NaN NaN
  • Lets convert the above columns to datetime data type.
In [17]:
for col in col_with_date:
    data[col]=pd.to_datetime(data[col], format="%m/%d/%Y")
data[col_with_date].head()
Out[17]:
date_of_last_rech_6 date_of_last_rech_7 date_of_last_rech_8 date_of_last_rech_9 date_of_last_rech_data_6 date_of_last_rech_data_7 date_of_last_rech_data_8 date_of_last_rech_data_9
mobile_number
7000842753 2014-06-21 2014-07-16 2014-08-08 2014-09-28 2014-06-21 2014-07-16 2014-08-08 NaT
7001865778 2014-06-29 2014-07-31 2014-08-28 2014-09-30 NaT 2014-07-25 2014-08-10 NaT
7001625959 2014-06-17 2014-07-24 2014-08-14 2014-09-29 NaT NaT NaT 2014-09-17
7001204172 2014-06-28 2014-07-31 2014-08-31 2014-09-30 NaT NaT NaT NaT
7000142493 2014-06-26 2014-07-28 2014-08-09 2014-09-28 2014-06-04 NaT NaT NaT

Filtering High Value Customers

  • Customers are High Values if their Average recharge amount of june and july is more than or equal to 70th percentile of Average recharge amount.
In [18]:
#Deriving Average recharge amount of June and July.
data['Average_rech_amt_6n7']=(data['total_rech_amt_6']+data['total_rech_amt_7'])/2
In [19]:
#Filtering based HIGH VALUED CUSTOMERS based on (Average_rech_amt_6n7 >= 70th percentile of Average_rech_amt_6n7)
data=data[(data['Average_rech_amt_6n7']>= data['Average_rech_amt_6n7'].quantile(0.7))]

Missing Values

In [20]:
#Checking for missing values.
missing_values = metadata_matrix(data)[['Datatype', 'Null_Percentage']].sort_values(by='Null_Percentage', ascending=False)
missing_values
Out[20]:
Datatype Null_Percentage
av_rech_amt_data_6 float64 62.02
count_rech_2g_6 float64 62.02
arpu_2g_6 float64 62.02
max_rech_data_6 float64 62.02
night_pck_user_6 category 62.02
date_of_last_rech_data_6 datetime64[ns] 62.02
total_rech_data_6 float64 62.02
arpu_3g_6 float64 62.02
fb_user_6 category 62.02
count_rech_3g_6 category 62.02
av_rech_amt_data_9 float64 61.81
count_rech_2g_9 float64 61.81
night_pck_user_9 category 61.81
arpu_3g_9 float64 61.81
arpu_2g_9 float64 61.81
fb_user_9 category 61.81
date_of_last_rech_data_9 datetime64[ns] 61.81
total_rech_data_9 float64 61.81
count_rech_3g_9 category 61.81
max_rech_data_9 float64 61.81
count_rech_2g_7 float64 61.14
count_rech_3g_7 category 61.14
arpu_2g_7 float64 61.14
arpu_3g_7 float64 61.14
av_rech_amt_data_7 float64 61.14
max_rech_data_7 float64 61.14
fb_user_7 category 61.14
total_rech_data_7 float64 61.14
date_of_last_rech_data_7 datetime64[ns] 61.14
night_pck_user_7 category 61.14
av_rech_amt_data_8 float64 60.83
count_rech_3g_8 category 60.83
total_rech_data_8 float64 60.83
arpu_3g_8 float64 60.83
max_rech_data_8 float64 60.83
date_of_last_rech_data_8 datetime64[ns] 60.83
arpu_2g_8 float64 60.83
fb_user_8 category 60.83
night_pck_user_8 category 60.83
count_rech_2g_8 float64 60.83
loc_og_t2t_mou_9 float64 5.68
ic_others_9 float64 5.68
isd_ic_mou_9 float64 5.68
og_others_9 float64 5.68
loc_og_t2f_mou_9 float64 5.68
roam_ic_mou_9 float64 5.68
loc_og_mou_9 float64 5.68
std_og_t2f_mou_9 float64 5.68
loc_og_t2m_mou_9 float64 5.68
std_og_t2m_mou_9 float64 5.68
loc_og_t2c_mou_9 float64 5.68
std_og_t2t_mou_9 float64 5.68
std_ic_t2o_mou_9 category 5.68
std_ic_mou_9 float64 5.68
spl_ic_mou_9 float64 5.68
std_ic_t2f_mou_9 float64 5.68
roam_og_mou_9 float64 5.68
std_ic_t2m_mou_9 float64 5.68
offnet_mou_9 float64 5.68
std_og_mou_9 float64 5.68
spl_og_mou_9 float64 5.68
loc_ic_t2t_mou_9 float64 5.68
onnet_mou_9 float64 5.68
loc_ic_t2m_mou_9 float64 5.68
loc_ic_t2f_mou_9 float64 5.68
std_og_t2c_mou_9 category 5.68
loc_ic_mou_9 float64 5.68
std_ic_t2t_mou_9 float64 5.68
isd_og_mou_9 float64 5.68
std_og_t2t_mou_8 float64 3.13
std_og_t2c_mou_8 category 3.13
std_og_t2f_mou_8 float64 3.13
std_og_mou_8 float64 3.13
roam_og_mou_8 float64 3.13
isd_og_mou_8 float64 3.13
loc_og_t2t_mou_8 float64 3.13
spl_ic_mou_8 float64 3.13
std_og_t2m_mou_8 float64 3.13
ic_others_8 float64 3.13
offnet_mou_8 float64 3.13
og_others_8 float64 3.13
isd_ic_mou_8 float64 3.13
roam_ic_mou_8 float64 3.13
spl_og_mou_8 float64 3.13
loc_og_t2f_mou_8 float64 3.13
std_ic_t2m_mou_8 float64 3.13
std_ic_t2f_mou_8 float64 3.13
std_ic_t2t_mou_8 float64 3.13
loc_og_t2c_mou_8 float64 3.13
loc_ic_mou_8 float64 3.13
onnet_mou_8 float64 3.13
loc_og_t2m_mou_8 float64 3.13
loc_ic_t2f_mou_8 float64 3.13
std_ic_t2o_mou_8 category 3.13
loc_og_mou_8 float64 3.13
loc_ic_t2m_mou_8 float64 3.13
std_ic_mou_8 float64 3.13
loc_ic_t2t_mou_8 float64 3.13
date_of_last_rech_9 datetime64[ns] 2.89
date_of_last_rech_8 datetime64[ns] 1.98
last_date_of_month_9 object 1.20
loc_og_mou_6 float64 1.05
std_ic_t2m_mou_6 float64 1.05
roam_og_mou_6 float64 1.05
std_ic_t2t_mou_6 float64 1.05
loc_ic_mou_6 float64 1.05
roam_ic_mou_6 float64 1.05
loc_ic_t2f_mou_6 float64 1.05
loc_ic_t2m_mou_6 float64 1.05
std_og_t2t_mou_6 float64 1.05
onnet_mou_6 float64 1.05
loc_ic_t2t_mou_6 float64 1.05
offnet_mou_6 float64 1.05
og_others_6 float64 1.05
loc_og_t2t_mou_6 float64 1.05
isd_og_mou_6 float64 1.05
std_og_t2m_mou_6 float64 1.05
loc_og_t2f_mou_6 float64 1.05
spl_ic_mou_6 float64 1.05
std_ic_mou_6 float64 1.05
isd_ic_mou_6 float64 1.05
loc_og_t2m_mou_6 float64 1.05
std_ic_t2o_mou_6 category 1.05
spl_og_mou_6 float64 1.05
ic_others_6 float64 1.05
std_ic_t2f_mou_6 float64 1.05
loc_og_t2c_mou_6 float64 1.05
std_og_mou_6 float64 1.05
std_og_t2f_mou_6 float64 1.05
std_og_t2c_mou_6 category 1.05
roam_ic_mou_7 float64 1.01
loc_og_t2c_mou_7 float64 1.01
loc_og_t2f_mou_7 float64 1.01
loc_og_t2m_mou_7 float64 1.01
loc_og_t2t_mou_7 float64 1.01
roam_og_mou_7 float64 1.01
std_ic_t2t_mou_7 float64 1.01
offnet_mou_7 float64 1.01
onnet_mou_7 float64 1.01
std_ic_t2f_mou_7 float64 1.01
std_ic_mou_7 float64 1.01
loc_ic_t2f_mou_7 float64 1.01
std_ic_t2m_mou_7 float64 1.01
loc_og_mou_7 float64 1.01
loc_ic_t2t_mou_7 float64 1.01
std_og_t2t_mou_7 float64 1.01
std_og_t2c_mou_7 category 1.01
std_og_mou_7 float64 1.01
isd_og_mou_7 float64 1.01
spl_og_mou_7 float64 1.01
og_others_7 float64 1.01
spl_ic_mou_7 float64 1.01
loc_ic_t2m_mou_7 float64 1.01
loc_ic_mou_7 float64 1.01
ic_others_7 float64 1.01
std_og_t2m_mou_7 float64 1.01
isd_ic_mou_7 float64 1.01
std_ic_t2o_mou_7 category 1.01
std_og_t2f_mou_7 float64 1.01
last_date_of_month_8 object 0.52
loc_og_t2o_mou category 0.38
loc_ic_t2o_mou category 0.38
date_of_last_rech_7 datetime64[ns] 0.38
std_og_t2o_mou category 0.38
date_of_last_rech_6 datetime64[ns] 0.21
last_date_of_month_7 object 0.10
vol_3g_mb_6 float64 0.00
arpu_6 float64 0.00
total_rech_amt_8 int64 0.00
total_rech_amt_7 int64 0.00
total_rech_amt_6 int64 0.00
total_rech_num_9 int64 0.00
last_date_of_month_6 object 0.00
vol_3g_mb_8 float64 0.00
arpu_7 float64 0.00
arpu_8 float64 0.00
arpu_9 float64 0.00
total_og_mou_6 float64 0.00
total_og_mou_7 float64 0.00
vol_3g_mb_7 float64 0.00
max_rech_amt_9 int64 0.00
vol_2g_mb_9 float64 0.00
vol_2g_mb_8 float64 0.00
vol_2g_mb_7 float64 0.00
vol_2g_mb_6 float64 0.00
last_day_rch_amt_9 int64 0.00
last_day_rch_amt_8 int64 0.00
last_day_rch_amt_7 int64 0.00
last_day_rch_amt_6 int64 0.00
max_rech_amt_8 int64 0.00
max_rech_amt_7 int64 0.00
max_rech_amt_6 int64 0.00
total_rech_amt_9 int64 0.00
total_ic_mou_6 float64 0.00
total_og_mou_8 float64 0.00
vbc_3g_8 float64 0.00
total_ic_mou_7 float64 0.00
total_ic_mou_8 float64 0.00
sachet_3g_9 category 0.00
sachet_3g_7 category 0.00
vbc_3g_9 float64 0.00
vbc_3g_6 float64 0.00
vbc_3g_7 float64 0.00
aon int64 0.00
sachet_3g_6 category 0.00
monthly_3g_8 category 0.00
monthly_3g_9 category 0.00
sachet_3g_8 category 0.00
monthly_3g_7 category 0.00
sachet_2g_9 category 0.00
sachet_2g_8 category 0.00
sachet_2g_7 category 0.00
sachet_2g_6 category 0.00
monthly_2g_9 category 0.00
monthly_2g_8 category 0.00
monthly_2g_7 category 0.00
monthly_2g_6 category 0.00
monthly_3g_6 category 0.00
circle_id category 0.00
vol_3g_mb_9 float64 0.00
total_og_mou_9 float64 0.00
total_rech_num_8 int64 0.00
total_rech_num_7 int64 0.00
total_rech_num_6 int64 0.00
total_ic_mou_9 float64 0.00
Average_rech_amt_6n7 float64 0.00
In [21]:
# Columns with high missing values , > 50%
metadata = metadata_matrix(data)
condition = metadata['Null_Percentage'] > 50 
high_missing_values = metadata[condition]
high_missing_values
Out[21]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
av_rech_amt_data_6 float64 11397 18614 62.02 794
count_rech_3g_6 category 11397 18614 62.02 25
count_rech_2g_6 float64 11397 18614 62.02 30
arpu_2g_6 float64 11397 18614 62.02 4503
max_rech_data_6 float64 11397 18614 62.02 43
night_pck_user_6 category 11397 18614 62.02 2
date_of_last_rech_data_6 datetime64[ns] 11397 18614 62.02 30
total_rech_data_6 float64 11397 18614 62.02 36
arpu_3g_6 float64 11397 18614 62.02 4875
fb_user_6 category 11397 18614 62.02 2
max_rech_data_9 float64 11461 18550 61.81 48
count_rech_3g_9 category 11461 18550 61.81 27
fb_user_9 category 11461 18550 61.81 2
total_rech_data_9 float64 11461 18550 61.81 35
date_of_last_rech_data_9 datetime64[ns] 11461 18550 61.81 30
av_rech_amt_data_9 float64 11461 18550 61.81 812
arpu_2g_9 float64 11461 18550 61.81 3846
arpu_3g_9 float64 11461 18550 61.81 4800
night_pck_user_9 category 11461 18550 61.81 2
count_rech_2g_9 float64 11461 18550 61.81 29
fb_user_7 category 11662 18349 61.14 2
date_of_last_rech_data_7 datetime64[ns] 11662 18349 61.14 31
total_rech_data_7 float64 11662 18349 61.14 40
night_pck_user_7 category 11662 18349 61.14 2
max_rech_data_7 float64 11662 18349 61.14 46
count_rech_2g_7 float64 11662 18349 61.14 35
arpu_3g_7 float64 11662 18349 61.14 4860
av_rech_amt_data_7 float64 11662 18349 61.14 863
arpu_2g_7 float64 11662 18349 61.14 4219
count_rech_3g_7 category 11662 18349 61.14 28
night_pck_user_8 category 11754 18257 60.83 2
fb_user_8 category 11754 18257 60.83 2
arpu_2g_8 float64 11754 18257 60.83 3854
count_rech_2g_8 float64 11754 18257 60.83 33
date_of_last_rech_data_8 datetime64[ns] 11754 18257 60.83 31
av_rech_amt_data_8 float64 11754 18257 60.83 837
arpu_3g_8 float64 11754 18257 60.83 4769
total_rech_data_8 float64 11754 18257 60.83 45
count_rech_3g_8 category 11754 18257 60.83 29
max_rech_data_8 float64 11754 18257 60.83 47
In [22]:
# Dropping above columns with high missing values 
high_missing_value_columns = high_missing_values.index 
data.drop(columns=high_missing_value_columns, inplace=True)
In [23]:
# Looking at remaining columns with missing values 
metadata_matrix(data)
Out[23]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
std_ic_t2o_mou_9 category 28307 1704 5.68 1
spl_og_mou_9 float64 28307 1704 5.68 2966
isd_og_mou_9 float64 28307 1704 5.68 908
roam_ic_mou_9 float64 28307 1704 5.68 3370
std_og_mou_9 float64 28307 1704 5.68 15900
roam_og_mou_9 float64 28307 1704 5.68 4004
std_ic_t2f_mou_9 float64 28307 1704 5.68 1971
std_og_t2c_mou_9 category 28307 1704 5.68 1
loc_og_t2t_mou_9 float64 28307 1704 5.68 10360
std_og_t2f_mou_9 float64 28307 1704 5.68 1595
std_ic_mou_9 float64 28307 1704 5.68 7745
loc_og_t2m_mou_9 float64 28307 1704 5.68 15585
std_og_t2m_mou_9 float64 28307 1704 5.68 12445
loc_og_t2f_mou_9 float64 28307 1704 5.68 3111
std_og_t2t_mou_9 float64 28307 1704 5.68 11141
loc_ic_mou_9 float64 28307 1704 5.68 18018
loc_og_t2c_mou_9 float64 28307 1704 5.68 1576
offnet_mou_9 float64 28307 1704 5.68 20452
loc_og_mou_9 float64 28307 1704 5.68 18207
spl_ic_mou_9 float64 28307 1704 5.68 287
std_ic_t2m_mou_9 float64 28307 1704 5.68 6168
loc_ic_t2f_mou_9 float64 28307 1704 5.68 4611
ic_others_9 float64 28307 1704 5.68 1284
loc_ic_t2m_mou_9 float64 28307 1704 5.68 15194
loc_ic_t2t_mou_9 float64 28307 1704 5.68 9407
std_ic_t2t_mou_9 float64 28307 1704 5.68 4280
isd_ic_mou_9 float64 28307 1704 5.68 3329
og_others_9 float64 28307 1704 5.68 132
onnet_mou_9 float64 28307 1704 5.68 16674
std_og_mou_8 float64 29073 938 3.13 16864
std_og_t2m_mou_8 float64 29073 938 3.13 13326
og_others_8 float64 29073 938 3.13 133
loc_ic_t2f_mou_8 float64 29073 938 3.13 4705
std_og_t2t_mou_8 float64 29073 938 3.13 11781
loc_og_mou_8 float64 29073 938 3.13 18885
std_ic_t2o_mou_8 category 29073 938 3.13 1
loc_ic_t2m_mou_8 float64 29073 938 3.13 15598
std_ic_t2m_mou_8 float64 29073 938 3.13 6420
std_ic_t2t_mou_8 float64 29073 938 3.13 4486
std_og_t2f_mou_8 float64 29073 938 3.13 1627
std_ic_t2f_mou_8 float64 29073 938 3.13 1941
spl_og_mou_8 float64 29073 938 3.13 3238
loc_ic_t2t_mou_8 float64 29073 938 3.13 9671
std_og_t2c_mou_8 category 29073 938 3.13 1
isd_og_mou_8 float64 29073 938 3.13 940
loc_ic_mou_8 float64 29073 938 3.13 18573
roam_ic_mou_8 float64 29073 938 3.13 3655
isd_ic_mou_8 float64 29073 938 3.13 3493
onnet_mou_8 float64 29073 938 3.13 17604
loc_og_t2c_mou_8 float64 29073 938 3.13 1730
spl_ic_mou_8 float64 29073 938 3.13 85
loc_og_t2f_mou_8 float64 29073 938 3.13 3124
std_ic_mou_8 float64 29073 938 3.13 8033
roam_og_mou_8 float64 29073 938 3.13 4382
ic_others_8 float64 29073 938 3.13 1259
loc_og_t2m_mou_8 float64 29073 938 3.13 16165
loc_og_t2t_mou_8 float64 29073 938 3.13 10772
offnet_mou_8 float64 29073 938 3.13 21513
date_of_last_rech_9 datetime64[ns] 29145 866 2.89 30
date_of_last_rech_8 datetime64[ns] 29417 594 1.98 31
last_date_of_month_9 object 29651 360 1.20 1
std_ic_mou_6 float64 29695 316 1.05 8391
offnet_mou_6 float64 29695 316 1.05 22454
std_ic_t2f_mou_6 float64 29695 316 1.05 2033
isd_ic_mou_6 float64 29695 316 1.05 3429
ic_others_6 float64 29695 316 1.05 1227
onnet_mou_6 float64 29695 316 1.05 18813
std_ic_t2m_mou_6 float64 29695 316 1.05 6680
loc_ic_t2t_mou_6 float64 29695 316 1.05 9872
loc_ic_t2m_mou_6 float64 29695 316 1.05 16015
loc_ic_t2f_mou_6 float64 29695 316 1.05 4817
loc_ic_mou_6 float64 29695 316 1.05 19133
std_ic_t2t_mou_6 float64 29695 316 1.05 4608
og_others_6 float64 29695 316 1.05 862
spl_og_mou_6 float64 29695 316 1.05 3053
roam_ic_mou_6 float64 29695 316 1.05 4338
spl_ic_mou_6 float64 29695 316 1.05 78
std_og_t2t_mou_6 float64 29695 316 1.05 12777
loc_og_t2c_mou_6 float64 29695 316 1.05 1658
std_og_t2m_mou_6 float64 29695 316 1.05 14518
loc_og_t2f_mou_6 float64 29695 316 1.05 3252
std_og_t2f_mou_6 float64 29695 316 1.05 1773
loc_og_t2m_mou_6 float64 29695 316 1.05 16747
std_ic_t2o_mou_6 category 29695 316 1.05 1
std_og_t2c_mou_6 category 29695 316 1.05 1
std_og_mou_6 float64 29695 316 1.05 18325
loc_og_t2t_mou_6 float64 29695 316 1.05 11151
isd_og_mou_6 float64 29695 316 1.05 1113
roam_og_mou_6 float64 29695 316 1.05 5174
loc_og_mou_6 float64 29695 316 1.05 19691
isd_ic_mou_7 float64 29708 303 1.01 3639
std_ic_t2f_mou_7 float64 29708 303 1.01 2075
std_ic_t2m_mou_7 float64 29708 303 1.01 6747
std_ic_t2o_mou_7 category 29708 303 1.01 1
ic_others_7 float64 29708 303 1.01 1371
spl_ic_mou_7 float64 29708 303 1.01 93
std_ic_t2t_mou_7 float64 29708 303 1.01 4706
std_ic_mou_7 float64 29708 303 1.01 8543
loc_ic_t2f_mou_7 float64 29708 303 1.01 4897
og_others_7 float64 29708 303 1.01 123
loc_ic_mou_7 float64 29708 303 1.01 19030
std_og_t2f_mou_7 float64 29708 303 1.01 1714
onnet_mou_7 float64 29708 303 1.01 18938
roam_ic_mou_7 float64 29708 303 1.01 3649
roam_og_mou_7 float64 29708 303 1.01 4431
loc_og_t2t_mou_7 float64 29708 303 1.01 11154
loc_og_t2m_mou_7 float64 29708 303 1.01 16872
loc_og_t2f_mou_7 float64 29708 303 1.01 3267
loc_og_t2c_mou_7 float64 29708 303 1.01 1750
loc_og_mou_7 float64 29708 303 1.01 19880
std_og_t2t_mou_7 float64 29708 303 1.01 12983
std_og_t2m_mou_7 float64 29708 303 1.01 14589
offnet_mou_7 float64 29708 303 1.01 22650
std_og_t2c_mou_7 category 29708 303 1.01 1
loc_ic_t2t_mou_7 float64 29708 303 1.01 9961
isd_og_mou_7 float64 29708 303 1.01 1125
spl_og_mou_7 float64 29708 303 1.01 3399
std_og_mou_7 float64 29708 303 1.01 18445
loc_ic_t2m_mou_7 float64 29708 303 1.01 16068
last_date_of_month_8 object 29854 157 0.52 1
std_og_t2o_mou category 29897 114 0.38 1
loc_ic_t2o_mou category 29897 114 0.38 1
date_of_last_rech_7 datetime64[ns] 29897 114 0.38 31
loc_og_t2o_mou category 29897 114 0.38 1
date_of_last_rech_6 datetime64[ns] 29949 62 0.21 30
last_date_of_month_7 object 29980 31 0.10 1
sachet_3g_6 category 30011 0 0.00 25
monthly_2g_8 category 30011 0 0.00 6
vol_2g_mb_8 float64 30011 0 0.00 7310
vol_2g_mb_9 float64 30011 0 0.00 6984
vol_2g_mb_6 float64 30011 0 0.00 7809
sachet_3g_9 category 30011 0 0.00 27
sachet_3g_8 category 30011 0 0.00 29
monthly_3g_9 category 30011 0 0.00 11
vol_3g_mb_6 float64 30011 0 0.00 7043
vol_3g_mb_7 float64 30011 0 0.00 7440
vol_3g_mb_8 float64 30011 0 0.00 7151
vol_3g_mb_9 float64 30011 0 0.00 7016
monthly_2g_6 category 30011 0 0.00 5
monthly_2g_7 category 30011 0 0.00 6
monthly_2g_9 category 30011 0 0.00 5
sachet_3g_7 category 30011 0 0.00 27
sachet_2g_6 category 30011 0 0.00 30
sachet_2g_7 category 30011 0 0.00 34
sachet_2g_8 category 30011 0 0.00 34
sachet_2g_9 category 30011 0 0.00 29
vbc_3g_9 float64 30011 0 0.00 2171
monthly_3g_8 category 30011 0 0.00 12
monthly_3g_7 category 30011 0 0.00 15
vbc_3g_6 float64 30011 0 0.00 6864
vbc_3g_7 float64 30011 0 0.00 7318
vbc_3g_8 float64 30011 0 0.00 7291
aon int64 30011 0 0.00 3321
monthly_3g_6 category 30011 0 0.00 12
vol_2g_mb_7 float64 30011 0 0.00 7813
circle_id category 30011 0 0.00 1
last_day_rch_amt_9 int64 30011 0 0.00 170
last_day_rch_amt_8 int64 30011 0 0.00 179
last_date_of_month_6 object 30011 0 0.00 1
arpu_6 float64 30011 0 0.00 29261
arpu_7 float64 30011 0 0.00 29260
arpu_8 float64 30011 0 0.00 28405
arpu_9 float64 30011 0 0.00 27327
total_og_mou_6 float64 30011 0 0.00 24607
total_og_mou_7 float64 30011 0 0.00 24913
total_og_mou_8 float64 30011 0 0.00 23644
total_og_mou_9 float64 30011 0 0.00 22615
total_ic_mou_6 float64 30011 0 0.00 20602
total_ic_mou_7 float64 30011 0 0.00 20711
total_ic_mou_8 float64 30011 0 0.00 20096
total_ic_mou_9 float64 30011 0 0.00 19437
total_rech_num_6 int64 30011 0 0.00 102
total_rech_num_7 int64 30011 0 0.00 101
total_rech_num_8 int64 30011 0 0.00 96
total_rech_num_9 int64 30011 0 0.00 96
total_rech_amt_6 int64 30011 0 0.00 2241
total_rech_amt_7 int64 30011 0 0.00 2265
total_rech_amt_8 int64 30011 0 0.00 2299
total_rech_amt_9 int64 30011 0 0.00 2248
max_rech_amt_6 int64 30011 0 0.00 170
max_rech_amt_7 int64 30011 0 0.00 151
max_rech_amt_8 int64 30011 0 0.00 182
max_rech_amt_9 int64 30011 0 0.00 186
last_day_rch_amt_6 int64 30011 0 0.00 158
last_day_rch_amt_7 int64 30011 0 0.00 149
Average_rech_amt_6n7 float64 30011 0 0.00 3025
  • data contains information of 04 months - 6,7,8,9.
  • For the purpose of missing value treatment, each month's revenue and usage data is not related to the other months.
  • hence, missing value treatment could be performed month wise.
In [24]:
# Month 6 
In [25]:
sixth_month_columns = []
for column in data.columns:
    x = re.search("6$", column)
    if x:
        sixth_month_columns.append(column)
# missing_values.loc[sixth_month_columns].sort_values(by='Null_Percentage', ascending=False)
metadata = metadata_matrix(data)
condition = metadata.index.isin(sixth_month_columns)
sixth_month_metadata = metadata[condition]
sixth_month_metadata
Out[25]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
std_ic_mou_6 float64 29695 316 1.05 8391
offnet_mou_6 float64 29695 316 1.05 22454
std_ic_t2f_mou_6 float64 29695 316 1.05 2033
isd_ic_mou_6 float64 29695 316 1.05 3429
ic_others_6 float64 29695 316 1.05 1227
onnet_mou_6 float64 29695 316 1.05 18813
std_ic_t2m_mou_6 float64 29695 316 1.05 6680
loc_ic_t2t_mou_6 float64 29695 316 1.05 9872
loc_ic_t2m_mou_6 float64 29695 316 1.05 16015
loc_ic_t2f_mou_6 float64 29695 316 1.05 4817
loc_ic_mou_6 float64 29695 316 1.05 19133
std_ic_t2t_mou_6 float64 29695 316 1.05 4608
og_others_6 float64 29695 316 1.05 862
spl_og_mou_6 float64 29695 316 1.05 3053
roam_ic_mou_6 float64 29695 316 1.05 4338
spl_ic_mou_6 float64 29695 316 1.05 78
std_og_t2t_mou_6 float64 29695 316 1.05 12777
loc_og_t2c_mou_6 float64 29695 316 1.05 1658
std_og_t2m_mou_6 float64 29695 316 1.05 14518
loc_og_t2f_mou_6 float64 29695 316 1.05 3252
std_og_t2f_mou_6 float64 29695 316 1.05 1773
loc_og_t2m_mou_6 float64 29695 316 1.05 16747
std_ic_t2o_mou_6 category 29695 316 1.05 1
std_og_t2c_mou_6 category 29695 316 1.05 1
std_og_mou_6 float64 29695 316 1.05 18325
loc_og_t2t_mou_6 float64 29695 316 1.05 11151
isd_og_mou_6 float64 29695 316 1.05 1113
roam_og_mou_6 float64 29695 316 1.05 5174
loc_og_mou_6 float64 29695 316 1.05 19691
date_of_last_rech_6 datetime64[ns] 29949 62 0.21 30
sachet_3g_6 category 30011 0 0.00 25
vol_2g_mb_6 float64 30011 0 0.00 7809
vol_3g_mb_6 float64 30011 0 0.00 7043
monthly_2g_6 category 30011 0 0.00 5
sachet_2g_6 category 30011 0 0.00 30
vbc_3g_6 float64 30011 0 0.00 6864
monthly_3g_6 category 30011 0 0.00 12
last_date_of_month_6 object 30011 0 0.00 1
arpu_6 float64 30011 0 0.00 29261
total_og_mou_6 float64 30011 0 0.00 24607
total_ic_mou_6 float64 30011 0 0.00 20602
total_rech_num_6 int64 30011 0 0.00 102
total_rech_amt_6 int64 30011 0 0.00 2241
max_rech_amt_6 int64 30011 0 0.00 170
last_day_rch_amt_6 int64 30011 0 0.00 158
  • Note that all the columns with *_mou have exactly 3.94% rows with missing values.
  • This is an indicator of a meaningful missing values.
  • Further note that *_mou columns indicate minutes of usage, which are applicable only to customers using calling plans. It is probable that, the 3.94% customers not using calling plans.
  • This could confirmed by looking at 'total_og_mou_6' and 'total_ic_mou_6' related columns where _mou columns have missing values. If these columns are zero for a customer , then all _mou columns should be zero too.
In [26]:
#  columns with meaningful missing in 6th month 
sixth_month_meaningful_missing_condition = sixth_month_metadata['Null_Percentage'] == 1.05
sixth_month_meaningful_missing_cols = sixth_month_metadata[sixth_month_meaningful_missing_condition].index.values
sixth_month_meaningful_missing_cols
Out[26]:
array(['std_ic_mou_6', 'offnet_mou_6', 'std_ic_t2f_mou_6', 'isd_ic_mou_6',
       'ic_others_6', 'onnet_mou_6', 'std_ic_t2m_mou_6',
       'loc_ic_t2t_mou_6', 'loc_ic_t2m_mou_6', 'loc_ic_t2f_mou_6',
       'loc_ic_mou_6', 'std_ic_t2t_mou_6', 'og_others_6', 'spl_og_mou_6',
       'roam_ic_mou_6', 'spl_ic_mou_6', 'std_og_t2t_mou_6',
       'loc_og_t2c_mou_6', 'std_og_t2m_mou_6', 'loc_og_t2f_mou_6',
       'std_og_t2f_mou_6', 'loc_og_t2m_mou_6', 'std_ic_t2o_mou_6',
       'std_og_t2c_mou_6', 'std_og_mou_6', 'loc_og_t2t_mou_6',
       'isd_og_mou_6', 'roam_og_mou_6', 'loc_og_mou_6'], dtype=object)
In [27]:
# Looking at all sixth month columns where rows of *_mou are null
condition = data[sixth_month_meaningful_missing_cols].isnull()
# data.loc[condition, sixth_month_columns]


# Rows is null for all the above columns 
missing_rows = pd.Series([True]*data.shape[0], index = data.index)
for column in sixth_month_meaningful_missing_cols : 
    missing_rows = missing_rows & data[column].isnull()

print('Total outgoing mou for each customer with missing *_mou data is ', data.loc[missing_rows,'total_og_mou_6'].unique()[0])
print('Total incoming mou for each customer with missing *_mou data is ', data.loc[missing_rows,'total_ic_mou_6'].unique()[0])
Total outgoing mou for each customer with missing *_mou data is  0.0
Total incoming mou for each customer with missing *_mou data is  0.0
  • Hence, these could be imputed with 0
In [28]:
# Imputation
data[sixth_month_meaningful_missing_cols] = data[sixth_month_meaningful_missing_cols].fillna(0)

metadata = metadata_matrix(data)

# Remaining Missing Values
metadata.iloc[metadata.index.isin(sixth_month_columns)]
Out[28]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
date_of_last_rech_6 datetime64[ns] 29949 62 0.21 30
monthly_2g_6 category 30011 0 0.00 5
vbc_3g_6 float64 30011 0 0.00 6864
max_rech_amt_6 int64 30011 0 0.00 170
sachet_3g_6 category 30011 0 0.00 25
sachet_2g_6 category 30011 0 0.00 30
vol_2g_mb_6 float64 30011 0 0.00 7809
monthly_3g_6 category 30011 0 0.00 12
vol_3g_mb_6 float64 30011 0 0.00 7043
last_day_rch_amt_6 int64 30011 0 0.00 158
total_rech_amt_6 int64 30011 0 0.00 2241
loc_og_t2m_mou_6 float64 30011 0 0.00 16747
isd_og_mou_6 float64 30011 0 0.00 1113
std_og_mou_6 float64 30011 0 0.00 18325
std_og_t2c_mou_6 category 30011 0 0.00 1
std_og_t2f_mou_6 float64 30011 0 0.00 1773
std_og_t2m_mou_6 float64 30011 0 0.00 14518
std_og_t2t_mou_6 float64 30011 0 0.00 12777
loc_og_mou_6 float64 30011 0 0.00 19691
loc_og_t2c_mou_6 float64 30011 0 0.00 1658
loc_og_t2f_mou_6 float64 30011 0 0.00 3252
loc_og_t2t_mou_6 float64 30011 0 0.00 11151
roam_og_mou_6 float64 30011 0 0.00 5174
roam_ic_mou_6 float64 30011 0 0.00 4338
offnet_mou_6 float64 30011 0 0.00 22454
onnet_mou_6 float64 30011 0 0.00 18813
arpu_6 float64 30011 0 0.00 29261
last_date_of_month_6 object 30011 0 0.00 1
spl_og_mou_6 float64 30011 0 0.00 3053
og_others_6 float64 30011 0 0.00 862
total_og_mou_6 float64 30011 0 0.00 24607
total_rech_num_6 int64 30011 0 0.00 102
ic_others_6 float64 30011 0 0.00 1227
isd_ic_mou_6 float64 30011 0 0.00 3429
spl_ic_mou_6 float64 30011 0 0.00 78
total_ic_mou_6 float64 30011 0 0.00 20602
std_ic_mou_6 float64 30011 0 0.00 8391
std_ic_t2o_mou_6 category 30011 0 0.00 1
std_ic_t2f_mou_6 float64 30011 0 0.00 2033
std_ic_t2m_mou_6 float64 30011 0 0.00 6680
std_ic_t2t_mou_6 float64 30011 0 0.00 4608
loc_ic_mou_6 float64 30011 0 0.00 19133
loc_ic_t2f_mou_6 float64 30011 0 0.00 4817
loc_ic_t2m_mou_6 float64 30011 0 0.00 16015
loc_ic_t2t_mou_6 float64 30011 0 0.00 9872
  • Looks like there '1.61%' customers with missing date of last recharge. Let's look at 'recharge' related columns for such customers
In [29]:
# Looking at 'recharge' related 6th month columns for customers with missing 'date_of_last_rech_6' 
condition = data['date_of_last_rech_6'].isnull()
data[condition].filter(regex='.*rech.*6$', axis=1).head()
Out[29]:
total_rech_num_6 total_rech_amt_6 max_rech_amt_6 date_of_last_rech_6
mobile_number
7001588448 0 0 0 NaT
7001223277 0 0 0 NaT
7000721536 0 0 0 NaT
7001490351 0 0 0 NaT
7000665415 0 0 0 NaT
In [30]:
data[condition].filter(regex='.*rech.*6$', axis=1).nunique()
Out[30]:
total_rech_num_6       1
total_rech_amt_6       1
max_rech_amt_6         1
date_of_last_rech_6    0
dtype: int64
  • Notice, that the recharge related columns for customers with missing 'date_of_last_rech_6' has just one unique value. From the first few rows of the output, we see that this is 0.
  • Hence, 'date_of_last_rech_6' is missing since there were no recharges made in this month.
  • These are meaning missing values
In [31]:
# Check for missing values in 6th month variables
metadata = metadata_matrix(data)
metadata[metadata.index.isin(sixth_month_columns)]
Out[31]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
date_of_last_rech_6 datetime64[ns] 29949 62 0.21 30
monthly_2g_6 category 30011 0 0.00 5
vbc_3g_6 float64 30011 0 0.00 6864
max_rech_amt_6 int64 30011 0 0.00 170
sachet_3g_6 category 30011 0 0.00 25
sachet_2g_6 category 30011 0 0.00 30
vol_2g_mb_6 float64 30011 0 0.00 7809
monthly_3g_6 category 30011 0 0.00 12
vol_3g_mb_6 float64 30011 0 0.00 7043
last_day_rch_amt_6 int64 30011 0 0.00 158
total_rech_amt_6 int64 30011 0 0.00 2241
loc_og_t2m_mou_6 float64 30011 0 0.00 16747
isd_og_mou_6 float64 30011 0 0.00 1113
std_og_mou_6 float64 30011 0 0.00 18325
std_og_t2c_mou_6 category 30011 0 0.00 1
std_og_t2f_mou_6 float64 30011 0 0.00 1773
std_og_t2m_mou_6 float64 30011 0 0.00 14518
std_og_t2t_mou_6 float64 30011 0 0.00 12777
loc_og_mou_6 float64 30011 0 0.00 19691
loc_og_t2c_mou_6 float64 30011 0 0.00 1658
loc_og_t2f_mou_6 float64 30011 0 0.00 3252
loc_og_t2t_mou_6 float64 30011 0 0.00 11151
roam_og_mou_6 float64 30011 0 0.00 5174
roam_ic_mou_6 float64 30011 0 0.00 4338
offnet_mou_6 float64 30011 0 0.00 22454
onnet_mou_6 float64 30011 0 0.00 18813
arpu_6 float64 30011 0 0.00 29261
last_date_of_month_6 object 30011 0 0.00 1
spl_og_mou_6 float64 30011 0 0.00 3053
og_others_6 float64 30011 0 0.00 862
total_og_mou_6 float64 30011 0 0.00 24607
total_rech_num_6 int64 30011 0 0.00 102
ic_others_6 float64 30011 0 0.00 1227
isd_ic_mou_6 float64 30011 0 0.00 3429
spl_ic_mou_6 float64 30011 0 0.00 78
total_ic_mou_6 float64 30011 0 0.00 20602
std_ic_mou_6 float64 30011 0 0.00 8391
std_ic_t2o_mou_6 category 30011 0 0.00 1
std_ic_t2f_mou_6 float64 30011 0 0.00 2033
std_ic_t2m_mou_6 float64 30011 0 0.00 6680
std_ic_t2t_mou_6 float64 30011 0 0.00 4608
loc_ic_mou_6 float64 30011 0 0.00 19133
loc_ic_t2f_mou_6 float64 30011 0 0.00 4817
loc_ic_t2m_mou_6 float64 30011 0 0.00 16015
loc_ic_t2t_mou_6 float64 30011 0 0.00 9872
  • No more Missing Values in 6th month columns
In [32]:
# Month : 7 
seventh_month_columns = data.filter(regex='7$', axis=1).columns
seventh_month_columns
Out[32]:
Index(['last_date_of_month_7', 'arpu_7', 'onnet_mou_7', 'offnet_mou_7',
       'roam_ic_mou_7', 'roam_og_mou_7', 'loc_og_t2t_mou_7',
       'loc_og_t2m_mou_7', 'loc_og_t2f_mou_7', 'loc_og_t2c_mou_7',
       'loc_og_mou_7', 'std_og_t2t_mou_7', 'std_og_t2m_mou_7',
       'std_og_t2f_mou_7', 'std_og_t2c_mou_7', 'std_og_mou_7', 'isd_og_mou_7',
       'spl_og_mou_7', 'og_others_7', 'total_og_mou_7', 'loc_ic_t2t_mou_7',
       'loc_ic_t2m_mou_7', 'loc_ic_t2f_mou_7', 'loc_ic_mou_7',
       'std_ic_t2t_mou_7', 'std_ic_t2m_mou_7', 'std_ic_t2f_mou_7',
       'std_ic_t2o_mou_7', 'std_ic_mou_7', 'total_ic_mou_7', 'spl_ic_mou_7',
       'isd_ic_mou_7', 'ic_others_7', 'total_rech_num_7', 'total_rech_amt_7',
       'max_rech_amt_7', 'date_of_last_rech_7', 'last_day_rch_amt_7',
       'vol_2g_mb_7', 'vol_3g_mb_7', 'monthly_2g_7', 'sachet_2g_7',
       'monthly_3g_7', 'sachet_3g_7', 'vbc_3g_7', 'Average_rech_amt_6n7'],
      dtype='object')
In [33]:
seventh_month_metadata = metadata[metadata.index.isin(seventh_month_columns)]
seventh_month_metadata
Out[33]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
loc_ic_t2t_mou_7 float64 29708 303 1.01 9961
og_others_7 float64 29708 303 1.01 123
loc_ic_t2f_mou_7 float64 29708 303 1.01 4897
loc_ic_t2m_mou_7 float64 29708 303 1.01 16068
loc_ic_mou_7 float64 29708 303 1.01 19030
std_ic_t2t_mou_7 float64 29708 303 1.01 4706
std_ic_t2f_mou_7 float64 29708 303 1.01 2075
std_ic_t2o_mou_7 category 29708 303 1.01 1
std_ic_mou_7 float64 29708 303 1.01 8543
spl_ic_mou_7 float64 29708 303 1.01 93
isd_ic_mou_7 float64 29708 303 1.01 3639
ic_others_7 float64 29708 303 1.01 1371
std_ic_t2m_mou_7 float64 29708 303 1.01 6747
isd_og_mou_7 float64 29708 303 1.01 1125
spl_og_mou_7 float64 29708 303 1.01 3399
std_og_t2f_mou_7 float64 29708 303 1.01 1714
onnet_mou_7 float64 29708 303 1.01 18938
offnet_mou_7 float64 29708 303 1.01 22650
roam_ic_mou_7 float64 29708 303 1.01 3649
roam_og_mou_7 float64 29708 303 1.01 4431
loc_og_t2t_mou_7 float64 29708 303 1.01 11154
loc_og_t2f_mou_7 float64 29708 303 1.01 3267
loc_og_t2c_mou_7 float64 29708 303 1.01 1750
loc_og_mou_7 float64 29708 303 1.01 19880
std_og_t2t_mou_7 float64 29708 303 1.01 12983
std_og_t2m_mou_7 float64 29708 303 1.01 14589
loc_og_t2m_mou_7 float64 29708 303 1.01 16872
std_og_t2c_mou_7 category 29708 303 1.01 1
std_og_mou_7 float64 29708 303 1.01 18445
date_of_last_rech_7 datetime64[ns] 29897 114 0.38 31
last_date_of_month_7 object 29980 31 0.10 1
vol_2g_mb_7 float64 30011 0 0.00 7813
max_rech_amt_7 int64 30011 0 0.00 151
vbc_3g_7 float64 30011 0 0.00 7318
sachet_3g_7 category 30011 0 0.00 27
total_rech_amt_7 int64 30011 0 0.00 2265
monthly_2g_7 category 30011 0 0.00 6
sachet_2g_7 category 30011 0 0.00 34
last_day_rch_amt_7 int64 30011 0 0.00 149
monthly_3g_7 category 30011 0 0.00 15
vol_3g_mb_7 float64 30011 0 0.00 7440
total_rech_num_7 int64 30011 0 0.00 101
arpu_7 float64 30011 0 0.00 29260
total_og_mou_7 float64 30011 0 0.00 24913
total_ic_mou_7 float64 30011 0 0.00 20711
Average_rech_amt_6n7 float64 30011 0 0.00 3025
  • Note that all the columns with *_mou have exactly 3.86% rows with missing values.
  • This is an indicator of a meaningful missing values.
  • Further note that *_mou columns indicate minutes of usage, which are applicable only to customers using calling plans. It is probable that, the 3.86% customers not using calling plans.
  • This could confirmed by looking at 'total_og_mou_7' and 'total_ic_mou_7' related columns where _mou columns have missing values. If these columns are zero for a customer , then all _mou columns should be zero too.
In [34]:
#  columns with meaningful missing in 7th month 
seventh_month_meaningful_missing_condition = seventh_month_metadata['Null_Percentage'] == 1.01
seventh_month_meaningful_missing_cols = seventh_month_metadata[seventh_month_meaningful_missing_condition].index.values
seventh_month_meaningful_missing_cols
Out[34]:
array(['loc_ic_t2t_mou_7', 'og_others_7', 'loc_ic_t2f_mou_7',
       'loc_ic_t2m_mou_7', 'loc_ic_mou_7', 'std_ic_t2t_mou_7',
       'std_ic_t2f_mou_7', 'std_ic_t2o_mou_7', 'std_ic_mou_7',
       'spl_ic_mou_7', 'isd_ic_mou_7', 'ic_others_7', 'std_ic_t2m_mou_7',
       'isd_og_mou_7', 'spl_og_mou_7', 'std_og_t2f_mou_7', 'onnet_mou_7',
       'offnet_mou_7', 'roam_ic_mou_7', 'roam_og_mou_7',
       'loc_og_t2t_mou_7', 'loc_og_t2f_mou_7', 'loc_og_t2c_mou_7',
       'loc_og_mou_7', 'std_og_t2t_mou_7', 'std_og_t2m_mou_7',
       'loc_og_t2m_mou_7', 'std_og_t2c_mou_7', 'std_og_mou_7'],
      dtype=object)
In [35]:
# Looking at all 7th month columns where rows of *_mou are null
condition = data[seventh_month_meaningful_missing_cols].isnull()

# Rows is null for all the above columns 
missing_rows = pd.Series([True]*data.shape[0], index = data.index)
for column in seventh_month_meaningful_missing_cols : 
    missing_rows = missing_rows & data[column].isnull()

print('Total outgoing mou for each customer with missing *_mou data is ', data.loc[missing_rows,'total_og_mou_7'].unique()[0])
print('Total incoming mou for each customer with missing *_mou data is ', data.loc[missing_rows,'total_ic_mou_7'].unique()[0])
Total outgoing mou for each customer with missing *_mou data is  0.0
Total incoming mou for each customer with missing *_mou data is  0.0
  • Hence, these could be imputed with 0
In [36]:
# Imputation
data[seventh_month_meaningful_missing_cols] = data[seventh_month_meaningful_missing_cols].fillna(0)

metadata = metadata_matrix(data)

# Remaining Missing Values
metadata.iloc[metadata.index.isin(seventh_month_columns)]
Out[36]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
date_of_last_rech_7 datetime64[ns] 29897 114 0.38 31
last_date_of_month_7 object 29980 31 0.10 1
total_rech_num_7 int64 30011 0 0.00 101
ic_others_7 float64 30011 0 0.00 1371
isd_ic_mou_7 float64 30011 0 0.00 3639
spl_ic_mou_7 float64 30011 0 0.00 93
total_rech_amt_7 int64 30011 0 0.00 2265
sachet_2g_7 category 30011 0 0.00 34
monthly_3g_7 category 30011 0 0.00 15
sachet_3g_7 category 30011 0 0.00 27
vbc_3g_7 float64 30011 0 0.00 7318
max_rech_amt_7 int64 30011 0 0.00 151
last_day_rch_amt_7 int64 30011 0 0.00 149
vol_2g_mb_7 float64 30011 0 0.00 7813
monthly_2g_7 category 30011 0 0.00 6
vol_3g_mb_7 float64 30011 0 0.00 7440
loc_ic_t2f_mou_7 float64 30011 0 0.00 4897
total_ic_mou_7 float64 30011 0 0.00 20711
loc_og_t2t_mou_7 float64 30011 0 0.00 11154
std_og_t2m_mou_7 float64 30011 0 0.00 14589
std_og_t2t_mou_7 float64 30011 0 0.00 12983
loc_og_mou_7 float64 30011 0 0.00 19880
loc_og_t2c_mou_7 float64 30011 0 0.00 1750
loc_og_t2f_mou_7 float64 30011 0 0.00 3267
loc_og_t2m_mou_7 float64 30011 0 0.00 16872
roam_og_mou_7 float64 30011 0 0.00 4431
roam_ic_mou_7 float64 30011 0 0.00 3649
offnet_mou_7 float64 30011 0 0.00 22650
onnet_mou_7 float64 30011 0 0.00 18938
arpu_7 float64 30011 0 0.00 29260
std_og_t2f_mou_7 float64 30011 0 0.00 1714
std_og_t2c_mou_7 category 30011 0 0.00 1
loc_ic_t2m_mou_7 float64 30011 0 0.00 16068
std_ic_mou_7 float64 30011 0 0.00 8543
std_ic_t2o_mou_7 category 30011 0 0.00 1
std_ic_t2f_mou_7 float64 30011 0 0.00 2075
std_ic_t2m_mou_7 float64 30011 0 0.00 6747
std_ic_t2t_mou_7 float64 30011 0 0.00 4706
loc_ic_mou_7 float64 30011 0 0.00 19030
loc_ic_t2t_mou_7 float64 30011 0 0.00 9961
total_og_mou_7 float64 30011 0 0.00 24913
og_others_7 float64 30011 0 0.00 123
spl_og_mou_7 float64 30011 0 0.00 3399
isd_og_mou_7 float64 30011 0 0.00 1125
std_og_mou_7 float64 30011 0 0.00 18445
Average_rech_amt_6n7 float64 30011 0 0.00 3025
  • Looks like there '1.77%' customers with missing date of last recharge. Let's look at 'recharge' related columns for such customers
In [37]:
# Looking at 'recharge' related 7th month columns for customers with missing 'date_of_last_rech_7' 
condition = data['date_of_last_rech_7'].isnull()
data[condition].filter(regex='.*rech.*7$', axis=1).head()
Out[37]:
total_rech_num_7 total_rech_amt_7 max_rech_amt_7 date_of_last_rech_7 Average_rech_amt_6n7
mobile_number
7000369789 0 0 0 NaT 393.0
7001967148 0 0 0 NaT 500.5
7000066601 0 0 0 NaT 490.0
7001189556 0 0 0 NaT 523.5
7002024450 0 0 0 NaT 493.0
In [38]:
data[condition].filter(regex='.*rech.*7$', axis=1).nunique()
Out[38]:
total_rech_num_7         1
total_rech_amt_7         1
max_rech_amt_7           1
date_of_last_rech_7      0
Average_rech_amt_6n7    90
dtype: int64
  • Notice, that the recharge related columns for customers with missing 'date_of_last_rech_7' has just one unique value. From the first few rows of the output, we see that this is 0.
  • Hence, 'date_of_last_rech_7' is missing since there were no recharges made in this month.
  • These are meaning missing values
In [39]:
# Month : 8 
In [40]:
eighth_month_columns = data.filter(regex="8$", axis=1).columns
metadata = metadata_matrix(data)
condition = metadata.index.isin(eighth_month_columns)
eighth_month_metadata = metadata[condition]
eighth_month_metadata
Out[40]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
std_og_t2c_mou_8 category 29073 938 3.13 1
std_og_mou_8 float64 29073 938 3.13 16864
isd_og_mou_8 float64 29073 938 3.13 940
loc_ic_mou_8 float64 29073 938 3.13 18573
std_og_t2m_mou_8 float64 29073 938 3.13 13326
loc_ic_t2m_mou_8 float64 29073 938 3.13 15598
loc_og_mou_8 float64 29073 938 3.13 18885
std_og_t2t_mou_8 float64 29073 938 3.13 11781
std_og_t2f_mou_8 float64 29073 938 3.13 1627
loc_ic_t2f_mou_8 float64 29073 938 3.13 4705
loc_og_t2c_mou_8 float64 29073 938 3.13 1730
ic_others_8 float64 29073 938 3.13 1259
loc_og_t2m_mou_8 float64 29073 938 3.13 16165
spl_og_mou_8 float64 29073 938 3.13 3238
roam_ic_mou_8 float64 29073 938 3.13 3655
std_ic_mou_8 float64 29073 938 3.13 8033
spl_ic_mou_8 float64 29073 938 3.13 85
std_ic_t2o_mou_8 category 29073 938 3.13 1
onnet_mou_8 float64 29073 938 3.13 17604
loc_og_t2f_mou_8 float64 29073 938 3.13 3124
offnet_mou_8 float64 29073 938 3.13 21513
std_ic_t2f_mou_8 float64 29073 938 3.13 1941
og_others_8 float64 29073 938 3.13 133
loc_ic_t2t_mou_8 float64 29073 938 3.13 9671
std_ic_t2m_mou_8 float64 29073 938 3.13 6420
std_ic_t2t_mou_8 float64 29073 938 3.13 4486
roam_og_mou_8 float64 29073 938 3.13 4382
isd_ic_mou_8 float64 29073 938 3.13 3493
loc_og_t2t_mou_8 float64 29073 938 3.13 10772
date_of_last_rech_8 datetime64[ns] 29417 594 1.98 31
last_date_of_month_8 object 29854 157 0.52 1
total_rech_num_8 int64 30011 0 0.00 96
total_rech_amt_8 int64 30011 0 0.00 2299
last_day_rch_amt_8 int64 30011 0 0.00 179
sachet_2g_8 category 30011 0 0.00 34
monthly_3g_8 category 30011 0 0.00 12
sachet_3g_8 category 30011 0 0.00 29
vbc_3g_8 float64 30011 0 0.00 7291
monthly_2g_8 category 30011 0 0.00 6
max_rech_amt_8 int64 30011 0 0.00 182
total_ic_mou_8 float64 30011 0 0.00 20096
vol_2g_mb_8 float64 30011 0 0.00 7310
vol_3g_mb_8 float64 30011 0 0.00 7151
arpu_8 float64 30011 0 0.00 28405
total_og_mou_8 float64 30011 0 0.00 23644
In [41]:
#  columns with meaningful missing in 8th month 
eighth_month_meaningful_missing_condition = eighth_month_metadata['Null_Percentage'] == 3.13
eighth_month_meaningful_missing_cols = eighth_month_metadata[eighth_month_meaningful_missing_condition].index.values
eighth_month_meaningful_missing_cols
Out[41]:
array(['std_og_t2c_mou_8', 'std_og_mou_8', 'isd_og_mou_8', 'loc_ic_mou_8',
       'std_og_t2m_mou_8', 'loc_ic_t2m_mou_8', 'loc_og_mou_8',
       'std_og_t2t_mou_8', 'std_og_t2f_mou_8', 'loc_ic_t2f_mou_8',
       'loc_og_t2c_mou_8', 'ic_others_8', 'loc_og_t2m_mou_8',
       'spl_og_mou_8', 'roam_ic_mou_8', 'std_ic_mou_8', 'spl_ic_mou_8',
       'std_ic_t2o_mou_8', 'onnet_mou_8', 'loc_og_t2f_mou_8',
       'offnet_mou_8', 'std_ic_t2f_mou_8', 'og_others_8',
       'loc_ic_t2t_mou_8', 'std_ic_t2m_mou_8', 'std_ic_t2t_mou_8',
       'roam_og_mou_8', 'isd_ic_mou_8', 'loc_og_t2t_mou_8'], dtype=object)
In [42]:
# Looking at all 8th month columns where rows of *_mou are null
condition = data[eighth_month_meaningful_missing_cols].isnull()

# Rows is null for all the above columns 
missing_rows = pd.Series([True]*data.shape[0], index = data.index)
for column in eighth_month_meaningful_missing_cols : 
    missing_rows = missing_rows & data[column].isnull()

print('Total outgoing mou for each customer with missing *_mou data is ', data.loc[missing_rows,'total_og_mou_8'].unique()[0])
print('Total incoming mou for each customer with missing *_mou data is ', data.loc[missing_rows,'total_ic_mou_8'].unique()[0])
Total outgoing mou for each customer with missing *_mou data is  0.0
Total incoming mou for each customer with missing *_mou data is  0.0
In [43]:
# Imputation
data[eighth_month_meaningful_missing_cols] = data[eighth_month_meaningful_missing_cols].fillna(0)

metadata = metadata_matrix(data)

# Remaining Missing Values
metadata.iloc[metadata.index.isin(eighth_month_columns)]
Out[43]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
date_of_last_rech_8 datetime64[ns] 29417 594 1.98 31
last_date_of_month_8 object 29854 157 0.52 1
spl_ic_mou_8 float64 30011 0 0.00 85
total_rech_num_8 int64 30011 0 0.00 96
std_ic_t2f_mou_8 float64 30011 0 0.00 1941
ic_others_8 float64 30011 0 0.00 1259
std_ic_t2o_mou_8 category 30011 0 0.00 1
std_ic_mou_8 float64 30011 0 0.00 8033
total_ic_mou_8 float64 30011 0 0.00 20096
isd_ic_mou_8 float64 30011 0 0.00 3493
sachet_2g_8 category 30011 0 0.00 34
monthly_3g_8 category 30011 0 0.00 12
sachet_3g_8 category 30011 0 0.00 29
vbc_3g_8 float64 30011 0 0.00 7291
monthly_2g_8 category 30011 0 0.00 6
total_rech_amt_8 int64 30011 0 0.00 2299
max_rech_amt_8 int64 30011 0 0.00 182
last_day_rch_amt_8 int64 30011 0 0.00 179
vol_2g_mb_8 float64 30011 0 0.00 7310
vol_3g_mb_8 float64 30011 0 0.00 7151
std_ic_t2m_mou_8 float64 30011 0 0.00 6420
loc_og_t2m_mou_8 float64 30011 0 0.00 16165
loc_og_t2f_mou_8 float64 30011 0 0.00 3124
loc_og_t2c_mou_8 float64 30011 0 0.00 1730
loc_og_mou_8 float64 30011 0 0.00 18885
std_og_t2t_mou_8 float64 30011 0 0.00 11781
loc_og_t2t_mou_8 float64 30011 0 0.00 10772
onnet_mou_8 float64 30011 0 0.00 17604
arpu_8 float64 30011 0 0.00 28405
roam_og_mou_8 float64 30011 0 0.00 4382
offnet_mou_8 float64 30011 0 0.00 21513
roam_ic_mou_8 float64 30011 0 0.00 3655
std_og_t2m_mou_8 float64 30011 0 0.00 13326
loc_ic_t2t_mou_8 float64 30011 0 0.00 9671
loc_ic_t2m_mou_8 float64 30011 0 0.00 15598
loc_ic_t2f_mou_8 float64 30011 0 0.00 4705
loc_ic_mou_8 float64 30011 0 0.00 18573
std_ic_t2t_mou_8 float64 30011 0 0.00 4486
total_og_mou_8 float64 30011 0 0.00 23644
og_others_8 float64 30011 0 0.00 133
std_og_t2f_mou_8 float64 30011 0 0.00 1627
std_og_t2c_mou_8 category 30011 0 0.00 1
std_og_mou_8 float64 30011 0 0.00 16864
isd_og_mou_8 float64 30011 0 0.00 940
spl_og_mou_8 float64 30011 0 0.00 3238
In [44]:
# Looking at 'recharge' related 8th month columns for customers with missing 'date_of_last_rech_8' 
condition = data['date_of_last_rech_8'].isnull()
data[condition].filter(regex='.*rech.*8$', axis=1).head()
Out[44]:
total_rech_num_8 total_rech_amt_8 max_rech_amt_8 date_of_last_rech_8
mobile_number
7000340381 0 0 0 NaT
7000608224 0 0 0 NaT
7000369789 0 0 0 NaT
7000248548 0 0 0 NaT
7001967063 0 0 0 NaT
In [45]:
data[condition].filter(regex='.*rech.*8$', axis=1).nunique()
Out[45]:
total_rech_num_8       1
total_rech_amt_8       1
max_rech_amt_8         1
date_of_last_rech_8    0
dtype: int64
In [46]:
# Month : 9
In [47]:
ninth_month_columns = data.filter(regex="9$", axis=1).columns
metadata = metadata_matrix(data)
condition = metadata.index.isin(ninth_month_columns)
ninth_month_metadata = metadata[condition]
ninth_month_metadata
Out[47]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
std_og_t2c_mou_9 category 28307 1704 5.68 1
spl_ic_mou_9 float64 28307 1704 5.68 287
loc_og_t2m_mou_9 float64 28307 1704 5.68 15585
og_others_9 float64 28307 1704 5.68 132
loc_og_t2c_mou_9 float64 28307 1704 5.68 1576
isd_ic_mou_9 float64 28307 1704 5.68 3329
loc_og_t2t_mou_9 float64 28307 1704 5.68 10360
spl_og_mou_9 float64 28307 1704 5.68 2966
loc_ic_t2t_mou_9 float64 28307 1704 5.68 9407
loc_og_mou_9 float64 28307 1704 5.68 18207
roam_og_mou_9 float64 28307 1704 5.68 4004
std_ic_mou_9 float64 28307 1704 5.68 7745
loc_ic_t2m_mou_9 float64 28307 1704 5.68 15194
roam_ic_mou_9 float64 28307 1704 5.68 3370
std_og_t2t_mou_9 float64 28307 1704 5.68 11141
offnet_mou_9 float64 28307 1704 5.68 20452
loc_ic_t2f_mou_9 float64 28307 1704 5.68 4611
std_ic_t2f_mou_9 float64 28307 1704 5.68 1971
isd_og_mou_9 float64 28307 1704 5.68 908
std_og_mou_9 float64 28307 1704 5.68 15900
std_og_t2f_mou_9 float64 28307 1704 5.68 1595
ic_others_9 float64 28307 1704 5.68 1284
std_ic_t2t_mou_9 float64 28307 1704 5.68 4280
std_ic_t2o_mou_9 category 28307 1704 5.68 1
loc_og_t2f_mou_9 float64 28307 1704 5.68 3111
std_og_t2m_mou_9 float64 28307 1704 5.68 12445
loc_ic_mou_9 float64 28307 1704 5.68 18018
std_ic_t2m_mou_9 float64 28307 1704 5.68 6168
onnet_mou_9 float64 28307 1704 5.68 16674
date_of_last_rech_9 datetime64[ns] 29145 866 2.89 30
last_date_of_month_9 object 29651 360 1.20 1
total_rech_num_9 int64 30011 0 0.00 96
total_ic_mou_9 float64 30011 0 0.00 19437
monthly_3g_9 category 30011 0 0.00 11
monthly_2g_9 category 30011 0 0.00 5
sachet_2g_9 category 30011 0 0.00 29
sachet_3g_9 category 30011 0 0.00 27
vbc_3g_9 float64 30011 0 0.00 2171
vol_3g_mb_9 float64 30011 0 0.00 7016
total_rech_amt_9 int64 30011 0 0.00 2248
max_rech_amt_9 int64 30011 0 0.00 186
last_day_rch_amt_9 int64 30011 0 0.00 170
vol_2g_mb_9 float64 30011 0 0.00 6984
arpu_9 float64 30011 0 0.00 27327
total_og_mou_9 float64 30011 0 0.00 22615
In [48]:
#  columns with meaningful missing in 9th month 
ninth_month_meaningful_missing_condition = ninth_month_metadata['Null_Percentage'] == 5.68
ninth_month_meaningful_missing_cols = ninth_month_metadata[ninth_month_meaningful_missing_condition].index.values
ninth_month_meaningful_missing_cols
Out[48]:
array(['std_og_t2c_mou_9', 'spl_ic_mou_9', 'loc_og_t2m_mou_9',
       'og_others_9', 'loc_og_t2c_mou_9', 'isd_ic_mou_9',
       'loc_og_t2t_mou_9', 'spl_og_mou_9', 'loc_ic_t2t_mou_9',
       'loc_og_mou_9', 'roam_og_mou_9', 'std_ic_mou_9',
       'loc_ic_t2m_mou_9', 'roam_ic_mou_9', 'std_og_t2t_mou_9',
       'offnet_mou_9', 'loc_ic_t2f_mou_9', 'std_ic_t2f_mou_9',
       'isd_og_mou_9', 'std_og_mou_9', 'std_og_t2f_mou_9', 'ic_others_9',
       'std_ic_t2t_mou_9', 'std_ic_t2o_mou_9', 'loc_og_t2f_mou_9',
       'std_og_t2m_mou_9', 'loc_ic_mou_9', 'std_ic_t2m_mou_9',
       'onnet_mou_9'], dtype=object)
In [49]:
# Looking at all 9th month columns where rows of *_mou are null
condition = data[ninth_month_meaningful_missing_cols].isnull()

# Rows is null for all the above columns 
missing_rows = pd.Series([True]*data.shape[0], index = data.index)
for column in ninth_month_meaningful_missing_cols : 
    missing_rows = missing_rows & data[column].isnull()

print('Total outgoing mou for each customer with missing *_mou data is ', data.loc[missing_rows,'total_og_mou_9'].unique()[0])
print('Total incoming mou for each customer with missing *_mou data is ', data.loc[missing_rows,'total_ic_mou_9'].unique()[0])
Total outgoing mou for each customer with missing *_mou data is  0.0
Total incoming mou for each customer with missing *_mou data is  0.0
In [50]:
# Imputation
data[ninth_month_meaningful_missing_cols] = data[ninth_month_meaningful_missing_cols].fillna(0)

metadata = metadata_matrix(data)

# Remaining Missing Values
metadata.iloc[metadata.index.isin(ninth_month_columns)]
Out[50]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
date_of_last_rech_9 datetime64[ns] 29145 866 2.89 30
last_date_of_month_9 object 29651 360 1.20 1
spl_ic_mou_9 float64 30011 0 0.00 287
total_ic_mou_9 float64 30011 0 0.00 19437
std_ic_mou_9 float64 30011 0 0.00 7745
isd_ic_mou_9 float64 30011 0 0.00 3329
ic_others_9 float64 30011 0 0.00 1284
loc_ic_mou_9 float64 30011 0 0.00 18018
std_ic_t2t_mou_9 float64 30011 0 0.00 4280
std_ic_t2m_mou_9 float64 30011 0 0.00 6168
std_ic_t2f_mou_9 float64 30011 0 0.00 1971
std_ic_t2o_mou_9 category 30011 0 0.00 1
total_rech_amt_9 int64 30011 0 0.00 2248
total_rech_num_9 int64 30011 0 0.00 96
monthly_3g_9 category 30011 0 0.00 11
monthly_2g_9 category 30011 0 0.00 5
sachet_2g_9 category 30011 0 0.00 29
sachet_3g_9 category 30011 0 0.00 27
vbc_3g_9 float64 30011 0 0.00 2171
max_rech_amt_9 int64 30011 0 0.00 186
vol_3g_mb_9 float64 30011 0 0.00 7016
last_day_rch_amt_9 int64 30011 0 0.00 170
vol_2g_mb_9 float64 30011 0 0.00 6984
loc_ic_t2f_mou_9 float64 30011 0 0.00 4611
loc_og_t2t_mou_9 float64 30011 0 0.00 10360
loc_og_t2m_mou_9 float64 30011 0 0.00 15585
loc_og_t2f_mou_9 float64 30011 0 0.00 3111
loc_og_t2c_mou_9 float64 30011 0 0.00 1576
loc_og_mou_9 float64 30011 0 0.00 18207
roam_og_mou_9 float64 30011 0 0.00 4004
onnet_mou_9 float64 30011 0 0.00 16674
arpu_9 float64 30011 0 0.00 27327
offnet_mou_9 float64 30011 0 0.00 20452
roam_ic_mou_9 float64 30011 0 0.00 3370
std_og_t2t_mou_9 float64 30011 0 0.00 11141
spl_og_mou_9 float64 30011 0 0.00 2966
og_others_9 float64 30011 0 0.00 132
total_og_mou_9 float64 30011 0 0.00 22615
loc_ic_t2t_mou_9 float64 30011 0 0.00 9407
loc_ic_t2m_mou_9 float64 30011 0 0.00 15194
isd_og_mou_9 float64 30011 0 0.00 908
std_og_t2m_mou_9 float64 30011 0 0.00 12445
std_og_t2f_mou_9 float64 30011 0 0.00 1595
std_og_t2c_mou_9 category 30011 0 0.00 1
std_og_mou_9 float64 30011 0 0.00 15900
In [51]:
# Looking at 'recharge' related 9th month columns for customers with missing 'date_of_last_rech_9' 
condition = data['date_of_last_rech_9'].isnull()
data[condition].filter(regex='.*rech.*9$', axis=1).head()
Out[51]:
total_rech_num_9 total_rech_amt_9 max_rech_amt_9 date_of_last_rech_9
mobile_number
7000340381 0 0 0 NaT
7000854899 0 0 0 NaT
7000369789 0 0 0 NaT
7001967063 0 0 0 NaT
7000066601 0 0 0 NaT
In [52]:
data[condition].filter(regex='.*rech.*9$', axis=1).nunique()
Out[52]:
total_rech_num_9       1
total_rech_amt_9       1
max_rech_amt_9         1
date_of_last_rech_9    0
dtype: int64
In [53]:
# Imputing "last_date_of_month_*"
In [54]:
print('Missing Value Percentage in last_date_of_month columns : \n', 100*data.filter(regex='last_date_of_month_.*', axis=1).isnull().sum() / data.shape[0], '\n')
print('The unique values in last_date_of_month_6 : ' , data['last_date_of_month_6'].unique())
print('The unique values in last_date_of_month_7 : ' , data['last_date_of_month_7'].unique())
print('The unique values in last_date_of_month_8 : ' , data['last_date_of_month_8'].unique())
print('The unique values in last_date_of_month_9 : ' , data['last_date_of_month_9'].unique())
Missing Value Percentage in last_date_of_month columns : 
 last_date_of_month_6    0.000000
last_date_of_month_7    0.103295
last_date_of_month_8    0.523142
last_date_of_month_9    1.199560
dtype: float64 

The unique values in last_date_of_month_6 :  ['6/30/2014']
The unique values in last_date_of_month_7 :  ['7/31/2014' nan]
The unique values in last_date_of_month_8 :  ['8/31/2014' nan]
The unique values in last_date_of_month_9 :  ['9/30/2014' nan]
  • Last date of month is the last calender date of a particular month, it is independent of the churn data.
  • Lets impute these missing values using mode.
In [55]:
# Imputing last_date_of_month_* values
data['last_date_of_month_7'] = data['last_date_of_month_7'].fillna(data['last_date_of_month_7'].mode()[0])
data['last_date_of_month_8'] = data['last_date_of_month_8'].fillna(data['last_date_of_month_8'].mode()[0])
data['last_date_of_month_9'] = data['last_date_of_month_9'].fillna(data['last_date_of_month_9'].mode()[0])
In [56]:
data['last_date_of_month_7'].unique()
Out[56]:
array(['7/31/2014'], dtype=object)
In [57]:
metadata = metadata_matrix(data)
metadata
Out[57]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
date_of_last_rech_9 datetime64[ns] 29145 866 2.89 30
date_of_last_rech_8 datetime64[ns] 29417 594 1.98 31
loc_og_t2o_mou category 29897 114 0.38 1
date_of_last_rech_7 datetime64[ns] 29897 114 0.38 31
std_og_t2o_mou category 29897 114 0.38 1
loc_ic_t2o_mou category 29897 114 0.38 1
date_of_last_rech_6 datetime64[ns] 29949 62 0.21 30
isd_ic_mou_6 float64 30011 0 0.00 3429
total_ic_mou_6 float64 30011 0 0.00 20602
total_ic_mou_7 float64 30011 0 0.00 20711
total_ic_mou_8 float64 30011 0 0.00 20096
total_ic_mou_9 float64 30011 0 0.00 19437
spl_ic_mou_6 float64 30011 0 0.00 78
spl_ic_mou_7 float64 30011 0 0.00 93
spl_ic_mou_8 float64 30011 0 0.00 85
spl_ic_mou_9 float64 30011 0 0.00 287
total_rech_num_6 int64 30011 0 0.00 102
ic_others_9 float64 30011 0 0.00 1284
std_ic_mou_8 float64 30011 0 0.00 8033
isd_ic_mou_7 float64 30011 0 0.00 3639
isd_ic_mou_8 float64 30011 0 0.00 3493
isd_ic_mou_9 float64 30011 0 0.00 3329
ic_others_6 float64 30011 0 0.00 1227
ic_others_7 float64 30011 0 0.00 1371
ic_others_8 float64 30011 0 0.00 1259
std_ic_mou_9 float64 30011 0 0.00 7745
std_ic_mou_7 float64 30011 0 0.00 8543
total_rech_num_8 int64 30011 0 0.00 96
std_ic_t2m_mou_7 float64 30011 0 0.00 6747
loc_ic_mou_6 float64 30011 0 0.00 19133
loc_ic_mou_7 float64 30011 0 0.00 19030
loc_ic_mou_8 float64 30011 0 0.00 18573
loc_ic_mou_9 float64 30011 0 0.00 18018
std_ic_t2t_mou_6 float64 30011 0 0.00 4608
std_ic_t2t_mou_7 float64 30011 0 0.00 4706
std_ic_t2t_mou_8 float64 30011 0 0.00 4486
std_ic_t2t_mou_9 float64 30011 0 0.00 4280
std_ic_t2m_mou_6 float64 30011 0 0.00 6680
std_ic_t2m_mou_8 float64 30011 0 0.00 6420
std_ic_mou_6 float64 30011 0 0.00 8391
std_ic_t2m_mou_9 float64 30011 0 0.00 6168
std_ic_t2f_mou_6 float64 30011 0 0.00 2033
std_ic_t2f_mou_7 float64 30011 0 0.00 2075
std_ic_t2f_mou_8 float64 30011 0 0.00 1941
std_ic_t2f_mou_9 float64 30011 0 0.00 1971
std_ic_t2o_mou_6 category 30011 0 0.00 1
std_ic_t2o_mou_7 category 30011 0 0.00 1
std_ic_t2o_mou_8 category 30011 0 0.00 1
std_ic_t2o_mou_9 category 30011 0 0.00 1
total_rech_num_7 int64 30011 0 0.00 101
circle_id category 30011 0 0.00 1
total_rech_num_9 int64 30011 0 0.00 96
monthly_3g_9 category 30011 0 0.00 11
monthly_2g_9 category 30011 0 0.00 5
sachet_2g_6 category 30011 0 0.00 30
sachet_2g_7 category 30011 0 0.00 34
sachet_2g_8 category 30011 0 0.00 34
sachet_2g_9 category 30011 0 0.00 29
monthly_3g_6 category 30011 0 0.00 12
monthly_3g_7 category 30011 0 0.00 15
monthly_3g_8 category 30011 0 0.00 12
sachet_3g_6 category 30011 0 0.00 25
monthly_2g_7 category 30011 0 0.00 6
sachet_3g_7 category 30011 0 0.00 27
sachet_3g_8 category 30011 0 0.00 29
sachet_3g_9 category 30011 0 0.00 27
aon int64 30011 0 0.00 3321
vbc_3g_8 float64 30011 0 0.00 7291
vbc_3g_7 float64 30011 0 0.00 7318
vbc_3g_6 float64 30011 0 0.00 6864
vbc_3g_9 float64 30011 0 0.00 2171
monthly_2g_8 category 30011 0 0.00 6
monthly_2g_6 category 30011 0 0.00 5
total_rech_amt_6 int64 30011 0 0.00 2241
last_day_rch_amt_7 int64 30011 0 0.00 149
loc_ic_t2f_mou_9 float64 30011 0 0.00 4611
total_rech_amt_8 int64 30011 0 0.00 2299
total_rech_amt_9 int64 30011 0 0.00 2248
max_rech_amt_6 int64 30011 0 0.00 170
max_rech_amt_7 int64 30011 0 0.00 151
max_rech_amt_8 int64 30011 0 0.00 182
max_rech_amt_9 int64 30011 0 0.00 186
last_day_rch_amt_6 int64 30011 0 0.00 158
last_day_rch_amt_8 int64 30011 0 0.00 179
vol_3g_mb_9 float64 30011 0 0.00 7016
last_day_rch_amt_9 int64 30011 0 0.00 170
vol_2g_mb_6 float64 30011 0 0.00 7809
vol_2g_mb_7 float64 30011 0 0.00 7813
vol_2g_mb_8 float64 30011 0 0.00 7310
vol_2g_mb_9 float64 30011 0 0.00 6984
vol_3g_mb_6 float64 30011 0 0.00 7043
vol_3g_mb_7 float64 30011 0 0.00 7440
vol_3g_mb_8 float64 30011 0 0.00 7151
total_rech_amt_7 int64 30011 0 0.00 2265
loc_ic_t2f_mou_7 float64 30011 0 0.00 4897
loc_ic_t2f_mou_8 float64 30011 0 0.00 4705
roam_og_mou_7 float64 30011 0 0.00 4431
roam_og_mou_9 float64 30011 0 0.00 4004
loc_og_t2t_mou_6 float64 30011 0 0.00 11151
loc_og_t2t_mou_7 float64 30011 0 0.00 11154
loc_og_t2t_mou_8 float64 30011 0 0.00 10772
loc_og_t2t_mou_9 float64 30011 0 0.00 10360
loc_og_t2m_mou_6 float64 30011 0 0.00 16747
loc_og_t2m_mou_7 float64 30011 0 0.00 16872
loc_og_t2m_mou_8 float64 30011 0 0.00 16165
loc_og_t2m_mou_9 float64 30011 0 0.00 15585
loc_og_t2f_mou_6 float64 30011 0 0.00 3252
loc_og_t2f_mou_7 float64 30011 0 0.00 3267
loc_og_t2f_mou_8 float64 30011 0 0.00 3124
loc_og_t2f_mou_9 float64 30011 0 0.00 3111
loc_og_t2c_mou_6 float64 30011 0 0.00 1658
loc_og_t2c_mou_7 float64 30011 0 0.00 1750
loc_og_t2c_mou_8 float64 30011 0 0.00 1730
loc_og_t2c_mou_9 float64 30011 0 0.00 1576
loc_og_mou_6 float64 30011 0 0.00 19691
loc_og_mou_7 float64 30011 0 0.00 19880
roam_og_mou_8 float64 30011 0 0.00 4382
roam_og_mou_6 float64 30011 0 0.00 5174
loc_og_mou_9 float64 30011 0 0.00 18207
roam_ic_mou_9 float64 30011 0 0.00 3370
last_date_of_month_6 object 30011 0 0.00 1
last_date_of_month_7 object 30011 0 0.00 1
last_date_of_month_8 object 30011 0 0.00 1
last_date_of_month_9 object 30011 0 0.00 1
arpu_6 float64 30011 0 0.00 29261
arpu_7 float64 30011 0 0.00 29260
arpu_8 float64 30011 0 0.00 28405
arpu_9 float64 30011 0 0.00 27327
onnet_mou_6 float64 30011 0 0.00 18813
onnet_mou_7 float64 30011 0 0.00 18938
onnet_mou_8 float64 30011 0 0.00 17604
onnet_mou_9 float64 30011 0 0.00 16674
offnet_mou_6 float64 30011 0 0.00 22454
offnet_mou_7 float64 30011 0 0.00 22650
offnet_mou_8 float64 30011 0 0.00 21513
offnet_mou_9 float64 30011 0 0.00 20452
roam_ic_mou_6 float64 30011 0 0.00 4338
roam_ic_mou_7 float64 30011 0 0.00 3649
roam_ic_mou_8 float64 30011 0 0.00 3655
loc_og_mou_8 float64 30011 0 0.00 18885
std_og_t2t_mou_6 float64 30011 0 0.00 12777
loc_ic_t2f_mou_6 float64 30011 0 0.00 4817
isd_og_mou_9 float64 30011 0 0.00 908
spl_og_mou_7 float64 30011 0 0.00 3399
spl_og_mou_8 float64 30011 0 0.00 3238
spl_og_mou_9 float64 30011 0 0.00 2966
og_others_6 float64 30011 0 0.00 862
og_others_7 float64 30011 0 0.00 123
og_others_8 float64 30011 0 0.00 133
og_others_9 float64 30011 0 0.00 132
total_og_mou_6 float64 30011 0 0.00 24607
total_og_mou_7 float64 30011 0 0.00 24913
total_og_mou_8 float64 30011 0 0.00 23644
total_og_mou_9 float64 30011 0 0.00 22615
loc_ic_t2t_mou_6 float64 30011 0 0.00 9872
loc_ic_t2t_mou_7 float64 30011 0 0.00 9961
loc_ic_t2t_mou_8 float64 30011 0 0.00 9671
loc_ic_t2t_mou_9 float64 30011 0 0.00 9407
loc_ic_t2m_mou_6 float64 30011 0 0.00 16015
loc_ic_t2m_mou_7 float64 30011 0 0.00 16068
loc_ic_t2m_mou_8 float64 30011 0 0.00 15598
loc_ic_t2m_mou_9 float64 30011 0 0.00 15194
spl_og_mou_6 float64 30011 0 0.00 3053
isd_og_mou_8 float64 30011 0 0.00 940
std_og_t2t_mou_7 float64 30011 0 0.00 12983
isd_og_mou_7 float64 30011 0 0.00 1125
std_og_t2t_mou_8 float64 30011 0 0.00 11781
std_og_t2t_mou_9 float64 30011 0 0.00 11141
std_og_t2m_mou_6 float64 30011 0 0.00 14518
std_og_t2m_mou_7 float64 30011 0 0.00 14589
std_og_t2m_mou_8 float64 30011 0 0.00 13326
std_og_t2m_mou_9 float64 30011 0 0.00 12445
std_og_t2f_mou_6 float64 30011 0 0.00 1773
std_og_t2f_mou_7 float64 30011 0 0.00 1714
std_og_t2f_mou_8 float64 30011 0 0.00 1627
std_og_t2f_mou_9 float64 30011 0 0.00 1595
std_og_t2c_mou_6 category 30011 0 0.00 1
std_og_t2c_mou_7 category 30011 0 0.00 1
std_og_t2c_mou_8 category 30011 0 0.00 1
std_og_t2c_mou_9 category 30011 0 0.00 1
std_og_mou_6 float64 30011 0 0.00 18325
std_og_mou_7 float64 30011 0 0.00 18445
std_og_mou_8 float64 30011 0 0.00 16864
std_og_mou_9 float64 30011 0 0.00 15900
isd_og_mou_6 float64 30011 0 0.00 1113
Average_rech_amt_6n7 float64 30011 0 0.00 3025
In [58]:
print(data[data['date_of_last_rech_6'].isnull()][['date_of_last_rech_6','total_rech_amt_6','total_rech_num_6']].nunique())
print(data[data['date_of_last_rech_7'].isnull()][['date_of_last_rech_7','total_rech_amt_7','total_rech_num_7']].nunique())
print(data[data['date_of_last_rech_8'].isnull()][['date_of_last_rech_8','total_rech_amt_8','total_rech_num_8']].nunique())
print(data[data['date_of_last_rech_9'].isnull()][['date_of_last_rech_9','total_rech_amt_9','total_rech_num_9']].nunique())
date_of_last_rech_6    0
total_rech_amt_6       1
total_rech_num_6       1
dtype: int64
date_of_last_rech_7    0
total_rech_amt_7       1
total_rech_num_7       1
dtype: int64
date_of_last_rech_8    0
total_rech_amt_8       1
total_rech_num_8       1
dtype: int64
date_of_last_rech_9    0
total_rech_amt_9       1
total_rech_num_9       1
dtype: int64
In [59]:
print("\n",data[data['date_of_last_rech_6'].isnull()][['total_rech_amt_6','total_rech_num_6']].head())
print("\n",data[data['date_of_last_rech_7'].isnull()][['total_rech_amt_7','total_rech_num_7']].head())
print("\n",data[data['date_of_last_rech_8'].isnull()][['total_rech_amt_8','total_rech_num_8']].head())
print("\n",data[data['date_of_last_rech_9'].isnull()][['total_rech_amt_9','total_rech_num_9']].head())
                total_rech_amt_6  total_rech_num_6
mobile_number                                    
7001588448                    0                 0
7001223277                    0                 0
7000721536                    0                 0
7001490351                    0                 0
7000665415                    0                 0

                total_rech_amt_7  total_rech_num_7
mobile_number                                    
7000369789                    0                 0
7001967148                    0                 0
7000066601                    0                 0
7001189556                    0                 0
7002024450                    0                 0

                total_rech_amt_8  total_rech_num_8
mobile_number                                    
7000340381                    0                 0
7000608224                    0                 0
7000369789                    0                 0
7000248548                    0                 0
7001967063                    0                 0

                total_rech_amt_9  total_rech_num_9
mobile_number                                    
7000340381                    0                 0
7000854899                    0                 0
7000369789                    0                 0
7001967063                    0                 0
7000066601                    0                 0
  • The columns 'date_of_last_rech' for june,july and August does not have any value becuase there are no recharges done by the user during those months.

Dropping columns with one unique value.

In [60]:
metadata=metadata_matrix(data)
singular_value_cols=metadata[metadata['Unique_Values_Count']==1].index.values
#data.loc[metadata_matrix(data)['Unique_Values_Count']==1].index
In [61]:
#Dropping singular value columns.
data.drop(columns=singular_value_cols,inplace=True)
In [62]:
# Dropping date columns 
# since they are not usage related columns and can't be used for modelling 
date_columns = data.filter(regex='^date.*').columns
data.drop(columns=date_columns, inplace=True)
metadata_matrix(data)
Out[62]:
Datatype Non_Null_Count Null_Count Null_Percentage Unique_Values_Count
arpu_6 float64 30011 0 0.0 29261
total_ic_mou_6 float64 30011 0 0.0 20602
total_ic_mou_8 float64 30011 0 0.0 20096
total_ic_mou_9 float64 30011 0 0.0 19437
spl_ic_mou_6 float64 30011 0 0.0 78
spl_ic_mou_7 float64 30011 0 0.0 93
spl_ic_mou_8 float64 30011 0 0.0 85
spl_ic_mou_9 float64 30011 0 0.0 287
isd_ic_mou_6 float64 30011 0 0.0 3429
isd_ic_mou_7 float64 30011 0 0.0 3639
isd_ic_mou_8 float64 30011 0 0.0 3493
isd_ic_mou_9 float64 30011 0 0.0 3329
ic_others_6 float64 30011 0 0.0 1227
ic_others_7 float64 30011 0 0.0 1371
ic_others_8 float64 30011 0 0.0 1259
ic_others_9 float64 30011 0 0.0 1284
total_rech_num_6 int64 30011 0 0.0 102
total_rech_num_7 int64 30011 0 0.0 101
total_rech_num_8 int64 30011 0 0.0 96
total_ic_mou_7 float64 30011 0 0.0 20711
std_ic_mou_9 float64 30011 0 0.0 7745
total_rech_amt_6 int64 30011 0 0.0 2241
std_ic_mou_8 float64 30011 0 0.0 8033
loc_ic_mou_7 float64 30011 0 0.0 19030
loc_ic_mou_8 float64 30011 0 0.0 18573
loc_ic_mou_9 float64 30011 0 0.0 18018
std_ic_t2t_mou_6 float64 30011 0 0.0 4608
std_ic_t2t_mou_7 float64 30011 0 0.0 4706
std_ic_t2t_mou_8 float64 30011 0 0.0 4486
std_ic_t2t_mou_9 float64 30011 0 0.0 4280
std_ic_t2m_mou_6 float64 30011 0 0.0 6680
std_ic_t2m_mou_7 float64 30011 0 0.0 6747
std_ic_t2m_mou_8 float64 30011 0 0.0 6420
std_ic_t2m_mou_9 float64 30011 0 0.0 6168
std_ic_t2f_mou_6 float64 30011 0 0.0 2033
std_ic_t2f_mou_7 float64 30011 0 0.0 2075
std_ic_t2f_mou_8 float64 30011 0 0.0 1941
std_ic_t2f_mou_9 float64 30011 0 0.0 1971
std_ic_mou_6 float64 30011 0 0.0 8391
std_ic_mou_7 float64 30011 0 0.0 8543
total_rech_num_9 int64 30011 0 0.0 96
total_rech_amt_7 int64 30011 0 0.0 2265
arpu_7 float64 30011 0 0.0 29260
monthly_2g_8 category 30011 0 0.0 6
sachet_2g_6 category 30011 0 0.0 30
sachet_2g_7 category 30011 0 0.0 34
sachet_2g_8 category 30011 0 0.0 34
sachet_2g_9 category 30011 0 0.0 29
monthly_3g_6 category 30011 0 0.0 12
monthly_3g_7 category 30011 0 0.0 15
monthly_3g_8 category 30011 0 0.0 12
monthly_3g_9 category 30011 0 0.0 11
sachet_3g_6 category 30011 0 0.0 25
sachet_3g_7 category 30011 0 0.0 27
sachet_3g_8 category 30011 0 0.0 29
sachet_3g_9 category 30011 0 0.0 27
aon int64 30011 0 0.0 3321
vbc_3g_8 float64 30011 0 0.0 7291
vbc_3g_7 float64 30011 0 0.0 7318
vbc_3g_6 float64 30011 0 0.0 6864
vbc_3g_9 float64 30011 0 0.0 2171
monthly_2g_9 category 30011 0 0.0 5
monthly_2g_7 category 30011 0 0.0 6
total_rech_amt_8 int64 30011 0 0.0 2299
monthly_2g_6 category 30011 0 0.0 5
total_rech_amt_9 int64 30011 0 0.0 2248
max_rech_amt_6 int64 30011 0 0.0 170
max_rech_amt_7 int64 30011 0 0.0 151
max_rech_amt_8 int64 30011 0 0.0 182
max_rech_amt_9 int64 30011 0 0.0 186
last_day_rch_amt_6 int64 30011 0 0.0 158
last_day_rch_amt_7 int64 30011 0 0.0 149
last_day_rch_amt_8 int64 30011 0 0.0 179
last_day_rch_amt_9 int64 30011 0 0.0 170
vol_2g_mb_6 float64 30011 0 0.0 7809
vol_2g_mb_7 float64 30011 0 0.0 7813
vol_2g_mb_8 float64 30011 0 0.0 7310
vol_2g_mb_9 float64 30011 0 0.0 6984
vol_3g_mb_6 float64 30011 0 0.0 7043
vol_3g_mb_7 float64 30011 0 0.0 7440
vol_3g_mb_8 float64 30011 0 0.0 7151
vol_3g_mb_9 float64 30011 0 0.0 7016
loc_ic_mou_6 float64 30011 0 0.0 19133
loc_ic_t2f_mou_9 float64 30011 0 0.0 4611
loc_ic_t2f_mou_8 float64 30011 0 0.0 4705
loc_og_t2t_mou_7 float64 30011 0 0.0 11154
loc_og_t2t_mou_9 float64 30011 0 0.0 10360
loc_og_t2m_mou_6 float64 30011 0 0.0 16747
loc_og_t2m_mou_7 float64 30011 0 0.0 16872
loc_og_t2m_mou_8 float64 30011 0 0.0 16165
loc_og_t2m_mou_9 float64 30011 0 0.0 15585
loc_og_t2f_mou_6 float64 30011 0 0.0 3252
loc_og_t2f_mou_7 float64 30011 0 0.0 3267
loc_og_t2f_mou_8 float64 30011 0 0.0 3124
loc_og_t2f_mou_9 float64 30011 0 0.0 3111
loc_og_t2c_mou_6 float64 30011 0 0.0 1658
loc_og_t2c_mou_7 float64 30011 0 0.0 1750
loc_og_t2c_mou_8 float64 30011 0 0.0 1730
loc_og_t2c_mou_9 float64 30011 0 0.0 1576
loc_og_mou_6 float64 30011 0 0.0 19691
loc_og_mou_7 float64 30011 0 0.0 19880
loc_og_mou_8 float64 30011 0 0.0 18885
loc_og_mou_9 float64 30011 0 0.0 18207
loc_og_t2t_mou_8 float64 30011 0 0.0 10772
loc_og_t2t_mou_6 float64 30011 0 0.0 11151
loc_ic_t2f_mou_7 float64 30011 0 0.0 4897
roam_og_mou_9 float64 30011 0 0.0 4004
arpu_8 float64 30011 0 0.0 28405
arpu_9 float64 30011 0 0.0 27327
onnet_mou_6 float64 30011 0 0.0 18813
onnet_mou_7 float64 30011 0 0.0 18938
onnet_mou_8 float64 30011 0 0.0 17604
onnet_mou_9 float64 30011 0 0.0 16674
offnet_mou_6 float64 30011 0 0.0 22454
offnet_mou_7 float64 30011 0 0.0 22650
offnet_mou_8 float64 30011 0 0.0 21513
offnet_mou_9 float64 30011 0 0.0 20452
roam_ic_mou_6 float64 30011 0 0.0 4338
roam_ic_mou_7 float64 30011 0 0.0 3649
roam_ic_mou_8 float64 30011 0 0.0 3655
roam_ic_mou_9 float64 30011 0 0.0 3370
roam_og_mou_6 float64 30011 0 0.0 5174
roam_og_mou_7 float64 30011 0 0.0 4431
roam_og_mou_8 float64 30011 0 0.0 4382
std_og_t2t_mou_6 float64 30011 0 0.0 12777
std_og_t2t_mou_7 float64 30011 0 0.0 12983
std_og_t2t_mou_8 float64 30011 0 0.0 11781
std_og_t2t_mou_9 float64 30011 0 0.0 11141
og_others_6 float64 30011 0 0.0 862
og_others_7 float64 30011 0 0.0 123
og_others_8 float64 30011 0 0.0 133
og_others_9 float64 30011 0 0.0 132
total_og_mou_6 float64 30011 0 0.0 24607
total_og_mou_7 float64 30011 0 0.0 24913
total_og_mou_8 float64 30011 0 0.0 23644
total_og_mou_9 float64 30011 0 0.0 22615
loc_ic_t2t_mou_6 float64 30011 0 0.0 9872
loc_ic_t2t_mou_7 float64 30011 0 0.0 9961
loc_ic_t2t_mou_8 float64 30011 0 0.0 9671
loc_ic_t2t_mou_9 float64 30011 0 0.0 9407
loc_ic_t2m_mou_6 float64 30011 0 0.0 16015
loc_ic_t2m_mou_7 float64 30011 0 0.0 16068
loc_ic_t2m_mou_8 float64 30011 0 0.0 15598
loc_ic_t2m_mou_9 float64 30011 0 0.0 15194
loc_ic_t2f_mou_6 float64 30011 0 0.0 4817
spl_og_mou_9 float64 30011 0 0.0 2966
spl_og_mou_8 float64 30011 0 0.0 3238
spl_og_mou_7 float64 30011 0 0.0 3399
std_og_t2f_mou_9 float64 30011 0 0.0 1595
std_og_t2m_mou_6 float64 30011 0 0.0 14518
std_og_t2m_mou_7 float64 30011 0 0.0 14589
std_og_t2m_mou_8 float64 30011 0 0.0 13326
std_og_t2m_mou_9 float64 30011 0 0.0 12445
std_og_t2f_mou_6 float64 30011 0 0.0 1773
std_og_t2f_mou_7 float64 30011 0 0.0 1714
std_og_t2f_mou_8 float64 30011 0 0.0 1627
std_og_mou_6 float64 30011 0 0.0 18325
spl_og_mou_6 float64 30011 0 0.0 3053
std_og_mou_7 float64 30011 0 0.0 18445
std_og_mou_8 float64 30011 0 0.0 16864
std_og_mou_9 float64 30011 0 0.0 15900
isd_og_mou_6 float64 30011 0 0.0 1113
isd_og_mou_7 float64 30011 0 0.0 1125
isd_og_mou_8 float64 30011 0 0.0 940
isd_og_mou_9 float64 30011 0 0.0 908
Average_rech_amt_6n7 float64 30011 0 0.0 3025

Tagging Churn (TARGET variable)

In [63]:
data['Churn'] = 0
churned_customers = data.query('total_og_mou_9 == 0 & total_ic_mou_9 == 0 & vol_2g_mb_9 == 0 &  vol_3g_mb_9 == 0').index
data.loc[churned_customers,'Churn']=1
data['Churn'] = data['Churn'].astype('category')
In [64]:
# Churn proportions
data['Churn'].value_counts(normalize=True).to_frame()
Out[64]:
Churn
0 0.913598
1 0.086402

Dropping Churn Phase Columns

In [65]:
churn_phase_columns = data.filter(regex='9$').columns
data.drop(columns=churn_phase_columns, inplace=True)
print('Retained Columns')
data.columns.to_frame(index=False)
Retained Columns
Out[65]:
0
0 arpu_6
1 arpu_7
2 arpu_8
3 onnet_mou_6
4 onnet_mou_7
5 onnet_mou_8
6 offnet_mou_6
7 offnet_mou_7
8 offnet_mou_8
9 roam_ic_mou_6
10 roam_ic_mou_7
11 roam_ic_mou_8
12 roam_og_mou_6
13 roam_og_mou_7
14 roam_og_mou_8
15 loc_og_t2t_mou_6
16 loc_og_t2t_mou_7
17 loc_og_t2t_mou_8
18 loc_og_t2m_mou_6
19 loc_og_t2m_mou_7
20 loc_og_t2m_mou_8
21 loc_og_t2f_mou_6
22 loc_og_t2f_mou_7
23 loc_og_t2f_mou_8
24 loc_og_t2c_mou_6
25 loc_og_t2c_mou_7
26 loc_og_t2c_mou_8
27 loc_og_mou_6
28 loc_og_mou_7
29 loc_og_mou_8
30 std_og_t2t_mou_6
31 std_og_t2t_mou_7
32 std_og_t2t_mou_8
33 std_og_t2m_mou_6
34 std_og_t2m_mou_7
35 std_og_t2m_mou_8
36 std_og_t2f_mou_6
37 std_og_t2f_mou_7
38 std_og_t2f_mou_8
39 std_og_mou_6
40 std_og_mou_7
41 std_og_mou_8
42 isd_og_mou_6
43 isd_og_mou_7
44 isd_og_mou_8
45 spl_og_mou_6
46 spl_og_mou_7
47 spl_og_mou_8
48 og_others_6
49 og_others_7
50 og_others_8
51 total_og_mou_6
52 total_og_mou_7
53 total_og_mou_8
54 loc_ic_t2t_mou_6
55 loc_ic_t2t_mou_7
56 loc_ic_t2t_mou_8
57 loc_ic_t2m_mou_6
58 loc_ic_t2m_mou_7
59 loc_ic_t2m_mou_8
60 loc_ic_t2f_mou_6
61 loc_ic_t2f_mou_7
62 loc_ic_t2f_mou_8
63 loc_ic_mou_6
64 loc_ic_mou_7
65 loc_ic_mou_8
66 std_ic_t2t_mou_6
67 std_ic_t2t_mou_7
68 std_ic_t2t_mou_8
69 std_ic_t2m_mou_6
70 std_ic_t2m_mou_7
71 std_ic_t2m_mou_8
72 std_ic_t2f_mou_6
73 std_ic_t2f_mou_7
74 std_ic_t2f_mou_8
75 std_ic_mou_6
76 std_ic_mou_7
77 std_ic_mou_8
78 total_ic_mou_6
79 total_ic_mou_7
80 total_ic_mou_8
81 spl_ic_mou_6
82 spl_ic_mou_7
83 spl_ic_mou_8
84 isd_ic_mou_6
85 isd_ic_mou_7
86 isd_ic_mou_8
87 ic_others_6
88 ic_others_7
89 ic_others_8
90 total_rech_num_6
91 total_rech_num_7
92 total_rech_num_8
93 total_rech_amt_6
94 total_rech_amt_7
95 total_rech_amt_8
96 max_rech_amt_6
97 max_rech_amt_7
98 max_rech_amt_8
99 last_day_rch_amt_6
100 last_day_rch_amt_7
101 last_day_rch_amt_8
102 vol_2g_mb_6
103 vol_2g_mb_7
104 vol_2g_mb_8
105 vol_3g_mb_6
106 vol_3g_mb_7
107 vol_3g_mb_8
108 monthly_2g_6
109 monthly_2g_7
110 monthly_2g_8
111 sachet_2g_6
112 sachet_2g_7
113 sachet_2g_8
114 monthly_3g_6
115 monthly_3g_7
116 monthly_3g_8
117 sachet_3g_6
118 sachet_3g_7
119 sachet_3g_8
120 aon
121 vbc_3g_8
122 vbc_3g_7
123 vbc_3g_6
124 Average_rech_amt_6n7
125 Churn
In [66]:
print('retained no of rows', data.shape[0])
print('retain no of columns', data.shape[1])
retained no of rows 30011
retain no of columns 126

Exploratory Data Analysis

Summary Statistics

In [67]:
data.describe()
Out[67]:
arpu_6 arpu_7 arpu_8 onnet_mou_6 onnet_mou_7 onnet_mou_8 offnet_mou_6 offnet_mou_7 offnet_mou_8 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_mou_6 std_og_mou_7 std_og_mou_8 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 og_others_6 og_others_7 og_others_8 total_og_mou_6 total_og_mou_7 total_og_mou_8 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 total_ic_mou_6 total_ic_mou_7 total_ic_mou_8 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 ic_others_6 ic_others_7 ic_others_8 total_rech_num_6 total_rech_num_7 total_rech_num_8 total_rech_amt_6 total_rech_amt_7 total_rech_amt_8 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 vol_2g_mb_6 vol_2g_mb_7 vol_2g_mb_8 vol_3g_mb_6 vol_3g_mb_7 vol_3g_mb_8 aon vbc_3g_8 vbc_3g_7 vbc_3g_6 Average_rech_amt_6n7
count 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.00000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.00000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.00000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000 30011.000000
mean 587.284404 589.135427 534.857433 296.034461 304.343206 267.600412 417.933372 423.924375 375.021691 17.412764 13.522114 13.25627 29.321648 22.036003 21.469272 94.680696 95.729729 87.139995 181.279583 181.271524 167.591199 6.97933 7.097268 6.494314 1.567160 1.862229 1.712739 282.948414 284.107492 261.233938 189.753131 199.877508 172.196408 203.097767 213.411914 179.568790 2.010766 2.034241 1.789728 394.865994 415.327988 353.558826 2.264425 2.207400 2.029314 5.916364 7.425487 6.885193 0.692507 0.047600 0.059131 686.697541 709.124730 623.774684 68.749054 70.311351 65.936968 159.613810 160.813032 153.628517 15.595629 16.510023 14.706512 243.968340 247.644401 234.281577 16.229350 16.893723 15.051559 32.015163 33.477150 30.434765 2.874506 2.992948 2.680925 51.122992 53.36786 48.170990 307.512073 314.875472 295.426531 0.066731 0.018066 0.027660 11.156530 12.360190 11.700835 1.188803 1.476889 1.237756 12.121322 11.913465 10.225317 697.365833 695.962880 613.638799 171.414048 175.661058 162.869348 104.485655 105.287128 95.653294 78.859009 78.171382 69.209105 258.392681 278.093737 269.864111 1264.064776 129.439626 135.127102 121.360548 696.664356
std 442.722413 462.897814 492.259586 460.775592 481.780488 466.560947 470.588583 486.525332 477.489377 79.152657 76.303736 74.55207 118.570414 97.925249 106.244774 236.849265 248.132623 234.721938 250.132066 240.722132 234.862468 22.66552 22.588864 20.220028 6.889317 9.255645 7.397562 379.985249 375.837282 366.539171 409.716719 428.119476 410.033964 413.489240 437.941904 416.752834 12.457422 13.350441 11.700376 606.508681 637.446710 616.219690 45.918087 45.619381 44.794926 18.621373 23.065743 22.893414 2.281325 2.741786 3.320320 660.356820 685.071178 685.983313 158.647160 167.315954 155.702334 222.001036 219.432004 217.026349 45.827009 49.478371 43.714061 312.805586 315.468343 307.043800 78.862358 84.691403 72.433104 101.084965 105.806605 105.308898 19.928472 20.511317 20.269535 140.504104 149.17944 140.965196 361.159561 369.654489 360.343153 0.194273 0.181944 0.116574 67.258387 76.992293 74.928607 13.987003 15.406483 12.889879 9.543550 9.605532 9.478572 539.325984 562.143146 601.821630 174.703215 181.545389 172.605809 142.767207 141.148386 145.260363 277.445058 280.331857 268.494284 866.195376 855.682340 859.299266 975.263117 390.478591 408.024394 389.726031 488.782088
min -2258.709000 -2014.045000 -945.808000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 180.000000 0.000000 0.000000 0.000000 368.500000
25% 364.161000 365.004500 289.609500 41.110000 40.950000 27.010000 137.335000 135.680000 95.695000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 8.320000 9.130000 5.790000 30.290000 33.580000 22.420000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 51.010000 56.710000 38.270000 0.000000 0.000000 0.000000 1.600000 1.330000 0.000000 0.000000 0.000000 0.000000 5.950000 5.555000 1.780000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 266.170000 275.045000 188.790000 8.290000 9.460000 6.810000 33.460000 38.130000 29.660000 0.000000 0.000000 0.000000 56.700000 63.535000 49.985000 0.000000 0.000000 0.000000 0.450000 0.480000 0.000000 0.000000 0.000000 0.000000 2.630000 2.78000 1.430000 89.975000 98.820000 78.930000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 6.000000 4.000000 432.000000 426.500000 309.000000 110.000000 110.000000 67.000000 30.000000 27.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 480.000000 0.000000 0.000000 0.000000 450.000000
50% 495.682000 493.561000 452.091000 125.830000 125.460000 99.440000 282.190000 281.940000 240.940000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 32.590000 33.160000 28.640000 101.240000 104.340000 89.810000 0.33000 0.400000 0.160000 0.000000 0.000000 0.000000 166.310000 170.440000 148.280000 12.830000 13.350000 5.930000 37.730000 37.530000 23.660000 0.000000 0.000000 0.000000 126.010000 131.730000 72.890000 0.000000 0.000000 0.000000 0.210000 0.780000 0.490000 0.000000 0.000000 0.000000 510.230000 525.580000 435.330000 29.130000 30.130000 26.840000 93.940000 96.830000 89.810000 1.960000 2.210000 1.850000 151.060000 154.830000 142.840000 1.050000 1.200000 0.560000 7.080000 7.460000 5.710000 0.000000 0.000000 0.000000 15.030000 16.11000 12.560000 205.240000 211.190000 193.440000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 9.000000 8.000000 584.000000 581.000000 520.000000 120.000000 128.000000 130.000000 110.000000 98.000000 50.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 914.000000 0.000000 0.000000 0.000000 568.500000
75% 703.922000 700.788000 671.150000 353.310000 359.925000 297.735000 523.125000 532.695000 482.610000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 91.460000 91.480000 84.670000 240.165000 239.485000 223.590000 5.09000 5.260000 4.680000 0.000000 0.100000 0.050000 374.475000 375.780000 348.310000 178.085000 191.380000 132.820000 211.210000 223.010000 164.725000 0.000000 0.000000 0.000000 573.090000 615.150000 481.030000 0.000000 0.000000 0.000000 5.160000 7.110000 6.380000 0.000000 0.000000 0.000000 899.505000 931.050000 833.100000 73.640000 74.680000 70.330000 202.830000 203.485000 196.975000 12.440000 13.035000 11.605000 315.500000 316.780000 302.110000 10.280000 10.980000 8.860000 27.540000 29.235000 25.330000 0.180000 0.260000 0.130000 47.540000 50.36000 43.410000 393.680000 396.820000 380.410000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.060000 0.000000 0.060000 15.000000 15.000000 13.000000 837.000000 835.000000 790.000000 200.000000 200.000000 198.000000 120.000000 130.000000 130.000000 14.450000 14.960000 9.620000 0.000000 2.080000 0.000000 1924.000000 1.600000 1.990000 0.000000 795.500000
max 27731.088000 35145.834000 33543.624000 7376.710000 8157.780000 10752.560000 8362.360000 9667.130000 14007.340000 2613.310000 3813.290000 4169.81000 3775.110000 2812.040000 5337.040000 6431.330000 7400.660000 10752.560000 4729.740000 4557.140000 4961.330000 1466.03000 1196.430000 928.490000 342.860000 569.710000 351.830000 10643.380000 7674.780000 11039.910000 7366.580000 8133.660000 8014.430000 8314.760000 9284.740000 13950.040000 628.560000 544.630000 516.910000 8432.990000 10936.730000 13980.060000 5900.660000 5490.280000 5681.540000 1023.210000 1265.790000 1390.880000 100.610000 370.130000 394.930000 10674.030000 11365.310000 14043.060000 6351.440000 5709.590000 4003.210000 4693.860000 4388.730000 5738.460000 1678.410000 1983.010000 1588.530000 6496.110000 6466.740000 5748.810000 5459.560000 5800.930000 4309.290000 4630.230000 3470.380000 5645.860000 1351.110000 1136.080000 1394.890000 5459.630000 6745.76000 5957.140000 6798.640000 7279.080000 5990.710000 19.760000 21.330000 6.230000 3965.690000 4747.910000 4100.380000 1344.140000 1495.940000 1209.860000 307.000000 138.000000 196.000000 35190.000000 40335.000000 45320.000000 4010.000000 4010.000000 4449.000000 4010.000000 4010.000000 4449.000000 10285.900000 7873.550000 11117.610000 45735.400000 28144.120000 30036.060000 4321.000000 12916.220000 9165.600000 11166.210000 37762.500000
  • The telecom company has many users with negative average revenues in both phases. These users are likely to churn
In [68]:
categorical_columns = data.dtypes[data.dtypes == 'category'].index.values
print('Mode : ')
data[categorical_columns].mode().T
Mode : 
Out[68]:
0
monthly_2g_6 0
monthly_2g_7 0
monthly_2g_8 0
sachet_2g_6 0
sachet_2g_7 0
sachet_2g_8 0
monthly_3g_6 0
monthly_3g_7 0
monthly_3g_8 0
sachet_3g_6 0
sachet_3g_7 0
sachet_3g_8 0
Churn 0
  • Most customers prefer the plans of '0' category

Univariate Analysis

In [69]:
churned_customers = data[data['Churn'] == 1]
non_churned_customers = data[data['Churn'] == 0]

Age on Network

In [70]:
plt.figure(figsize=(12,8))
sns.violinplot(x='aon', y='Churn', data=data)
plt.title('Age on Network vs Churn')
plt.show()
  • The customers with lesser 'aon' are more likely to Churn when compared to the Customers with higer 'aon'
In [71]:
# function for numerical variable univariate analysis
from tabulate import tabulate
def num_univariate_analysis(column_names,scale='linear') : 
    # boxplot for column vs target
    
    fig = plt.figure(figsize=(16,8))
    ax1 = fig.add_subplot(1,3,1)
    sns.violinplot(x='Churn', y = column_names[0], data = data, ax=ax1)
    title = ''.join(column_names[0]) +' vs Churn'
    ax1.set(title=title)
    if scale == 'log' :
        plt.yscale('log')
        ax1.set(ylabel= column_names[0] + '(Log Scale)')
        
    ax2 = fig.add_subplot(1,3,2)
    sns.violinplot(x='Churn', y = column_names[1], data = data, ax=ax2)
    title = ''.join(column_names[1]) +' vs Churn'
    ax2.set(title=title)
    if scale == 'log' :
        plt.yscale('log')
        ax2.set(ylabel= column_names[1] + '(Log Scale)')
    
    ax3 = fig.add_subplot(1,3,3)
    sns.violinplot(x='Churn', y = column_names[2], data = data, ax=ax3)
    title = ''.join(column_names[2]) +' vs Churn'
    ax3.set(title=title)
    if scale == 'log' :
        plt.yscale('log')
        ax3.set(ylabel= column_names[2] + '(Log Scale)')
    
    # summary statistic
    
    print('Customers who churned (Churn : 1)')
    print(churned_customers[column_names].describe())
    
    print('\nCustomers who did not churn (Churn : 0)')
    print(non_churned_customers[column_names].describe(),'\n')
In [72]:
# function for categorical variable univariate analysis
!pip install sidetable
import sidetable
def cat_univariate_analysis(column_names,figsize=(16,4)) : 
    
    # column vs target count plot
    fig = plt.figure(figsize=figsize)
    
    ax1 = fig.add_subplot(1,3,1)
    sns.countplot(x=column_names[0],hue='Churn',data=data, ax=ax1)
    title = column_names[0] + ' vs No of Churned Customers'
    ax1.set(title= title)
    ax1.legend(loc='upper right')

    
    ax2 = fig.add_subplot(1,3,2)
    sns.countplot(x=column_names[1],hue='Churn',data=data, ax=ax2)
    title = column_names[1] + ' vs No of Churned Customers'
    ax2.set(title= title)
    ax2.legend(loc='upper right')

        
    ax3 = fig.add_subplot(1,3,3)
    sns.countplot(x=column_names[2],hue='Churn',data=data, ax=ax3)
    title = column_names[2] + ' vs No of Churned Customers'
    ax3.set(title= title)
    ax3.legend(loc='upper right')

        
    # Percentages 
    print('Customers who churned (Churn : 1)')
    print(tabulate(pd.DataFrame(churned_customers.stb.freq([column_names[0]])), headers='keys', tablefmt='psql'),'\n')
    print(tabulate(pd.DataFrame(churned_customers.stb.freq([column_names[1]])), headers='keys', tablefmt='psql'),'\n')
    print(tabulate(pd.DataFrame(churned_customers.stb.freq([column_names[2]])), headers='keys', tablefmt='psql'),'\n')

    print('\nCustomers who did not churn (Churn : 0)')
    print(tabulate(pd.DataFrame(non_churned_customers.stb.freq([column_names[0]])), headers='keys', tablefmt='psql'),'\n')
    print(tabulate(pd.DataFrame(non_churned_customers.stb.freq([column_names[1]])), headers='keys', tablefmt='psql'),'\n')
    print(tabulate(pd.DataFrame(non_churned_customers.stb.freq([column_names[2]])), headers='keys', tablefmt='psql'),'\n')
Requirement already satisfied: sidetable in /Users/UMAER/Documents/DataScience/anaconda3/lib/python3.8/site-packages (0.7.0)
Requirement already satisfied: pandas>=1.0 in /Users/UMAER/Documents/DataScience/anaconda3/lib/python3.8/site-packages (from sidetable) (1.0.5)
Requirement already satisfied: pytz>=2017.2 in /Users/UMAER/Documents/DataScience/anaconda3/lib/python3.8/site-packages (from pandas>=1.0->sidetable) (2020.1)
Requirement already satisfied: numpy>=1.13.3 in /Users/UMAER/Documents/DataScience/anaconda3/lib/python3.8/site-packages (from pandas>=1.0->sidetable) (1.18.5)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/UMAER/Documents/DataScience/anaconda3/lib/python3.8/site-packages (from pandas>=1.0->sidetable) (2.8.1)
Requirement already satisfied: six>=1.5 in /Users/UMAER/Documents/DataScience/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.6.1->pandas>=1.0->sidetable) (1.15.0)

arpu_6, arpu_7 , arpu_8

In [73]:
columns = ['arpu_6','arpu_7','arpu_8']
num_univariate_analysis(columns,'log')
Customers who churned (Churn : 1)
             arpu_6        arpu_7       arpu_8
count   2593.000000   2593.000000  2593.000000
mean     678.716970    550.511946   243.063343
std      551.792864    517.241221   378.843531
min     -209.465000   -158.963000   -37.887000
25%      396.507000    289.641000     0.000000
50%      573.396000    464.674000   101.894000
75%      819.460000    691.588000   351.028000
max    11505.508000  13224.119000  5228.826000

Customers who did not churn (Churn : 0)
             arpu_6        arpu_7        arpu_8
count  27418.000000  27418.000000  27418.000000
mean     578.637360    592.788162    562.453248
std      429.988265    457.265996    492.802655
min    -2258.709000  -2014.045000   -945.808000
25%      362.218000    369.610500    319.118500
50%      489.324000    496.182500    471.024000
75%      690.891750    701.418000    690.921000
max    27731.088000  35145.834000  33543.624000 

  • We can understand from the above plots that revenue generated by the Customers who are about to churn is very unstable.
  • The Customers whose arpu decreases in 7th month are more likely to churn when compared to ones with increase in arpu.

total_og_mou_6, total_og_mou_7, total_og_mou_8

In [74]:
columns = ['total_og_mou_6', 'total_og_mou_7', 'total_og_mou_8']
num_univariate_analysis(columns)
Customers who churned (Churn : 1)
       total_og_mou_6  total_og_mou_7  total_og_mou_8
count     2593.000000     2593.000000     2593.000000
mean       867.961342      677.868909      225.083741
std        852.697688      786.961399      471.672718
min          0.000000        0.000000        0.000000
25%        277.880000      110.090000        0.000000
50%        658.360000      466.910000        0.000000
75%       1209.040000      926.760000      255.810000
max       8488.360000     8285.640000     5206.210000

Customers who did not churn (Churn : 0)
       total_og_mou_6  total_og_mou_7  total_og_mou_8
count    27418.000000    27418.000000    27418.000000
mean       669.554896      712.080684      661.480046
std        636.531612      674.580516      691.079113
min          0.000000        0.000000        0.000000
25%        265.682500      284.500000      227.970000
50%        500.410000      529.935000      470.475000
75%        872.070000      931.197500      866.045000
max      10674.030000    11365.310000    14043.060000 

  • The Customers with high total_og_mou in 6th month and lower total_og_mou in 7th month are more likely to churn compared to the rest.

'total_ic_mou_6', 'total_ic_mou_7', 'total_ic_mou_8'

In [75]:
columns = ['total_ic_mou_6', 'total_ic_mou_7', 'total_ic_mou_8']
num_univariate_analysis(columns)
Customers who churned (Churn : 1)
       total_ic_mou_6  total_ic_mou_7  total_ic_mou_8
count     2593.000000     2593.000000     2593.000000
mean       241.954404      193.341076       68.807042
std        360.836586      318.183813      154.450340
min          0.000000        0.000000        0.000000
25%         49.460000       27.890000        0.000000
50%        137.330000       99.980000        0.000000
75%        289.510000      235.740000       70.290000
max       6633.180000     5137.560000     1859.280000

Customers who did not churn (Churn : 0)
       total_ic_mou_6  total_ic_mou_7  total_ic_mou_8
count    27418.000000    27418.000000    27418.000000
mean       313.712052      326.369333      316.858595
std        360.580253      372.112086      366.818717
min          0.000000        0.000000        0.000000
25%         94.460000      107.802500       98.265000
50%        212.160000      222.290000      212.360000
75%        401.602500      410.182500      402.270000
max       6798.640000     7279.080000     5990.710000 

  • The Customers with decrease in rate of total_ic_mou in 7th month are more likely to churn, compared to the rest.

vol_2g_mb_6, vol_2g_mb_7, vol_2g_mb_8

In [76]:
columns = ['vol_2g_mb_6', 'vol_2g_mb_7', 'vol_2g_mb_8']
num_univariate_analysis(columns, 'log')
Customers who churned (Churn : 1)
       vol_2g_mb_6  vol_2g_mb_7  vol_2g_mb_8
count  2593.000000  2593.000000  2593.000000
mean     60.775588    49.054393    15.283185
std     243.084276   219.485813   120.975111
min       0.000000     0.000000     0.000000
25%       0.000000     0.000000     0.000000
50%       0.000000     0.000000     0.000000
75%       0.000000     0.000000     0.000000
max    4017.160000  3430.730000  3349.190000

Customers who did not churn (Churn : 0)
        vol_2g_mb_6   vol_2g_mb_7   vol_2g_mb_8
count  27418.000000  27418.000000  27418.000000
mean      80.569210     80.925060     74.309036
std      280.420463    285.265125    277.889339
min        0.000000      0.000000      0.000000
25%        0.000000      0.000000      0.000000
50%        0.000000      0.000000      0.000000
75%       16.937500     18.267500     14.245000
max    10285.900000   7873.550000  11117.610000 

  • Customers with stable usage of 2g volumes throughout 6 and 7 months are less likely to churn.
  • Customers with fall in consumption of 2g volumes in 7th month are more likely to Churn.

vol_3g_mb_6, vol_3g_mb_7, vol_3g_mb_8, monthly_3g_6

In [77]:
columns = ['vol_3g_mb_6', 'vol_3g_mb_7', 'vol_3g_mb_8', 'monthly_3g_6']
num_univariate_analysis(columns, 'log')
Customers who churned (Churn : 1)
       vol_3g_mb_6   vol_3g_mb_7   vol_3g_mb_8
count  2593.000000   2593.000000   2593.000000
mean    188.395461    157.714254     56.776880
std     715.327843    690.773561    446.532769
min       0.000000      0.000000      0.000000
25%       0.000000      0.000000      0.000000
50%       0.000000      0.000000      0.000000
75%       0.000000      0.000000      0.000000
max    9400.120000  15115.510000  13440.720000

Customers who did not churn (Churn : 0)
        vol_3g_mb_6   vol_3g_mb_7   vol_3g_mb_8
count  27418.000000  27418.000000  27418.000000
mean     265.012522    289.478375    290.016390
std      878.846885    868.808831    885.821105
min        0.000000      0.000000      0.000000
25%        0.000000      0.000000      0.000000
50%        0.000000      0.000000      0.000000
75%        0.000000     35.855000     27.120000
max    45735.400000  28144.120000  30036.060000 

  • Customers with stable usage of 3g volumes throughout 6 and 7 months are less likely to churn.
  • Customers with fall in consumption of 3g volumes in 7th month are more likely to Churn.

monthly_2g_6, monthly_2g_7, monthly_2g_8

In [78]:
columns = ['monthly_2g_6', 'monthly_2g_7', 'monthly_2g_8']
cat_univariate_analysis(columns)
Customers who churned (Churn : 1)
+----+----------------+---------+------------+--------------------+----------------------+
|    |   monthly_2g_6 |   count |    percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+------------+--------------------+----------------------|
|  0 |              0 |    2454 | 94.6394    |               2454 |              94.6394 |
|  1 |              1 |     126 |  4.85924   |               2580 |              99.4987 |
|  2 |              2 |      11 |  0.424219  |               2591 |              99.9229 |
|  3 |              4 |       2 |  0.0771307 |               2593 |             100      |
+----+----------------+---------+------------+--------------------+----------------------+ 

+----+----------------+---------+-----------+--------------------+----------------------+
|    |   monthly_2g_7 |   count |   percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 |              0 |    2477 | 95.5264   |               2477 |              95.5264 |
|  1 |              1 |     104 |  4.0108   |               2581 |              99.5372 |
|  2 |              2 |      12 |  0.462784 |               2593 |             100      |
+----+----------------+---------+-----------+--------------------+----------------------+ 

+----+----------------+---------+------------+--------------------+----------------------+
|    |   monthly_2g_8 |   count |    percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+------------+--------------------+----------------------|
|  0 |              0 |    2555 | 98.5345    |               2555 |              98.5345 |
|  1 |              1 |      37 |  1.42692   |               2592 |              99.9614 |
|  2 |              2 |       1 |  0.0385654 |               2593 |             100      |
+----+----------------+---------+------------+--------------------+----------------------+ 


Customers who did not churn (Churn : 0)
+----+----------------+---------+------------+--------------------+----------------------+
|    |   monthly_2g_6 |   count |    percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+------------+--------------------+----------------------|
|  0 |              0 |   24228 | 88.3653    |              24228 |              88.3653 |
|  1 |              1 |    2825 | 10.3035    |              27053 |              98.6688 |
|  2 |              2 |     334 |  1.21818   |              27387 |              99.8869 |
|  3 |              3 |      26 |  0.0948282 |              27413 |              99.9818 |
|  4 |              4 |       5 |  0.0182362 |              27418 |             100      |
+----+----------------+---------+------------+--------------------+----------------------+ 

+----+----------------+---------+-------------+--------------------+----------------------+
|    |   monthly_2g_7 |   count |     percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |   24079 | 87.8219     |              24079 |              87.8219 |
|  1 |              1 |    2909 | 10.6098     |              26988 |              98.4317 |
|  2 |              2 |     394 |  1.43701    |              27382 |              99.8687 |
|  3 |              3 |      29 |  0.10577    |              27411 |              99.9745 |
|  4 |              4 |       5 |  0.0182362  |              27416 |              99.9927 |
|  5 |              5 |       2 |  0.00729448 |              27418 |             100      |
+----+----------------+---------+-------------+--------------------+----------------------+ 

+----+----------------+---------+-------------+--------------------+----------------------+
|    |   monthly_2g_8 |   count |     percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |   24383 | 88.9306     |              24383 |              88.9306 |
|  1 |              1 |    2724 |  9.93508    |              27107 |              98.8657 |
|  2 |              2 |     282 |  1.02852    |              27389 |              99.8942 |
|  3 |              3 |      22 |  0.0802393  |              27411 |              99.9745 |
|  4 |              4 |       5 |  0.0182362  |              27416 |              99.9927 |
|  5 |              5 |       2 |  0.00729448 |              27418 |             100      |
+----+----------------+---------+-------------+--------------------+----------------------+ 

monthly_3g_6, monthly_3g_7, monthly_3g_8

In [79]:
columns = ['monthly_3g_6', 'monthly_3g_7', 'monthly_3g_8']
cat_univariate_analysis(columns)
Customers who churned (Churn : 1)
+----+----------------+---------+------------+--------------------+----------------------+
|    |   monthly_3g_6 |   count |    percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+------------+--------------------+----------------------|
|  0 |              0 |    2352 | 90.7057    |               2352 |              90.7057 |
|  1 |              1 |     170 |  6.55611   |               2522 |              97.2619 |
|  2 |              2 |      49 |  1.8897    |               2571 |              99.1516 |
|  3 |              3 |      13 |  0.50135   |               2584 |              99.6529 |
|  4 |              5 |       4 |  0.154261  |               2588 |              99.8072 |
|  5 |              4 |       4 |  0.154261  |               2592 |              99.9614 |
|  6 |              6 |       1 |  0.0385654 |               2593 |             100      |
+----+----------------+---------+------------+--------------------+----------------------+ 

+----+----------------+---------+------------+--------------------+----------------------+
|    |   monthly_3g_7 |   count |    percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+------------+--------------------+----------------------|
|  0 |              0 |    2399 | 92.5183    |               2399 |              92.5183 |
|  1 |              1 |     136 |  5.24489   |               2535 |              97.7632 |
|  2 |              2 |      48 |  1.85114   |               2583 |              99.6143 |
|  3 |              3 |       9 |  0.347088  |               2592 |              99.9614 |
|  4 |              5 |       1 |  0.0385654 |               2593 |             100      |
+----+----------------+---------+------------+--------------------+----------------------+ 

+----+----------------+---------+------------+--------------------+----------------------+
|    |   monthly_3g_8 |   count |    percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+------------+--------------------+----------------------|
|  0 |              0 |    2524 | 97.339     |               2524 |              97.339  |
|  1 |              1 |      56 |  2.15966   |               2580 |              99.4987 |
|  2 |              2 |       8 |  0.308523  |               2588 |              99.8072 |
|  3 |              3 |       4 |  0.154261  |               2592 |              99.9614 |
|  4 |              4 |       1 |  0.0385654 |               2593 |             100      |
+----+----------------+---------+------------+--------------------+----------------------+ 


Customers who did not churn (Churn : 0)
+----+----------------+---------+-------------+--------------------+----------------------+
|    |   monthly_3g_6 |   count |     percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |   24080 | 87.8255     |              24080 |              87.8255 |
|  1 |              1 |    2371 |  8.6476     |              26451 |              96.4731 |
|  2 |              2 |     648 |  2.36341    |              27099 |              98.8365 |
|  3 |              3 |     194 |  0.707564   |              27293 |              99.5441 |
|  4 |              4 |      70 |  0.255307   |              27363 |              99.7994 |
|  5 |              5 |      28 |  0.102123   |              27391 |              99.9015 |
|  6 |              6 |      10 |  0.0364724  |              27401 |              99.938  |
|  7 |              7 |       9 |  0.0328252  |              27410 |              99.9708 |
|  8 |              8 |       3 |  0.0109417  |              27413 |              99.9818 |
|  9 |             11 |       2 |  0.00729448 |              27415 |              99.9891 |
| 10 |              9 |       2 |  0.00729448 |              27417 |              99.9964 |
| 11 |             14 |       1 |  0.00364724 |              27418 |             100      |
+----+----------------+---------+-------------+--------------------+----------------------+ 

+----+----------------+---------+-------------+--------------------+----------------------+
|    |   monthly_3g_7 |   count |     percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |   23962 | 87.3951     |              23962 |              87.3951 |
|  1 |              1 |    2330 |  8.49807    |              26292 |              95.8932 |
|  2 |              2 |     774 |  2.82296    |              27066 |              98.7162 |
|  3 |              3 |     198 |  0.722153   |              27264 |              99.4383 |
|  4 |              4 |      68 |  0.248012   |              27332 |              99.6863 |
|  5 |              5 |      38 |  0.138595   |              27370 |              99.8249 |
|  6 |              6 |      23 |  0.0838865  |              27393 |              99.9088 |
|  7 |              7 |      10 |  0.0364724  |              27403 |              99.9453 |
|  8 |              8 |       5 |  0.0182362  |              27408 |              99.9635 |
|  9 |              9 |       4 |  0.014589   |              27412 |              99.9781 |
| 10 |             11 |       2 |  0.00729448 |              27414 |              99.9854 |
| 11 |             16 |       1 |  0.00364724 |              27415 |              99.9891 |
| 12 |             14 |       1 |  0.00364724 |              27416 |              99.9927 |
| 13 |             12 |       1 |  0.00364724 |              27417 |              99.9964 |
| 14 |             10 |       1 |  0.00364724 |              27418 |             100      |
+----+----------------+---------+-------------+--------------------+----------------------+ 

+----+----------------+---------+-------------+--------------------+----------------------+
|    |   monthly_3g_8 |   count |     percent |   cumulative_count |   cumulative_percent |
|----+----------------+---------+-------------+--------------------+----------------------|
|  0 |              0 |   24002 | 87.541      |              24002 |              87.541  |
|  1 |              1 |    2347 |  8.56007    |              26349 |              96.1011 |
|  2 |              2 |     728 |  2.65519    |              27077 |              98.7563 |
|  3 |              3 |     193 |  0.703917   |              27270 |              99.4602 |
|  4 |              4 |      86 |  0.313663   |              27356 |              99.7739 |
|  5 |              5 |      30 |  0.109417   |              27386 |              99.8833 |
|  6 |              6 |      14 |  0.0510613  |              27400 |              99.9343 |
|  7 |              7 |       9 |  0.0328252  |              27409 |              99.9672 |
|  8 |              9 |       3 |  0.0109417  |              27412 |              99.9781 |
|  9 |              8 |       3 |  0.0109417  |              27415 |              99.9891 |
| 10 |             10 |       2 |  0.00729448 |              27417 |              99.9964 |
| 11 |             16 |       1 |  0.00364724 |              27418 |             100      |
+----+----------------+---------+-------------+--------------------+----------------------+ 

sachet_3g_6, sachet_3g_7, sachet_3g_8

In [1]:
columns = ['sachet_3g_6', 'sachet_3g_7','sachet_3g_8']
print(data[columns].dtypes)
cat_univariate_analysis(columns)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-43696c0d750b> in <module>
      1 columns = ['sachet_3g_6', 'sachet_3g_7','sachet_3g_8']
----> 2 print(data[columns].dtypes)
      3 cat_univariate_analysis(columns)

NameError: name 'data' is not defined

aug_vbc_3g, jul_vbc_3g, jun_vbc_3g

In [81]:
columns = [ 'vbc_3g_6', 'vbc_3g_7','vbc_3g_8']
num_univariate_analysis(columns, 'log')
Customers who churned (Churn : 1)
          vbc_3g_6     vbc_3g_7     vbc_3g_8
count  2593.000000  2593.000000  2593.000000
mean     81.564601    71.143880    32.610659
std     320.898511   284.882601   197.998246
min       0.000000     0.000000     0.000000
25%       0.000000     0.000000     0.000000
50%       0.000000     0.000000     0.000000
75%       0.000000     0.000000     0.000000
max    6931.810000  4908.270000  5738.740000

Customers who did not churn (Churn : 0)
           vbc_3g_6      vbc_3g_7      vbc_3g_8
count  27418.000000  27418.000000  27418.000000
mean     125.124167    141.178182    138.597023
std      395.413666    417.292310    402.761779
min        0.000000      0.000000      0.000000
25%        0.000000      0.000000      0.000000
50%        0.000000      0.000000      0.000000
75%        0.000000      9.940000     17.675000
max    11166.210000   9165.600000  12916.220000 

Bivariate Analysis

In [96]:
data.head()
Out[96]:
arpu_6 arpu_7 arpu_8 onnet_mou_6 onnet_mou_7 onnet_mou_8 offnet_mou_6 offnet_mou_7 offnet_mou_8 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_mou_6 std_og_mou_7 std_og_mou_8 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 og_others_6 og_others_7 og_others_8 total_og_mou_6 total_og_mou_7 total_og_mou_8 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 total_ic_mou_6 total_ic_mou_7 total_ic_mou_8 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 ic_others_6 ic_others_7 ic_others_8 total_rech_num_6 total_rech_num_7 total_rech_num_8 total_rech_amt_6 total_rech_amt_7 total_rech_amt_8 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 vol_2g_mb_6 vol_2g_mb_7 vol_2g_mb_8 vol_3g_mb_6 vol_3g_mb_7 vol_3g_mb_8 monthly_2g_6 monthly_2g_7 monthly_2g_8 sachet_2g_6 sachet_2g_7 sachet_2g_8 monthly_3g_6 monthly_3g_7 monthly_3g_8 sachet_3g_6 sachet_3g_7 sachet_3g_8 aon vbc_3g_8 vbc_3g_7 vbc_3g_6 Average_rech_amt_6n7 Churn
mobile_number
7000701601 1069.180 1349.850 3171.480 57.84 54.68 52.29 453.43 567.16 325.91 16.23 33.49 31.64 23.74 12.59 38.06 51.39 31.38 40.28 308.63 447.38 162.28 62.13 55.14 53.23 0.0 0.0 0.00 422.16 533.91 255.79 4.30 23.29 12.01 49.89 31.76 49.14 6.66 20.08 16.68 60.86 75.14 77.84 0.0 0.18 10.01 4.50 0.00 6.50 0.00 0.0 0.0 487.53 609.24 350.16 58.14 32.26 27.31 217.56 221.49 121.19 152.16 101.46 39.53 427.88 355.23 188.04 36.89 11.83 30.39 91.44 126.99 141.33 52.19 34.24 22.21 180.54 173.08 193.94 626.46 558.04 428.74 0.21 0.0 0.0 2.06 14.53 31.59 15.74 15.19 15.14 5 5 7 1580 790 3638 1580 790 1580 0 0 779 0.0 0.0 0.00 0.0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 802 57.74 19.38 18.74 1185.0 1
7001524846 378.721 492.223 137.362 413.69 351.03 35.08 94.66 80.63 136.48 0.00 0.00 0.00 0.00 0.00 0.00 297.13 217.59 12.49 80.96 70.58 50.54 0.00 0.00 0.00 0.0 0.0 7.15 378.09 288.18 63.04 116.56 133.43 22.58 13.69 10.04 75.69 0.00 0.00 0.00 130.26 143.48 98.28 0.0 0.00 0.00 0.00 0.00 10.23 0.00 0.0 0.0 508.36 431.66 171.56 23.84 9.84 0.31 57.58 13.98 15.48 0.00 0.00 0.00 81.43 23.83 15.79 0.00 0.58 0.10 22.43 4.08 0.65 0.00 0.00 0.00 22.43 4.66 0.75 103.86 28.49 16.54 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 19 21 14 437 601 120 90 154 30 50 0 10 0.0 356.0 0.03 0.0 750.95 11.94 0 1 0 0 1 3 0 0 0 0 0 0 315 21.03 910.65 122.16 519.0 0
7002191713 492.846 205.671 593.260 501.76 108.39 534.24 413.31 119.28 482.46 23.53 144.24 72.11 7.98 35.26 1.44 49.63 6.19 36.01 151.13 47.28 294.46 4.54 0.00 23.51 0.0 0.0 0.49 205.31 53.48 353.99 446.41 85.98 498.23 255.36 52.94 156.94 0.00 0.00 0.00 701.78 138.93 655.18 0.0 0.00 1.29 0.00 0.00 4.78 0.00 0.0 0.0 907.09 192.41 1015.26 67.88 7.58 52.58 142.88 18.53 195.18 4.81 0.00 7.49 215.58 26.11 255.26 115.68 38.29 154.58 308.13 29.79 317.91 0.00 0.00 1.91 423.81 68.09 474.41 968.61 172.58 1144.53 0.45 0.0 0.0 245.28 62.11 393.39 83.48 16.24 21.44 6 4 11 507 253 717 110 110 130 110 50 0 0.0 0.0 0.02 0.0 0.00 0.00 0 0 0 0 0 3 0 0 0 0 0 0 2607 0.00 0.00 0.00 380.0 0
7000875565 430.975 299.869 187.894 50.51 74.01 70.61 296.29 229.74 162.76 0.00 2.83 0.00 0.00 17.74 0.00 42.61 65.16 67.38 273.29 145.99 128.28 0.00 4.48 10.26 0.0 0.0 0.00 315.91 215.64 205.93 7.89 2.58 3.23 22.99 64.51 18.29 0.00 0.00 0.00 30.89 67.09 21.53 0.0 0.00 0.00 0.00 3.26 5.91 0.00 0.0 0.0 346.81 286.01 233.38 41.33 71.44 28.89 226.81 149.69 150.16 8.71 8.68 32.71 276.86 229.83 211.78 68.79 78.64 6.33 18.68 73.08 73.93 0.51 0.00 2.18 87.99 151.73 82.44 364.86 381.56 294.46 0.00 0.0 0.0 0.00 0.00 0.23 0.00 0.00 0.00 10 6 2 570 348 160 110 110 130 100 100 130 0.0 0.0 0.00 0.0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 511 0.00 2.45 21.89 459.0 0
7000187447 690.008 18.980 25.499 1185.91 9.28 7.79 61.64 0.00 5.54 0.00 4.76 4.81 0.00 8.46 13.34 38.99 0.00 0.00 58.54 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 97.54 0.00 0.00 1146.91 0.81 0.00 1.55 0.00 0.00 0.00 0.00 0.00 1148.46 0.81 0.00 0.0 0.00 0.00 2.58 0.00 0.00 0.93 0.0 0.0 1249.53 0.81 0.00 34.54 0.00 0.00 47.41 2.31 0.00 0.00 0.00 0.00 81.96 2.31 0.00 8.63 0.00 0.00 1.28 0.00 0.00 0.00 0.00 0.00 9.91 0.00 0.00 91.88 2.31 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 19 2 4 816 0 30 110 0 30 30 0 0 0.0 0.0 0.00 0.0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 667 0.00 0.00 0.00 408.0 0

'total_og_mou_6' vs 'total_og_mou_8' with respect to Churn.

In [123]:
sns.scatterplot(x=data['total_og_mou_6'],y=data['total_og_mou_8'],hue=data['Churn'])
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffc15cc7190>

'total_og_mou_7' vs 'total_og_mou_8' with respect to Churn.

In [122]:
sns.scatterplot(x=data['total_og_mou_6'],y=data['total_og_mou_8'],hue=data['Churn'])
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffc1a6f59a0>
  • The customers with lower total_og_mou in 6th and 8th months are more likely to Churn compared to the ones with higher total_og_mou.

'aon' vs 'total_og_mou_8' with respect to Churn.

In [119]:
sns.scatterplot(x=data['aon'],y=data['total_og_mou_8'],hue=data['Churn'])
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffc128bd790>
  • The customers with lesser total_og_mou_8 and aon are more likely to churn compared to the one with higher total_og_mou_8 and aon.
In [120]:
sns.scatterplot(x=data['aon'],y=data['total_ic_mou_8'],hue=data['Churn'])
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffc197fbdc0>
  • The customers with less total_ic_mou_8 are more likely to churn irrespective of aon.
  • The customers with total_ic_mou_8 > 2000 are very less likely to churn.

'max_rech_amt_6' vs 'max_rech_amt_8' with respect to 'Churn'.

In [124]:
sns.scatterplot(x=data['max_rech_amt_6'],y=data['max_rech_amt_8'],hue=data['Churn'])
Out[124]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffc1b2ad970>

Correlation Analysis

In [186]:
# function to correlate variables
def correlation(dataframe) : 
    
    columnsForAnalysis = set(dataframe.columns.values) - {'Churn'}
    cor0=dataframe[columnsForAnalysis].corr()
    type(cor0)
    cor0.where(np.triu(np.ones(cor0.shape),k=1).astype(np.bool))
    cor0=cor0.unstack().reset_index()
    cor0.columns=['VAR1','VAR2','CORR']
    cor0.dropna(subset=['CORR'], inplace=True)
    cor0.CORR=round(cor0['CORR'],2)
    cor0.CORR=cor0.CORR.abs()
    cor0.sort_values(by=['CORR'],ascending=False)
    cor0=cor0[~(cor0['VAR1']==cor0['VAR2'])]

    # removing duplicate correlations
    cor0['pair'] = cor0[['VAR1', 'VAR2']].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)
    
    cor0 = cor0.drop_duplicates(subset=['pair'], keep='first')
    cor0 = cor0[['VAR1', 'VAR2','CORR']]
    return pd.DataFrame(cor0.sort_values(by=['CORR'],ascending=False))
In [187]:
# Correlations for Churn : 0  - non churn customers
# Absolute values are reported 
pd.set_option('precision', 2)
cor_0 = correlation(non_churned_customers)

# filtering for correlations >= 40%
condition = cor_0['CORR'] > 0.4
cor_0 = cor_0[condition]
cor_0.style.background_gradient(cmap='GnBu').hide_index()
Out[187]:
VAR1 VAR2 CORR
isd_og_mou_8 isd_og_mou_7 0.96
isd_og_mou_8 isd_og_mou_6 0.95
arpu_8 total_rech_amt_8 0.95
isd_og_mou_7 isd_og_mou_6 0.95
arpu_6 total_rech_amt_6 0.94
total_rech_amt_7 arpu_7 0.94
Average_rech_amt_6n7 arpu_7 0.91
total_rech_amt_7 Average_rech_amt_6n7 0.91
total_ic_mou_6 loc_ic_mou_6 0.90
Average_rech_amt_6n7 total_rech_amt_6 0.90
arpu_6 Average_rech_amt_6n7 0.89
loc_ic_mou_8 total_ic_mou_8 0.89
loc_ic_mou_7 total_ic_mou_7 0.88
loc_ic_mou_7 loc_ic_mou_8 0.85
std_og_t2t_mou_8 onnet_mou_8 0.85
loc_ic_mou_8 loc_ic_t2m_mou_8 0.85
loc_ic_t2m_mou_6 loc_ic_mou_6 0.85
std_og_t2m_mou_8 offnet_mou_8 0.85
std_og_t2t_mou_7 onnet_mou_7 0.84
std_og_mou_8 total_og_mou_8 0.84
loc_og_mou_8 loc_og_mou_7 0.84
std_ic_mou_8 std_ic_t2m_mou_8 0.84
std_og_t2t_mou_6 onnet_mou_6 0.84
offnet_mou_7 std_og_t2m_mou_7 0.84
total_og_mou_7 std_og_mou_7 0.83
loc_ic_mou_7 loc_ic_mou_6 0.83
total_ic_mou_7 total_ic_mou_8 0.83
loc_og_t2t_mou_8 loc_og_t2t_mou_7 0.83
loc_ic_mou_7 loc_ic_t2m_mou_7 0.83
loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 0.82
loc_og_t2f_mou_8 loc_og_t2f_mou_7 0.82
loc_og_t2m_mou_8 loc_og_t2m_mou_7 0.82
onnet_mou_7 onnet_mou_8 0.82
std_ic_t2m_mou_6 std_ic_mou_6 0.82
std_og_t2t_mou_7 std_og_t2t_mou_8 0.82
std_ic_t2m_mou_7 std_ic_mou_7 0.81
loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 0.81
std_og_mou_8 std_og_mou_7 0.81
offnet_mou_6 std_og_t2m_mou_6 0.81
total_ic_mou_6 total_ic_mou_7 0.81
loc_ic_t2t_mou_8 loc_ic_t2t_mou_7 0.81
total_og_mou_6 std_og_mou_6 0.80
loc_og_mou_6 loc_og_mou_7 0.80
loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 0.80
loc_og_t2t_mou_6 loc_og_t2t_mou_7 0.80
std_og_t2m_mou_8 std_og_t2m_mou_7 0.79
loc_og_t2f_mou_7 loc_og_t2f_mou_6 0.79
loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 0.79
loc_og_mou_6 loc_og_t2m_mou_6 0.79
total_rech_num_8 total_rech_num_7 0.78
loc_og_t2m_mou_7 loc_og_t2m_mou_6 0.78
arpu_8 Average_rech_amt_6n7 0.78
offnet_mou_7 offnet_mou_8 0.78
loc_og_t2t_mou_8 loc_og_mou_8 0.77
arpu_7 total_rech_amt_8 0.77
arpu_8 arpu_7 0.77
arpu_8 total_rech_amt_7 0.77
loc_og_t2m_mou_8 loc_og_mou_8 0.77
total_og_mou_7 total_og_mou_8 0.77
std_og_t2f_mou_7 std_og_t2f_mou_8 0.77
loc_og_mou_7 loc_og_t2t_mou_7 0.77
total_ic_mou_8 loc_ic_t2m_mou_8 0.76
std_ic_mou_8 std_ic_mou_7 0.76
loc_ic_t2m_mou_6 total_ic_mou_6 0.76
isd_ic_mou_7 isd_ic_mou_6 0.75
isd_ic_mou_8 isd_ic_mou_7 0.75
loc_og_mou_6 loc_og_t2t_mou_6 0.75
std_ic_mou_6 std_ic_mou_7 0.75
loc_ic_mou_7 total_ic_mou_8 0.75
loc_og_t2m_mou_7 loc_og_mou_7 0.75
loc_ic_mou_8 loc_ic_mou_6 0.75
total_ic_mou_7 loc_ic_mou_8 0.75
Average_rech_amt_6n7 total_rech_amt_8 0.75
loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 0.75
vol_3g_mb_8 vol_3g_mb_7 0.75
std_og_t2m_mou_6 std_og_t2m_mou_7 0.75
std_og_mou_8 std_og_t2m_mou_8 0.75
std_og_t2m_mou_6 std_og_mou_6 0.74
std_og_mou_8 std_og_t2t_mou_8 0.74
std_ic_t2f_mou_7 std_ic_t2f_mou_6 0.74
std_og_t2m_mou_7 std_og_mou_7 0.74
loc_ic_mou_7 total_ic_mou_6 0.74
loc_ic_t2m_mou_7 total_ic_mou_7 0.74
std_og_mou_6 std_og_t2t_mou_6 0.74
std_ic_t2t_mou_6 std_ic_t2t_mou_7 0.74
std_ic_t2t_mou_8 std_ic_t2t_mou_7 0.73
loc_og_t2f_mou_8 loc_og_t2f_mou_6 0.73
std_og_t2t_mou_7 std_og_t2t_mou_6 0.73
std_ic_t2m_mou_7 std_ic_t2m_mou_8 0.73
onnet_mou_7 onnet_mou_6 0.73
total_rech_amt_7 total_rech_amt_8 0.73
loc_og_mou_6 loc_og_mou_8 0.73
total_ic_mou_6 total_ic_mou_8 0.73
std_og_t2t_mou_7 std_og_mou_7 0.73
std_og_mou_6 std_og_mou_7 0.73
total_ic_mou_7 loc_ic_mou_6 0.72
std_ic_t2f_mou_7 std_ic_t2f_mou_8 0.72
offnet_mou_6 offnet_mou_7 0.72
loc_ic_t2m_mou_6 loc_ic_t2m_mou_8 0.72
loc_ic_t2m_mou_7 loc_ic_mou_8 0.72
total_og_mou_8 offnet_mou_8 0.72
std_ic_t2m_mou_7 std_ic_t2m_mou_6 0.72
loc_og_t2t_mou_8 loc_og_t2t_mou_6 0.72
total_og_mou_8 onnet_mou_8 0.71
vbc_3g_8 vbc_3g_7 0.71
vol_3g_mb_6 vol_3g_mb_7 0.71
std_og_t2f_mou_6 std_og_t2f_mou_7 0.71
total_og_mou_7 onnet_mou_7 0.71
ic_others_8 ic_others_7 0.71
total_og_mou_6 offnet_mou_6 0.70
total_rech_amt_6 arpu_7 0.70
total_og_mou_7 offnet_mou_7 0.70
std_ic_t2t_mou_7 std_ic_mou_7 0.70
arpu_6 arpu_7 0.70
total_og_mou_6 onnet_mou_6 0.70
loc_ic_t2t_mou_6 loc_ic_t2t_mou_8 0.70
loc_ic_mou_7 loc_ic_t2m_mou_8 0.70
last_day_rch_amt_8 max_rech_amt_8 0.69
std_og_t2t_mou_7 onnet_mou_8 0.69
vbc_3g_7 vbc_3g_6 0.69
loc_og_t2m_mou_8 loc_og_t2m_mou_6 0.69
loc_ic_t2m_mou_7 loc_ic_mou_6 0.69
total_rech_num_6 total_rech_num_7 0.69
vol_2g_mb_7 vol_2g_mb_8 0.69
std_og_t2t_mou_8 onnet_mou_7 0.69
loc_ic_mou_7 loc_ic_t2m_mou_6 0.68
arpu_6 total_rech_amt_7 0.68
loc_ic_mou_7 loc_ic_t2t_mou_7 0.68
total_ic_mou_6 loc_ic_mou_8 0.68
std_ic_t2f_mou_8 std_ic_t2f_mou_6 0.67
ic_others_7 ic_others_6 0.67
loc_ic_t2t_mou_6 loc_ic_mou_6 0.67
loc_ic_t2t_mou_8 loc_ic_mou_8 0.67
vol_2g_mb_7 vol_2g_mb_6 0.67
loc_ic_t2f_mou_6 loc_ic_t2f_mou_8 0.67
total_og_mou_6 total_og_mou_7 0.67
vol_3g_mb_8 vol_3g_mb_6 0.67
std_ic_t2t_mou_6 std_ic_mou_6 0.67
total_ic_mou_8 loc_ic_mou_6 0.66
std_ic_t2t_mou_8 std_ic_mou_8 0.66
total_og_mou_7 std_og_mou_8 0.66
std_og_t2m_mou_8 offnet_mou_7 0.66
std_ic_mou_8 std_ic_mou_6 0.66
vbc_3g_7 vol_3g_mb_7 0.65
max_rech_amt_6 last_day_rch_amt_6 0.65
std_og_t2m_mou_7 offnet_mou_8 0.65
std_og_t2f_mou_6 std_og_t2f_mou_8 0.65
loc_og_t2t_mou_8 loc_og_mou_7 0.65
arpu_6 total_rech_amt_8 0.64
arpu_6 arpu_8 0.64
total_og_mou_8 std_og_mou_7 0.64
std_og_t2m_mou_8 total_og_mou_8 0.64
loc_ic_t2m_mou_6 loc_ic_mou_8 0.64
roam_og_mou_6 roam_ic_mou_6 0.64
std_ic_t2m_mou_7 std_ic_mou_8 0.64
loc_og_mou_8 loc_og_t2t_mou_7 0.64
loc_ic_t2m_mou_7 total_ic_mou_8 0.64
total_rech_amt_7 total_rech_amt_6 0.64
total_ic_mou_7 loc_ic_t2m_mou_8 0.63
vol_3g_mb_6 vbc_3g_6 0.63
vbc_3g_8 vol_3g_mb_8 0.63
total_ic_mou_6 loc_ic_t2m_mou_7 0.63
roam_ic_mou_7 roam_og_mou_7 0.63
onnet_mou_8 onnet_mou_6 0.63
loc_og_t2m_mou_8 loc_og_mou_7 0.63
arpu_8 total_rech_amt_6 0.63
std_og_t2t_mou_8 std_og_t2t_mou_6 0.63
loc_ic_t2m_mou_8 loc_ic_mou_6 0.63
loc_og_t2t_mou_6 loc_og_mou_7 0.63
total_og_mou_7 std_og_t2m_mou_7 0.63
ic_others_8 ic_others_6 0.63
std_ic_t2m_mou_7 std_ic_mou_6 0.63
loc_og_mou_8 loc_og_t2m_mou_7 0.63
std_ic_t2m_mou_6 std_ic_t2m_mou_8 0.63
total_rech_amt_6 total_rech_amt_8 0.63
isd_ic_mou_8 isd_ic_mou_6 0.62
std_og_t2t_mou_8 std_og_mou_7 0.62
total_og_mou_8 std_og_t2t_mou_8 0.61
loc_og_t2m_mou_6 loc_og_mou_7 0.61
onnet_mou_7 std_og_t2t_mou_6 0.61
vbc_3g_8 vbc_3g_6 0.61
loc_og_mou_6 loc_og_t2m_mou_7 0.61
std_og_mou_8 onnet_mou_8 0.61
std_ic_t2m_mou_8 std_ic_mou_7 0.61
last_day_rch_amt_7 max_rech_amt_7 0.61
std_og_t2m_mou_6 offnet_mou_7 0.61
std_og_mou_8 std_og_mou_6 0.61
max_rech_amt_6 max_rech_amt_8 0.60
std_og_mou_8 std_og_t2m_mou_7 0.60
total_rech_num_8 total_rech_num_6 0.60
total_ic_mou_7 loc_ic_t2t_mou_7 0.60
loc_ic_t2m_mou_6 total_ic_mou_7 0.60
roam_og_mou_8 roam_ic_mou_8 0.60
std_og_t2t_mou_7 onnet_mou_6 0.60
total_og_mou_7 std_og_t2t_mou_7 0.60
std_og_mou_8 std_og_t2t_mou_7 0.60
loc_og_mou_6 loc_og_t2t_mou_7 0.60
total_og_mou_6 std_og_t2m_mou_6 0.60
std_og_mou_8 offnet_mou_8 0.60
std_og_t2m_mou_8 std_og_t2m_mou_6 0.60
std_og_mou_6 onnet_mou_6 0.59
loc_ic_t2t_mou_8 total_ic_mou_8 0.59
onnet_mou_7 std_og_mou_7 0.59
total_ic_mou_6 loc_ic_t2t_mou_6 0.59
std_og_t2m_mou_8 std_og_mou_7 0.59
offnet_mou_6 offnet_mou_8 0.59
std_ic_t2t_mou_6 std_ic_t2t_mou_8 0.59
loc_ic_mou_7 loc_ic_t2t_mou_8 0.59
total_og_mou_7 onnet_mou_8 0.58
std_ic_t2m_mou_6 std_ic_mou_7 0.58
total_og_mou_6 std_og_t2t_mou_6 0.58
offnet_mou_6 std_og_t2m_mou_7 0.58
roam_og_mou_8 roam_og_mou_7 0.58
total_ic_mou_6 loc_ic_t2m_mou_8 0.58
spl_og_mou_7 spl_og_mou_8 0.57
total_og_mou_7 std_og_mou_6 0.57
offnet_mou_7 std_og_mou_7 0.57
loc_ic_mou_7 loc_ic_t2t_mou_6 0.57
loc_og_t2t_mou_8 loc_og_mou_6 0.57
std_og_t2m_mou_6 std_og_mou_7 0.56
max_rech_amt_7 max_rech_amt_8 0.56
loc_ic_t2m_mou_6 total_ic_mou_8 0.56
spl_og_mou_7 spl_og_mou_6 0.56
roam_ic_mou_7 roam_ic_mou_8 0.56
std_ic_t2m_mou_8 std_ic_mou_6 0.56
loc_og_mou_8 loc_og_t2t_mou_6 0.56
loc_ic_mou_8 loc_ic_t2t_mou_7 0.56
loc_og_mou_8 loc_og_t2m_mou_6 0.56
total_og_mou_8 onnet_mou_7 0.56
total_og_mou_6 total_og_mou_8 0.55
loc_og_t2m_mou_8 loc_og_mou_6 0.55
std_ic_t2t_mou_6 std_ic_mou_7 0.55
loc_ic_t2t_mou_7 loc_ic_mou_6 0.55
total_og_mou_7 offnet_mou_8 0.54
std_og_t2t_mou_7 std_og_mou_6 0.54
std_ic_mou_8 std_ic_t2m_mou_6 0.54
offnet_mou_6 std_og_mou_6 0.54
std_og_mou_6 std_og_t2m_mou_7 0.54
total_og_mou_8 offnet_mou_7 0.54
spl_og_mou_7 loc_og_t2c_mou_7 0.53
std_og_t2t_mou_6 std_og_mou_7 0.53
total_og_mou_6 std_og_mou_7 0.53
std_og_t2t_mou_6 onnet_mou_8 0.53
std_ic_t2t_mou_8 std_ic_mou_7 0.53
loc_og_t2c_mou_8 loc_og_t2c_mou_7 0.53
Average_rech_amt_6n7 isd_og_mou_7 0.53
std_og_t2t_mou_8 onnet_mou_6 0.52
vol_2g_mb_8 vol_2g_mb_6 0.52
isd_og_mou_7 arpu_7 0.52
loc_ic_t2t_mou_8 loc_ic_mou_6 0.51
arpu_8 total_og_mou_8 0.51
total_ic_mou_7 loc_ic_t2t_mou_8 0.51
vol_3g_mb_7 vbc_3g_6 0.51
vbc_3g_8 vol_3g_mb_7 0.51
total_og_mou_6 arpu_6 0.51
total_rech_amt_7 isd_og_mou_7 0.50
roam_og_mou_7 roam_og_mou_6 0.50
onnet_mou_8 std_og_mou_7 0.50
Average_rech_amt_6n7 isd_og_mou_6 0.50
loc_ic_t2t_mou_6 total_ic_mou_7 0.50
loc_ic_t2t_mou_6 loc_ic_mou_8 0.50
std_ic_t2t_mou_7 std_ic_mou_6 0.50
max_rech_amt_6 max_rech_amt_7 0.50
isd_og_mou_8 Average_rech_amt_6n7 0.50
std_ic_mou_8 std_ic_t2t_mou_7 0.50
loc_ic_t2m_mou_6 loc_og_t2m_mou_6 0.50
total_og_mou_6 total_rech_amt_6 0.49
loc_ic_t2t_mou_7 total_ic_mou_8 0.49
total_og_mou_7 std_og_t2m_mou_8 0.49
isd_og_mou_8 total_rech_amt_8 0.49
total_og_mou_7 std_og_t2t_mou_8 0.49
total_og_mou_8 total_rech_amt_8 0.49
loc_og_t2m_mou_7 loc_ic_t2m_mou_7 0.49
total_ic_mou_6 loc_ic_t2t_mou_7 0.49
vbc_3g_7 vol_3g_mb_8 0.49
std_og_mou_8 onnet_mou_7 0.49
loc_og_t2m_mou_8 loc_ic_t2m_mou_8 0.49
total_og_mou_7 onnet_mou_6 0.48
total_og_mou_7 offnet_mou_6 0.48
std_og_t2m_mou_6 offnet_mou_8 0.48
total_og_mou_7 arpu_7 0.48
max_rech_amt_8 total_rech_amt_8 0.48
isd_og_mou_8 arpu_7 0.48
total_og_mou_8 std_og_t2m_mou_7 0.48
arpu_6 isd_og_mou_6 0.48
total_og_mou_6 onnet_mou_7 0.48
isd_og_mou_7 total_rech_amt_8 0.48
isd_og_mou_8 arpu_8 0.48
loc_og_t2c_mou_6 spl_og_mou_6 0.48
offnet_mou_6 std_og_t2m_mou_8 0.47
vol_3g_mb_8 vbc_3g_6 0.47
total_rech_amt_6 isd_og_mou_6 0.47
onnet_mou_7 loc_og_t2t_mou_7 0.47
std_og_t2t_mou_8 std_og_mou_6 0.47
arpu_6 isd_og_mou_7 0.47
total_og_mou_6 offnet_mou_7 0.47
offnet_mou_6 loc_og_t2m_mou_6 0.47
std_og_t2t_mou_7 total_og_mou_8 0.47
loc_og_t2t_mou_6 onnet_mou_6 0.47
arpu_8 offnet_mou_8 0.47
roam_ic_mou_7 roam_ic_mou_6 0.46
arpu_7 isd_og_mou_6 0.46
offnet_mou_8 total_rech_amt_8 0.46
total_og_mou_7 total_rech_amt_7 0.46
spl_og_mou_8 loc_og_t2c_mou_8 0.46
isd_og_mou_8 arpu_6 0.46
isd_og_mou_7 total_rech_amt_6 0.46
std_og_mou_8 std_og_t2m_mou_6 0.46
arpu_8 isd_og_mou_7 0.46
std_ic_mou_8 total_ic_mou_8 0.46
total_rech_amt_7 max_rech_amt_7 0.46
vbc_3g_7 vol_3g_mb_6 0.46
total_og_mou_8 std_og_mou_6 0.46
loc_og_t2t_mou_8 onnet_mou_8 0.46
arpu_6 offnet_mou_6 0.46
isd_og_mou_8 total_rech_amt_7 0.46
offnet_mou_6 total_rech_amt_6 0.45
total_ic_mou_6 loc_ic_t2t_mou_8 0.45
total_ic_mou_7 std_ic_mou_7 0.45
std_og_mou_8 offnet_mou_7 0.45
isd_og_mou_8 total_rech_amt_6 0.45
total_og_mou_7 std_og_t2m_mou_6 0.45
std_og_mou_8 std_og_t2t_mou_6 0.45
loc_og_mou_6 loc_ic_mou_6 0.45
total_ic_mou_6 std_ic_mou_6 0.44
loc_og_t2m_mou_8 loc_ic_mou_8 0.44
loc_og_t2m_mou_8 offnet_mou_8 0.44
total_og_mou_6 loc_og_mou_6 0.44
total_og_mou_6 std_og_mou_8 0.44
isd_og_mou_6 total_rech_amt_8 0.44
std_og_mou_7 offnet_mou_8 0.44
std_og_t2m_mou_8 std_og_mou_6 0.44
loc_og_t2m_mou_6 loc_ic_mou_6 0.44
std_ic_t2t_mou_6 std_ic_mou_8 0.44
loc_og_mou_8 loc_ic_mou_8 0.44
offnet_mou_7 arpu_7 0.44
arpu_8 isd_og_mou_6 0.43
loc_og_t2m_mou_7 loc_ic_t2m_mou_8 0.43
max_rech_amt_6 total_rech_amt_6 0.43
loc_og_t2m_mou_8 loc_ic_t2m_mou_7 0.43
total_rech_amt_7 isd_og_mou_6 0.43
vbc_3g_8 vol_3g_mb_6 0.43
total_rech_amt_7 offnet_mou_7 0.43
loc_ic_t2t_mou_6 total_ic_mou_8 0.43
onnet_mou_7 std_og_mou_6 0.43
loc_og_mou_8 total_og_mou_8 0.42
loc_ic_t2m_mou_7 loc_og_t2m_mou_6 0.42
total_og_mou_6 Average_rech_amt_6n7 0.42
loc_ic_mou_7 loc_og_mou_7 0.42
loc_ic_mou_7 loc_og_t2m_mou_7 0.42
total_og_mou_7 Average_rech_amt_6n7 0.42
last_day_rch_amt_6 max_rech_amt_8 0.42
std_ic_t2t_mou_8 std_ic_mou_6 0.42
loc_ic_t2m_mou_6 loc_og_t2m_mou_7 0.42
loc_og_t2m_mou_7 offnet_mou_7 0.42
total_og_mou_6 onnet_mou_8 0.42
spl_og_mou_8 spl_og_mou_6 0.41
last_day_rch_amt_8 max_rech_amt_7 0.41
offnet_mou_6 Average_rech_amt_6n7 0.41
max_rech_amt_6 last_day_rch_amt_8 0.41
loc_ic_t2m_mou_6 loc_og_mou_6 0.41
total_og_mou_8 onnet_mou_6 0.41
In [188]:
# Correlations for Churn : 1  - churned customers
# Absolute values are reported 
pd.set_option('precision', 2)
cor_1 = correlation(churned_customers)

# filtering for correlations >= 40%
condition = cor_1['CORR'] > 0.4
cor_1 = cor_1[condition]
cor_1.style.background_gradient(cmap='GnBu').hide_index()
Out[188]:
VAR1 VAR2 CORR
og_others_7 og_others_8 1.00
arpu_8 total_rech_amt_8 0.96
arpu_6 total_rech_amt_6 0.95
std_og_mou_8 total_og_mou_8 0.95
total_rech_amt_7 arpu_7 0.95
std_og_t2t_mou_7 onnet_mou_7 0.95
total_og_mou_7 std_og_mou_7 0.94
og_others_8 loc_og_t2f_mou_6 0.93
std_og_t2t_mou_8 onnet_mou_8 0.93
loc_og_t2f_mou_7 loc_og_t2f_mou_6 0.93
og_others_7 loc_og_t2f_mou_6 0.93
total_og_mou_6 std_og_mou_6 0.92
offnet_mou_6 std_og_t2m_mou_6 0.92
offnet_mou_7 std_og_t2m_mou_7 0.92
std_og_t2t_mou_6 onnet_mou_6 0.92
std_ic_mou_8 std_ic_t2m_mou_8 0.92
loc_og_t2f_mou_7 og_others_8 0.91
loc_og_t2f_mou_7 og_others_7 0.91
loc_ic_mou_8 loc_ic_t2m_mou_8 0.90
loc_ic_t2m_mou_6 loc_ic_mou_6 0.90
loc_ic_mou_8 total_ic_mou_8 0.89
loc_og_t2m_mou_8 loc_og_mou_8 0.88
total_ic_mou_6 loc_ic_mou_6 0.87
std_og_t2m_mou_8 offnet_mou_8 0.87
loc_ic_mou_7 total_ic_mou_7 0.86
loc_ic_mou_7 loc_ic_t2m_mou_7 0.84
loc_og_t2m_mou_7 loc_og_mou_7 0.84
std_ic_t2m_mou_7 std_ic_mou_7 0.82
total_ic_mou_8 loc_ic_t2m_mou_8 0.81
std_og_mou_8 std_og_t2t_mou_8 0.79
std_ic_t2t_mou_6 std_ic_t2t_mou_7 0.78
Average_rech_amt_6n7 arpu_7 0.77
loc_og_mou_6 loc_og_t2m_mou_6 0.77
loc_ic_t2m_mou_6 total_ic_mou_6 0.77
std_ic_t2m_mou_6 std_ic_mou_6 0.77
total_rech_amt_7 Average_rech_amt_6n7 0.76
Average_rech_amt_6n7 total_rech_amt_6 0.76
loc_og_mou_6 loc_og_t2t_mou_6 0.75
total_og_mou_8 std_og_t2t_mou_8 0.75
std_og_t2m_mou_7 std_og_mou_7 0.74
std_og_mou_8 onnet_mou_8 0.74
total_og_mou_8 onnet_mou_8 0.74
arpu_6 Average_rech_amt_6n7 0.73
loc_og_t2t_mou_8 loc_og_t2t_mou_7 0.73
loc_ic_t2t_mou_8 loc_ic_mou_8 0.73
loc_ic_t2t_mou_6 loc_ic_mou_6 0.72
max_rech_amt_6 last_day_rch_amt_6 0.72
std_og_t2m_mou_6 std_og_mou_6 0.72
std_ic_t2t_mou_6 std_ic_mou_6 0.72
roam_ic_mou_7 roam_ic_mou_8 0.72
total_og_mou_7 offnet_mou_7 0.72
loc_ic_t2m_mou_7 total_ic_mou_7 0.72
std_og_mou_8 std_og_t2m_mou_8 0.71
total_og_mou_8 offnet_mou_8 0.70
last_day_rch_amt_8 max_rech_amt_8 0.70
loc_og_mou_7 loc_og_t2t_mou_7 0.69
std_og_t2t_mou_7 std_og_mou_7 0.69
total_og_mou_7 std_og_t2m_mou_7 0.69
loc_ic_mou_7 loc_ic_t2t_mou_7 0.69
loc_og_t2t_mou_8 loc_og_mou_8 0.68
total_og_mou_6 offnet_mou_6 0.68
std_og_mou_6 std_og_t2t_mou_6 0.68
std_og_t2m_mou_8 total_og_mou_8 0.68
max_rech_amt_8 total_rech_amt_8 0.68
spl_og_mou_7 loc_og_t2c_mou_7 0.68
loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 0.67
vol_3g_mb_8 vol_3g_mb_7 0.67
std_og_t2t_mou_7 std_og_t2t_mou_6 0.67
total_og_mou_6 std_og_t2m_mou_6 0.67
offnet_mou_7 std_og_mou_7 0.66
total_og_mou_7 onnet_mou_7 0.66
onnet_mou_7 std_og_mou_7 0.65
loc_og_mou_8 loc_ic_mou_8 0.65
std_ic_t2t_mou_7 std_ic_mou_7 0.65
loc_og_t2m_mou_8 loc_ic_t2m_mou_8 0.65
roam_og_mou_8 roam_og_mou_7 0.65
total_og_mou_6 onnet_mou_6 0.65
total_og_mou_7 std_og_t2t_mou_7 0.64
loc_ic_t2t_mou_8 total_ic_mou_8 0.64
onnet_mou_7 onnet_mou_6 0.64
loc_og_mou_8 loc_ic_t2m_mou_8 0.64
onnet_mou_7 std_og_t2t_mou_6 0.63
loc_og_mou_8 loc_og_mou_7 0.63
total_ic_mou_6 loc_ic_t2t_mou_6 0.63
offnet_mou_6 std_og_mou_6 0.63
roam_og_mou_6 roam_ic_mou_6 0.63
std_og_mou_8 offnet_mou_8 0.63
roam_ic_mou_7 roam_ic_mou_6 0.62
arpu_8 max_rech_amt_8 0.62
std_og_t2m_mou_6 std_og_t2m_mou_7 0.62
total_og_mou_6 std_og_t2t_mou_6 0.62
vol_3g_mb_6 vbc_3g_6 0.62
onnet_mou_7 onnet_mou_8 0.62
loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 0.62
loc_og_mou_8 total_ic_mou_8 0.61
vbc_3g_8 vbc_3g_7 0.61
loc_og_t2m_mou_7 loc_og_t2m_mou_6 0.61
std_og_t2t_mou_7 onnet_mou_6 0.61
roam_og_mou_7 roam_og_mou_6 0.61
std_og_t2t_mou_7 std_og_t2t_mou_8 0.61
std_og_mou_6 onnet_mou_6 0.61
std_og_t2t_mou_8 onnet_mou_7 0.60
std_ic_mou_6 std_ic_mou_7 0.60
loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 0.60
arpu_8 total_og_mou_8 0.60
std_og_t2f_mou_7 std_og_t2f_mou_8 0.60
isd_og_mou_8 isd_og_mou_7 0.60
loc_og_mou_6 loc_og_mou_7 0.59
loc_og_t2m_mou_8 loc_og_t2m_mou_7 0.59
last_day_rch_amt_7 max_rech_amt_7 0.59
arpu_8 offnet_mou_8 0.59
std_og_mou_8 std_og_mou_7 0.58
total_og_mou_8 total_rech_amt_8 0.58
loc_og_t2m_mou_7 loc_ic_t2m_mou_7 0.58
loc_og_t2m_mou_8 loc_ic_mou_8 0.58
loc_ic_mou_7 loc_ic_mou_8 0.58
std_og_t2m_mou_8 std_og_t2m_mou_7 0.58
total_ic_mou_7 loc_ic_t2t_mou_7 0.58
offnet_mou_8 total_rech_amt_8 0.58
std_og_mou_6 std_og_mou_7 0.58
spl_og_mou_8 loc_og_t2c_mou_8 0.57
loc_ic_mou_7 loc_ic_mou_6 0.57
isd_ic_mou_7 isd_ic_mou_6 0.57
offnet_mou_6 offnet_mou_7 0.57
offnet_mou_7 offnet_mou_8 0.57
vol_3g_mb_6 vol_3g_mb_7 0.57
isd_og_mou_7 arpu_7 0.57
loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 0.57
total_rech_num_8 total_og_mou_8 0.57
loc_og_t2t_mou_6 loc_og_t2t_mou_7 0.56
std_og_t2t_mou_7 onnet_mou_8 0.56
total_og_mou_7 total_og_mou_8 0.56
vbc_3g_7 vol_3g_mb_7 0.56
total_rech_amt_7 isd_og_mou_7 0.56
std_ic_mou_8 total_ic_mou_8 0.56
loc_ic_t2t_mou_8 loc_ic_t2t_mou_7 0.56
loc_og_t2c_mou_6 spl_og_mou_6 0.56
std_og_t2m_mou_6 offnet_mou_7 0.55
ic_others_7 ic_others_6 0.55
total_og_mou_7 std_og_mou_8 0.55
total_ic_mou_6 total_ic_mou_7 0.55
total_rech_num_8 total_rech_amt_8 0.55
arpu_8 total_rech_num_8 0.54
loc_ic_t2m_mou_7 loc_ic_mou_6 0.54
std_ic_t2t_mou_8 std_ic_mou_8 0.54
loc_og_t2c_mou_6 loc_og_t2c_mou_7 0.54
std_og_t2m_mou_8 offnet_mou_7 0.54
total_rech_num_8 total_rech_num_7 0.54
total_ic_mou_7 std_ic_mou_7 0.54
std_ic_t2t_mou_7 std_ic_mou_6 0.54
loc_ic_mou_7 loc_ic_t2m_mou_6 0.54
offnet_mou_6 std_og_t2m_mou_7 0.54
std_og_mou_8 total_rech_num_8 0.54
loc_og_t2m_mou_8 total_ic_mou_8 0.54
total_ic_mou_7 total_ic_mou_8 0.54
std_ic_t2t_mou_8 std_ic_t2t_mou_7 0.54
total_ic_mou_6 std_ic_mou_6 0.53
vol_2g_mb_7 vol_2g_mb_6 0.53
vbc_3g_7 vbc_3g_6 0.53
arpu_6 isd_og_mou_6 0.53
total_og_mou_8 std_og_mou_7 0.52
std_ic_mou_8 std_ic_mou_7 0.52
total_rech_amt_6 isd_og_mou_6 0.52
loc_og_mou_8 loc_og_t2m_mou_7 0.52
arpu_8 std_og_mou_8 0.51
loc_og_t2m_mou_8 loc_og_mou_7 0.51
loc_ic_t2m_mou_7 loc_og_mou_7 0.51
loc_og_t2m_mou_6 loc_og_mou_7 0.51
arpu_8 total_rech_amt_7 0.51
total_og_mou_7 arpu_7 0.51
total_og_mou_6 total_og_mou_7 0.51
roam_ic_mou_7 roam_og_mou_7 0.51
loc_ic_t2m_mou_7 loc_ic_mou_8 0.51
total_og_mou_7 std_og_mou_6 0.51
loc_ic_t2m_mou_6 loc_og_t2m_mou_6 0.50
std_ic_t2m_mou_7 std_ic_t2m_mou_8 0.50
std_ic_t2m_mou_7 std_ic_t2m_mou_6 0.50
total_og_mou_6 std_og_mou_7 0.50
total_rech_amt_7 max_rech_amt_7 0.50
std_og_t2m_mou_7 offnet_mou_8 0.50
arpu_8 onnet_mou_8 0.50
onnet_mou_8 total_rech_amt_8 0.50
loc_ic_mou_7 loc_og_mou_7 0.50
total_ic_mou_7 loc_ic_mou_8 0.50
std_og_mou_8 total_rech_amt_8 0.50
loc_ic_mou_7 loc_ic_t2m_mou_8 0.50
last_day_rch_amt_8 total_rech_amt_8 0.50
std_og_t2f_mou_7 loc_og_t2f_mou_7 0.50
loc_og_t2t_mou_8 loc_og_mou_7 0.50
arpu_8 arpu_7 0.50
loc_ic_mou_7 total_ic_mou_6 0.49
std_ic_t2t_mou_6 std_ic_t2t_mou_8 0.49
loc_ic_mou_7 loc_og_t2m_mou_7 0.49
loc_ic_mou_7 total_ic_mou_8 0.49
vol_2g_mb_7 vol_2g_mb_8 0.49
vbc_3g_8 vol_3g_mb_8 0.49
loc_ic_mou_8 loc_ic_t2f_mou_8 0.49
arpu_7 total_rech_amt_8 0.48
std_og_t2f_mou_7 og_others_8 0.48
loc_og_t2t_mou_8 loc_ic_t2t_mou_8 0.48
vol_3g_mb_8 vol_3g_mb_6 0.48
std_ic_t2t_mou_6 std_ic_mou_7 0.48
isd_og_mou_7 isd_ic_mou_7 0.48
isd_ic_mou_8 isd_ic_mou_7 0.48
std_ic_t2m_mou_8 total_ic_mou_8 0.48
total_og_mou_7 total_rech_amt_7 0.48
std_og_t2f_mou_7 og_others_7 0.48
std_og_t2f_mou_7 loc_og_t2f_mou_6 0.48
loc_og_mou_8 total_og_mou_8 0.47
total_rech_num_8 onnet_mou_8 0.47
total_ic_mou_6 loc_ic_t2m_mou_7 0.47
loc_ic_mou_7 loc_ic_t2t_mou_8 0.47
total_og_mou_6 arpu_6 0.47
std_og_mou_8 std_og_t2t_mou_7 0.46
Average_rech_amt_6n7 isd_og_mou_7 0.46
spl_og_mou_7 spl_og_mou_8 0.46
loc_ic_t2f_mou_6 loc_ic_t2f_mou_8 0.46
roam_og_mou_8 last_day_rch_amt_8 0.46
total_og_mou_7 total_rech_num_7 0.46
total_og_mou_6 total_rech_amt_6 0.46
total_ic_mou_7 loc_ic_mou_6 0.46
std_ic_t2m_mou_7 std_ic_mou_8 0.46
loc_og_mou_8 loc_og_t2t_mou_7 0.46
max_rech_amt_6 total_rech_amt_6 0.46
arpu_8 roam_og_mou_8 0.45
loc_og_t2m_mou_6 loc_ic_mou_6 0.45
total_rech_num_8 std_og_t2t_mou_8 0.45
loc_og_mou_6 loc_og_t2t_mou_7 0.45
std_ic_t2m_mou_8 std_ic_mou_7 0.45
std_ic_t2m_mou_7 total_ic_mou_7 0.45
total_rech_num_6 total_rech_num_7 0.45
offnet_mou_7 arpu_7 0.45
loc_og_mou_6 loc_ic_mou_6 0.45
std_ic_t2f_mou_7 std_ic_t2f_mou_6 0.45
max_rech_amt_7 max_rech_amt_8 0.45
total_rech_amt_7 total_rech_amt_8 0.45
loc_og_mou_6 loc_og_t2m_mou_7 0.45
std_og_t2m_mou_8 std_og_mou_7 0.44
arpu_8 Average_rech_amt_6n7 0.44
total_ic_mou_7 loc_og_mou_7 0.44
std_og_mou_8 onnet_mou_7 0.44
loc_og_t2c_mou_8 loc_og_t2c_mou_7 0.44
roam_og_mou_8 roam_ic_mou_8 0.44
loc_ic_t2m_mou_6 total_ic_mou_7 0.44
roam_og_mou_8 total_rech_amt_8 0.44
arpu_8 last_day_rch_amt_8 0.44
ic_others_6 isd_ic_mou_6 0.44
loc_og_t2m_mou_8 offnet_mou_8 0.44
loc_ic_t2m_mou_7 total_ic_mou_8 0.43
std_og_t2t_mou_8 std_og_mou_7 0.43
loc_og_mou_8 offnet_mou_8 0.43
total_og_mou_6 total_rech_num_6 0.43
total_og_mou_8 onnet_mou_7 0.43
std_ic_t2m_mou_6 total_ic_mou_6 0.43
total_rech_num_8 offnet_mou_8 0.43
spl_og_mou_7 spl_og_mou_6 0.43
total_ic_mou_8 loc_ic_t2f_mou_8 0.43
std_ic_t2f_mou_7 std_og_t2f_mou_7 0.43
loc_og_t2t_mou_8 loc_ic_mou_8 0.42
loc_ic_t2m_mou_6 loc_og_mou_6 0.42
loc_ic_mou_7 loc_ic_t2f_mou_7 0.42
total_ic_mou_7 loc_ic_t2m_mou_8 0.42
max_rech_amt_6 max_rech_amt_8 0.42
loc_og_mou_8 loc_ic_t2t_mou_8 0.42
std_og_t2t_mou_7 std_og_mou_6 0.42
std_ic_t2m_mou_6 std_ic_mou_7 0.42
arpu_8 total_ic_mou_8 0.42
loc_og_t2m_mou_8 loc_ic_t2m_mou_7 0.42
last_day_rch_amt_7 max_rech_amt_8 0.42
arpu_6 offnet_mou_6 0.42
total_rech_amt_7 offnet_mou_7 0.42
loc_og_t2m_mou_7 total_ic_mou_7 0.42
total_og_mou_7 std_og_t2m_mou_8 0.42
Average_rech_amt_6n7 total_rech_amt_8 0.42
std_og_mou_7 arpu_7 0.41
std_og_mou_6 std_og_t2m_mou_7 0.41
total_rech_num_7 std_og_mou_7 0.41
total_og_mou_7 offnet_mou_8 0.41
spl_ic_mou_8 spl_ic_mou_6 0.41
std_og_t2t_mou_6 std_og_mou_7 0.41
offnet_mou_6 total_rech_amt_6 0.41
loc_og_t2t_mou_8 loc_og_t2t_mou_6 0.41
std_og_t2t_mou_7 total_og_mou_8 0.41
total_og_mou_8 total_ic_mou_8 0.41
std_og_t2m_mou_6 std_og_mou_7 0.41

Data Preparation

Derived Variables

In [189]:
# Derived variables to measure change in usage 

# Usage 
data['delta_vol_2g'] = data['vol_2g_mb_8'] - data['vol_2g_mb_6'].add(data['vol_2g_mb_7']).div(2)
data['delta_vol_3g'] = data['vol_3g_mb_8'] - data['vol_3g_mb_6'].add(data['vol_3g_mb_7']).div(2)
data['delta_total_og_mou'] = data['total_og_mou_8'] - data['total_og_mou_6'].add(data['total_og_mou_7']).div(2)
data['delta_total_ic_mou'] = data['total_ic_mou_8'] - data['total_ic_mou_6'].add(data['total_ic_mou_7']).div(2)
data['delta_vbc_3g'] = data['vbc_3g_8'] - data['vbc_3g_6'].add(data['vbc_3g_7']).div(2)

# Revenue 
data['delta_arpu'] = data['arpu_8'] - data['arpu_6'].add(data['arpu_7']).div(2)
data['delta_total_rech_amt'] = data['total_rech_amt_8'] - data['total_rech_amt_6'].add(data['total_rech_amt_7']).div(2)
In [190]:
# Removing variables used for derivation : 
data.drop(columns=[
 'vol_2g_mb_8', 'vol_2g_mb_6', 'vol_2g_mb_7',
  'vol_3g_mb_8'  , 'vol_3g_mb_6', 'vol_3g_mb_7' ,
    'total_og_mou_8','total_og_mou_6', 'total_og_mou_7', 
    'total_ic_mou_8','total_ic_mou_6', 'total_ic_mou_7',
    'vbc_3g_8','vbc_3g_6','vbc_3g_7',
    'arpu_8','arpu_6','arpu_7',
    'total_rech_amt_8', 'total_rech_amt_6', 'total_rech_amt_7'
    
], inplace=True)

Outlier Treatment

In [191]:
# Looking at quantiles from 0.90 to 1. 
data.quantile(np.arange(0.9,1.01,0.01)).style.bar()
Out[191]:
onnet_mou_6 onnet_mou_7 onnet_mou_8 offnet_mou_6 offnet_mou_7 offnet_mou_8 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_mou_6 std_og_mou_7 std_og_mou_8 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 og_others_6 og_others_7 og_others_8 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 ic_others_6 ic_others_7 ic_others_8 total_rech_num_6 total_rech_num_7 total_rech_num_8 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 aon Average_rech_amt_6n7 delta_vol_2g delta_vol_3g delta_total_og_mou delta_total_ic_mou delta_vbc_3g delta_arpu delta_total_rech_amt
0.9 794.98 824.38 723.61 915.58 935.69 853.79 32.73 18.36 18.68 64.48 41.20 37.11 207.93 207.84 196.91 435.16 437.49 416.66 18.38 18.66 16.96 4.04 4.84 4.45 661.74 657.38 633.34 630.53 663.79 567.34 604.41 645.88 531.26 2.20 2.18 1.73 1140.93 1177.18 1057.29 0.00 0.00 0.00 15.93 19.51 18.04 2.26 0.00 0.00 154.88 156.61 148.14 368.54 364.54 360.54 39.23 41.04 37.19 559.28 558.99 549.79 34.73 36.01 32.14 73.38 75.28 68.58 4.36 4.58 3.94 115.91 118.66 108.38 0.28 0.00 0.00 15.01 18.30 15.33 1.16 1.59 1.23 23.00 23.00 21.00 297.00 300.00 252.00 250.00 250.00 225.00 2846.00 1118.00 29.84 170.07 345.07 147.30 69.83 257.31 319.00
0.91 848.97 878.35 783.49 966.74 984.02 899.29 39.69 23.28 23.39 78.43 50.01 46.44 225.96 224.87 213.83 461.10 461.81 441.84 20.28 20.68 18.84 4.68 5.51 5.11 703.11 692.67 669.63 686.26 722.84 622.13 658.47 695.77 583.42 2.91 2.80 2.28 1195.61 1244.40 1125.28 0.00 0.00 0.00 17.54 21.28 19.69 2.54 0.00 0.00 165.79 168.03 159.84 390.64 387.11 382.20 43.59 45.39 41.21 593.13 589.65 580.54 38.21 39.91 35.93 80.41 81.93 75.54 5.21 5.49 4.71 125.98 129.29 118.24 0.30 0.00 0.00 18.34 21.84 18.83 1.44 1.94 1.51 24.00 24.00 22.00 325.00 330.00 289.00 250.00 250.00 250.00 2910.10 1156.00 39.88 227.15 377.46 161.80 95.33 278.90 345.50
0.92 909.05 941.99 848.96 1031.39 1038.09 953.35 48.71 29.68 29.64 93.60 60.97 57.59 247.94 244.78 232.33 490.63 488.04 468.83 22.56 23.14 20.93 5.45 6.26 5.86 742.96 735.69 711.57 750.31 786.39 680.10 713.49 760.98 640.57 3.74 3.71 3.01 1268.83 1315.08 1201.29 0.13 0.05 0.00 19.26 23.39 21.78 2.86 0.00 0.00 180.18 181.49 173.59 415.89 412.03 405.97 48.65 50.66 46.19 629.64 624.36 614.45 42.73 44.58 39.99 88.27 90.41 83.44 6.33 6.61 5.75 138.32 142.16 130.55 0.33 0.00 0.03 22.58 26.94 23.58 1.78 2.38 1.86 25.00 25.00 23.00 350.00 350.00 330.00 250.00 250.00 250.00 2981.20 1202.00 53.66 289.30 419.97 177.35 127.50 303.51 375.00
0.93 990.48 1016.15 920.96 1094.77 1103.93 1017.35 60.42 37.28 37.90 111.15 75.00 72.45 275.51 271.70 254.64 523.56 519.80 500.38 25.25 26.00 23.51 6.34 7.15 6.79 794.01 786.73 759.45 812.08 856.34 753.44 777.69 828.18 706.23 4.89 4.70 4.00 1358.41 1404.59 1283.20 0.33 0.25 0.00 21.33 25.84 24.07 3.23 0.00 0.00 195.66 196.99 188.25 444.94 439.30 434.35 54.59 57.50 51.92 671.69 667.07 654.41 48.03 50.82 45.64 98.00 99.68 93.47 7.84 8.08 7.08 153.30 158.86 146.87 0.36 0.00 0.11 28.18 33.33 29.95 2.20 2.93 2.38 27.00 27.00 25.00 350.00 398.00 350.00 252.00 252.00 250.00 3055.30 1257.00 71.44 361.57 466.79 195.08 166.36 331.83 408.50
0.9400000000000001 1066.85 1097.12 1007.56 1168.09 1186.36 1096.62 73.96 49.44 48.29 137.40 94.73 90.27 307.64 305.87 282.50 563.70 559.41 537.10 29.13 29.75 26.89 7.36 8.45 7.84 855.97 849.42 817.15 888.23 933.58 842.48 856.37 907.95 786.16 6.23 6.11 5.26 1456.41 1503.09 1385.91 0.63 0.51 0.23 23.92 28.49 26.83 3.64 0.00 0.00 216.28 214.47 207.73 478.56 478.91 472.15 62.96 65.76 58.84 717.00 711.74 704.89 55.49 58.04 52.58 110.61 113.20 105.54 9.66 9.89 8.77 174.34 179.84 165.92 0.40 0.00 0.21 35.80 40.97 36.57 2.81 3.73 2.98 28.00 28.00 26.00 400.00 455.00 398.00 252.00 252.00 252.00 3107.00 1317.70 97.46 463.36 524.65 218.73 217.42 366.26 447.20
0.9500000000000001 1153.97 1208.17 1115.66 1271.47 1286.28 1188.46 94.59 63.34 62.80 168.46 119.34 114.80 348.62 346.90 324.14 614.99 608.01 585.06 33.59 34.09 31.31 8.69 9.95 9.33 935.51 920.12 883.25 986.24 1029.29 936.49 960.80 1004.26 886.56 8.16 7.92 7.18 1558.50 1624.81 1518.82 1.10 1.01 0.55 26.81 32.15 30.23 4.14 0.00 0.00 243.94 238.62 232.50 520.55 518.65 516.67 72.61 76.05 67.56 773.27 781.18 767.31 64.94 66.55 61.56 126.66 130.41 121.88 12.24 12.31 10.98 200.64 205.16 191.95 0.43 0.06 0.25 46.45 51.98 46.48 3.63 4.83 3.93 30.00 30.00 28.00 500.00 500.00 455.00 252.00 274.00 252.00 3179.00 1406.00 129.68 562.66 604.55 245.97 284.40 404.58 499.00
0.9600000000000001 1282.78 1344.04 1256.34 1406.07 1407.78 1305.32 120.08 83.43 82.12 211.03 153.97 145.54 411.69 412.46 380.74 674.30 670.99 646.48 39.84 40.05 37.61 10.61 11.86 11.45 1025.57 1016.38 975.80 1099.72 1146.44 1066.04 1101.07 1136.32 998.28 11.43 10.87 9.59 1707.60 1766.85 1672.44 2.20 2.28 1.09 31.43 37.11 34.33 4.78 0.00 0.00 276.09 270.15 265.31 578.33 574.86 573.49 85.30 89.25 77.93 847.56 854.66 848.82 77.15 81.35 74.18 151.66 153.16 144.74 15.67 15.88 14.48 237.74 241.25 224.12 0.46 0.13 0.25 60.59 67.46 61.77 4.98 6.51 5.31 32.00 32.00 31.00 505.00 550.00 500.00 330.00 339.00 300.00 3264.00 1508.50 185.21 705.70 704.27 282.57 356.70 458.40 555.30
0.9700000000000001 1444.23 1497.25 1441.53 1578.82 1585.02 1481.57 155.13 117.54 112.07 270.52 203.66 188.86 508.01 500.20 458.64 758.99 749.79 734.08 49.38 49.09 46.36 13.04 14.68 14.14 1163.16 1143.62 1101.16 1243.36 1308.19 1235.72 1262.71 1308.35 1162.84 16.44 15.26 13.64 1904.26 1950.83 1877.19 5.01 5.36 2.73 37.80 44.44 40.55 5.56 0.00 0.00 320.64 322.92 308.49 655.65 646.22 642.58 103.65 109.78 96.64 959.39 962.29 941.13 97.62 101.69 94.54 184.59 187.18 176.96 20.81 21.78 19.62 290.90 295.51 279.29 0.53 0.20 0.36 82.75 91.07 85.44 7.03 9.05 7.58 35.00 36.00 34.00 550.00 550.00 550.00 398.00 398.00 379.00 3424.70 1633.85 262.23 895.25 843.24 334.26 461.60 529.37 644.00
0.9800000000000001 1694.68 1772.62 1700.24 1837.93 1838.39 1739.01 221.26 166.28 165.81 363.12 282.49 266.53 668.59 660.28 596.48 885.83 868.35 853.83 62.69 63.06 60.11 17.15 19.23 18.67 1372.78 1338.79 1306.65 1458.71 1520.56 1463.19 1518.64 1558.41 1413.14 24.58 23.23 21.17 2174.34 2312.91 2165.26 12.99 13.21 7.75 48.05 56.14 51.15 6.84 0.00 0.00 414.27 407.34 392.53 775.61 765.13 748.24 132.06 143.85 125.21 1136.29 1136.25 1114.04 132.11 138.47 131.28 245.19 253.11 234.79 30.02 31.45 28.10 367.71 388.61 362.84 0.58 0.30 0.50 129.13 135.43 127.01 10.84 13.99 11.54 40.00 40.00 39.00 655.00 750.00 619.00 500.00 500.00 500.00 3632.00 1834.70 392.11 1207.69 1051.14 431.92 621.75 649.14 779.30
0.9900000000000001 2166.37 2220.37 2188.50 2326.29 2410.10 2211.64 349.35 292.54 288.49 543.71 448.13 432.74 1076.24 1059.88 956.50 1147.05 1112.66 1092.59 90.88 91.06 86.68 24.86 28.24 28.87 1806.94 1761.43 1689.07 1885.20 1919.19 1938.13 1955.61 2112.66 1905.81 44.39 43.89 38.88 2744.49 2874.65 2800.87 41.25 40.43 31.24 71.36 79.87 74.11 9.31 0.00 0.00 625.35 648.79 621.67 1026.44 1009.29 976.09 197.17 205.25 185.62 1484.99 1515.87 1459.55 215.64 231.15 215.20 393.73 408.58 372.61 53.39 56.59 49.41 577.89 616.89 563.89 0.68 0.51 0.61 239.60 240.13 249.89 20.71 25.26 21.53 48.00 48.00 46.00 1000.00 1000.00 951.00 655.00 655.00 619.00 3651.00 2216.30 654.31 1878.12 1465.10 619.69 929.64 864.34 1036.40
1.0 7376.71 8157.78 10752.56 8362.36 9667.13 14007.34 2613.31 3813.29 4169.81 3775.11 2812.04 5337.04 6431.33 7400.66 10752.56 4729.74 4557.14 4961.33 1466.03 1196.43 928.49 342.86 569.71 351.83 10643.38 7674.78 11039.91 7366.58 8133.66 8014.43 8314.76 9284.74 13950.04 628.56 544.63 516.91 8432.99 10936.73 13980.06 5900.66 5490.28 5681.54 1023.21 1265.79 1390.88 100.61 370.13 394.93 6351.44 5709.59 4003.21 4693.86 4388.73 5738.46 1678.41 1983.01 1588.53 6496.11 6466.74 5748.81 5459.56 5800.93 4309.29 4630.23 3470.38 5645.86 1351.11 1136.08 1394.89 5459.63 6745.76 5957.14 19.76 21.33 6.23 3965.69 4747.91 4100.38 1344.14 1495.94 1209.86 307.00 138.00 196.00 4010.00 4010.00 4449.00 4010.00 4010.00 4449.00 4321.00 37762.50 8062.30 15646.39 12768.70 4862.62 8254.62 12808.62 14344.50
In [192]:
# Looking at percentage change in quantiles from 0.90 to 1. 
data.quantile(np.arange(0.9,1.01,0.01)).pct_change().mul(100).style.bar()
Out[192]:
onnet_mou_6 onnet_mou_7 onnet_mou_8 offnet_mou_6 offnet_mou_7 offnet_mou_8 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_mou_6 std_og_mou_7 std_og_mou_8 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 og_others_6 og_others_7 og_others_8 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 ic_others_6 ic_others_7 ic_others_8 total_rech_num_6 total_rech_num_7 total_rech_num_8 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 aon Average_rech_amt_6n7 delta_vol_2g delta_vol_3g delta_total_og_mou delta_total_ic_mou delta_vbc_3g delta_arpu delta_total_rech_amt
0.9 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
0.91 6.79 6.55 8.27 5.59 5.17 5.33 21.27 26.80 25.22 21.64 21.39 25.13 8.67 8.20 8.59 5.96 5.56 6.04 10.34 10.83 11.08 15.84 13.88 14.88 6.25 5.37 5.73 8.84 8.90 9.66 8.94 7.72 9.82 32.27 28.44 31.79 4.79 5.71 6.43 nan nan nan 10.11 9.09 9.16 12.39 nan nan 7.05 7.29 7.90 6.00 6.19 6.01 11.11 10.60 10.81 6.05 5.48 5.59 10.03 10.84 11.79 9.58 8.84 10.15 19.50 19.89 19.54 8.69 8.96 9.10 7.14 nan nan 22.19 19.35 22.84 24.14 22.01 22.76 4.35 4.35 4.76 9.43 10.00 14.68 0.00 0.00 11.11 2.25 3.40 33.68 33.56 9.39 9.84 36.51 8.39 8.31
0.92 7.08 7.25 8.36 6.69 5.49 6.01 22.72 27.49 26.73 19.34 21.90 24.03 9.73 8.85 8.65 6.41 5.68 6.11 11.24 11.91 11.09 16.45 13.57 14.71 5.67 6.21 6.26 9.33 8.79 9.32 8.36 9.37 9.79 28.52 32.50 32.02 6.12 5.68 6.76 inf inf nan 9.81 9.92 10.60 12.60 nan nan 8.68 8.01 8.60 6.46 6.44 6.22 11.60 11.61 12.08 6.15 5.89 5.84 11.83 11.70 11.30 9.77 10.35 10.46 21.50 20.38 22.12 9.79 9.95 10.41 10.00 nan inf 23.12 23.35 25.20 23.61 22.68 23.18 4.17 4.17 4.55 7.69 6.06 14.19 0.00 0.00 0.00 2.44 3.98 34.55 27.36 11.26 9.61 33.76 8.82 8.54
0.93 8.96 7.87 8.48 6.14 6.34 6.71 24.03 25.62 27.86 18.75 23.02 25.80 11.12 11.00 9.60 6.71 6.51 6.73 11.91 12.32 12.33 16.33 14.27 15.79 6.87 6.94 6.73 8.23 8.89 10.78 9.00 8.83 10.25 30.75 26.77 32.76 7.06 6.81 6.82 153.85 400.00 nan 10.76 10.46 10.50 12.94 nan nan 8.59 8.54 8.44 6.99 6.62 6.99 12.21 13.49 12.40 6.68 6.84 6.50 12.40 13.98 14.13 11.02 10.25 12.01 23.85 22.24 23.09 10.83 11.75 12.50 9.09 nan 266.67 24.79 23.73 27.02 23.60 23.24 27.96 8.00 8.00 8.70 0.00 13.71 6.06 0.80 0.80 0.00 2.49 4.58 33.13 24.98 11.15 10.00 30.47 9.33 8.93
0.9400000000000001 7.71 7.97 9.40 6.70 7.47 7.79 22.41 32.61 27.41 23.62 26.30 24.59 11.66 12.58 10.94 7.67 7.62 7.34 15.38 14.43 14.39 16.09 18.10 15.52 7.80 7.97 7.60 9.38 9.02 11.82 10.12 9.63 11.32 27.40 29.92 31.63 7.21 7.01 8.00 90.91 104.00 inf 12.12 10.26 11.49 12.69 nan nan 10.54 8.87 10.35 7.56 9.02 8.70 15.33 14.36 13.33 6.75 6.70 7.71 15.53 14.22 15.21 12.87 13.57 12.92 23.21 22.45 23.84 13.72 13.21 12.97 11.11 nan 90.91 27.04 22.91 22.11 27.73 27.17 25.21 3.70 3.70 4.00 14.29 14.32 13.71 0.00 0.00 0.80 1.69 4.83 36.42 28.15 12.40 12.12 30.69 10.38 9.47
0.9500000000000001 8.17 10.12 10.73 8.85 8.42 8.38 27.89 28.10 30.03 22.61 25.97 27.18 13.32 13.41 14.74 9.10 8.69 8.93 15.33 14.58 16.43 18.07 17.72 18.94 9.29 8.32 8.09 11.03 10.25 11.16 12.19 10.61 12.77 30.98 29.62 36.50 7.01 8.10 9.59 74.60 98.04 139.13 12.09 12.85 12.67 13.74 nan nan 12.79 11.26 11.92 8.77 8.30 9.43 15.33 15.65 14.82 7.85 9.76 8.85 17.04 14.66 17.08 14.51 15.21 15.48 26.71 24.42 25.23 15.09 14.08 15.69 7.50 inf 19.05 29.73 26.89 27.12 29.18 29.49 31.88 7.14 7.14 7.69 25.00 9.89 14.32 0.00 8.73 0.00 2.32 6.70 33.06 21.43 15.23 12.45 30.81 10.46 11.58
0.9600000000000001 11.16 11.25 12.61 10.59 9.45 9.83 26.95 31.73 30.77 25.27 29.02 26.77 18.09 18.90 17.46 9.64 10.36 10.50 18.58 17.51 20.13 22.09 19.26 22.68 9.63 10.46 10.48 11.51 11.38 13.83 14.60 13.15 12.60 40.07 37.27 33.57 9.57 8.74 10.11 99.64 125.74 98.55 17.23 15.43 13.57 15.46 nan nan 13.18 13.22 14.11 11.10 10.84 11.00 17.48 17.37 15.35 9.61 9.41 10.62 18.80 22.23 20.50 19.74 17.44 18.76 28.04 28.98 31.88 18.49 17.59 16.76 6.98 116.67 0.00 30.45 29.77 32.87 37.19 34.78 35.01 6.67 6.67 10.71 1.00 10.00 9.89 30.95 23.72 19.05 2.67 7.29 42.82 25.42 16.49 14.88 25.42 13.30 11.28
0.9700000000000001 12.59 11.40 14.74 12.29 12.59 13.50 29.19 40.89 36.48 28.19 32.27 29.77 23.40 21.27 20.46 12.56 11.74 13.55 23.95 22.57 23.25 22.90 23.79 23.54 13.42 12.52 12.85 13.06 14.11 15.92 14.68 15.14 16.48 43.87 40.36 42.23 11.52 10.41 12.24 128.14 135.09 150.00 20.27 19.74 18.11 16.32 nan nan 16.14 19.53 16.28 13.37 12.41 12.05 21.51 22.99 24.01 13.19 12.59 10.87 26.52 25.01 27.45 21.71 22.21 22.26 32.77 37.18 35.52 22.36 22.49 24.61 15.22 53.85 44.00 36.58 35.00 38.32 41.18 39.03 42.86 9.38 12.50 9.68 8.91 0.00 10.00 20.61 17.40 26.33 4.92 8.31 41.59 26.86 19.73 18.29 29.41 15.48 15.97
0.9800000000000001 17.34 18.39 17.95 16.41 15.99 17.38 42.63 41.47 47.95 34.23 38.71 41.12 31.61 32.00 30.06 16.71 15.81 16.31 26.97 28.45 29.66 31.50 30.95 32.04 18.02 17.07 18.66 17.32 16.23 18.41 20.27 19.11 21.53 49.46 52.20 55.19 14.18 18.56 15.35 159.20 146.46 183.88 27.11 26.34 26.15 23.02 nan nan 29.20 26.14 27.24 18.30 18.40 16.44 27.41 31.04 29.57 18.44 18.08 18.37 35.34 36.17 38.87 32.83 35.22 32.68 44.29 44.39 43.18 26.40 31.51 29.92 9.43 50.00 38.89 56.05 48.71 48.66 54.17 54.57 52.27 14.29 11.11 14.71 19.09 36.36 12.55 25.63 25.63 31.93 6.05 12.29 49.53 34.90 24.65 29.22 34.70 22.63 21.01
0.9900000000000001 27.83 25.26 28.72 26.57 31.10 27.18 57.89 75.93 73.99 49.73 58.63 62.36 60.97 60.52 60.36 29.49 28.14 27.96 44.96 44.40 44.20 44.96 46.86 54.64 31.63 31.57 29.27 29.24 26.22 32.46 28.77 35.57 34.86 80.59 88.95 83.68 26.22 24.29 29.35 217.65 206.02 303.10 48.50 42.27 44.88 36.08 nan nan 50.95 59.27 58.38 32.34 31.91 30.45 49.30 42.69 48.24 30.69 33.41 31.01 63.22 66.93 63.92 60.58 61.42 58.70 77.82 79.94 75.85 57.16 58.74 55.41 17.24 70.00 22.00 85.55 77.30 96.75 91.03 80.54 86.54 20.00 20.00 17.95 52.67 33.33 53.63 31.00 31.00 23.80 0.52 20.80 66.87 55.51 39.38 43.47 49.52 33.15 32.99
1.0 240.51 267.41 391.32 259.47 301.11 533.35 648.04 1203.51 1345.42 594.33 527.51 1133.30 497.57 598.26 1024.15 312.34 309.57 354.09 1513.24 1213.96 971.17 1279.33 1917.74 1118.63 489.03 335.71 553.61 290.76 323.81 313.51 325.17 339.48 631.98 1316.15 1141.04 1229.43 207.27 280.45 399.13 14204.63 13481.40 18086.75 1333.97 1484.83 1776.73 980.90 inf inf 915.66 780.03 543.95 357.30 334.83 487.90 751.25 866.13 755.80 337.45 326.60 293.87 2431.76 2409.61 1902.47 1076.01 749.39 1415.23 2430.88 1907.56 2723.15 844.75 993.52 956.44 2805.88 4082.35 921.31 1555.13 1877.27 1540.89 6390.92 5822.64 5519.41 539.58 187.50 326.09 301.00 301.00 367.82 512.21 512.21 618.74 18.35 1603.85 1132.18 733.09 771.52 684.69 787.93 1381.89 1284.07
In [193]:
# Columns with outliers 
pct_change_99_1 = data.quantile(np.arange(0.9,1.01,0.01)).pct_change().mul(100).iloc[-1]
outlier_condition = pct_change_99_1 > 100
columns_with_outliers = pct_change_99_1[outlier_condition].index.values
print('Columns with outliers :\n', columns_with_outliers)
Columns with outliers :
 ['onnet_mou_6' 'onnet_mou_7' 'onnet_mou_8' 'offnet_mou_6' 'offnet_mou_7'
 'offnet_mou_8' 'roam_ic_mou_6' 'roam_ic_mou_7' 'roam_ic_mou_8'
 'roam_og_mou_6' 'roam_og_mou_7' 'roam_og_mou_8' 'loc_og_t2t_mou_6'
 'loc_og_t2t_mou_7' 'loc_og_t2t_mou_8' 'loc_og_t2m_mou_6'
 'loc_og_t2m_mou_7' 'loc_og_t2m_mou_8' 'loc_og_t2f_mou_6'
 'loc_og_t2f_mou_7' 'loc_og_t2f_mou_8' 'loc_og_t2c_mou_6'
 'loc_og_t2c_mou_7' 'loc_og_t2c_mou_8' 'loc_og_mou_6' 'loc_og_mou_7'
 'loc_og_mou_8' 'std_og_t2t_mou_6' 'std_og_t2t_mou_7' 'std_og_t2t_mou_8'
 'std_og_t2m_mou_6' 'std_og_t2m_mou_7' 'std_og_t2m_mou_8'
 'std_og_t2f_mou_6' 'std_og_t2f_mou_7' 'std_og_t2f_mou_8' 'std_og_mou_6'
 'std_og_mou_7' 'std_og_mou_8' 'isd_og_mou_6' 'isd_og_mou_7'
 'isd_og_mou_8' 'spl_og_mou_6' 'spl_og_mou_7' 'spl_og_mou_8' 'og_others_6'
 'og_others_7' 'og_others_8' 'loc_ic_t2t_mou_6' 'loc_ic_t2t_mou_7'
 'loc_ic_t2t_mou_8' 'loc_ic_t2m_mou_6' 'loc_ic_t2m_mou_7'
 'loc_ic_t2m_mou_8' 'loc_ic_t2f_mou_6' 'loc_ic_t2f_mou_7'
 'loc_ic_t2f_mou_8' 'loc_ic_mou_6' 'loc_ic_mou_7' 'loc_ic_mou_8'
 'std_ic_t2t_mou_6' 'std_ic_t2t_mou_7' 'std_ic_t2t_mou_8'
 'std_ic_t2m_mou_6' 'std_ic_t2m_mou_7' 'std_ic_t2m_mou_8'
 'std_ic_t2f_mou_6' 'std_ic_t2f_mou_7' 'std_ic_t2f_mou_8' 'std_ic_mou_6'
 'std_ic_mou_7' 'std_ic_mou_8' 'spl_ic_mou_6' 'spl_ic_mou_7'
 'spl_ic_mou_8' 'isd_ic_mou_6' 'isd_ic_mou_7' 'isd_ic_mou_8' 'ic_others_6'
 'ic_others_7' 'ic_others_8' 'total_rech_num_6' 'total_rech_num_7'
 'total_rech_num_8' 'max_rech_amt_6' 'max_rech_amt_7' 'max_rech_amt_8'
 'last_day_rch_amt_6' 'last_day_rch_amt_7' 'last_day_rch_amt_8'
 'Average_rech_amt_6n7' 'delta_vol_2g' 'delta_vol_3g' 'delta_total_og_mou'
 'delta_total_ic_mou' 'delta_vbc_3g' 'delta_arpu' 'delta_total_rech_amt']
In [194]:
# capping outliers to 99th percentile values
outlier_treatment = pd.DataFrame(columns=['Column', 'Outlier Threshold', 'Outliers replaced'])
for col in columns_with_outliers : 
    outlier_threshold = data[col].quantile(0.99)
    condition = data[col] > outlier_threshold
    outlier_treatment = outlier_treatment.append({'Column' : col , 'Outlier Threshold' : outlier_threshold, 'Outliers replaced' : data.loc[condition,col].shape[0] }, ignore_index=True)
    data.loc[condition, col] = outlier_threshold
outlier_treatment
   
Out[194]:
Column Outlier Threshold Outliers replaced
0 onnet_mou_6 2166.37 301
1 onnet_mou_7 2220.37 301
2 onnet_mou_8 2188.50 301
3 offnet_mou_6 2326.29 301
4 offnet_mou_7 2410.10 301
5 offnet_mou_8 2211.64 301
6 roam_ic_mou_6 349.35 301
7 roam_ic_mou_7 292.54 301
8 roam_ic_mou_8 288.49 301
9 roam_og_mou_6 543.71 301
10 roam_og_mou_7 448.13 301
11 roam_og_mou_8 432.74 301
12 loc_og_t2t_mou_6 1076.24 301
13 loc_og_t2t_mou_7 1059.88 301
14 loc_og_t2t_mou_8 956.50 301
15 loc_og_t2m_mou_6 1147.05 301
16 loc_og_t2m_mou_7 1112.66 301
17 loc_og_t2m_mou_8 1092.59 301
18 loc_og_t2f_mou_6 90.88 301
19 loc_og_t2f_mou_7 91.06 301
20 loc_og_t2f_mou_8 86.68 300
21 loc_og_t2c_mou_6 24.86 301
22 loc_og_t2c_mou_7 28.24 301
23 loc_og_t2c_mou_8 28.87 301
24 loc_og_mou_6 1806.94 301
25 loc_og_mou_7 1761.43 301
26 loc_og_mou_8 1689.07 301
27 std_og_t2t_mou_6 1885.20 301
28 std_og_t2t_mou_7 1919.19 301
29 std_og_t2t_mou_8 1938.13 301
30 std_og_t2m_mou_6 1955.61 301
31 std_og_t2m_mou_7 2112.66 301
32 std_og_t2m_mou_8 1905.81 301
33 std_og_t2f_mou_6 44.39 301
34 std_og_t2f_mou_7 43.89 301
35 std_og_t2f_mou_8 38.88 301
36 std_og_mou_6 2744.49 301
37 std_og_mou_7 2874.65 301
38 std_og_mou_8 2800.87 301
39 isd_og_mou_6 41.25 301
40 isd_og_mou_7 40.43 301
41 isd_og_mou_8 31.24 300
42 spl_og_mou_6 71.36 301
43 spl_og_mou_7 79.87 301
44 spl_og_mou_8 74.11 301
45 og_others_6 9.31 301
46 og_others_7 0.00 164
47 og_others_8 0.00 180
48 loc_ic_t2t_mou_6 625.35 301
49 loc_ic_t2t_mou_7 648.79 301
50 loc_ic_t2t_mou_8 621.67 301
51 loc_ic_t2m_mou_6 1026.44 301
52 loc_ic_t2m_mou_7 1009.29 301
53 loc_ic_t2m_mou_8 976.09 301
54 loc_ic_t2f_mou_6 197.17 301
55 loc_ic_t2f_mou_7 205.25 301
56 loc_ic_t2f_mou_8 185.62 301
57 loc_ic_mou_6 1484.99 301
58 loc_ic_mou_7 1515.87 301
59 loc_ic_mou_8 1459.55 301
60 std_ic_t2t_mou_6 215.64 301
61 std_ic_t2t_mou_7 231.15 301
62 std_ic_t2t_mou_8 215.20 301
63 std_ic_t2m_mou_6 393.73 301
64 std_ic_t2m_mou_7 408.58 301
65 std_ic_t2m_mou_8 372.61 301
66 std_ic_t2f_mou_6 53.39 301
67 std_ic_t2f_mou_7 56.59 300
68 std_ic_t2f_mou_8 49.41 301
69 std_ic_mou_6 577.89 301
70 std_ic_mou_7 616.89 301
71 std_ic_mou_8 563.89 301
72 spl_ic_mou_6 0.68 278
73 spl_ic_mou_7 0.51 295
74 spl_ic_mou_8 0.61 293
75 isd_ic_mou_6 239.60 301
76 isd_ic_mou_7 240.13 301
77 isd_ic_mou_8 249.89 301
78 ic_others_6 20.71 301
79 ic_others_7 25.26 301
80 ic_others_8 21.53 300
81 total_rech_num_6 48.00 283
82 total_rech_num_7 48.00 283
83 total_rech_num_8 46.00 287
84 max_rech_amt_6 1000.00 169
85 max_rech_amt_7 1000.00 204
86 max_rech_amt_8 951.00 289
87 last_day_rch_amt_6 655.00 284
88 last_day_rch_amt_7 655.00 300
89 last_day_rch_amt_8 619.00 283
90 Average_rech_amt_6n7 2216.30 301
91 delta_vol_2g 654.31 301
92 delta_vol_3g 1878.12 301
93 delta_total_og_mou 1465.10 301
94 delta_total_ic_mou 619.69 301
95 delta_vbc_3g 929.64 301
96 delta_arpu 864.34 301
97 delta_total_rech_amt 1036.40 301
In [195]:
categorical = data.dtypes == 'category'
categorical_vars = data.columns[categorical].to_list()
ind_categorical_vars = set(categorical_vars) - {'Churn'} #independent categorical variables
ind_categorical_vars
Out[195]:
{'monthly_2g_6',
 'monthly_2g_7',
 'monthly_2g_8',
 'monthly_3g_6',
 'monthly_3g_7',
 'monthly_3g_8',
 'sachet_2g_6',
 'sachet_2g_7',
 'sachet_2g_8',
 'sachet_3g_6',
 'sachet_3g_7',
 'sachet_3g_8'}

Grouping Categories with less Contribution

In [196]:
# Finding & Grouping categories with less than 1% contribution in each column into "Others"
for col in ind_categorical_vars : 
    category_counts = 100*data[col].value_counts(normalize=True)
    print('\n',tabulate(pd.DataFrame(category_counts), headers='keys', tablefmt='psql'),'\n')
    low_count_categories = category_counts[category_counts <= 1].index.to_list()
    print(f"Replaced {low_count_categories} in {col} with category : Others")
    data[col].replace(low_count_categories,'Others',inplace=True)
    
 +----+---------------+
|    |   sachet_3g_6 |
|----+---------------|
|  0 |   93.4091     |
|  1 |    4.35507    |
|  2 |    1.04295    |
|  3 |    0.396521   |
|  4 |    0.219919   |
|  5 |    0.123288   |
|  6 |    0.089967   |
|  7 |    0.0866349  |
|  8 |    0.0499817  |
|  9 |    0.0499817  |
| 10 |    0.0366532  |
| 11 |    0.0266569  |
| 15 |    0.0166606  |
| 12 |    0.0133284  |
| 19 |    0.0133284  |
| 13 |    0.00999633 |
| 14 |    0.00999633 |
| 18 |    0.00999633 |
| 23 |    0.00999633 |
| 16 |    0.00666422 |
| 22 |    0.00666422 |
| 29 |    0.00666422 |
| 28 |    0.00333211 |
| 17 |    0.00333211 |
| 21 |    0.00333211 |
+----+---------------+ 

Replaced [3, 4, 5, 6, 7, 8, 9, 10, 11, 15, 12, 19, 13, 14, 18, 23, 16, 22, 29, 28, 17, 21] in sachet_3g_6 with category : Others

 +----+----------------+
|    |   monthly_2g_7 |
|----+----------------|
|  0 |    88.4876     |
|  1 |    10.0397     |
|  2 |     1.35284    |
|  3 |     0.0966312  |
|  4 |     0.0166606  |
|  5 |     0.00666422 |
+----+----------------+ 

Replaced [3, 4, 5] in monthly_2g_7 with category : Others

 +----+----------------+
|    |   monthly_2g_8 |
|----+----------------|
|  0 |    89.7604     |
|  1 |     9.19996    |
|  2 |     0.942988   |
|  3 |     0.0733065  |
|  4 |     0.0166606  |
|  5 |     0.00666422 |
+----+----------------+ 

Replaced [2, 3, 4, 5] in monthly_2g_8 with category : Others

 +----+---------------+
|    |   sachet_3g_8 |
|----+---------------|
|  0 |   94.2388     |
|  1 |    3.52537    |
|  2 |    0.839692   |
|  3 |    0.429842   |
|  4 |    0.243244   |
|  5 |    0.219919   |
|  6 |    0.0866349  |
|  7 |    0.0766386  |
|  8 |    0.0733065  |
|  9 |    0.0399853  |
| 12 |    0.0366532  |
| 13 |    0.0333211  |
| 10 |    0.0333211  |
| 11 |    0.0199927  |
| 14 |    0.0199927  |
| 15 |    0.0166606  |
| 16 |    0.00999633 |
| 17 |    0.00666422 |
| 18 |    0.00666422 |
| 20 |    0.00666422 |
| 21 |    0.00666422 |
| 23 |    0.00666422 |
| 38 |    0.00333211 |
| 19 |    0.00333211 |
| 25 |    0.00333211 |
| 27 |    0.00333211 |
| 29 |    0.00333211 |
| 30 |    0.00333211 |
| 41 |    0.00333211 |
+----+---------------+ 

Replaced [2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15, 16, 17, 18, 20, 21, 23, 38, 19, 25, 27, 29, 30, 41] in sachet_3g_8 with category : Others

 +----+----------------+
|    |   monthly_3g_7 |
|----+----------------|
|  0 |    87.8378     |
|  1 |     8.21699    |
|  2 |     2.739      |
|  3 |     0.689747   |
|  4 |     0.226584   |
|  5 |     0.129952   |
|  6 |     0.0766386  |
|  7 |     0.0333211  |
|  8 |     0.0166606  |
|  9 |     0.0133284  |
| 11 |     0.00666422 |
| 16 |     0.00333211 |
| 14 |     0.00333211 |
| 12 |     0.00333211 |
| 10 |     0.00333211 |
+----+----------------+ 

Replaced [3, 4, 5, 6, 7, 8, 9, 11, 16, 14, 12, 10] in monthly_3g_7 with category : Others

 +----+---------------+
|    |   sachet_2g_6 |
|----+---------------|
|  0 |   82.5631     |
|  1 |    7.87378    |
|  2 |    3.3621     |
|  3 |    2.0126     |
|  4 |    1.32951    |
|  5 |    0.703076   |
|  6 |    0.509813   |
|  7 |    0.356536   |
|  8 |    0.286562   |
|  9 |    0.239912   |
| 10 |    0.17327    |
| 12 |    0.146613   |
| 11 |    0.0999633  |
| 13 |    0.0566459  |
| 14 |    0.0533138  |
| 15 |    0.0433175  |
| 17 |    0.0366532  |
| 18 |    0.029989   |
| 19 |    0.029989   |
| 16 |    0.0233248  |
| 22 |    0.0133284  |
| 20 |    0.00999633 |
| 21 |    0.00999633 |
| 24 |    0.00999633 |
| 25 |    0.00999633 |
| 39 |    0.00333211 |
| 27 |    0.00333211 |
| 30 |    0.00333211 |
| 32 |    0.00333211 |
| 34 |    0.00333211 |
| 28 |    0          |
| 42 |    0          |
+----+---------------+ 

Replaced [5, 6, 7, 8, 9, 10, 12, 11, 13, 14, 15, 17, 18, 19, 16, 22, 20, 21, 24, 25, 39, 27, 30, 32, 34, 28, 42] in sachet_2g_6 with category : Others

 +----+----------------+
|    |   monthly_2g_6 |
|----+----------------|
|  0 |     88.9074    |
|  1 |      9.83306   |
|  2 |      1.14958   |
|  3 |      0.0866349 |
|  4 |      0.0233248 |
+----+----------------+ 

Replaced [3, 4] in monthly_2g_6 with category : Others

 +----+---------------+
|    |   sachet_2g_7 |
|----+---------------|
|  0 |   81.8033     |
|  1 |    7.24068    |
|  2 |    3.34877    |
|  3 |    1.96595    |
|  4 |    1.50945    |
|  5 |    1.20622    |
|  6 |    0.843024   |
|  7 |    0.543134   |
|  8 |    0.403185   |
| 10 |    0.239912   |
|  9 |    0.219919   |
| 11 |    0.159941   |
| 12 |    0.0966312  |
| 14 |    0.0799707  |
| 13 |    0.0666422  |
| 15 |    0.0499817  |
| 16 |    0.0366532  |
| 18 |    0.0333211  |
| 17 |    0.029989   |
| 20 |    0.0266569  |
| 19 |    0.0233248  |
| 21 |    0.00999633 |
| 26 |    0.00999633 |
| 27 |    0.00999633 |
| 22 |    0.00666422 |
| 23 |    0.00666422 |
| 30 |    0.00666422 |
| 42 |    0.00333211 |
| 24 |    0.00333211 |
| 25 |    0.00333211 |
| 29 |    0.00333211 |
| 32 |    0.00333211 |
| 35 |    0.00333211 |
| 48 |    0.00333211 |
| 28 |    0          |
+----+---------------+ 

Replaced [6, 7, 8, 10, 9, 11, 12, 14, 13, 15, 16, 18, 17, 20, 19, 21, 26, 27, 22, 23, 30, 42, 24, 25, 29, 32, 35, 48, 28] in sachet_2g_7 with category : Others

 +----+---------------+
|    |   sachet_3g_7 |
|----+---------------|
|  0 |   93.4757     |
|  1 |    4.10849    |
|  2 |    1.03962    |
|  3 |    0.383193   |
|  4 |    0.239912   |
|  5 |    0.219919   |
|  6 |    0.139949   |
|  7 |    0.059978   |
|  9 |    0.0533138  |
|  8 |    0.0466496  |
| 11 |    0.0433175  |
| 10 |    0.0333211  |
| 12 |    0.0333211  |
| 15 |    0.0166606  |
| 14 |    0.0166606  |
| 13 |    0.0133284  |
| 18 |    0.0133284  |
| 19 |    0.00999633 |
| 20 |    0.00999633 |
| 22 |    0.00999633 |
| 17 |    0.00666422 |
| 21 |    0.00666422 |
| 24 |    0.00666422 |
| 33 |    0.00333211 |
| 16 |    0.00333211 |
| 31 |    0.00333211 |
| 35 |    0.00333211 |
+----+---------------+ 

Replaced [3, 4, 5, 6, 7, 9, 8, 11, 10, 12, 15, 14, 13, 18, 19, 20, 22, 17, 21, 24, 33, 16, 31, 35] in sachet_3g_7 with category : Others

 +----+----------------+
|    |   monthly_3g_8 |
|----+----------------|
|  0 |    88.3876     |
|  1 |     8.00706    |
|  2 |     2.45243    |
|  3 |     0.656426   |
|  4 |     0.289894   |
|  5 |     0.0999633  |
|  6 |     0.0466496  |
|  7 |     0.029989   |
|  9 |     0.00999633 |
|  8 |     0.00999633 |
| 10 |     0.00666422 |
| 16 |     0.00333211 |
+----+----------------+ 

Replaced [3, 4, 5, 6, 7, 9, 8, 10, 16] in monthly_3g_8 with category : Others

 +----+----------------+
|    |   monthly_3g_6 |
|----+----------------|
|  0 |    88.0744     |
|  1 |     8.4669     |
|  2 |     2.32248    |
|  3 |     0.689747   |
|  4 |     0.246576   |
|  5 |     0.106628   |
|  6 |     0.0366532  |
|  7 |     0.029989   |
|  8 |     0.00999633 |
| 11 |     0.00666422 |
|  9 |     0.00666422 |
| 14 |     0.00333211 |
+----+----------------+ 

Replaced [3, 4, 5, 6, 7, 8, 11, 9, 14] in monthly_3g_6 with category : Others

 +----+---------------+
|    |   sachet_2g_8 |
|----+---------------|
|  0 |   79.7274     |
|  1 |    8.87008    |
|  2 |    3.25881    |
|  3 |    2.19253    |
|  4 |    1.81267    |
|  5 |    1.44947    |
|  6 |    0.88301    |
|  7 |    0.459831   |
|  8 |    0.313218   |
|  9 |    0.249908   |
| 10 |    0.169938   |
| 11 |    0.123288   |
| 12 |    0.113292   |
| 14 |    0.0766386  |
| 15 |    0.0566459  |
| 13 |    0.0499817  |
| 16 |    0.0433175  |
| 18 |    0.0266569  |
| 17 |    0.0233248  |
| 19 |    0.0233248  |
| 20 |    0.0133284  |
| 34 |    0.00666422 |
| 29 |    0.00666422 |
| 27 |    0.00666422 |
| 24 |    0.00666422 |
| 22 |    0.00666422 |
| 21 |    0.00666422 |
| 23 |    0.00333211 |
| 25 |    0.00333211 |
| 26 |    0.00333211 |
| 31 |    0.00333211 |
| 32 |    0.00333211 |
| 33 |    0.00333211 |
| 44 |    0.00333211 |
+----+---------------+ 

Replaced [6, 7, 8, 9, 10, 11, 12, 14, 15, 13, 16, 18, 17, 19, 20, 34, 29, 27, 24, 22, 21, 23, 25, 26, 31, 32, 33, 44] in sachet_2g_8 with category : Others

Creating Dummy Variables

In [197]:
dummy_vars = pd.get_dummies(data[ind_categorical_vars], drop_first=False, prefix=ind_categorical_vars, prefix_sep='_')
dummy_vars.head()
Out[197]:
sachet_3g_6_0 sachet_3g_6_1 sachet_3g_6_2 sachet_3g_6_Others monthly_2g_7_0 monthly_2g_7_1 monthly_2g_7_2 monthly_2g_7_Others monthly_2g_8_0 monthly_2g_8_1 monthly_2g_8_Others sachet_3g_8_0 sachet_3g_8_1 sachet_3g_8_Others monthly_3g_7_0 monthly_3g_7_1 monthly_3g_7_2 monthly_3g_7_Others sachet_2g_6_0 sachet_2g_6_1 sachet_2g_6_2 sachet_2g_6_3 sachet_2g_6_4 sachet_2g_6_Others monthly_2g_6_0 monthly_2g_6_1 monthly_2g_6_2 monthly_2g_6_Others sachet_2g_7_0 sachet_2g_7_1 sachet_2g_7_2 sachet_2g_7_3 sachet_2g_7_4 sachet_2g_7_5 sachet_2g_7_Others sachet_3g_7_0 sachet_3g_7_1 sachet_3g_7_2 sachet_3g_7_Others monthly_3g_8_0 monthly_3g_8_1 monthly_3g_8_2 monthly_3g_8_Others monthly_3g_6_0 monthly_3g_6_1 monthly_3g_6_2 monthly_3g_6_Others sachet_2g_8_0 sachet_2g_8_1 sachet_2g_8_2 sachet_2g_8_3 sachet_2g_8_4 sachet_2g_8_5 sachet_2g_8_Others
mobile_number
7000701601 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
7001524846 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0
7002191713 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0
7000875565 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
7000187447 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
In [ ]:
 
In [198]:
reference_cols = dummy_vars.filter(regex='.*Others$').columns.to_list() # Using category 'Others' in each column as reference. 
dummy_vars.drop(columns=reference_cols, inplace=True)
reference_cols
Out[198]:
['sachet_3g_6_Others',
 'monthly_2g_7_Others',
 'monthly_2g_8_Others',
 'sachet_3g_8_Others',
 'monthly_3g_7_Others',
 'sachet_2g_6_Others',
 'monthly_2g_6_Others',
 'sachet_2g_7_Others',
 'sachet_3g_7_Others',
 'monthly_3g_8_Others',
 'monthly_3g_6_Others',
 'sachet_2g_8_Others']
In [199]:
# concatenating dummy variables with original 'data'
data.drop(columns=ind_categorical_vars, inplace=True) # dropping original categorical columns
data = pd.concat([data, dummy_vars], axis=1)
data.head()
Out[199]:
onnet_mou_6 onnet_mou_7 onnet_mou_8 offnet_mou_6 offnet_mou_7 offnet_mou_8 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_mou_6 std_og_mou_7 std_og_mou_8 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 og_others_6 og_others_7 og_others_8 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 ic_others_6 ic_others_7 ic_others_8 total_rech_num_6 total_rech_num_7 total_rech_num_8 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 aon Average_rech_amt_6n7 Churn delta_vol_2g delta_vol_3g delta_total_og_mou delta_total_ic_mou delta_vbc_3g delta_arpu delta_total_rech_amt sachet_3g_6_0 sachet_3g_6_1 sachet_3g_6_2 monthly_2g_7_0 monthly_2g_7_1 monthly_2g_7_2 monthly_2g_8_0 monthly_2g_8_1 sachet_3g_8_0 sachet_3g_8_1 monthly_3g_7_0 monthly_3g_7_1 monthly_3g_7_2 sachet_2g_6_0 sachet_2g_6_1 sachet_2g_6_2 sachet_2g_6_3 sachet_2g_6_4 monthly_2g_6_0 monthly_2g_6_1 monthly_2g_6_2 sachet_2g_7_0 sachet_2g_7_1 sachet_2g_7_2 sachet_2g_7_3 sachet_2g_7_4 sachet_2g_7_5 sachet_3g_7_0 sachet_3g_7_1 sachet_3g_7_2 monthly_3g_8_0 monthly_3g_8_1 monthly_3g_8_2 monthly_3g_6_0 monthly_3g_6_1 monthly_3g_6_2 sachet_2g_8_0 sachet_2g_8_1 sachet_2g_8_2 sachet_2g_8_3 sachet_2g_8_4 sachet_2g_8_5
mobile_number
7000701601 57.84 54.68 52.29 453.43 567.16 325.91 16.23 33.49 31.64 23.74 12.59 38.06 51.39 31.38 40.28 308.63 447.38 162.28 62.13 55.14 53.23 0.0 0.0 0.00 422.16 533.91 255.79 4.30 23.29 12.01 49.89 31.76 49.14 6.66 20.08 16.68 60.86 75.14 77.84 0.0 0.18 10.01 4.50 0.00 6.50 0.00 0.0 0.0 58.14 32.26 27.31 217.56 221.49 121.19 152.16 101.46 39.53 427.88 355.23 188.04 36.89 11.83 30.39 91.44 126.99 141.33 52.19 34.24 22.21 180.54 173.08 193.94 0.21 0.0 0.0 2.06 14.53 31.59 15.74 15.19 15.14 5.0 5.0 7.0 1000.0 790.0 951.0 0.0 0.0 619.0 802 1185.0 1 0.00 0.00 -198.22 -163.51 38.68 864.34 1036.4 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0
7001524846 413.69 351.03 35.08 94.66 80.63 136.48 0.00 0.00 0.00 0.00 0.00 0.00 297.13 217.59 12.49 80.96 70.58 50.54 0.00 0.00 0.00 0.0 0.0 7.15 378.09 288.18 63.04 116.56 133.43 22.58 13.69 10.04 75.69 0.00 0.00 0.00 130.26 143.48 98.28 0.0 0.00 0.00 0.00 0.00 10.23 0.00 0.0 0.0 23.84 9.84 0.31 57.58 13.98 15.48 0.00 0.00 0.00 81.43 23.83 15.79 0.00 0.58 0.10 22.43 4.08 0.65 0.00 0.00 0.00 22.43 4.66 0.75 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 19.0 21.0 14.0 90.0 154.0 30.0 50.0 0.0 10.0 315 519.0 0 -177.97 -363.54 -298.45 -49.63 -495.38 -298.11 -399.0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0
7002191713 501.76 108.39 534.24 413.31 119.28 482.46 23.53 144.24 72.11 7.98 35.26 1.44 49.63 6.19 36.01 151.13 47.28 294.46 4.54 0.00 23.51 0.0 0.0 0.49 205.31 53.48 353.99 446.41 85.98 498.23 255.36 52.94 156.94 0.00 0.00 0.00 701.78 138.93 655.18 0.0 0.00 1.29 0.00 0.00 4.78 0.00 0.0 0.0 67.88 7.58 52.58 142.88 18.53 195.18 4.81 0.00 7.49 215.58 26.11 255.26 115.68 38.29 154.58 308.13 29.79 317.91 0.00 0.00 1.91 423.81 68.09 474.41 0.45 0.0 0.0 239.60 62.11 249.89 20.71 16.24 21.44 6.0 4.0 11.0 110.0 110.0 130.0 110.0 50.0 0.0 2607 380.0 0 0.02 0.00 465.51 573.93 0.00 244.00 337.0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0
7000875565 50.51 74.01 70.61 296.29 229.74 162.76 0.00 2.83 0.00 0.00 17.74 0.00 42.61 65.16 67.38 273.29 145.99 128.28 0.00 4.48 10.26 0.0 0.0 0.00 315.91 215.64 205.93 7.89 2.58 3.23 22.99 64.51 18.29 0.00 0.00 0.00 30.89 67.09 21.53 0.0 0.00 0.00 0.00 3.26 5.91 0.00 0.0 0.0 41.33 71.44 28.89 226.81 149.69 150.16 8.71 8.68 32.71 276.86 229.83 211.78 68.79 78.64 6.33 18.68 73.08 73.93 0.51 0.00 2.18 87.99 151.73 82.44 0.00 0.0 0.0 0.00 0.00 0.23 0.00 0.00 0.00 10.0 6.0 2.0 110.0 110.0 130.0 100.0 100.0 130.0 511 459.0 0 0.00 0.00 -83.03 -78.75 -12.17 -177.53 -299.0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0
7000187447 1185.91 9.28 7.79 61.64 0.00 5.54 0.00 4.76 4.81 0.00 8.46 13.34 38.99 0.00 0.00 58.54 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 97.54 0.00 0.00 1146.91 0.81 0.00 1.55 0.00 0.00 0.00 0.00 0.00 1148.46 0.81 0.00 0.0 0.00 0.00 2.58 0.00 0.00 0.93 0.0 0.0 34.54 0.00 0.00 47.41 2.31 0.00 0.00 0.00 0.00 81.96 2.31 0.00 8.63 0.00 0.00 1.28 0.00 0.00 0.00 0.00 0.00 9.91 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 19.0 2.0 4.0 110.0 0.0 30.0 30.0 0.0 0.0 667 408.0 0 0.00 0.00 -625.17 -47.09 0.00 -329.00 -378.0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0
In [200]:
dummy_cols = dummy_vars.columns.to_list()
data[dummy_cols] = data[dummy_cols].astype('category')
In [201]:
data.shape
Out[201]:
(30011, 142)

----joint

This following section contains

  • Test Train Split
  • Class Imbalance
  • Standardization
  • Modelling
    • Model 1 : Logistic Regression with RFE & Manual Elimination ( Interpretable Model )
    • Model 2 : PCA + Logistic Regression
    • Model 3 : PCA + Random Forest Classifier
    • Model 4 : PCA + XGBoost

Train-Test Split

In [3]:
y = data.pop('Churn') # Predicted / Target Variable
X = data # Predictor variables
In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, random_state=42)

Class Imbalance

In [5]:
y.value_counts(normalize=True).to_frame()
Out[5]:
Churn
0 0.913598
1 0.086402
In [6]:
# Ratio of classes 
class_0 = y[y == 0].count()
class_1 = y[y == 1].count()

print(f'Class Imbalance Ratio : {round(class_1/class_0,3)}')
Class Imbalance Ratio : 0.095
  • To account for class imbalance, Synthetic Minority Class Oversampling Technique (SMOTE) could be used.

Using SMOTE

In [7]:
#!pip install imblearn
from imblearn.over_sampling import SMOTE
smt = SMOTE(random_state=42, k_neighbors=5)

# Resampling Train set to account for class imbalance

X_train_resampled, y_train_resampled= smt.fit_resample(X_train, y_train)
X_train_resampled.head()
Out[7]:
onnet_mou_6 onnet_mou_7 onnet_mou_8 offnet_mou_6 offnet_mou_7 offnet_mou_8 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_mou_6 std_og_mou_7 std_og_mou_8 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 og_others_6 og_others_7 og_others_8 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 ic_others_6 ic_others_7 ic_others_8 total_rech_num_6 total_rech_num_7 total_rech_num_8 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 aon Average_rech_amt_6n7 delta_vol_2g delta_vol_3g delta_total_og_mou delta_total_ic_mou delta_vbc_3g delta_arpu delta_total_rech_amt sachet_3g_6_0 sachet_3g_6_1 sachet_3g_6_2 monthly_2g_7_0 monthly_2g_7_1 monthly_2g_7_2 monthly_2g_8_0 monthly_2g_8_1 sachet_3g_8_0 sachet_3g_8_1 monthly_3g_7_0 monthly_3g_7_1 monthly_3g_7_2 sachet_2g_6_0 sachet_2g_6_1 sachet_2g_6_2 sachet_2g_6_3 sachet_2g_6_4 monthly_2g_6_0 monthly_2g_6_1 monthly_2g_6_2 sachet_2g_7_0 sachet_2g_7_1 sachet_2g_7_2 sachet_2g_7_3 sachet_2g_7_4 sachet_2g_7_5 sachet_3g_7_0 sachet_3g_7_1 sachet_3g_7_2 monthly_3g_8_0 monthly_3g_8_1 monthly_3g_8_2 monthly_3g_6_0 monthly_3g_6_1 monthly_3g_6_2 sachet_2g_8_0 sachet_2g_8_1 sachet_2g_8_2 sachet_2g_8_3 sachet_2g_8_4 sachet_2g_8_5
0 53.01 52.64 37.48 316.01 195.74 68.36 0.0 0.0 0.0 0.0 0.0 0.0 53.01 52.64 37.48 282.38 171.64 44.51 31.59 17.38 19.43 0.0 0.0 0.00 366.99 241.68 101.43 0.00 0.00 0.00 0.00 2.11 0.00 2.03 4.59 4.41 2.03 6.71 4.41 0.00 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 18.41 40.79 11.79 292.99 191.98 85.89 6.26 1.21 10.39 317.68 233.99 108.09 0.00 0.00 0.00 0.66 0.00 0.00 5.61 1.53 2.76 6.28 1.53 2.76 0.00 0.0 0.00 0.00 0.00 9.55 0.00 0.00 0.00 6.0 5.0 4.0 198.0 198.0 198.0 110.0 130.0 130.0 1423 483.0 -791.7700 1077.750 -202.870 -159.335 71.085 -172.4995 -155.0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0
1 91.39 216.14 150.58 504.19 301.98 434.41 0.0 0.0 0.0 0.0 0.0 0.0 40.36 36.21 27.73 37.26 36.73 59.61 0.00 0.00 0.00 0.0 0.0 0.58 77.63 72.94 87.34 51.03 179.93 122.84 465.96 265.24 356.44 0.00 0.00 0.00 516.99 445.18 479.29 0.96 0.0 3.89 0.0 0.0 14.45 0.0 0.0 0.0 104.39 31.98 35.83 154.11 147.88 243.53 0.00 0.76 0.00 258.51 180.63 279.36 4.03 2.99 0.46 6.36 12.31 3.91 0.00 0.00 0.00 10.39 15.31 4.38 0.58 0.0 0.25 19.66 21.96 86.63 0.23 0.56 1.04 8.0 11.0 12.0 110.0 130.0 130.0 0.0 130.0 0.0 189 454.0 0.0000 0.000 28.130 117.745 0.000 48.6160 -94.0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0
2 11.96 14.13 0.40 1.51 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 11.96 14.13 0.40 1.51 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 13.48 14.13 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 20.58 20.39 97.66 36.84 21.58 18.66 5.48 0.73 1.43 62.91 42.71 117.76 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.0 3.0 4.0 252.0 252.0 252.0 252.0 0.0 252.0 2922 403.0 -44.6300 -5.525 -13.405 64.950 0.000 75.3940 151.0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0
3 532.66 537.31 738.21 49.03 71.64 39.43 0.0 0.0 0.0 0.0 0.0 0.0 24.46 19.79 37.74 41.26 47.86 39.43 1.19 4.04 0.00 0.0 0.0 0.00 66.93 71.71 77.18 508.19 517.51 700.46 6.56 18.24 0.00 0.00 1.48 0.00 514.76 537.24 700.46 0.00 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 19.86 28.81 20.24 66.08 94.18 67.54 51.74 68.16 50.08 137.69 191.16 137.88 18.83 14.56 1.28 1.08 20.89 6.83 0.00 3.08 3.05 19.91 38.54 11.16 0.00 0.0 0.00 0.00 5.28 7.49 0.00 0.00 0.00 10.0 13.0 12.0 145.0 150.0 145.0 0.0 150.0 0.0 1128 521.0 -10.1500 -108.195 182.315 -39.760 0.000 192.8075 207.0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0
4 122.68 105.51 149.33 302.23 211.44 264.11 0.0 0.0 0.0 0.0 0.0 0.0 122.68 105.51 149.33 301.04 194.06 257.14 0.00 0.66 0.51 0.0 0.0 0.00 423.73 300.24 406.99 0.00 0.00 0.00 1.18 15.75 6.44 0.00 0.96 0.00 1.18 16.71 6.44 0.00 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 228.54 198.24 231.13 412.99 392.98 353.86 81.76 89.69 88.74 723.31 680.93 673.74 0.00 0.00 1.05 8.14 5.33 0.70 11.83 6.58 10.44 19.98 11.91 12.19 0.00 0.0 0.00 0.43 0.00 0.48 0.00 0.00 0.00 5.0 5.0 4.0 325.0 154.0 164.0 325.0 154.0 164.0 2453 721.0 654.3125 -686.915 42.505 -31.855 -433.700 -55.1110 -105.0 1 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0

Standardizing Columns

In [8]:
# columns with numerical data
condition1 = data.dtypes == 'int'
condition2 = data.dtypes == 'float'
numerical_vars = data.columns[condition1 | condition2].to_list()
In [9]:
# Standard scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() 

# Fit and transform train set 
X_train_resampled[numerical_vars] = scaler.fit_transform(X_train_resampled[numerical_vars])

# Transform test set
X_test[numerical_vars] = scaler.transform(X_test[numerical_vars])
In [10]:
# summary statistics of standardized variables
round(X_train_resampled.describe(),2)
Out[10]:
onnet_mou_6 onnet_mou_7 onnet_mou_8 offnet_mou_6 offnet_mou_7 offnet_mou_8 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_mou_6 std_og_mou_7 std_og_mou_8 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 og_others_6 og_others_7 og_others_8 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 ic_others_6 ic_others_7 ic_others_8 total_rech_num_6 total_rech_num_7 total_rech_num_8 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 aon Average_rech_amt_6n7 delta_vol_2g delta_vol_3g delta_total_og_mou delta_total_ic_mou delta_vbc_3g delta_arpu delta_total_rech_amt sachet_3g_6_0 sachet_3g_6_1 sachet_3g_6_2 monthly_2g_7_0 monthly_2g_7_1 monthly_2g_7_2 monthly_2g_8_0 monthly_2g_8_1 sachet_3g_8_0 sachet_3g_8_1 monthly_3g_7_0 monthly_3g_7_1 monthly_3g_7_2 sachet_2g_6_0 sachet_2g_6_1 sachet_2g_6_2 sachet_2g_6_3 sachet_2g_6_4 monthly_2g_6_0 monthly_2g_6_1 monthly_2g_6_2 sachet_2g_7_0 sachet_2g_7_1 sachet_2g_7_2 sachet_2g_7_3 sachet_2g_7_4 sachet_2g_7_5 sachet_3g_7_0 sachet_3g_7_1 sachet_3g_7_2 monthly_3g_8_0 monthly_3g_8_1 monthly_3g_8_2 monthly_3g_6_0 monthly_3g_6_1 monthly_3g_6_2 sachet_2g_8_0 sachet_2g_8_1 sachet_2g_8_2 sachet_2g_8_3 sachet_2g_8_4 sachet_2g_8_5
count 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.0 38374.0 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00 38374.00
mean -0.00 -0.00 0.00 0.00 -0.00 -0.00 -0.00 -0.00 0.00 -0.00 -0.00 0.00 -0.00 -0.00 0.00 0.00 -0.00 0.00 0.00 -0.00 -0.00 -0.00 0.00 -0.00 0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 0.00 0.00 0.00 0.00 -0.00 0.00 0.00 -0.00 -0.00 0.00 0.00 0.00 0.00 -0.00 0.0 0.0 0.00 -0.00 -0.00 -0.00 0.00 -0.00 0.00 0.00 0.00 -0.00 -0.00 -0.00 0.00 0.00 -0.00 0.00 0.00 0.00 0.00 0.00 -0.00 -0.00 0.00 -0.00 -0.00 -0.00 -0.00 -0.00 0.00 0.00 -0.00 -0.00 0.00 -0.00 0.00 -0.00 -0.00 -0.00 0.00 0.00 -0.00 0.00 0.00 -0.00 0.00 0.00 -0.00 0.00 -0.00 0.00 -0.00 0.00 -0.00 -0.00 -0.00 -0.00 -0.00 0.00 0.00 -0.00 0.00 0.00 0.00 0.00 0.00 -0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.00 -0.00 -0.00 0.00 0.00 -0.00 -0.00 0.00 -0.00 -0.00 0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 0.00 -0.00 0.00 0.00
std 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.0 0.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
min -0.73 -0.68 -0.53 -0.94 -0.89 -0.70 -0.31 -0.32 -0.33 -0.33 -0.36 -0.36 -0.50 -0.49 -0.42 -0.75 -0.73 -0.59 -0.38 -0.38 -0.33 -0.37 -0.37 -0.31 -0.76 -0.74 -0.60 -0.57 -0.54 -0.40 -0.60 -0.57 -0.43 -0.22 -0.22 -0.19 -0.79 -0.74 -0.53 -0.20 -0.18 -0.15 -0.51 -0.53 -0.43 -0.44 0.0 0.0 -0.61 -0.57 -0.48 -0.77 -0.75 -0.61 -0.40 -0.39 -0.35 -0.80 -0.77 -0.63 -0.45 -0.43 -0.33 -0.51 -0.47 -0.39 -0.26 -0.26 -0.22 -0.56 -0.52 -0.41 -0.47 -0.21 -0.20 -0.28 -0.27 -0.22 -0.27 -0.25 -0.22 -1.50 -1.37 -1.02 -1.08 -1.04 -0.86 -0.93 -0.85 -0.64 -1.01 -0.91 -28.14 -27.16 -11.65 -14.66 -22.68 -14.96 -14.68 -3.43 -0.16 -0.08 -3.10 -0.25 -0.09 -3.78 -0.23 -4.62 -0.14 -2.78 -0.23 -0.13 -1.95 -0.22 -0.14 -0.11 -0.09 -2.99 -0.24 -0.08 -1.99 -0.21 -0.14 -0.10 -0.09 -0.08 -3.44 -0.16 -0.07 -3.33 -0.22 -0.12 -2.66 -0.23 -0.12 -2.11 -0.23 -0.14 -0.11 -0.10 -0.09
25% -0.63 -0.60 -0.52 -0.66 -0.65 -0.66 -0.31 -0.32 -0.33 -0.33 -0.36 -0.36 -0.45 -0.45 -0.42 -0.63 -0.63 -0.59 -0.38 -0.38 -0.33 -0.37 -0.37 -0.31 -0.63 -0.63 -0.60 -0.57 -0.54 -0.40 -0.59 -0.56 -0.43 -0.22 -0.22 -0.19 -0.77 -0.72 -0.53 -0.20 -0.18 -0.15 -0.51 -0.53 -0.43 -0.44 0.0 0.0 -0.53 -0.51 -0.48 -0.62 -0.61 -0.61 -0.40 -0.39 -0.35 -0.63 -0.62 -0.63 -0.45 -0.43 -0.33 -0.49 -0.46 -0.39 -0.26 -0.26 -0.22 -0.51 -0.48 -0.41 -0.47 -0.21 -0.20 -0.28 -0.27 -0.22 -0.27 -0.25 -0.22 -0.68 -0.66 -0.63 -0.40 -0.45 -0.70 -0.64 -0.65 -0.64 -0.73 -0.68 0.12 0.11 -0.41 -0.22 0.08 -0.51 -0.48 0.29 -0.16 -0.08 0.32 -0.25 -0.09 0.26 -0.23 0.22 -0.14 0.36 -0.23 -0.13 0.51 -0.22 -0.14 -0.11 -0.09 0.33 -0.24 -0.08 0.50 -0.21 -0.14 -0.10 -0.09 -0.08 0.29 -0.16 -0.07 0.30 -0.22 -0.12 0.38 -0.23 -0.12 0.47 -0.23 -0.14 -0.11 -0.10 -0.09
50% -0.42 -0.41 -0.40 -0.33 -0.33 -0.36 -0.31 -0.32 -0.33 -0.33 -0.36 -0.36 -0.32 -0.32 -0.35 -0.37 -0.37 -0.43 -0.38 -0.38 -0.33 -0.37 -0.37 -0.31 -0.36 -0.37 -0.43 -0.50 -0.48 -0.40 -0.45 -0.45 -0.41 -0.22 -0.22 -0.19 -0.42 -0.45 -0.49 -0.20 -0.18 -0.15 -0.43 -0.43 -0.43 -0.44 0.0 0.0 -0.34 -0.33 -0.37 -0.34 -0.35 -0.40 -0.36 -0.35 -0.35 -0.34 -0.35 -0.40 -0.38 -0.36 -0.33 -0.34 -0.34 -0.35 -0.26 -0.26 -0.22 -0.33 -0.33 -0.35 -0.47 -0.21 -0.20 -0.28 -0.27 -0.22 -0.27 -0.25 -0.22 -0.25 -0.30 -0.32 -0.33 -0.25 -0.07 -0.16 -0.34 -0.43 -0.39 -0.35 0.15 0.11 0.23 0.13 0.08 0.08 0.03 0.29 -0.16 -0.08 0.32 -0.25 -0.09 0.26 -0.23 0.22 -0.14 0.36 -0.23 -0.13 0.51 -0.22 -0.14 -0.11 -0.09 0.33 -0.24 -0.08 0.50 -0.21 -0.14 -0.10 -0.09 -0.08 0.29 -0.16 -0.07 0.30 -0.22 -0.12 0.38 -0.23 -0.12 0.47 -0.23 -0.14 -0.11 -0.10 -0.09
75% 0.20 0.15 0.01 0.27 0.26 0.23 -0.27 -0.25 -0.22 -0.28 -0.21 -0.20 0.01 0.00 -0.04 0.23 0.22 0.16 -0.13 -0.13 -0.20 -0.24 -0.21 -0.31 0.24 0.24 0.17 0.08 0.01 -0.20 0.10 0.07 -0.10 -0.22 -0.22 -0.19 0.45 0.39 0.07 -0.20 -0.18 -0.15 0.05 0.07 -0.05 -0.11 0.0 0.0 0.09 0.06 0.03 0.20 0.20 0.17 -0.09 -0.11 -0.15 0.23 0.21 0.20 -0.02 -0.04 -0.14 0.02 -0.00 -0.07 -0.25 -0.26 -0.22 0.05 0.03 -0.04 -0.14 -0.21 -0.20 -0.26 -0.26 -0.22 -0.21 -0.23 -0.22 0.37 0.40 0.28 0.06 0.02 0.22 0.14 0.34 0.58 0.39 0.30 0.15 0.11 0.50 0.36 0.08 0.58 0.58 0.29 -0.16 -0.08 0.32 -0.25 -0.09 0.26 -0.23 0.22 -0.14 0.36 -0.23 -0.13 0.51 -0.22 -0.14 -0.11 -0.09 0.33 -0.24 -0.08 0.50 -0.21 -0.14 -0.10 -0.09 -0.08 0.29 -0.16 -0.07 0.30 -0.22 -0.12 0.38 -0.23 -0.12 0.47 -0.23 -0.14 -0.11 -0.10 -0.09
max 4.09 4.46 5.67 4.02 4.45 5.24 6.11 6.09 6.19 5.41 5.44 5.67 7.06 7.45 7.71 5.26 5.34 5.79 7.25 7.30 7.57 6.32 6.47 7.42 5.35 5.53 5.83 4.02 4.36 5.89 4.04 4.64 5.99 8.74 8.89 9.37 3.56 3.93 5.21 7.82 8.52 10.18 6.06 5.90 6.82 5.40 0.0 0.0 6.65 6.93 7.46 5.33 5.53 5.87 7.24 7.12 7.65 5.24 5.53 5.85 6.78 7.35 8.26 6.96 7.26 7.84 8.18 8.35 9.00 6.70 7.11 7.69 4.48 8.02 7.72 7.97 7.96 9.30 8.11 8.48 9.21 4.09 4.28 4.93 5.68 5.62 6.02 5.45 5.54 5.69 3.66 4.46 4.05 4.24 2.96 3.13 4.48 2.84 2.84 0.29 6.28 12.99 0.32 4.07 11.56 0.26 4.32 0.22 7.05 0.36 4.35 7.92 0.51 4.53 7.29 9.20 11.42 0.33 4.12 12.50 0.50 4.78 7.36 9.71 10.80 11.92 0.29 6.43 13.42 0.30 4.57 8.36 0.38 4.28 8.04 0.47 4.29 7.35 9.30 10.13 10.99

Modelling

Model 1 : Interpretable Model : Logistic Regression

Baseline Logistic Regression Model

In [11]:
from sklearn.linear_model import LogisticRegression


baseline_model = LogisticRegression(random_state=100, class_weight='balanced') # `weight of class` balancing technique used
baseline_model = baseline_model.fit(X_train, y_train)

y_train_pred = baseline_model.predict_proba(X_train)[:,1]
y_test_pred  = baseline_model.predict_proba(X_test)[:,1]
In [12]:
y_train_pred = pd.Series(y_train_pred,index = X_train.index, ) # converting test and train to a series to preserve index
y_test_pred = pd.Series(y_test_pred,index = X_test.index)

Baseline Performance

In [13]:
# Function for Baseline Performance Metrics
import math
def model_metrics(matrix) :
    TN = matrix[0][0]
    TP = matrix[1][1]
    FP = matrix[0][1]
    FN = matrix[1][0]
    accuracy = round((TP + TN)/float(TP+TN+FP+FN),3)
    print('Accuracy :' ,accuracy )
    sensitivity = round(TP/float(FN + TP),3)
    print('Sensitivity / True Positive Rate / Recall :', sensitivity)
    specificity = round(TN/float(TN + FP),3)
    print('Specificity / True Negative Rate : ', specificity)
    precision = round(TP/float(TP + FP),3)
    print('Precision / Positive Predictive Value :', precision)
    print('F1-score :', round(2*precision*sensitivity/(precision + sensitivity),3))
In [14]:
# Prediction at threshold of 0.5 
classification_threshold = 0.5 
    
y_train_pred_classified = y_train_pred.map(lambda x : 1 if x > classification_threshold else 0)
y_test_pred_classified = y_test_pred.map(lambda x : 1 if x > classification_threshold else 0)
In [15]:
from sklearn.metrics import confusion_matrix
train_matrix = confusion_matrix(y_train, y_train_pred_classified)
print('Confusion Matrix for train:\n', train_matrix)
test_matrix = confusion_matrix(y_test, y_test_pred_classified)
print('\nConfusion Matrix for test: \n', test_matrix)
Confusion Matrix for train:
 [[16001  3186]
 [  326  1494]]

Confusion Matrix for test: 
 [[6090 2141]
 [ 149  624]]
In [16]:
# Baseline Model Performance : 

print('Train Performance : \n')
model_metrics(train_matrix)

print('\n\nTest Performance : \n')
model_metrics(test_matrix)
Train Performance : 

Accuracy : 0.833
Sensitivity / True Positive Rate / Recall : 0.821
Specificity / True Negative Rate :  0.834
Precision / Positive Predictive Value : 0.319
F1-score : 0.459


Test Performance : 

Accuracy : 0.746
Sensitivity / True Positive Rate / Recall : 0.807
Specificity / True Negative Rate :  0.74
Precision / Positive Predictive Value : 0.226
F1-score : 0.353

Baseline Performance - Finding Optimum Probability Cutoff

In [17]:
# Specificity / Sensitivity Tradeoff 

# Classification at probability thresholds between 0 and 1 
y_train_pred_thres = pd.DataFrame(index=X_train.index)
thresholds = [float(x)/10 for x in range(10)]

def thresholder(x, thresh) :
    if x > thresh : 
        return 1 
    else : 
        return 0

    
for i in thresholds:
    y_train_pred_thres[i]= y_train_pred.map(lambda x : thresholder(x,i))
y_train_pred_thres.head()
Out[17]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
mobile_number
7000166926 1 1 1 1 1 0 0 0 0 0
7001343085 1 1 1 0 0 0 0 0 0 0
7001863283 1 1 0 0 0 0 0 0 0 0
7002275981 1 1 1 0 0 0 0 0 0 0
7001086221 1 0 0 0 0 0 0 0 0 0
In [18]:
# # sensitivity, specificity, accuracy for each threshold
metrics_df = pd.DataFrame(columns=['sensitivity', 'specificity', 'accuracy'])

# Function for calculation of metrics for each threshold
def model_metrics_thres(matrix) :
    TN = matrix[0][0]
    TP = matrix[1][1]
    FP = matrix[0][1]
    FN = matrix[1][0]
    accuracy = round((TP + TN)/float(TP+TN+FP+FN),3)
    sensitivity = round(TP/float(FN + TP),3)
    specificity = round(TN/float(TN + FP),3)
    return sensitivity,specificity,accuracy

# generating a data frame for metrics for each threshold
for thres,column in zip(thresholds,y_train_pred_thres.columns.to_list()) : 
    confusion = confusion_matrix(y_train, y_train_pred_thres.loc[:,column])
    sensitivity,specificity,accuracy = model_metrics_thres(confusion)
    
    metrics_df =  metrics_df.append({ 
        'sensitivity' :sensitivity,
        'specificity' : specificity,
        'accuracy' : accuracy
    }, ignore_index = True)
    
metrics_df.index = thresholds
metrics_df
Out[18]:
sensitivity specificity accuracy
0.0 1.000 0.000 0.087
0.1 0.974 0.345 0.399
0.2 0.947 0.523 0.560
0.3 0.910 0.658 0.680
0.4 0.868 0.763 0.772
0.5 0.821 0.834 0.833
0.6 0.770 0.883 0.873
0.7 0.677 0.921 0.899
0.8 0.493 0.953 0.913
0.9 0.234 0.981 0.916
In [19]:
metrics_df.plot(kind='line', figsize=(24,8), grid=True, xticks=np.arange(0,1,0.02),
                title='Specificity-Sensitivity TradeOff');

Baseline Performance at Optimum Cutoff

In [20]:
optimum_cutoff = 0.49
y_train_pred_final = y_train_pred.map(lambda x : 1 if x > optimum_cutoff else 0)
y_test_pred_final = y_test_pred.map(lambda x : 1 if x > optimum_cutoff else 0)

train_matrix = confusion_matrix(y_train, y_train_pred_final)
print('Confusion Matrix for train:\n', train_matrix)
test_matrix = confusion_matrix(y_test, y_test_pred_final)
print('\nConfusion Matrix for test: \n', test_matrix)
Confusion Matrix for train:
 [[15888  3299]
 [  318  1502]]

Confusion Matrix for test: 
 [[1329 6902]
 [  16  757]]
In [21]:
print('Train Performance: \n')
model_metrics(train_matrix)

print('\n\nTest Performance : \n')
model_metrics(test_matrix)
Train Performance: 

Accuracy : 0.828
Sensitivity / True Positive Rate / Recall : 0.825
Specificity / True Negative Rate :  0.828
Precision / Positive Predictive Value : 0.313
F1-score : 0.454


Test Performance : 

Accuracy : 0.232
Sensitivity / True Positive Rate / Recall : 0.979
Specificity / True Negative Rate :  0.161
Precision / Positive Predictive Value : 0.099
F1-score : 0.18
In [22]:
# ROC_AUC score 
from sklearn.metrics import roc_auc_score
print('ROC AUC score for Train : ',round(roc_auc_score(y_train, y_train_pred),3), '\n' )
print('ROC AUC score for Test : ',round(roc_auc_score(y_test, y_test_pred),3) )
ROC AUC score for Train :  0.891 

ROC AUC score for Test :  0.838

Feature Selection using RFE

In [23]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=100 , class_weight='balanced')
rfe = RFE(lr, 15)
results = rfe.fit(X_train,y_train)
results.support_
Out[23]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False,  True,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False,  True, False, False,  True, False, False,
       False,  True, False, False,  True, False, False, False, False,
        True, False, False,  True, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
        True, False, False, False, False, False])
In [24]:
# DataFrame with features supported by RFE
rfe_support = pd.DataFrame({'Column' : X.columns.to_list(), 'Rank' : rfe.ranking_, 
                                      'Support' :  rfe.support_}).sort_values(by=
                                       'Rank', ascending=True)
rfe_support
Out[24]:
Column Rank Support
99 sachet_3g_6_0 1 True
120 sachet_2g_7_0 1 True
102 monthly_2g_7_0 1 True
135 sachet_2g_8_0 1 True
81 total_rech_num_6 1 True
129 monthly_3g_8_0 1 True
105 monthly_2g_8_0 1 True
83 total_rech_num_8 1 True
117 monthly_2g_6_0 1 True
68 std_ic_t2f_mou_8 1 True
67 std_ic_t2f_mou_7 1 True
112 sachet_2g_6_0 1 True
109 monthly_3g_7_0 1 True
56 loc_ic_t2f_mou_8 1 True
35 std_og_t2f_mou_8 1 True
40 isd_og_mou_7 2 False
53 loc_ic_t2m_mou_8 3 False
19 loc_og_t2f_mou_7 4 False
62 std_ic_t2t_mou_8 5 False
61 std_ic_t2t_mou_7 6 False
107 sachet_3g_8_0 7 False
41 isd_og_mou_8 8 False
89 last_day_rch_amt_8 9 False
11 roam_og_mou_8 10 False
132 monthly_3g_6_0 11 False
39 isd_og_mou_6 12 False
79 ic_others_7 13 False
50 loc_ic_t2t_mou_8 14 False
7 roam_ic_mou_7 15 False
58 loc_ic_mou_7 16 False
71 std_ic_mou_8 17 False
75 isd_ic_mou_6 18 False
33 std_og_t2f_mou_6 19 False
38 std_og_mou_8 20 False
66 std_ic_t2f_mou_6 21 False
29 std_og_t2t_mou_8 22 False
32 std_og_t2m_mou_8 23 False
78 ic_others_6 24 False
44 spl_og_mou_8 25 False
97 delta_arpu 26 False
85 max_rech_amt_7 27 False
70 std_ic_mou_7 28 False
64 std_ic_t2m_mou_7 29 False
30 std_og_t2m_mou_6 30 False
42 spl_og_mou_6 31 False
27 std_og_t2t_mou_6 32 False
18 loc_og_t2f_mou_6 33 False
60 std_ic_t2t_mou_6 34 False
36 std_og_mou_6 35 False
51 loc_ic_t2m_mou_6 36 False
15 loc_og_t2m_mou_6 37 False
94 delta_total_og_mou 38 False
69 std_ic_mou_6 39 False
65 std_ic_t2m_mou_8 40 False
2 onnet_mou_8 41 False
55 loc_ic_t2f_mou_7 42 False
28 std_og_t2t_mou_7 43 False
13 loc_og_t2t_mou_7 44 False
1 onnet_mou_7 45 False
9 roam_og_mou_6 46 False
21 loc_og_t2c_mou_6 47 False
14 loc_og_t2t_mou_8 48 False
84 max_rech_amt_6 49 False
26 loc_og_mou_8 50 False
8 roam_ic_mou_8 51 False
10 roam_og_mou_7 52 False
48 loc_ic_t2t_mou_6 53 False
57 loc_ic_mou_6 54 False
6 roam_ic_mou_6 55 False
106 monthly_2g_8_1 56 False
87 last_day_rch_amt_6 57 False
49 loc_ic_t2t_mou_7 58 False
98 delta_total_rech_amt 59 False
88 last_day_rch_amt_7 60 False
34 std_og_t2f_mou_7 61 False
126 sachet_3g_7_0 62 False
23 loc_og_t2c_mou_8 63 False
103 monthly_2g_7_1 64 False
118 monthly_2g_6_1 65 False
92 delta_vol_2g 66 False
16 loc_og_t2m_mou_7 67 False
4 offnet_mou_7 68 False
43 spl_og_mou_7 69 False
130 monthly_3g_8_1 70 False
20 loc_og_t2f_mou_8 71 False
17 loc_og_t2m_mou_8 72 False
63 std_ic_t2m_mou_6 73 False
93 delta_vol_3g 74 False
76 isd_ic_mou_7 75 False
24 loc_og_mou_6 76 False
12 loc_og_t2t_mou_6 77 False
54 loc_ic_t2f_mou_6 78 False
0 onnet_mou_6 79 False
3 offnet_mou_6 80 False
77 isd_ic_mou_8 81 False
5 offnet_mou_8 82 False
22 loc_og_t2c_mou_7 83 False
95 delta_total_ic_mou 84 False
52 loc_ic_t2m_mou_7 85 False
59 loc_ic_mou_8 86 False
90 aon 87 False
74 spl_ic_mou_8 88 False
136 sachet_2g_8_1 89 False
121 sachet_2g_7_1 90 False
113 sachet_2g_6_1 91 False
108 sachet_3g_8_1 92 False
80 ic_others_8 93 False
137 sachet_2g_8_2 94 False
138 sachet_2g_8_3 95 False
114 sachet_2g_6_2 96 False
123 sachet_2g_7_3 97 False
133 monthly_3g_6_1 98 False
125 sachet_2g_7_5 99 False
131 monthly_3g_8_2 100 False
119 monthly_2g_6_2 101 False
25 loc_og_mou_7 102 False
104 monthly_2g_7_2 103 False
110 monthly_3g_7_1 104 False
100 sachet_3g_6_1 105 False
139 sachet_2g_8_4 106 False
134 monthly_3g_6_2 107 False
111 monthly_3g_7_2 108 False
37 std_og_mou_7 109 False
31 std_og_t2m_mou_7 110 False
140 sachet_2g_8_5 111 False
101 sachet_3g_6_2 112 False
72 spl_ic_mou_6 113 False
86 max_rech_amt_8 114 False
73 spl_ic_mou_7 115 False
96 delta_vbc_3g 116 False
82 total_rech_num_7 117 False
115 sachet_2g_6_3 118 False
124 sachet_2g_7_4 119 False
127 sachet_3g_7_1 120 False
91 Average_rech_amt_6n7 121 False
45 og_others_6 122 False
116 sachet_2g_6_4 123 False
128 sachet_3g_7_2 124 False
122 sachet_2g_7_2 125 False
47 og_others_8 126 False
46 og_others_7 127 False
In [25]:
# RFE Selected columns
rfe_selected_columns = rfe_support.loc[rfe_support['Rank'] == 1,'Column'].to_list()
rfe_selected_columns
Out[25]:
['sachet_3g_6_0',
 'sachet_2g_7_0',
 'monthly_2g_7_0',
 'sachet_2g_8_0',
 'total_rech_num_6',
 'monthly_3g_8_0',
 'monthly_2g_8_0',
 'total_rech_num_8',
 'monthly_2g_6_0',
 'std_ic_t2f_mou_8',
 'std_ic_t2f_mou_7',
 'sachet_2g_6_0',
 'monthly_3g_7_0',
 'loc_ic_t2f_mou_8',
 'std_og_t2f_mou_8']

Logistic Regression with RFE Selected Columns

Model I

In [26]:
# Logistic Regression Model with RFE columns
import statsmodels.api as sm 

# Note that the SMOTE resampled Train set is used with statsmodels.api.GLM since it doesnot support class_weight
logr = sm.GLM(y_train_resampled,(sm.add_constant(X_train_resampled[rfe_selected_columns])), family = sm.families.Binomial())
logr_fit = logr.fit()
logr_fit.summary()
Out[26]:
Generalized Linear Model Regression Results
Dep. Variable: Churn No. Observations: 38374
Model: GLM Df Residuals: 38358
Model Family: Binomial Df Model: 15
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -19485.
Date: Mon, 30 Nov 2020 Deviance: 38969.
Time: 21:57:09 Pearson chi2: 2.80e+05
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.2334 0.015 -15.657 0.000 -0.263 -0.204
sachet_3g_6_0 -0.0396 0.014 -2.886 0.004 -0.066 -0.013
sachet_2g_7_0 -0.0980 0.016 -6.201 0.000 -0.129 -0.067
monthly_2g_7_0 0.0096 0.016 0.594 0.552 -0.022 0.041
sachet_2g_8_0 0.0489 0.015 3.359 0.001 0.020 0.077
total_rech_num_6 0.6047 0.017 35.547 0.000 0.571 0.638
monthly_3g_8_0 0.3993 0.017 23.439 0.000 0.366 0.433
monthly_2g_8_0 0.3697 0.018 21.100 0.000 0.335 0.404
total_rech_num_8 -1.2013 0.019 -62.378 0.000 -1.239 -1.164
monthly_2g_6_0 -0.0194 0.015 -1.262 0.207 -0.050 0.011
std_ic_t2f_mou_8 -0.3364 0.026 -12.792 0.000 -0.388 -0.285
std_ic_t2f_mou_7 0.1535 0.019 8.148 0.000 0.117 0.190
sachet_2g_6_0 -0.1117 0.016 -6.847 0.000 -0.144 -0.080
monthly_3g_7_0 -0.2094 0.017 -12.602 0.000 -0.242 -0.177
loc_ic_t2f_mou_8 -1.2743 0.038 -33.599 0.000 -1.349 -1.200
std_og_t2f_mou_8 -0.2476 0.021 -11.621 0.000 -0.289 -0.206

Logistic Regression with Manual Feature Elimination

In [27]:
# Using P-value and vif for manual feature elimination

from statsmodels.stats.outliers_influence import variance_inflation_factor
def vif(X_train_resampled, logr_fit, selected_columns) : 
    vif = pd.DataFrame()
    vif['Features'] = rfe_selected_columns
    vif['VIF'] = [variance_inflation_factor(X_train_resampled[selected_columns].values, i) for i in range(X_train_resampled[selected_columns].shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.set_index('Features')
    vif['P-value'] = round(logr_fit.pvalues,4)
    vif = vif.sort_values(by = ["VIF",'P-value'], ascending = [False,False])
    return vif

vif(X_train_resampled, logr_fit, rfe_selected_columns)
Out[27]:
VIF P-value
Features
std_ic_t2f_mou_8 1.66 0.0000
sachet_2g_6_0 1.64 0.0000
sachet_2g_7_0 1.57 0.0000
std_ic_t2f_mou_7 1.56 0.0000
monthly_2g_7_0 1.54 0.5524
monthly_3g_7_0 1.54 0.0000
monthly_3g_8_0 1.52 0.0000
monthly_2g_8_0 1.43 0.0000
monthly_2g_6_0 1.38 0.2069
sachet_2g_8_0 1.36 0.0008
total_rech_num_6 1.27 0.0000
total_rech_num_8 1.25 0.0000
std_og_t2f_mou_8 1.20 0.0000
sachet_3g_6_0 1.12 0.0039
loc_ic_t2f_mou_8 1.09 0.0000
  • 'monthly_2g_7_0' has the very p-value. Hence, this feature could be eliminated
In [28]:
selected_columns = rfe_selected_columns
selected_columns.remove('monthly_2g_7_0')
selected_columns
Out[28]:
['sachet_3g_6_0',
 'sachet_2g_7_0',
 'sachet_2g_8_0',
 'total_rech_num_6',
 'monthly_3g_8_0',
 'monthly_2g_8_0',
 'total_rech_num_8',
 'monthly_2g_6_0',
 'std_ic_t2f_mou_8',
 'std_ic_t2f_mou_7',
 'sachet_2g_6_0',
 'monthly_3g_7_0',
 'loc_ic_t2f_mou_8',
 'std_og_t2f_mou_8']

Model II

In [29]:
logr2 = sm.GLM(y_train_resampled,(sm.add_constant(X_train_resampled[selected_columns])), family = sm.families.Binomial())
logr2_fit = logr2.fit()
logr2_fit.summary()
Out[29]:
Generalized Linear Model Regression Results
Dep. Variable: Churn No. Observations: 38374
Model: GLM Df Residuals: 38359
Model Family: Binomial Df Model: 14
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -19485.
Date: Mon, 30 Nov 2020 Deviance: 38970.
Time: 21:57:09 Pearson chi2: 2.80e+05
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.2335 0.015 -15.662 0.000 -0.263 -0.204
sachet_3g_6_0 -0.0395 0.014 -2.881 0.004 -0.066 -0.013
sachet_2g_7_0 -0.0982 0.016 -6.217 0.000 -0.129 -0.067
sachet_2g_8_0 0.0491 0.015 3.372 0.001 0.021 0.078
total_rech_num_6 0.6049 0.017 35.566 0.000 0.572 0.638
monthly_3g_8_0 0.4000 0.017 23.521 0.000 0.367 0.433
monthly_2g_8_0 0.3733 0.016 22.696 0.000 0.341 0.406
total_rech_num_8 -1.2012 0.019 -62.375 0.000 -1.239 -1.163
monthly_2g_6_0 -0.0163 0.014 -1.128 0.259 -0.045 0.012
std_ic_t2f_mou_8 -0.3361 0.026 -12.784 0.000 -0.388 -0.285
std_ic_t2f_mou_7 0.1532 0.019 8.136 0.000 0.116 0.190
sachet_2g_6_0 -0.1111 0.016 -6.823 0.000 -0.143 -0.079
monthly_3g_7_0 -0.2098 0.017 -12.633 0.000 -0.242 -0.177
loc_ic_t2f_mou_8 -1.2749 0.038 -33.622 0.000 -1.349 -1.201
std_og_t2f_mou_8 -0.2476 0.021 -11.620 0.000 -0.289 -0.206
In [30]:
# vif and p-values
vif(X_train_resampled, logr2_fit, selected_columns)
Out[30]:
VIF P-value
Features
std_ic_t2f_mou_8 1.66 0.0000
sachet_2g_6_0 1.63 0.0000
sachet_2g_7_0 1.57 0.0000
std_ic_t2f_mou_7 1.56 0.0000
monthly_3g_7_0 1.54 0.0000
monthly_3g_8_0 1.52 0.0000
sachet_2g_8_0 1.36 0.0007
total_rech_num_6 1.27 0.0000
total_rech_num_8 1.25 0.0000
monthly_2g_8_0 1.23 0.0000
monthly_2g_6_0 1.21 0.2595
std_og_t2f_mou_8 1.20 0.0000
sachet_3g_6_0 1.12 0.0040
loc_ic_t2f_mou_8 1.09 0.0000
  • 'monthly_2g_6_0' has very high p-value. Hence, this feature could be eliminated
In [31]:
selected_columns.remove('monthly_2g_6_0')
selected_columns
Out[31]:
['sachet_3g_6_0',
 'sachet_2g_7_0',
 'sachet_2g_8_0',
 'total_rech_num_6',
 'monthly_3g_8_0',
 'monthly_2g_8_0',
 'total_rech_num_8',
 'std_ic_t2f_mou_8',
 'std_ic_t2f_mou_7',
 'sachet_2g_6_0',
 'monthly_3g_7_0',
 'loc_ic_t2f_mou_8',
 'std_og_t2f_mou_8']

Model III

In [32]:
logr3 = sm.GLM(y_train_resampled,(sm.add_constant(X_train_resampled[selected_columns])), family = sm.families.Binomial())
logr3_fit = logr3.fit()
logr3_fit.summary()
Out[32]:
Generalized Linear Model Regression Results
Dep. Variable: Churn No. Observations: 38374
Model: GLM Df Residuals: 38360
Model Family: Binomial Df Model: 13
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -19486.
Date: Mon, 30 Nov 2020 Deviance: 38971.
Time: 21:57:10 Pearson chi2: 2.79e+05
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.2336 0.015 -15.667 0.000 -0.263 -0.204
sachet_3g_6_0 -0.0399 0.014 -2.916 0.004 -0.067 -0.013
sachet_2g_7_0 -0.0987 0.016 -6.249 0.000 -0.130 -0.068
sachet_2g_8_0 0.0488 0.015 3.354 0.001 0.020 0.077
total_rech_num_6 0.6053 0.017 35.581 0.000 0.572 0.639
monthly_3g_8_0 0.3994 0.017 23.494 0.000 0.366 0.433
monthly_2g_8_0 0.3666 0.015 23.953 0.000 0.337 0.397
total_rech_num_8 -1.2033 0.019 -62.720 0.000 -1.241 -1.166
std_ic_t2f_mou_8 -0.3363 0.026 -12.788 0.000 -0.388 -0.285
std_ic_t2f_mou_7 0.1532 0.019 8.137 0.000 0.116 0.190
sachet_2g_6_0 -0.1108 0.016 -6.810 0.000 -0.143 -0.079
monthly_3g_7_0 -0.2099 0.017 -12.640 0.000 -0.242 -0.177
loc_ic_t2f_mou_8 -1.2736 0.038 -33.621 0.000 -1.348 -1.199
std_og_t2f_mou_8 -0.2474 0.021 -11.617 0.000 -0.289 -0.206
In [33]:
# vif and p-values
vif(X_train_resampled, logr3_fit, selected_columns)
Out[33]:
VIF P-value
Features
std_ic_t2f_mou_8 1.66 0.0000
sachet_2g_6_0 1.63 0.0000
sachet_2g_7_0 1.57 0.0000
std_ic_t2f_mou_7 1.56 0.0000
monthly_3g_7_0 1.54 0.0000
monthly_3g_8_0 1.52 0.0000
sachet_2g_8_0 1.36 0.0008
total_rech_num_6 1.27 0.0000
total_rech_num_8 1.24 0.0000
std_og_t2f_mou_8 1.20 0.0000
sachet_3g_6_0 1.12 0.0035
loc_ic_t2f_mou_8 1.09 0.0000
monthly_2g_8_0 1.03 0.0000
  • All features have low p-values(<0.05) and VIF (<5)
  • This model could be used as the interpretable logistic regression model.

Final Logistic Regression Model with RFE and Manual Elimination

In [34]:
logr3_fit.summary()
Out[34]:
Generalized Linear Model Regression Results
Dep. Variable: Churn No. Observations: 38374
Model: GLM Df Residuals: 38360
Model Family: Binomial Df Model: 13
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -19486.
Date: Mon, 30 Nov 2020 Deviance: 38971.
Time: 21:57:10 Pearson chi2: 2.79e+05
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.2336 0.015 -15.667 0.000 -0.263 -0.204
sachet_3g_6_0 -0.0399 0.014 -2.916 0.004 -0.067 -0.013
sachet_2g_7_0 -0.0987 0.016 -6.249 0.000 -0.130 -0.068
sachet_2g_8_0 0.0488 0.015 3.354 0.001 0.020 0.077
total_rech_num_6 0.6053 0.017 35.581 0.000 0.572 0.639
monthly_3g_8_0 0.3994 0.017 23.494 0.000 0.366 0.433
monthly_2g_8_0 0.3666 0.015 23.953 0.000 0.337 0.397
total_rech_num_8 -1.2033 0.019 -62.720 0.000 -1.241 -1.166
std_ic_t2f_mou_8 -0.3363 0.026 -12.788 0.000 -0.388 -0.285
std_ic_t2f_mou_7 0.1532 0.019 8.137 0.000 0.116 0.190
sachet_2g_6_0 -0.1108 0.016 -6.810 0.000 -0.143 -0.079
monthly_3g_7_0 -0.2099 0.017 -12.640 0.000 -0.242 -0.177
loc_ic_t2f_mou_8 -1.2736 0.038 -33.621 0.000 -1.348 -1.199
std_og_t2f_mou_8 -0.2474 0.021 -11.617 0.000 -0.289 -0.206
In [35]:
selected_columns
Out[35]:
['sachet_3g_6_0',
 'sachet_2g_7_0',
 'sachet_2g_8_0',
 'total_rech_num_6',
 'monthly_3g_8_0',
 'monthly_2g_8_0',
 'total_rech_num_8',
 'std_ic_t2f_mou_8',
 'std_ic_t2f_mou_7',
 'sachet_2g_6_0',
 'monthly_3g_7_0',
 'loc_ic_t2f_mou_8',
 'std_og_t2f_mou_8']
In [36]:
# Prediction 
y_train_pred_lr = logr3_fit.predict(sm.add_constant(X_train_resampled[selected_columns]))
y_train_pred_lr.head()
Out[36]:
0    0.118916
1    0.343873
2    0.381230
3    0.015277
4    0.001595
dtype: float64
In [37]:
y_test_pred_lr = logr3_fit.predict(sm.add_constant(X_test[selected_columns]))
y_test_pred_lr.head()
Out[37]:
mobile_number
7002242818    0.013556
7000517161    0.903162
7002162382    0.247123
7002152271    0.330787
7002058655    0.056105
dtype: float64

Performance

Finding Optimum Probability Cutoff

In [38]:
# Specificity / Sensitivity Tradeoff 

# Classification at probability thresholds between 0 and 1 
y_train_pred_thres = pd.DataFrame(index=X_train_resampled.index)
thresholds = [float(x)/10 for x in range(10)]

def thresholder(x, thresh) :
    if x > thresh : 
        return 1 
    else : 
        return 0

    
for i in thresholds:
    y_train_pred_thres[i]= y_train_pred_lr.map(lambda x : thresholder(x,i))
y_train_pred_thres.head()
Out[38]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 1 1 0 0 0 0 0 0 0 0
1 1 1 1 1 0 0 0 0 0 0
2 1 1 1 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0
In [39]:
# DataFrame for Performance metrics at each threshold

logr_metrics_df = pd.DataFrame(columns=['sensitivity', 'specificity', 'accuracy'])
for thres,column in zip(thresholds,y_train_pred_thres.columns.to_list()) : 
    confusion = confusion_matrix(y_train_resampled, y_train_pred_thres.loc[:,column])
    sensitivity,specificity,accuracy = model_metrics_thres(confusion)
    logr_metrics_df =  logr_metrics_df.append({ 
        'sensitivity' :sensitivity,
        'specificity' : specificity,
        'accuracy' : accuracy
    }, ignore_index = True)
    
logr_metrics_df.index = thresholds
logr_metrics_df
Out[39]:
sensitivity specificity accuracy
0.0 1.000 0.000 0.500
0.1 0.976 0.224 0.600
0.2 0.947 0.351 0.649
0.3 0.916 0.472 0.694
0.4 0.864 0.598 0.731
0.5 0.794 0.722 0.758
0.6 0.703 0.841 0.772
0.7 0.550 0.930 0.740
0.8 0.310 0.975 0.642
0.9 0.095 0.994 0.544
In [40]:
logr_metrics_df.plot(kind='line', figsize=(24,8), grid=True, xticks=np.arange(0,1,0.02),
                title='Specificity-Sensitivity TradeOff');
  • The optimum probability cutoff for Logistic regression model is 0.53
In [41]:
optimum_cutoff = 0.53
y_train_pred_lr_final = y_train_pred_lr.map(lambda x : 1 if x > optimum_cutoff else 0)
y_test_pred_lr_final = y_test_pred_lr.map(lambda x : 1 if x > optimum_cutoff else 0)

train_matrix = confusion_matrix(y_train_resampled, y_train_pred_lr_final)
print('Confusion Matrix for train:\n', train_matrix)
test_matrix = confusion_matrix(y_test, y_test_pred_lr_final)
print('\nConfusion Matrix for test: \n', test_matrix)
Confusion Matrix for train:
 [[14531  4656]
 [ 4411 14776]]

Confusion Matrix for test: 
 [[6313 1918]
 [ 191  582]]
In [42]:
print('Train Performance: \n')
model_metrics(train_matrix)

print('\n\nTest Performance : \n')
model_metrics(test_matrix)
Train Performance: 

Accuracy : 0.764
Sensitivity / True Positive Rate / Recall : 0.77
Specificity / True Negative Rate :  0.757
Precision / Positive Predictive Value : 0.76
F1-score : 0.765


Test Performance : 

Accuracy : 0.766
Sensitivity / True Positive Rate / Recall : 0.753
Specificity / True Negative Rate :  0.767
Precision / Positive Predictive Value : 0.233
F1-score : 0.356
In [43]:
# ROC_AUC score 
print('ROC AUC score for Train : ',round(roc_auc_score(y_train_resampled, y_train_pred_lr),3), '\n' )
print('ROC AUC score for Test : ',round(roc_auc_score(y_test, y_test_pred_lr),3) )
ROC AUC score for Train :  0.843 

ROC AUC score for Test :  0.828

Model 1 : Logistic Regression (Interpretable Model Summary)

In [44]:
lr_summary_html = logr3_fit.summary().tables[1].as_html()
lr_results = pd.read_html(lr_summary_html, header=0, index_col=0)[0]
coef_column = lr_results.columns[0]
print('Most important predictors of Churn , in order of importance and their coefficients are as follows : \n')
lr_results.sort_values(by=coef_column, key=lambda x: abs(x), ascending=False)['coef']
Most important predictors of Churn , in order of importance and their coefficients are as follows : 

Out[44]:
loc_ic_t2f_mou_8   -1.2736
total_rech_num_8   -1.2033
total_rech_num_6    0.6053
monthly_3g_8_0      0.3994
monthly_2g_8_0      0.3666
std_ic_t2f_mou_8   -0.3363
std_og_t2f_mou_8   -0.2474
const              -0.2336
monthly_3g_7_0     -0.2099
std_ic_t2f_mou_7    0.1532
sachet_2g_6_0      -0.1108
sachet_2g_7_0      -0.0987
sachet_2g_8_0       0.0488
sachet_3g_6_0      -0.0399
Name: coef, dtype: float64
  • The above model could be used as the interpretable model for predicting telecom churn.

PCA

In [45]:
from sklearn.decomposition import PCA 
pca = PCA(random_state = 42) 
pca.fit(X_train) # note that pca is fit on original train set instead of resampled train set. 
pca.components_
Out[45]:
array([[ 1.64887430e-01,  1.93987506e-01,  1.67239205e-01, ...,
         1.43967238e-06, -1.55704675e-06, -1.88892194e-06],
       [ 6.48591961e-02,  9.55966684e-02,  1.20775174e-01, ...,
        -2.12841595e-06, -1.47944145e-06, -3.90881587e-07],
       [ 2.38415388e-01,  2.73645507e-01,  2.38436263e-01, ...,
        -1.25598531e-06, -4.37900299e-07,  6.19889336e-07],
       ...,
       [ 1.68015588e-06,  1.93600851e-06, -1.82065762e-06, ...,
         4.25473944e-03,  2.56738368e-03,  3.51118176e-03],
       [ 0.00000000e+00, -1.11533905e-16,  1.57807487e-16, ...,
         1.73764144e-15,  6.22907679e-16,  1.45339158e-16],
       [ 0.00000000e+00,  4.98537742e-16, -6.02718139e-16, ...,
         1.27514583e-15,  1.25772226e-15,  3.41773342e-16]])
In [46]:
pca.explained_variance_ratio_
Out[46]:
array([2.72067612e-01, 1.62438240e-01, 1.20827535e-01, 1.06070063e-01,
       9.11349433e-02, 4.77504400e-02, 2.63978655e-02, 2.56843982e-02,
       1.91789343e-02, 1.68045932e-02, 1.55523468e-02, 1.31676589e-02,
       1.04552128e-02, 7.72970448e-03, 7.22746863e-03, 6.14494838e-03,
       5.62073089e-03, 5.44579273e-03, 4.59009989e-03, 4.38488162e-03,
       3.46703626e-03, 3.27941490e-03, 2.78099200e-03, 2.13444270e-03,
       2.07542043e-03, 1.89794720e-03, 1.41383936e-03, 1.30240760e-03,
       1.15369576e-03, 1.05262500e-03, 9.64293417e-04, 9.16686049e-04,
       8.84067044e-04, 7.62966236e-04, 6.61794767e-04, 5.69667265e-04,
       5.12585166e-04, 5.04441248e-04, 4.82396680e-04, 4.46889495e-04,
       4.36441254e-04, 4.10389488e-04, 3.51844810e-04, 3.12626195e-04,
       2.51673027e-04, 2.34723896e-04, 1.96950034e-04, 1.71296745e-04,
       1.59882693e-04, 1.48330353e-04, 1.45919483e-04, 1.08583729e-04,
       1.04038518e-04, 8.90621848e-05, 8.53009223e-05, 7.60704088e-05,
       7.57150133e-05, 6.16615717e-05, 6.07777411e-05, 5.70517541e-05,
       5.36161089e-05, 5.28495367e-05, 5.14887086e-05, 4.73768570e-05,
       4.71283394e-05, 4.11523975e-05, 4.10392906e-05, 2.86090257e-05,
       2.19793282e-05, 1.58203581e-05, 1.50969788e-05, 1.42865579e-05,
       1.34537530e-05, 1.33026062e-05, 1.10239870e-05, 8.27539516e-06,
       7.55845974e-06, 6.45372276e-06, 6.22570067e-06, 3.42288900e-06,
       3.20804681e-06, 3.09270863e-06, 2.86608967e-06, 2.44898003e-06,
       2.08230568e-06, 1.85144734e-06, 1.64714248e-06, 1.45630245e-06,
       1.35265729e-06, 1.05472047e-06, 9.89133015e-07, 8.65864423e-07,
       7.45065121e-07, 3.66727807e-07, 6.49277820e-08, 6.13357428e-08,
       4.35995018e-08, 2.28152900e-08, 2.00441141e-08, 1.84235145e-08,
       1.66102335e-08, 1.47870989e-08, 1.23390691e-08, 1.12094165e-08,
       1.09702422e-08, 9.51924270e-09, 8.61596309e-09, 7.38051070e-09,
       7.15370081e-09, 6.29095319e-09, 5.00739371e-09, 4.68791660e-09,
       4.23376173e-09, 4.04558169e-09, 3.75847771e-09, 3.71213838e-09,
       3.32806929e-09, 3.23527525e-09, 3.12734302e-09, 2.82062311e-09,
       2.72602311e-09, 2.66103741e-09, 2.46562734e-09, 2.20243536e-09,
       2.15044476e-09, 1.59498492e-09, 1.47087974e-09, 1.06159357e-09,
       9.33938436e-10, 8.10080735e-10, 8.04656028e-10, 6.12994365e-10,
       4.82074297e-10, 4.02577318e-10, 3.58059984e-10, 3.28374076e-10,
       3.03687605e-10, 7.12091816e-11, 6.13978255e-11, 1.04375208e-33,
       1.04375208e-33])

Scree Plot

In [47]:
var_cum = np.cumsum(pca.explained_variance_ratio_)
plt.figure(figsize=(20,8))
sns.set_style('darkgrid')
sns.lineplot(np.arange(1,len(var_cum) + 1), var_cum)
plt.xticks(np.arange(0,140,5))
plt.axhline(0.95,color='r')
plt.axhline(1.0,color='r')
plt.axvline(15,color='b')
plt.axvline(45,color='b')
plt.text(10,0.96,'0.95')

plt.title('Scree Plot of Telecom Churn Train Set');
  • From the above scree plot, it is clear that 95% of variance in the train set can be explained by first 16 principal components and 100% of variance is explained by the first 45 principal components.
In [48]:
# Perform PCA using the first 45 components
pca_final = PCA(n_components=45, random_state=42)
transformed_data = pca_final.fit_transform(X_train)
X_train_pca = pd.DataFrame(transformed_data, columns=["PC_"+str(x) for x in range(1,46)], index = X_train.index)
data_train_pca = pd.concat([X_train_pca, y_train], axis=1)

data_train_pca.head()
Out[48]:
PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 PC_8 PC_9 PC_10 PC_11 PC_12 PC_13 PC_14 PC_15 PC_16 PC_17 PC_18 PC_19 PC_20 PC_21 PC_22 PC_23 PC_24 PC_25 PC_26 PC_27 PC_28 PC_29 PC_30 PC_31 PC_32 PC_33 PC_34 PC_35 PC_36 PC_37 PC_38 PC_39 PC_40 PC_41 PC_42 PC_43 PC_44 PC_45 Churn
mobile_number
7000166926 -907.572208 -342.923676 13.094442 58.813506 -95.616159 -1050.535219 254.648987 -31.445039 305.140339 -216.814250 95.825021 231.408291 -111.002572 -2.007256 444.977249 31.541681 573.831941 -278.539708 30.768637 -36.915195 -0.293915 -83.574447 -13.960479 -60.930941 -53.208613 56.049658 -17.776675 -12.624526 14.149393 -30.559156 26.064776 -1.080160 -19.814893 -3.293546 -2.717923 7.470255 22.686838 28.696686 -14.312037 4.959030 -8.652543 2.473147 17.080399 -21.824778 -8.062901 0
7001343085 573.898045 -902.385767 -424.839214 -331.153508 -148.987005 -36.955710 -134.445130 265.325388 -92.070929 -164.203586 25.105150 -36.980621 164.785936 -222.908959 -12.573878 -50.569424 -44.767869 -62.984835 -18.100729 -86.239469 -115.399141 -45.776518 16.345395 -21.497140 -10.541281 -71.754047 29.230830 -20.880178 -0.690183 3.220864 -21.223298 65.500636 -39.719437 50.424623 10.586150 43.055219 0.209259 -66.107880 13.583016 25.823444 52.037618 -3.272773 8.493995 19.449057 -38.779466 0
7001863283 -1538.198366 514.032564 846.865497 57.032319 -1126.228705 -84.209511 -44.422495 -88.158881 -58.411887 50.518811 3.052703 -229.100202 -109.215465 -3.253782 7.045279 -85.645393 54.536446 -52.292779 20.978943 -90.806167 96.348659 24.280381 -52.425262 42.430049 -40.627473 -12.715890 -4.331719 -4.092290 50.339358 -0.777645 -35.146663 -121.580965 98.868473 -34.068010 -8.941074 22.920757 1.669933 52.644942 -8.542762 9.087643 -18.403853 3.672076 26.073078 27.246371 19.603368 0
7002275981 486.830772 -224.929803 1130.460535 -496.189015 6.009139 81.106845 -148.667431 170.280911 -7.375197 -99.556793 -159.659135 -14.186219 -98.682096 213.233743 -34.920639 -17.212430 29.644778 4.941994 2.799763 -49.580528 -88.567855 16.809461 -9.471018 4.383889 29.532189 38.211558 32.465761 -5.316497 -60.149577 12.593305 20.988200 80.709846 -50.975160 -3.712583 65.002407 -57.837280 -8.312631 -5.931175 -5.053131 -5.667538 -12.102225 -14.690148 -32.215573 12.517731 -20.158820 0
7001086221 -1420.949314 794.071749 99.221352 155.118564 145.349456 784.723580 -10.947301 609.724272 -172.482377 -42.796400 59.174124 -162.912577 -112.219187 -55.108445 17.303261 -152.111164 -611.929832 181.577435 -211.358075 -77.180329 116.282095 83.488753 -26.254488 128.490023 -69.085253 4.854304 -128.278573 44.328867 -6.470515 -28.782209 14.618174 -31.359379 27.331179 -25.948771 8.941634 -34.840913 -21.933848 17.941556 -0.866531 -19.428832 -5.321193 6.319611 -11.398376 41.907093 -8.296132 0
In [49]:
## Plotting principal components 
sns.pairplot(data=data_train_pca, x_vars=["PC_1"], y_vars=["PC_2"], hue = "Churn", size=8);

Model 2 : PCA + Logistic Regression Model

In [50]:
# X,y Split
y_train_pca = data_train_pca.pop('Churn')
X_train_pca = data_train_pca

# Transforming test set with pca ( 45 components)
X_test_pca = pca_final.transform(X_test)

# Logistic Regression
lr_pca = LogisticRegression(random_state=100, class_weight='balanced')
lr_pca.fit(X_train_pca,y_train_pca ) 
Out[50]:
LogisticRegression(class_weight='balanced', random_state=100)
In [51]:
# y_train predictions
y_train_pred_lr_pca = lr_pca.predict(X_train_pca)
y_train_pred_lr_pca[:5]
Out[51]:
array([1, 0, 0, 0, 0])
In [52]:
# Test Prediction
X_test_pca = pca_final.transform(X_test)
y_test_pred_lr_pca = lr_pca.predict(X_test_pca)
y_test_pred_lr_pca[:5]
Out[52]:
array([1, 1, 1, 1, 1])

Baseline Performance

In [53]:
train_matrix = confusion_matrix(y_train, y_train_pred_lr_pca)
test_matrix = confusion_matrix(y_test, y_test_pred_lr_pca)

print('Train Performance :\n')
model_metrics(train_matrix)

print('\nTest Performance :\n')
model_metrics(test_matrix)
Train Performance :

Accuracy : 0.645
Sensitivity / True Positive Rate / Recall : 0.905
Specificity / True Negative Rate :  0.62
Precision / Positive Predictive Value : 0.184
F1-score : 0.306

Test Performance :

Accuracy : 0.086
Sensitivity / True Positive Rate / Recall : 1.0
Specificity / True Negative Rate :  0.0
Precision / Positive Predictive Value : 0.086
F1-score : 0.158

Hyperparameter Tuning

In [54]:
# Creating a Logistic regression model using pca transformed train set
from sklearn.pipeline import Pipeline
lr_pca = LogisticRegression(random_state=100, class_weight='balanced')
In [55]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV , StratifiedKFold
params = {
    'penalty' : ['l1','l2','none'], 
    'C' : [0,1,2,3,4,5,10,50]
}
folds = StratifiedKFold(n_splits=4, shuffle=True, random_state=100)

search = GridSearchCV(cv=folds, estimator = lr_pca, param_grid=params,scoring='roc_auc', verbose=True, n_jobs=-1)
search.fit(X_train_pca, y_train_pca)
Fitting 4 folds for each of 24 candidates, totalling 96 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done  96 out of  96 | elapsed:    6.9s finished
Out[55]:
GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=100, shuffle=True),
             estimator=LogisticRegression(class_weight='balanced',
                                          random_state=100),
             n_jobs=-1,
             param_grid={'C': [0, 1, 2, 3, 4, 5, 10, 50],
                         'penalty': ['l1', 'l2', 'none']},
             scoring='roc_auc', verbose=True)
In [56]:
# Optimum Hyperparameters
print('Best ROC-AUC score :', search.best_score_)
print('Best Parameters :', search.best_params_)
Best ROC-AUC score : 0.8763924253372933
Best Parameters : {'C': 0, 'penalty': 'none'}
In [57]:
# Modelling using the best LR-PCA estimator 
lr_pca_best = search.best_estimator_
lr_pca_best_fit = lr_pca_best.fit(X_train_pca, y_train_pca)

# Prediction on Train set
y_train_pred_lr_pca_best = lr_pca_best_fit.predict(X_train_pca)
y_train_pred_lr_pca_best[:5]
Out[57]:
array([1, 1, 0, 0, 0])
In [58]:
# Prediction on test set
y_test_pred_lr_pca_best = lr_pca_best_fit.predict(X_test_pca)
y_test_pred_lr_pca_best[:5]
Out[58]:
array([1, 1, 1, 1, 1])
In [59]:
## Model Performance after Hyper Parameter Tuning

train_matrix = confusion_matrix(y_train, y_train_pred_lr_pca_best)
test_matrix = confusion_matrix(y_test, y_test_pred_lr_pca_best)

print('Train Performance :\n')
model_metrics(train_matrix)

print('\nTest Performance :\n')
model_metrics(test_matrix)
Train Performance :

Accuracy : 0.627
Sensitivity / True Positive Rate / Recall : 0.918
Specificity / True Negative Rate :  0.599
Precision / Positive Predictive Value : 0.179
F1-score : 0.3

Test Performance :

Accuracy : 0.086
Sensitivity / True Positive Rate / Recall : 1.0
Specificity / True Negative Rate :  0.0
Precision / Positive Predictive Value : 0.086
F1-score : 0.158

Model 3 : PCA + Random Forest

In [60]:
from sklearn.ensemble import RandomForestClassifier

# creating a random forest classifier using pca output

pca_rf = RandomForestClassifier(random_state=42, class_weight= {0 : class_1/(class_0 + class_1) , 1 : class_0/(class_0 + class_1) } , oob_score=True, n_jobs=-1,verbose=1)
pca_rf
Out[60]:
RandomForestClassifier(class_weight={0: 0.08640165272733331,
                                     1: 0.9135983472726666},
                       n_jobs=-1, oob_score=True, random_state=42, verbose=1)
In [68]:
# Hyper parameter Tuning
params = {
    'n_estimators'  : [30,40,50,100],
    'max_depth' : [3,4,5,6,7],
    'min_samples_leaf' : [15,20,25,30]
}
folds = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
pca_rf_model_search = GridSearchCV(estimator=pca_rf, param_grid=params, 
                                   cv=folds, scoring='roc_auc', verbose=True, n_jobs=-1 )

pca_rf_model_search.fit(X_train_pca, y_train)
Fitting 4 folds for each of 80 candidates, totalling 320 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   23.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:  5.5min finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    2.6s finished
Out[68]:
GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=42, shuffle=True),
             estimator=RandomForestClassifier(class_weight={0: 0.08640165272733331,
                                                            1: 0.9135983472726666},
                                              n_jobs=-1, oob_score=True,
                                              random_state=42, verbose=1),
             n_jobs=-1,
             param_grid={'max_depth': [3, 4, 5, 6, 7],
                         'min_samples_leaf': [15, 20, 25, 30],
                         'n_estimators': [30, 40, 50, 100]},
             scoring='roc_auc', verbose=True)
In [69]:
# Optimum Hyperparameters
print('Best ROC-AUC score :', pca_rf_model_search.best_score_)
print('Best Parameters :', pca_rf_model_search.best_params_)
Best ROC-AUC score : 0.8861621751601011
Best Parameters : {'max_depth': 7, 'min_samples_leaf': 20, 'n_estimators': 100}
In [70]:
# Modelling using the best PCA-RandomForest Estimator 
pca_rf_best = pca_rf_model_search.best_estimator_
pca_rf_best_fit = pca_rf_best.fit(X_train_pca, y_train)

# Prediction on Train set
y_train_pred_pca_rf_best = pca_rf_best_fit.predict(X_train_pca)
y_train_pred_pca_rf_best[:5]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    2.7s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished
Out[70]:
array([0, 0, 0, 0, 0])
In [71]:
# Prediction on test set
y_test_pred_pca_rf_best = pca_rf_best_fit.predict(X_test_pca)
y_test_pred_pca_rf_best[:5]
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished
Out[71]:
array([0, 0, 0, 0, 0])
In [72]:
## PCA - RandomForest Model Performance - Hyper Parameter Tuned

train_matrix = confusion_matrix(y_train, y_train_pred_pca_rf_best)
test_matrix = confusion_matrix(y_test, y_test_pred_pca_rf_best)

print('Train Performance :\n')
model_metrics(train_matrix)

print('\nTest Performance :\n')
model_metrics(test_matrix)
Train Performance :

Accuracy : 0.882
Sensitivity / True Positive Rate / Recall : 0.816
Specificity / True Negative Rate :  0.888
Precision / Positive Predictive Value : 0.408
F1-score : 0.544

Test Performance :

Accuracy : 0.86
Sensitivity / True Positive Rate / Recall : 0.80
Specificity / True Negative Rate :  0.78
Precision / Positive Predictive Value : 0.37
F1-score : 0.51
In [67]:
## out of bag error 
pca_rf_best_fit.oob_score_
Out[67]:
0.8625220164707003

Model 4 : PCA + XGBoost

In [74]:
import xgboost as xgb
pca_xgb = xgb.XGBClassifier(random_state=42, scale_pos_weight= class_0/class_1 ,
                                    tree_method='hist', 
                                   objective='binary:logistic',
                                  
                                  
                                  )  # scale_pos_weight takes care of class imbalance
pca_xgb.fit(X_train_pca, y_train)
Out[74]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=10.573852680293097,
              subsample=1, tree_method='hist', validate_parameters=1,
              verbosity=None)
In [75]:
print('Baseline Train AUC Score')
roc_auc_score(y_train, pca_xgb.predict_proba(X_train_pca)[:, 1])
Baseline Train AUC Score
Out[75]:
0.9999996277241286
In [76]:
print('Baseline Test AUC Score')
roc_auc_score(y_test, pca_xgb.predict_proba(X_test_pca)[:, 1])
Baseline Test AUC Score
Out[76]:
0.46093390352284136
In [77]:
## Hyper parameter Tuning
parameters = {
              'learning_rate': [0.1, 0.2, 0.3],
              'gamma' : [10,20,50],
              'max_depth': [2,3,4],
              'min_child_weight': [25,50],
              'n_estimators': [150,200,500]}
pca_xgb_search = GridSearchCV(estimator=pca_xgb , param_grid=parameters,scoring='roc_auc', cv=folds, n_jobs=-1, verbose=1)
pca_xgb_search.fit(X_train_pca, y_train)
Fitting 4 folds for each of 162 candidates, totalling 648 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   28.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 648 out of 648 | elapsed:  8.0min finished
Out[77]:
GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=42, shuffle=True),
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0, gpu_id=-1,
                                     importance_type='gain',
                                     interaction_constraints='',
                                     learning_rate=0.300000012,
                                     max_delta_step=0, max_depth=6,
                                     min_child_weight=1, missing=nan,
                                     monotone_...
                                     n_estimators=100, n_jobs=0,
                                     num_parallel_tree=1, random_state=42,
                                     reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=10.573852680293097,
                                     subsample=1, tree_method='hist',
                                     validate_parameters=1, verbosity=None),
             n_jobs=-1,
             param_grid={'gamma': [10, 20, 50],
                         'learning_rate': [0.1, 0.2, 0.3],
                         'max_depth': [2, 3, 4], 'min_child_weight': [25, 50],
                         'n_estimators': [150, 200, 500]},
             scoring='roc_auc', verbose=1)
In [78]:
# Optimum Hyperparameters
print('Best ROC-AUC score :', pca_xgb_search.best_score_)
print('Best Parameters :', pca_xgb_search.best_params_)
Best ROC-AUC score : 0.8955777259491308
Best Parameters : {'gamma': 10, 'learning_rate': 0.1, 'max_depth': 2, 'min_child_weight': 50, 'n_estimators': 500}
In [79]:
# Modelling using the best PCA-XGBoost Estimator 
pca_xgb_best = pca_xgb_search.best_estimator_
pca_xgb_best_fit = pca_xgb_best.fit(X_train_pca, y_train)

# Prediction on Train set
y_train_pred_pca_xgb_best = pca_xgb_best_fit.predict(X_train_pca)
y_train_pred_pca_xgb_best[:5]
Out[79]:
array([0, 0, 0, 0, 0])
In [84]:
X_train_pca.head()
Out[84]:
PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 PC_8 PC_9 PC_10 PC_11 PC_12 PC_13 PC_14 PC_15 PC_16 PC_17 PC_18 PC_19 PC_20 PC_21 PC_22 PC_23 PC_24 PC_25 PC_26 PC_27 PC_28 PC_29 PC_30 PC_31 PC_32 PC_33 PC_34 PC_35 PC_36 PC_37 PC_38 PC_39 PC_40 PC_41 PC_42 PC_43 PC_44 PC_45
mobile_number
7000166926 -907.572208 -342.923676 13.094442 58.813506 -95.616159 -1050.535219 254.648987 -31.445039 305.140339 -216.814250 95.825021 231.408291 -111.002572 -2.007256 444.977249 31.541681 573.831941 -278.539708 30.768637 -36.915195 -0.293915 -83.574447 -13.960479 -60.930941 -53.208613 56.049658 -17.776675 -12.624526 14.149393 -30.559156 26.064776 -1.080160 -19.814893 -3.293546 -2.717923 7.470255 22.686838 28.696686 -14.312037 4.959030 -8.652543 2.473147 17.080399 -21.824778 -8.062901
7001343085 573.898045 -902.385767 -424.839214 -331.153508 -148.987005 -36.955710 -134.445130 265.325388 -92.070929 -164.203586 25.105150 -36.980621 164.785936 -222.908959 -12.573878 -50.569424 -44.767869 -62.984835 -18.100729 -86.239469 -115.399141 -45.776518 16.345395 -21.497140 -10.541281 -71.754047 29.230830 -20.880178 -0.690183 3.220864 -21.223298 65.500636 -39.719437 50.424623 10.586150 43.055219 0.209259 -66.107880 13.583016 25.823444 52.037618 -3.272773 8.493995 19.449057 -38.779466
7001863283 -1538.198366 514.032564 846.865497 57.032319 -1126.228705 -84.209511 -44.422495 -88.158881 -58.411887 50.518811 3.052703 -229.100202 -109.215465 -3.253782 7.045279 -85.645393 54.536446 -52.292779 20.978943 -90.806167 96.348659 24.280381 -52.425262 42.430049 -40.627473 -12.715890 -4.331719 -4.092290 50.339358 -0.777645 -35.146663 -121.580965 98.868473 -34.068010 -8.941074 22.920757 1.669933 52.644942 -8.542762 9.087643 -18.403853 3.672076 26.073078 27.246371 19.603368
7002275981 486.830772 -224.929803 1130.460535 -496.189015 6.009139 81.106845 -148.667431 170.280911 -7.375197 -99.556793 -159.659135 -14.186219 -98.682096 213.233743 -34.920639 -17.212430 29.644778 4.941994 2.799763 -49.580528 -88.567855 16.809461 -9.471018 4.383889 29.532189 38.211558 32.465761 -5.316497 -60.149577 12.593305 20.988200 80.709846 -50.975160 -3.712583 65.002407 -57.837280 -8.312631 -5.931175 -5.053131 -5.667538 -12.102225 -14.690148 -32.215573 12.517731 -20.158820
7001086221 -1420.949314 794.071749 99.221352 155.118564 145.349456 784.723580 -10.947301 609.724272 -172.482377 -42.796400 59.174124 -162.912577 -112.219187 -55.108445 17.303261 -152.111164 -611.929832 181.577435 -211.358075 -77.180329 116.282095 83.488753 -26.254488 128.490023 -69.085253 4.854304 -128.278573 44.328867 -6.470515 -28.782209 14.618174 -31.359379 27.331179 -25.948771 8.941634 -34.840913 -21.933848 17.941556 -0.866531 -19.428832 -5.321193 6.319611 -11.398376 41.907093 -8.296132
In [85]:
# Prediction on test set
X_test_pca = pca_final.transform(X_test)
X_test_pca = pd.DataFrame(X_test_pca, index=X_test.index, columns = X_train_pca.columns)
y_test_pred_pca_xgb_best = pca_xgb_best_fit.predict(X_test_pca)
y_test_pred_pca_xgb_best[:5]
Out[85]:
array([1, 1, 1, 1, 1])
In [86]:
## PCA - XGBOOST [Hyper parameter tuned] Model Performance

train_matrix = confusion_matrix(y_train, y_train_pred_pca_xgb_best)
test_matrix = confusion_matrix(y_test, y_test_pred_pca_xgb_best)

print('Train Performance :\n')
model_metrics(train_matrix)

print('\nTest Performance :\n')
model_metrics(test_matrix)
Train Performance :

Accuracy : 0.873
Sensitivity / True Positive Rate / Recall : 0.887
Specificity / True Negative Rate :  0.872
Precision / Positive Predictive Value : 0.396
F1-score : 0.548

Test Performance :

Accuracy : 0.086
Sensitivity / True Positive Rate / Recall : 1.0
Specificity / True Negative Rate :  0.0
Precision / Positive Predictive Value : 0.086
F1-score : 0.158
In [87]:
## PCA - XGBOOST [Hyper parameter tuned] Model Performance
print('Train AUC Score')
print(roc_auc_score(y_train, pca_xgb_best.predict_proba(X_train_pca)[:, 1]))
print('Test AUC Score')
print(roc_auc_score(y_test, pca_xgb_best.predict_proba(X_test_pca)[:, 1]))
Train AUC Score
0.9442462043611259
Test AUC Score
0.6353301334697982

Recommendations

In [88]:
print('Most Important Predictors of churn , in the order of importance are : ')
lr_results.sort_values(by=coef_column, key=lambda x: abs(x), ascending=False)['coef']
Most Important Predictors of churn , in the order of importance are : 
Out[88]:
loc_ic_t2f_mou_8   -1.2736
total_rech_num_8   -1.2033
total_rech_num_6    0.6053
monthly_3g_8_0      0.3994
monthly_2g_8_0      0.3666
std_ic_t2f_mou_8   -0.3363
std_og_t2f_mou_8   -0.2474
const              -0.2336
monthly_3g_7_0     -0.2099
std_ic_t2f_mou_7    0.1532
sachet_2g_6_0      -0.1108
sachet_2g_7_0      -0.0987
sachet_2g_8_0       0.0488
sachet_3g_6_0      -0.0399
Name: coef, dtype: float64

From the above, the following are the strongest indicators of churn

  • Customers who churn show lower average monthly local incoming calls from fixed line in the action period by 1.27 standard deviations , compared to users who don't churn , when all other factors are held constant. This is the strongest indicator of churn.
  • Customers who churn show lower number of recharges done in action period by 1.20 standard deviations, when all other factors are held constant. This is the second strongest indicator of churn.
  • Further customers who churn have done 0.6 standard deviations higher recharge than non-churn customers. This factor when coupled with above factors is a good indicator of churn.
  • Customers who churn are more likely to be users of 'monthly 2g package-0 / monthly 3g package-0' in action period (approximately 0.3 std deviations higher than other packages), when all other factors are held constant.

Based on the above indicators the recommendations to the telecom company are :

  • Concentrate on users with 1.27 std devations lower than average incoming calls from fixed line. They are most likely to churn.
  • Concentrate on users who recharge less number of times ( less than 1.2 std deviations compared to avg) in the 8th month. They are second most likely to churn.
  • Models with high sensitivity are the best for predicting churn. Use the PCA + Logistic Regression model to predict churn. It has an ROC score of 0.87, test sensitivity of 100%
In [ ]:
 
In [ ]: