Table of Contents

Lead Scoring for X Education

Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:

X Education has appointed us to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires us to build a model wherein we need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

Business Goals

  • X-Education wants to improve their lead conversion.
  • Rather than randomly pursuing leads, the company wants to create a pool of Hot Leads the sales team could focus on.
  • They have tasked us to score their leads betwen 0-100 based on the probability of conversion. 100 being the most likely to convert and 0 being unlikely to convert.

Analysis Approach & Conclusions

Lead scoring is a class probability estimation problem, a form of classification problem. The target variable in the data set has two classes : 0 - Un converted and 1 - Converted. The objective is to model the probability(p) that each lead belongs to the class - Converted. Since there are just two classes - it follows that the probability of belonging to class - Un-Converted is (1-p). The relationship between probability of conversion of each lead and its characteristics is modelled using Logistic Regression. And the leads are scored on a scale of 0-100, 100 being most probable conversion candidate.

The final solution has been provided in two parts.

1. Scoring the leads provided by the company in the order of probability of conversion (0-100)
2. Insights into the relationship between characteristics of a lead and the log-odds probability of conversion that could help the company score leads in the future. 


A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).

The final relationship between log Odds of Conversion Probability and lead features is

logOdds(Conversion Probability) = -0.6469 - 1.5426 Do Not Email -1.2699 Unknown Occupation -0.9057 No Specialization -0.8704 Hospitality Management - 0.6584 Outside India + 1.7923 SMS Sent + 1.1749 Other Last Activity + 2.3769 Working Professional - 0.8614 Olark Chat Conversation + 5.3886 Welingak Website + 3.0246 Reference + 1.1876 Olark Chat -1.0250 Landing Page Submission + 1.1253 Total Time Spent on Website + 0.6106 * Email Opened

where Total Time Spent on Website is standardized to $\mu=0,\sigma=1$

Interpreting Top 6 features affecting Conversion Probability :

  • A lead from Welingak Website has 5.4 times higher log odds of conversion than those from Google.
  • Leads through Reference have 3 times higher log odds of conversion than those from Google.
  • Leads from Working Professional have 2.38 times higher log odds of conversion than those from Businessman.
  • Leads with SMS Sent have 1.8 times higher log odds of conversion than those with no SMS sent.
  • Leads with Do Not Email have 1.5 times lesser log odds of conversion compared to leads who would like email updates.
  • Leads with Unknown Occupation have 1.27 times lesser log odds of conversion compared to those from Businessman.

Lead Scores :

  • Score sheet can be generated by running coding in the cell named Score Sheet for X Education cell in the analysis notebook.

At an optimum cut-off probability of 0.36, model performance is as follows.

Model Performance on Training Set :

  • Accuracy : 81.7%
  • Sensitivity / Recall: 80.393 %
  • Specificity : 81.772 %
  • Precision / Positive Predictive Power : 72.924 %
  • False Positive Rate : 18.228 %
  • AUC Score : 0.81

Model Performance for Test Set :

  • Accuracy : 79.593 %
  • Sensitivity / Recall : 77.605 %
  • Specificity : 80.81%
  • Precision / Positive Predictive Power : 71.224 %
  • False Positive Rate : 19.19 %
  • AUC Score : 0.79

KS statistic :

  • Max KS Statistic is 59.76 for 5th decile
  • This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonably good model.

Gain :

  • Instead of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift :

  • The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
  • As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Note :

  • Incorrect data types have been corrected
  • Columns with high missing values have been dropped.
  • Columns which do not explain variability in the model have been dropped.
  • Columns with sales teams notes like Tags where the classes are not mutually exclusive have been dropped.
  • Features with low missing values have been imputed with the most frequent values.
  • Categories in a feature with less than 1% contribution have been grouped together to reduce the number of levels.
  • Inconsistencies in Categories have been corrected.
  • 97.5 % of the leads provided by the company have been used for analysis.
  • Class imbalance = 0.6
  • Indicator variables have been created for all categorical variables with the first category as the reference.
  • Continuous variables have been standardized $\mu : 0 , \sigma = 1$ before modelling.
In [151]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

import seaborn as sns 
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

!pip install tabulate
from tabulate import tabulate 

import sidetable

!jt -t grade3 -f roboto -fs 12 -cellw 100%
Requirement already satisfied: tabulate in /Users/jayanth/opt/anaconda3/lib/python3.7/site-packages (0.8.7)
In [152]:
# to table print  a dataframe 
def tab(ser) : 
        print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt="psql"))

Importing Data

In [153]:
# importing the dataset
leads = pd.read_csv('./leads.csv')

# Inspecting few column heads at a time 
for i in range(0,leads.shape[1], 5) : 
    if i+4 <= leads.shape[1] : 
        print('Columns : ',i,' to ',i+4)
    else : 
        print('Columns : ',i,' to last')
    tab(leads.iloc[:,i : i+5].head())
    print('\n')
Columns :  0  to  4
+----+--------------------------------------+---------------+-------------------------+----------------+----------------+
|    | Prospect ID                          |   Lead Number | Lead Origin             | Lead Source    | Do Not Email   |
|----+--------------------------------------+---------------+-------------------------+----------------+----------------|
|  0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 |        660737 | API                     | Olark Chat     | No             |
|  1 | 2a272436-5132-4136-86fa-dcc88c88f482 |        660728 | API                     | Organic Search | No             |
|  2 | 8cc8c611-a219-4f35-ad23-fdfd2656bd8a |        660727 | Landing Page Submission | Direct Traffic | No             |
|  3 | 0cc2df48-7cf4-4e39-9de9-19797f9b38cc |        660719 | Landing Page Submission | Direct Traffic | No             |
|  4 | 3256f628-e534-4826-9d63-4a8b88782852 |        660681 | Landing Page Submission | Google         | No             |
+----+--------------------------------------+---------------+-------------------------+----------------+----------------+


Columns :  5  to  9
+----+---------------+-------------+---------------+-------------------------------+------------------------+
|    | Do Not Call   |   Converted |   TotalVisits |   Total Time Spent on Website |   Page Views Per Visit |
|----+---------------+-------------+---------------+-------------------------------+------------------------|
|  0 | No            |           0 |             0 |                             0 |                    0   |
|  1 | No            |           0 |             5 |                           674 |                    2.5 |
|  2 | No            |           1 |             2 |                          1532 |                    2   |
|  3 | No            |           0 |             1 |                           305 |                    1   |
|  4 | No            |           1 |             2 |                          1428 |                    1   |
+----+---------------+-------------+---------------+-------------------------------+------------------------+


Columns :  10  to  14
+----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------+
|    | Last Activity           | Country   | Specialization          | How did you hear about X Education   | What is your current occupation   |
|----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------|
|  0 | Page Visited on Website | nan       | Select                  | Select                               | Unemployed                        |
|  1 | Email Opened            | India     | Select                  | Select                               | Unemployed                        |
|  2 | Email Opened            | India     | Business Administration | Select                               | Student                           |
|  3 | Unreachable             | India     | Media and Advertising   | Word Of Mouth                        | Unemployed                        |
|  4 | Converted to Lead       | India     | Select                  | Other                                | Unemployed                        |
+----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------+


Columns :  15  to  19
+----+-------------------------------------------------+----------+------------+---------------------+----------------------+
|    | What matters most to you in choosing a course   | Search   | Magazine   | Newspaper Article   | X Education Forums   |
|----+-------------------------------------------------+----------+------------+---------------------+----------------------|
|  0 | Better Career Prospects                         | No       | No         | No                  | No                   |
|  1 | Better Career Prospects                         | No       | No         | No                  | No                   |
|  2 | Better Career Prospects                         | No       | No         | No                  | No                   |
|  3 | Better Career Prospects                         | No       | No         | No                  | No                   |
|  4 | Better Career Prospects                         | No       | No         | No                  | No                   |
+----+-------------------------------------------------+----------+------------+---------------------+----------------------+


Columns :  20  to  24
+----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------+
|    | Newspaper   | Digital Advertisement   | Through Recommendations   | Receive More Updates About Our Courses   | Tags                                |
|----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------|
|  0 | No          | No                      | No                        | No                                       | Interested in other courses         |
|  1 | No          | No                      | No                        | No                                       | Ringing                             |
|  2 | No          | No                      | No                        | No                                       | Will revert after reading the email |
|  3 | No          | No                      | No                        | No                                       | Ringing                             |
|  4 | No          | No                      | No                        | No                                       | Will revert after reading the email |
+----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------+


Columns :  25  to  29
+----+------------------+-------------------------------------+-----------------------------+----------------+--------+
|    | Lead Quality     | Update me on Supply Chain Content   | Get updates on DM Content   | Lead Profile   | City   |
|----+------------------+-------------------------------------+-----------------------------+----------------+--------|
|  0 | Low in Relevance | No                                  | No                          | Select         | Select |
|  1 | nan              | No                                  | No                          | Select         | Select |
|  2 | Might be         | No                                  | No                          | Potential Lead | Mumbai |
|  3 | Not Sure         | No                                  | No                          | Select         | Mumbai |
|  4 | Might be         | No                                  | No                          | Select         | Mumbai |
+----+------------------+-------------------------------------+-----------------------------+----------------+--------+


Columns :  30  to  34
+----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------+
|    | Asymmetrique Activity Index   | Asymmetrique Profile Index   |   Asymmetrique Activity Score |   Asymmetrique Profile Score | I agree to pay the amount through cheque   |
|----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------|
|  0 | 02.Medium                     | 02.Medium                    |                            15 |                           15 | No                                         |
|  1 | 02.Medium                     | 02.Medium                    |                            15 |                           15 | No                                         |
|  2 | 02.Medium                     | 01.High                      |                            14 |                           20 | No                                         |
|  3 | 02.Medium                     | 01.High                      |                            13 |                           17 | No                                         |
|  4 | 02.Medium                     | 01.High                      |                            15 |                           18 | No                                         |
+----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------+


Columns :  35  to last
+----+------------------------------------------+-------------------------+
|    | A free copy of Mastering The Interview   | Last Notable Activity   |
|----+------------------------------------------+-------------------------|
|  0 | No                                       | Modified                |
|  1 | No                                       | Email Opened            |
|  2 | Yes                                      | Email Opened            |
|  3 | No                                       | Modified                |
|  4 | No                                       | Modified                |
+----+------------------------------------------+-------------------------+


In [154]:
# dataset information 
leads.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 non-null   float64
 10  Last Activity                                  9137 non-null   object 
 11  Country                                        6779 non-null   object 
 12  Specialization                                 7802 non-null   object 
 13  How did you hear about X Education             7033 non-null   object 
 14  What is your current occupation                6550 non-null   object 
 15  What matters most to you in choosing a course  6531 non-null   object 
 16  Search                                         9240 non-null   object 
 17  Magazine                                       9240 non-null   object 
 18  Newspaper Article                              9240 non-null   object 
 19  X Education Forums                             9240 non-null   object 
 20  Newspaper                                      9240 non-null   object 
 21  Digital Advertisement                          9240 non-null   object 
 22  Through Recommendations                        9240 non-null   object 
 23  Receive More Updates About Our Courses         9240 non-null   object 
 24  Tags                                           5887 non-null   object 
 25  Lead Quality                                   4473 non-null   object 
 26  Update me on Supply Chain Content              9240 non-null   object 
 27  Get updates on DM Content                      9240 non-null   object 
 28  Lead Profile                                   6531 non-null   object 
 29  City                                           7820 non-null   object 
 30  Asymmetrique Activity Index                    5022 non-null   object 
 31  Asymmetrique Profile Index                     5022 non-null   object 
 32  Asymmetrique Activity Score                    5022 non-null   float64
 33  Asymmetrique Profile Score                     5022 non-null   float64
 34  I agree to pay the amount through cheque       9240 non-null   object 
 35  A free copy of Mastering The Interview         9240 non-null   object 
 36  Last Notable Activity                          9240 non-null   object 
dtypes: float64(4), int64(3), object(30)
memory usage: 2.6+ MB
  • This data set has a total of 9240 records ,each with 36 features.
  • Each record represents the characteristics of a lead and whether the lead was converted.
  • Converted column indicates whether the particular lead was converted to a client. This is our target variable.

Data Cleaning

Incorrect data types

In [155]:
# 'Converted' is a binary categorical variable but the info shows it is `int64`. Converting to `category` data type 
leads['Converted'] = leads['Converted'].astype('category')

Duplicates

In [156]:
# Checking for any duplicate leads / prospects 

duplicate_prospect_ids = leads['Prospect ID'][leads['Prospect ID'].duplicated()].sum()
duplicate_lead_no = leads['Lead Number'][leads['Lead Number'].duplicated()].sum()
print('No of Duplicate Prospect IDs : ', duplicate_prospect_ids)
print('No of Duplicate Lead Nos : ', duplicate_lead_no)
No of Duplicate Prospect IDs :  0
No of Duplicate Lead Nos :  0
  • There are no duplicate prospect IDs or Lead Numbers
  • Since, these are dimensions (i.e identification columns) not required for analysis, they could be popped for re-indentification at a later step.

Separating ID columns

In [157]:
# Popping Prospect ID and Lead Number columns for later use
prospect_ids = leads.pop('Prospect ID')
lead_no = leads.pop('Lead Number')

Missing Values

In [158]:
# Null values in each Column
nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
nulls.columns = ['Null Percentage']

# Sorting null percentages in descending order and highlighting null % > 45 
nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 45 else '')
Out[158]:
Null Percentage
Lead Quality 51.590909
Asymmetrique Profile Score 45.649351
Asymmetrique Activity Score 45.649351
Asymmetrique Profile Index 45.649351
Asymmetrique Activity Index 45.649351
Tags 36.287879
Lead Profile 29.318182
What matters most to you in choosing a course 29.318182
What is your current occupation 29.112554
Country 26.634199
How did you hear about X Education 23.885281
Specialization 15.562771
City 15.367965
TotalVisits 1.482684
Page Views Per Visit 1.482684
Last Activity 1.114719
Lead Source 0.389610
  • More than 45% of the leads have missing values in Lead Quality,Asymmetrique Profile Score, Asymmetrique Activity Score, Asymmetrique Profile Index, Asymmetrique Activity Index
  • Further, the data in these columns is filled by the sales team and the values depend heavily on the team's judgement. These columns are not good candidates for modelling since the values are subjective.
  • Hence these columns could be dropped.
In [159]:
# Dropping columns with null percentage > 45
high_null_col = nulls[nulls['Null Percentage'] >=45].index
leads.drop(columns=high_null_col, inplace=True)
In [160]:
# Rows Missing Target Variable 
print('Number of rows with missing Target Variable : ',leads['Converted'].isnull().sum())
Number of rows with missing Target Variable :  0
  • No rows with missing target value
In [161]:
# Rows missing more than 50% of values 
highNullRowsCondition = leads.isnull().sum(axis=1)/leads.shape[1] > 0.5
leads[highNullRowsCondition].index
Out[161]:
Int64Index([], dtype='int64')
  • No rows missing more than 50% of values.

Disguised missing Values

In [162]:
# Categorical columns
condition = leads.dtypes == 'object'
categoricalColumns = leads.dtypes[condition].index.values
categoricalColumns
Out[162]:
array(['Lead Origin', 'Lead Source', 'Do Not Email', 'Do Not Call',
       'Last Activity', 'Country', 'Specialization',
       'How did you hear about X Education',
       'What is your current occupation',
       'What matters most to you in choosing a course', 'Search',
       'Magazine', 'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses', 'Tags',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity'],
      dtype=object)
In [163]:
# value counts of each label in a categorical feature 
def cat_value_counts(column_name) : 
    '''
    prints unique values and value counts of each label in categorical column
    '''
    print(tabulate(pd.DataFrame(leads.stb.freq([column_name])), headers='keys', tablefmt='psql'))
    print(pd.DataFrame(leads[column_name]).stb.missing(),'\n\n\n')
In [164]:
# Looking at value counts of each label in categorical variables
for col in sorted(categoricalColumns) : 
    print(col)
    cat_value_counts(col)
A free copy of Mastering The Interview
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    6352 |  0.687446 |               6352 |             0.687446 |
|  1 | Yes                                      |    2888 |  0.312554 |               9240 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
                                        Missing  Total  Percent
A free copy of Mastering The Interview        0   9240      0.0 



City
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | City                        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Mumbai                      |    3222 | 0.41202    |               3222 |             0.41202  |
|  1 | Select                      |    2249 | 0.287596   |               5471 |             0.699616 |
|  2 | Thane & Outskirts           |     752 | 0.0961637  |               6223 |             0.79578  |
|  3 | Other Cities                |     686 | 0.0877238  |               6909 |             0.883504 |
|  4 | Other Cities of Maharashtra |     457 | 0.0584399  |               7366 |             0.941944 |
|  5 | Other Metro Cities          |     380 | 0.0485934  |               7746 |             0.990537 |
|  6 | Tier II Cities              |      74 | 0.00946292 |               7820 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
      Missing  Total  Percent
City     1420   9240  0.15368 



Country
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | Country              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | India                |    6492 | 0.957663    |               6492 |             0.957663 |
|  1 | United States        |      69 | 0.0101785   |               6561 |             0.967842 |
|  2 | United Arab Emirates |      53 | 0.00781826  |               6614 |             0.97566  |
|  3 | Singapore            |      24 | 0.00354035  |               6638 |             0.9792   |
|  4 | Saudi Arabia         |      21 | 0.0030978   |               6659 |             0.982298 |
|  5 | United Kingdom       |      15 | 0.00221272  |               6674 |             0.984511 |
|  6 | Australia            |      13 | 0.00191769  |               6687 |             0.986429 |
|  7 | Qatar                |      10 | 0.00147514  |               6697 |             0.987904 |
|  8 | Hong Kong            |       7 | 0.0010326   |               6704 |             0.988936 |
|  9 | Bahrain              |       7 | 0.0010326   |               6711 |             0.989969 |
| 10 | Oman                 |       6 | 0.000885086 |               6717 |             0.990854 |
| 11 | France               |       6 | 0.000885086 |               6723 |             0.991739 |
| 12 | unknown              |       5 | 0.000737572 |               6728 |             0.992477 |
| 13 | South Africa         |       4 | 0.000590058 |               6732 |             0.993067 |
| 14 | Nigeria              |       4 | 0.000590058 |               6736 |             0.993657 |
| 15 | Kuwait               |       4 | 0.000590058 |               6740 |             0.994247 |
| 16 | Germany              |       4 | 0.000590058 |               6744 |             0.994837 |
| 17 | Canada               |       4 | 0.000590058 |               6748 |             0.995427 |
| 18 | Sweden               |       3 | 0.000442543 |               6751 |             0.99587  |
| 19 | Uganda               |       2 | 0.000295029 |               6753 |             0.996165 |
| 20 | Philippines          |       2 | 0.000295029 |               6755 |             0.99646  |
| 21 | Netherlands          |       2 | 0.000295029 |               6757 |             0.996755 |
| 22 | Italy                |       2 | 0.000295029 |               6759 |             0.99705  |
| 23 | Ghana                |       2 | 0.000295029 |               6761 |             0.997345 |
| 24 | China                |       2 | 0.000295029 |               6763 |             0.99764  |
| 25 | Belgium              |       2 | 0.000295029 |               6765 |             0.997935 |
| 26 | Bangladesh           |       2 | 0.000295029 |               6767 |             0.99823  |
| 27 | Asia/Pacific Region  |       2 | 0.000295029 |               6769 |             0.998525 |
| 28 | Vietnam              |       1 | 0.000147514 |               6770 |             0.998672 |
| 29 | Tanzania             |       1 | 0.000147514 |               6771 |             0.99882  |
| 30 | Switzerland          |       1 | 0.000147514 |               6772 |             0.998967 |
| 31 | Sri Lanka            |       1 | 0.000147514 |               6773 |             0.999115 |
| 32 | Russia               |       1 | 0.000147514 |               6774 |             0.999262 |
| 33 | Malaysia             |       1 | 0.000147514 |               6775 |             0.99941  |
| 34 | Liberia              |       1 | 0.000147514 |               6776 |             0.999557 |
| 35 | Kenya                |       1 | 0.000147514 |               6777 |             0.999705 |
| 36 | Indonesia            |       1 | 0.000147514 |               6778 |             0.999852 |
| 37 | Denmark              |       1 | 0.000147514 |               6779 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+
         Missing  Total   Percent
Country     2461   9240  0.266342 



Digital Advertisement
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Digital Advertisement   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                      |    9236 | 0.999567  |               9236 |             0.999567 |
|  1 | Yes                     |       4 | 0.0004329 |               9240 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
                       Missing  Total  Percent
Digital Advertisement        0   9240      0.0 



Do Not Call
+----+---------------+---------+------------+--------------------+----------------------+
|    | Do Not Call   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+------------+--------------------+----------------------|
|  0 | No            |    9238 | 0.999784   |               9238 |             0.999784 |
|  1 | Yes           |       2 | 0.00021645 |               9240 |             1        |
+----+---------------+---------+------------+--------------------+----------------------+
             Missing  Total  Percent
Do Not Call        0   9240      0.0 



Do Not Email
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    8506 | 0.920563  |               8506 |             0.920563 |
|  1 | Yes            |     734 | 0.0794372 |               9240 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+
              Missing  Total  Percent
Do Not Email        0   9240      0.0 



Get updates on DM Content
+----+-----------------------------+---------+-----------+--------------------+----------------------+
|    | Get updates on DM Content   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                          |    9240 |         1 |               9240 |                    1 |
+----+-----------------------------+---------+-----------+--------------------+----------------------+
                           Missing  Total  Percent
Get updates on DM Content        0   9240      0.0 



How did you hear about X Education
+----+--------------------------------------+---------+------------+--------------------+----------------------+
|    | How did you hear about X Education   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------------------+---------+------------+--------------------+----------------------|
|  0 | Select                               |    5043 | 0.717048   |               5043 |             0.717048 |
|  1 | Online Search                        |     808 | 0.114887   |               5851 |             0.831935 |
|  2 | Word Of Mouth                        |     348 | 0.049481   |               6199 |             0.881416 |
|  3 | Student of SomeSchool                |     310 | 0.0440779  |               6509 |             0.925494 |
|  4 | Other                                |     186 | 0.0264468  |               6695 |             0.951941 |
|  5 | Multiple Sources                     |     152 | 0.0216124  |               6847 |             0.973553 |
|  6 | Advertisements                       |      70 | 0.00995308 |               6917 |             0.983506 |
|  7 | Social Media                         |      67 | 0.00952652 |               6984 |             0.993033 |
|  8 | Email                                |      26 | 0.00369686 |               7010 |             0.99673  |
|  9 | SMS                                  |      23 | 0.0032703  |               7033 |             1        |
+----+--------------------------------------+---------+------------+--------------------+----------------------+
                                    Missing  Total   Percent
How did you hear about X Education     2207   9240  0.238853 



I agree to pay the amount through cheque
+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
|    | I agree to pay the amount through cheque   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                         |    9240 |         1 |               9240 |                    1 |
+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
                                          Missing  Total  Percent
I agree to pay the amount through cheque        0   9240      0.0 



Last Activity
+----+------------------------------+---------+-------------+--------------------+----------------------+
|    | Last Activity                |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Email Opened                 |    3437 | 0.376163    |               3437 |             0.376163 |
|  1 | SMS Sent                     |    2745 | 0.300427    |               6182 |             0.67659  |
|  2 | Olark Chat Conversation      |     973 | 0.10649     |               7155 |             0.78308  |
|  3 | Page Visited on Website      |     640 | 0.0700449   |               7795 |             0.853125 |
|  4 | Converted to Lead            |     428 | 0.0468425   |               8223 |             0.899967 |
|  5 | Email Bounced                |     326 | 0.0356791   |               8549 |             0.935646 |
|  6 | Email Link Clicked           |     267 | 0.0292218   |               8816 |             0.964868 |
|  7 | Form Submitted on Website    |     116 | 0.0126956   |               8932 |             0.977564 |
|  8 | Unreachable                  |      93 | 0.0101784   |               9025 |             0.987742 |
|  9 | Unsubscribed                 |      61 | 0.00667615  |               9086 |             0.994418 |
| 10 | Had a Phone Conversation     |      30 | 0.00328335  |               9116 |             0.997702 |
| 11 | Approached upfront           |       9 | 0.000985006 |               9125 |             0.998687 |
| 12 | View in browser link Clicked |       6 | 0.000656671 |               9131 |             0.999343 |
| 13 | Email Received               |       2 | 0.00021889  |               9133 |             0.999562 |
| 14 | Email Marked Spam            |       2 | 0.00021889  |               9135 |             0.999781 |
| 15 | Visited Booth in Tradeshow   |       1 | 0.000109445 |               9136 |             0.999891 |
| 16 | Resubscribed to emails       |       1 | 0.000109445 |               9137 |             1        |
+----+------------------------------+---------+-------------+--------------------+----------------------+
               Missing  Total   Percent
Last Activity      103   9240  0.011147 



Last Notable Activity
+----+------------------------------+---------+-------------+--------------------+----------------------+
|    | Last Notable Activity        |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Modified                     |    3407 | 0.368723    |               3407 |             0.368723 |
|  1 | Email Opened                 |    2827 | 0.305952    |               6234 |             0.674675 |
|  2 | SMS Sent                     |    2172 | 0.235065    |               8406 |             0.90974  |
|  3 | Page Visited on Website      |     318 | 0.0344156   |               8724 |             0.944156 |
|  4 | Olark Chat Conversation      |     183 | 0.0198052   |               8907 |             0.963961 |
|  5 | Email Link Clicked           |     173 | 0.0187229   |               9080 |             0.982684 |
|  6 | Email Bounced                |      60 | 0.00649351  |               9140 |             0.989177 |
|  7 | Unsubscribed                 |      47 | 0.00508658  |               9187 |             0.994264 |
|  8 | Unreachable                  |      32 | 0.0034632   |               9219 |             0.997727 |
|  9 | Had a Phone Conversation     |      14 | 0.00151515  |               9233 |             0.999242 |
| 10 | Email Marked Spam            |       2 | 0.00021645  |               9235 |             0.999459 |
| 11 | View in browser link Clicked |       1 | 0.000108225 |               9236 |             0.999567 |
| 12 | Resubscribed to emails       |       1 | 0.000108225 |               9237 |             0.999675 |
| 13 | Form Submitted on Website    |       1 | 0.000108225 |               9238 |             0.999784 |
| 14 | Email Received               |       1 | 0.000108225 |               9239 |             0.999892 |
| 15 | Approached upfront           |       1 | 0.000108225 |               9240 |             1        |
+----+------------------------------+---------+-------------+--------------------+----------------------+
                       Missing  Total  Percent
Last Notable Activity        0   9240      0.0 



Lead Origin
+----+-------------------------+---------+-------------+--------------------+----------------------+
|    | Lead Origin             |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-------------+--------------------+----------------------|
|  0 | Landing Page Submission |    4886 | 0.528788    |               4886 |             0.528788 |
|  1 | API                     |    3580 | 0.387446    |               8466 |             0.916234 |
|  2 | Lead Add Form           |     718 | 0.0777056   |               9184 |             0.993939 |
|  3 | Lead Import             |      55 | 0.00595238  |               9239 |             0.999892 |
|  4 | Quick Add Form          |       1 | 0.000108225 |               9240 |             1        |
+----+-------------------------+---------+-------------+--------------------+----------------------+
             Missing  Total  Percent
Lead Origin        0   9240      0.0 



Lead Profile
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | Lead Profile                |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Select                      |    4146 | 0.634819   |               4146 |             0.634819 |
|  1 | Potential Lead              |    1613 | 0.246976   |               5759 |             0.881795 |
|  2 | Other Leads                 |     487 | 0.0745674  |               6246 |             0.956362 |
|  3 | Student of SomeSchool       |     241 | 0.0369009  |               6487 |             0.993263 |
|  4 | Lateral Student             |      24 | 0.00367478 |               6511 |             0.996938 |
|  5 | Dual Specialization Student |      20 | 0.00306232 |               6531 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
              Missing  Total   Percent
Lead Profile     2709   9240  0.293182 



Lead Source
+----+-------------------+---------+-------------+--------------------+----------------------+
|    | Lead Source       |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-------------+--------------------+----------------------|
|  0 | Google            |    2868 | 0.311604    |               2868 |             0.311604 |
|  1 | Direct Traffic    |    2543 | 0.276293    |               5411 |             0.587897 |
|  2 | Olark Chat        |    1755 | 0.190678    |               7166 |             0.778575 |
|  3 | Organic Search    |    1154 | 0.12538     |               8320 |             0.903955 |
|  4 | Reference         |     534 | 0.0580183   |               8854 |             0.961973 |
|  5 | Welingak Website  |     142 | 0.0154281   |               8996 |             0.977401 |
|  6 | Referral Sites    |     125 | 0.0135811   |               9121 |             0.990982 |
|  7 | Facebook          |      55 | 0.00597566  |               9176 |             0.996958 |
|  8 | bing              |       6 | 0.00065189  |               9182 |             0.99761  |
|  9 | google            |       5 | 0.000543242 |               9187 |             0.998153 |
| 10 | Click2call        |       4 | 0.000434594 |               9191 |             0.998588 |
| 11 | Social Media      |       2 | 0.000217297 |               9193 |             0.998805 |
| 12 | Press_Release     |       2 | 0.000217297 |               9195 |             0.999022 |
| 13 | Live Chat         |       2 | 0.000217297 |               9197 |             0.999239 |
| 14 | youtubechannel    |       1 | 0.000108648 |               9198 |             0.999348 |
| 15 | welearnblog_Home  |       1 | 0.000108648 |               9199 |             0.999457 |
| 16 | testone           |       1 | 0.000108648 |               9200 |             0.999565 |
| 17 | blog              |       1 | 0.000108648 |               9201 |             0.999674 |
| 18 | WeLearn           |       1 | 0.000108648 |               9202 |             0.999783 |
| 19 | Pay per Click Ads |       1 | 0.000108648 |               9203 |             0.999891 |
| 20 | NC_EDM            |       1 | 0.000108648 |               9204 |             1        |
+----+-------------------+---------+-------------+--------------------+----------------------+
             Missing  Total   Percent
Lead Source       36   9240  0.003896 



Magazine
+----+------------+---------+-----------+--------------------+----------------------+
|    | Magazine   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------+---------+-----------+--------------------+----------------------|
|  0 | No         |    9240 |         1 |               9240 |                    1 |
+----+------------+---------+-----------+--------------------+----------------------+
          Missing  Total  Percent
Magazine        0   9240      0.0 



Newspaper
+----+-------------+---------+-------------+--------------------+----------------------+
|    | Newspaper   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------+---------+-------------+--------------------+----------------------|
|  0 | No          |    9239 | 0.999892    |               9239 |             0.999892 |
|  1 | Yes         |       1 | 0.000108225 |               9240 |             1        |
+----+-------------+---------+-------------+--------------------+----------------------+
           Missing  Total  Percent
Newspaper        0   9240      0.0 



Newspaper Article
+----+---------------------+---------+------------+--------------------+----------------------+
|    | Newspaper Article   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------+---------+------------+--------------------+----------------------|
|  0 | No                  |    9238 | 0.999784   |               9238 |             0.999784 |
|  1 | Yes                 |       2 | 0.00021645 |               9240 |             1        |
+----+---------------------+---------+------------+--------------------+----------------------+
                   Missing  Total  Percent
Newspaper Article        0   9240      0.0 



Receive More Updates About Our Courses
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | Receive More Updates About Our Courses   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    9240 |         1 |               9240 |                    1 |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
                                        Missing  Total  Percent
Receive More Updates About Our Courses        0   9240      0.0 



Search
+----+----------+---------+------------+--------------------+----------------------+
|    | Search   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+----------+---------+------------+--------------------+----------------------|
|  0 | No       |    9226 | 0.998485   |               9226 |             0.998485 |
|  1 | Yes      |      14 | 0.00151515 |               9240 |             1        |
+----+----------+---------+------------+--------------------+----------------------+
        Missing  Total  Percent
Search        0   9240      0.0 



Specialization
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | Specialization                    |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | Select                            |    1942 | 0.248911   |               1942 |             0.248911 |
|  1 | Finance Management                |     976 | 0.125096   |               2918 |             0.374007 |
|  2 | Human Resource Management         |     848 | 0.10869    |               3766 |             0.482697 |
|  3 | Marketing Management              |     838 | 0.107408   |               4604 |             0.590105 |
|  4 | Operations Management             |     503 | 0.0644706  |               5107 |             0.654576 |
|  5 | Business Administration           |     403 | 0.0516534  |               5510 |             0.706229 |
|  6 | IT Projects Management            |     366 | 0.046911   |               5876 |             0.75314  |
|  7 | Supply Chain Management           |     349 | 0.0447321  |               6225 |             0.797872 |
|  8 | Banking, Investment And Insurance |     338 | 0.0433222  |               6563 |             0.841195 |
|  9 | Travel and Tourism                |     203 | 0.026019   |               6766 |             0.867214 |
| 10 | Media and Advertising             |     203 | 0.026019   |               6969 |             0.893233 |
| 11 | International Business            |     178 | 0.0228147  |               7147 |             0.916047 |
| 12 | Healthcare Management             |     159 | 0.0203794  |               7306 |             0.936427 |
| 13 | Hospitality Management            |     114 | 0.0146116  |               7420 |             0.951038 |
| 14 | E-COMMERCE                        |     112 | 0.0143553  |               7532 |             0.965393 |
| 15 | Retail Management                 |     100 | 0.0128172  |               7632 |             0.978211 |
| 16 | Rural and Agribusiness            |      73 | 0.00935658 |               7705 |             0.987567 |
| 17 | E-Business                        |      57 | 0.00730582 |               7762 |             0.994873 |
| 18 | Services Excellence               |      40 | 0.00512689 |               7802 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
                Missing  Total   Percent
Specialization     1438   9240  0.155628 



Tags
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | Tags                                              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Will revert after reading the email               |    2072 | 0.351962    |               2072 |             0.351962 |
|  1 | Ringing                                           |    1203 | 0.204349    |               3275 |             0.556311 |
|  2 | Interested in other courses                       |     513 | 0.0871412   |               3788 |             0.643452 |
|  3 | Already a student                                 |     465 | 0.0789876   |               4253 |             0.722439 |
|  4 | Closed by Horizzon                                |     358 | 0.060812    |               4611 |             0.783251 |
|  5 | switched off                                      |     240 | 0.0407678   |               4851 |             0.824019 |
|  6 | Busy                                              |     186 | 0.031595    |               5037 |             0.855614 |
|  7 | Lost to EINS                                      |     175 | 0.0297265   |               5212 |             0.885341 |
|  8 | Not doing further education                       |     145 | 0.0246305   |               5357 |             0.909971 |
|  9 | Interested  in full time MBA                      |     117 | 0.0198743   |               5474 |             0.929845 |
| 10 | Graduation in progress                            |     111 | 0.0188551   |               5585 |             0.948701 |
| 11 | invalid number                                    |      83 | 0.0140989   |               5668 |             0.962799 |
| 12 | Diploma holder (Not Eligible)                     |      63 | 0.0107015   |               5731 |             0.973501 |
| 13 | wrong number given                                |      47 | 0.00798369  |               5778 |             0.981485 |
| 14 | opp hangup                                        |      33 | 0.00560557  |               5811 |             0.98709  |
| 15 | number not provided                               |      27 | 0.00458638  |               5838 |             0.991677 |
| 16 | in touch with EINS                                |      12 | 0.00203839  |               5850 |             0.993715 |
| 17 | Lost to Others                                    |       7 | 0.00118906  |               5857 |             0.994904 |
| 18 | Want to take admission but has financial problems |       6 | 0.00101919  |               5863 |             0.995923 |
| 19 | Still Thinking                                    |       6 | 0.00101919  |               5869 |             0.996942 |
| 20 | Interested in Next batch                          |       5 | 0.000849329 |               5874 |             0.997792 |
| 21 | In confusion whether part time or DLP             |       5 | 0.000849329 |               5879 |             0.998641 |
| 22 | Lateral student                                   |       3 | 0.000509597 |               5882 |             0.999151 |
| 23 | University not recognized                         |       2 | 0.000339732 |               5884 |             0.99949  |
| 24 | Shall take in the next coming month               |       2 | 0.000339732 |               5886 |             0.99983  |
| 25 | Recognition issue (DEC approval)                  |       1 | 0.000169866 |               5887 |             1        |
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
      Missing  Total   Percent
Tags     3353   9240  0.362879 



Through Recommendations
+----+---------------------------+---------+-------------+--------------------+----------------------+
|    | Through Recommendations   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-------------+--------------------+----------------------|
|  0 | No                        |    9233 | 0.999242    |               9233 |             0.999242 |
|  1 | Yes                       |       7 | 0.000757576 |               9240 |             1        |
+----+---------------------------+---------+-------------+--------------------+----------------------+
                         Missing  Total  Percent
Through Recommendations        0   9240      0.0 



Update me on Supply Chain Content
+----+-------------------------------------+---------+-----------+--------------------+----------------------+
|    | Update me on Supply Chain Content   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                  |    9240 |         1 |               9240 |                    1 |
+----+-------------------------------------+---------+-----------+--------------------+----------------------+
                                   Missing  Total  Percent
Update me on Supply Chain Content        0   9240      0.0 



What is your current occupation
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | Unemployed                        |    5600 | 0.854962   |               5600 |             0.854962 |
|  1 | Working Professional              |     706 | 0.107786   |               6306 |             0.962748 |
|  2 | Student                           |     210 | 0.0320611  |               6516 |             0.994809 |
|  3 | Other                             |      16 | 0.00244275 |               6532 |             0.997252 |
|  4 | Housewife                         |      10 | 0.00152672 |               6542 |             0.998779 |
|  5 | Businessman                       |       8 | 0.00122137 |               6550 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
                                 Missing  Total   Percent
What is your current occupation     2690   9240  0.291126 



What matters most to you in choosing a course
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | What matters most to you in choosing a course   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Better Career Prospects                         |    6528 | 0.999541    |               6528 |             0.999541 |
|  1 | Flexibility & Convenience                       |       2 | 0.000306232 |               6530 |             0.999847 |
|  2 | Other                                           |       1 | 0.000153116 |               6531 |             1        |
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
                                               Missing  Total   Percent
What matters most to you in choosing a course     2709   9240  0.293182 



X Education Forums
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | X Education Forums   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | No                   |    9239 | 0.999892    |               9239 |             0.999892 |
|  1 | Yes                  |       1 | 0.000108225 |               9240 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+
                    Missing  Total  Percent
X Education Forums        0   9240      0.0 



  • The following columns have a label Select which is a disguised missing value.
  • Select is the default option in online forms and this value might mean that the lead hasn't selected any option.
  • We shall replace them with np.nan.
    • Specialization
    • Lead Profile
    • City
    • How did you hear about X Education
In [165]:
# Replacing Select with NaN value
leads.replace({'Select' : np.nan},inplace=True)
In [166]:
# Looking at Missing Values again 

# Null values in each Column
nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
nulls.columns = ['Null Percentage']

# Sorting null percentages in descending order and highlighting null % > 50 
nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 50 else '')
Out[166]:
Null Percentage
How did you hear about X Education 78.463203
Lead Profile 74.188312
City 39.707792
Specialization 36.580087
Tags 36.287879
What matters most to you in choosing a course 29.318182
What is your current occupation 29.112554
Country 26.634199
TotalVisits 1.482684
Page Views Per Visit 1.482684
Last Activity 1.114719
Lead Source 0.389610
  • Lead Profile & How did you hear about X Education have very high percentage of nulls. Let's drop these columns
In [167]:
leads.drop(columns=['Lead Profile','How did you hear about X Education'],inplace=True)

Imputation

In [168]:
# Sorting null percentages in ascending order and highlighting null % < 16
def lowNulls() : 
    nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
    nulls.columns = ['Null Percentage']
    return nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=True).style.applymap(lambda x : 'color : green' if x < 16 else '')

lowNulls()
Out[168]:
Null Percentage
Lead Source 0.389610
Last Activity 1.114719
TotalVisits 1.482684
Page Views Per Visit 1.482684
Country 26.634199
What is your current occupation 29.112554
What matters most to you in choosing a course 29.318182
Tags 36.287879
Specialization 36.580087
City 39.707792
  • 'Lead Source','Last Activity','TotalVisits','Page Views Per Visit' have less than 2% missing values. These rows could be dropped.
  • We could impute columns with higher missing values on a case by case basis.
  • Missing values are imputed by the metric most representative of the feature's distribution.
  • For categorical features, missing values could be imputed by the most frequently occuring label i.e MODE value, since this is the most representative metric of a categorical feature.
  • For continuous features, if there are outliers, the most representative metric of the feature's distribution is MEDIAN, else it is MEAN. Continuous feature imputations are thus dependent on presence of outliers.
  • About 26% of data in Country Column is missing.
In [169]:
# Country Imputation : 
leads.stb.freq(['Country'])
Out[169]:
Country Count Percent Cumulative Count Cumulative Percent
0 India 6492 0.957663 6492 0.957663
1 United States 69 0.010178 6561 0.967842
2 United Arab Emirates 53 0.007818 6614 0.975660
3 Singapore 24 0.003540 6638 0.979200
4 Saudi Arabia 21 0.003098 6659 0.982298
5 United Kingdom 15 0.002213 6674 0.984511
6 Australia 13 0.001918 6687 0.986429
7 Qatar 10 0.001475 6697 0.987904
8 Hong Kong 7 0.001033 6704 0.988936
9 Bahrain 7 0.001033 6711 0.989969
10 Oman 6 0.000885 6717 0.990854
11 France 6 0.000885 6723 0.991739
12 unknown 5 0.000738 6728 0.992477
13 South Africa 4 0.000590 6732 0.993067
14 Nigeria 4 0.000590 6736 0.993657
15 Kuwait 4 0.000590 6740 0.994247
16 Germany 4 0.000590 6744 0.994837
17 Canada 4 0.000590 6748 0.995427
18 Sweden 3 0.000443 6751 0.995870
19 Uganda 2 0.000295 6753 0.996165
20 Philippines 2 0.000295 6755 0.996460
21 Netherlands 2 0.000295 6757 0.996755
22 Italy 2 0.000295 6759 0.997050
23 Ghana 2 0.000295 6761 0.997345
24 China 2 0.000295 6763 0.997640
25 Belgium 2 0.000295 6765 0.997935
26 Bangladesh 2 0.000295 6767 0.998230
27 Asia/Pacific Region 2 0.000295 6769 0.998525
28 Vietnam 1 0.000148 6770 0.998672
29 Tanzania 1 0.000148 6771 0.998820
30 Switzerland 1 0.000148 6772 0.998967
31 Sri Lanka 1 0.000148 6773 0.999115
32 Russia 1 0.000148 6774 0.999262
33 Malaysia 1 0.000148 6775 0.999410
34 Liberia 1 0.000148 6776 0.999557
35 Kenya 1 0.000148 6777 0.999705
36 Indonesia 1 0.000148 6778 0.999852
37 Denmark 1 0.000148 6779 1.000000
  • Since 95% of leads come from India, it is probable that missing values are from India.
In [170]:
#  Imputing missing values in Country feature with "India"
leads['Country'].fillna('India', inplace=True)
  • Specialization feature has 36% of missing values.
  • Since there's no one label that's driving leads, replacement would mislead the analysis.
  • Hence, we could impute missing values with a new label 'No Specialization'
In [171]:
# Imputing Null Values by filling it using "No Specialization".
leads['Specialization'].fillna("No Specialization",inplace=True)
print('Missing values in Specialization feature ', leads['Specialization'].isnull().sum())
Missing values in Specialization feature  0
In [172]:
leads['Specialization'].value_counts()
Out[172]:
No Specialization                    3380
Finance Management                    976
Human Resource Management             848
Marketing Management                  838
Operations Management                 503
Business Administration               403
IT Projects Management                366
Supply Chain Management               349
Banking, Investment And Insurance     338
Travel and Tourism                    203
Media and Advertising                 203
International Business                178
Healthcare Management                 159
Hospitality Management                114
E-COMMERCE                            112
Retail Management                     100
Rural and Agribusiness                 73
E-Business                             57
Services Excellence                    40
Name: Specialization, dtype: int64
  • About 39% of values in City column are missing
In [173]:
# Imputation of missing cities 
leads.stb.freq(['City'])
Out[173]:
City Count Percent Cumulative Count Cumulative Percent
0 Mumbai 3222 0.578352 3222 0.578352
1 Thane & Outskirts 752 0.134985 3974 0.713337
2 Other Cities 686 0.123138 4660 0.836475
3 Other Cities of Maharashtra 457 0.082032 5117 0.918507
4 Other Metro Cities 380 0.068210 5497 0.986717
5 Tier II Cities 74 0.013283 5571 1.000000
In [174]:
# Missing Cities vs Country
condition_india = leads['Country'] == 'India'
print('Total Missing City values :', leads['City'].isnull().sum())
print('Missing City values in leads from India : ',leads.loc[condition_india,'City'].isnull().sum())
Total Missing City values : 3669
Missing City values in leads from India :  3609
  • Looks like 3609 out 3669 leads with Missing City label are from India.
  • As can be seen from the value counts of City feature, 60% of the leads come from Mumbai.
  • Since, we could impute missing City value for leads from India with Mumbai
In [175]:
# Replacing Null Cities in India with Mumbai
condition = (leads['City'].isnull()) & condition_india
leads.loc[condition,'City'] = 'Mumbai'
  • 29% of values in What is your current occupation column are missing.
  • Let's look at the distribution of levels in this column
In [176]:
tab(leads.stb.freq(['What is your current occupation']))
tab(leads['What is your current occupation'].reset_index().stb.missing())
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | Unemployed                        |    5600 | 0.854962   |               5600 |             0.854962 |
|  1 | Working Professional              |     706 | 0.107786   |               6306 |             0.962748 |
|  2 | Student                           |     210 | 0.0320611  |               6516 |             0.994809 |
|  3 | Other                             |      16 | 0.00244275 |               6532 |             0.997252 |
|  4 | Housewife                         |      10 | 0.00152672 |               6542 |             0.998779 |
|  5 | Businessman                       |       8 | 0.00122137 |               6550 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
+---------------------------------+-----------+---------+-----------+
|                                 |   Missing |   Total |   Percent |
|---------------------------------+-----------+---------+-----------|
| What is your current occupation |      2690 |    9240 |  0.291126 |
| index                           |         0 |    9240 |  0        |
+---------------------------------+-----------+---------+-----------+
  • Since, the business problem clearly says the company targets working professionals, this is an extremely important variable.
  • So to keep the analysis unbiased, we could impute missing values with a new level for now.
  • Let's replace missing values in What is your current occupation with 'Unknown Occupation'
In [177]:
leads['What is your current occupation'].fillna('Unknown Occupation',inplace=True)
In [178]:
# Missing Values in `What matters most to you in choosing a course`
ftr = 'What matters most to you in choosing a course'
tab(leads.stb.freq([ftr]))
tab(leads[ftr].reset_index().stb.missing())
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | What matters most to you in choosing a course   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Better Career Prospects                         |    6528 | 0.999541    |               6528 |             0.999541 |
|  1 | Flexibility & Convenience                       |       2 | 0.000306232 |               6530 |             0.999847 |
|  2 | Other                                           |       1 | 0.000153116 |               6531 |             1        |
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
+-----------------------------------------------+-----------+---------+-----------+
|                                               |   Missing |   Total |   Percent |
|-----------------------------------------------+-----------+---------+-----------|
| What matters most to you in choosing a course |      2709 |    9240 |  0.293182 |
| index                                         |         0 |    9240 |  0        |
+-----------------------------------------------+-----------+---------+-----------+
  • Almost all leads ( >99%) show interest in the company's offerings for Better Career Prospects , excluding leads who havent filled this feature.
  • Since this might be a very important feature from the analysis perspective, to keep the analysis unbiased, instead of imputation with an existing label, let's impute with a new label for now.
  • Let's replace missing values in What matters most to you in choosing a course with 'Unknown Target' for now.
In [179]:
# Filling Missing Values with `Unknown`
leads[ftr].fillna('Unknown Target',inplace=True)

Incorrect Labels

In [180]:
# Looking at labels in each Categorical Variable to check for incorrect labels. 
categoricalFeatures = leads.dtypes[leads.dtypes == 'object'].index.values
print('Categorical Features : ', categoricalFeatures,'\n\n')
for feature in categoricalFeatures : 
    print('Levels in ',feature,' are ' , leads[feature].unique(),'\n\n')
Categorical Features :  ['Lead Origin' 'Lead Source' 'Do Not Email' 'Do Not Call' 'Last Activity'
 'Country' 'Specialization' 'What is your current occupation'
 'What matters most to you in choosing a course' 'Search' 'Magazine'
 'Newspaper Article' 'X Education Forums' 'Newspaper'
 'Digital Advertisement' 'Through Recommendations'
 'Receive More Updates About Our Courses' 'Tags'
 'Update me on Supply Chain Content' 'Get updates on DM Content' 'City'
 'I agree to pay the amount through cheque'
 'A free copy of Mastering The Interview' 'Last Notable Activity'] 


Levels in  Lead Origin  are  ['API' 'Landing Page Submission' 'Lead Add Form' 'Lead Import'
 'Quick Add Form'] 


Levels in  Lead Source  are  ['Olark Chat' 'Organic Search' 'Direct Traffic' 'Google' 'Referral Sites'
 'Welingak Website' 'Reference' 'google' 'Facebook' nan 'blog'
 'Pay per Click Ads' 'bing' 'Social Media' 'WeLearn' 'Click2call'
 'Live Chat' 'welearnblog_Home' 'youtubechannel' 'testone' 'Press_Release'
 'NC_EDM'] 


Levels in  Do Not Email  are  ['No' 'Yes'] 


Levels in  Do Not Call  are  ['No' 'Yes'] 


Levels in  Last Activity  are  ['Page Visited on Website' 'Email Opened' 'Unreachable'
 'Converted to Lead' 'Olark Chat Conversation' 'Email Bounced'
 'Email Link Clicked' 'Form Submitted on Website' 'Unsubscribed'
 'Had a Phone Conversation' 'View in browser link Clicked' nan
 'Approached upfront' 'SMS Sent' 'Visited Booth in Tradeshow'
 'Resubscribed to emails' 'Email Received' 'Email Marked Spam'] 


Levels in  Country  are  ['India' 'Russia' 'Kuwait' 'Oman' 'United Arab Emirates' 'United States'
 'Australia' 'United Kingdom' 'Bahrain' 'Ghana' 'Singapore' 'Qatar'
 'Saudi Arabia' 'Belgium' 'France' 'Sri Lanka' 'China' 'Canada'
 'Netherlands' 'Sweden' 'Nigeria' 'Hong Kong' 'Germany'
 'Asia/Pacific Region' 'Uganda' 'Kenya' 'Italy' 'South Africa' 'Tanzania'
 'unknown' 'Malaysia' 'Liberia' 'Switzerland' 'Denmark' 'Philippines'
 'Bangladesh' 'Vietnam' 'Indonesia'] 


Levels in  Specialization  are  ['No Specialization' 'Business Administration' 'Media and Advertising'
 'Supply Chain Management' 'IT Projects Management' 'Finance Management'
 'Travel and Tourism' 'Human Resource Management' 'Marketing Management'
 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE'
 'Operations Management' 'Retail Management' 'Services Excellence'
 'Hospitality Management' 'Rural and Agribusiness' 'Healthcare Management'
 'E-Business'] 


Levels in  What is your current occupation  are  ['Unemployed' 'Student' 'Unknown Occupation' 'Working Professional'
 'Businessman' 'Other' 'Housewife'] 


Levels in  What matters most to you in choosing a course  are  ['Better Career Prospects' 'Unknown Target' 'Flexibility & Convenience'
 'Other'] 


Levels in  Search  are  ['No' 'Yes'] 


Levels in  Magazine  are  ['No'] 


Levels in  Newspaper Article  are  ['No' 'Yes'] 


Levels in  X Education Forums  are  ['No' 'Yes'] 


Levels in  Newspaper  are  ['No' 'Yes'] 


Levels in  Digital Advertisement  are  ['No' 'Yes'] 


Levels in  Through Recommendations  are  ['No' 'Yes'] 


Levels in  Receive More Updates About Our Courses  are  ['No'] 


Levels in  Tags  are  ['Interested in other courses' 'Ringing'
 'Will revert after reading the email' nan 'Lost to EINS'
 'In confusion whether part time or DLP' 'Busy' 'switched off'
 'in touch with EINS' 'Already a student' 'Diploma holder (Not Eligible)'
 'Graduation in progress' 'Closed by Horizzon' 'number not provided'
 'opp hangup' 'Not doing further education' 'invalid number'
 'wrong number given' 'Interested  in full time MBA' 'Still Thinking'
 'Lost to Others' 'Shall take in the next coming month' 'Lateral student'
 'Interested in Next batch' 'Recognition issue (DEC approval)'
 'Want to take admission but has financial problems'
 'University not recognized'] 


Levels in  Update me on Supply Chain Content  are  ['No'] 


Levels in  Get updates on DM Content  are  ['No'] 


Levels in  City  are  ['Mumbai' 'Thane & Outskirts' 'Other Metro Cities' nan 'Other Cities'
 'Other Cities of Maharashtra' 'Tier II Cities'] 


Levels in  I agree to pay the amount through cheque  are  ['No'] 


Levels in  A free copy of Mastering The Interview  are  ['No' 'Yes'] 


Levels in  Last Notable Activity  are  ['Modified' 'Email Opened' 'Page Visited on Website' 'Email Bounced'
 'Email Link Clicked' 'Unreachable' 'Unsubscribed'
 'Had a Phone Conversation' 'Olark Chat Conversation' 'SMS Sent'
 'Approached upfront' 'Resubscribed to emails'
 'View in browser link Clicked' 'Form Submitted on Website'
 'Email Received' 'Email Marked Spam'] 


  • We can clearly see that Google is appearing twice in 'Lead Source'- (Google,google)
In [181]:
# Replacing 'google' with 'Google
leads['Lead Source']=leads['Lead Source'].str.replace("google","Google")

Cleaning Categorical Features

Dropping Unnecessary Columns

In [182]:
# Missing Values and Value Counts for all categorical Variables 
tab(leads.stb.missing())
print('Value Counts of each Feature : \n')
for feature in sorted(categoricalFeatures) : 
    tab(leads.stb.freq([feature]))
    
+-----------------------------------------------+-----------+---------+------------+
|                                               |   Missing |   Total |    Percent |
|-----------------------------------------------+-----------+---------+------------|
| Tags                                          |      3353 |    9240 | 0.362879   |
| TotalVisits                                   |       137 |    9240 | 0.0148268  |
| Page Views Per Visit                          |       137 |    9240 | 0.0148268  |
| Last Activity                                 |       103 |    9240 | 0.0111472  |
| City                                          |        60 |    9240 | 0.00649351 |
| Lead Source                                   |        36 |    9240 | 0.0038961  |
| Lead Origin                                   |         0 |    9240 | 0          |
| X Education Forums                            |         0 |    9240 | 0          |
| A free copy of Mastering The Interview        |         0 |    9240 | 0          |
| I agree to pay the amount through cheque      |         0 |    9240 | 0          |
| Get updates on DM Content                     |         0 |    9240 | 0          |
| Update me on Supply Chain Content             |         0 |    9240 | 0          |
| Receive More Updates About Our Courses        |         0 |    9240 | 0          |
| Through Recommendations                       |         0 |    9240 | 0          |
| Digital Advertisement                         |         0 |    9240 | 0          |
| Newspaper                                     |         0 |    9240 | 0          |
| Magazine                                      |         0 |    9240 | 0          |
| Newspaper Article                             |         0 |    9240 | 0          |
| Search                                        |         0 |    9240 | 0          |
| What matters most to you in choosing a course |         0 |    9240 | 0          |
| What is your current occupation               |         0 |    9240 | 0          |
| Specialization                                |         0 |    9240 | 0          |
| Country                                       |         0 |    9240 | 0          |
| Total Time Spent on Website                   |         0 |    9240 | 0          |
| Converted                                     |         0 |    9240 | 0          |
| Do Not Call                                   |         0 |    9240 | 0          |
| Do Not Email                                  |         0 |    9240 | 0          |
| Last Notable Activity                         |         0 |    9240 | 0          |
+-----------------------------------------------+-----------+---------+------------+
Value Counts of each Feature : 

+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    6352 |  0.687446 |               6352 |             0.687446 |
|  1 | Yes                                      |    2888 |  0.312554 |               9240 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
+----+-----------------------------+---------+-----------+--------------------+----------------------+
|    | City                        |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+-----------+--------------------+----------------------|
|  0 | Mumbai                      |    6831 | 0.744118  |               6831 |             0.744118 |
|  1 | Thane & Outskirts           |     752 | 0.0819172 |               7583 |             0.826035 |
|  2 | Other Cities                |     686 | 0.0747277 |               8269 |             0.900763 |
|  3 | Other Cities of Maharashtra |     457 | 0.0497821 |               8726 |             0.950545 |
|  4 | Other Metro Cities          |     380 | 0.0413943 |               9106 |             0.991939 |
|  5 | Tier II Cities              |      74 | 0.008061  |               9180 |             1        |
+----+-----------------------------+---------+-----------+--------------------+----------------------+
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | Country              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | India                |    8953 | 0.968939    |               8953 |             0.968939 |
|  1 | United States        |      69 | 0.00746753  |               9022 |             0.976407 |
|  2 | United Arab Emirates |      53 | 0.00573593  |               9075 |             0.982143 |
|  3 | Singapore            |      24 | 0.0025974   |               9099 |             0.98474  |
|  4 | Saudi Arabia         |      21 | 0.00227273  |               9120 |             0.987013 |
|  5 | United Kingdom       |      15 | 0.00162338  |               9135 |             0.988636 |
|  6 | Australia            |      13 | 0.00140693  |               9148 |             0.990043 |
|  7 | Qatar                |      10 | 0.00108225  |               9158 |             0.991126 |
|  8 | Hong Kong            |       7 | 0.000757576 |               9165 |             0.991883 |
|  9 | Bahrain              |       7 | 0.000757576 |               9172 |             0.992641 |
| 10 | Oman                 |       6 | 0.000649351 |               9178 |             0.99329  |
| 11 | France               |       6 | 0.000649351 |               9184 |             0.993939 |
| 12 | unknown              |       5 | 0.000541126 |               9189 |             0.994481 |
| 13 | South Africa         |       4 | 0.0004329   |               9193 |             0.994913 |
| 14 | Nigeria              |       4 | 0.0004329   |               9197 |             0.995346 |
| 15 | Kuwait               |       4 | 0.0004329   |               9201 |             0.995779 |
| 16 | Germany              |       4 | 0.0004329   |               9205 |             0.996212 |
| 17 | Canada               |       4 | 0.0004329   |               9209 |             0.996645 |
| 18 | Sweden               |       3 | 0.000324675 |               9212 |             0.99697  |
| 19 | Uganda               |       2 | 0.00021645  |               9214 |             0.997186 |
| 20 | Philippines          |       2 | 0.00021645  |               9216 |             0.997403 |
| 21 | Netherlands          |       2 | 0.00021645  |               9218 |             0.997619 |
| 22 | Italy                |       2 | 0.00021645  |               9220 |             0.997835 |
| 23 | Ghana                |       2 | 0.00021645  |               9222 |             0.998052 |
| 24 | China                |       2 | 0.00021645  |               9224 |             0.998268 |
| 25 | Belgium              |       2 | 0.00021645  |               9226 |             0.998485 |
| 26 | Bangladesh           |       2 | 0.00021645  |               9228 |             0.998701 |
| 27 | Asia/Pacific Region  |       2 | 0.00021645  |               9230 |             0.998918 |
| 28 | Vietnam              |       1 | 0.000108225 |               9231 |             0.999026 |
| 29 | Tanzania             |       1 | 0.000108225 |               9232 |             0.999134 |
| 30 | Switzerland          |       1 | 0.000108225 |               9233 |             0.999242 |
| 31 | Sri Lanka            |       1 | 0.000108225 |               9234 |             0.999351 |
| 32 | Russia               |       1 | 0.000108225 |               9235 |             0.999459 |
| 33 | Malaysia             |       1 | 0.000108225 |               9236 |             0.999567 |
| 34 | Liberia              |       1 | 0.000108225 |               9237 |             0.999675 |
| 35 | Kenya                |       1 | 0.000108225 |               9238 |             0.999784 |
| 36 | Indonesia            |       1 | 0.000108225 |               9239 |             0.999892 |
| 37 | Denmark              |       1 | 0.000108225 |               9240 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Digital Advertisement   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                      |    9236 | 0.999567  |               9236 |             0.999567 |
|  1 | Yes                     |       4 | 0.0004329 |               9240 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
+----+---------------+---------+------------+--------------------+----------------------+
|    | Do Not Call   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+------------+--------------------+----------------------|
|  0 | No            |    9238 | 0.999784   |               9238 |             0.999784 |
|  1 | Yes           |       2 | 0.00021645 |               9240 |             1        |
+----+---------------+---------+------------+--------------------+----------------------+
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    8506 | 0.920563  |               8506 |             0.920563 |
|  1 | Yes            |     734 | 0.0794372 |               9240 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+
+----+-----------------------------+---------+-----------+--------------------+----------------------+
|    | Get updates on DM Content   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                          |    9240 |         1 |               9240 |                    1 |
+----+-----------------------------+---------+-----------+--------------------+----------------------+
+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
|    | I agree to pay the amount through cheque   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                         |    9240 |         1 |               9240 |                    1 |
+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
+----+------------------------------+---------+-------------+--------------------+----------------------+
|    | Last Activity                |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Email Opened                 |    3437 | 0.376163    |               3437 |             0.376163 |
|  1 | SMS Sent                     |    2745 | 0.300427    |               6182 |             0.67659  |
|  2 | Olark Chat Conversation      |     973 | 0.10649     |               7155 |             0.78308  |
|  3 | Page Visited on Website      |     640 | 0.0700449   |               7795 |             0.853125 |
|  4 | Converted to Lead            |     428 | 0.0468425   |               8223 |             0.899967 |
|  5 | Email Bounced                |     326 | 0.0356791   |               8549 |             0.935646 |
|  6 | Email Link Clicked           |     267 | 0.0292218   |               8816 |             0.964868 |
|  7 | Form Submitted on Website    |     116 | 0.0126956   |               8932 |             0.977564 |
|  8 | Unreachable                  |      93 | 0.0101784   |               9025 |             0.987742 |
|  9 | Unsubscribed                 |      61 | 0.00667615  |               9086 |             0.994418 |
| 10 | Had a Phone Conversation     |      30 | 0.00328335  |               9116 |             0.997702 |
| 11 | Approached upfront           |       9 | 0.000985006 |               9125 |             0.998687 |
| 12 | View in browser link Clicked |       6 | 0.000656671 |               9131 |             0.999343 |
| 13 | Email Received               |       2 | 0.00021889  |               9133 |             0.999562 |
| 14 | Email Marked Spam            |       2 | 0.00021889  |               9135 |             0.999781 |
| 15 | Visited Booth in Tradeshow   |       1 | 0.000109445 |               9136 |             0.999891 |
| 16 | Resubscribed to emails       |       1 | 0.000109445 |               9137 |             1        |
+----+------------------------------+---------+-------------+--------------------+----------------------+
+----+------------------------------+---------+-------------+--------------------+----------------------+
|    | Last Notable Activity        |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Modified                     |    3407 | 0.368723    |               3407 |             0.368723 |
|  1 | Email Opened                 |    2827 | 0.305952    |               6234 |             0.674675 |
|  2 | SMS Sent                     |    2172 | 0.235065    |               8406 |             0.90974  |
|  3 | Page Visited on Website      |     318 | 0.0344156   |               8724 |             0.944156 |
|  4 | Olark Chat Conversation      |     183 | 0.0198052   |               8907 |             0.963961 |
|  5 | Email Link Clicked           |     173 | 0.0187229   |               9080 |             0.982684 |
|  6 | Email Bounced                |      60 | 0.00649351  |               9140 |             0.989177 |
|  7 | Unsubscribed                 |      47 | 0.00508658  |               9187 |             0.994264 |
|  8 | Unreachable                  |      32 | 0.0034632   |               9219 |             0.997727 |
|  9 | Had a Phone Conversation     |      14 | 0.00151515  |               9233 |             0.999242 |
| 10 | Email Marked Spam            |       2 | 0.00021645  |               9235 |             0.999459 |
| 11 | View in browser link Clicked |       1 | 0.000108225 |               9236 |             0.999567 |
| 12 | Resubscribed to emails       |       1 | 0.000108225 |               9237 |             0.999675 |
| 13 | Form Submitted on Website    |       1 | 0.000108225 |               9238 |             0.999784 |
| 14 | Email Received               |       1 | 0.000108225 |               9239 |             0.999892 |
| 15 | Approached upfront           |       1 | 0.000108225 |               9240 |             1        |
+----+------------------------------+---------+-------------+--------------------+----------------------+
+----+-------------------------+---------+-------------+--------------------+----------------------+
|    | Lead Origin             |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-------------+--------------------+----------------------|
|  0 | Landing Page Submission |    4886 | 0.528788    |               4886 |             0.528788 |
|  1 | API                     |    3580 | 0.387446    |               8466 |             0.916234 |
|  2 | Lead Add Form           |     718 | 0.0777056   |               9184 |             0.993939 |
|  3 | Lead Import             |      55 | 0.00595238  |               9239 |             0.999892 |
|  4 | Quick Add Form          |       1 | 0.000108225 |               9240 |             1        |
+----+-------------------------+---------+-------------+--------------------+----------------------+
+----+-------------------+---------+-------------+--------------------+----------------------+
|    | Lead Source       |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-------------+--------------------+----------------------|
|  0 | Google            |    2873 | 0.312147    |               2873 |             0.312147 |
|  1 | Direct Traffic    |    2543 | 0.276293    |               5416 |             0.58844  |
|  2 | Olark Chat        |    1755 | 0.190678    |               7171 |             0.779118 |
|  3 | Organic Search    |    1154 | 0.12538     |               8325 |             0.904498 |
|  4 | Reference         |     534 | 0.0580183   |               8859 |             0.962516 |
|  5 | Welingak Website  |     142 | 0.0154281   |               9001 |             0.977944 |
|  6 | Referral Sites    |     125 | 0.0135811   |               9126 |             0.991525 |
|  7 | Facebook          |      55 | 0.00597566  |               9181 |             0.997501 |
|  8 | bing              |       6 | 0.00065189  |               9187 |             0.998153 |
|  9 | Click2call        |       4 | 0.000434594 |               9191 |             0.998588 |
| 10 | Social Media      |       2 | 0.000217297 |               9193 |             0.998805 |
| 11 | Press_Release     |       2 | 0.000217297 |               9195 |             0.999022 |
| 12 | Live Chat         |       2 | 0.000217297 |               9197 |             0.999239 |
| 13 | youtubechannel    |       1 | 0.000108648 |               9198 |             0.999348 |
| 14 | welearnblog_Home  |       1 | 0.000108648 |               9199 |             0.999457 |
| 15 | testone           |       1 | 0.000108648 |               9200 |             0.999565 |
| 16 | blog              |       1 | 0.000108648 |               9201 |             0.999674 |
| 17 | WeLearn           |       1 | 0.000108648 |               9202 |             0.999783 |
| 18 | Pay per Click Ads |       1 | 0.000108648 |               9203 |             0.999891 |
| 19 | NC_EDM            |       1 | 0.000108648 |               9204 |             1        |
+----+-------------------+---------+-------------+--------------------+----------------------+
+----+------------+---------+-----------+--------------------+----------------------+
|    | Magazine   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------+---------+-----------+--------------------+----------------------|
|  0 | No         |    9240 |         1 |               9240 |                    1 |
+----+------------+---------+-----------+--------------------+----------------------+
+----+-------------+---------+-------------+--------------------+----------------------+
|    | Newspaper   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------+---------+-------------+--------------------+----------------------|
|  0 | No          |    9239 | 0.999892    |               9239 |             0.999892 |
|  1 | Yes         |       1 | 0.000108225 |               9240 |             1        |
+----+-------------+---------+-------------+--------------------+----------------------+
+----+---------------------+---------+------------+--------------------+----------------------+
|    | Newspaper Article   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------+---------+------------+--------------------+----------------------|
|  0 | No                  |    9238 | 0.999784   |               9238 |             0.999784 |
|  1 | Yes                 |       2 | 0.00021645 |               9240 |             1        |
+----+---------------------+---------+------------+--------------------+----------------------+
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | Receive More Updates About Our Courses   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    9240 |         1 |               9240 |                    1 |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
+----+----------+---------+------------+--------------------+----------------------+
|    | Search   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+----------+---------+------------+--------------------+----------------------|
|  0 | No       |    9226 | 0.998485   |               9226 |             0.998485 |
|  1 | Yes      |      14 | 0.00151515 |               9240 |             1        |
+----+----------+---------+------------+--------------------+----------------------+
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | Specialization                    |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | No Specialization                 |    3380 | 0.365801   |               3380 |             0.365801 |
|  1 | Finance Management                |     976 | 0.105628   |               4356 |             0.471429 |
|  2 | Human Resource Management         |     848 | 0.0917749  |               5204 |             0.563203 |
|  3 | Marketing Management              |     838 | 0.0906926  |               6042 |             0.653896 |
|  4 | Operations Management             |     503 | 0.0544372  |               6545 |             0.708333 |
|  5 | Business Administration           |     403 | 0.0436147  |               6948 |             0.751948 |
|  6 | IT Projects Management            |     366 | 0.0396104  |               7314 |             0.791558 |
|  7 | Supply Chain Management           |     349 | 0.0377706  |               7663 |             0.829329 |
|  8 | Banking, Investment And Insurance |     338 | 0.0365801  |               8001 |             0.865909 |
|  9 | Travel and Tourism                |     203 | 0.0219697  |               8204 |             0.887879 |
| 10 | Media and Advertising             |     203 | 0.0219697  |               8407 |             0.909848 |
| 11 | International Business            |     178 | 0.0192641  |               8585 |             0.929113 |
| 12 | Healthcare Management             |     159 | 0.0172078  |               8744 |             0.94632  |
| 13 | Hospitality Management            |     114 | 0.0123377  |               8858 |             0.958658 |
| 14 | E-COMMERCE                        |     112 | 0.0121212  |               8970 |             0.970779 |
| 15 | Retail Management                 |     100 | 0.0108225  |               9070 |             0.981602 |
| 16 | Rural and Agribusiness            |      73 | 0.00790043 |               9143 |             0.989502 |
| 17 | E-Business                        |      57 | 0.00616883 |               9200 |             0.995671 |
| 18 | Services Excellence               |      40 | 0.004329   |               9240 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | Tags                                              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Will revert after reading the email               |    2072 | 0.351962    |               2072 |             0.351962 |
|  1 | Ringing                                           |    1203 | 0.204349    |               3275 |             0.556311 |
|  2 | Interested in other courses                       |     513 | 0.0871412   |               3788 |             0.643452 |
|  3 | Already a student                                 |     465 | 0.0789876   |               4253 |             0.722439 |
|  4 | Closed by Horizzon                                |     358 | 0.060812    |               4611 |             0.783251 |
|  5 | switched off                                      |     240 | 0.0407678   |               4851 |             0.824019 |
|  6 | Busy                                              |     186 | 0.031595    |               5037 |             0.855614 |
|  7 | Lost to EINS                                      |     175 | 0.0297265   |               5212 |             0.885341 |
|  8 | Not doing further education                       |     145 | 0.0246305   |               5357 |             0.909971 |
|  9 | Interested  in full time MBA                      |     117 | 0.0198743   |               5474 |             0.929845 |
| 10 | Graduation in progress                            |     111 | 0.0188551   |               5585 |             0.948701 |
| 11 | invalid number                                    |      83 | 0.0140989   |               5668 |             0.962799 |
| 12 | Diploma holder (Not Eligible)                     |      63 | 0.0107015   |               5731 |             0.973501 |
| 13 | wrong number given                                |      47 | 0.00798369  |               5778 |             0.981485 |
| 14 | opp hangup                                        |      33 | 0.00560557  |               5811 |             0.98709  |
| 15 | number not provided                               |      27 | 0.00458638  |               5838 |             0.991677 |
| 16 | in touch with EINS                                |      12 | 0.00203839  |               5850 |             0.993715 |
| 17 | Lost to Others                                    |       7 | 0.00118906  |               5857 |             0.994904 |
| 18 | Want to take admission but has financial problems |       6 | 0.00101919  |               5863 |             0.995923 |
| 19 | Still Thinking                                    |       6 | 0.00101919  |               5869 |             0.996942 |
| 20 | Interested in Next batch                          |       5 | 0.000849329 |               5874 |             0.997792 |
| 21 | In confusion whether part time or DLP             |       5 | 0.000849329 |               5879 |             0.998641 |
| 22 | Lateral student                                   |       3 | 0.000509597 |               5882 |             0.999151 |
| 23 | University not recognized                         |       2 | 0.000339732 |               5884 |             0.99949  |
| 24 | Shall take in the next coming month               |       2 | 0.000339732 |               5886 |             0.99983  |
| 25 | Recognition issue (DEC approval)                  |       1 | 0.000169866 |               5887 |             1        |
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
+----+---------------------------+---------+-------------+--------------------+----------------------+
|    | Through Recommendations   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-------------+--------------------+----------------------|
|  0 | No                        |    9233 | 0.999242    |               9233 |             0.999242 |
|  1 | Yes                       |       7 | 0.000757576 |               9240 |             1        |
+----+---------------------------+---------+-------------+--------------------+----------------------+
+----+-------------------------------------+---------+-----------+--------------------+----------------------+
|    | Update me on Supply Chain Content   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                  |    9240 |         1 |               9240 |                    1 |
+----+-------------------------------------+---------+-----------+--------------------+----------------------+
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Unemployed                        |    5600 | 0.606061    |               5600 |             0.606061 |
|  1 | Unknown Occupation                |    2690 | 0.291126    |               8290 |             0.897186 |
|  2 | Working Professional              |     706 | 0.0764069   |               8996 |             0.973593 |
|  3 | Student                           |     210 | 0.0227273   |               9206 |             0.99632  |
|  4 | Other                             |      16 | 0.0017316   |               9222 |             0.998052 |
|  5 | Housewife                         |      10 | 0.00108225  |               9232 |             0.999134 |
|  6 | Businessman                       |       8 | 0.000865801 |               9240 |             1        |
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | What matters most to you in choosing a course   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Better Career Prospects                         |    6528 | 0.706494    |               6528 |             0.706494 |
|  1 | Unknown Target                                  |    2709 | 0.293182    |               9237 |             0.999675 |
|  2 | Flexibility & Convenience                       |       2 | 0.00021645  |               9239 |             0.999892 |
|  3 | Other                                           |       1 | 0.000108225 |               9240 |             1        |
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | X Education Forums   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | No                   |    9239 | 0.999892    |               9239 |             0.999892 |
|  1 | Yes                  |       1 | 0.000108225 |               9240 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+
  • Let's look for columns have more than 99% leads have the same level
  • Such variables do not explain any variability.
  • Hence, these columns are unnecessary to the analysis . They could be dropped
In [183]:
# Dropping columns having only one label - since these donot explain any variability in the dataset

invariableCol = ['Digital Advertisement','Do Not Call','Get updates on DM Content','Magazine','Newspaper','Newspaper Article','Receive More Updates About Our Courses','Search',
'Update me on Supply Chain Content','Through Recommendations',
'I agree to pay the amount through cheque',"What matters most to you in choosing a course",'X Education Forums']
leads.drop(columns=invariableCol, inplace=True)
In [184]:
# Tags feature
tab(leads.stb.freq(['Tags']))
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | Tags                                              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Will revert after reading the email               |    2072 | 0.351962    |               2072 |             0.351962 |
|  1 | Ringing                                           |    1203 | 0.204349    |               3275 |             0.556311 |
|  2 | Interested in other courses                       |     513 | 0.0871412   |               3788 |             0.643452 |
|  3 | Already a student                                 |     465 | 0.0789876   |               4253 |             0.722439 |
|  4 | Closed by Horizzon                                |     358 | 0.060812    |               4611 |             0.783251 |
|  5 | switched off                                      |     240 | 0.0407678   |               4851 |             0.824019 |
|  6 | Busy                                              |     186 | 0.031595    |               5037 |             0.855614 |
|  7 | Lost to EINS                                      |     175 | 0.0297265   |               5212 |             0.885341 |
|  8 | Not doing further education                       |     145 | 0.0246305   |               5357 |             0.909971 |
|  9 | Interested  in full time MBA                      |     117 | 0.0198743   |               5474 |             0.929845 |
| 10 | Graduation in progress                            |     111 | 0.0188551   |               5585 |             0.948701 |
| 11 | invalid number                                    |      83 | 0.0140989   |               5668 |             0.962799 |
| 12 | Diploma holder (Not Eligible)                     |      63 | 0.0107015   |               5731 |             0.973501 |
| 13 | wrong number given                                |      47 | 0.00798369  |               5778 |             0.981485 |
| 14 | opp hangup                                        |      33 | 0.00560557  |               5811 |             0.98709  |
| 15 | number not provided                               |      27 | 0.00458638  |               5838 |             0.991677 |
| 16 | in touch with EINS                                |      12 | 0.00203839  |               5850 |             0.993715 |
| 17 | Lost to Others                                    |       7 | 0.00118906  |               5857 |             0.994904 |
| 18 | Want to take admission but has financial problems |       6 | 0.00101919  |               5863 |             0.995923 |
| 19 | Still Thinking                                    |       6 | 0.00101919  |               5869 |             0.996942 |
| 20 | Interested in Next batch                          |       5 | 0.000849329 |               5874 |             0.997792 |
| 21 | In confusion whether part time or DLP             |       5 | 0.000849329 |               5879 |             0.998641 |
| 22 | Lateral student                                   |       3 | 0.000509597 |               5882 |             0.999151 |
| 23 | University not recognized                         |       2 | 0.000339732 |               5884 |             0.99949  |
| 24 | Shall take in the next coming month               |       2 | 0.000339732 |               5886 |             0.99983  |
| 25 | Recognition issue (DEC approval)                  |       1 | 0.000169866 |               5887 |             1        |
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
  • Tags column shows remarks by the sales team. This a subjective variable based on judgement of the team and cannot be used for analysis since the lables might change or might not always be available.
  • Also, it has a lot of levels that don't seem like mutually exclusive classes
  • Let's drop this feature for this analysis.
In [185]:
# dropping Tags feature 
leads.drop(columns=['Tags'], inplace=True)
  • Last Notable Activity & Last Activity seem to have similar levels
  • Lets look at the possibility of dropping one of them
In [186]:
# Last Notable Activity vs Last Activity
leads_copy = leads.copy()
leads_copy['Converted'] = leads_copy['Converted'].astype('int')
tab(leads_copy.stb.freq(['Last Notable Activity'],value='Converted'))
tab(leads_copy.stb.freq(['Last Activity'],value='Converted'))
+----+--------------------------+-------------+------------+------------------------+----------------------+
|    | Last Notable Activity    |   Converted |    Percent |   Cumulative Converted |   Cumulative Percent |
|----+--------------------------+-------------+------------+------------------------+----------------------|
|  0 | SMS Sent                 |        1508 | 0.423477   |                   1508 |             0.423477 |
|  1 | Email Opened             |        1044 | 0.293176   |                   2552 |             0.716653 |
|  2 | Modified                 |         783 | 0.219882   |                   3335 |             0.936535 |
|  3 | Page Visited on Website  |          93 | 0.0261163  |                   3428 |             0.962651 |
|  4 | Email Link Clicked       |          45 | 0.0126369  |                   3473 |             0.975288 |
|  5 | Olark Chat Conversation  |          25 | 0.0070205  |                   3498 |             0.982308 |
|  6 | Unreachable              |          22 | 0.00617804 |                   3520 |             0.988486 |
|  7 | Unsubscribed             |          14 | 0.00393148 |                   3534 |             0.992418 |
|  8 | Had a Phone Conversation |          13 | 0.00365066 |                   3547 |             0.996069 |
|  9 | Email Bounced            |           9 | 0.00252738 |                   3556 |             0.998596 |
| 10 | Email Marked Spam        |           2 | 0.00056164 |                   3558 |             0.999158 |
| 11 | Resubscribed to emails   |           1 | 0.00028082 |                   3559 |             0.999438 |
| 12 | Email Received           |           1 | 0.00028082 |                   3560 |             0.999719 |
| 13 | Approached upfront       |           1 | 0.00028082 |                   3561 |             1        |
+----+--------------------------+-------------+------------+------------------------+----------------------+
+----+------------------------------+-------------+-------------+------------------------+----------------------+
|    | Last Activity                |   Converted |     Percent |   Cumulative Converted |   Cumulative Percent |
|----+------------------------------+-------------+-------------+------------------------+----------------------|
|  0 | SMS Sent                     |        1727 | 0.496264    |                   1727 |             0.496264 |
|  1 | Email Opened                 |        1253 | 0.360057    |                   2980 |             0.856322 |
|  2 | Page Visited on Website      |         151 | 0.0433908   |                   3131 |             0.899713 |
|  3 | Olark Chat Conversation      |          84 | 0.0241379   |                   3215 |             0.923851 |
|  4 | Email Link Clicked           |          73 | 0.020977    |                   3288 |             0.944828 |
|  5 | Converted to Lead            |          54 | 0.0155172   |                   3342 |             0.960345 |
|  6 | Unreachable                  |          31 | 0.00890805  |                   3373 |             0.969253 |
|  7 | Form Submitted on Website    |          28 | 0.00804598  |                   3401 |             0.977299 |
|  8 | Email Bounced                |          26 | 0.00747126  |                   3427 |             0.98477  |
|  9 | Had a Phone Conversation     |          22 | 0.00632184  |                   3449 |             0.991092 |
| 10 | Unsubscribed                 |          16 | 0.0045977   |                   3465 |             0.99569  |
| 11 | Approached upfront           |           9 | 0.00258621  |                   3474 |             0.998276 |
| 12 | Email Received               |           2 | 0.000574713 |                   3476 |             0.998851 |
| 13 | Email Marked Spam            |           2 | 0.000574713 |                   3478 |             0.999425 |
| 14 | View in browser link Clicked |           1 | 0.000287356 |                   3479 |             0.999713 |
| 15 | Resubscribed to emails       |           1 | 0.000287356 |                   3480 |             1        |
+----+------------------------------+-------------+-------------+------------------------+----------------------+
  • Last Activity has more levels compared to Last Notable Activity
  • Last Notable Activity seems like a column derived by the sales team using Last Activity.
  • Since this insight might not be available for a new lead, let'd drop Last Notable Activity.
In [187]:
leads.drop(columns = ['Last Notable Activity'], inplace=True)
In [188]:
# Looking at Missing Values again 
leads.stb.missing()
Out[188]:
Missing Total Percent
TotalVisits 137 9240 0.014827
Page Views Per Visit 137 9240 0.014827
Last Activity 103 9240 0.011147
City 60 9240 0.006494
Lead Source 36 9240 0.003896
Lead Origin 0 9240 0.000000
Do Not Email 0 9240 0.000000
Converted 0 9240 0.000000
Total Time Spent on Website 0 9240 0.000000
Country 0 9240 0.000000
Specialization 0 9240 0.000000
What is your current occupation 0 9240 0.000000
A free copy of Mastering The Interview 0 9240 0.000000
  • From the above, the number of missing values is less than 2%. These are deemed missing completely at random. And these rows could be dropped without affecting the analysis
In [189]:
leads.dropna(inplace=True)
leads.stb.missing()
Out[189]:
Missing Total Percent
Lead Origin 0 9014 0.0
Lead Source 0 9014 0.0
Do Not Email 0 9014 0.0
Converted 0 9014 0.0
TotalVisits 0 9014 0.0
Total Time Spent on Website 0 9014 0.0
Page Views Per Visit 0 9014 0.0
Last Activity 0 9014 0.0
Country 0 9014 0.0
Specialization 0 9014 0.0
What is your current occupation 0 9014 0.0
City 0 9014 0.0
A free copy of Mastering The Interview 0 9014 0.0

Grouping Labels with less leads

Country
In [190]:
# Country distribution
leads.stb.freq(['Country'])
Out[190]:
Country Count Percent Cumulative Count Cumulative Percent
0 India 8787 0.974817 8787 0.974817
1 United States 51 0.005658 8838 0.980475
2 United Arab Emirates 44 0.004881 8882 0.985356
3 Saudi Arabia 21 0.002330 8903 0.987686
4 Singapore 17 0.001886 8920 0.989572
5 United Kingdom 12 0.001331 8932 0.990903
6 Australia 11 0.001220 8943 0.992123
7 Qatar 8 0.000888 8951 0.993011
8 Bahrain 7 0.000777 8958 0.993787
9 Hong Kong 6 0.000666 8964 0.994453
10 France 6 0.000666 8970 0.995119
11 Oman 5 0.000555 8975 0.995673
12 Nigeria 4 0.000444 8979 0.996117
13 Kuwait 4 0.000444 8983 0.996561
14 Germany 4 0.000444 8987 0.997005
15 South Africa 3 0.000333 8990 0.997337
16 Canada 3 0.000333 8993 0.997670
17 Philippines 2 0.000222 8995 0.997892
18 Netherlands 2 0.000222 8997 0.998114
19 Belgium 2 0.000222 8999 0.998336
20 Bangladesh 2 0.000222 9001 0.998558
21 Vietnam 1 0.000111 9002 0.998669
22 Uganda 1 0.000111 9003 0.998780
23 Tanzania 1 0.000111 9004 0.998891
24 Switzerland 1 0.000111 9005 0.999002
25 Sweden 1 0.000111 9006 0.999112
26 Malaysia 1 0.000111 9007 0.999223
27 Liberia 1 0.000111 9008 0.999334
28 Kenya 1 0.000111 9009 0.999445
29 Italy 1 0.000111 9010 0.999556
30 Indonesia 1 0.000111 9011 0.999667
31 Ghana 1 0.000111 9012 0.999778
32 Denmark 1 0.000111 9013 0.999889
33 China 1 0.000111 9014 1.000000
  • We see that leads from India make 97% of all leads. And others collectively make up 3% and the contribution of each of these countries is <= 1%.
  • To reduce the levels, let us group the minority labels into a new label called 'Outside India'
In [191]:
# Grouping Countries with very low lead count into 'Outside India' 

leadsByCountry = leads['Country'].value_counts(normalize=True)
lowLeadCountries = leadsByCountry[leadsByCountry <= 0.01].index

leads['Country'].replace(lowLeadCountries,'Outside India',inplace=True)
leads.stb.freq(['Country'])
Out[191]:
Country Count Percent Cumulative Count Cumulative Percent
0 India 8787 0.974817 8787 0.974817
1 Outside India 227 0.025183 9014 1.000000
Lead Origin
In [192]:
feature = 'Lead Origin'
leads.stb.freq([feature])
Out[192]:
Lead Origin Count Percent Cumulative Count Cumulative Percent
0 Landing Page Submission 4870 0.540271 4870 0.540271
1 API 3533 0.391946 8403 0.932217
2 Lead Add Form 581 0.064455 8984 0.996672
3 Lead Import 30 0.003328 9014 1.000000
  • We see that lead origins like Lead Add Form,Lead Import are less than 1% of all origins.
  • Let's group them into a level called 'Other Lead Origins'
In [193]:
# Grouping lead origins 
leadOriginsToGroup = ["Lead Add Form","Lead Import"]
leads[feature] = leads[feature].replace(leadOriginsToGroup, ['Other Lead Origins']*2)
leads.stb.freq([feature])
Out[193]:
Lead Origin Count Percent Cumulative Count Cumulative Percent
0 Landing Page Submission 4870 0.540271 4870 0.540271
1 API 3533 0.391946 8403 0.932217
2 Other Lead Origins 611 0.067783 9014 1.000000
Lead Source
In [194]:
feature = 'Lead Source'
leads.stb.freq([feature])
Out[194]:
Lead Source Count Percent Cumulative Count Cumulative Percent
0 Google 2857 0.316951 2857 0.316951
1 Direct Traffic 2528 0.280453 5385 0.597404
2 Olark Chat 1753 0.194475 7138 0.791879
3 Organic Search 1131 0.125471 8269 0.917351
4 Reference 443 0.049146 8712 0.966497
5 Welingak Website 129 0.014311 8841 0.980808
6 Referral Sites 120 0.013313 8961 0.994120
7 Facebook 31 0.003439 8992 0.997559
8 bing 6 0.000666 8998 0.998225
9 Click2call 4 0.000444 9002 0.998669
10 Social Media 2 0.000222 9004 0.998891
11 Press_Release 2 0.000222 9006 0.999112
12 Live Chat 2 0.000222 9008 0.999334
13 welearnblog_Home 1 0.000111 9009 0.999445
14 testone 1 0.000111 9010 0.999556
15 blog 1 0.000111 9011 0.999667
16 WeLearn 1 0.000111 9012 0.999778
17 Pay per Click Ads 1 0.000111 9013 0.999889
18 NC_EDM 1 0.000111 9014 1.000000
  • Lead sources from #7 to #18 contribute to less than 1% of all sources.
  • Let's group these into a new label called 'Other Lead Sources'
In [195]:
# Grouping lead Sources
labelCounts = leads[feature].value_counts(normalize=True)

# labels with less than 1% contribution
labelsToGroup = labelCounts[labelCounts < 0.01].index.values

leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature+'s']*len(labelsToGroup))

leads.stb.freq([feature])
Out[195]:
Lead Source Count Percent Cumulative Count Cumulative Percent
0 Google 2857 0.316951 2857 0.316951
1 Direct Traffic 2528 0.280453 5385 0.597404
2 Olark Chat 1753 0.194475 7138 0.791879
3 Organic Search 1131 0.125471 8269 0.917351
4 Reference 443 0.049146 8712 0.966497
5 Welingak Website 129 0.014311 8841 0.980808
6 Referral Sites 120 0.013313 8961 0.994120
7 Other Lead Sources 53 0.005880 9014 1.000000
Last Activity
In [196]:
feature = 'Last Activity'
leads.stb.freq([feature])
Out[196]:
Last Activity Count Percent Cumulative Count Cumulative Percent
0 Email Opened 3417 0.379077 3417 0.379077
1 SMS Sent 2701 0.299645 6118 0.678722
2 Olark Chat Conversation 963 0.106834 7081 0.785556
3 Page Visited on Website 637 0.070668 7718 0.856224
4 Converted to Lead 422 0.046816 8140 0.903040
5 Email Bounced 304 0.033725 8444 0.936765
6 Email Link Clicked 266 0.029510 8710 0.966275
7 Form Submitted on Website 115 0.012758 8825 0.979033
8 Unreachable 89 0.009874 8914 0.988906
9 Unsubscribed 58 0.006434 8972 0.995341
10 Had a Phone Conversation 25 0.002773 8997 0.998114
11 View in browser link Clicked 6 0.000666 9003 0.998780
12 Approached upfront 5 0.000555 9008 0.999334
13 Email Received 2 0.000222 9010 0.999556
14 Email Marked Spam 2 0.000222 9012 0.999778
15 Visited Booth in Tradeshow 1 0.000111 9013 0.999889
16 Resubscribed to emails 1 0.000111 9014 1.000000
  • Leads from #9 to #16 contribute to less than 1% of all last activity labels.Moreover, each of these labels contributes to less than 1% of leads.
  • Let's group these into a new label called 'Other Last Activity'
In [197]:
# Grouping Last Activity
labelCounts = leads[feature].value_counts(normalize=True)

# labels with less than 2% contribution
labelsToGroup = labelCounts[labelCounts < 0.01].index.values

leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))

leads.stb.freq([feature])
Out[197]:
Last Activity Count Percent Cumulative Count Cumulative Percent
0 Email Opened 3417 0.379077 3417 0.379077
1 SMS Sent 2701 0.299645 6118 0.678722
2 Olark Chat Conversation 963 0.106834 7081 0.785556
3 Page Visited on Website 637 0.070668 7718 0.856224
4 Converted to Lead 422 0.046816 8140 0.903040
5 Email Bounced 304 0.033725 8444 0.936765
6 Email Link Clicked 266 0.029510 8710 0.966275
7 Other Last Activity 189 0.020967 8899 0.987242
8 Form Submitted on Website 115 0.012758 9014 1.000000
Specialization
In [198]:
feature = 'Specialization'
leads.stb.freq([feature])
Out[198]:
Specialization Count Percent Cumulative Count Cumulative Percent
0 No Specialization 3230 0.358331 3230 0.358331
1 Finance Management 959 0.106390 4189 0.464722
2 Human Resource Management 836 0.092745 5025 0.557466
3 Marketing Management 822 0.091191 5847 0.648658
4 Operations Management 498 0.055247 6345 0.703905
5 Business Administration 397 0.044043 6742 0.747948
6 IT Projects Management 366 0.040604 7108 0.788551
7 Supply Chain Management 344 0.038163 7452 0.826714
8 Banking, Investment And Insurance 335 0.037164 7787 0.863878
9 Travel and Tourism 202 0.022410 7989 0.886288
10 Media and Advertising 202 0.022410 8191 0.908698
11 International Business 176 0.019525 8367 0.928223
12 Healthcare Management 156 0.017306 8523 0.945529
13 E-COMMERCE 111 0.012314 8634 0.957843
14 Hospitality Management 110 0.012203 8744 0.970047
15 Retail Management 100 0.011094 8844 0.981140
16 Rural and Agribusiness 73 0.008099 8917 0.989239
17 E-Business 57 0.006323 8974 0.995562
18 Services Excellence 40 0.004438 9014 1.000000
  • Lead from from #16 to #18 contribute to less than 2% of all Specialization categories. Moreover, each of these categories contributes to less than 1% of leads.
  • Let's group these into a new label called 'Other Specializations'
In [199]:
# Grouping Last Activity
labelCounts = leads[feature].value_counts(normalize=True)

# labels with less than 2% contribution
labelsToGroup = labelCounts[labelCounts <=0.012121].index.values

leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))

leads.stb.freq([feature])
Out[199]:
Specialization Count Percent Cumulative Count Cumulative Percent
0 No Specialization 3230 0.358331 3230 0.358331
1 Finance Management 959 0.106390 4189 0.464722
2 Human Resource Management 836 0.092745 5025 0.557466
3 Marketing Management 822 0.091191 5847 0.648658
4 Operations Management 498 0.055247 6345 0.703905
5 Business Administration 397 0.044043 6742 0.747948
6 IT Projects Management 366 0.040604 7108 0.788551
7 Supply Chain Management 344 0.038163 7452 0.826714
8 Banking, Investment And Insurance 335 0.037164 7787 0.863878
9 Other Specialization 270 0.029953 8057 0.893832
10 Travel and Tourism 202 0.022410 8259 0.916241
11 Media and Advertising 202 0.022410 8461 0.938651
12 International Business 176 0.019525 8637 0.958176
13 Healthcare Management 156 0.017306 8793 0.975483
14 E-COMMERCE 111 0.012314 8904 0.987797
15 Hospitality Management 110 0.012203 9014 1.000000
  • Cleaning tasks have been completed. No more missing values exist

Retained Data

In [200]:
# Columns retained 
print('Retained Columns\n\n', leads.columns.values)
Retained Columns

 ['Lead Origin' 'Lead Source' 'Do Not Email' 'Converted' 'TotalVisits'
 'Total Time Spent on Website' 'Page Views Per Visit' 'Last Activity'
 'Country' 'Specialization' 'What is your current occupation' 'City'
 'A free copy of Mastering The Interview']
In [201]:
# Retained rows
print('Retained rows : ',leads.shape[0]) 
print("Ratio of retained rows", 100*leads.shape[0]/9240)
Retained rows :  9014
Ratio of retained rows 97.55411255411255

Data Imbalance

In [202]:
leads.stb.freq(['Converted'])
Out[202]:
Converted Count Percent Cumulative Count Cumulative Percent
0 0 5595 0.620701 5595 0.620701
1 1 3419 0.379299 9014 1.000000
In [203]:
converted_cond = leads['Converted'] == 1
imbalance = leads[converted_cond].shape[0]/leads[~converted_cond].shape[0]
print('Class Imbalance : Converted /Un-converted =', np.round(imbalance,3))
Class Imbalance : Converted /Un-converted = 0.611
  • From the above, you can see that this data set contains 37% of converted leads and 62% of un-converted leads.
  • Ratio of classes = 0.6
  • The dataset is skewed towards 'unconverted leads'

Univariate Analysis

In [204]:
def categoricalUAn(column,figsize=[8,8]) : 
    
    ''' Function for categorical univariate analysis '''
    print('Types of ' + column)
    tab(leads.stb.freq([column]))
    
    converted = leads[leads['Converted'] == 1]
    unconverted = leads[leads['Converted'] == 0]
      
    print(column + ' for Converted Leads')
    
    tab(converted.stb.freq([column]))
    
    print(column + ' for Un-Converted Leads')
    
    tab(unconverted.stb.freq([column]))
    
    print(column + ' vs Conversion Rate')
    
    tab((converted[column].value_counts()) / (converted[column].value_counts() + unconverted[column].value_counts()))
    
    # bar plot
    plt.figure(figsize=figsize)
    ax = sns.countplot(y=column,hue='Converted',data=leads)
    title = column + ' vs Lead Conversion'
    ax.set(title= title)

    
    

Lead Origin

In [205]:
column = 'Lead Origin'
categoricalUAn(column,figsize=[8,8])
Types of Lead Origin
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Lead Origin             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | Landing Page Submission |    4870 | 0.540271  |               4870 |             0.540271 |
|  1 | API                     |    3533 | 0.391946  |               8403 |             0.932217 |
|  2 | Other Lead Origins      |     611 | 0.0677834 |               9014 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
Lead Origin for Converted Leads
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Lead Origin             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | Landing Page Submission |    1765 |  0.516233 |               1765 |             0.516233 |
|  1 | API                     |    1101 |  0.322024 |               2866 |             0.838257 |
|  2 | Other Lead Origins      |     553 |  0.161743 |               3419 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
Lead Origin for Un-Converted Leads
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Lead Origin             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | Landing Page Submission |    3105 | 0.55496   |               3105 |             0.55496  |
|  1 | API                     |    2432 | 0.434674  |               5537 |             0.989634 |
|  2 | Other Lead Origins      |      58 | 0.0103664 |               5595 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
Lead Origin vs Conversion Rate
+-------------------------+---------------+
|                         |   Lead Origin |
|-------------------------+---------------|
| Landing Page Submission |      0.362423 |
| API                     |      0.311633 |
| Other Lead Origins      |      0.905074 |
+-------------------------+---------------+
  • Leads from Landing Page Submission followed by API make up 93% of all leads.
  • But it is interesting that 8.3% of leads coming from other sources have the highest conversion rate of 87.5%

Lead Source

In [206]:
column = 'Lead Source'
categoricalUAn(column,figsize=[8,8])
Types of Lead Source
+----+--------------------+---------+------------+--------------------+----------------------+
|    | Lead Source        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+------------+--------------------+----------------------|
|  0 | Google             |    2857 | 0.316951   |               2857 |             0.316951 |
|  1 | Direct Traffic     |    2528 | 0.280453   |               5385 |             0.597404 |
|  2 | Olark Chat         |    1753 | 0.194475   |               7138 |             0.791879 |
|  3 | Organic Search     |    1131 | 0.125471   |               8269 |             0.917351 |
|  4 | Reference          |     443 | 0.0491458  |               8712 |             0.966497 |
|  5 | Welingak Website   |     129 | 0.0143111  |               8841 |             0.980808 |
|  6 | Referral Sites     |     120 | 0.0133126  |               8961 |             0.99412  |
|  7 | Other Lead Sources |      53 | 0.00587974 |               9014 |             1        |
+----+--------------------+---------+------------+--------------------+----------------------+
Lead Source for Converted Leads
+----+--------------------+---------+------------+--------------------+----------------------+
|    | Lead Source        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+------------+--------------------+----------------------|
|  0 | Google             |    1142 | 0.334016   |               1142 |             0.334016 |
|  1 | Direct Traffic     |     815 | 0.238374   |               1957 |             0.57239  |
|  2 | Olark Chat         |     448 | 0.131032   |               2405 |             0.703422 |
|  3 | Organic Search     |     428 | 0.125183   |               2833 |             0.828605 |
|  4 | Reference          |     410 | 0.119918   |               3243 |             0.948523 |
|  5 | Welingak Website   |     127 | 0.0371454  |               3370 |             0.985668 |
|  6 | Referral Sites     |      31 | 0.00906698 |               3401 |             0.994735 |
|  7 | Other Lead Sources |      18 | 0.0052647  |               3419 |             1        |
+----+--------------------+---------+------------+--------------------+----------------------+
Lead Source for Un-Converted Leads
+----+--------------------+---------+-------------+--------------------+----------------------+
|    | Lead Source        |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+-------------+--------------------+----------------------|
|  0 | Google             |    1715 | 0.306524    |               1715 |             0.306524 |
|  1 | Direct Traffic     |    1713 | 0.306166    |               3428 |             0.61269  |
|  2 | Olark Chat         |    1305 | 0.233244    |               4733 |             0.845934 |
|  3 | Organic Search     |     703 | 0.125648    |               5436 |             0.971582 |
|  4 | Referral Sites     |      89 | 0.0159071   |               5525 |             0.987489 |
|  5 | Other Lead Sources |      35 | 0.00625559  |               5560 |             0.993744 |
|  6 | Reference          |      33 | 0.00589812  |               5593 |             0.999643 |
|  7 | Welingak Website   |       2 | 0.000357462 |               5595 |             1        |
+----+--------------------+---------+-------------+--------------------+----------------------+
Lead Source vs Conversion Rate
+--------------------+---------------+
|                    |   Lead Source |
|--------------------+---------------|
| Direct Traffic     |      0.322389 |
| Google             |      0.39972  |
| Olark Chat         |      0.255562 |
| Organic Search     |      0.378426 |
| Other Lead Sources |      0.339623 |
| Reference          |      0.925508 |
| Referral Sites     |      0.258333 |
| Welingak Website   |      0.984496 |
+--------------------+---------------+
  • Most leads that get converted come from Google(31%), followed by Direct Traffic(28%) and Olark Chat(19%)
  • And leads through Reference have a very high conversion rate (91%)

Do not Email

In [207]:
feature = 'Do Not Email'
categoricalUAn(feature,figsize=[8,8])
Types of Do Not Email
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    8311 | 0.92201   |               8311 |              0.92201 |
|  1 | Yes            |     703 | 0.0779898 |               9014 |              1       |
+----+----------------+---------+-----------+--------------------+----------------------+
Do Not Email for Converted Leads
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    3315 | 0.969582  |               3315 |             0.969582 |
|  1 | Yes            |     104 | 0.0304183 |               3419 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+
Do Not Email for Un-Converted Leads
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    4996 |   0.89294 |               4996 |              0.89294 |
|  1 | Yes            |     599 |   0.10706 |               5595 |              1       |
+----+----------------+---------+-----------+--------------------+----------------------+
Do Not Email vs Conversion Rate
+-----+----------------+
|     |   Do Not Email |
|-----+----------------|
| No  |       0.398869 |
| Yes |       0.147937 |
+-----+----------------+
  • 92% of leads prefer to be sent Emails about the company. Do not Email = No
  • And these are the most converted customers (40%)

Last Activity

In [208]:
# 'Last Activity'
feature = 'Last Activity'
categoricalUAn(feature,figsize=[8,8])
Types of Last Activity
+----+---------------------------+---------+-----------+--------------------+----------------------+
|    | Last Activity             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-----------+--------------------+----------------------|
|  0 | Email Opened              |    3417 | 0.379077  |               3417 |             0.379077 |
|  1 | SMS Sent                  |    2701 | 0.299645  |               6118 |             0.678722 |
|  2 | Olark Chat Conversation   |     963 | 0.106834  |               7081 |             0.785556 |
|  3 | Page Visited on Website   |     637 | 0.0706679 |               7718 |             0.856224 |
|  4 | Converted to Lead         |     422 | 0.0468161 |               8140 |             0.90304  |
|  5 | Email Bounced             |     304 | 0.0337253 |               8444 |             0.936765 |
|  6 | Email Link Clicked        |     266 | 0.0295097 |               8710 |             0.966275 |
|  7 | Other Last Activity       |     189 | 0.0209674 |               8899 |             0.987242 |
|  8 | Form Submitted on Website |     115 | 0.0127579 |               9014 |             1        |
+----+---------------------------+---------+-----------+--------------------+----------------------+
Last Activity for Converted Leads
+----+---------------------------+---------+------------+--------------------+----------------------+
|    | Last Activity             |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+------------+--------------------+----------------------|
|  0 | SMS Sent                  |    1697 | 0.496344   |               1697 |             0.496344 |
|  1 | Email Opened              |    1246 | 0.364434   |               2943 |             0.860778 |
|  2 | Page Visited on Website   |     150 | 0.0438725  |               3093 |             0.90465  |
|  3 | Olark Chat Conversation   |      84 | 0.0245686  |               3177 |             0.929219 |
|  4 | Other Last Activity       |      74 | 0.0216438  |               3251 |             0.950863 |
|  5 | Email Link Clicked        |      72 | 0.0210588  |               3323 |             0.971922 |
|  6 | Converted to Lead         |      53 | 0.0155016  |               3376 |             0.987423 |
|  7 | Form Submitted on Website |      27 | 0.00789705 |               3403 |             0.99532  |
|  8 | Email Bounced             |      16 | 0.00467973 |               3419 |             1        |
+----+---------------------------+---------+------------+--------------------+----------------------+
Last Activity for Un-Converted Leads
+----+---------------------------+---------+-----------+--------------------+----------------------+
|    | Last Activity             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-----------+--------------------+----------------------|
|  0 | Email Opened              |    2171 | 0.388025  |               2171 |             0.388025 |
|  1 | SMS Sent                  |    1004 | 0.179446  |               3175 |             0.567471 |
|  2 | Olark Chat Conversation   |     879 | 0.157105  |               4054 |             0.724576 |
|  3 | Page Visited on Website   |     487 | 0.087042  |               4541 |             0.811618 |
|  4 | Converted to Lead         |     369 | 0.0659517 |               4910 |             0.877569 |
|  5 | Email Bounced             |     288 | 0.0514745 |               5198 |             0.929044 |
|  6 | Email Link Clicked        |     194 | 0.0346738 |               5392 |             0.963718 |
|  7 | Other Last Activity       |     115 | 0.0205541 |               5507 |             0.984272 |
|  8 | Form Submitted on Website |      88 | 0.0157283 |               5595 |             1        |
+----+---------------------------+---------+-----------+--------------------+----------------------+
Last Activity vs Conversion Rate
+---------------------------+-----------------+
|                           |   Last Activity |
|---------------------------+-----------------|
| Converted to Lead         |       0.125592  |
| Email Bounced             |       0.0526316 |
| Email Link Clicked        |       0.270677  |
| Email Opened              |       0.364647  |
| Form Submitted on Website |       0.234783  |
| Olark Chat Conversation   |       0.0872274 |
| Other Last Activity       |       0.391534  |
| Page Visited on Website   |       0.235479  |
| SMS Sent                  |       0.628286  |
+---------------------------+-----------------+
  • Most leads open emails sent to them (38%) and that's their last activity.
  • Among those leads who's last activity is opening emails, 37% are converted.
  • Only 4% of last activity indicators show Converted to Lead
  • Last activiy as 'SMS Sent' has highest conversion rate (62%).
  • Last activiy as 'Email Bounced' has lowest conversion rate (7.9%).

Country

In [209]:
feature = 'Country'
categoricalUAn(feature,figsize=[8,8])
Types of Country
+----+---------------+---------+-----------+--------------------+----------------------+
|    | Country       |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | India         |    8787 |  0.974817 |               8787 |             0.974817 |
|  1 | Outside India |     227 |  0.025183 |               9014 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+
Country for Converted Leads
+----+---------------+---------+-----------+--------------------+----------------------+
|    | Country       |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | India         |    3351 | 0.980111  |               3351 |             0.980111 |
|  1 | Outside India |      68 | 0.0198889 |               3419 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+
Country for Un-Converted Leads
+----+---------------+---------+-----------+--------------------+----------------------+
|    | Country       |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | India         |    5436 | 0.971582  |               5436 |             0.971582 |
|  1 | Outside India |     159 | 0.0284182 |               5595 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+
Country vs Conversion Rate
+---------------+-----------+
|               |   Country |
|---------------+-----------|
| India         |  0.381359 |
| Outside India |  0.299559 |
+---------------+-----------+
  • Most leads come from India (97%)
  • Out of these 38% are converted.

Specialization

In [210]:
feature = 'Specialization'
categoricalUAn(feature)
Types of Specialization
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
|    | Specialization                    |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No Specialization                 |    3230 | 0.358331  |               3230 |             0.358331 |
|  1 | Finance Management                |     959 | 0.10639   |               4189 |             0.464722 |
|  2 | Human Resource Management         |     836 | 0.0927446 |               5025 |             0.557466 |
|  3 | Marketing Management              |     822 | 0.0911915 |               5847 |             0.648658 |
|  4 | Operations Management             |     498 | 0.0552474 |               6345 |             0.703905 |
|  5 | Business Administration           |     397 | 0.0440426 |               6742 |             0.747948 |
|  6 | IT Projects Management            |     366 | 0.0406035 |               7108 |             0.788551 |
|  7 | Supply Chain Management           |     344 | 0.0381629 |               7452 |             0.826714 |
|  8 | Banking, Investment And Insurance |     335 | 0.0371644 |               7787 |             0.863878 |
|  9 | Other Specialization              |     270 | 0.0299534 |               8057 |             0.893832 |
| 10 | Travel and Tourism                |     202 | 0.0224096 |               8259 |             0.916241 |
| 11 | Media and Advertising             |     202 | 0.0224096 |               8461 |             0.938651 |
| 12 | International Business            |     176 | 0.0195252 |               8637 |             0.958176 |
| 13 | Healthcare Management             |     156 | 0.0173064 |               8793 |             0.975483 |
| 14 | E-COMMERCE                        |     111 | 0.0123142 |               8904 |             0.987797 |
| 15 | Hospitality Management            |     110 | 0.0122032 |               9014 |             1        |
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
Specialization for Converted Leads
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
|    | Specialization                    |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No Specialization                 |     888 | 0.259725  |                888 |             0.259725 |
|  1 | Finance Management                |     422 | 0.123428  |               1310 |             0.383153 |
|  2 | Marketing Management              |     397 | 0.116116  |               1707 |             0.499269 |
|  3 | Human Resource Management         |     379 | 0.110851  |               2086 |             0.61012  |
|  4 | Operations Management             |     234 | 0.0684411 |               2320 |             0.678561 |
|  5 | Business Administration           |     175 | 0.0511846 |               2495 |             0.729746 |
|  6 | Banking, Investment And Insurance |     164 | 0.0479672 |               2659 |             0.777713 |
|  7 | Supply Chain Management           |     147 | 0.042995  |               2806 |             0.820708 |
|  8 | IT Projects Management            |     140 | 0.0409476 |               2946 |             0.861655 |
|  9 | Other Specialization              |      97 | 0.0283709 |               3043 |             0.890026 |
| 10 | Media and Advertising             |      84 | 0.0245686 |               3127 |             0.914595 |
| 11 | Healthcare Management             |      76 | 0.0222287 |               3203 |             0.936824 |
| 12 | Travel and Tourism                |      71 | 0.0207663 |               3274 |             0.95759  |
| 13 | International Business            |      62 | 0.018134  |               3336 |             0.975724 |
| 14 | Hospitality Management            |      44 | 0.0128693 |               3380 |             0.988593 |
| 15 | E-COMMERCE                        |      39 | 0.0114068 |               3419 |             1        |
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
Specialization for Un-Converted Leads
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
|    | Specialization                    |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No Specialization                 |    2342 | 0.418588  |               2342 |             0.418588 |
|  1 | Finance Management                |     537 | 0.0959786 |               2879 |             0.514567 |
|  2 | Human Resource Management         |     457 | 0.0816801 |               3336 |             0.596247 |
|  3 | Marketing Management              |     425 | 0.0759607 |               3761 |             0.672207 |
|  4 | Operations Management             |     264 | 0.047185  |               4025 |             0.719392 |
|  5 | IT Projects Management            |     226 | 0.0403932 |               4251 |             0.759786 |
|  6 | Business Administration           |     222 | 0.0396783 |               4473 |             0.799464 |
|  7 | Supply Chain Management           |     197 | 0.03521   |               4670 |             0.834674 |
|  8 | Other Specialization              |     173 | 0.0309205 |               4843 |             0.865594 |
|  9 | Banking, Investment And Insurance |     171 | 0.030563  |               5014 |             0.896157 |
| 10 | Travel and Tourism                |     131 | 0.0234138 |               5145 |             0.919571 |
| 11 | Media and Advertising             |     118 | 0.0210903 |               5263 |             0.940661 |
| 12 | International Business            |     114 | 0.0203753 |               5377 |             0.961037 |
| 13 | Healthcare Management             |      80 | 0.0142985 |               5457 |             0.975335 |
| 14 | E-COMMERCE                        |      72 | 0.0128686 |               5529 |             0.988204 |
| 15 | Hospitality Management            |      66 | 0.0117962 |               5595 |             1        |
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
Specialization vs Conversion Rate
+-----------------------------------+------------------+
|                                   |   Specialization |
|-----------------------------------+------------------|
| Banking, Investment And Insurance |         0.489552 |
| Business Administration           |         0.440806 |
| E-COMMERCE                        |         0.351351 |
| Finance Management                |         0.440042 |
| Healthcare Management             |         0.487179 |
| Hospitality Management            |         0.4      |
| Human Resource Management         |         0.453349 |
| IT Projects Management            |         0.382514 |
| International Business            |         0.352273 |
| Marketing Management              |         0.482968 |
| Media and Advertising             |         0.415842 |
| No Specialization                 |         0.274923 |
| Operations Management             |         0.46988  |
| Other Specialization              |         0.359259 |
| Supply Chain Management           |         0.427326 |
| Travel and Tourism                |         0.351485 |
+-----------------------------------+------------------+
  • Specialization of 36% of leads is missing.
  • We have mapped those missing values with 'No Specialization'.There might be two reason for this,
    • Lead might be a fresher.
    • Lead missed to fill it.
  • Among all the specializations, ' Banking, Investment And Insurance' has the highest conversion rate(48.9%).

What is your current occupation

In [211]:
feature = 'What is your current occupation'
categoricalUAn(feature,figsize=[8,8])
Types of What is your current occupation
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Unemployed                        |    5445 | 0.60406     |               5445 |             0.60406  |
|  1 | Unknown Occupation                |    2656 | 0.294653    |               8101 |             0.898713 |
|  2 | Working Professional              |     675 | 0.0748835   |               8776 |             0.973597 |
|  3 | Student                           |     206 | 0.0228533   |               8982 |             0.99645  |
|  4 | Other                             |      15 | 0.00166408  |               8997 |             0.998114 |
|  5 | Housewife                         |       9 | 0.000998447 |               9006 |             0.999112 |
|  6 | Businessman                       |       8 | 0.000887508 |               9014 |             1        |
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
What is your current occupation for Converted Leads
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | Unemployed                        |    2336 | 0.683241   |               2336 |             0.683241 |
|  1 | Working Professional              |     620 | 0.18134    |               2956 |             0.86458  |
|  2 | Unknown Occupation                |     366 | 0.107049   |               3322 |             0.971629 |
|  3 | Student                           |      74 | 0.0216438  |               3396 |             0.993273 |
|  4 | Other                             |       9 | 0.00263235 |               3405 |             0.995905 |
|  5 | Housewife                         |       9 | 0.00263235 |               3414 |             0.998538 |
|  6 | Businessman                       |       5 | 0.00146242 |               3419 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
What is your current occupation for Un-Converted Leads
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Unemployed                        |    3109 | 0.555675    |               3109 |             0.555675 |
|  1 | Unknown Occupation                |    2290 | 0.409294    |               5399 |             0.964969 |
|  2 | Student                           |     132 | 0.0235925   |               5531 |             0.988561 |
|  3 | Working Professional              |      55 | 0.00983021  |               5586 |             0.998391 |
|  4 | Other                             |       6 | 0.00107239  |               5592 |             0.999464 |
|  5 | Businessman                       |       3 | 0.000536193 |               5595 |             1        |
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
What is your current occupation vs Conversion Rate
+----------------------+-----------------------------------+
|                      |   What is your current occupation |
|----------------------+-----------------------------------|
| Businessman          |                          0.625    |
| Housewife            |                        nan        |
| Other                |                          0.6      |
| Student              |                          0.359223 |
| Unemployed           |                          0.429017 |
| Unknown Occupation   |                          0.137801 |
| Working Professional |                          0.918519 |
+----------------------+-----------------------------------+
  • Although the conversion rate for Working Professional is the highest ! 91.6%, they only make 7.4% of all leads. 60% leads are Unemployed customers followed by 29% with unknown nature of employment
  • Among all the converted leads, Unemployed and Working Professionals top the list.
  • Conversion for Housewife segment is 100%

City

In [212]:
feature = 'City'
categoricalUAn(feature)
Types of City
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | City                        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Mumbai                      |    6692 | 0.742401   |               6692 |             0.742401 |
|  1 | Thane & Outskirts           |     745 | 0.0826492  |               7437 |             0.82505  |
|  2 | Other Cities                |     680 | 0.0754382  |               8117 |             0.900488 |
|  3 | Other Cities of Maharashtra |     446 | 0.0494786  |               8563 |             0.949967 |
|  4 | Other Metro Cities          |     377 | 0.0418238  |               8940 |             0.991791 |
|  5 | Tier II Cities              |      74 | 0.00820945 |               9014 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
City for Converted Leads
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | City                        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Mumbai                      |    2440 | 0.713659   |               2440 |             0.713659 |
|  1 | Thane & Outskirts           |     332 | 0.0971044  |               2772 |             0.810763 |
|  2 | Other Cities                |     272 | 0.0795554  |               3044 |             0.890319 |
|  3 | Other Cities of Maharashtra |     196 | 0.0573267  |               3240 |             0.947646 |
|  4 | Other Metro Cities          |     154 | 0.0450424  |               3394 |             0.992688 |
|  5 | Tier II Cities              |      25 | 0.00731208 |               3419 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
City for Un-Converted Leads
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | City                        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Mumbai                      |    4252 | 0.759964   |               4252 |             0.759964 |
|  1 | Thane & Outskirts           |     413 | 0.0738159  |               4665 |             0.83378  |
|  2 | Other Cities                |     408 | 0.0729223  |               5073 |             0.906702 |
|  3 | Other Cities of Maharashtra |     250 | 0.0446828  |               5323 |             0.951385 |
|  4 | Other Metro Cities          |     223 | 0.039857   |               5546 |             0.991242 |
|  5 | Tier II Cities              |      49 | 0.00875782 |               5595 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
City vs Conversion Rate
+-----------------------------+----------+
|                             |     City |
|-----------------------------+----------|
| Mumbai                      | 0.364614 |
| Thane & Outskirts           | 0.445638 |
| Other Cities                | 0.4      |
| Other Cities of Maharashtra | 0.439462 |
| Other Metro Cities          | 0.408488 |
| Tier II Cities              | 0.337838 |
+-----------------------------+----------+
  • Most Leads come from 'Mumbai' and they have a decent conversion rate of 36.4%.
  • Leads from Thane and outskirts make up 8.2% with a conversion rate of 44%

A free copy of Mastering The Interview.

In [213]:
feature = 'A free copy of Mastering The Interview'
categoricalUAn(feature)
Types of A free copy of Mastering The Interview
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    6126 |  0.679609 |               6126 |             0.679609 |
|  1 | Yes                                      |    2888 |  0.320391 |               9014 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
A free copy of Mastering The Interview for Converted Leads
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    2389 |  0.698742 |               2389 |             0.698742 |
|  1 | Yes                                      |    1030 |  0.301258 |               3419 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
A free copy of Mastering The Interview for Un-Converted Leads
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    3737 |  0.667918 |               3737 |             0.667918 |
|  1 | Yes                                      |    1858 |  0.332082 |               5595 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
A free copy of Mastering The Interview vs Conversion Rate
+-----+------------------------------------------+
|     |   A free copy of Mastering The Interview |
|-----+------------------------------------------|
| No  |                                 0.389977 |
| Yes |                                 0.356648 |
+-----+------------------------------------------+
  • 68% of the leads said "No" for a free copy of 'Mastering The Interview'.
  • Conversion rate of leads who said "No" is high (39.8%).
In [214]:
def num_univariate_analysis(column_name,scale='linear') : 
    
    converted = leads[leads['Converted'] == 1]
    unconverted = leads[leads['Converted'] == 0]

    plt.figure(figsize=(8,6))
    ax = sns.boxplot(x=column_name, y='Converted', data = leads)
    title = 'Boxplot of ' + column_name+' vs Conversion'
    ax.set(title=title)
    if scale == 'log' :
        ax.set_xscale('log')
        ax.set(ylabel=column_name + '(Log Scale)')
        
    print("Spread for range of "+column_name+" that were Converted")
    tab(converted[column_name].describe())
    print("Spread for range of "+column_name+" that were not converted")
    tab(unconverted[column_name].describe())

TotalVisits

In [215]:
column_name = 'TotalVisits'
num_univariate_analysis(column_name,scale='log')
Spread for range of TotalVisits that were Converted
+-------+---------------+
|       |   TotalVisits |
|-------+---------------|
| count |    3419       |
| mean  |       3.65575 |
| std   |       5.57527 |
| min   |       0       |
| 25%   |       0       |
| 50%   |       3       |
| 75%   |       5       |
| max   |     251       |
+-------+---------------+
Spread for range of TotalVisits that were not converted
+-------+---------------+
|       |   TotalVisits |
|-------+---------------|
| count |    5595       |
| mean  |       3.33262 |
| std   |       4.37298 |
| min   |       0       |
| 25%   |       1       |
| 50%   |       3       |
| 75%   |       4       |
| max   |     141       |
+-------+---------------+
  • Looks like Total Visits have a lot of outliers among both Converted and Un-converted leads.
  • Let's take a look at the quantiles between 90 and 100.
In [216]:
# Looking at Quantiles
tab(leads[column_name].quantile(np.linspace(.90,1,20)))
+----------+---------------+
|          |   TotalVisits |
|----------+---------------|
| 0.9      |             7 |
| 0.905263 |             7 |
| 0.910526 |             8 |
| 0.915789 |             8 |
| 0.921053 |             8 |
| 0.926316 |             8 |
| 0.931579 |             9 |
| 0.936842 |             9 |
| 0.942105 |             9 |
| 0.947368 |             9 |
| 0.952632 |            10 |
| 0.957895 |            10 |
| 0.963158 |            11 |
| 0.968421 |            11 |
| 0.973684 |            12 |
| 0.978947 |            13 |
| 0.984211 |            14 |
| 0.989474 |            17 |
| 0.994737 |            20 |
| 1        |           251 |
+----------+---------------+
  • From the above, it is clear that outliers exist and these might skew the analyses.
  • For now, let's cap the outliers about 99th percentile to 99th percentile value. soft range capping.
In [217]:
# Capping outliers to 99th perentile value 
cap = leads[column_name].quantile(.99)
condition = leads[column_name] > cap 
leads.loc[condition, column_name] = cap 

Total TIme Spent on Website.

In [218]:
column = 'Total Time Spent on Website'
num_univariate_analysis(column)
Spread for range of Total Time Spent on Website that were Converted
+-------+-------------------------------+
|       |   Total Time Spent on Website |
|-------+-------------------------------|
| count |                      3419     |
| mean  |                       732.945 |
| std   |                       614.476 |
| min   |                         0     |
| 25%   |                         0     |
| 50%   |                       826     |
| 75%   |                      1265.5   |
| max   |                      2253     |
+-------+-------------------------------+
Spread for range of Total Time Spent on Website that were not converted
+-------+-------------------------------+
|       |   Total Time Spent on Website |
|-------+-------------------------------|
| count |                      5595     |
| mean  |                       329.919 |
| std   |                       432.757 |
| min   |                         0     |
| 25%   |                        14     |
| 50%   |                       178     |
| 75%   |                       393.5   |
| max   |                      2272     |
+-------+-------------------------------+
  • 'Total Time Spend on Website' has many outliers.
  • Let's look quantiles to confirm this.
In [219]:
tab(leads[column].quantile(np.linspace(0.75,1,25)))
+----------+-------------------------------+
|          |   Total Time Spent on Website |
|----------+-------------------------------|
| 0.75     |                       924     |
| 0.760417 |                       953.635 |
| 0.770833 |                       991     |
| 0.78125  |                      1022.41  |
| 0.791667 |                      1054     |
| 0.802083 |                      1087     |
| 0.8125   |                      1115.06  |
| 0.822917 |                      1143     |
| 0.833333 |                      1177.83  |
| 0.84375  |                      1208     |
| 0.854167 |                      1238.6   |
| 0.864583 |                      1271     |
| 0.875    |                      1296.38  |
| 0.885417 |                      1328     |
| 0.895833 |                      1360     |
| 0.90625  |                      1392     |
| 0.916667 |                      1434     |
| 0.927083 |                      1468     |
| 0.9375   |                      1503     |
| 0.947917 |                      1549     |
| 0.958333 |                      1592.46  |
| 0.96875  |                      1647     |
| 0.979167 |                      1720.23  |
| 0.989583 |                      1830.34  |
| 1        |                      2272     |
+----------+-------------------------------+
In [220]:
leads[column].quantile(np.linspace(0.75,1,50)).plot()
Out[220]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc566866790>
In [221]:
# Capping `Total Time Spent on Website` values to 99th percentile 
cap = leads[column].quantile(.99)
condition = leads[column] > cap 
leads.loc[condition, column] = cap 

Page Views Per Visit.

In [222]:
column = 'Page Views Per Visit'
num_univariate_analysis(column) 
Spread for range of Page Views Per Visit that were Converted
+-------+------------------------+
|       |   Page Views Per Visit |
|-------+------------------------|
| count |             3419       |
| mean  |                2.36407 |
| std   |                2.10862 |
| min   |                0       |
| 25%   |                0       |
| 50%   |                2       |
| 75%   |                3.5     |
| max   |               15       |
+-------+------------------------+
Spread for range of Page Views Per Visit that were not converted
+-------+------------------------+
|       |   Page Views Per Visit |
|-------+------------------------|
| count |             5595       |
| mean  |                2.36962 |
| std   |                2.17789 |
| min   |                0       |
| 25%   |                1       |
| 50%   |                2       |
| 75%   |                3       |
| max   |               55       |
+-------+------------------------+
  • 'Page Views Per Visit' has many outliers.
  • Let's look quantiles to confirm this.
In [223]:
leads[column].quantile(np.linspace(0.75,1,30)).plot()
Out[223]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc54ad01890>
  • There is a sudden jump between 99th percentile and maximum value.
  • let's cap the values to 99th percentile to avoid skewing the analysis
In [224]:
# Capping `Page Views Per Visit` values to 99th percentile 
cap = leads[column].quantile(.99)
condition = leads[column] > cap 
leads.loc[condition, column] = cap 

Bivariate Analysis

In [225]:
leads.columns.values
Out[225]:
array(['Lead Origin', 'Lead Source', 'Do Not Email', 'Converted',
       'TotalVisits', 'Total Time Spent on Website',
       'Page Views Per Visit', 'Last Activity', 'Country',
       'Specialization', 'What is your current occupation', 'City',
       'A free copy of Mastering The Interview'], dtype=object)
In [226]:
continuous_vars = ['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']

TotalVisits vs A free copy of Mastering The Interview

In [227]:
plt.figure(figsize=[8,8])
sns.barplot(x=continuous_vars[0], y = 'A free copy of Mastering The Interview', data=leads, hue='Converted')
Out[227]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5668199d0>
  • One can see that the proportion of leads with high Total Visits to the website also like a free copy of Mastering The Interview.
  • Incidentally, these are leads with higher conversion rate.
  • More convertable leads are being attracted by the website through providing 'A free copy of Mastering The Interview'.

Lead Source vs Country

In [228]:
# sns.barplot(x='Lead Source', y = 'Country', hue='Converted', data=leads)
leads.groupby(['Country','Lead Source'])['Converted'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(14,12), kind='barh', stacked=True);
  • Most leads from India through Reference Sources have very high conversion rate.
  • Leads from outside of India from other Lead sources do not convert at all.

Occupation vs City

In [229]:
x = "What is your current occupation"
y = 'City'

leads.groupby([x,y])['Converted'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(14,12), kind='barh', stacked=True);
  • Working Professionals in Other cities of Maharashtra have higher conversion rates compared to those from Mumbai , Thane and other cities.
  • BusinessMen from Mumbai and Thane & Outskirts are poor leads in comparison to Tier 2 and Other cities .

Last Activity vs Country

In [230]:
x = "Country"
y = 'Last Activity'

leads.groupby([x,y])['Converted'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(14,12), kind='barh', stacked=True);
  • SMS and Emails are more favourable for conversion over Website Visits outside of India.
  • Leads from outside of India who click email links have higher conversion rate compared those from India.

Data Preparation

Mapping Binary Variables to 0 / 1

In [231]:
binary_var = ['Do Not Email', 'A free copy of Mastering The Interview']
leads[binary_var] = leads[binary_var].replace({'Yes' : 1, 'No' : 0})

Creating Indicator Variables

In [232]:
categoricalCol = ['Lead Origin', 'Lead Source','Last Activity', 'Country', 'Specialization',
       'What is your current occupation', 'City'] 

print('Levels in Each Cateogrical Variable\n')
for col in sorted(categoricalCol) : 
    print(col, leads[col].unique(), '\n')

# Creating dummy variables
leadOriginDummies = pd.get_dummies(leads['Lead Origin'], drop_first=True)
leadSourceDummies = pd.get_dummies(leads['Lead Source'], drop_first=True)
lastActivityDummies = pd.get_dummies(leads['Last Activity'], drop_first=True)
countryDummies = pd.get_dummies(leads['Country'] ,drop_first=True)
specDummies = pd.get_dummies(leads['Specialization'],drop_first=True)
occupationDummies = pd.get_dummies(leads[ 'What is your current occupation'],drop_first=True)
cityDummies = pd.get_dummies(leads[ 'City'],drop_first=True)

# adding dummy variables to leads dataframe
leads = pd.concat([leads, leadOriginDummies,leadSourceDummies,lastActivityDummies, countryDummies, specDummies, occupationDummies, cityDummies], axis=1)

# dropping categorical columns 
leads.drop(columns = categoricalCol, inplace=True)


print('Final Columns')
leads.columns
Levels in Each Cateogrical Variable

City ['Mumbai' 'Thane & Outskirts' 'Other Metro Cities' 'Other Cities'
 'Other Cities of Maharashtra' 'Tier II Cities'] 

Country ['India' 'Outside India'] 

Last Activity ['Page Visited on Website' 'Email Opened' 'Other Last Activity'
 'Converted to Lead' 'Olark Chat Conversation' 'Email Link Clicked'
 'Form Submitted on Website' 'Email Bounced' 'SMS Sent'] 

Lead Origin ['API' 'Landing Page Submission' 'Other Lead Origins'] 

Lead Source ['Olark Chat' 'Organic Search' 'Direct Traffic' 'Google' 'Referral Sites'
 'Reference' 'Welingak Website' 'Other Lead Sources'] 

Specialization ['No Specialization' 'Business Administration' 'Media and Advertising'
 'Supply Chain Management' 'IT Projects Management' 'Finance Management'
 'Travel and Tourism' 'Human Resource Management' 'Marketing Management'
 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE'
 'Operations Management' 'Other Specialization' 'Hospitality Management'
 'Healthcare Management'] 

What is your current occupation ['Unemployed' 'Student' 'Unknown Occupation' 'Working Professional'
 'Businessman' 'Other' 'Housewife'] 

Final Columns
Out[232]:
Index(['Do Not Email', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit',
       'A free copy of Mastering The Interview', 'Landing Page Submission',
       'Other Lead Origins', 'Google', 'Olark Chat', 'Organic Search',
       'Other Lead Sources', 'Reference', 'Referral Sites', 'Welingak Website',
       'Email Bounced', 'Email Link Clicked', 'Email Opened',
       'Form Submitted on Website', 'Olark Chat Conversation',
       'Other Last Activity', 'Page Visited on Website', 'SMS Sent',
       'Outside India', 'Business Administration', 'E-COMMERCE',
       'Finance Management', 'Healthcare Management', 'Hospitality Management',
       'Human Resource Management', 'IT Projects Management',
       'International Business', 'Marketing Management',
       'Media and Advertising', 'No Specialization', 'Operations Management',
       'Other Specialization', 'Supply Chain Management', 'Travel and Tourism',
       'Housewife', 'Other', 'Student', 'Unemployed', 'Unknown Occupation',
       'Working Professional', 'Other Cities', 'Other Cities of Maharashtra',
       'Other Metro Cities', 'Thane & Outskirts', 'Tier II Cities'],
      dtype='object')

Correlation

In [233]:
# Top Correlations
def correlation(dataframe) : 
    cor0=dataframe.corr()
    type(cor0)
    cor0.where(np.triu(np.ones(cor0.shape),k=1).astype(np.bool))
    cor0=cor0.unstack().reset_index()
    cor0.columns=['VAR1','VAR2','CORR']
    cor0.dropna(subset=['CORR'], inplace=True)
    cor0.CORR=round(cor0['CORR'],2)
    cor0.CORR=cor0.CORR.abs()
    cor0.sort_values(by=['CORR'],ascending=False)
    cor0=cor0[~(cor0['VAR1']==cor0['VAR2'])]
    return pd.DataFrame(cor0.sort_values(by=['CORR'],ascending=False))
In [234]:
#Correlations for Converted Leads 
convertedCondition= leads['Converted']==1
print('Correlations for Converted Leads')
correlation(leads[convertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()
Correlations for Converted Leads
Out[234]:
VAR1 VAR2 CORR
Other Lead Origins Reference 0.840000
SMS Sent Email Opened 0.750000
Unemployed Working Professional 0.690000
Page Views Per Visit TotalVisits 0.680000
No Specialization Landing Page Submission 0.600000
Page Views Per Visit Landing Page Submission 0.560000
Landing Page Submission A free copy of Mastering The Interview 0.540000
Unemployed Unknown Occupation 0.510000
Total Time Spent on Website Page Views Per Visit 0.500000
Other Lead Origins Page Views Per Visit 0.480000
Other Lead Origins Total Time Spent on Website 0.460000
Other Lead Origins Landing Page Submission 0.450000
Welingak Website Other Lead Origins 0.450000
Total Time Spent on Website TotalVisits 0.440000
TotalVisits Total Time Spent on Website 0.440000
  • Conversions of leads from other lead origins and the ones through reference have similar conversion behaviour.
In [235]:
#Correlations for un-Converted Leads 
unconvertedCondition=leads['Converted']==0
print('Correlations for Non-Converted Leads')
correlation(leads[unconvertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()
Correlations for Non-Converted Leads
Out[235]:
VAR1 VAR2 CORR
Unemployed Unknown Occupation 0.930000
Landing Page Submission No Specialization 0.870000
Reference Other Lead Origins 0.750000
TotalVisits Page Views Per Visit 0.730000
Email Bounced Do Not Email 0.650000
Olark Chat Page Views Per Visit 0.630000
Olark Chat Landing Page Submission 0.620000
A free copy of Mastering The Interview No Specialization 0.590000
A free copy of Mastering The Interview Landing Page Submission 0.580000
No Specialization Olark Chat 0.570000
Olark Chat TotalVisits 0.520000
Page Views Per Visit Landing Page Submission 0.500000
Olark Chat Conversation Olark Chat 0.500000
No Specialization Page Views Per Visit 0.490000
Other Lead Origins Other Lead Sources 0.480000
  • From the above, Unknown Occupation and Unemployed are highly correlated for non-converted leads.
  • This might mean that unemployed leads and leads with unknown occupation have the same conversion behaviour.

Train-Test Split

In [236]:
from sklearn.model_selection import train_test_split
y = leads.pop('Converted')
X = leads
In [237]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=100)

Standardizing Continuous Variables

In [238]:
continuous_vars
Out[238]:
['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']
In [239]:
from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler()

# fitting and transforming train set
X_train[continuous_vars] = scaler.fit_transform(X_train[continuous_vars])

# Transforming test set for later use
X_test[continuous_vars] = scaler.transform(X_test[continuous_vars])

Modelling

Recurvise Feature Elimination

In [240]:
print('No of features : ', len(X_train.columns)) 
No of features :  49
  • Currently, the dataset has 49 features.
  • We shall follow a mixed feature elimination approach.
  • We could use Recursive Feature Elimination for coarse elimination to 25 columns
  • This is followed by manual elimination of features with high p-value / VIF.
In [241]:
# RFE 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
minFeatures = 25
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=minFeatures)
rfe = rfe.fit(X_train, y_train)
In [242]:
# Columns selected by RFE : 
RFE_features = pd.DataFrame( {'feature' : X_train.columns, 'rank' : rfe.ranking_, 'support' : rfe.support_})
condition = RFE_features['support'] == True 
rfe_features = RFE_features[condition].sort_values(by='rank',ascending=True )
print('Features selected by RFE\n')
rfe_features 
Features selected by RFE

Out[242]:
feature rank support
0 Do Not Email 1 True
42 Unknown Occupation 1 True
41 Unemployed 1 True
40 Student 1 True
38 Housewife 1 True
33 No Specialization 1 True
32 Media and Advertising 1 True
27 Hospitality Management 1 True
22 Outside India 1 True
21 SMS Sent 1 True
19 Other Last Activity 1 True
43 Working Professional 1 True
18 Olark Chat Conversation 1 True
15 Email Link Clicked 1 True
13 Welingak Website 1 True
11 Reference 1 True
10 Other Lead Sources 1 True
8 Olark Chat 1 True
6 Other Lead Origins 1 True
5 Landing Page Submission 1 True
3 Page Views Per Visit 1 True
2 Total Time Spent on Website 1 True
1 TotalVisits 1 True
16 Email Opened 1 True
48 Tier II Cities 1 True
In [243]:
rfeFeatures = rfe_features['feature'].values

Manual Feature Elimination

Model 1

In [244]:
### Multicollinearity 
from statsmodels.stats.outliers_influence import variance_inflation_factor
def vif(X) :
    df = sm.add_constant(X)
    vif = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
    vif_frame = pd.DataFrame({'vif' : vif[0:]},index = df.columns).reset_index()
    tab(vif_frame.sort_values(by='vif',ascending=False))
In [245]:
# Model 1
import statsmodels.api as sm 
features = rfe_features['feature'].values
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+-----------+
|    | index                       |       vif |
|----+-----------------------------+-----------|
|  0 | const                       | 498.666   |
|  3 | Unemployed                  | 117.314   |
|  2 | Unknown Occupation          | 102.767   |
| 19 | Other Lead Origins          |  34.8794  |
| 12 | Working Professional        |  34.6746  |
| 16 | Reference                   |  26.0024  |
|  4 | Student                     |  11.9606  |
| 15 | Welingak Website            |   8.73583 |
| 20 | Landing Page Submission     |   3.48279 |
|  6 | No Specialization           |   3.05316 |
| 17 | Other Lead Sources          |   2.84805 |
| 21 | Page Views Per Visit        |   2.65403 |
| 24 | Email Opened                |   2.47549 |
| 18 | Olark Chat                  |   2.45889 |
| 10 | SMS Sent                    |   2.30595 |
| 23 | TotalVisits                 |   2.0918  |
| 13 | Olark Chat Conversation     |   1.93429 |
|  5 | Housewife                   |   1.46972 |
| 22 | Total Time Spent on Website |   1.35898 |
|  1 | Do Not Email                |   1.2055  |
| 14 | Email Link Clicked          |   1.20382 |
| 11 | Other Last Activity         |   1.11873 |
|  9 | Outside India               |   1.02551 |
|  7 | Media and Advertising       |   1.01922 |
|  8 | Hospitality Management      |   1.0113  |
| 25 | Tier II Cities              |   1.00967 |
+----+-----------------------------+-----------+
Out[245]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6283
Model Family: Binomial Df Model: 25
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2503.0
Date: Mon, 07 Sep 2020 Deviance: 5005.9
Time: 22:02:03 Pearson chi2: 6.36e+03
No. Iterations: 20
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.3145 0.675 -0.466 0.641 -1.638 1.009
Do Not Email -1.4730 0.189 -7.775 0.000 -1.844 -1.102
Unknown Occupation -1.8781 0.668 -2.813 0.005 -3.187 -0.570
Unemployed -0.6028 0.664 -0.908 0.364 -1.904 0.698
Student -0.6044 0.699 -0.864 0.387 -1.975 0.766
Housewife 19.9950 1.12e+04 0.002 0.999 -2.2e+04 2.21e+04
No Specialization -0.8514 0.127 -6.698 0.000 -1.101 -0.602
Media and Advertising -0.2951 0.247 -1.196 0.232 -0.779 0.189
Hospitality Management -0.9043 0.333 -2.715 0.007 -1.557 -0.252
Outside India -0.6829 0.236 -2.899 0.004 -1.145 -0.221
SMS Sent 2.0121 0.128 15.767 0.000 1.762 2.262
Other Last Activity 1.3061 0.262 4.991 0.000 0.793 1.819
Working Professional 1.8232 0.688 2.651 0.008 0.475 3.171
Olark Chat Conversation -0.6947 0.202 -3.437 0.001 -1.091 -0.298
Email Link Clicked 0.4905 0.236 2.079 0.038 0.028 0.953
Welingak Website 4.3758 1.134 3.858 0.000 2.153 6.599
Reference 2.0165 0.892 2.261 0.024 0.268 3.765
Other Lead Sources -0.1279 0.830 -0.154 0.878 -1.755 1.500
Olark Chat 1.2173 0.141 8.613 0.000 0.940 1.494
Other Lead Origins 1.0385 0.869 1.196 0.232 -0.664 2.741
Landing Page Submission -0.9141 0.131 -6.962 0.000 -1.171 -0.657
Page Views Per Visit -0.2670 0.056 -4.727 0.000 -0.378 -0.156
Total Time Spent on Website 1.1245 0.042 26.804 0.000 1.042 1.207
TotalVisits 0.3212 0.050 6.432 0.000 0.223 0.419
Email Opened 0.7814 0.125 6.271 0.000 0.537 1.026
Tier II Cities -0.4419 0.425 -1.040 0.298 -1.275 0.391
  • Unemployed has the highest VIF. let's drop this feature.

Model 2

In [246]:
# Model 2 : Removing `Unemployed`
column_to_remove = 'Unemployed'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
| 18 | Other Lead Origins          | 34.8784  |
| 15 | Reference                   | 26.0024  |
|  0 | const                       | 19.6719  |
| 14 | Welingak Website            |  8.73573 |
| 19 | Landing Page Submission     |  3.4731  |
|  5 | No Specialization           |  3.04661 |
| 16 | Other Lead Sources          |  2.84805 |
| 20 | Page Views Per Visit        |  2.65403 |
| 23 | Email Opened                |  2.47543 |
| 17 | Olark Chat                  |  2.45888 |
|  9 | SMS Sent                    |  2.30562 |
| 22 | TotalVisits                 |  2.09129 |
| 12 | Olark Chat Conversation     |  1.93402 |
| 21 | Total Time Spent on Website |  1.35834 |
|  1 | Do Not Email                |  1.20547 |
| 13 | Email Link Clicked          |  1.2038  |
|  2 | Unknown Occupation          |  1.17383 |
| 11 | Working Professional        |  1.15381 |
| 10 | Other Last Activity         |  1.11849 |
|  8 | Outside India               |  1.02545 |
|  3 | Student                     |  1.02534 |
|  6 | Media and Advertising       |  1.01899 |
|  7 | Hospitality Management      |  1.01126 |
|  4 | Housewife                   |  1.01026 |
| 24 | Tier II Cities              |  1.00967 |
+----+-----------------------------+----------+
Out[246]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6284
Model Family: Binomial Df Model: 24
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2503.4
Date: Mon, 07 Sep 2020 Deviance: 5006.8
Time: 22:02:04 Pearson chi2: 6.36e+03
No. Iterations: 21
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.9101 0.163 -5.598 0.000 -1.229 -0.591
Do Not Email -1.4697 0.189 -7.761 0.000 -1.841 -1.099
Unknown Occupation -1.2777 0.091 -14.023 0.000 -1.456 -1.099
Student -0.0044 0.229 -0.019 0.985 -0.453 0.444
Housewife 21.5945 1.85e+04 0.001 0.999 -3.63e+04 3.63e+04
No Specialization -0.8575 0.127 -6.754 0.000 -1.106 -0.609
Media and Advertising -0.2953 0.247 -1.198 0.231 -0.779 0.188
Hospitality Management -0.9066 0.333 -2.722 0.006 -1.559 -0.254
Outside India -0.6853 0.236 -2.909 0.004 -1.147 -0.224
SMS Sent 2.0111 0.128 15.755 0.000 1.761 2.261
Other Last Activity 1.3069 0.261 5.001 0.000 0.795 1.819
Working Professional 2.4233 0.191 12.700 0.000 2.049 2.797
Olark Chat Conversation -0.6896 0.202 -3.415 0.001 -1.085 -0.294
Email Link Clicked 0.4893 0.236 2.074 0.038 0.027 0.952
Welingak Website 4.3797 1.134 3.861 0.000 2.157 6.603
Reference 2.0184 0.892 2.263 0.024 0.270 3.766
Other Lead Sources -0.1269 0.830 -0.153 0.879 -1.754 1.500
Olark Chat 1.2189 0.141 8.624 0.000 0.942 1.496
Other Lead Origins 1.0344 0.869 1.191 0.234 -0.668 2.737
Landing Page Submission -0.9202 0.131 -7.016 0.000 -1.177 -0.663
Page Views Per Visit -0.2668 0.056 -4.723 0.000 -0.377 -0.156
Total Time Spent on Website 1.1251 0.042 26.818 0.000 1.043 1.207
TotalVisits 0.3227 0.050 6.466 0.000 0.225 0.421
Email Opened 0.7824 0.125 6.278 0.000 0.538 1.027
Tier II Cities -0.4432 0.425 -1.043 0.297 -1.276 0.389
  • Other Lead Origins has a very high VIF.
  • Let's drop this variable

Model 3

In [247]:
# Model 3 : Removing `Other Lead Origins`
column_to_remove = 'Other Lead Origins'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.5742  |
| 18 | Landing Page Submission     |  3.44401 |
|  5 | No Specialization           |  3.03467 |
| 19 | Page Views Per Visit        |  2.64774 |
| 22 | Email Opened                |  2.4731  |
| 17 | Olark Chat                  |  2.45033 |
|  9 | SMS Sent                    |  2.30433 |
| 21 | TotalVisits                 |  2.09083 |
| 12 | Olark Chat Conversation     |  1.93393 |
| 15 | Reference                   |  1.64559 |
| 20 | Total Time Spent on Website |  1.35735 |
|  1 | Do Not Email                |  1.20544 |
| 13 | Email Link Clicked          |  1.20372 |
|  2 | Unknown Occupation          |  1.17275 |
| 11 | Working Professional        |  1.15369 |
| 14 | Welingak Website            |  1.14402 |
| 10 | Other Last Activity         |  1.11839 |
| 16 | Other Lead Sources          |  1.03494 |
|  8 | Outside India               |  1.02524 |
|  3 | Student                     |  1.02486 |
|  6 | Media and Advertising       |  1.01894 |
|  7 | Hospitality Management      |  1.01112 |
|  4 | Housewife                   |  1.01026 |
| 23 | Tier II Cities              |  1.00966 |
+----+-----------------------------+----------+
Out[247]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6285
Model Family: Binomial Df Model: 23
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2504.2
Date: Mon, 07 Sep 2020 Deviance: 5008.4
Time: 22:02:04 Pearson chi2: 6.37e+03
No. Iterations: 21
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8922 0.162 -5.511 0.000 -1.209 -0.575
Do Not Email -1.4710 0.189 -7.770 0.000 -1.842 -1.100
Unknown Occupation -1.2816 0.091 -14.076 0.000 -1.460 -1.103
Student -0.0129 0.228 -0.056 0.955 -0.461 0.435
Housewife 21.5902 1.85e+04 0.001 0.999 -3.63e+04 3.63e+04
No Specialization -0.8719 0.127 -6.891 0.000 -1.120 -0.624
Media and Advertising -0.2936 0.247 -1.190 0.234 -0.777 0.190
Hospitality Management -0.9011 0.333 -2.705 0.007 -1.554 -0.248
Outside India -0.6916 0.235 -2.940 0.003 -1.153 -0.231
SMS Sent 2.0143 0.127 15.802 0.000 1.764 2.264
Other Last Activity 1.3096 0.261 5.011 0.000 0.797 1.822
Working Professional 2.4221 0.191 12.693 0.000 2.048 2.796
Olark Chat Conversation -0.6896 0.202 -3.416 0.001 -1.085 -0.294
Email Link Clicked 0.4914 0.236 2.084 0.037 0.029 0.954
Welingak Website 5.4015 0.741 7.294 0.000 3.950 6.853
Reference 3.0328 0.258 11.775 0.000 2.528 3.538
Other Lead Sources 0.6953 0.383 1.815 0.070 -0.056 1.446
Olark Chat 1.2083 0.141 8.581 0.000 0.932 1.484
Landing Page Submission -0.9391 0.130 -7.204 0.000 -1.195 -0.684
Page Views Per Visit -0.2697 0.056 -4.780 0.000 -0.380 -0.159
Total Time Spent on Website 1.1240 0.042 26.816 0.000 1.042 1.206
TotalVisits 0.3212 0.050 6.437 0.000 0.223 0.419
Email Opened 0.7860 0.124 6.318 0.000 0.542 1.030
Tier II Cities -0.4421 0.425 -1.041 0.298 -1.275 0.391
  • Housewife has a high p-value and hence the coefficient is insignificant. let's drop the same.

Model 4

In [248]:
# Model 4 : Removing `Housewife`
column_to_remove = 'Housewife'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.5671  |
| 17 | Landing Page Submission     |  3.44232 |
|  4 | No Specialization           |  3.03251 |
| 18 | Page Views Per Visit        |  2.64773 |
| 21 | Email Opened                |  2.47231 |
| 16 | Olark Chat                  |  2.4503  |
|  8 | SMS Sent                    |  2.30422 |
| 20 | TotalVisits                 |  2.09075 |
| 11 | Olark Chat Conversation     |  1.93383 |
| 14 | Reference                   |  1.64154 |
| 19 | Total Time Spent on Website |  1.35679 |
|  1 | Do Not Email                |  1.20544 |
| 12 | Email Link Clicked          |  1.20292 |
|  2 | Unknown Occupation          |  1.17264 |
| 10 | Working Professional        |  1.15248 |
| 13 | Welingak Website            |  1.14402 |
|  9 | Other Last Activity         |  1.11839 |
| 15 | Other Lead Sources          |  1.03492 |
|  3 | Student                     |  1.02478 |
|  7 | Outside India               |  1.02446 |
|  5 | Media and Advertising       |  1.01783 |
|  6 | Hospitality Management      |  1.01111 |
| 22 | Tier II Cities              |  1.00965 |
+----+-----------------------------+----------+
Out[248]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6286
Model Family: Binomial Df Model: 22
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2506.1
Date: Mon, 07 Sep 2020 Deviance: 5012.2
Time: 22:02:05 Pearson chi2: 6.39e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8894 0.162 -5.494 0.000 -1.207 -0.572
Do Not Email -1.4718 0.189 -7.770 0.000 -1.843 -1.101
Unknown Occupation -1.2830 0.091 -14.091 0.000 -1.461 -1.105
Student -0.0165 0.229 -0.072 0.942 -0.465 0.432
No Specialization -0.8767 0.127 -6.929 0.000 -1.125 -0.629
Media and Advertising -0.2871 0.246 -1.169 0.243 -0.769 0.194
Hospitality Management -0.9050 0.333 -2.715 0.007 -1.558 -0.252
Outside India -0.6828 0.234 -2.919 0.004 -1.141 -0.224
SMS Sent 2.0146 0.128 15.799 0.000 1.765 2.264
Other Last Activity 1.3101 0.261 5.011 0.000 0.798 1.823
Working Professional 2.4187 0.191 12.674 0.000 2.045 2.793
Olark Chat Conversation -0.6880 0.202 -3.407 0.001 -1.084 -0.292
Email Link Clicked 0.4981 0.235 2.117 0.034 0.037 0.959
Welingak Website 5.4013 0.741 7.294 0.000 3.950 6.853
Reference 3.0546 0.257 11.869 0.000 2.550 3.559
Other Lead Sources 0.6922 0.383 1.807 0.071 -0.059 1.443
Olark Chat 1.2075 0.141 8.575 0.000 0.931 1.483
Landing Page Submission -0.9416 0.130 -7.228 0.000 -1.197 -0.686
Page Views Per Visit -0.2701 0.056 -4.788 0.000 -0.381 -0.160
Total Time Spent on Website 1.1252 0.042 26.851 0.000 1.043 1.207
TotalVisits 0.3203 0.050 6.421 0.000 0.223 0.418
Email Opened 0.7897 0.124 6.347 0.000 0.546 1.034
Tier II Cities -0.4440 0.425 -1.045 0.296 -1.277 0.389
  • Student has a p-value higher than 0.05 and the highest among all p-values. Let's drop this feature.

Model 5

In [249]:
# Model 5 : Removing `Student`
column_to_remove = 'Student'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.5487  |
| 16 | Landing Page Submission     |  3.43698 |
|  3 | No Specialization           |  3.02919 |
| 17 | Page Views Per Visit        |  2.64719 |
| 20 | Email Opened                |  2.47218 |
| 15 | Olark Chat                  |  2.44998 |
|  7 | SMS Sent                    |  2.30157 |
| 19 | TotalVisits                 |  2.09067 |
| 10 | Olark Chat Conversation     |  1.93303 |
| 13 | Reference                   |  1.63987 |
| 18 | Total Time Spent on Website |  1.35679 |
|  1 | Do Not Email                |  1.20485 |
| 11 | Email Link Clicked          |  1.2029  |
|  2 | Unknown Occupation          |  1.15459 |
|  9 | Working Professional        |  1.14906 |
| 12 | Welingak Website            |  1.14371 |
|  8 | Other Last Activity         |  1.11789 |
| 14 | Other Lead Sources          |  1.03492 |
|  6 | Outside India               |  1.02427 |
|  4 | Media and Advertising       |  1.01768 |
|  5 | Hospitality Management      |  1.01055 |
| 21 | Tier II Cities              |  1.00946 |
+----+-----------------------------+----------+
Out[249]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6287
Model Family: Binomial Df Model: 21
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2506.1
Date: Mon, 07 Sep 2020 Deviance: 5012.2
Time: 22:02:05 Pearson chi2: 6.39e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8898 0.162 -5.500 0.000 -1.207 -0.573
Do Not Email -1.4718 0.189 -7.771 0.000 -1.843 -1.101
Unknown Occupation -1.2824 0.091 -14.134 0.000 -1.460 -1.105
No Specialization -0.8767 0.127 -6.929 0.000 -1.125 -0.629
Media and Advertising -0.2868 0.246 -1.168 0.243 -0.768 0.195
Hospitality Management -0.9053 0.333 -2.717 0.007 -1.559 -0.252
Outside India -0.6826 0.234 -2.919 0.004 -1.141 -0.224
SMS Sent 2.0148 0.127 15.806 0.000 1.765 2.265
Other Last Activity 1.3105 0.261 5.014 0.000 0.798 1.823
Working Professional 2.4193 0.191 12.687 0.000 2.046 2.793
Olark Chat Conversation -0.6883 0.202 -3.410 0.001 -1.084 -0.293
Email Link Clicked 0.4981 0.235 2.117 0.034 0.037 0.959
Welingak Website 5.4017 0.741 7.294 0.000 3.950 6.853
Reference 3.0541 0.257 11.873 0.000 2.550 3.558
Other Lead Sources 0.6922 0.383 1.807 0.071 -0.059 1.443
Olark Chat 1.2072 0.141 8.577 0.000 0.931 1.483
Landing Page Submission -0.9418 0.130 -7.231 0.000 -1.197 -0.687
Page Views Per Visit -0.2701 0.056 -4.788 0.000 -0.381 -0.160
Total Time Spent on Website 1.1252 0.042 26.851 0.000 1.043 1.207
TotalVisits 0.3204 0.050 6.421 0.000 0.223 0.418
Email Opened 0.7896 0.124 6.347 0.000 0.546 1.033
Tier II Cities -0.4436 0.425 -1.044 0.297 -1.276 0.389
  • Tier II Cities has a p-value higher than confidence level and the highest among all the p - values.
  • Let's remove this feature

Model 6

In [250]:
# Model 6 : Removing `Tier II Cities`
column_to_remove = 'Tier II Cities'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.5461  |
| 16 | Landing Page Submission     |  3.43294 |
|  3 | No Specialization           |  3.02879 |
| 17 | Page Views Per Visit        |  2.64718 |
| 20 | Email Opened                |  2.47182 |
| 15 | Olark Chat                  |  2.44996 |
|  7 | SMS Sent                    |  2.30139 |
| 19 | TotalVisits                 |  2.09046 |
| 10 | Olark Chat Conversation     |  1.93285 |
| 13 | Reference                   |  1.63987 |
| 18 | Total Time Spent on Website |  1.35679 |
|  1 | Do Not Email                |  1.20341 |
| 11 | Email Link Clicked          |  1.20289 |
|  2 | Unknown Occupation          |  1.15454 |
|  9 | Working Professional        |  1.14885 |
| 12 | Welingak Website            |  1.14371 |
|  8 | Other Last Activity         |  1.11724 |
| 14 | Other Lead Sources          |  1.0349  |
|  6 | Outside India               |  1.02425 |
|  4 | Media and Advertising       |  1.0176  |
|  5 | Hospitality Management      |  1.01026 |
+----+-----------------------------+----------+
Out[250]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6288
Model Family: Binomial Df Model: 20
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2506.6
Date: Mon, 07 Sep 2020 Deviance: 5013.3
Time: 22:02:06 Pearson chi2: 6.38e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8916 0.162 -5.511 0.000 -1.209 -0.574
Do Not Email -1.4762 0.189 -7.805 0.000 -1.847 -1.106
Unknown Occupation -1.2818 0.091 -14.129 0.000 -1.460 -1.104
No Specialization -0.8758 0.127 -6.923 0.000 -1.124 -0.628
Media and Advertising -0.2821 0.246 -1.149 0.251 -0.763 0.199
Hospitality Management -0.9112 0.333 -2.740 0.006 -1.563 -0.259
Outside India -0.6825 0.234 -2.919 0.004 -1.141 -0.224
SMS Sent 2.0155 0.127 15.809 0.000 1.766 2.265
Other Last Activity 1.3159 0.261 5.035 0.000 0.804 1.828
Working Professional 2.4165 0.191 12.684 0.000 2.043 2.790
Olark Chat Conversation -0.6867 0.202 -3.402 0.001 -1.082 -0.291
Email Link Clicked 0.4992 0.235 2.123 0.034 0.038 0.960
Welingak Website 5.4036 0.741 7.296 0.000 3.952 6.855
Reference 3.0553 0.257 11.877 0.000 2.551 3.559
Other Lead Sources 0.6941 0.383 1.812 0.070 -0.057 1.445
Olark Chat 1.2077 0.141 8.582 0.000 0.932 1.483
Landing Page Submission -0.9467 0.130 -7.273 0.000 -1.202 -0.692
Page Views Per Visit -0.2700 0.056 -4.790 0.000 -0.381 -0.160
Total Time Spent on Website 1.1249 0.042 26.851 0.000 1.043 1.207
TotalVisits 0.3212 0.050 6.441 0.000 0.223 0.419
Email Opened 0.7907 0.124 6.355 0.000 0.547 1.035
  • Page Views Per Visit has a high p-value. Let's eliminate this.

Model 7

In [251]:
# Model 7 : Removing `Page Views Per Visit`
column_to_remove = 'Page Views Per Visit'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.521   |
| 16 | Landing Page Submission     |  3.40467 |
|  3 | No Specialization           |  3.02277 |
| 19 | Email Opened                |  2.43525 |
|  7 | SMS Sent                    |  2.2499  |
| 15 | Olark Chat                  |  2.22665 |
| 10 | Olark Chat Conversation     |  1.92174 |
| 13 | Reference                   |  1.56753 |
| 18 | TotalVisits                 |  1.49299 |
| 17 | Total Time Spent on Website |  1.35607 |
|  1 | Do Not Email                |  1.20207 |
| 11 | Email Link Clicked          |  1.20085 |
|  2 | Unknown Occupation          |  1.15454 |
|  9 | Working Professional        |  1.14798 |
| 12 | Welingak Website            |  1.12183 |
|  8 | Other Last Activity         |  1.11356 |
| 14 | Other Lead Sources          |  1.03154 |
|  6 | Outside India               |  1.02316 |
|  4 | Media and Advertising       |  1.0176  |
|  5 | Hospitality Management      |  1.01022 |
+----+-----------------------------+----------+
Out[251]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6289
Model Family: Binomial Df Model: 19
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2518.3
Date: Mon, 07 Sep 2020 Deviance: 5036.6
Time: 22:02:06 Pearson chi2: 6.35e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8590 0.161 -5.346 0.000 -1.174 -0.544
Do Not Email -1.5025 0.189 -7.940 0.000 -1.873 -1.132
Unknown Occupation -1.2733 0.090 -14.083 0.000 -1.451 -1.096
No Specialization -0.8357 0.126 -6.656 0.000 -1.082 -0.590
Media and Advertising -0.2901 0.244 -1.187 0.235 -0.769 0.189
Hospitality Management -0.9063 0.332 -2.732 0.006 -1.556 -0.256
Outside India -0.6629 0.234 -2.828 0.005 -1.122 -0.203
SMS Sent 1.9118 0.125 15.332 0.000 1.667 2.156
Other Last Activity 1.2298 0.260 4.738 0.000 0.721 1.739
Working Professional 2.4225 0.191 12.702 0.000 2.049 2.796
Olark Chat Conversation -0.7552 0.201 -3.762 0.000 -1.149 -0.362
Email Link Clicked 0.4387 0.234 1.875 0.061 -0.020 0.897
Welingak Website 5.6036 0.739 7.587 0.000 4.156 7.051
Reference 3.2759 0.253 12.946 0.000 2.780 3.772
Other Lead Sources 0.8276 0.379 2.186 0.029 0.086 1.570
Olark Chat 1.4051 0.135 10.431 0.000 1.141 1.669
Landing Page Submission -0.9926 0.129 -7.685 0.000 -1.246 -0.739
Total Time Spent on Website 1.1253 0.042 26.905 0.000 1.043 1.207
TotalVisits 0.1958 0.042 4.614 0.000 0.113 0.279
Email Opened 0.7106 0.123 5.797 0.000 0.470 0.951
  • Media and Advertising has a high p-value. let's drop this feature

Model 8

In [252]:
# Model 8 : Removing `Media and Advertising`
column_to_remove = 'Media and Advertising'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.4607  |
| 15 | Landing Page Submission     |  3.40463 |
|  3 | No Specialization           |  3.01069 |
| 18 | Email Opened                |  2.43523 |
|  6 | SMS Sent                    |  2.24839 |
| 14 | Olark Chat                  |  2.22661 |
|  9 | Olark Chat Conversation     |  1.92165 |
| 12 | Reference                   |  1.56581 |
| 17 | TotalVisits                 |  1.49255 |
| 16 | Total Time Spent on Website |  1.35565 |
|  1 | Do Not Email                |  1.202   |
| 10 | Email Link Clicked          |  1.20054 |
|  2 | Unknown Occupation          |  1.15412 |
|  8 | Working Professional        |  1.14797 |
| 11 | Welingak Website            |  1.12171 |
|  7 | Other Last Activity         |  1.11347 |
| 13 | Other Lead Sources          |  1.03154 |
|  5 | Outside India               |  1.02306 |
|  4 | Hospitality Management      |  1.00959 |
+----+-----------------------------+----------+
Out[252]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6290
Model Family: Binomial Df Model: 18
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2519.0
Date: Mon, 07 Sep 2020 Deviance: 5038.0
Time: 22:02:07 Pearson chi2: 6.34e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8673 0.160 -5.404 0.000 -1.182 -0.553
Do Not Email -1.5036 0.189 -7.945 0.000 -1.874 -1.133
Unknown Occupation -1.2714 0.090 -14.070 0.000 -1.448 -1.094
No Specialization -0.8261 0.125 -6.595 0.000 -1.072 -0.581
Hospitality Management -0.8963 0.332 -2.703 0.007 -1.546 -0.246
Outside India -0.6576 0.234 -2.809 0.005 -1.116 -0.199
SMS Sent 1.9074 0.125 15.311 0.000 1.663 2.152
Other Last Activity 1.2318 0.259 4.749 0.000 0.723 1.740
Working Professional 2.4169 0.191 12.682 0.000 2.043 2.790
Olark Chat Conversation -0.7593 0.201 -3.783 0.000 -1.153 -0.366
Email Link Clicked 0.4421 0.234 1.890 0.059 -0.016 0.901
Welingak Website 5.6066 0.739 7.591 0.000 4.159 7.054
Reference 3.2810 0.253 12.970 0.000 2.785 3.777
Other Lead Sources 0.8257 0.378 2.182 0.029 0.084 1.567
Olark Chat 1.4068 0.135 10.444 0.000 1.143 1.671
Landing Page Submission -0.9929 0.129 -7.688 0.000 -1.246 -0.740
Total Time Spent on Website 1.1264 0.042 26.929 0.000 1.044 1.208
TotalVisits 0.1962 0.042 4.623 0.000 0.113 0.279
Email Opened 0.7096 0.123 5.791 0.000 0.469 0.950
  • This model has a feature Email Link Clicked with high p-value of 0.059. Let's drop this feature.

Model 9

In [253]:
# Model 9 : Removing `Email Link Clicked`
column_to_remove = 'Email Link Clicked'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 18.2371  |
| 14 | Landing Page Submission     |  3.40436 |
|  3 | No Specialization           |  3.01066 |
| 13 | Olark Chat                  |  2.21092 |
| 17 | Email Opened                |  2.09586 |
|  6 | SMS Sent                    |  1.97156 |
|  9 | Olark Chat Conversation     |  1.75591 |
| 11 | Reference                   |  1.56268 |
| 16 | TotalVisits                 |  1.49255 |
| 15 | Total Time Spent on Website |  1.35564 |
|  1 | Do Not Email                |  1.16781 |
|  2 | Unknown Occupation          |  1.1541  |
|  8 | Working Professional        |  1.14791 |
| 10 | Welingak Website            |  1.12098 |
|  7 | Other Last Activity         |  1.09682 |
| 12 | Other Lead Sources          |  1.0315  |
|  5 | Outside India               |  1.02286 |
|  4 | Hospitality Management      |  1.00955 |
+----+-----------------------------+----------+
Out[253]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6291
Model Family: Binomial Df Model: 17
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2520.8
Date: Mon, 07 Sep 2020 Deviance: 5041.5
Time: 22:02:07 Pearson chi2: 6.34e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.7799 0.153 -5.111 0.000 -1.079 -0.481
Do Not Email -1.5328 0.188 -8.150 0.000 -1.901 -1.164
Unknown Occupation -1.2696 0.090 -14.058 0.000 -1.447 -1.093
No Specialization -0.8248 0.125 -6.587 0.000 -1.070 -0.579
Hospitality Management -0.8909 0.332 -2.687 0.007 -1.541 -0.241
Outside India -0.6555 0.235 -2.795 0.005 -1.115 -0.196
SMS Sent 1.8157 0.113 16.026 0.000 1.594 2.038
Other Last Activity 1.1473 0.255 4.500 0.000 0.648 1.647
Working Professional 2.4103 0.190 12.668 0.000 2.037 2.783
Olark Chat Conversation -0.8576 0.193 -4.441 0.000 -1.236 -0.479
Welingak Website 5.6297 0.738 7.632 0.000 4.184 7.076
Reference 3.3059 0.253 13.064 0.000 2.810 3.802
Other Lead Sources 0.8192 0.378 2.167 0.030 0.078 1.560
Olark Chat 1.4212 0.135 10.561 0.000 1.157 1.685
Landing Page Submission -0.9890 0.129 -7.664 0.000 -1.242 -0.736
Total Time Spent on Website 1.1247 0.042 26.931 0.000 1.043 1.207
TotalVisits 0.1953 0.042 4.613 0.000 0.112 0.278
Email Opened 0.6161 0.111 5.564 0.000 0.399 0.833
  • All coefficients are significant / low p-value
  • For further elimination , let's use the magnitude of coefficient as the weight/importance of the variable. Higher values are more important than lower values.
  • By this reasoning, TotalVisits has the least coefficient. Let'd drop this.

Model 10

In [254]:
# Model 10 : Removing `TotalVisits`
column_to_remove = 'TotalVisits'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
logm1 = logm1.fit()
print("VIF for X_train")
vif(X_train)
logm1.summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 17.9489  |
| 14 | Landing Page Submission     |  3.4038  |
|  3 | No Specialization           |  2.97226 |
| 16 | Email Opened                |  2.09586 |
| 13 | Olark Chat                  |  1.98518 |
|  6 | SMS Sent                    |  1.97039 |
|  9 | Olark Chat Conversation     |  1.7559  |
| 11 | Reference                   |  1.46711 |
| 15 | Total Time Spent on Website |  1.34366 |
|  1 | Do Not Email                |  1.1672  |
|  2 | Unknown Occupation          |  1.15404 |
|  8 | Working Professional        |  1.14782 |
| 10 | Welingak Website            |  1.10012 |
|  7 | Other Last Activity         |  1.09606 |
| 12 | Other Lead Sources          |  1.02698 |
|  5 | Outside India               |  1.0226  |
|  4 | Hospitality Management      |  1.00954 |
+----+-----------------------------+----------+
Out[254]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6292
Model Family: Binomial Df Model: 16
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2531.3
Date: Mon, 07 Sep 2020 Deviance: 5062.7
Time: 22:02:07 Pearson chi2: 6.40e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.6821 0.151 -4.531 0.000 -0.977 -0.387
Do Not Email -1.5442 0.188 -8.229 0.000 -1.912 -1.176
Unknown Occupation -1.2691 0.090 -14.078 0.000 -1.446 -1.092
No Specialization -0.8891 0.124 -7.144 0.000 -1.133 -0.645
Hospitality Management -0.8856 0.332 -2.667 0.008 -1.536 -0.235
Outside India -0.6636 0.233 -2.845 0.004 -1.121 -0.206
SMS Sent 1.7926 0.113 15.916 0.000 1.572 2.013
Other Last Activity 1.1818 0.254 4.646 0.000 0.683 1.680
Working Professional 2.3799 0.189 12.584 0.000 2.009 2.751
Olark Chat Conversation -0.8591 0.192 -4.474 0.000 -1.236 -0.483
Welingak Website 5.4168 0.736 7.360 0.000 3.974 6.859
Reference 3.0622 0.247 12.402 0.000 2.578 3.546
Other Lead Sources 0.6577 0.377 1.744 0.081 -0.081 1.397
Olark Chat 1.2170 0.126 9.662 0.000 0.970 1.464
Landing Page Submission -0.9929 0.129 -7.702 0.000 -1.246 -0.740
Total Time Spent on Website 1.1330 0.042 27.181 0.000 1.051 1.215
Email Opened 0.6052 0.110 5.490 0.000 0.389 0.821
  • Other Lead Sources has high p-value. Let's drop this variable.

Model 11 - Final Model

In [255]:
# Model 11 : Removing `Other Lead Sources`
column_to_remove = 'Other Lead Sources'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm_final = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
logm_final = logm_final.fit()
print("VIF for X_train")
vif(X_train)
logm_final.summary()
VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 17.7441  |
| 13 | Landing Page Submission     |  3.3594  |
|  3 | No Specialization           |  2.96428 |
| 15 | Email Opened                |  2.09302 |
|  6 | SMS Sent                    |  1.97027 |
| 12 | Olark Chat                  |  1.96234 |
|  9 | Olark Chat Conversation     |  1.75584 |
| 11 | Reference                   |  1.45497 |
| 14 | Total Time Spent on Website |  1.3339  |
|  1 | Do Not Email                |  1.16719 |
|  2 | Unknown Occupation          |  1.1536  |
|  8 | Working Professional        |  1.14772 |
| 10 | Welingak Website            |  1.09768 |
|  7 | Other Last Activity         |  1.09593 |
|  5 | Outside India               |  1.02258 |
|  4 | Hospitality Management      |  1.0094  |
+----+-----------------------------+----------+
Out[255]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6293
Model Family: Binomial Df Model: 15
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2532.8
Date: Mon, 07 Sep 2020 Deviance: 5065.5
Time: 22:02:08 Pearson chi2: 6.39e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.6469 0.149 -4.338 0.000 -0.939 -0.355
Do Not Email -1.5426 0.188 -8.222 0.000 -1.910 -1.175
Unknown Occupation -1.2699 0.090 -14.094 0.000 -1.446 -1.093
No Specialization -0.9057 0.124 -7.281 0.000 -1.150 -0.662
Hospitality Management -0.8704 0.332 -2.621 0.009 -1.521 -0.220
Outside India -0.6584 0.233 -2.823 0.005 -1.116 -0.201
SMS Sent 1.7923 0.113 15.927 0.000 1.572 2.013
Other Last Activity 1.1749 0.254 4.622 0.000 0.677 1.673
Working Professional 2.3769 0.189 12.576 0.000 2.006 2.747
Olark Chat Conversation -0.8614 0.192 -4.488 0.000 -1.238 -0.485
Welingak Website 5.3886 0.736 7.323 0.000 3.946 6.831
Reference 3.0246 0.246 12.302 0.000 2.543 3.506
Olark Chat 1.1876 0.125 9.530 0.000 0.943 1.432
Landing Page Submission -1.0250 0.128 -8.018 0.000 -1.276 -0.774
Total Time Spent on Website 1.1253 0.041 27.204 0.000 1.044 1.206
Email Opened 0.6106 0.110 5.545 0.000 0.395 0.826
  • From the above, the features that remain are statistically significant and donot show any multi collinearity.
  • Hence, we could use Model 11 is our final model.

Final Features

In [256]:
finalFeatures = X_train.columns.values
print('The Final Feature for Modelling are :', finalFeatures)
The Final Feature for Modelling are : ['Do Not Email' 'Unknown Occupation' 'No Specialization'
 'Hospitality Management' 'Outside India' 'SMS Sent' 'Other Last Activity'
 'Working Professional' 'Olark Chat Conversation' 'Welingak Website'
 'Reference' 'Olark Chat' 'Landing Page Submission'
 'Total Time Spent on Website' 'Email Opened']

Predictions

Predictions on Train set

In [257]:
X_train_sm = sm.add_constant(X_train)
y_train_pred = logm_final.predict(X_train_sm)

Actual Conversions vs Conversion Predictions

In [258]:
# Creating a data frame with converted vs converted probabilities
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
y_train_pred_final['CustID'] = y_train.index
y_train_pred_final.head(10)
Out[258]:
Converted Converted_Prob CustID
4948 1 0.401461 4948
5938 1 0.318706 5938
5688 1 0.745966 5688
5381 0 0.002848 5381
4742 1 0.801898 4742
5811 0 0.167979 5811
898 0 0.088098 898
5316 0 0.028318 5316
7381 0 0.251634 7381
1211 0 0.023542 1211

Predictions with cut off = 0.5

In [259]:
#Creating new column 'predicted' with 1 if Converted_Prob > 0.5 else 0
y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head(10)
Out[259]:
Converted Converted_Prob CustID predicted
4948 1 0.401461 4948 0
5938 1 0.318706 5938 0
5688 1 0.745966 5688 1
5381 0 0.002848 5381 0
4742 1 0.801898 4742 1
5811 0 0.167979 5811 0
898 0 0.088098 898 0
5316 0 0.028318 5316 0
7381 0 0.251634 7381 0
1211 0 0.023542 1211 0

Confusion Matrix

In [260]:
from sklearn import metrics
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
print(confusion)
[[3462  455]
 [ 699 1693]]

Confusion Matrix for Train Set

$\frac{Predicted}{Actual}$ Not Converted Converted
Not Converted 3462 455
Converted 699 1693

Accuracy of the Model

In [261]:
# Let's check the overall accuracy.
accuracy = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
print('Accuracy on Train set : ', round(100*accuracy,3),'%')
Accuracy on Train set :  81.709 %

Metrics beyond simple accuracy

In [262]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ',  round(100*specificity,3),'%')
print('False Positive Rate : ',  round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ',  round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ',  round(100*negativePredictivePower,3),'%')
sensitivity / Recall:  70.778 %
specificity :  88.384 %
False Positive Rate :  11.616 %
Precision / Positive Predictive Power :  78.818 %
Negative Predictive Power :  83.201 %

Plotting ROC Curve.

In [263]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

    return None
In [264]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

Finding Optimal Cutoff Point

  • Optimal cutoff probability is that prob where we get balanced sensitivity and specificity
In [265]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()
Out[265]:
Converted Converted_Prob CustID predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
4948 1 0.401461 4948 0 1 1 1 1 1 0 0 0 0 0
5938 1 0.318706 5938 0 1 1 1 1 0 0 0 0 0 0
5688 1 0.745966 5688 1 1 1 1 1 1 1 1 1 0 0
5381 0 0.002848 5381 0 1 0 0 0 0 0 0 0 0 0
4742 1 0.801898 4742 1 1 1 1 1 1 1 1 1 1 0
In [266]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)
     prob  accuracy     sensi     speci
0.0   0.0  0.379141  1.000000  0.000000
0.1   0.1  0.632905  0.974498  0.424304
0.2   0.2  0.758599  0.923077  0.658157
0.3   0.3  0.793311  0.869983  0.746490
0.4   0.4  0.818830  0.781355  0.841716
0.5   0.5  0.817087  0.707776  0.883840
0.6   0.6  0.808052  0.634197  0.914220
0.7   0.7  0.782533  0.513796  0.946643
0.8   0.8  0.755112  0.405518  0.968598
0.9   0.9  0.715961  0.273829  0.985959
In [267]:
# Let's plot accuracy sensitivity and specificity for various cutoff probabilities.

fig,ax = plt.subplots()
fig.set_figwidth(30)
fig.set_figheight(10)
plots=['accuracy','sensi','speci']
ax.set_xticks(np.linspace(0,1,50))
ax.set_title('Finding Optimal Cutoff')
sns.lineplot(x='prob',y=plots[0] , data=cutoff_df,ax=ax)
sns.lineplot(x='prob',y=plots[1] , data=cutoff_df,ax=ax)
sns.lineplot(x='prob',y=plots[2] , data=cutoff_df,ax=ax)

ax.set_xlabel('Probabilites')
ax.set_ylabel('Accuracy,Sensitivity,Specificity')
ax.legend(["Accuracy",'Sensitivity','Specificity'])
# cutoff_df.plot.line(, figure=[10,10])
plt.show()
  • From the curve above, 0.36 is the optimum cutoff probability.
In [268]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.36 else 0)

y_train_pred_final.head()
Out[268]:
Converted Converted_Prob CustID predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 final_predicted
4948 1 0.401461 4948 0 1 1 1 1 1 0 0 0 0 0 1
5938 1 0.318706 5938 0 1 1 1 1 0 0 0 0 0 0 0
5688 1 0.745966 5688 1 1 1 1 1 1 1 1 1 0 0 1
5381 0 0.002848 5381 0 1 0 0 0 0 0 0 0 0 0 0
4742 1 0.801898 4742 1 1 1 1 1 1 1 1 1 1 0 1
In [269]:
# Let's check the overall accuracy.
accu = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
print('Accuracy on Train set at Optimum Cut Off : ', round(100*accu,3),'%')
Accuracy on Train set at Optimum Cut Off :  81.249 %
In [270]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2
Out[270]:
array([[3203,  714],
       [ 469, 1923]])
In [271]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ',  round(100*specificity,3),'%')
print('False Positive Rate : ',  round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ',  round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ',  round(100*negativePredictivePower,3),'%')
sensitivity / Recall:  80.393 %
specificity :  81.772 %
False Positive Rate :  18.228 %
Precision / Positive Predictive Power :  72.924 %
Negative Predictive Power :  87.228 %
In [272]:
## ROC curve for cut off probability of 0.36
draw_roc(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

Precision and Recall

In [273]:
#Looking at the confusion matrix again
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
confusion
Out[273]:
array([[3462,  455],
       [ 699, 1693]])
  • Precision :TP / TP + FP
  • Recall :TP / TP + FN
In [274]:
print('Precision :', confusion[1,1]/(confusion[0,1]+confusion[1,1]))
print('Recall :', confusion[1,1]/(confusion[1,0]+confusion[1,1]))
Precision : 0.7881750465549349
Recall : 0.7077759197324415
In [301]:
#Doing the same using the sklearn.
from sklearn.metrics import precision_score, recall_score
print('Precision : ', precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted))
print('Recall :', recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted))
Precision :  0.7881750465549349
Recall : 0.7077759197324415

Precision and Recall Tradeoff

In [276]:
from sklearn.metrics import precision_recall_curve

p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()
  • The cut off point from precision-recall curve is ~0.4.
  • Note that we have used the cut off obtained from 'Sensitivity-Specificity' trade off to predict conversions in this analysis.

Predictions on Test set

In [277]:
X_test_sm = sm.add_constant(X_test[finalFeatures])
y_test_pred = logm_final.predict(X_test_sm)

Actual Conversions vs Conversion Probability

In [278]:
# predicted conversions vs actual conversions and customer ID
y_test_predictions = pd.DataFrame({'Converted' :y_test, 'Conversion Probability' : y_test_pred, 'CustID' : y_test.index})
y_test_predictions.head()
Out[278]:
Converted Conversion Probability CustID
1260 0 0.116132 1260
2104 1 0.318706 2104
7105 1 0.982061 7105
8916 0 0.420480 8916
2822 0 0.029267 2822
In [279]:
# predictions with optimal cut off = 0.35
cutoff=0.36
y_test_predictions['Predicted'] = y_test_predictions[
 'Conversion Probability'
].map(lambda x : 1 if x > cutoff else 0 ) 

Confusion Matrix

In [280]:
confusion = metrics.confusion_matrix(y_test_predictions['Converted'], y_test_predictions['Predicted'])
confusion
Out[280]:
array([[1356,  322],
       [ 230,  797]])
In [281]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

Accuracy

In [282]:
print('Accuracy on Test set : ', round(100*(TP + TN)/(TP + TN + FP + FN),3),'%')
Accuracy on Test set :  79.593 %

Metrics Beyond Simple Accuracy

In [283]:
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ',  round(100*specificity,3),'%')
print('False Positive Rate : ',  round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ',  round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ',  round(100*negativePredictivePower,3),'%')
sensitivity / Recall:  77.605 %
specificity :  80.81 %
False Positive Rate :  19.19 %
Precision / Positive Predictive Power :  71.224 %
Negative Predictive Power :  85.498 %
In [284]:
## ROC curve for cut off probability of 0.364
draw_roc(y_test_predictions['Converted'],y_test_predictions['Predicted'])
  • Note the AUC is 0.79 on the test test

Lead Scoring

In [285]:
# merging final predictions with leads dataset
conversionProb = pd.concat([y_test_predictions['Conversion Probability'],y_train_pred_final['Converted_Prob']],axis=0)
conversionProb = pd.DataFrame({'Conversion Probability' : conversionProb}, index=conversionProb.index)
leads = pd.concat([leads,conversionProb],axis=1)
leads['Prospect ID'] = prospect_ids
leads['Lead No'] = lead_no
leads['Converted'] = y
In [286]:
# Verifying prediction accuracy
leads['Predicted'] = leads['Conversion Probability'].map(lambda x : 1 if x > 0.36 else 0)
In [287]:
confusion = metrics.confusion_matrix(leads['Converted'], leads['Predicted'])
In [288]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
acc = metrics.accuracy_score(leads['Converted'], leads['Predicted'])
print('Accuracy : ', round(100*acc,3),'%')
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
falseNegativeRate = FN/(FP + TP)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity : ', round(100*sensitivity,3),'%')
print('specificity : ',  round(100*specificity,3),'%')
print('False Positive Rate : ',  round(100*falsePositiveRate,3),'%')
print('False Negative Rate : ',  round(100*falseNegativeRate,3),'%')
print('Positive Predictive Power / Precision : ',  round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ',  round(100*negativePredictivePower,3),'%')
Accuracy :  80.752 %
sensitivity :  79.555 %
specificity :  81.483 %
False Positive Rate :  18.517 %
False Negative Rate :  18.61 %
Positive Predictive Power / Precision :  72.417 %
Negative Predictive Power :  86.706 %
In [289]:
## ROC curve
draw_roc(leads['Converted'], leads['Predicted'])
In [290]:
# Lead Scores 
leads['Lead Score'] = leads['Conversion Probability']*100
leads[['Prospect ID','Lead No','Lead Score']].sort_values(by='Lead Score', ascending=False)[:10]
Out[290]:
Prospect ID Lead No Lead Score
7219 ed62264f-7666-4bf9-9cb6-5b9a825f1e67 594038 99.911806
7234 7e2819e8-97f0-416b-bcb6-45ef14f0e11a 593962 99.855878
2378 8ff353ab-1207-4608-a8cc-8172ea7c12eb 636860 99.830268
7327 95d1590f-7c47-4f40-9806-388f4472d3a4 593208 99.825999
2497 e5fb32dd-b3b7-4fbf-972d-c13d2cfc6866 635761 99.822078
5671 623bc6c9-9184-4437-b38f-d374be49d1a3 606508 99.822078
7094 9ec1cafe-b019-498e-b246-7ab06167d72c 595141 99.819952
7187 f33166e8-d8d3-4e8c-b9d0-8a1922c35910 594369 99.817589
7420 2caa32d0-50b7-4d29-b31f-2528b06d7bc8 592625 99.816186
8120 bf4a03bc-b747-45a6-a6b5-659afa3bf3ac 587853 99.815060

Score Sheet for X Education

In [291]:
# Run the following to generate a sheet containing lead information provided by the company and corresponding scores 
leads.to_csv('lead_scores.csv')

KS Statistic

In [292]:
# Gain Chart 
y_test_predictions = y_test_predictions.sort_values(by='Conversion Probability', ascending=False)
y_test_predictions['decile'] = pd.qcut(y_test_predictions['Conversion Probability'],10,labels=range(10,0,-1))
y_test_predictions['Converted'] = y_test_predictions['Converted'].astype('int')
y_test_predictions['Un Converted'] = 1 - y_test_predictions['Converted']
y_test_predictions.head()
Out[292]:
Converted Conversion Probability CustID Predicted decile Un Converted
7327 1 0.998260 7327 1 1 0
7420 1 0.998162 7420 1 1 0
4613 1 0.997809 4613 1 1 0
6243 1 0.997674 6243 1 1 0
7324 1 0.997402 7324 1 1 0
In [293]:
df1 = pd.pivot_table(data=y_test_predictions,index=['decile'],values=['Converted','Un Converted','Conversion Probability'],
                     aggfunc={'Converted':[np.sum],
                              'Un Converted':[np.sum],
                              'Conversion Probability' : [np.min,np.max]})
df1 = df1.reset_index()
df1.columns = ['Decile','Max Prob', 'Min Prob','Converted Count','Un Converted Count']
df1 = df1.sort_values(by='Decile', ascending=False)
df1['Total Leads'] = df1['Converted Count'] + df1['Un Converted Count']
df1['Conversion Rate'] = df1['Converted Count'] / df1['Un Converted Count']
converted_sum = df1['Converted Count'].sum()
unconverted_sum = df1['Un Converted Count'].sum()
df1['Converted %'] = df1['Converted Count'] / converted_sum
df1['Un Converted %'] = df1['Un Converted Count'] / unconverted_sum
df1.head()
Out[293]:
Decile Max Prob Min Prob Converted Count Un Converted Count Total Leads Conversion Rate Converted % Un Converted %
9 1 0.998260 0.910258 257 14 271 18.357143 0.250243 0.008343
8 2 0.909066 0.781567 224 46 270 4.869565 0.218111 0.027414
7 3 0.780757 0.564123 182 89 271 2.044944 0.177215 0.053039
6 4 0.563613 0.380673 118 152 270 0.776316 0.114898 0.090584
5 5 0.380229 0.251101 113 157 270 0.719745 0.110029 0.093564
In [294]:
df1['ks_stats'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum() -(df1['Un Converted Count'] / df1['Un Converted Count'].sum()).cumsum()), 4) * 100
df1
Out[294]:
Decile Max Prob Min Prob Converted Count Un Converted Count Total Leads Conversion Rate Converted % Un Converted % ks_stats
9 1 0.998260 0.910258 257 14 271 18.357143 0.250243 0.008343 24.19
8 2 0.909066 0.781567 224 46 270 4.869565 0.218111 0.027414 43.26
7 3 0.780757 0.564123 182 89 271 2.044944 0.177215 0.053039 55.68
6 4 0.563613 0.380673 118 152 270 0.776316 0.114898 0.090584 58.11
5 5 0.380229 0.251101 113 157 270 0.719745 0.110029 0.093564 59.76
4 6 0.250676 0.149579 64 207 271 0.309179 0.062317 0.123361 53.65
3 7 0.149564 0.113845 37 233 270 0.158798 0.036027 0.138856 43.37
2 8 0.113644 0.069387 22 249 271 0.088353 0.021422 0.148391 30.67
1 9 0.069378 0.029278 8 256 264 0.031250 0.007790 0.152563 16.19
0 10 0.029267 0.002231 2 275 277 0.007273 0.001947 0.163886 0.00
  • Max KS Statistic is 59.76 for 5th decile
  • This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonable good model.

Gain Chart

In [295]:
df1['Cum Conversion %'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum()), 4) * 100
df1
Out[295]:
Decile Max Prob Min Prob Converted Count Un Converted Count Total Leads Conversion Rate Converted % Un Converted % ks_stats Cum Conversion %
9 1 0.998260 0.910258 257 14 271 18.357143 0.250243 0.008343 24.19 25.02
8 2 0.909066 0.781567 224 46 270 4.869565 0.218111 0.027414 43.26 46.84
7 3 0.780757 0.564123 182 89 271 2.044944 0.177215 0.053039 55.68 64.56
6 4 0.563613 0.380673 118 152 270 0.776316 0.114898 0.090584 58.11 76.05
5 5 0.380229 0.251101 113 157 270 0.719745 0.110029 0.093564 59.76 87.05
4 6 0.250676 0.149579 64 207 271 0.309179 0.062317 0.123361 53.65 93.28
3 7 0.149564 0.113845 37 233 270 0.158798 0.036027 0.138856 43.37 96.88
2 8 0.113644 0.069387 22 249 271 0.088353 0.021422 0.148391 30.67 99.03
1 9 0.069378 0.029278 8 256 264 0.031250 0.007790 0.152563 16.19 99.81
0 10 0.029267 0.002231 2 275 277 0.007273 0.001947 0.163886 0.00 100.00
In [296]:
df1['Base %'] = np.arange(10,110,10)
df1 = df1.set_index('Decile')
In [297]:
df1
Out[297]:
Max Prob Min Prob Converted Count Un Converted Count Total Leads Conversion Rate Converted % Un Converted % ks_stats Cum Conversion % Base %
Decile
1 0.998260 0.910258 257 14 271 18.357143 0.250243 0.008343 24.19 25.02 10
2 0.909066 0.781567 224 46 270 4.869565 0.218111 0.027414 43.26 46.84 20
3 0.780757 0.564123 182 89 271 2.044944 0.177215 0.053039 55.68 64.56 30
4 0.563613 0.380673 118 152 270 0.776316 0.114898 0.090584 58.11 76.05 40
5 0.380229 0.251101 113 157 270 0.719745 0.110029 0.093564 59.76 87.05 50
6 0.250676 0.149579 64 207 271 0.309179 0.062317 0.123361 53.65 93.28 60
7 0.149564 0.113845 37 233 270 0.158798 0.036027 0.138856 43.37 96.88 70
8 0.113644 0.069387 22 249 271 0.088353 0.021422 0.148391 30.67 99.03 80
9 0.069378 0.029278 8 256 264 0.031250 0.007790 0.152563 16.19 99.81 90
10 0.029267 0.002231 2 275 277 0.007273 0.001947 0.163886 0.00 100.00 100
In [298]:
### Gain chart 
plot_columns =['Base %','Cum Conversion %']
plt.plot(df1[plot_columns]);
plt.xticks(df1.index);
plt.title('Gain chart');
plt.xlabel('Decile')
plt.ylabel('Cummulative Conversion %')
plt.legend(('Our Model','Random Model'));
  • Instead of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift Chart

In [299]:
df1['Lift'] = df1['Cum Conversion %'] / df1['Base %']
df1['Baseline'] = 1
df1
Out[299]:
Max Prob Min Prob Converted Count Un Converted Count Total Leads Conversion Rate Converted % Un Converted % ks_stats Cum Conversion % Base % Lift Baseline
Decile
1 0.998260 0.910258 257 14 271 18.357143 0.250243 0.008343 24.19 25.02 10 2.502000 1
2 0.909066 0.781567 224 46 270 4.869565 0.218111 0.027414 43.26 46.84 20 2.342000 1
3 0.780757 0.564123 182 89 271 2.044944 0.177215 0.053039 55.68 64.56 30 2.152000 1
4 0.563613 0.380673 118 152 270 0.776316 0.114898 0.090584 58.11 76.05 40 1.901250 1
5 0.380229 0.251101 113 157 270 0.719745 0.110029 0.093564 59.76 87.05 50 1.741000 1
6 0.250676 0.149579 64 207 271 0.309179 0.062317 0.123361 53.65 93.28 60 1.554667 1
7 0.149564 0.113845 37 233 270 0.158798 0.036027 0.138856 43.37 96.88 70 1.384000 1
8 0.113644 0.069387 22 249 271 0.088353 0.021422 0.148391 30.67 99.03 80 1.237875 1
9 0.069378 0.029278 8 256 264 0.031250 0.007790 0.152563 16.19 99.81 90 1.109000 1
10 0.029267 0.002231 2 275 277 0.007273 0.001947 0.163886 0.00 100.00 100 1.000000 1
In [300]:
# Lift chart 
plot_columns =['Lift', 'Baseline']
plt.plot(df1[plot_columns]);
plt.xticks(df1.index);
plt.title('Lift chart');
plt.xlabel('Decile')
plt.ylabel('Lift')
plt.legend(('Our Model','Random Model'));
  • The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
  • As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Conclusion

A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).

The final relationship between log Odds of Conversion Probability and lead features is

logOdds(Conversion Probability) = -0.6469 - 1.5426 Do Not Email -1.2699 Unknown Occupation -0.9057 No Specialization -0.8704 Hospitality Management - 0.6584 Outside India + 1.7923 SMS Sent + 1.1749 Other Last Activity + 2.3769 Working Professional - 0.8614 Olark Chat Conversation + 5.3886 Welingak Website + 3.0246 Reference + 1.1876 Olark Chat -1.0250 Landing Page Submission + 1.1253 Total Time Spent on Website + 0.6106 * Email Opened

where Total Time Spent on Website is standardized to $\mu=0,\sigma=1$

Interpreting Top 6 features affecting Conversion Probability :

  • A lead from Welingak Website has 5.4 times higher log odds of conversion than those from Google.
  • Leads through Reference have 3 times higher log odds of conversion than those from Google.
  • Leads from Working Professional have 2.38 times higher log odds of conversion than those from Businessman.
  • Leads with SMS Sent have 1.8 times higher log odds of conversion than those with no SMS sent.
  • Leads with Do Not Email have 1.5 times lesser log odds of conversion compared to leads who would like email updates.
  • Leads with Unknown Occupation have 1.27 times lesser log odds of conversion compared to those from Businessman.

Lead Scores :

  • Score sheet can be generated by running this cell.

At an optimum cut-off probability of 0.36, model performance is as follows.

Model Performance on Training Set :

  • Accuracy : 81.7%
  • Sensitivity / Recall: 80.393 %
  • Specificity : 81.772 %
  • Precision / Positive Predictive Power : 72.924 %
  • False Positive Rate : 18.228 %
  • AUC Score : 0.81

Model Performance for Test Set :

  • Accuracy : 79.593 %
  • Sensitivity / Recall : 77.605 %
  • Specificity : 80.81%
  • Precision / Positive Predictive Power : 71.224 %
  • False Positive Rate : 19.19 %
  • AUC Score : 0.79

KS statistic :

  • Max KS Statistic is 59.76 for 5th decile
  • This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonably good model.

Gain :

  • Inside of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift :

  • The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
  • As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Note :

  • Incorrect data types have been corrected
  • Columns with high missing values have been dropped.
  • Columns which do not explain variability in the model have been dropped.
  • Columns with sales teams notes like Tags where the classes are not mutually exclusive have been dropped.
  • Features with low missing values have been imputed with the most frequent values.
  • Categories in a feature with less than 1% contribution have been grouped together to reduce the number of levels.
  • Inconsistencies in Categories have been corrected.
  • 97.5 % of the leads provided by the company have been used for analysis.
  • Class imbalance = 0.6
  • Indicator variables have been created for all categorical variables with the first category as the reference.
  • Continuous variables have been standardized $\mu : 0 , \sigma = 1$ before modelling.