1 Problem Statement
2 Business Goals
3 Analysis Approach & Conclusions
4 Importing Data
5 Data Cleaning
6 Retained Data
7 Data Imbalance
8 Univariate Analysis
9 Bivariate Analysis
10 Data Preparation
11 Modelling
- 11.1 Recurvise Feature Elimination
- 11.2 Manual Feature Elimination
12 Final Features
13 Predictions
- 13.1 Predictions on Train set
- 13.2 Predictions on Test set
14 Lead Scoring
15 Score Sheet for X Education
16 KS Statistic
17 Gain Chart
18 Lift Chart
19 Conclusion

Lead Scoring for X Education¶

Problem Statement¶

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:

X Education has appointed us to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires us to build a model wherein we need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

Business Goals¶

X-Education wants to improve their lead conversion.
Rather than randomly pursuing leads, the company wants to create a pool of Hot Leads the sales team could focus on.
They have tasked us to score their leads betwen 0-100 based on the probability of conversion. 100 being the most likely to convert and 0 being unlikely to convert.

Analysis Approach & Conclusions¶

Lead scoring is a class probability estimation problem, a form of classification problem. The target variable in the data set has two classes : 0 - Un converted and 1 - Converted. The objective is to model the probability(p) that each lead belongs to the class - Converted. Since there are just two classes - it follows that the probability of belonging to class - Un-Converted is (1-p). The relationship between probability of conversion of each lead and its characteristics is modelled using Logistic Regression. And the leads are scored on a scale of 0-100, 100 being most probable conversion candidate.

The final solution has been provided in two parts.

1. Scoring the leads provided by the company in the order of probability of conversion (0-100)
2. Insights into the relationship between characteristics of a lead and the log-odds probability of conversion that could help the company score leads in the future.

A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).

The final relationship between log Odds of Conversion Probability and lead features is

logOdds(Conversion Probability) = -0.6469 - 1.5426 Do Not Email -1.2699 Unknown Occupation -0.9057 No Specialization -0.8704 Hospitality Management - 0.6584 Outside India + 1.7923 SMS Sent + 1.1749 Other Last Activity + 2.3769 Working Professional - 0.8614 Olark Chat Conversation + 5.3886 Welingak Website + 3.0246 Reference + 1.1876 Olark Chat -1.0250 Landing Page Submission + 1.1253 Total Time Spent on Website + 0.6106 * Email Opened

where Total Time Spent on Website is standardized to $\mu=0,\sigma=1$

Interpreting Top 6 features affecting Conversion Probability :

A lead from Welingak Website has 5.4 times higher log odds of conversion than those from Google.
Leads through Reference have 3 times higher log odds of conversion than those from Google.
Leads from Working Professional have 2.38 times higher log odds of conversion than those from Businessman.
Leads with SMS Sent have 1.8 times higher log odds of conversion than those with no SMS sent.
Leads with Do Not Email have 1.5 times lesser log odds of conversion compared to leads who would like email updates.
Leads with Unknown Occupation have 1.27 times lesser log odds of conversion compared to those from Businessman.

Lead Scores :

Score sheet can be generated by running coding in the cell named Score Sheet for X Education cell in the analysis notebook.

At an optimum cut-off probability of 0.36, model performance is as follows.

Model Performance on Training Set :

Accuracy : 81.7%
Sensitivity / Recall: 80.393 %
Specificity : 81.772 %
Precision / Positive Predictive Power : 72.924 %
False Positive Rate : 18.228 %
AUC Score : 0.81

Model Performance for Test Set :

Accuracy : 79.593 %
Sensitivity / Recall : 77.605 %
Specificity : 80.81%
Precision / Positive Predictive Power : 71.224 %
False Positive Rate : 19.19 %
AUC Score : 0.79

KS statistic :

Max KS Statistic is 59.76 for 5th decile
This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonably good model.

Gain :

Instead of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift :

The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Note :

Incorrect data types have been corrected
Columns with high missing values have been dropped.
Columns which do not explain variability in the model have been dropped.
Columns with sales teams notes like Tags where the classes are not mutually exclusive have been dropped.
Features with low missing values have been imputed with the most frequent values.
Categories in a feature with less than 1% contribution have been grouped together to reduce the number of levels.
Inconsistencies in Categories have been corrected.
97.5 % of the leads provided by the company have been used for analysis.
Class imbalance = 0.6
Indicator variables have been created for all categorical variables with the first category as the reference.
Continuous variables have been standardized $\mu : 0 , \sigma = 1$ before modelling.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

import seaborn as sns 
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

!pip install tabulate
from tabulate import tabulate 

import sidetable

!jt -t grade3 -f roboto -fs 12 -cellw 100%

Requirement already satisfied: tabulate in /Users/jayanth/opt/anaconda3/lib/python3.7/site-packages (0.8.7)

# to table print  a dataframe 
def tab(ser) : 
        print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt="psql"))

Importing Data¶

# importing the dataset
leads = pd.read_csv('./leads.csv')

# Inspecting few column heads at a time 
for i in range(0,leads.shape[1], 5) : 
    if i+4 <= leads.shape[1] : 
        print('Columns : ',i,' to ',i+4)
    else : 
        print('Columns : ',i,' to last')
    tab(leads.iloc[:,i : i+5].head())
    print('\n')

Columns :  0  to  4
+----+--------------------------------------+---------------+-------------------------+----------------+----------------+
|    | Prospect ID                          |   Lead Number | Lead Origin             | Lead Source    | Do Not Email   |
|----+--------------------------------------+---------------+-------------------------+----------------+----------------|
|  0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 |        660737 | API                     | Olark Chat     | No             |
|  1 | 2a272436-5132-4136-86fa-dcc88c88f482 |        660728 | API                     | Organic Search | No             |
|  2 | 8cc8c611-a219-4f35-ad23-fdfd2656bd8a |        660727 | Landing Page Submission | Direct Traffic | No             |
|  3 | 0cc2df48-7cf4-4e39-9de9-19797f9b38cc |        660719 | Landing Page Submission | Direct Traffic | No             |
|  4 | 3256f628-e534-4826-9d63-4a8b88782852 |        660681 | Landing Page Submission | Google         | No             |
+----+--------------------------------------+---------------+-------------------------+----------------+----------------+


Columns :  5  to  9
+----+---------------+-------------+---------------+-------------------------------+------------------------+
|    | Do Not Call   |   Converted |   TotalVisits |   Total Time Spent on Website |   Page Views Per Visit |
|----+---------------+-------------+---------------+-------------------------------+------------------------|
|  0 | No            |           0 |             0 |                             0 |                    0   |
|  1 | No            |           0 |             5 |                           674 |                    2.5 |
|  2 | No            |           1 |             2 |                          1532 |                    2   |
|  3 | No            |           0 |             1 |                           305 |                    1   |
|  4 | No            |           1 |             2 |                          1428 |                    1   |
+----+---------------+-------------+---------------+-------------------------------+------------------------+


Columns :  10  to  14
+----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------+
|    | Last Activity           | Country   | Specialization          | How did you hear about X Education   | What is your current occupation   |
|----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------|
|  0 | Page Visited on Website | nan       | Select                  | Select                               | Unemployed                        |
|  1 | Email Opened            | India     | Select                  | Select                               | Unemployed                        |
|  2 | Email Opened            | India     | Business Administration | Select                               | Student                           |
|  3 | Unreachable             | India     | Media and Advertising   | Word Of Mouth                        | Unemployed                        |
|  4 | Converted to Lead       | India     | Select                  | Other                                | Unemployed                        |
+----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------+


Columns :  15  to  19
+----+-------------------------------------------------+----------+------------+---------------------+----------------------+
|    | What matters most to you in choosing a course   | Search   | Magazine   | Newspaper Article   | X Education Forums   |
|----+-------------------------------------------------+----------+------------+---------------------+----------------------|
|  0 | Better Career Prospects                         | No       | No         | No                  | No                   |
|  1 | Better Career Prospects                         | No       | No         | No                  | No                   |
|  2 | Better Career Prospects                         | No       | No         | No                  | No                   |
|  3 | Better Career Prospects                         | No       | No         | No                  | No                   |
|  4 | Better Career Prospects                         | No       | No         | No                  | No                   |
+----+-------------------------------------------------+----------+------------+---------------------+----------------------+


Columns :  20  to  24
+----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------+
|    | Newspaper   | Digital Advertisement   | Through Recommendations   | Receive More Updates About Our Courses   | Tags                                |
|----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------|
|  0 | No          | No                      | No                        | No                                       | Interested in other courses         |
|  1 | No          | No                      | No                        | No                                       | Ringing                             |
|  2 | No          | No                      | No                        | No                                       | Will revert after reading the email |
|  3 | No          | No                      | No                        | No                                       | Ringing                             |
|  4 | No          | No                      | No                        | No                                       | Will revert after reading the email |
+----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------+


Columns :  25  to  29
+----+------------------+-------------------------------------+-----------------------------+----------------+--------+
|    | Lead Quality     | Update me on Supply Chain Content   | Get updates on DM Content   | Lead Profile   | City   |
|----+------------------+-------------------------------------+-----------------------------+----------------+--------|
|  0 | Low in Relevance | No                                  | No                          | Select         | Select |
|  1 | nan              | No                                  | No                          | Select         | Select |
|  2 | Might be         | No                                  | No                          | Potential Lead | Mumbai |
|  3 | Not Sure         | No                                  | No                          | Select         | Mumbai |
|  4 | Might be         | No                                  | No                          | Select         | Mumbai |
+----+------------------+-------------------------------------+-----------------------------+----------------+--------+


Columns :  30  to  34
+----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------+
|    | Asymmetrique Activity Index   | Asymmetrique Profile Index   |   Asymmetrique Activity Score |   Asymmetrique Profile Score | I agree to pay the amount through cheque   |
|----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------|
|  0 | 02.Medium                     | 02.Medium                    |                            15 |                           15 | No                                         |
|  1 | 02.Medium                     | 02.Medium                    |                            15 |                           15 | No                                         |
|  2 | 02.Medium                     | 01.High                      |                            14 |                           20 | No                                         |
|  3 | 02.Medium                     | 01.High                      |                            13 |                           17 | No                                         |
|  4 | 02.Medium                     | 01.High                      |                            15 |                           18 | No                                         |
+----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------+


Columns :  35  to last
+----+------------------------------------------+-------------------------+
|    | A free copy of Mastering The Interview   | Last Notable Activity   |
|----+------------------------------------------+-------------------------|
|  0 | No                                       | Modified                |
|  1 | No                                       | Email Opened            |
|  2 | Yes                                      | Email Opened            |
|  3 | No                                       | Modified                |
|  4 | No                                       | Modified                |
+----+------------------------------------------+-------------------------+

# dataset information 
leads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 non-null   float64
 10  Last Activity                                  9137 non-null   object 
 11  Country                                        6779 non-null   object 
 12  Specialization                                 7802 non-null   object 
 13  How did you hear about X Education             7033 non-null   object 
 14  What is your current occupation                6550 non-null   object 
 15  What matters most to you in choosing a course  6531 non-null   object 
 16  Search                                         9240 non-null   object 
 17  Magazine                                       9240 non-null   object 
 18  Newspaper Article                              9240 non-null   object 
 19  X Education Forums                             9240 non-null   object 
 20  Newspaper                                      9240 non-null   object 
 21  Digital Advertisement                          9240 non-null   object 
 22  Through Recommendations                        9240 non-null   object 
 23  Receive More Updates About Our Courses         9240 non-null   object 
 24  Tags                                           5887 non-null   object 
 25  Lead Quality                                   4473 non-null   object 
 26  Update me on Supply Chain Content              9240 non-null   object 
 27  Get updates on DM Content                      9240 non-null   object 
 28  Lead Profile                                   6531 non-null   object 
 29  City                                           7820 non-null   object 
 30  Asymmetrique Activity Index                    5022 non-null   object 
 31  Asymmetrique Profile Index                     5022 non-null   object 
 32  Asymmetrique Activity Score                    5022 non-null   float64
 33  Asymmetrique Profile Score                     5022 non-null   float64
 34  I agree to pay the amount through cheque       9240 non-null   object 
 35  A free copy of Mastering The Interview         9240 non-null   object 
 36  Last Notable Activity                          9240 non-null   object 
dtypes: float64(4), int64(3), object(30)
memory usage: 2.6+ MB

This data set has a total of 9240 records ,each with 36 features.
Each record represents the characteristics of a lead and whether the lead was converted.
Converted column indicates whether the particular lead was converted to a client. This is our target variable.

Data Cleaning¶

Incorrect data types¶

# 'Converted' is a binary categorical variable but the info shows it is `int64`. Converting to `category` data type 
leads['Converted'] = leads['Converted'].astype('category')

Duplicates¶

# Checking for any duplicate leads / prospects 

duplicate_prospect_ids = leads['Prospect ID'][leads['Prospect ID'].duplicated()].sum()
duplicate_lead_no = leads['Lead Number'][leads['Lead Number'].duplicated()].sum()
print('No of Duplicate Prospect IDs : ', duplicate_prospect_ids)
print('No of Duplicate Lead Nos : ', duplicate_lead_no)

No of Duplicate Prospect IDs :  0
No of Duplicate Lead Nos :  0

There are no duplicate prospect IDs or Lead Numbers
Since, these are dimensions (i.e identification columns) not required for analysis, they could be popped for re-indentification at a later step.

Separating ID columns¶

# Popping Prospect ID and Lead Number columns for later use
prospect_ids = leads.pop('Prospect ID')
lead_no = leads.pop('Lead Number')

Missing Values¶

# Null values in each Column
nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
nulls.columns = ['Null Percentage']

# Sorting null percentages in descending order and highlighting null % > 45 
nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 45 else '')

More than 45% of the leads have missing values in Lead Quality,Asymmetrique Profile Score, Asymmetrique Activity Score, Asymmetrique Profile Index, Asymmetrique Activity Index
Further, the data in these columns is filled by the sales team and the values depend heavily on the team's judgement. These columns are not good candidates for modelling since the values are subjective.
Hence these columns could be dropped.

# Dropping columns with null percentage > 45
high_null_col = nulls[nulls['Null Percentage'] >=45].index
leads.drop(columns=high_null_col, inplace=True)

# Rows Missing Target Variable 
print('Number of rows with missing Target Variable : ',leads['Converted'].isnull().sum())

Number of rows with missing Target Variable :  0

No rows with missing target value

# Rows missing more than 50% of values 
highNullRowsCondition = leads.isnull().sum(axis=1)/leads.shape[1] > 0.5
leads[highNullRowsCondition].index

Int64Index([], dtype='int64')

No rows missing more than 50% of values.

Disguised missing Values¶

# Categorical columns
condition = leads.dtypes == 'object'
categoricalColumns = leads.dtypes[condition].index.values
categoricalColumns

array(['Lead Origin', 'Lead Source', 'Do Not Email', 'Do Not Call',
       'Last Activity', 'Country', 'Specialization',
       'How did you hear about X Education',
       'What is your current occupation',
       'What matters most to you in choosing a course', 'Search',
       'Magazine', 'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses', 'Tags',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity'],
      dtype=object)

# value counts of each label in a categorical feature 
def cat_value_counts(column_name) : 
    '''
    prints unique values and value counts of each label in categorical column
    '''
    print(tabulate(pd.DataFrame(leads.stb.freq([column_name])), headers='keys', tablefmt='psql'))
    print(pd.DataFrame(leads[column_name]).stb.missing(),'\n\n\n')

# Looking at value counts of each label in categorical variables
for col in sorted(categoricalColumns) : 
    print(col)
    cat_value_counts(col)

A free copy of Mastering The Interview
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    6352 |  0.687446 |               6352 |             0.687446 |
|  1 | Yes                                      |    2888 |  0.312554 |               9240 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
                                        Missing  Total  Percent
A free copy of Mastering The Interview        0   9240      0.0 


City
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | City                        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Mumbai                      |    3222 | 0.41202    |               3222 |             0.41202  |
|  1 | Select                      |    2249 | 0.287596   |               5471 |             0.699616 |
|  2 | Thane & Outskirts           |     752 | 0.0961637  |               6223 |             0.79578  |
|  3 | Other Cities                |     686 | 0.0877238  |               6909 |             0.883504 |
|  4 | Other Cities of Maharashtra |     457 | 0.0584399  |               7366 |             0.941944 |
|  5 | Other Metro Cities          |     380 | 0.0485934  |               7746 |             0.990537 |
|  6 | Tier II Cities              |      74 | 0.00946292 |               7820 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
      Missing  Total  Percent
City     1420   9240  0.15368 


Country
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | Country              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | India                |    6492 | 0.957663    |               6492 |             0.957663 |
|  1 | United States        |      69 | 0.0101785   |               6561 |             0.967842 |
|  2 | United Arab Emirates |      53 | 0.00781826  |               6614 |             0.97566  |
|  3 | Singapore            |      24 | 0.00354035  |               6638 |             0.9792   |
|  4 | Saudi Arabia         |      21 | 0.0030978   |               6659 |             0.982298 |
|  5 | United Kingdom       |      15 | 0.00221272  |               6674 |             0.984511 |
|  6 | Australia            |      13 | 0.00191769  |               6687 |             0.986429 |
|  7 | Qatar                |      10 | 0.00147514  |               6697 |             0.987904 |
|  8 | Hong Kong            |       7 | 0.0010326   |               6704 |             0.988936 |
|  9 | Bahrain              |       7 | 0.0010326   |               6711 |             0.989969 |
| 10 | Oman                 |       6 | 0.000885086 |               6717 |             0.990854 |
| 11 | France               |       6 | 0.000885086 |               6723 |             0.991739 |
| 12 | unknown              |       5 | 0.000737572 |               6728 |             0.992477 |
| 13 | South Africa         |       4 | 0.000590058 |               6732 |             0.993067 |
| 14 | Nigeria              |       4 | 0.000590058 |               6736 |             0.993657 |
| 15 | Kuwait               |       4 | 0.000590058 |               6740 |             0.994247 |
| 16 | Germany              |       4 | 0.000590058 |               6744 |             0.994837 |
| 17 | Canada               |       4 | 0.000590058 |               6748 |             0.995427 |
| 18 | Sweden               |       3 | 0.000442543 |               6751 |             0.99587  |
| 19 | Uganda               |       2 | 0.000295029 |               6753 |             0.996165 |
| 20 | Philippines          |       2 | 0.000295029 |               6755 |             0.99646  |
| 21 | Netherlands          |       2 | 0.000295029 |               6757 |             0.996755 |
| 22 | Italy                |       2 | 0.000295029 |               6759 |             0.99705  |
| 23 | Ghana                |       2 | 0.000295029 |               6761 |             0.997345 |
| 24 | China                |       2 | 0.000295029 |               6763 |             0.99764  |
| 25 | Belgium              |       2 | 0.000295029 |               6765 |             0.997935 |
| 26 | Bangladesh           |       2 | 0.000295029 |               6767 |             0.99823  |
| 27 | Asia/Pacific Region  |       2 | 0.000295029 |               6769 |             0.998525 |
| 28 | Vietnam              |       1 | 0.000147514 |               6770 |             0.998672 |
| 29 | Tanzania             |       1 | 0.000147514 |               6771 |             0.99882  |
| 30 | Switzerland          |       1 | 0.000147514 |               6772 |             0.998967 |
| 31 | Sri Lanka            |       1 | 0.000147514 |               6773 |             0.999115 |
| 32 | Russia               |       1 | 0.000147514 |               6774 |             0.999262 |
| 33 | Malaysia             |       1 | 0.000147514 |               6775 |             0.99941  |
| 34 | Liberia              |       1 | 0.000147514 |               6776 |             0.999557 |
| 35 | Kenya                |       1 | 0.000147514 |               6777 |             0.999705 |
| 36 | Indonesia            |       1 | 0.000147514 |               6778 |             0.999852 |
| 37 | Denmark              |       1 | 0.000147514 |               6779 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+
         Missing  Total   Percent
Country     2461   9240  0.266342 


Digital Advertisement
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Digital Advertisement   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                      |    9236 | 0.999567  |               9236 |             0.999567 |
|  1 | Yes                     |       4 | 0.0004329 |               9240 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
                       Missing  Total  Percent
Digital Advertisement        0   9240      0.0 


Do Not Call
+----+---------------+---------+------------+--------------------+----------------------+
|    | Do Not Call   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+------------+--------------------+----------------------|
|  0 | No            |    9238 | 0.999784   |               9238 |             0.999784 |
|  1 | Yes           |       2 | 0.00021645 |               9240 |             1        |
+----+---------------+---------+------------+--------------------+----------------------+
             Missing  Total  Percent
Do Not Call        0   9240      0.0 


Do Not Email
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    8506 | 0.920563  |               8506 |             0.920563 |
|  1 | Yes            |     734 | 0.0794372 |               9240 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+
              Missing  Total  Percent
Do Not Email        0   9240      0.0 


Get updates on DM Content
+----+-----------------------------+---------+-----------+--------------------+----------------------+
|    | Get updates on DM Content   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                          |    9240 |         1 |               9240 |                    1 |
+----+-----------------------------+---------+-----------+--------------------+----------------------+
                           Missing  Total  Percent
Get updates on DM Content        0   9240      0.0 


How did you hear about X Education
+----+--------------------------------------+---------+------------+--------------------+----------------------+
|    | How did you hear about X Education   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------------------+---------+------------+--------------------+----------------------|
|  0 | Select                               |    5043 | 0.717048   |               5043 |             0.717048 |
|  1 | Online Search                        |     808 | 0.114887   |               5851 |             0.831935 |
|  2 | Word Of Mouth                        |     348 | 0.049481   |               6199 |             0.881416 |
|  3 | Student of SomeSchool                |     310 | 0.0440779  |               6509 |             0.925494 |
|  4 | Other                                |     186 | 0.0264468  |               6695 |             0.951941 |
|  5 | Multiple Sources                     |     152 | 0.0216124  |               6847 |             0.973553 |
|  6 | Advertisements                       |      70 | 0.00995308 |               6917 |             0.983506 |
|  7 | Social Media                         |      67 | 0.00952652 |               6984 |             0.993033 |
|  8 | Email                                |      26 | 0.00369686 |               7010 |             0.99673  |
|  9 | SMS                                  |      23 | 0.0032703  |               7033 |             1        |
+----+--------------------------------------+---------+------------+--------------------+----------------------+
                                    Missing  Total   Percent
How did you hear about X Education     2207   9240  0.238853 


I agree to pay the amount through cheque
+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
|    | I agree to pay the amount through cheque   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                         |    9240 |         1 |               9240 |                    1 |
+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
                                          Missing  Total  Percent
I agree to pay the amount through cheque        0   9240      0.0 


Last Activity
+----+------------------------------+---------+-------------+--------------------+----------------------+
|    | Last Activity                |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Email Opened                 |    3437 | 0.376163    |               3437 |             0.376163 |
|  1 | SMS Sent                     |    2745 | 0.300427    |               6182 |             0.67659  |
|  2 | Olark Chat Conversation      |     973 | 0.10649     |               7155 |             0.78308  |
|  3 | Page Visited on Website      |     640 | 0.0700449   |               7795 |             0.853125 |
|  4 | Converted to Lead            |     428 | 0.0468425   |               8223 |             0.899967 |
|  5 | Email Bounced                |     326 | 0.0356791   |               8549 |             0.935646 |
|  6 | Email Link Clicked           |     267 | 0.0292218   |               8816 |             0.964868 |
|  7 | Form Submitted on Website    |     116 | 0.0126956   |               8932 |             0.977564 |
|  8 | Unreachable                  |      93 | 0.0101784   |               9025 |             0.987742 |
|  9 | Unsubscribed                 |      61 | 0.00667615  |               9086 |             0.994418 |
| 10 | Had a Phone Conversation     |      30 | 0.00328335  |               9116 |             0.997702 |
| 11 | Approached upfront           |       9 | 0.000985006 |               9125 |             0.998687 |
| 12 | View in browser link Clicked |       6 | 0.000656671 |               9131 |             0.999343 |
| 13 | Email Received               |       2 | 0.00021889  |               9133 |             0.999562 |
| 14 | Email Marked Spam            |       2 | 0.00021889  |               9135 |             0.999781 |
| 15 | Visited Booth in Tradeshow   |       1 | 0.000109445 |               9136 |             0.999891 |
| 16 | Resubscribed to emails       |       1 | 0.000109445 |               9137 |             1        |
+----+------------------------------+---------+-------------+--------------------+----------------------+
               Missing  Total   Percent
Last Activity      103   9240  0.011147 


Last Notable Activity
+----+------------------------------+---------+-------------+--------------------+----------------------+
|    | Last Notable Activity        |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Modified                     |    3407 | 0.368723    |               3407 |             0.368723 |
|  1 | Email Opened                 |    2827 | 0.305952    |               6234 |             0.674675 |
|  2 | SMS Sent                     |    2172 | 0.235065    |               8406 |             0.90974  |
|  3 | Page Visited on Website      |     318 | 0.0344156   |               8724 |             0.944156 |
|  4 | Olark Chat Conversation      |     183 | 0.0198052   |               8907 |             0.963961 |
|  5 | Email Link Clicked           |     173 | 0.0187229   |               9080 |             0.982684 |
|  6 | Email Bounced                |      60 | 0.00649351  |               9140 |             0.989177 |
|  7 | Unsubscribed                 |      47 | 0.00508658  |               9187 |             0.994264 |
|  8 | Unreachable                  |      32 | 0.0034632   |               9219 |             0.997727 |
|  9 | Had a Phone Conversation     |      14 | 0.00151515  |               9233 |             0.999242 |
| 10 | Email Marked Spam            |       2 | 0.00021645  |               9235 |             0.999459 |
| 11 | View in browser link Clicked |       1 | 0.000108225 |               9236 |             0.999567 |
| 12 | Resubscribed to emails       |       1 | 0.000108225 |               9237 |             0.999675 |
| 13 | Form Submitted on Website    |       1 | 0.000108225 |               9238 |             0.999784 |
| 14 | Email Received               |       1 | 0.000108225 |               9239 |             0.999892 |
| 15 | Approached upfront           |       1 | 0.000108225 |               9240 |             1        |
+----+------------------------------+---------+-------------+--------------------+----------------------+
                       Missing  Total  Percent
Last Notable Activity        0   9240      0.0 


Lead Origin
+----+-------------------------+---------+-------------+--------------------+----------------------+
|    | Lead Origin             |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-------------+--------------------+----------------------|
|  0 | Landing Page Submission |    4886 | 0.528788    |               4886 |             0.528788 |
|  1 | API                     |    3580 | 0.387446    |               8466 |             0.916234 |
|  2 | Lead Add Form           |     718 | 0.0777056   |               9184 |             0.993939 |
|  3 | Lead Import             |      55 | 0.00595238  |               9239 |             0.999892 |
|  4 | Quick Add Form          |       1 | 0.000108225 |               9240 |             1        |
+----+-------------------------+---------+-------------+--------------------+----------------------+
             Missing  Total  Percent
Lead Origin        0   9240      0.0 


Lead Profile
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | Lead Profile                |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Select                      |    4146 | 0.634819   |               4146 |             0.634819 |
|  1 | Potential Lead              |    1613 | 0.246976   |               5759 |             0.881795 |
|  2 | Other Leads                 |     487 | 0.0745674  |               6246 |             0.956362 |
|  3 | Student of SomeSchool       |     241 | 0.0369009  |               6487 |             0.993263 |
|  4 | Lateral Student             |      24 | 0.00367478 |               6511 |             0.996938 |
|  5 | Dual Specialization Student |      20 | 0.00306232 |               6531 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
              Missing  Total   Percent
Lead Profile     2709   9240  0.293182 


Lead Source
+----+-------------------+---------+-------------+--------------------+----------------------+
|    | Lead Source       |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-------------+--------------------+----------------------|
|  0 | Google            |    2868 | 0.311604    |               2868 |             0.311604 |
|  1 | Direct Traffic    |    2543 | 0.276293    |               5411 |             0.587897 |
|  2 | Olark Chat        |    1755 | 0.190678    |               7166 |             0.778575 |
|  3 | Organic Search    |    1154 | 0.12538     |               8320 |             0.903955 |
|  4 | Reference         |     534 | 0.0580183   |               8854 |             0.961973 |
|  5 | Welingak Website  |     142 | 0.0154281   |               8996 |             0.977401 |
|  6 | Referral Sites    |     125 | 0.0135811   |               9121 |             0.990982 |
|  7 | Facebook          |      55 | 0.00597566  |               9176 |             0.996958 |
|  8 | bing              |       6 | 0.00065189  |               9182 |             0.99761  |
|  9 | google            |       5 | 0.000543242 |               9187 |             0.998153 |
| 10 | Click2call        |       4 | 0.000434594 |               9191 |             0.998588 |
| 11 | Social Media      |       2 | 0.000217297 |               9193 |             0.998805 |
| 12 | Press_Release     |       2 | 0.000217297 |               9195 |             0.999022 |
| 13 | Live Chat         |       2 | 0.000217297 |               9197 |             0.999239 |
| 14 | youtubechannel    |       1 | 0.000108648 |               9198 |             0.999348 |
| 15 | welearnblog_Home  |       1 | 0.000108648 |               9199 |             0.999457 |
| 16 | testone           |       1 | 0.000108648 |               9200 |             0.999565 |
| 17 | blog              |       1 | 0.000108648 |               9201 |             0.999674 |
| 18 | WeLearn           |       1 | 0.000108648 |               9202 |             0.999783 |
| 19 | Pay per Click Ads |       1 | 0.000108648 |               9203 |             0.999891 |
| 20 | NC_EDM            |       1 | 0.000108648 |               9204 |             1        |
+----+-------------------+---------+-------------+--------------------+----------------------+
             Missing  Total   Percent
Lead Source       36   9240  0.003896 


Magazine
+----+------------+---------+-----------+--------------------+----------------------+
|    | Magazine   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------+---------+-----------+--------------------+----------------------|
|  0 | No         |    9240 |         1 |               9240 |                    1 |
+----+------------+---------+-----------+--------------------+----------------------+
          Missing  Total  Percent
Magazine        0   9240      0.0 


Newspaper
+----+-------------+---------+-------------+--------------------+----------------------+
|    | Newspaper   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------+---------+-------------+--------------------+----------------------|
|  0 | No          |    9239 | 0.999892    |               9239 |             0.999892 |
|  1 | Yes         |       1 | 0.000108225 |               9240 |             1        |
+----+-------------+---------+-------------+--------------------+----------------------+
           Missing  Total  Percent
Newspaper        0   9240      0.0 


Newspaper Article
+----+---------------------+---------+------------+--------------------+----------------------+
|    | Newspaper Article   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------+---------+------------+--------------------+----------------------|
|  0 | No                  |    9238 | 0.999784   |               9238 |             0.999784 |
|  1 | Yes                 |       2 | 0.00021645 |               9240 |             1        |
+----+---------------------+---------+------------+--------------------+----------------------+
                   Missing  Total  Percent
Newspaper Article        0   9240      0.0 


Receive More Updates About Our Courses
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | Receive More Updates About Our Courses   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    9240 |         1 |               9240 |                    1 |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
                                        Missing  Total  Percent
Receive More Updates About Our Courses        0   9240      0.0 


Search
+----+----------+---------+------------+--------------------+----------------------+
|    | Search   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+----------+---------+------------+--------------------+----------------------|
|  0 | No       |    9226 | 0.998485   |               9226 |             0.998485 |
|  1 | Yes      |      14 | 0.00151515 |               9240 |             1        |
+----+----------+---------+------------+--------------------+----------------------+
        Missing  Total  Percent
Search        0   9240      0.0 


Specialization
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | Specialization                    |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | Select                            |    1942 | 0.248911   |               1942 |             0.248911 |
|  1 | Finance Management                |     976 | 0.125096   |               2918 |             0.374007 |
|  2 | Human Resource Management         |     848 | 0.10869    |               3766 |             0.482697 |
|  3 | Marketing Management              |     838 | 0.107408   |               4604 |             0.590105 |
|  4 | Operations Management             |     503 | 0.0644706  |               5107 |             0.654576 |
|  5 | Business Administration           |     403 | 0.0516534  |               5510 |             0.706229 |
|  6 | IT Projects Management            |     366 | 0.046911   |               5876 |             0.75314  |
|  7 | Supply Chain Management           |     349 | 0.0447321  |               6225 |             0.797872 |
|  8 | Banking, Investment And Insurance |     338 | 0.0433222  |               6563 |             0.841195 |
|  9 | Travel and Tourism                |     203 | 0.026019   |               6766 |             0.867214 |
| 10 | Media and Advertising             |     203 | 0.026019   |               6969 |             0.893233 |
| 11 | International Business            |     178 | 0.0228147  |               7147 |             0.916047 |
| 12 | Healthcare Management             |     159 | 0.0203794  |               7306 |             0.936427 |
| 13 | Hospitality Management            |     114 | 0.0146116  |               7420 |             0.951038 |
| 14 | E-COMMERCE                        |     112 | 0.0143553  |               7532 |             0.965393 |
| 15 | Retail Management                 |     100 | 0.0128172  |               7632 |             0.978211 |
| 16 | Rural and Agribusiness            |      73 | 0.00935658 |               7705 |             0.987567 |
| 17 | E-Business                        |      57 | 0.00730582 |               7762 |             0.994873 |
| 18 | Services Excellence               |      40 | 0.00512689 |               7802 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
                Missing  Total   Percent
Specialization     1438   9240  0.155628 


Tags
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | Tags                                              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Will revert after reading the email               |    2072 | 0.351962    |               2072 |             0.351962 |
|  1 | Ringing                                           |    1203 | 0.204349    |               3275 |             0.556311 |
|  2 | Interested in other courses                       |     513 | 0.0871412   |               3788 |             0.643452 |
|  3 | Already a student                                 |     465 | 0.0789876   |               4253 |             0.722439 |
|  4 | Closed by Horizzon                                |     358 | 0.060812    |               4611 |             0.783251 |
|  5 | switched off                                      |     240 | 0.0407678   |               4851 |             0.824019 |
|  6 | Busy                                              |     186 | 0.031595    |               5037 |             0.855614 |
|  7 | Lost to EINS                                      |     175 | 0.0297265   |               5212 |             0.885341 |
|  8 | Not doing further education                       |     145 | 0.0246305   |               5357 |             0.909971 |
|  9 | Interested  in full time MBA                      |     117 | 0.0198743   |               5474 |             0.929845 |
| 10 | Graduation in progress                            |     111 | 0.0188551   |               5585 |             0.948701 |
| 11 | invalid number                                    |      83 | 0.0140989   |               5668 |             0.962799 |
| 12 | Diploma holder (Not Eligible)                     |      63 | 0.0107015   |               5731 |             0.973501 |
| 13 | wrong number given                                |      47 | 0.00798369  |               5778 |             0.981485 |
| 14 | opp hangup                                        |      33 | 0.00560557  |               5811 |             0.98709  |
| 15 | number not provided                               |      27 | 0.00458638  |               5838 |             0.991677 |
| 16 | in touch with EINS                                |      12 | 0.00203839  |               5850 |             0.993715 |
| 17 | Lost to Others                                    |       7 | 0.00118906  |               5857 |             0.994904 |
| 18 | Want to take admission but has financial problems |       6 | 0.00101919  |               5863 |             0.995923 |
| 19 | Still Thinking                                    |       6 | 0.00101919  |               5869 |             0.996942 |
| 20 | Interested in Next batch                          |       5 | 0.000849329 |               5874 |             0.997792 |
| 21 | In confusion whether part time or DLP             |       5 | 0.000849329 |               5879 |             0.998641 |
| 22 | Lateral student                                   |       3 | 0.000509597 |               5882 |             0.999151 |
| 23 | University not recognized                         |       2 | 0.000339732 |               5884 |             0.99949  |
| 24 | Shall take in the next coming month               |       2 | 0.000339732 |               5886 |             0.99983  |
| 25 | Recognition issue (DEC approval)                  |       1 | 0.000169866 |               5887 |             1        |
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
      Missing  Total   Percent
Tags     3353   9240  0.362879 


Through Recommendations
+----+---------------------------+---------+-------------+--------------------+----------------------+
|    | Through Recommendations   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-------------+--------------------+----------------------|
|  0 | No                        |    9233 | 0.999242    |               9233 |             0.999242 |
|  1 | Yes                       |       7 | 0.000757576 |               9240 |             1        |
+----+---------------------------+---------+-------------+--------------------+----------------------+
                         Missing  Total  Percent
Through Recommendations        0   9240      0.0 


Update me on Supply Chain Content
+----+-------------------------------------+---------+-----------+--------------------+----------------------+
|    | Update me on Supply Chain Content   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                  |    9240 |         1 |               9240 |                    1 |
+----+-------------------------------------+---------+-----------+--------------------+----------------------+
                                   Missing  Total  Percent
Update me on Supply Chain Content        0   9240      0.0 


What is your current occupation
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | Unemployed                        |    5600 | 0.854962   |               5600 |             0.854962 |
|  1 | Working Professional              |     706 | 0.107786   |               6306 |             0.962748 |
|  2 | Student                           |     210 | 0.0320611  |               6516 |             0.994809 |
|  3 | Other                             |      16 | 0.00244275 |               6532 |             0.997252 |
|  4 | Housewife                         |      10 | 0.00152672 |               6542 |             0.998779 |
|  5 | Businessman                       |       8 | 0.00122137 |               6550 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
                                 Missing  Total   Percent
What is your current occupation     2690   9240  0.291126 


What matters most to you in choosing a course
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | What matters most to you in choosing a course   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Better Career Prospects                         |    6528 | 0.999541    |               6528 |             0.999541 |
|  1 | Flexibility & Convenience                       |       2 | 0.000306232 |               6530 |             0.999847 |
|  2 | Other                                           |       1 | 0.000153116 |               6531 |             1        |
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
                                               Missing  Total   Percent
What matters most to you in choosing a course     2709   9240  0.293182 


X Education Forums
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | X Education Forums   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | No                   |    9239 | 0.999892    |               9239 |             0.999892 |
|  1 | Yes                  |       1 | 0.000108225 |               9240 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+
                    Missing  Total  Percent
X Education Forums        0   9240      0.0

The following columns have a label Select which is a disguised missing value.
Select is the default option in online forms and this value might mean that the lead hasn't selected any option.
We shall replace them with np.nan.
- Specialization
- Lead Profile
- City
- How did you hear about X Education

# Replacing Select with NaN value
leads.replace({'Select' : np.nan},inplace=True)

# Looking at Missing Values again 

# Null values in each Column
nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
nulls.columns = ['Null Percentage']

# Sorting null percentages in descending order and highlighting null % > 50 
nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 50 else '')

Lead Profile & How did you hear about X Education have very high percentage of nulls. Let's drop these columns

leads.drop(columns=['Lead Profile','How did you hear about X Education'],inplace=True)

Imputation¶

# Sorting null percentages in ascending order and highlighting null % < 16
def lowNulls() : 
    nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
    nulls.columns = ['Null Percentage']
    return nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=True).style.applymap(lambda x : 'color : green' if x < 16 else '')

lowNulls()

'Lead Source','Last Activity','TotalVisits','Page Views Per Visit' have less than 2% missing values. These rows could be dropped.
We could impute columns with higher missing values on a case by case basis.
Missing values are imputed by the metric most representative of the feature's distribution.
For categorical features, missing values could be imputed by the most frequently occuring label i.e MODE value, since this is the most representative metric of a categorical feature.
For continuous features, if there are outliers, the most representative metric of the feature's distribution is MEDIAN, else it is MEAN. Continuous feature imputations are thus dependent on presence of outliers.

About 26% of data in Country Column is missing.

# Country Imputation : 
leads.stb.freq(['Country'])

Since 95% of leads come from India, it is probable that missing values are from India.

#  Imputing missing values in Country feature with "India"
leads['Country'].fillna('India', inplace=True)

Specialization feature has 36% of missing values.
Since there's no one label that's driving leads, replacement would mislead the analysis.
Hence, we could impute missing values with a new label 'No Specialization'

# Imputing Null Values by filling it using "No Specialization".
leads['Specialization'].fillna("No Specialization",inplace=True)
print('Missing values in Specialization feature ', leads['Specialization'].isnull().sum())

Missing values in Specialization feature  0

leads['Specialization'].value_counts()

No Specialization                    3380
Finance Management                    976
Human Resource Management             848
Marketing Management                  838
Operations Management                 503
Business Administration               403
IT Projects Management                366
Supply Chain Management               349
Banking, Investment And Insurance     338
Travel and Tourism                    203
Media and Advertising                 203
International Business                178
Healthcare Management                 159
Hospitality Management                114
E-COMMERCE                            112
Retail Management                     100
Rural and Agribusiness                 73
E-Business                             57
Services Excellence                    40
Name: Specialization, dtype: int64

About 39% of values in City column are missing

# Imputation of missing cities 
leads.stb.freq(['City'])

# Missing Cities vs Country
condition_india = leads['Country'] == 'India'
print('Total Missing City values :', leads['City'].isnull().sum())
print('Missing City values in leads from India : ',leads.loc[condition_india,'City'].isnull().sum())

Total Missing City values : 3669
Missing City values in leads from India :  3609

Looks like 3609 out 3669 leads with Missing City label are from India.
As can be seen from the value counts of City feature, 60% of the leads come from Mumbai.
Since, we could impute missing City value for leads from India with Mumbai

# Replacing Null Cities in India with Mumbai
condition = (leads['City'].isnull()) & condition_india
leads.loc[condition,'City'] = 'Mumbai'

29% of values in What is your current occupation column are missing.
Let's look at the distribution of levels in this column

tab(leads.stb.freq(['What is your current occupation']))
tab(leads['What is your current occupation'].reset_index().stb.missing())

+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | Unemployed                        |    5600 | 0.854962   |               5600 |             0.854962 |
|  1 | Working Professional              |     706 | 0.107786   |               6306 |             0.962748 |
|  2 | Student                           |     210 | 0.0320611  |               6516 |             0.994809 |
|  3 | Other                             |      16 | 0.00244275 |               6532 |             0.997252 |
|  4 | Housewife                         |      10 | 0.00152672 |               6542 |             0.998779 |
|  5 | Businessman                       |       8 | 0.00122137 |               6550 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
+---------------------------------+-----------+---------+-----------+
|                                 |   Missing |   Total |   Percent |
|---------------------------------+-----------+---------+-----------|
| What is your current occupation |      2690 |    9240 |  0.291126 |
| index                           |         0 |    9240 |  0        |
+---------------------------------+-----------+---------+-----------+

Since, the business problem clearly says the company targets working professionals, this is an extremely important variable.
So to keep the analysis unbiased, we could impute missing values with a new level for now.
Let's replace missing values in What is your current occupation with 'Unknown Occupation'

leads['What is your current occupation'].fillna('Unknown Occupation',inplace=True)

# Missing Values in `What matters most to you in choosing a course`
ftr = 'What matters most to you in choosing a course'
tab(leads.stb.freq([ftr]))
tab(leads[ftr].reset_index().stb.missing())

+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | What matters most to you in choosing a course   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Better Career Prospects                         |    6528 | 0.999541    |               6528 |             0.999541 |
|  1 | Flexibility & Convenience                       |       2 | 0.000306232 |               6530 |             0.999847 |
|  2 | Other                                           |       1 | 0.000153116 |               6531 |             1        |
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
+-----------------------------------------------+-----------+---------+-----------+
|                                               |   Missing |   Total |   Percent |
|-----------------------------------------------+-----------+---------+-----------|
| What matters most to you in choosing a course |      2709 |    9240 |  0.293182 |
| index                                         |         0 |    9240 |  0        |
+-----------------------------------------------+-----------+---------+-----------+

Almost all leads ( >99%) show interest in the company's offerings for Better Career Prospects , excluding leads who havent filled this feature.
Since this might be a very important feature from the analysis perspective, to keep the analysis unbiased, instead of imputation with an existing label, let's impute with a new label for now.
Let's replace missing values in What matters most to you in choosing a course with 'Unknown Target' for now.

# Filling Missing Values with `Unknown`
leads[ftr].fillna('Unknown Target',inplace=True)

Incorrect Labels¶

# Looking at labels in each Categorical Variable to check for incorrect labels. 
categoricalFeatures = leads.dtypes[leads.dtypes == 'object'].index.values
print('Categorical Features : ', categoricalFeatures,'\n\n')
for feature in categoricalFeatures : 
    print('Levels in ',feature,' are ' , leads[feature].unique(),'\n\n')

Categorical Features :  ['Lead Origin' 'Lead Source' 'Do Not Email' 'Do Not Call' 'Last Activity'
 'Country' 'Specialization' 'What is your current occupation'
 'What matters most to you in choosing a course' 'Search' 'Magazine'
 'Newspaper Article' 'X Education Forums' 'Newspaper'
 'Digital Advertisement' 'Through Recommendations'
 'Receive More Updates About Our Courses' 'Tags'
 'Update me on Supply Chain Content' 'Get updates on DM Content' 'City'
 'I agree to pay the amount through cheque'
 'A free copy of Mastering The Interview' 'Last Notable Activity'] 


Levels in  Lead Origin  are  ['API' 'Landing Page Submission' 'Lead Add Form' 'Lead Import'
 'Quick Add Form'] 


Levels in  Lead Source  are  ['Olark Chat' 'Organic Search' 'Direct Traffic' 'Google' 'Referral Sites'
 'Welingak Website' 'Reference' 'google' 'Facebook' nan 'blog'
 'Pay per Click Ads' 'bing' 'Social Media' 'WeLearn' 'Click2call'
 'Live Chat' 'welearnblog_Home' 'youtubechannel' 'testone' 'Press_Release'
 'NC_EDM'] 


Levels in  Do Not Email  are  ['No' 'Yes'] 


Levels in  Do Not Call  are  ['No' 'Yes'] 


Levels in  Last Activity  are  ['Page Visited on Website' 'Email Opened' 'Unreachable'
 'Converted to Lead' 'Olark Chat Conversation' 'Email Bounced'
 'Email Link Clicked' 'Form Submitted on Website' 'Unsubscribed'
 'Had a Phone Conversation' 'View in browser link Clicked' nan
 'Approached upfront' 'SMS Sent' 'Visited Booth in Tradeshow'
 'Resubscribed to emails' 'Email Received' 'Email Marked Spam'] 


Levels in  Country  are  ['India' 'Russia' 'Kuwait' 'Oman' 'United Arab Emirates' 'United States'
 'Australia' 'United Kingdom' 'Bahrain' 'Ghana' 'Singapore' 'Qatar'
 'Saudi Arabia' 'Belgium' 'France' 'Sri Lanka' 'China' 'Canada'
 'Netherlands' 'Sweden' 'Nigeria' 'Hong Kong' 'Germany'
 'Asia/Pacific Region' 'Uganda' 'Kenya' 'Italy' 'South Africa' 'Tanzania'
 'unknown' 'Malaysia' 'Liberia' 'Switzerland' 'Denmark' 'Philippines'
 'Bangladesh' 'Vietnam' 'Indonesia'] 


Levels in  Specialization  are  ['No Specialization' 'Business Administration' 'Media and Advertising'
 'Supply Chain Management' 'IT Projects Management' 'Finance Management'
 'Travel and Tourism' 'Human Resource Management' 'Marketing Management'
 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE'
 'Operations Management' 'Retail Management' 'Services Excellence'
 'Hospitality Management' 'Rural and Agribusiness' 'Healthcare Management'
 'E-Business'] 


Levels in  What is your current occupation  are  ['Unemployed' 'Student' 'Unknown Occupation' 'Working Professional'
 'Businessman' 'Other' 'Housewife'] 


Levels in  What matters most to you in choosing a course  are  ['Better Career Prospects' 'Unknown Target' 'Flexibility & Convenience'
 'Other'] 


Levels in  Search  are  ['No' 'Yes'] 


Levels in  Magazine  are  ['No'] 


Levels in  Newspaper Article  are  ['No' 'Yes'] 


Levels in  X Education Forums  are  ['No' 'Yes'] 


Levels in  Newspaper  are  ['No' 'Yes'] 


Levels in  Digital Advertisement  are  ['No' 'Yes'] 


Levels in  Through Recommendations  are  ['No' 'Yes'] 


Levels in  Receive More Updates About Our Courses  are  ['No'] 


Levels in  Tags  are  ['Interested in other courses' 'Ringing'
 'Will revert after reading the email' nan 'Lost to EINS'
 'In confusion whether part time or DLP' 'Busy' 'switched off'
 'in touch with EINS' 'Already a student' 'Diploma holder (Not Eligible)'
 'Graduation in progress' 'Closed by Horizzon' 'number not provided'
 'opp hangup' 'Not doing further education' 'invalid number'
 'wrong number given' 'Interested  in full time MBA' 'Still Thinking'
 'Lost to Others' 'Shall take in the next coming month' 'Lateral student'
 'Interested in Next batch' 'Recognition issue (DEC approval)'
 'Want to take admission but has financial problems'
 'University not recognized'] 


Levels in  Update me on Supply Chain Content  are  ['No'] 


Levels in  Get updates on DM Content  are  ['No'] 


Levels in  City  are  ['Mumbai' 'Thane & Outskirts' 'Other Metro Cities' nan 'Other Cities'
 'Other Cities of Maharashtra' 'Tier II Cities'] 


Levels in  I agree to pay the amount through cheque  are  ['No'] 


Levels in  A free copy of Mastering The Interview  are  ['No' 'Yes'] 


Levels in  Last Notable Activity  are  ['Modified' 'Email Opened' 'Page Visited on Website' 'Email Bounced'
 'Email Link Clicked' 'Unreachable' 'Unsubscribed'
 'Had a Phone Conversation' 'Olark Chat Conversation' 'SMS Sent'
 'Approached upfront' 'Resubscribed to emails'
 'View in browser link Clicked' 'Form Submitted on Website'
 'Email Received' 'Email Marked Spam']

We can clearly see that Google is appearing twice in 'Lead Source'- (Google,google)

# Replacing 'google' with 'Google
leads['Lead Source']=leads['Lead Source'].str.replace("google","Google")

Cleaning Categorical Features¶

Dropping Unnecessary Columns¶

# Missing Values and Value Counts for all categorical Variables 
tab(leads.stb.missing())
print('Value Counts of each Feature : \n')
for feature in sorted(categoricalFeatures) : 
    tab(leads.stb.freq([feature]))

+-----------------------------------------------+-----------+---------+------------+
|                                               |   Missing |   Total |    Percent |
|-----------------------------------------------+-----------+---------+------------|
| Tags                                          |      3353 |    9240 | 0.362879   |
| TotalVisits                                   |       137 |    9240 | 0.0148268  |
| Page Views Per Visit                          |       137 |    9240 | 0.0148268  |
| Last Activity                                 |       103 |    9240 | 0.0111472  |
| City                                          |        60 |    9240 | 0.00649351 |
| Lead Source                                   |        36 |    9240 | 0.0038961  |
| Lead Origin                                   |         0 |    9240 | 0          |
| X Education Forums                            |         0 |    9240 | 0          |
| A free copy of Mastering The Interview        |         0 |    9240 | 0          |
| I agree to pay the amount through cheque      |         0 |    9240 | 0          |
| Get updates on DM Content                     |         0 |    9240 | 0          |
| Update me on Supply Chain Content             |         0 |    9240 | 0          |
| Receive More Updates About Our Courses        |         0 |    9240 | 0          |
| Through Recommendations                       |         0 |    9240 | 0          |
| Digital Advertisement                         |         0 |    9240 | 0          |
| Newspaper                                     |         0 |    9240 | 0          |
| Magazine                                      |         0 |    9240 | 0          |
| Newspaper Article                             |         0 |    9240 | 0          |
| Search                                        |         0 |    9240 | 0          |
| What matters most to you in choosing a course |         0 |    9240 | 0          |
| What is your current occupation               |         0 |    9240 | 0          |
| Specialization                                |         0 |    9240 | 0          |
| Country                                       |         0 |    9240 | 0          |
| Total Time Spent on Website                   |         0 |    9240 | 0          |
| Converted                                     |         0 |    9240 | 0          |
| Do Not Call                                   |         0 |    9240 | 0          |
| Do Not Email                                  |         0 |    9240 | 0          |
| Last Notable Activity                         |         0 |    9240 | 0          |
+-----------------------------------------------+-----------+---------+------------+
Value Counts of each Feature : 

+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    6352 |  0.687446 |               6352 |             0.687446 |
|  1 | Yes                                      |    2888 |  0.312554 |               9240 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
+----+-----------------------------+---------+-----------+--------------------+----------------------+
|    | City                        |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+-----------+--------------------+----------------------|
|  0 | Mumbai                      |    6831 | 0.744118  |               6831 |             0.744118 |
|  1 | Thane & Outskirts           |     752 | 0.0819172 |               7583 |             0.826035 |
|  2 | Other Cities                |     686 | 0.0747277 |               8269 |             0.900763 |
|  3 | Other Cities of Maharashtra |     457 | 0.0497821 |               8726 |             0.950545 |
|  4 | Other Metro Cities          |     380 | 0.0413943 |               9106 |             0.991939 |
|  5 | Tier II Cities              |      74 | 0.008061  |               9180 |             1        |
+----+-----------------------------+---------+-----------+--------------------+----------------------+
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | Country              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | India                |    8953 | 0.968939    |               8953 |             0.968939 |
|  1 | United States        |      69 | 0.00746753  |               9022 |             0.976407 |
|  2 | United Arab Emirates |      53 | 0.00573593  |               9075 |             0.982143 |
|  3 | Singapore            |      24 | 0.0025974   |               9099 |             0.98474  |
|  4 | Saudi Arabia         |      21 | 0.00227273  |               9120 |             0.987013 |
|  5 | United Kingdom       |      15 | 0.00162338  |               9135 |             0.988636 |
|  6 | Australia            |      13 | 0.00140693  |               9148 |             0.990043 |
|  7 | Qatar                |      10 | 0.00108225  |               9158 |             0.991126 |
|  8 | Hong Kong            |       7 | 0.000757576 |               9165 |             0.991883 |
|  9 | Bahrain              |       7 | 0.000757576 |               9172 |             0.992641 |
| 10 | Oman                 |       6 | 0.000649351 |               9178 |             0.99329  |
| 11 | France               |       6 | 0.000649351 |               9184 |             0.993939 |
| 12 | unknown              |       5 | 0.000541126 |               9189 |             0.994481 |
| 13 | South Africa         |       4 | 0.0004329   |               9193 |             0.994913 |
| 14 | Nigeria              |       4 | 0.0004329   |               9197 |             0.995346 |
| 15 | Kuwait               |       4 | 0.0004329   |               9201 |             0.995779 |
| 16 | Germany              |       4 | 0.0004329   |               9205 |             0.996212 |
| 17 | Canada               |       4 | 0.0004329   |               9209 |             0.996645 |
| 18 | Sweden               |       3 | 0.000324675 |               9212 |             0.99697  |
| 19 | Uganda               |       2 | 0.00021645  |               9214 |             0.997186 |
| 20 | Philippines          |       2 | 0.00021645  |               9216 |             0.997403 |
| 21 | Netherlands          |       2 | 0.00021645  |               9218 |             0.997619 |
| 22 | Italy                |       2 | 0.00021645  |               9220 |             0.997835 |
| 23 | Ghana                |       2 | 0.00021645  |               9222 |             0.998052 |
| 24 | China                |       2 | 0.00021645  |               9224 |             0.998268 |
| 25 | Belgium              |       2 | 0.00021645  |               9226 |             0.998485 |
| 26 | Bangladesh           |       2 | 0.00021645  |               9228 |             0.998701 |
| 27 | Asia/Pacific Region  |       2 | 0.00021645  |               9230 |             0.998918 |
| 28 | Vietnam              |       1 | 0.000108225 |               9231 |             0.999026 |
| 29 | Tanzania             |       1 | 0.000108225 |               9232 |             0.999134 |
| 30 | Switzerland          |       1 | 0.000108225 |               9233 |             0.999242 |
| 31 | Sri Lanka            |       1 | 0.000108225 |               9234 |             0.999351 |
| 32 | Russia               |       1 | 0.000108225 |               9235 |             0.999459 |
| 33 | Malaysia             |       1 | 0.000108225 |               9236 |             0.999567 |
| 34 | Liberia              |       1 | 0.000108225 |               9237 |             0.999675 |
| 35 | Kenya                |       1 | 0.000108225 |               9238 |             0.999784 |
| 36 | Indonesia            |       1 | 0.000108225 |               9239 |             0.999892 |
| 37 | Denmark              |       1 | 0.000108225 |               9240 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Digital Advertisement   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                      |    9236 | 0.999567  |               9236 |             0.999567 |
|  1 | Yes                     |       4 | 0.0004329 |               9240 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
+----+---------------+---------+------------+--------------------+----------------------+
|    | Do Not Call   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+------------+--------------------+----------------------|
|  0 | No            |    9238 | 0.999784   |               9238 |             0.999784 |
|  1 | Yes           |       2 | 0.00021645 |               9240 |             1        |
+----+---------------+---------+------------+--------------------+----------------------+
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    8506 | 0.920563  |               8506 |             0.920563 |
|  1 | Yes            |     734 | 0.0794372 |               9240 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+
+----+-----------------------------+---------+-----------+--------------------+----------------------+
|    | Get updates on DM Content   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                          |    9240 |         1 |               9240 |                    1 |
+----+-----------------------------+---------+-----------+--------------------+----------------------+
+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
|    | I agree to pay the amount through cheque   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                         |    9240 |         1 |               9240 |                    1 |
+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
+----+------------------------------+---------+-------------+--------------------+----------------------+
|    | Last Activity                |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Email Opened                 |    3437 | 0.376163    |               3437 |             0.376163 |
|  1 | SMS Sent                     |    2745 | 0.300427    |               6182 |             0.67659  |
|  2 | Olark Chat Conversation      |     973 | 0.10649     |               7155 |             0.78308  |
|  3 | Page Visited on Website      |     640 | 0.0700449   |               7795 |             0.853125 |
|  4 | Converted to Lead            |     428 | 0.0468425   |               8223 |             0.899967 |
|  5 | Email Bounced                |     326 | 0.0356791   |               8549 |             0.935646 |
|  6 | Email Link Clicked           |     267 | 0.0292218   |               8816 |             0.964868 |
|  7 | Form Submitted on Website    |     116 | 0.0126956   |               8932 |             0.977564 |
|  8 | Unreachable                  |      93 | 0.0101784   |               9025 |             0.987742 |
|  9 | Unsubscribed                 |      61 | 0.00667615  |               9086 |             0.994418 |
| 10 | Had a Phone Conversation     |      30 | 0.00328335  |               9116 |             0.997702 |
| 11 | Approached upfront           |       9 | 0.000985006 |               9125 |             0.998687 |
| 12 | View in browser link Clicked |       6 | 0.000656671 |               9131 |             0.999343 |
| 13 | Email Received               |       2 | 0.00021889  |               9133 |             0.999562 |
| 14 | Email Marked Spam            |       2 | 0.00021889  |               9135 |             0.999781 |
| 15 | Visited Booth in Tradeshow   |       1 | 0.000109445 |               9136 |             0.999891 |
| 16 | Resubscribed to emails       |       1 | 0.000109445 |               9137 |             1        |
+----+------------------------------+---------+-------------+--------------------+----------------------+
+----+------------------------------+---------+-------------+--------------------+----------------------+
|    | Last Notable Activity        |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Modified                     |    3407 | 0.368723    |               3407 |             0.368723 |
|  1 | Email Opened                 |    2827 | 0.305952    |               6234 |             0.674675 |
|  2 | SMS Sent                     |    2172 | 0.235065    |               8406 |             0.90974  |
|  3 | Page Visited on Website      |     318 | 0.0344156   |               8724 |             0.944156 |
|  4 | Olark Chat Conversation      |     183 | 0.0198052   |               8907 |             0.963961 |
|  5 | Email Link Clicked           |     173 | 0.0187229   |               9080 |             0.982684 |
|  6 | Email Bounced                |      60 | 0.00649351  |               9140 |             0.989177 |
|  7 | Unsubscribed                 |      47 | 0.00508658  |               9187 |             0.994264 |
|  8 | Unreachable                  |      32 | 0.0034632   |               9219 |             0.997727 |
|  9 | Had a Phone Conversation     |      14 | 0.00151515  |               9233 |             0.999242 |
| 10 | Email Marked Spam            |       2 | 0.00021645  |               9235 |             0.999459 |
| 11 | View in browser link Clicked |       1 | 0.000108225 |               9236 |             0.999567 |
| 12 | Resubscribed to emails       |       1 | 0.000108225 |               9237 |             0.999675 |
| 13 | Form Submitted on Website    |       1 | 0.000108225 |               9238 |             0.999784 |
| 14 | Email Received               |       1 | 0.000108225 |               9239 |             0.999892 |
| 15 | Approached upfront           |       1 | 0.000108225 |               9240 |             1        |
+----+------------------------------+---------+-------------+--------------------+----------------------+
+----+-------------------------+---------+-------------+--------------------+----------------------+
|    | Lead Origin             |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-------------+--------------------+----------------------|
|  0 | Landing Page Submission |    4886 | 0.528788    |               4886 |             0.528788 |
|  1 | API                     |    3580 | 0.387446    |               8466 |             0.916234 |
|  2 | Lead Add Form           |     718 | 0.0777056   |               9184 |             0.993939 |
|  3 | Lead Import             |      55 | 0.00595238  |               9239 |             0.999892 |
|  4 | Quick Add Form          |       1 | 0.000108225 |               9240 |             1        |
+----+-------------------------+---------+-------------+--------------------+----------------------+
+----+-------------------+---------+-------------+--------------------+----------------------+
|    | Lead Source       |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------+---------+-------------+--------------------+----------------------|
|  0 | Google            |    2873 | 0.312147    |               2873 |             0.312147 |
|  1 | Direct Traffic    |    2543 | 0.276293    |               5416 |             0.58844  |
|  2 | Olark Chat        |    1755 | 0.190678    |               7171 |             0.779118 |
|  3 | Organic Search    |    1154 | 0.12538     |               8325 |             0.904498 |
|  4 | Reference         |     534 | 0.0580183   |               8859 |             0.962516 |
|  5 | Welingak Website  |     142 | 0.0154281   |               9001 |             0.977944 |
|  6 | Referral Sites    |     125 | 0.0135811   |               9126 |             0.991525 |
|  7 | Facebook          |      55 | 0.00597566  |               9181 |             0.997501 |
|  8 | bing              |       6 | 0.00065189  |               9187 |             0.998153 |
|  9 | Click2call        |       4 | 0.000434594 |               9191 |             0.998588 |
| 10 | Social Media      |       2 | 0.000217297 |               9193 |             0.998805 |
| 11 | Press_Release     |       2 | 0.000217297 |               9195 |             0.999022 |
| 12 | Live Chat         |       2 | 0.000217297 |               9197 |             0.999239 |
| 13 | youtubechannel    |       1 | 0.000108648 |               9198 |             0.999348 |
| 14 | welearnblog_Home  |       1 | 0.000108648 |               9199 |             0.999457 |
| 15 | testone           |       1 | 0.000108648 |               9200 |             0.999565 |
| 16 | blog              |       1 | 0.000108648 |               9201 |             0.999674 |
| 17 | WeLearn           |       1 | 0.000108648 |               9202 |             0.999783 |
| 18 | Pay per Click Ads |       1 | 0.000108648 |               9203 |             0.999891 |
| 19 | NC_EDM            |       1 | 0.000108648 |               9204 |             1        |
+----+-------------------+---------+-------------+--------------------+----------------------+
+----+------------+---------+-----------+--------------------+----------------------+
|    | Magazine   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------+---------+-----------+--------------------+----------------------|
|  0 | No         |    9240 |         1 |               9240 |                    1 |
+----+------------+---------+-----------+--------------------+----------------------+
+----+-------------+---------+-------------+--------------------+----------------------+
|    | Newspaper   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------+---------+-------------+--------------------+----------------------|
|  0 | No          |    9239 | 0.999892    |               9239 |             0.999892 |
|  1 | Yes         |       1 | 0.000108225 |               9240 |             1        |
+----+-------------+---------+-------------+--------------------+----------------------+
+----+---------------------+---------+------------+--------------------+----------------------+
|    | Newspaper Article   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------+---------+------------+--------------------+----------------------|
|  0 | No                  |    9238 | 0.999784   |               9238 |             0.999784 |
|  1 | Yes                 |       2 | 0.00021645 |               9240 |             1        |
+----+---------------------+---------+------------+--------------------+----------------------+
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | Receive More Updates About Our Courses   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    9240 |         1 |               9240 |                    1 |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
+----+----------+---------+------------+--------------------+----------------------+
|    | Search   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+----------+---------+------------+--------------------+----------------------|
|  0 | No       |    9226 | 0.998485   |               9226 |             0.998485 |
|  1 | Yes      |      14 | 0.00151515 |               9240 |             1        |
+----+----------+---------+------------+--------------------+----------------------+
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | Specialization                    |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | No Specialization                 |    3380 | 0.365801   |               3380 |             0.365801 |
|  1 | Finance Management                |     976 | 0.105628   |               4356 |             0.471429 |
|  2 | Human Resource Management         |     848 | 0.0917749  |               5204 |             0.563203 |
|  3 | Marketing Management              |     838 | 0.0906926  |               6042 |             0.653896 |
|  4 | Operations Management             |     503 | 0.0544372  |               6545 |             0.708333 |
|  5 | Business Administration           |     403 | 0.0436147  |               6948 |             0.751948 |
|  6 | IT Projects Management            |     366 | 0.0396104  |               7314 |             0.791558 |
|  7 | Supply Chain Management           |     349 | 0.0377706  |               7663 |             0.829329 |
|  8 | Banking, Investment And Insurance |     338 | 0.0365801  |               8001 |             0.865909 |
|  9 | Travel and Tourism                |     203 | 0.0219697  |               8204 |             0.887879 |
| 10 | Media and Advertising             |     203 | 0.0219697  |               8407 |             0.909848 |
| 11 | International Business            |     178 | 0.0192641  |               8585 |             0.929113 |
| 12 | Healthcare Management             |     159 | 0.0172078  |               8744 |             0.94632  |
| 13 | Hospitality Management            |     114 | 0.0123377  |               8858 |             0.958658 |
| 14 | E-COMMERCE                        |     112 | 0.0121212  |               8970 |             0.970779 |
| 15 | Retail Management                 |     100 | 0.0108225  |               9070 |             0.981602 |
| 16 | Rural and Agribusiness            |      73 | 0.00790043 |               9143 |             0.989502 |
| 17 | E-Business                        |      57 | 0.00616883 |               9200 |             0.995671 |
| 18 | Services Excellence               |      40 | 0.004329   |               9240 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | Tags                                              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Will revert after reading the email               |    2072 | 0.351962    |               2072 |             0.351962 |
|  1 | Ringing                                           |    1203 | 0.204349    |               3275 |             0.556311 |
|  2 | Interested in other courses                       |     513 | 0.0871412   |               3788 |             0.643452 |
|  3 | Already a student                                 |     465 | 0.0789876   |               4253 |             0.722439 |
|  4 | Closed by Horizzon                                |     358 | 0.060812    |               4611 |             0.783251 |
|  5 | switched off                                      |     240 | 0.0407678   |               4851 |             0.824019 |
|  6 | Busy                                              |     186 | 0.031595    |               5037 |             0.855614 |
|  7 | Lost to EINS                                      |     175 | 0.0297265   |               5212 |             0.885341 |
|  8 | Not doing further education                       |     145 | 0.0246305   |               5357 |             0.909971 |
|  9 | Interested  in full time MBA                      |     117 | 0.0198743   |               5474 |             0.929845 |
| 10 | Graduation in progress                            |     111 | 0.0188551   |               5585 |             0.948701 |
| 11 | invalid number                                    |      83 | 0.0140989   |               5668 |             0.962799 |
| 12 | Diploma holder (Not Eligible)                     |      63 | 0.0107015   |               5731 |             0.973501 |
| 13 | wrong number given                                |      47 | 0.00798369  |               5778 |             0.981485 |
| 14 | opp hangup                                        |      33 | 0.00560557  |               5811 |             0.98709  |
| 15 | number not provided                               |      27 | 0.00458638  |               5838 |             0.991677 |
| 16 | in touch with EINS                                |      12 | 0.00203839  |               5850 |             0.993715 |
| 17 | Lost to Others                                    |       7 | 0.00118906  |               5857 |             0.994904 |
| 18 | Want to take admission but has financial problems |       6 | 0.00101919  |               5863 |             0.995923 |
| 19 | Still Thinking                                    |       6 | 0.00101919  |               5869 |             0.996942 |
| 20 | Interested in Next batch                          |       5 | 0.000849329 |               5874 |             0.997792 |
| 21 | In confusion whether part time or DLP             |       5 | 0.000849329 |               5879 |             0.998641 |
| 22 | Lateral student                                   |       3 | 0.000509597 |               5882 |             0.999151 |
| 23 | University not recognized                         |       2 | 0.000339732 |               5884 |             0.99949  |
| 24 | Shall take in the next coming month               |       2 | 0.000339732 |               5886 |             0.99983  |
| 25 | Recognition issue (DEC approval)                  |       1 | 0.000169866 |               5887 |             1        |
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
+----+---------------------------+---------+-------------+--------------------+----------------------+
|    | Through Recommendations   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-------------+--------------------+----------------------|
|  0 | No                        |    9233 | 0.999242    |               9233 |             0.999242 |
|  1 | Yes                       |       7 | 0.000757576 |               9240 |             1        |
+----+---------------------------+---------+-------------+--------------------+----------------------+
+----+-------------------------------------+---------+-----------+--------------------+----------------------+
|    | Update me on Supply Chain Content   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                  |    9240 |         1 |               9240 |                    1 |
+----+-------------------------------------+---------+-----------+--------------------+----------------------+
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Unemployed                        |    5600 | 0.606061    |               5600 |             0.606061 |
|  1 | Unknown Occupation                |    2690 | 0.291126    |               8290 |             0.897186 |
|  2 | Working Professional              |     706 | 0.0764069   |               8996 |             0.973593 |
|  3 | Student                           |     210 | 0.0227273   |               9206 |             0.99632  |
|  4 | Other                             |      16 | 0.0017316   |               9222 |             0.998052 |
|  5 | Housewife                         |      10 | 0.00108225  |               9232 |             0.999134 |
|  6 | Businessman                       |       8 | 0.000865801 |               9240 |             1        |
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | What matters most to you in choosing a course   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Better Career Prospects                         |    6528 | 0.706494    |               6528 |             0.706494 |
|  1 | Unknown Target                                  |    2709 | 0.293182    |               9237 |             0.999675 |
|  2 | Flexibility & Convenience                       |       2 | 0.00021645  |               9239 |             0.999892 |
|  3 | Other                                           |       1 | 0.000108225 |               9240 |             1        |
+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
+----+----------------------+---------+-------------+--------------------+----------------------+
|    | X Education Forums   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------------+---------+-------------+--------------------+----------------------|
|  0 | No                   |    9239 | 0.999892    |               9239 |             0.999892 |
|  1 | Yes                  |       1 | 0.000108225 |               9240 |             1        |
+----+----------------------+---------+-------------+--------------------+----------------------+

Let's look for columns have more than 99% leads have the same level
Such variables do not explain any variability.
Hence, these columns are unnecessary to the analysis . They could be dropped

# Dropping columns having only one label - since these donot explain any variability in the dataset

invariableCol = ['Digital Advertisement','Do Not Call','Get updates on DM Content','Magazine','Newspaper','Newspaper Article','Receive More Updates About Our Courses','Search',
'Update me on Supply Chain Content','Through Recommendations',
'I agree to pay the amount through cheque',"What matters most to you in choosing a course",'X Education Forums']
leads.drop(columns=invariableCol, inplace=True)

# Tags feature
tab(leads.stb.freq(['Tags']))

+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
|    | Tags                                              |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Will revert after reading the email               |    2072 | 0.351962    |               2072 |             0.351962 |
|  1 | Ringing                                           |    1203 | 0.204349    |               3275 |             0.556311 |
|  2 | Interested in other courses                       |     513 | 0.0871412   |               3788 |             0.643452 |
|  3 | Already a student                                 |     465 | 0.0789876   |               4253 |             0.722439 |
|  4 | Closed by Horizzon                                |     358 | 0.060812    |               4611 |             0.783251 |
|  5 | switched off                                      |     240 | 0.0407678   |               4851 |             0.824019 |
|  6 | Busy                                              |     186 | 0.031595    |               5037 |             0.855614 |
|  7 | Lost to EINS                                      |     175 | 0.0297265   |               5212 |             0.885341 |
|  8 | Not doing further education                       |     145 | 0.0246305   |               5357 |             0.909971 |
|  9 | Interested  in full time MBA                      |     117 | 0.0198743   |               5474 |             0.929845 |
| 10 | Graduation in progress                            |     111 | 0.0188551   |               5585 |             0.948701 |
| 11 | invalid number                                    |      83 | 0.0140989   |               5668 |             0.962799 |
| 12 | Diploma holder (Not Eligible)                     |      63 | 0.0107015   |               5731 |             0.973501 |
| 13 | wrong number given                                |      47 | 0.00798369  |               5778 |             0.981485 |
| 14 | opp hangup                                        |      33 | 0.00560557  |               5811 |             0.98709  |
| 15 | number not provided                               |      27 | 0.00458638  |               5838 |             0.991677 |
| 16 | in touch with EINS                                |      12 | 0.00203839  |               5850 |             0.993715 |
| 17 | Lost to Others                                    |       7 | 0.00118906  |               5857 |             0.994904 |
| 18 | Want to take admission but has financial problems |       6 | 0.00101919  |               5863 |             0.995923 |
| 19 | Still Thinking                                    |       6 | 0.00101919  |               5869 |             0.996942 |
| 20 | Interested in Next batch                          |       5 | 0.000849329 |               5874 |             0.997792 |
| 21 | In confusion whether part time or DLP             |       5 | 0.000849329 |               5879 |             0.998641 |
| 22 | Lateral student                                   |       3 | 0.000509597 |               5882 |             0.999151 |
| 23 | University not recognized                         |       2 | 0.000339732 |               5884 |             0.99949  |
| 24 | Shall take in the next coming month               |       2 | 0.000339732 |               5886 |             0.99983  |
| 25 | Recognition issue (DEC approval)                  |       1 | 0.000169866 |               5887 |             1        |
+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+

Tags column shows remarks by the sales team. This a subjective variable based on judgement of the team and cannot be used for analysis since the lables might change or might not always be available.
Also, it has a lot of levels that don't seem like mutually exclusive classes
Let's drop this feature for this analysis.

# dropping Tags feature 
leads.drop(columns=['Tags'], inplace=True)

Last Notable Activity & Last Activity seem to have similar levels
Lets look at the possibility of dropping one of them

# Last Notable Activity vs Last Activity
leads_copy = leads.copy()
leads_copy['Converted'] = leads_copy['Converted'].astype('int')
tab(leads_copy.stb.freq(['Last Notable Activity'],value='Converted'))
tab(leads_copy.stb.freq(['Last Activity'],value='Converted'))

+----+--------------------------+-------------+------------+------------------------+----------------------+
|    | Last Notable Activity    |   Converted |    Percent |   Cumulative Converted |   Cumulative Percent |
|----+--------------------------+-------------+------------+------------------------+----------------------|
|  0 | SMS Sent                 |        1508 | 0.423477   |                   1508 |             0.423477 |
|  1 | Email Opened             |        1044 | 0.293176   |                   2552 |             0.716653 |
|  2 | Modified                 |         783 | 0.219882   |                   3335 |             0.936535 |
|  3 | Page Visited on Website  |          93 | 0.0261163  |                   3428 |             0.962651 |
|  4 | Email Link Clicked       |          45 | 0.0126369  |                   3473 |             0.975288 |
|  5 | Olark Chat Conversation  |          25 | 0.0070205  |                   3498 |             0.982308 |
|  6 | Unreachable              |          22 | 0.00617804 |                   3520 |             0.988486 |
|  7 | Unsubscribed             |          14 | 0.00393148 |                   3534 |             0.992418 |
|  8 | Had a Phone Conversation |          13 | 0.00365066 |                   3547 |             0.996069 |
|  9 | Email Bounced            |           9 | 0.00252738 |                   3556 |             0.998596 |
| 10 | Email Marked Spam        |           2 | 0.00056164 |                   3558 |             0.999158 |
| 11 | Resubscribed to emails   |           1 | 0.00028082 |                   3559 |             0.999438 |
| 12 | Email Received           |           1 | 0.00028082 |                   3560 |             0.999719 |
| 13 | Approached upfront       |           1 | 0.00028082 |                   3561 |             1        |
+----+--------------------------+-------------+------------+------------------------+----------------------+
+----+------------------------------+-------------+-------------+------------------------+----------------------+
|    | Last Activity                |   Converted |     Percent |   Cumulative Converted |   Cumulative Percent |
|----+------------------------------+-------------+-------------+------------------------+----------------------|
|  0 | SMS Sent                     |        1727 | 0.496264    |                   1727 |             0.496264 |
|  1 | Email Opened                 |        1253 | 0.360057    |                   2980 |             0.856322 |
|  2 | Page Visited on Website      |         151 | 0.0433908   |                   3131 |             0.899713 |
|  3 | Olark Chat Conversation      |          84 | 0.0241379   |                   3215 |             0.923851 |
|  4 | Email Link Clicked           |          73 | 0.020977    |                   3288 |             0.944828 |
|  5 | Converted to Lead            |          54 | 0.0155172   |                   3342 |             0.960345 |
|  6 | Unreachable                  |          31 | 0.00890805  |                   3373 |             0.969253 |
|  7 | Form Submitted on Website    |          28 | 0.00804598  |                   3401 |             0.977299 |
|  8 | Email Bounced                |          26 | 0.00747126  |                   3427 |             0.98477  |
|  9 | Had a Phone Conversation     |          22 | 0.00632184  |                   3449 |             0.991092 |
| 10 | Unsubscribed                 |          16 | 0.0045977   |                   3465 |             0.99569  |
| 11 | Approached upfront           |           9 | 0.00258621  |                   3474 |             0.998276 |
| 12 | Email Received               |           2 | 0.000574713 |                   3476 |             0.998851 |
| 13 | Email Marked Spam            |           2 | 0.000574713 |                   3478 |             0.999425 |
| 14 | View in browser link Clicked |           1 | 0.000287356 |                   3479 |             0.999713 |
| 15 | Resubscribed to emails       |           1 | 0.000287356 |                   3480 |             1        |
+----+------------------------------+-------------+-------------+------------------------+----------------------+

Last Activity has more levels compared to Last Notable Activity
Last Notable Activity seems like a column derived by the sales team using Last Activity.
Since this insight might not be available for a new lead, let'd drop Last Notable Activity.

leads.drop(columns = ['Last Notable Activity'], inplace=True)

# Looking at Missing Values again 
leads.stb.missing()

From the above, the number of missing values is less than 2%. These are deemed missing completely at random. And these rows could be dropped without affecting the analysis

leads.dropna(inplace=True)
leads.stb.missing()

Grouping Labels with less leads¶

Country¶

# Country distribution
leads.stb.freq(['Country'])

We see that leads from India make 97% of all leads. And others collectively make up 3% and the contribution of each of these countries is <= 1%.
To reduce the levels, let us group the minority labels into a new label called 'Outside India'

# Grouping Countries with very low lead count into 'Outside India' 

leadsByCountry = leads['Country'].value_counts(normalize=True)
lowLeadCountries = leadsByCountry[leadsByCountry <= 0.01].index

leads['Country'].replace(lowLeadCountries,'Outside India',inplace=True)
leads.stb.freq(['Country'])

Lead Origin¶

feature = 'Lead Origin'
leads.stb.freq([feature])

We see that lead origins like Lead Add Form,Lead Import are less than 1% of all origins.
Let's group them into a level called 'Other Lead Origins'

# Grouping lead origins 
leadOriginsToGroup = ["Lead Add Form","Lead Import"]
leads[feature] = leads[feature].replace(leadOriginsToGroup, ['Other Lead Origins']*2)
leads.stb.freq([feature])

Lead Source¶

feature = 'Lead Source'
leads.stb.freq([feature])

Lead sources from #7 to #18 contribute to less than 1% of all sources.
Let's group these into a new label called 'Other Lead Sources'

# Grouping lead Sources
labelCounts = leads[feature].value_counts(normalize=True)

# labels with less than 1% contribution
labelsToGroup = labelCounts[labelCounts < 0.01].index.values

leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature+'s']*len(labelsToGroup))

leads.stb.freq([feature])

Last Activity¶

feature = 'Last Activity'
leads.stb.freq([feature])

Leads from #9 to #16 contribute to less than 1% of all last activity labels.Moreover, each of these labels contributes to less than 1% of leads.
Let's group these into a new label called 'Other Last Activity'

# Grouping Last Activity
labelCounts = leads[feature].value_counts(normalize=True)

# labels with less than 2% contribution
labelsToGroup = labelCounts[labelCounts < 0.01].index.values

leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))

leads.stb.freq([feature])

Specialization¶

feature = 'Specialization'
leads.stb.freq([feature])

Lead from from #16 to #18 contribute to less than 2% of all Specialization categories. Moreover, each of these categories contributes to less than 1% of leads.
Let's group these into a new label called 'Other Specializations'

# Grouping Last Activity
labelCounts = leads[feature].value_counts(normalize=True)

# labels with less than 2% contribution
labelsToGroup = labelCounts[labelCounts <=0.012121].index.values

leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))

leads.stb.freq([feature])

Cleaning tasks have been completed. No more missing values exist

Retained Data¶

# Columns retained 
print('Retained Columns\n\n', leads.columns.values)

Retained Columns

 ['Lead Origin' 'Lead Source' 'Do Not Email' 'Converted' 'TotalVisits'
 'Total Time Spent on Website' 'Page Views Per Visit' 'Last Activity'
 'Country' 'Specialization' 'What is your current occupation' 'City'
 'A free copy of Mastering The Interview']

# Retained rows
print('Retained rows : ',leads.shape[0]) 
print("Ratio of retained rows", 100*leads.shape[0]/9240)

Retained rows :  9014
Ratio of retained rows 97.55411255411255

Data Imbalance¶

leads.stb.freq(['Converted'])

converted_cond = leads['Converted'] == 1
imbalance = leads[converted_cond].shape[0]/leads[~converted_cond].shape[0]
print('Class Imbalance : Converted /Un-converted =', np.round(imbalance,3))

Class Imbalance : Converted /Un-converted = 0.611

From the above, you can see that this data set contains 37% of converted leads and 62% of un-converted leads.
Ratio of classes = 0.6
The dataset is skewed towards 'unconverted leads'

Univariate Analysis¶

def categoricalUAn(column,figsize=[8,8]) : 
    
    ''' Function for categorical univariate analysis '''
    print('Types of ' + column)
    tab(leads.stb.freq([column]))
    
    converted = leads[leads['Converted'] == 1]
    unconverted = leads[leads['Converted'] == 0]
      
    print(column + ' for Converted Leads')
    
    tab(converted.stb.freq([column]))
    
    print(column + ' for Un-Converted Leads')
    
    tab(unconverted.stb.freq([column]))
    
    print(column + ' vs Conversion Rate')
    
    tab((converted[column].value_counts()) / (converted[column].value_counts() + unconverted[column].value_counts()))
    
    # bar plot
    plt.figure(figsize=figsize)
    ax = sns.countplot(y=column,hue='Converted',data=leads)
    title = column + ' vs Lead Conversion'
    ax.set(title= title)

Lead Origin¶

column = 'Lead Origin'
categoricalUAn(column,figsize=[8,8])

Types of Lead Origin
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Lead Origin             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | Landing Page Submission |    4870 | 0.540271  |               4870 |             0.540271 |
|  1 | API                     |    3533 | 0.391946  |               8403 |             0.932217 |
|  2 | Other Lead Origins      |     611 | 0.0677834 |               9014 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
Lead Origin for Converted Leads
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Lead Origin             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | Landing Page Submission |    1765 |  0.516233 |               1765 |             0.516233 |
|  1 | API                     |    1101 |  0.322024 |               2866 |             0.838257 |
|  2 | Other Lead Origins      |     553 |  0.161743 |               3419 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
Lead Origin for Un-Converted Leads
+----+-------------------------+---------+-----------+--------------------+----------------------+
|    | Lead Origin             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-------------------------+---------+-----------+--------------------+----------------------|
|  0 | Landing Page Submission |    3105 | 0.55496   |               3105 |             0.55496  |
|  1 | API                     |    2432 | 0.434674  |               5537 |             0.989634 |
|  2 | Other Lead Origins      |      58 | 0.0103664 |               5595 |             1        |
+----+-------------------------+---------+-----------+--------------------+----------------------+
Lead Origin vs Conversion Rate
+-------------------------+---------------+
|                         |   Lead Origin |
|-------------------------+---------------|
| Landing Page Submission |      0.362423 |
| API                     |      0.311633 |
| Other Lead Origins      |      0.905074 |
+-------------------------+---------------+

Leads from Landing Page Submission followed by API make up 93% of all leads.
But it is interesting that 8.3% of leads coming from other sources have the highest conversion rate of 87.5%

Lead Source¶

column = 'Lead Source'
categoricalUAn(column,figsize=[8,8])

Types of Lead Source
+----+--------------------+---------+------------+--------------------+----------------------+
|    | Lead Source        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+------------+--------------------+----------------------|
|  0 | Google             |    2857 | 0.316951   |               2857 |             0.316951 |
|  1 | Direct Traffic     |    2528 | 0.280453   |               5385 |             0.597404 |
|  2 | Olark Chat         |    1753 | 0.194475   |               7138 |             0.791879 |
|  3 | Organic Search     |    1131 | 0.125471   |               8269 |             0.917351 |
|  4 | Reference          |     443 | 0.0491458  |               8712 |             0.966497 |
|  5 | Welingak Website   |     129 | 0.0143111  |               8841 |             0.980808 |
|  6 | Referral Sites     |     120 | 0.0133126  |               8961 |             0.99412  |
|  7 | Other Lead Sources |      53 | 0.00587974 |               9014 |             1        |
+----+--------------------+---------+------------+--------------------+----------------------+
Lead Source for Converted Leads
+----+--------------------+---------+------------+--------------------+----------------------+
|    | Lead Source        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+------------+--------------------+----------------------|
|  0 | Google             |    1142 | 0.334016   |               1142 |             0.334016 |
|  1 | Direct Traffic     |     815 | 0.238374   |               1957 |             0.57239  |
|  2 | Olark Chat         |     448 | 0.131032   |               2405 |             0.703422 |
|  3 | Organic Search     |     428 | 0.125183   |               2833 |             0.828605 |
|  4 | Reference          |     410 | 0.119918   |               3243 |             0.948523 |
|  5 | Welingak Website   |     127 | 0.0371454  |               3370 |             0.985668 |
|  6 | Referral Sites     |      31 | 0.00906698 |               3401 |             0.994735 |
|  7 | Other Lead Sources |      18 | 0.0052647  |               3419 |             1        |
+----+--------------------+---------+------------+--------------------+----------------------+
Lead Source for Un-Converted Leads
+----+--------------------+---------+-------------+--------------------+----------------------+
|    | Lead Source        |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+--------------------+---------+-------------+--------------------+----------------------|
|  0 | Google             |    1715 | 0.306524    |               1715 |             0.306524 |
|  1 | Direct Traffic     |    1713 | 0.306166    |               3428 |             0.61269  |
|  2 | Olark Chat         |    1305 | 0.233244    |               4733 |             0.845934 |
|  3 | Organic Search     |     703 | 0.125648    |               5436 |             0.971582 |
|  4 | Referral Sites     |      89 | 0.0159071   |               5525 |             0.987489 |
|  5 | Other Lead Sources |      35 | 0.00625559  |               5560 |             0.993744 |
|  6 | Reference          |      33 | 0.00589812  |               5593 |             0.999643 |
|  7 | Welingak Website   |       2 | 0.000357462 |               5595 |             1        |
+----+--------------------+---------+-------------+--------------------+----------------------+
Lead Source vs Conversion Rate
+--------------------+---------------+
|                    |   Lead Source |
|--------------------+---------------|
| Direct Traffic     |      0.322389 |
| Google             |      0.39972  |
| Olark Chat         |      0.255562 |
| Organic Search     |      0.378426 |
| Other Lead Sources |      0.339623 |
| Reference          |      0.925508 |
| Referral Sites     |      0.258333 |
| Welingak Website   |      0.984496 |
+--------------------+---------------+

Most leads that get converted come from Google(31%), followed by Direct Traffic(28%) and Olark Chat(19%)
And leads through Reference have a very high conversion rate (91%)

Do not Email¶

feature = 'Do Not Email'
categoricalUAn(feature,figsize=[8,8])

Types of Do Not Email
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    8311 | 0.92201   |               8311 |              0.92201 |
|  1 | Yes            |     703 | 0.0779898 |               9014 |              1       |
+----+----------------+---------+-----------+--------------------+----------------------+
Do Not Email for Converted Leads
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    3315 | 0.969582  |               3315 |             0.969582 |
|  1 | Yes            |     104 | 0.0304183 |               3419 |             1        |
+----+----------------+---------+-----------+--------------------+----------------------+
Do Not Email for Un-Converted Leads
+----+----------------+---------+-----------+--------------------+----------------------+
|    | Do Not Email   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+----------------+---------+-----------+--------------------+----------------------|
|  0 | No             |    4996 |   0.89294 |               4996 |              0.89294 |
|  1 | Yes            |     599 |   0.10706 |               5595 |              1       |
+----+----------------+---------+-----------+--------------------+----------------------+
Do Not Email vs Conversion Rate
+-----+----------------+
|     |   Do Not Email |
|-----+----------------|
| No  |       0.398869 |
| Yes |       0.147937 |
+-----+----------------+

92% of leads prefer to be sent Emails about the company. Do not Email = No
And these are the most converted customers (40%)

Last Activity¶

# 'Last Activity'
feature = 'Last Activity'
categoricalUAn(feature,figsize=[8,8])

Types of Last Activity
+----+---------------------------+---------+-----------+--------------------+----------------------+
|    | Last Activity             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-----------+--------------------+----------------------|
|  0 | Email Opened              |    3417 | 0.379077  |               3417 |             0.379077 |
|  1 | SMS Sent                  |    2701 | 0.299645  |               6118 |             0.678722 |
|  2 | Olark Chat Conversation   |     963 | 0.106834  |               7081 |             0.785556 |
|  3 | Page Visited on Website   |     637 | 0.0706679 |               7718 |             0.856224 |
|  4 | Converted to Lead         |     422 | 0.0468161 |               8140 |             0.90304  |
|  5 | Email Bounced             |     304 | 0.0337253 |               8444 |             0.936765 |
|  6 | Email Link Clicked        |     266 | 0.0295097 |               8710 |             0.966275 |
|  7 | Other Last Activity       |     189 | 0.0209674 |               8899 |             0.987242 |
|  8 | Form Submitted on Website |     115 | 0.0127579 |               9014 |             1        |
+----+---------------------------+---------+-----------+--------------------+----------------------+
Last Activity for Converted Leads
+----+---------------------------+---------+------------+--------------------+----------------------+
|    | Last Activity             |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+------------+--------------------+----------------------|
|  0 | SMS Sent                  |    1697 | 0.496344   |               1697 |             0.496344 |
|  1 | Email Opened              |    1246 | 0.364434   |               2943 |             0.860778 |
|  2 | Page Visited on Website   |     150 | 0.0438725  |               3093 |             0.90465  |
|  3 | Olark Chat Conversation   |      84 | 0.0245686  |               3177 |             0.929219 |
|  4 | Other Last Activity       |      74 | 0.0216438  |               3251 |             0.950863 |
|  5 | Email Link Clicked        |      72 | 0.0210588  |               3323 |             0.971922 |
|  6 | Converted to Lead         |      53 | 0.0155016  |               3376 |             0.987423 |
|  7 | Form Submitted on Website |      27 | 0.00789705 |               3403 |             0.99532  |
|  8 | Email Bounced             |      16 | 0.00467973 |               3419 |             1        |
+----+---------------------------+---------+------------+--------------------+----------------------+
Last Activity for Un-Converted Leads
+----+---------------------------+---------+-----------+--------------------+----------------------+
|    | Last Activity             |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------------------+---------+-----------+--------------------+----------------------|
|  0 | Email Opened              |    2171 | 0.388025  |               2171 |             0.388025 |
|  1 | SMS Sent                  |    1004 | 0.179446  |               3175 |             0.567471 |
|  2 | Olark Chat Conversation   |     879 | 0.157105  |               4054 |             0.724576 |
|  3 | Page Visited on Website   |     487 | 0.087042  |               4541 |             0.811618 |
|  4 | Converted to Lead         |     369 | 0.0659517 |               4910 |             0.877569 |
|  5 | Email Bounced             |     288 | 0.0514745 |               5198 |             0.929044 |
|  6 | Email Link Clicked        |     194 | 0.0346738 |               5392 |             0.963718 |
|  7 | Other Last Activity       |     115 | 0.0205541 |               5507 |             0.984272 |
|  8 | Form Submitted on Website |      88 | 0.0157283 |               5595 |             1        |
+----+---------------------------+---------+-----------+--------------------+----------------------+
Last Activity vs Conversion Rate
+---------------------------+-----------------+
|                           |   Last Activity |
|---------------------------+-----------------|
| Converted to Lead         |       0.125592  |
| Email Bounced             |       0.0526316 |
| Email Link Clicked        |       0.270677  |
| Email Opened              |       0.364647  |
| Form Submitted on Website |       0.234783  |
| Olark Chat Conversation   |       0.0872274 |
| Other Last Activity       |       0.391534  |
| Page Visited on Website   |       0.235479  |
| SMS Sent                  |       0.628286  |
+---------------------------+-----------------+

Most leads open emails sent to them (38%) and that's their last activity.
Among those leads who's last activity is opening emails, 37% are converted.
Only 4% of last activity indicators show Converted to Lead
Last activiy as 'SMS Sent' has highest conversion rate (62%).
Last activiy as 'Email Bounced' has lowest conversion rate (7.9%).

Country¶

feature = 'Country'
categoricalUAn(feature,figsize=[8,8])

Types of Country
+----+---------------+---------+-----------+--------------------+----------------------+
|    | Country       |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | India         |    8787 |  0.974817 |               8787 |             0.974817 |
|  1 | Outside India |     227 |  0.025183 |               9014 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+
Country for Converted Leads
+----+---------------+---------+-----------+--------------------+----------------------+
|    | Country       |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | India         |    3351 | 0.980111  |               3351 |             0.980111 |
|  1 | Outside India |      68 | 0.0198889 |               3419 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+
Country for Un-Converted Leads
+----+---------------+---------+-----------+--------------------+----------------------+
|    | Country       |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+---------------+---------+-----------+--------------------+----------------------|
|  0 | India         |    5436 | 0.971582  |               5436 |             0.971582 |
|  1 | Outside India |     159 | 0.0284182 |               5595 |             1        |
+----+---------------+---------+-----------+--------------------+----------------------+
Country vs Conversion Rate
+---------------+-----------+
|               |   Country |
|---------------+-----------|
| India         |  0.381359 |
| Outside India |  0.299559 |
+---------------+-----------+

Most leads come from India (97%)
Out of these 38% are converted.

Specialization¶

feature = 'Specialization'
categoricalUAn(feature)

Types of Specialization
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
|    | Specialization                    |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No Specialization                 |    3230 | 0.358331  |               3230 |             0.358331 |
|  1 | Finance Management                |     959 | 0.10639   |               4189 |             0.464722 |
|  2 | Human Resource Management         |     836 | 0.0927446 |               5025 |             0.557466 |
|  3 | Marketing Management              |     822 | 0.0911915 |               5847 |             0.648658 |
|  4 | Operations Management             |     498 | 0.0552474 |               6345 |             0.703905 |
|  5 | Business Administration           |     397 | 0.0440426 |               6742 |             0.747948 |
|  6 | IT Projects Management            |     366 | 0.0406035 |               7108 |             0.788551 |
|  7 | Supply Chain Management           |     344 | 0.0381629 |               7452 |             0.826714 |
|  8 | Banking, Investment And Insurance |     335 | 0.0371644 |               7787 |             0.863878 |
|  9 | Other Specialization              |     270 | 0.0299534 |               8057 |             0.893832 |
| 10 | Travel and Tourism                |     202 | 0.0224096 |               8259 |             0.916241 |
| 11 | Media and Advertising             |     202 | 0.0224096 |               8461 |             0.938651 |
| 12 | International Business            |     176 | 0.0195252 |               8637 |             0.958176 |
| 13 | Healthcare Management             |     156 | 0.0173064 |               8793 |             0.975483 |
| 14 | E-COMMERCE                        |     111 | 0.0123142 |               8904 |             0.987797 |
| 15 | Hospitality Management            |     110 | 0.0122032 |               9014 |             1        |
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
Specialization for Converted Leads
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
|    | Specialization                    |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No Specialization                 |     888 | 0.259725  |                888 |             0.259725 |
|  1 | Finance Management                |     422 | 0.123428  |               1310 |             0.383153 |
|  2 | Marketing Management              |     397 | 0.116116  |               1707 |             0.499269 |
|  3 | Human Resource Management         |     379 | 0.110851  |               2086 |             0.61012  |
|  4 | Operations Management             |     234 | 0.0684411 |               2320 |             0.678561 |
|  5 | Business Administration           |     175 | 0.0511846 |               2495 |             0.729746 |
|  6 | Banking, Investment And Insurance |     164 | 0.0479672 |               2659 |             0.777713 |
|  7 | Supply Chain Management           |     147 | 0.042995  |               2806 |             0.820708 |
|  8 | IT Projects Management            |     140 | 0.0409476 |               2946 |             0.861655 |
|  9 | Other Specialization              |      97 | 0.0283709 |               3043 |             0.890026 |
| 10 | Media and Advertising             |      84 | 0.0245686 |               3127 |             0.914595 |
| 11 | Healthcare Management             |      76 | 0.0222287 |               3203 |             0.936824 |
| 12 | Travel and Tourism                |      71 | 0.0207663 |               3274 |             0.95759  |
| 13 | International Business            |      62 | 0.018134  |               3336 |             0.975724 |
| 14 | Hospitality Management            |      44 | 0.0128693 |               3380 |             0.988593 |
| 15 | E-COMMERCE                        |      39 | 0.0114068 |               3419 |             1        |
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
Specialization for Un-Converted Leads
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
|    | Specialization                    |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No Specialization                 |    2342 | 0.418588  |               2342 |             0.418588 |
|  1 | Finance Management                |     537 | 0.0959786 |               2879 |             0.514567 |
|  2 | Human Resource Management         |     457 | 0.0816801 |               3336 |             0.596247 |
|  3 | Marketing Management              |     425 | 0.0759607 |               3761 |             0.672207 |
|  4 | Operations Management             |     264 | 0.047185  |               4025 |             0.719392 |
|  5 | IT Projects Management            |     226 | 0.0403932 |               4251 |             0.759786 |
|  6 | Business Administration           |     222 | 0.0396783 |               4473 |             0.799464 |
|  7 | Supply Chain Management           |     197 | 0.03521   |               4670 |             0.834674 |
|  8 | Other Specialization              |     173 | 0.0309205 |               4843 |             0.865594 |
|  9 | Banking, Investment And Insurance |     171 | 0.030563  |               5014 |             0.896157 |
| 10 | Travel and Tourism                |     131 | 0.0234138 |               5145 |             0.919571 |
| 11 | Media and Advertising             |     118 | 0.0210903 |               5263 |             0.940661 |
| 12 | International Business            |     114 | 0.0203753 |               5377 |             0.961037 |
| 13 | Healthcare Management             |      80 | 0.0142985 |               5457 |             0.975335 |
| 14 | E-COMMERCE                        |      72 | 0.0128686 |               5529 |             0.988204 |
| 15 | Hospitality Management            |      66 | 0.0117962 |               5595 |             1        |
+----+-----------------------------------+---------+-----------+--------------------+----------------------+
Specialization vs Conversion Rate
+-----------------------------------+------------------+
|                                   |   Specialization |
|-----------------------------------+------------------|
| Banking, Investment And Insurance |         0.489552 |
| Business Administration           |         0.440806 |
| E-COMMERCE                        |         0.351351 |
| Finance Management                |         0.440042 |
| Healthcare Management             |         0.487179 |
| Hospitality Management            |         0.4      |
| Human Resource Management         |         0.453349 |
| IT Projects Management            |         0.382514 |
| International Business            |         0.352273 |
| Marketing Management              |         0.482968 |
| Media and Advertising             |         0.415842 |
| No Specialization                 |         0.274923 |
| Operations Management             |         0.46988  |
| Other Specialization              |         0.359259 |
| Supply Chain Management           |         0.427326 |
| Travel and Tourism                |         0.351485 |
+-----------------------------------+------------------+

Specialization of 36% of leads is missing.
We have mapped those missing values with 'No Specialization'.There might be two reason for this,
- Lead might be a fresher.
- Lead missed to fill it.
Among all the specializations, ' Banking, Investment And Insurance' has the highest conversion rate(48.9%).

What is your current occupation¶

feature = 'What is your current occupation'
categoricalUAn(feature,figsize=[8,8])

Types of What is your current occupation
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Unemployed                        |    5445 | 0.60406     |               5445 |             0.60406  |
|  1 | Unknown Occupation                |    2656 | 0.294653    |               8101 |             0.898713 |
|  2 | Working Professional              |     675 | 0.0748835   |               8776 |             0.973597 |
|  3 | Student                           |     206 | 0.0228533   |               8982 |             0.99645  |
|  4 | Other                             |      15 | 0.00166408  |               8997 |             0.998114 |
|  5 | Housewife                         |       9 | 0.000998447 |               9006 |             0.999112 |
|  6 | Businessman                       |       8 | 0.000887508 |               9014 |             1        |
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
What is your current occupation for Converted Leads
+----+-----------------------------------+---------+------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+------------+--------------------+----------------------|
|  0 | Unemployed                        |    2336 | 0.683241   |               2336 |             0.683241 |
|  1 | Working Professional              |     620 | 0.18134    |               2956 |             0.86458  |
|  2 | Unknown Occupation                |     366 | 0.107049   |               3322 |             0.971629 |
|  3 | Student                           |      74 | 0.0216438  |               3396 |             0.993273 |
|  4 | Other                             |       9 | 0.00263235 |               3405 |             0.995905 |
|  5 | Housewife                         |       9 | 0.00263235 |               3414 |             0.998538 |
|  6 | Businessman                       |       5 | 0.00146242 |               3419 |             1        |
+----+-----------------------------------+---------+------------+--------------------+----------------------+
What is your current occupation for Un-Converted Leads
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
|    | What is your current occupation   |   Count |     Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------------+---------+-------------+--------------------+----------------------|
|  0 | Unemployed                        |    3109 | 0.555675    |               3109 |             0.555675 |
|  1 | Unknown Occupation                |    2290 | 0.409294    |               5399 |             0.964969 |
|  2 | Student                           |     132 | 0.0235925   |               5531 |             0.988561 |
|  3 | Working Professional              |      55 | 0.00983021  |               5586 |             0.998391 |
|  4 | Other                             |       6 | 0.00107239  |               5592 |             0.999464 |
|  5 | Businessman                       |       3 | 0.000536193 |               5595 |             1        |
+----+-----------------------------------+---------+-------------+--------------------+----------------------+
What is your current occupation vs Conversion Rate
+----------------------+-----------------------------------+
|                      |   What is your current occupation |
|----------------------+-----------------------------------|
| Businessman          |                          0.625    |
| Housewife            |                        nan        |
| Other                |                          0.6      |
| Student              |                          0.359223 |
| Unemployed           |                          0.429017 |
| Unknown Occupation   |                          0.137801 |
| Working Professional |                          0.918519 |
+----------------------+-----------------------------------+

Although the conversion rate for Working Professional is the highest ! 91.6%, they only make 7.4% of all leads. 60% leads are Unemployed customers followed by 29% with unknown nature of employment
Among all the converted leads, Unemployed and Working Professionals top the list.
Conversion for Housewife segment is 100%

City¶

feature = 'City'
categoricalUAn(feature)

Types of City
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | City                        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Mumbai                      |    6692 | 0.742401   |               6692 |             0.742401 |
|  1 | Thane & Outskirts           |     745 | 0.0826492  |               7437 |             0.82505  |
|  2 | Other Cities                |     680 | 0.0754382  |               8117 |             0.900488 |
|  3 | Other Cities of Maharashtra |     446 | 0.0494786  |               8563 |             0.949967 |
|  4 | Other Metro Cities          |     377 | 0.0418238  |               8940 |             0.991791 |
|  5 | Tier II Cities              |      74 | 0.00820945 |               9014 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
City for Converted Leads
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | City                        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Mumbai                      |    2440 | 0.713659   |               2440 |             0.713659 |
|  1 | Thane & Outskirts           |     332 | 0.0971044  |               2772 |             0.810763 |
|  2 | Other Cities                |     272 | 0.0795554  |               3044 |             0.890319 |
|  3 | Other Cities of Maharashtra |     196 | 0.0573267  |               3240 |             0.947646 |
|  4 | Other Metro Cities          |     154 | 0.0450424  |               3394 |             0.992688 |
|  5 | Tier II Cities              |      25 | 0.00731208 |               3419 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
City for Un-Converted Leads
+----+-----------------------------+---------+------------+--------------------+----------------------+
|    | City                        |   Count |    Percent |   Cumulative Count |   Cumulative Percent |
|----+-----------------------------+---------+------------+--------------------+----------------------|
|  0 | Mumbai                      |    4252 | 0.759964   |               4252 |             0.759964 |
|  1 | Thane & Outskirts           |     413 | 0.0738159  |               4665 |             0.83378  |
|  2 | Other Cities                |     408 | 0.0729223  |               5073 |             0.906702 |
|  3 | Other Cities of Maharashtra |     250 | 0.0446828  |               5323 |             0.951385 |
|  4 | Other Metro Cities          |     223 | 0.039857   |               5546 |             0.991242 |
|  5 | Tier II Cities              |      49 | 0.00875782 |               5595 |             1        |
+----+-----------------------------+---------+------------+--------------------+----------------------+
City vs Conversion Rate
+-----------------------------+----------+
|                             |     City |
|-----------------------------+----------|
| Mumbai                      | 0.364614 |
| Thane & Outskirts           | 0.445638 |
| Other Cities                | 0.4      |
| Other Cities of Maharashtra | 0.439462 |
| Other Metro Cities          | 0.408488 |
| Tier II Cities              | 0.337838 |
+-----------------------------+----------+

Most Leads come from 'Mumbai' and they have a decent conversion rate of 36.4%.
Leads from Thane and outskirts make up 8.2% with a conversion rate of 44%

A free copy of Mastering The Interview.¶

feature = 'A free copy of Mastering The Interview'
categoricalUAn(feature)

Types of A free copy of Mastering The Interview
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    6126 |  0.679609 |               6126 |             0.679609 |
|  1 | Yes                                      |    2888 |  0.320391 |               9014 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
A free copy of Mastering The Interview for Converted Leads
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    2389 |  0.698742 |               2389 |             0.698742 |
|  1 | Yes                                      |    1030 |  0.301258 |               3419 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
A free copy of Mastering The Interview for Un-Converted Leads
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
|    | A free copy of Mastering The Interview   |   Count |   Percent |   Cumulative Count |   Cumulative Percent |
|----+------------------------------------------+---------+-----------+--------------------+----------------------|
|  0 | No                                       |    3737 |  0.667918 |               3737 |             0.667918 |
|  1 | Yes                                      |    1858 |  0.332082 |               5595 |             1        |
+----+------------------------------------------+---------+-----------+--------------------+----------------------+
A free copy of Mastering The Interview vs Conversion Rate
+-----+------------------------------------------+
|     |   A free copy of Mastering The Interview |
|-----+------------------------------------------|
| No  |                                 0.389977 |
| Yes |                                 0.356648 |
+-----+------------------------------------------+

68% of the leads said "No" for a free copy of 'Mastering The Interview'.
Conversion rate of leads who said "No" is high (39.8%).

def num_univariate_analysis(column_name,scale='linear') : 
    
    converted = leads[leads['Converted'] == 1]
    unconverted = leads[leads['Converted'] == 0]

    plt.figure(figsize=(8,6))
    ax = sns.boxplot(x=column_name, y='Converted', data = leads)
    title = 'Boxplot of ' + column_name+' vs Conversion'
    ax.set(title=title)
    if scale == 'log' :
        ax.set_xscale('log')
        ax.set(ylabel=column_name + '(Log Scale)')
        
    print("Spread for range of "+column_name+" that were Converted")
    tab(converted[column_name].describe())
    print("Spread for range of "+column_name+" that were not converted")
    tab(unconverted[column_name].describe())

TotalVisits¶

column_name = 'TotalVisits'
num_univariate_analysis(column_name,scale='log')

Spread for range of TotalVisits that were Converted
+-------+---------------+
|       |   TotalVisits |
|-------+---------------|
| count |    3419       |
| mean  |       3.65575 |
| std   |       5.57527 |
| min   |       0       |
| 25%   |       0       |
| 50%   |       3       |
| 75%   |       5       |
| max   |     251       |
+-------+---------------+
Spread for range of TotalVisits that were not converted
+-------+---------------+
|       |   TotalVisits |
|-------+---------------|
| count |    5595       |
| mean  |       3.33262 |
| std   |       4.37298 |
| min   |       0       |
| 25%   |       1       |
| 50%   |       3       |
| 75%   |       4       |
| max   |     141       |
+-------+---------------+

Looks like Total Visits have a lot of outliers among both Converted and Un-converted leads.
Let's take a look at the quantiles between 90 and 100.

# Looking at Quantiles
tab(leads[column_name].quantile(np.linspace(.90,1,20)))

+----------+---------------+
|          |   TotalVisits |
|----------+---------------|
| 0.9      |             7 |
| 0.905263 |             7 |
| 0.910526 |             8 |
| 0.915789 |             8 |
| 0.921053 |             8 |
| 0.926316 |             8 |
| 0.931579 |             9 |
| 0.936842 |             9 |
| 0.942105 |             9 |
| 0.947368 |             9 |
| 0.952632 |            10 |
| 0.957895 |            10 |
| 0.963158 |            11 |
| 0.968421 |            11 |
| 0.973684 |            12 |
| 0.978947 |            13 |
| 0.984211 |            14 |
| 0.989474 |            17 |
| 0.994737 |            20 |
| 1        |           251 |
+----------+---------------+

From the above, it is clear that outliers exist and these might skew the analyses.
For now, let's cap the outliers about 99th percentile to 99th percentile value. soft range capping.

# Capping outliers to 99th perentile value 
cap = leads[column_name].quantile(.99)
condition = leads[column_name] > cap 
leads.loc[condition, column_name] = cap

Total TIme Spent on Website.¶

column = 'Total Time Spent on Website'
num_univariate_analysis(column)

Spread for range of Total Time Spent on Website that were Converted
+-------+-------------------------------+
|       |   Total Time Spent on Website |
|-------+-------------------------------|
| count |                      3419     |
| mean  |                       732.945 |
| std   |                       614.476 |
| min   |                         0     |
| 25%   |                         0     |
| 50%   |                       826     |
| 75%   |                      1265.5   |
| max   |                      2253     |
+-------+-------------------------------+
Spread for range of Total Time Spent on Website that were not converted
+-------+-------------------------------+
|       |   Total Time Spent on Website |
|-------+-------------------------------|
| count |                      5595     |
| mean  |                       329.919 |
| std   |                       432.757 |
| min   |                         0     |
| 25%   |                        14     |
| 50%   |                       178     |
| 75%   |                       393.5   |
| max   |                      2272     |
+-------+-------------------------------+

'Total Time Spend on Website' has many outliers.
Let's look quantiles to confirm this.

tab(leads[column].quantile(np.linspace(0.75,1,25)))

+----------+-------------------------------+
|          |   Total Time Spent on Website |
|----------+-------------------------------|
| 0.75     |                       924     |
| 0.760417 |                       953.635 |
| 0.770833 |                       991     |
| 0.78125  |                      1022.41  |
| 0.791667 |                      1054     |
| 0.802083 |                      1087     |
| 0.8125   |                      1115.06  |
| 0.822917 |                      1143     |
| 0.833333 |                      1177.83  |
| 0.84375  |                      1208     |
| 0.854167 |                      1238.6   |
| 0.864583 |                      1271     |
| 0.875    |                      1296.38  |
| 0.885417 |                      1328     |
| 0.895833 |                      1360     |
| 0.90625  |                      1392     |
| 0.916667 |                      1434     |
| 0.927083 |                      1468     |
| 0.9375   |                      1503     |
| 0.947917 |                      1549     |
| 0.958333 |                      1592.46  |
| 0.96875  |                      1647     |
| 0.979167 |                      1720.23  |
| 0.989583 |                      1830.34  |
| 1        |                      2272     |
+----------+-------------------------------+

leads[column].quantile(np.linspace(0.75,1,50)).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fc566866790>

# Capping `Total Time Spent on Website` values to 99th percentile 
cap = leads[column].quantile(.99)
condition = leads[column] > cap 
leads.loc[condition, column] = cap

Page Views Per Visit.¶

column = 'Page Views Per Visit'
num_univariate_analysis(column)

Spread for range of Page Views Per Visit that were Converted
+-------+------------------------+
|       |   Page Views Per Visit |
|-------+------------------------|
| count |             3419       |
| mean  |                2.36407 |
| std   |                2.10862 |
| min   |                0       |
| 25%   |                0       |
| 50%   |                2       |
| 75%   |                3.5     |
| max   |               15       |
+-------+------------------------+
Spread for range of Page Views Per Visit that were not converted
+-------+------------------------+
|       |   Page Views Per Visit |
|-------+------------------------|
| count |             5595       |
| mean  |                2.36962 |
| std   |                2.17789 |
| min   |                0       |
| 25%   |                1       |
| 50%   |                2       |
| 75%   |                3       |
| max   |               55       |
+-------+------------------------+

'Page Views Per Visit' has many outliers.
Let's look quantiles to confirm this.

leads[column].quantile(np.linspace(0.75,1,30)).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fc54ad01890>

There is a sudden jump between 99th percentile and maximum value.
let's cap the values to 99th percentile to avoid skewing the analysis

# Capping `Page Views Per Visit` values to 99th percentile 
cap = leads[column].quantile(.99)
condition = leads[column] > cap 
leads.loc[condition, column] = cap

Bivariate Analysis¶

leads.columns.values

array(['Lead Origin', 'Lead Source', 'Do Not Email', 'Converted',
       'TotalVisits', 'Total Time Spent on Website',
       'Page Views Per Visit', 'Last Activity', 'Country',
       'Specialization', 'What is your current occupation', 'City',
       'A free copy of Mastering The Interview'], dtype=object)

continuous_vars = ['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']

TotalVisits vs A free copy of Mastering The Interview¶

plt.figure(figsize=[8,8])
sns.barplot(x=continuous_vars[0], y = 'A free copy of Mastering The Interview', data=leads, hue='Converted')

<matplotlib.axes._subplots.AxesSubplot at 0x7fc5668199d0>

One can see that the proportion of leads with high Total Visits to the website also like a free copy of Mastering The Interview.
Incidentally, these are leads with higher conversion rate.
More convertable leads are being attracted by the website through providing 'A free copy of Mastering The Interview'.

Lead Source vs Country¶

# sns.barplot(x='Lead Source', y = 'Country', hue='Converted', data=leads)
leads.groupby(['Country','Lead Source'])['Converted'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(14,12), kind='barh', stacked=True);

Most leads from India through Reference Sources have very high conversion rate.
Leads from outside of India from other Lead sources do not convert at all.

Occupation vs City¶

x = "What is your current occupation"
y = 'City'

leads.groupby([x,y])['Converted'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(14,12), kind='barh', stacked=True);

Working Professionals in Other cities of Maharashtra have higher conversion rates compared to those from Mumbai , Thane and other cities.
BusinessMen from Mumbai and Thane & Outskirts are poor leads in comparison to Tier 2 and Other cities .

Last Activity vs Country¶

x = "Country"
y = 'Last Activity'

leads.groupby([x,y])['Converted'].value_counts(normalize=True)\
.unstack()\
   .plot( 
    layout=(2,2),
    figsize=(14,12), kind='barh', stacked=True);

SMS and Emails are more favourable for conversion over Website Visits outside of India.
Leads from outside of India who click email links have higher conversion rate compared those from India.

Data Preparation¶

Mapping Binary Variables to 0 / 1¶

binary_var = ['Do Not Email', 'A free copy of Mastering The Interview']
leads[binary_var] = leads[binary_var].replace({'Yes' : 1, 'No' : 0})

Creating Indicator Variables¶

categoricalCol = ['Lead Origin', 'Lead Source','Last Activity', 'Country', 'Specialization',
       'What is your current occupation', 'City'] 

print('Levels in Each Cateogrical Variable\n')
for col in sorted(categoricalCol) : 
    print(col, leads[col].unique(), '\n')

# Creating dummy variables
leadOriginDummies = pd.get_dummies(leads['Lead Origin'], drop_first=True)
leadSourceDummies = pd.get_dummies(leads['Lead Source'], drop_first=True)
lastActivityDummies = pd.get_dummies(leads['Last Activity'], drop_first=True)
countryDummies = pd.get_dummies(leads['Country'] ,drop_first=True)
specDummies = pd.get_dummies(leads['Specialization'],drop_first=True)
occupationDummies = pd.get_dummies(leads[ 'What is your current occupation'],drop_first=True)
cityDummies = pd.get_dummies(leads[ 'City'],drop_first=True)

# adding dummy variables to leads dataframe
leads = pd.concat([leads, leadOriginDummies,leadSourceDummies,lastActivityDummies, countryDummies, specDummies, occupationDummies, cityDummies], axis=1)

# dropping categorical columns 
leads.drop(columns = categoricalCol, inplace=True)


print('Final Columns')
leads.columns

Levels in Each Cateogrical Variable

City ['Mumbai' 'Thane & Outskirts' 'Other Metro Cities' 'Other Cities'
 'Other Cities of Maharashtra' 'Tier II Cities'] 

Country ['India' 'Outside India'] 

Last Activity ['Page Visited on Website' 'Email Opened' 'Other Last Activity'
 'Converted to Lead' 'Olark Chat Conversation' 'Email Link Clicked'
 'Form Submitted on Website' 'Email Bounced' 'SMS Sent'] 

Lead Origin ['API' 'Landing Page Submission' 'Other Lead Origins'] 

Lead Source ['Olark Chat' 'Organic Search' 'Direct Traffic' 'Google' 'Referral Sites'
 'Reference' 'Welingak Website' 'Other Lead Sources'] 

Specialization ['No Specialization' 'Business Administration' 'Media and Advertising'
 'Supply Chain Management' 'IT Projects Management' 'Finance Management'
 'Travel and Tourism' 'Human Resource Management' 'Marketing Management'
 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE'
 'Operations Management' 'Other Specialization' 'Hospitality Management'
 'Healthcare Management'] 

What is your current occupation ['Unemployed' 'Student' 'Unknown Occupation' 'Working Professional'
 'Businessman' 'Other' 'Housewife'] 

Final Columns

Index(['Do Not Email', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit',
       'A free copy of Mastering The Interview', 'Landing Page Submission',
       'Other Lead Origins', 'Google', 'Olark Chat', 'Organic Search',
       'Other Lead Sources', 'Reference', 'Referral Sites', 'Welingak Website',
       'Email Bounced', 'Email Link Clicked', 'Email Opened',
       'Form Submitted on Website', 'Olark Chat Conversation',
       'Other Last Activity', 'Page Visited on Website', 'SMS Sent',
       'Outside India', 'Business Administration', 'E-COMMERCE',
       'Finance Management', 'Healthcare Management', 'Hospitality Management',
       'Human Resource Management', 'IT Projects Management',
       'International Business', 'Marketing Management',
       'Media and Advertising', 'No Specialization', 'Operations Management',
       'Other Specialization', 'Supply Chain Management', 'Travel and Tourism',
       'Housewife', 'Other', 'Student', 'Unemployed', 'Unknown Occupation',
       'Working Professional', 'Other Cities', 'Other Cities of Maharashtra',
       'Other Metro Cities', 'Thane & Outskirts', 'Tier II Cities'],
      dtype='object')

Correlation¶

# Top Correlations
def correlation(dataframe) : 
    cor0=dataframe.corr()
    type(cor0)
    cor0.where(np.triu(np.ones(cor0.shape),k=1).astype(np.bool))
    cor0=cor0.unstack().reset_index()
    cor0.columns=['VAR1','VAR2','CORR']
    cor0.dropna(subset=['CORR'], inplace=True)
    cor0.CORR=round(cor0['CORR'],2)
    cor0.CORR=cor0.CORR.abs()
    cor0.sort_values(by=['CORR'],ascending=False)
    cor0=cor0[~(cor0['VAR1']==cor0['VAR2'])]
    return pd.DataFrame(cor0.sort_values(by=['CORR'],ascending=False))

#Correlations for Converted Leads 
convertedCondition= leads['Converted']==1
print('Correlations for Converted Leads')
correlation(leads[convertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()

Correlations for Converted Leads

Conversions of leads from other lead origins and the ones through reference have similar conversion behaviour.

#Correlations for un-Converted Leads 
unconvertedCondition=leads['Converted']==0
print('Correlations for Non-Converted Leads')
correlation(leads[unconvertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()

Correlations for Non-Converted Leads

From the above, Unknown Occupation and Unemployed are highly correlated for non-converted leads.
This might mean that unemployed leads and leads with unknown occupation have the same conversion behaviour.

Train-Test Split¶

from sklearn.model_selection import train_test_split
y = leads.pop('Converted')
X = leads

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=100)

Standardizing Continuous Variables¶

continuous_vars

['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']

from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler()

# fitting and transforming train set
X_train[continuous_vars] = scaler.fit_transform(X_train[continuous_vars])

# Transforming test set for later use
X_test[continuous_vars] = scaler.transform(X_test[continuous_vars])

Modelling¶

Recurvise Feature Elimination¶

print('No of features : ', len(X_train.columns))

No of features :  49

Currently, the dataset has 49 features.
We shall follow a mixed feature elimination approach.
We could use Recursive Feature Elimination for coarse elimination to 25 columns
This is followed by manual elimination of features with high p-value / VIF.

# RFE 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
minFeatures = 25
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=minFeatures)
rfe = rfe.fit(X_train, y_train)

# Columns selected by RFE : 
RFE_features = pd.DataFrame( {'feature' : X_train.columns, 'rank' : rfe.ranking_, 'support' : rfe.support_})
condition = RFE_features['support'] == True 
rfe_features = RFE_features[condition].sort_values(by='rank',ascending=True )
print('Features selected by RFE\n')
rfe_features

Features selected by RFE

rfeFeatures = rfe_features['feature'].values

Manual Feature Elimination¶

Model 1¶

### Multicollinearity 
from statsmodels.stats.outliers_influence import variance_inflation_factor
def vif(X) :
    df = sm.add_constant(X)
    vif = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
    vif_frame = pd.DataFrame({'vif' : vif[0:]},index = df.columns).reset_index()
    tab(vif_frame.sort_values(by='vif',ascending=False))

# Model 1
import statsmodels.api as sm 
features = rfe_features['feature'].values
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+-----------+
|    | index                       |       vif |
|----+-----------------------------+-----------|
|  0 | const                       | 498.666   |
|  3 | Unemployed                  | 117.314   |
|  2 | Unknown Occupation          | 102.767   |
| 19 | Other Lead Origins          |  34.8794  |
| 12 | Working Professional        |  34.6746  |
| 16 | Reference                   |  26.0024  |
|  4 | Student                     |  11.9606  |
| 15 | Welingak Website            |   8.73583 |
| 20 | Landing Page Submission     |   3.48279 |
|  6 | No Specialization           |   3.05316 |
| 17 | Other Lead Sources          |   2.84805 |
| 21 | Page Views Per Visit        |   2.65403 |
| 24 | Email Opened                |   2.47549 |
| 18 | Olark Chat                  |   2.45889 |
| 10 | SMS Sent                    |   2.30595 |
| 23 | TotalVisits                 |   2.0918  |
| 13 | Olark Chat Conversation     |   1.93429 |
|  5 | Housewife                   |   1.46972 |
| 22 | Total Time Spent on Website |   1.35898 |
|  1 | Do Not Email                |   1.2055  |
| 14 | Email Link Clicked          |   1.20382 |
| 11 | Other Last Activity         |   1.11873 |
|  9 | Outside India               |   1.02551 |
|  7 | Media and Advertising       |   1.01922 |
|  8 | Hospitality Management      |   1.0113  |
| 25 | Tier II Cities              |   1.00967 |
+----+-----------------------------+-----------+

Unemployed has the highest VIF. let's drop this feature.

Model 2¶

# Model 2 : Removing `Unemployed`
column_to_remove = 'Unemployed'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
| 18 | Other Lead Origins          | 34.8784  |
| 15 | Reference                   | 26.0024  |
|  0 | const                       | 19.6719  |
| 14 | Welingak Website            |  8.73573 |
| 19 | Landing Page Submission     |  3.4731  |
|  5 | No Specialization           |  3.04661 |
| 16 | Other Lead Sources          |  2.84805 |
| 20 | Page Views Per Visit        |  2.65403 |
| 23 | Email Opened                |  2.47543 |
| 17 | Olark Chat                  |  2.45888 |
|  9 | SMS Sent                    |  2.30562 |
| 22 | TotalVisits                 |  2.09129 |
| 12 | Olark Chat Conversation     |  1.93402 |
| 21 | Total Time Spent on Website |  1.35834 |
|  1 | Do Not Email                |  1.20547 |
| 13 | Email Link Clicked          |  1.2038  |
|  2 | Unknown Occupation          |  1.17383 |
| 11 | Working Professional        |  1.15381 |
| 10 | Other Last Activity         |  1.11849 |
|  8 | Outside India               |  1.02545 |
|  3 | Student                     |  1.02534 |
|  6 | Media and Advertising       |  1.01899 |
|  7 | Hospitality Management      |  1.01126 |
|  4 | Housewife                   |  1.01026 |
| 24 | Tier II Cities              |  1.00967 |
+----+-----------------------------+----------+

Other Lead Origins has a very high VIF.
Let's drop this variable

Model 3¶

# Model 3 : Removing `Other Lead Origins`
column_to_remove = 'Other Lead Origins'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.5742  |
| 18 | Landing Page Submission     |  3.44401 |
|  5 | No Specialization           |  3.03467 |
| 19 | Page Views Per Visit        |  2.64774 |
| 22 | Email Opened                |  2.4731  |
| 17 | Olark Chat                  |  2.45033 |
|  9 | SMS Sent                    |  2.30433 |
| 21 | TotalVisits                 |  2.09083 |
| 12 | Olark Chat Conversation     |  1.93393 |
| 15 | Reference                   |  1.64559 |
| 20 | Total Time Spent on Website |  1.35735 |
|  1 | Do Not Email                |  1.20544 |
| 13 | Email Link Clicked          |  1.20372 |
|  2 | Unknown Occupation          |  1.17275 |
| 11 | Working Professional        |  1.15369 |
| 14 | Welingak Website            |  1.14402 |
| 10 | Other Last Activity         |  1.11839 |
| 16 | Other Lead Sources          |  1.03494 |
|  8 | Outside India               |  1.02524 |
|  3 | Student                     |  1.02486 |
|  6 | Media and Advertising       |  1.01894 |
|  7 | Hospitality Management      |  1.01112 |
|  4 | Housewife                   |  1.01026 |
| 23 | Tier II Cities              |  1.00966 |
+----+-----------------------------+----------+

Housewife has a high p-value and hence the coefficient is insignificant. let's drop the same.

Model 4¶

# Model 4 : Removing `Housewife`
column_to_remove = 'Housewife'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.5671  |
| 17 | Landing Page Submission     |  3.44232 |
|  4 | No Specialization           |  3.03251 |
| 18 | Page Views Per Visit        |  2.64773 |
| 21 | Email Opened                |  2.47231 |
| 16 | Olark Chat                  |  2.4503  |
|  8 | SMS Sent                    |  2.30422 |
| 20 | TotalVisits                 |  2.09075 |
| 11 | Olark Chat Conversation     |  1.93383 |
| 14 | Reference                   |  1.64154 |
| 19 | Total Time Spent on Website |  1.35679 |
|  1 | Do Not Email                |  1.20544 |
| 12 | Email Link Clicked          |  1.20292 |
|  2 | Unknown Occupation          |  1.17264 |
| 10 | Working Professional        |  1.15248 |
| 13 | Welingak Website            |  1.14402 |
|  9 | Other Last Activity         |  1.11839 |
| 15 | Other Lead Sources          |  1.03492 |
|  3 | Student                     |  1.02478 |
|  7 | Outside India               |  1.02446 |
|  5 | Media and Advertising       |  1.01783 |
|  6 | Hospitality Management      |  1.01111 |
| 22 | Tier II Cities              |  1.00965 |
+----+-----------------------------+----------+

Student has a p-value higher than 0.05 and the highest among all p-values. Let's drop this feature.

Model 5¶

# Model 5 : Removing `Student`
column_to_remove = 'Student'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.5487  |
| 16 | Landing Page Submission     |  3.43698 |
|  3 | No Specialization           |  3.02919 |
| 17 | Page Views Per Visit        |  2.64719 |
| 20 | Email Opened                |  2.47218 |
| 15 | Olark Chat                  |  2.44998 |
|  7 | SMS Sent                    |  2.30157 |
| 19 | TotalVisits                 |  2.09067 |
| 10 | Olark Chat Conversation     |  1.93303 |
| 13 | Reference                   |  1.63987 |
| 18 | Total Time Spent on Website |  1.35679 |
|  1 | Do Not Email                |  1.20485 |
| 11 | Email Link Clicked          |  1.2029  |
|  2 | Unknown Occupation          |  1.15459 |
|  9 | Working Professional        |  1.14906 |
| 12 | Welingak Website            |  1.14371 |
|  8 | Other Last Activity         |  1.11789 |
| 14 | Other Lead Sources          |  1.03492 |
|  6 | Outside India               |  1.02427 |
|  4 | Media and Advertising       |  1.01768 |
|  5 | Hospitality Management      |  1.01055 |
| 21 | Tier II Cities              |  1.00946 |
+----+-----------------------------+----------+

Tier II Cities has a p-value higher than confidence level and the highest among all the p - values.
Let's remove this feature

Model 6¶

# Model 6 : Removing `Tier II Cities`
column_to_remove = 'Tier II Cities'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.5461  |
| 16 | Landing Page Submission     |  3.43294 |
|  3 | No Specialization           |  3.02879 |
| 17 | Page Views Per Visit        |  2.64718 |
| 20 | Email Opened                |  2.47182 |
| 15 | Olark Chat                  |  2.44996 |
|  7 | SMS Sent                    |  2.30139 |
| 19 | TotalVisits                 |  2.09046 |
| 10 | Olark Chat Conversation     |  1.93285 |
| 13 | Reference                   |  1.63987 |
| 18 | Total Time Spent on Website |  1.35679 |
|  1 | Do Not Email                |  1.20341 |
| 11 | Email Link Clicked          |  1.20289 |
|  2 | Unknown Occupation          |  1.15454 |
|  9 | Working Professional        |  1.14885 |
| 12 | Welingak Website            |  1.14371 |
|  8 | Other Last Activity         |  1.11724 |
| 14 | Other Lead Sources          |  1.0349  |
|  6 | Outside India               |  1.02425 |
|  4 | Media and Advertising       |  1.0176  |
|  5 | Hospitality Management      |  1.01026 |
+----+-----------------------------+----------+

Page Views Per Visit has a high p-value. Let's eliminate this.

Model 7¶

# Model 7 : Removing `Page Views Per Visit`
column_to_remove = 'Page Views Per Visit'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.521   |
| 16 | Landing Page Submission     |  3.40467 |
|  3 | No Specialization           |  3.02277 |
| 19 | Email Opened                |  2.43525 |
|  7 | SMS Sent                    |  2.2499  |
| 15 | Olark Chat                  |  2.22665 |
| 10 | Olark Chat Conversation     |  1.92174 |
| 13 | Reference                   |  1.56753 |
| 18 | TotalVisits                 |  1.49299 |
| 17 | Total Time Spent on Website |  1.35607 |
|  1 | Do Not Email                |  1.20207 |
| 11 | Email Link Clicked          |  1.20085 |
|  2 | Unknown Occupation          |  1.15454 |
|  9 | Working Professional        |  1.14798 |
| 12 | Welingak Website            |  1.12183 |
|  8 | Other Last Activity         |  1.11356 |
| 14 | Other Lead Sources          |  1.03154 |
|  6 | Outside India               |  1.02316 |
|  4 | Media and Advertising       |  1.0176  |
|  5 | Hospitality Management      |  1.01022 |
+----+-----------------------------+----------+

Media and Advertising has a high p-value. let's drop this feature

Model 8¶

# Model 8 : Removing `Media and Advertising`
column_to_remove = 'Media and Advertising'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 19.4607  |
| 15 | Landing Page Submission     |  3.40463 |
|  3 | No Specialization           |  3.01069 |
| 18 | Email Opened                |  2.43523 |
|  6 | SMS Sent                    |  2.24839 |
| 14 | Olark Chat                  |  2.22661 |
|  9 | Olark Chat Conversation     |  1.92165 |
| 12 | Reference                   |  1.56581 |
| 17 | TotalVisits                 |  1.49255 |
| 16 | Total Time Spent on Website |  1.35565 |
|  1 | Do Not Email                |  1.202   |
| 10 | Email Link Clicked          |  1.20054 |
|  2 | Unknown Occupation          |  1.15412 |
|  8 | Working Professional        |  1.14797 |
| 11 | Welingak Website            |  1.12171 |
|  7 | Other Last Activity         |  1.11347 |
| 13 | Other Lead Sources          |  1.03154 |
|  5 | Outside India               |  1.02306 |
|  4 | Hospitality Management      |  1.00959 |
+----+-----------------------------+----------+

This model has a feature Email Link Clicked with high p-value of 0.059. Let's drop this feature.

Model 9¶

# Model 9 : Removing `Email Link Clicked`
column_to_remove = 'Email Link Clicked'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 18.2371  |
| 14 | Landing Page Submission     |  3.40436 |
|  3 | No Specialization           |  3.01066 |
| 13 | Olark Chat                  |  2.21092 |
| 17 | Email Opened                |  2.09586 |
|  6 | SMS Sent                    |  1.97156 |
|  9 | Olark Chat Conversation     |  1.75591 |
| 11 | Reference                   |  1.56268 |
| 16 | TotalVisits                 |  1.49255 |
| 15 | Total Time Spent on Website |  1.35564 |
|  1 | Do Not Email                |  1.16781 |
|  2 | Unknown Occupation          |  1.1541  |
|  8 | Working Professional        |  1.14791 |
| 10 | Welingak Website            |  1.12098 |
|  7 | Other Last Activity         |  1.09682 |
| 12 | Other Lead Sources          |  1.0315  |
|  5 | Outside India               |  1.02286 |
|  4 | Hospitality Management      |  1.00955 |
+----+-----------------------------+----------+

All coefficients are significant / low p-value
For further elimination , let's use the magnitude of coefficient as the weight/importance of the variable. Higher values are more important than lower values.
By this reasoning, TotalVisits has the least coefficient. Let'd drop this.

Model 10¶

# Model 10 : Removing `TotalVisits`
column_to_remove = 'TotalVisits'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
logm1 = logm1.fit()
print("VIF for X_train")
vif(X_train)
logm1.summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 17.9489  |
| 14 | Landing Page Submission     |  3.4038  |
|  3 | No Specialization           |  2.97226 |
| 16 | Email Opened                |  2.09586 |
| 13 | Olark Chat                  |  1.98518 |
|  6 | SMS Sent                    |  1.97039 |
|  9 | Olark Chat Conversation     |  1.7559  |
| 11 | Reference                   |  1.46711 |
| 15 | Total Time Spent on Website |  1.34366 |
|  1 | Do Not Email                |  1.1672  |
|  2 | Unknown Occupation          |  1.15404 |
|  8 | Working Professional        |  1.14782 |
| 10 | Welingak Website            |  1.10012 |
|  7 | Other Last Activity         |  1.09606 |
| 12 | Other Lead Sources          |  1.02698 |
|  5 | Outside India               |  1.0226  |
|  4 | Hospitality Management      |  1.00954 |
+----+-----------------------------+----------+

Other Lead Sources has high p-value. Let's drop this variable.

Model 11 - Final Model¶

# Model 11 : Removing `Other Lead Sources`
column_to_remove = 'Other Lead Sources'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm_final = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
logm_final = logm_final.fit()
print("VIF for X_train")
vif(X_train)
logm_final.summary()

VIF for X_train
+----+-----------------------------+----------+
|    | index                       |      vif |
|----+-----------------------------+----------|
|  0 | const                       | 17.7441  |
| 13 | Landing Page Submission     |  3.3594  |
|  3 | No Specialization           |  2.96428 |
| 15 | Email Opened                |  2.09302 |
|  6 | SMS Sent                    |  1.97027 |
| 12 | Olark Chat                  |  1.96234 |
|  9 | Olark Chat Conversation     |  1.75584 |
| 11 | Reference                   |  1.45497 |
| 14 | Total Time Spent on Website |  1.3339  |
|  1 | Do Not Email                |  1.16719 |
|  2 | Unknown Occupation          |  1.1536  |
|  8 | Working Professional        |  1.14772 |
| 10 | Welingak Website            |  1.09768 |
|  7 | Other Last Activity         |  1.09593 |
|  5 | Outside India               |  1.02258 |
|  4 | Hospitality Management      |  1.0094  |
+----+-----------------------------+----------+

From the above, the features that remain are statistically significant and donot show any multi collinearity.
Hence, we could use Model 11 is our final model.

Final Features¶

finalFeatures = X_train.columns.values
print('The Final Feature for Modelling are :', finalFeatures)

The Final Feature for Modelling are : ['Do Not Email' 'Unknown Occupation' 'No Specialization'
 'Hospitality Management' 'Outside India' 'SMS Sent' 'Other Last Activity'
 'Working Professional' 'Olark Chat Conversation' 'Welingak Website'
 'Reference' 'Olark Chat' 'Landing Page Submission'
 'Total Time Spent on Website' 'Email Opened']

Predictions¶

Predictions on Train set¶

X_train_sm = sm.add_constant(X_train)
y_train_pred = logm_final.predict(X_train_sm)

Actual Conversions vs Conversion Predictions¶

# Creating a data frame with converted vs converted probabilities
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
y_train_pred_final['CustID'] = y_train.index
y_train_pred_final.head(10)

Predictions with cut off = 0.5¶

#Creating new column 'predicted' with 1 if Converted_Prob > 0.5 else 0
y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head(10)

Confusion Matrix¶

from sklearn import metrics
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
print(confusion)

[[3462  455]
 [ 699 1693]]

Confusion Matrix for Train Set

$\frac{Predicted}{Actual}$	Not Converted	Converted
Not Converted	3462	455
Converted	699	1693

Accuracy of the Model¶

# Let's check the overall accuracy.
accuracy = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
print('Accuracy on Train set : ', round(100*accuracy,3),'%')

Accuracy on Train set :  81.709 %

Metrics beyond simple accuracy¶

TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ',  round(100*specificity,3),'%')
print('False Positive Rate : ',  round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ',  round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ',  round(100*negativePredictivePower,3),'%')

sensitivity / Recall:  70.778 %
specificity :  88.384 %
False Positive Rate :  11.616 %
Precision / Positive Predictive Power :  78.818 %
Negative Predictive Power :  83.201 %

Plotting ROC Curve.¶

def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

    return None

draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

Finding Optimal Cutoff Point¶

Optimal cutoff probability is that prob where we get balanced sensitivity and specificity

# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

     prob  accuracy     sensi     speci
0.0   0.0  0.379141  1.000000  0.000000
0.1   0.1  0.632905  0.974498  0.424304
0.2   0.2  0.758599  0.923077  0.658157
0.3   0.3  0.793311  0.869983  0.746490
0.4   0.4  0.818830  0.781355  0.841716
0.5   0.5  0.817087  0.707776  0.883840
0.6   0.6  0.808052  0.634197  0.914220
0.7   0.7  0.782533  0.513796  0.946643
0.8   0.8  0.755112  0.405518  0.968598
0.9   0.9  0.715961  0.273829  0.985959

# Let's plot accuracy sensitivity and specificity for various cutoff probabilities.

fig,ax = plt.subplots()
fig.set_figwidth(30)
fig.set_figheight(10)
plots=['accuracy','sensi','speci']
ax.set_xticks(np.linspace(0,1,50))
ax.set_title('Finding Optimal Cutoff')
sns.lineplot(x='prob',y=plots[0] , data=cutoff_df,ax=ax)
sns.lineplot(x='prob',y=plots[1] , data=cutoff_df,ax=ax)
sns.lineplot(x='prob',y=plots[2] , data=cutoff_df,ax=ax)

ax.set_xlabel('Probabilites')
ax.set_ylabel('Accuracy,Sensitivity,Specificity')
ax.legend(["Accuracy",'Sensitivity','Specificity'])
# cutoff_df.plot.line(, figure=[10,10])
plt.show()

From the curve above, 0.36 is the optimum cutoff probability.

y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.36 else 0)

y_train_pred_final.head()

# Let's check the overall accuracy.
accu = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
print('Accuracy on Train set at Optimum Cut Off : ', round(100*accu,3),'%')

Accuracy on Train set at Optimum Cut Off :  81.249 %

confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2

array([[3203,  714],
       [ 469, 1923]])

TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ',  round(100*specificity,3),'%')
print('False Positive Rate : ',  round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ',  round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ',  round(100*negativePredictivePower,3),'%')

sensitivity / Recall:  80.393 %
specificity :  81.772 %
False Positive Rate :  18.228 %
Precision / Positive Predictive Power :  72.924 %
Negative Predictive Power :  87.228 %

## ROC curve for cut off probability of 0.36
draw_roc(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

Precision and Recall¶

#Looking at the confusion matrix again
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
confusion

array([[3462,  455],
       [ 699, 1693]])

Precision :TP / TP + FP
Recall :TP / TP + FN

print('Precision :', confusion[1,1]/(confusion[0,1]+confusion[1,1]))
print('Recall :', confusion[1,1]/(confusion[1,0]+confusion[1,1]))

Precision : 0.7881750465549349
Recall : 0.7077759197324415

#Doing the same using the sklearn.
from sklearn.metrics import precision_score, recall_score
print('Precision : ', precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted))
print('Recall :', recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted))

Precision :  0.7881750465549349
Recall : 0.7077759197324415

Precision and Recall Tradeoff¶

from sklearn.metrics import precision_recall_curve

p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

The cut off point from precision-recall curve is ~0.4.
Note that we have used the cut off obtained from 'Sensitivity-Specificity' trade off to predict conversions in this analysis.

Predictions on Test set¶

X_test_sm = sm.add_constant(X_test[finalFeatures])
y_test_pred = logm_final.predict(X_test_sm)

Actual Conversions vs Conversion Probability¶

# predicted conversions vs actual conversions and customer ID
y_test_predictions = pd.DataFrame({'Converted' :y_test, 'Conversion Probability' : y_test_pred, 'CustID' : y_test.index})
y_test_predictions.head()

# predictions with optimal cut off = 0.35
cutoff=0.36
y_test_predictions['Predicted'] = y_test_predictions[
 'Conversion Probability'
].map(lambda x : 1 if x > cutoff else 0 )

Confusion Matrix¶

confusion = metrics.confusion_matrix(y_test_predictions['Converted'], y_test_predictions['Predicted'])
confusion

array([[1356,  322],
       [ 230,  797]])

TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

Accuracy¶

print('Accuracy on Test set : ', round(100*(TP + TN)/(TP + TN + FP + FN),3),'%')

Accuracy on Test set :  79.593 %

Metrics Beyond Simple Accuracy¶

sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ',  round(100*specificity,3),'%')
print('False Positive Rate : ',  round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ',  round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ',  round(100*negativePredictivePower,3),'%')

sensitivity / Recall:  77.605 %
specificity :  80.81 %
False Positive Rate :  19.19 %
Precision / Positive Predictive Power :  71.224 %
Negative Predictive Power :  85.498 %

## ROC curve for cut off probability of 0.364
draw_roc(y_test_predictions['Converted'],y_test_predictions['Predicted'])

Note the AUC is 0.79 on the test test

Lead Scoring¶

# merging final predictions with leads dataset
conversionProb = pd.concat([y_test_predictions['Conversion Probability'],y_train_pred_final['Converted_Prob']],axis=0)
conversionProb = pd.DataFrame({'Conversion Probability' : conversionProb}, index=conversionProb.index)
leads = pd.concat([leads,conversionProb],axis=1)
leads['Prospect ID'] = prospect_ids
leads['Lead No'] = lead_no
leads['Converted'] = y

# Verifying prediction accuracy
leads['Predicted'] = leads['Conversion Probability'].map(lambda x : 1 if x > 0.36 else 0)

confusion = metrics.confusion_matrix(leads['Converted'], leads['Predicted'])

TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
acc = metrics.accuracy_score(leads['Converted'], leads['Predicted'])
print('Accuracy : ', round(100*acc,3),'%')
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
falseNegativeRate = FN/(FP + TP)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity : ', round(100*sensitivity,3),'%')
print('specificity : ',  round(100*specificity,3),'%')
print('False Positive Rate : ',  round(100*falsePositiveRate,3),'%')
print('False Negative Rate : ',  round(100*falseNegativeRate,3),'%')
print('Positive Predictive Power / Precision : ',  round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ',  round(100*negativePredictivePower,3),'%')

Accuracy :  80.752 %
sensitivity :  79.555 %
specificity :  81.483 %
False Positive Rate :  18.517 %
False Negative Rate :  18.61 %
Positive Predictive Power / Precision :  72.417 %
Negative Predictive Power :  86.706 %

## ROC curve
draw_roc(leads['Converted'], leads['Predicted'])

# Lead Scores 
leads['Lead Score'] = leads['Conversion Probability']*100
leads[['Prospect ID','Lead No','Lead Score']].sort_values(by='Lead Score', ascending=False)[:10]

Score Sheet for X Education¶

# Run the following to generate a sheet containing lead information provided by the company and corresponding scores 
leads.to_csv('lead_scores.csv')

KS Statistic¶

# Gain Chart 
y_test_predictions = y_test_predictions.sort_values(by='Conversion Probability', ascending=False)
y_test_predictions['decile'] = pd.qcut(y_test_predictions['Conversion Probability'],10,labels=range(10,0,-1))
y_test_predictions['Converted'] = y_test_predictions['Converted'].astype('int')
y_test_predictions['Un Converted'] = 1 - y_test_predictions['Converted']
y_test_predictions.head()

df1 = pd.pivot_table(data=y_test_predictions,index=['decile'],values=['Converted','Un Converted','Conversion Probability'],
                     aggfunc={'Converted':[np.sum],
                              'Un Converted':[np.sum],
                              'Conversion Probability' : [np.min,np.max]})
df1 = df1.reset_index()
df1.columns = ['Decile','Max Prob', 'Min Prob','Converted Count','Un Converted Count']
df1 = df1.sort_values(by='Decile', ascending=False)
df1['Total Leads'] = df1['Converted Count'] + df1['Un Converted Count']
df1['Conversion Rate'] = df1['Converted Count'] / df1['Un Converted Count']
converted_sum = df1['Converted Count'].sum()
unconverted_sum = df1['Un Converted Count'].sum()
df1['Converted %'] = df1['Converted Count'] / converted_sum
df1['Un Converted %'] = df1['Un Converted Count'] / unconverted_sum
df1.head()

df1['ks_stats'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum() -(df1['Un Converted Count'] / df1['Un Converted Count'].sum()).cumsum()), 4) * 100
df1

Max KS Statistic is 59.76 for 5th decile
This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonable good model.

Gain Chart¶

df1['Cum Conversion %'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum()), 4) * 100
df1

df1['Base %'] = np.arange(10,110,10)
df1 = df1.set_index('Decile')

df1

### Gain chart 
plot_columns =['Base %','Cum Conversion %']
plt.plot(df1[plot_columns]);
plt.xticks(df1.index);
plt.title('Gain chart');
plt.xlabel('Decile')
plt.ylabel('Cummulative Conversion %')
plt.legend(('Our Model','Random Model'));

Instead of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift Chart¶

df1['Lift'] = df1['Cum Conversion %'] / df1['Base %']
df1['Baseline'] = 1
df1

# Lift chart 
plot_columns =['Lift', 'Baseline']
plt.plot(df1[plot_columns]);
plt.xticks(df1.index);
plt.title('Lift chart');
plt.xlabel('Decile')
plt.ylabel('Lift')
plt.legend(('Our Model','Random Model'));

The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Conclusion¶

A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).

The final relationship between log Odds of Conversion Probability and lead features is

logOdds(Conversion Probability) = -0.6469 - 1.5426 Do Not Email -1.2699 Unknown Occupation -0.9057 No Specialization -0.8704 Hospitality Management - 0.6584 Outside India + 1.7923 SMS Sent + 1.1749 Other Last Activity + 2.3769 Working Professional - 0.8614 Olark Chat Conversation + 5.3886 Welingak Website + 3.0246 Reference + 1.1876 Olark Chat -1.0250 Landing Page Submission + 1.1253 Total Time Spent on Website + 0.6106 * Email Opened

where Total Time Spent on Website is standardized to $\mu=0,\sigma=1$

Interpreting Top 6 features affecting Conversion Probability :

A lead from Welingak Website has 5.4 times higher log odds of conversion than those from Google.
Leads through Reference have 3 times higher log odds of conversion than those from Google.
Leads from Working Professional have 2.38 times higher log odds of conversion than those from Businessman.
Leads with SMS Sent have 1.8 times higher log odds of conversion than those with no SMS sent.
Leads with Do Not Email have 1.5 times lesser log odds of conversion compared to leads who would like email updates.
Leads with Unknown Occupation have 1.27 times lesser log odds of conversion compared to those from Businessman.

Lead Scores :

Score sheet can be generated by running this cell.

At an optimum cut-off probability of 0.36, model performance is as follows.

Model Performance on Training Set :

Accuracy : 81.7%
Sensitivity / Recall: 80.393 %
Specificity : 81.772 %
Precision / Positive Predictive Power : 72.924 %
False Positive Rate : 18.228 %
AUC Score : 0.81

Model Performance for Test Set :

Accuracy : 79.593 %
Sensitivity / Recall : 77.605 %
Specificity : 80.81%
Precision / Positive Predictive Power : 71.224 %
False Positive Rate : 19.19 %
AUC Score : 0.79

KS statistic :

Max KS Statistic is 59.76 for 5th decile
This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonably good model.

Gain :

Inside of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift :

The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Note :

Incorrect data types have been corrected
Columns with high missing values have been dropped.
Columns which do not explain variability in the model have been dropped.
Columns with sales teams notes like Tags where the classes are not mutually exclusive have been dropped.
Features with low missing values have been imputed with the most frequent values.
Categories in a feature with less than 1% contribution have been grouped together to reduce the number of levels.
Inconsistencies in Categories have been corrected.
97.5 % of the leads provided by the company have been used for analysis.
Class imbalance = 0.6
Indicator variables have been created for all categorical variables with the first category as the reference.
Continuous variables have been standardized $\mu : 0 , \sigma = 1$ before modelling.

Dep. Variable:	Converted	No. Observations:	6309
Model:	GLM	Df Residuals:	6283
Model Family:	Binomial	Df Model:	25
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-2503.0
Date:	Mon, 07 Sep 2020	Deviance:	5005.9
Time:	22:02:03	Pearson chi2:	6.36e+03
No. Iterations:	20
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.3145	0.675	-0.466	0.641	-1.638	1.009
Do Not Email	-1.4730	0.189	-7.775	0.000	-1.844	-1.102
Unknown Occupation	-1.8781	0.668	-2.813	0.005	-3.187	-0.570
Unemployed	-0.6028	0.664	-0.908	0.364	-1.904	0.698
Student	-0.6044	0.699	-0.864	0.387	-1.975	0.766
Housewife	19.9950	1.12e+04	0.002	0.999	-2.2e+04	2.21e+04
No Specialization	-0.8514	0.127	-6.698	0.000	-1.101	-0.602
Media and Advertising	-0.2951	0.247	-1.196	0.232	-0.779	0.189
Hospitality Management	-0.9043	0.333	-2.715	0.007	-1.557	-0.252
Outside India	-0.6829	0.236	-2.899	0.004	-1.145	-0.221
SMS Sent	2.0121	0.128	15.767	0.000	1.762	2.262
Other Last Activity	1.3061	0.262	4.991	0.000	0.793	1.819
Working Professional	1.8232	0.688	2.651	0.008	0.475	3.171
Olark Chat Conversation	-0.6947	0.202	-3.437	0.001	-1.091	-0.298
Email Link Clicked	0.4905	0.236	2.079	0.038	0.028	0.953
Welingak Website	4.3758	1.134	3.858	0.000	2.153	6.599
Reference	2.0165	0.892	2.261	0.024	0.268	3.765
Other Lead Sources	-0.1279	0.830	-0.154	0.878	-1.755	1.500
Olark Chat	1.2173	0.141	8.613	0.000	0.940	1.494
Other Lead Origins	1.0385	0.869	1.196	0.232	-0.664	2.741
Landing Page Submission	-0.9141	0.131	-6.962	0.000	-1.171	-0.657
Page Views Per Visit	-0.2670	0.056	-4.727	0.000	-0.378	-0.156
Total Time Spent on Website	1.1245	0.042	26.804	0.000	1.042	1.207
TotalVisits	0.3212	0.050	6.432	0.000	0.223	0.419
Email Opened	0.7814	0.125	6.271	0.000	0.537	1.026
Tier II Cities	-0.4419	0.425	-1.040	0.298	-1.275	0.391

Dep. Variable:	Converted	No. Observations:	6309
Model:	GLM	Df Residuals:	6284
Model Family:	Binomial	Df Model:	24
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-2503.4
Date:	Mon, 07 Sep 2020	Deviance:	5006.8
Time:	22:02:04	Pearson chi2:	6.36e+03
No. Iterations:	21
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.9101	0.163	-5.598	0.000	-1.229	-0.591
Do Not Email	-1.4697	0.189	-7.761	0.000	-1.841	-1.099
Unknown Occupation	-1.2777	0.091	-14.023	0.000	-1.456	-1.099
Student	-0.0044	0.229	-0.019	0.985	-0.453	0.444
Housewife	21.5945	1.85e+04	0.001	0.999	-3.63e+04	3.63e+04
No Specialization	-0.8575	0.127	-6.754	0.000	-1.106	-0.609
Media and Advertising	-0.2953	0.247	-1.198	0.231	-0.779	0.188
Hospitality Management	-0.9066	0.333	-2.722	0.006	-1.559	-0.254
Outside India	-0.6853	0.236	-2.909	0.004	-1.147	-0.224
SMS Sent	2.0111	0.128	15.755	0.000	1.761	2.261
Other Last Activity	1.3069	0.261	5.001	0.000	0.795	1.819
Working Professional	2.4233	0.191	12.700	0.000	2.049	2.797
Olark Chat Conversation	-0.6896	0.202	-3.415	0.001	-1.085	-0.294
Email Link Clicked	0.4893	0.236	2.074	0.038	0.027	0.952
Welingak Website	4.3797	1.134	3.861	0.000	2.157	6.603
Reference	2.0184	0.892	2.263	0.024	0.270	3.766
Other Lead Sources	-0.1269	0.830	-0.153	0.879	-1.754	1.500
Olark Chat	1.2189	0.141	8.624	0.000	0.942	1.496
Other Lead Origins	1.0344	0.869	1.191	0.234	-0.668	2.737
Landing Page Submission	-0.9202	0.131	-7.016	0.000	-1.177	-0.663
Page Views Per Visit	-0.2668	0.056	-4.723	0.000	-0.377	-0.156
Total Time Spent on Website	1.1251	0.042	26.818	0.000	1.043	1.207
TotalVisits	0.3227	0.050	6.466	0.000	0.225	0.421
Email Opened	0.7824	0.125	6.278	0.000	0.538	1.027
Tier II Cities	-0.4432	0.425	-1.043	0.297	-1.276	0.389

Dep. Variable:	Converted	No. Observations:	6309
Model:	GLM	Df Residuals:	6285
Model Family:	Binomial	Df Model:	23
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-2504.2
Date:	Mon, 07 Sep 2020	Deviance:	5008.4
Time:	22:02:04	Pearson chi2:	6.37e+03
No. Iterations:	21
Covariance Type:	nonrobust

	Null Percentage
Lead Quality	51.590909
Asymmetrique Profile Score	45.649351
Asymmetrique Activity Score	45.649351
Asymmetrique Profile Index	45.649351
Asymmetrique Activity Index	45.649351
Tags	36.287879
Lead Profile	29.318182
What matters most to you in choosing a course	29.318182
What is your current occupation	29.112554
Country	26.634199
How did you hear about X Education	23.885281
Specialization	15.562771
City	15.367965
TotalVisits	1.482684
Page Views Per Visit	1.482684
Last Activity	1.114719
Lead Source	0.389610

	Null Percentage
How did you hear about X Education	78.463203
Lead Profile	74.188312
City	39.707792
Specialization	36.580087
Tags	36.287879
What matters most to you in choosing a course	29.318182
What is your current occupation	29.112554
Country	26.634199
TotalVisits	1.482684
Page Views Per Visit	1.482684
Last Activity	1.114719
Lead Source	0.389610

	Country	Count	Percent	Cumulative Count	Cumulative Percent
0	India	6492	0.957663	6492	0.957663
1	United States	69	0.010178	6561	0.967842
2	United Arab Emirates	53	0.007818	6614	0.975660
3	Singapore	24	0.003540	6638	0.979200
4	Saudi Arabia	21	0.003098	6659	0.982298
5	United Kingdom	15	0.002213	6674	0.984511
6	Australia	13	0.001918	6687	0.986429
7	Qatar	10	0.001475	6697	0.987904
8	Hong Kong	7	0.001033	6704	0.988936
9	Bahrain	7	0.001033	6711	0.989969
10	Oman	6	0.000885	6717	0.990854
11	France	6	0.000885	6723	0.991739
12	unknown	5	0.000738	6728	0.992477
13	South Africa	4	0.000590	6732	0.993067
14	Nigeria	4	0.000590	6736	0.993657
15	Kuwait	4	0.000590	6740	0.994247
16	Germany	4	0.000590	6744	0.994837
17	Canada	4	0.000590	6748	0.995427
18	Sweden	3	0.000443	6751	0.995870
19	Uganda	2	0.000295	6753	0.996165
20	Philippines	2	0.000295	6755	0.996460
21	Netherlands	2	0.000295	6757	0.996755
22	Italy	2	0.000295	6759	0.997050
23	Ghana	2	0.000295	6761	0.997345
24	China	2	0.000295	6763	0.997640
25	Belgium	2	0.000295	6765	0.997935
26	Bangladesh	2	0.000295	6767	0.998230
27	Asia/Pacific Region	2	0.000295	6769	0.998525
28	Vietnam	1	0.000148	6770	0.998672
29	Tanzania	1	0.000148	6771	0.998820
30	Switzerland	1	0.000148	6772	0.998967
31	Sri Lanka	1	0.000148	6773	0.999115
32	Russia	1	0.000148	6774	0.999262
33	Malaysia	1	0.000148	6775	0.999410
34	Liberia	1	0.000148	6776	0.999557
35	Kenya	1	0.000148	6777	0.999705
36	Indonesia	1	0.000148	6778	0.999852
37	Denmark	1	0.000148	6779	1.000000

	City	Count	Percent	Cumulative Count	Cumulative Percent
0	Mumbai	3222	0.578352	3222	0.578352
1	Thane & Outskirts	752	0.134985	3974	0.713337
2	Other Cities	686	0.123138	4660	0.836475
3	Other Cities of Maharashtra	457	0.082032	5117	0.918507
4	Other Metro Cities	380	0.068210	5497	0.986717
5	Tier II Cities	74	0.013283	5571	1.000000

	Missing	Total	Percent
TotalVisits	137	9240	0.014827
Page Views Per Visit	137	9240	0.014827
Last Activity	103	9240	0.011147
City	60	9240	0.006494
Lead Source	36	9240	0.003896
Lead Origin	0	9240	0.000000
Do Not Email	0	9240	0.000000
Converted	0	9240	0.000000
Total Time Spent on Website	0	9240	0.000000
Country	0	9240	0.000000
Specialization	0	9240	0.000000
What is your current occupation	0	9240	0.000000
A free copy of Mastering The Interview	0	9240	0.000000

	Missing	Total	Percent
Lead Origin	0	9014	0.0
Lead Source	0	9014	0.0
Do Not Email	0	9014	0.0
Converted	0	9014	0.0
TotalVisits	0	9014	0.0
Total Time Spent on Website	0	9014	0.0
Page Views Per Visit	0	9014	0.0
Last Activity	0	9014	0.0
Country	0	9014	0.0
Specialization	0	9014	0.0
What is your current occupation	0	9014	0.0
City	0	9014	0.0
A free copy of Mastering The Interview	0	9014	0.0

	Country	Count	Percent	Cumulative Count	Cumulative Percent
0	India	8787	0.974817	8787	0.974817
1	United States	51	0.005658	8838	0.980475
2	United Arab Emirates	44	0.004881	8882	0.985356
3	Saudi Arabia	21	0.002330	8903	0.987686
4	Singapore	17	0.001886	8920	0.989572
5	United Kingdom	12	0.001331	8932	0.990903
6	Australia	11	0.001220	8943	0.992123
7	Qatar	8	0.000888	8951	0.993011
8	Bahrain	7	0.000777	8958	0.993787
9	Hong Kong	6	0.000666	8964	0.994453
10	France	6	0.000666	8970	0.995119
11	Oman	5	0.000555	8975	0.995673
12	Nigeria	4	0.000444	8979	0.996117
13	Kuwait	4	0.000444	8983	0.996561
14	Germany	4	0.000444	8987	0.997005
15	South Africa	3	0.000333	8990	0.997337
16	Canada	3	0.000333	8993	0.997670
17	Philippines	2	0.000222	8995	0.997892
18	Netherlands	2	0.000222	8997	0.998114
19	Belgium	2	0.000222	8999	0.998336
20	Bangladesh	2	0.000222	9001	0.998558
21	Vietnam	1	0.000111	9002	0.998669
22	Uganda	1	0.000111	9003	0.998780
23	Tanzania	1	0.000111	9004	0.998891
24	Switzerland	1	0.000111	9005	0.999002
25	Sweden	1	0.000111	9006	0.999112
26	Malaysia	1	0.000111	9007	0.999223
27	Liberia	1	0.000111	9008	0.999334
28	Kenya	1	0.000111	9009	0.999445
29	Italy	1	0.000111	9010	0.999556
30	Indonesia	1	0.000111	9011	0.999667
31	Ghana	1	0.000111	9012	0.999778
32	Denmark	1	0.000111	9013	0.999889
33	China	1	0.000111	9014	1.000000

	Lead Origin	Count	Percent	Cumulative Count	Cumulative Percent
0	Landing Page Submission	4870	0.540271	4870	0.540271
1	API	3533	0.391946	8403	0.932217
2	Lead Add Form	581	0.064455	8984	0.996672
3	Lead Import	30	0.003328	9014	1.000000

	Lead Source	Count	Percent	Cumulative Count	Cumulative Percent
0	Google	2857	0.316951	2857	0.316951
1	Direct Traffic	2528	0.280453	5385	0.597404
2	Olark Chat	1753	0.194475	7138	0.791879
3	Organic Search	1131	0.125471	8269	0.917351
4	Reference	443	0.049146	8712	0.966497
5	Welingak Website	129	0.014311	8841	0.980808
6	Referral Sites	120	0.013313	8961	0.994120
7	Facebook	31	0.003439	8992	0.997559
8	bing	6	0.000666	8998	0.998225
9	Click2call	4	0.000444	9002	0.998669
10	Social Media	2	0.000222	9004	0.998891
11	Press_Release	2	0.000222	9006	0.999112
12	Live Chat	2	0.000222	9008	0.999334
13	welearnblog_Home	1	0.000111	9009	0.999445
14	testone	1	0.000111	9010	0.999556
15	blog	1	0.000111	9011	0.999667
16	WeLearn	1	0.000111	9012	0.999778
17	Pay per Click Ads	1	0.000111	9013	0.999889
18	NC_EDM	1	0.000111	9014	1.000000

	Last Activity	Count	Percent	Cumulative Count	Cumulative Percent
0	Email Opened	3417	0.379077	3417	0.379077
1	SMS Sent	2701	0.299645	6118	0.678722
2	Olark Chat Conversation	963	0.106834	7081	0.785556
3	Page Visited on Website	637	0.070668	7718	0.856224
4	Converted to Lead	422	0.046816	8140	0.903040
5	Email Bounced	304	0.033725	8444	0.936765
6	Email Link Clicked	266	0.029510	8710	0.966275
7	Form Submitted on Website	115	0.012758	8825	0.979033
8	Unreachable	89	0.009874	8914	0.988906
9	Unsubscribed	58	0.006434	8972	0.995341
10	Had a Phone Conversation	25	0.002773	8997	0.998114
11	View in browser link Clicked	6	0.000666	9003	0.998780
12	Approached upfront	5	0.000555	9008	0.999334
13	Email Received	2	0.000222	9010	0.999556
14	Email Marked Spam	2	0.000222	9012	0.999778
15	Visited Booth in Tradeshow	1	0.000111	9013	0.999889
16	Resubscribed to emails	1	0.000111	9014	1.000000

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.8590	0.161	-5.346	0.000	-1.174	-0.544
Do Not Email	-1.5025	0.189	-7.940	0.000	-1.873	-1.132
Unknown Occupation	-1.2733	0.090	-14.083	0.000	-1.451	-1.096
No Specialization	-0.8357	0.126	-6.656	0.000	-1.082	-0.590
Media and Advertising	-0.2901	0.244	-1.187	0.235	-0.769	0.189
Hospitality Management	-0.9063	0.332	-2.732	0.006	-1.556	-0.256
Outside India	-0.6629	0.234	-2.828	0.005	-1.122	-0.203
SMS Sent	1.9118	0.125	15.332	0.000	1.667	2.156
Other Last Activity	1.2298	0.260	4.738	0.000	0.721	1.739
Working Professional	2.4225	0.191	12.702	0.000	2.049	2.796
Olark Chat Conversation	-0.7552	0.201	-3.762	0.000	-1.149	-0.362
Email Link Clicked	0.4387	0.234	1.875	0.061	-0.020	0.897
Welingak Website	5.6036	0.739	7.587	0.000	4.156	7.051
Reference	3.2759	0.253	12.946	0.000	2.780	3.772
Other Lead Sources	0.8276	0.379	2.186	0.029	0.086	1.570
Olark Chat	1.4051	0.135	10.431	0.000	1.141	1.669
Landing Page Submission	-0.9926	0.129	-7.685	0.000	-1.246	-0.739
Total Time Spent on Website	1.1253	0.042	26.905	0.000	1.043	1.207
TotalVisits	0.1958	0.042	4.614	0.000	0.113	0.279
Email Opened	0.7106	0.123	5.797	0.000	0.470	0.951

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.8673	0.160	-5.404	0.000	-1.182	-0.553
Do Not Email	-1.5036	0.189	-7.945	0.000	-1.874	-1.133
Unknown Occupation	-1.2714	0.090	-14.070	0.000	-1.448	-1.094
No Specialization	-0.8261	0.125	-6.595	0.000	-1.072	-0.581
Hospitality Management	-0.8963	0.332	-2.703	0.007	-1.546	-0.246
Outside India	-0.6576	0.234	-2.809	0.005	-1.116	-0.199
SMS Sent	1.9074	0.125	15.311	0.000	1.663	2.152
Other Last Activity	1.2318	0.259	4.749	0.000	0.723	1.740
Working Professional	2.4169	0.191	12.682	0.000	2.043	2.790
Olark Chat Conversation	-0.7593	0.201	-3.783	0.000	-1.153	-0.366
Email Link Clicked	0.4421	0.234	1.890	0.059	-0.016	0.901
Welingak Website	5.6066	0.739	7.591	0.000	4.159	7.054
Reference	3.2810	0.253	12.970	0.000	2.785	3.777
Other Lead Sources	0.8257	0.378	2.182	0.029	0.084	1.567
Olark Chat	1.4068	0.135	10.444	0.000	1.143	1.671
Landing Page Submission	-0.9929	0.129	-7.688	0.000	-1.246	-0.740
Total Time Spent on Website	1.1264	0.042	26.929	0.000	1.044	1.208
TotalVisits	0.1962	0.042	4.623	0.000	0.113	0.279
Email Opened	0.7096	0.123	5.791	0.000	0.469	0.950

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.7799	0.153	-5.111	0.000	-1.079	-0.481
Do Not Email	-1.5328	0.188	-8.150	0.000	-1.901	-1.164
Unknown Occupation	-1.2696	0.090	-14.058	0.000	-1.447	-1.093
No Specialization	-0.8248	0.125	-6.587	0.000	-1.070	-0.579
Hospitality Management	-0.8909	0.332	-2.687	0.007	-1.541	-0.241
Outside India	-0.6555	0.235	-2.795	0.005	-1.115	-0.196
SMS Sent	1.8157	0.113	16.026	0.000	1.594	2.038
Other Last Activity	1.1473	0.255	4.500	0.000	0.648	1.647
Working Professional	2.4103	0.190	12.668	0.000	2.037	2.783
Olark Chat Conversation	-0.8576	0.193	-4.441	0.000	-1.236	-0.479
Welingak Website	5.6297	0.738	7.632	0.000	4.184	7.076
Reference	3.3059	0.253	13.064	0.000	2.810	3.802
Other Lead Sources	0.8192	0.378	2.167	0.030	0.078	1.560
Olark Chat	1.4212	0.135	10.561	0.000	1.157	1.685
Landing Page Submission	-0.9890	0.129	-7.664	0.000	-1.242	-0.736
Total Time Spent on Website	1.1247	0.042	26.931	0.000	1.043	1.207
TotalVisits	0.1953	0.042	4.613	0.000	0.112	0.278
Email Opened	0.6161	0.111	5.564	0.000	0.399	0.833

	Specialization	Count	Percent	Cumulative Count	Cumulative Percent
0	No Specialization	3230	0.358331	3230	0.358331
1	Finance Management	959	0.106390	4189	0.464722
2	Human Resource Management	836	0.092745	5025	0.557466
3	Marketing Management	822	0.091191	5847	0.648658
4	Operations Management	498	0.055247	6345	0.703905
5	Business Administration	397	0.044043	6742	0.747948
6	IT Projects Management	366	0.040604	7108	0.788551
7	Supply Chain Management	344	0.038163	7452	0.826714
8	Banking, Investment And Insurance	335	0.037164	7787	0.863878
9	Travel and Tourism	202	0.022410	7989	0.886288
10	Media and Advertising	202	0.022410	8191	0.908698
11	International Business	176	0.019525	8367	0.928223
12	Healthcare Management	156	0.017306	8523	0.945529
13	E-COMMERCE	111	0.012314	8634	0.957843
14	Hospitality Management	110	0.012203	8744	0.970047
15	Retail Management	100	0.011094	8844	0.981140
16	Rural and Agribusiness	73	0.008099	8917	0.989239
17	E-Business	57	0.006323	8974	0.995562
18	Services Excellence	40	0.004438	9014	1.000000

VAR1	VAR2	CORR
Other Lead Origins	Reference	0.840000
SMS Sent	Email Opened	0.750000
Unemployed	Working Professional	0.690000
Page Views Per Visit	TotalVisits	0.680000
No Specialization	Landing Page Submission	0.600000
Page Views Per Visit	Landing Page Submission	0.560000
Landing Page Submission	A free copy of Mastering The Interview	0.540000
Unemployed	Unknown Occupation	0.510000
Total Time Spent on Website	Page Views Per Visit	0.500000
Other Lead Origins	Page Views Per Visit	0.480000
Other Lead Origins	Total Time Spent on Website	0.460000
Other Lead Origins	Landing Page Submission	0.450000
Welingak Website	Other Lead Origins	0.450000
Total Time Spent on Website	TotalVisits	0.440000
TotalVisits	Total Time Spent on Website	0.440000

VAR1	VAR2	CORR
Unemployed	Unknown Occupation	0.930000
Landing Page Submission	No Specialization	0.870000
Reference	Other Lead Origins	0.750000
TotalVisits	Page Views Per Visit	0.730000
Email Bounced	Do Not Email	0.650000
Olark Chat	Page Views Per Visit	0.630000
Olark Chat	Landing Page Submission	0.620000
A free copy of Mastering The Interview	No Specialization	0.590000
A free copy of Mastering The Interview	Landing Page Submission	0.580000
No Specialization	Olark Chat	0.570000
Olark Chat	TotalVisits	0.520000
Page Views Per Visit	Landing Page Submission	0.500000
Olark Chat Conversation	Olark Chat	0.500000
No Specialization	Page Views Per Visit	0.490000
Other Lead Origins	Other Lead Sources	0.480000

	feature	rank	support
0	Do Not Email	1	True
42	Unknown Occupation	1	True
41	Unemployed	1	True
40	Student	1	True
38	Housewife	1	True
33	No Specialization	1	True
32	Media and Advertising	1	True
27	Hospitality Management	1	True
22	Outside India	1	True
21	SMS Sent	1	True
19	Other Last Activity	1	True
43	Working Professional	1	True
18	Olark Chat Conversation	1	True
15	Email Link Clicked	1	True
13	Welingak Website	1	True
11	Reference	1	True
10	Other Lead Sources	1	True
8	Olark Chat	1	True
6	Other Lead Origins	1	True
5	Landing Page Submission	1	True
3	Page Views Per Visit	1	True
2	Total Time Spent on Website	1	True
1	TotalVisits	1	True
16	Email Opened	1	True
48	Tier II Cities	1	True

	Converted	Converted_Prob	CustID
4948	1	0.401461	4948
5938	1	0.318706	5938
5688	1	0.745966	5688
5381	0	0.002848	5381
4742	1	0.801898	4742
5811	0	0.167979	5811
898	0	0.088098	898
5316	0	0.028318	5316
7381	0	0.251634	7381
1211	0	0.023542	1211

	Converted	Conversion Probability	CustID
1260	0	0.116132	1260
2104	1	0.318706	2104
7105	1	0.982061	7105
8916	0	0.420480	8916
2822	0	0.029267	2822

	Prospect ID	Lead No	Lead Score
7219	ed62264f-7666-4bf9-9cb6-5b9a825f1e67	594038	99.911806
7234	7e2819e8-97f0-416b-bcb6-45ef14f0e11a	593962	99.855878
2378	8ff353ab-1207-4608-a8cc-8172ea7c12eb	636860	99.830268
7327	95d1590f-7c47-4f40-9806-388f4472d3a4	593208	99.825999
2497	e5fb32dd-b3b7-4fbf-972d-c13d2cfc6866	635761	99.822078
5671	623bc6c9-9184-4437-b38f-d374be49d1a3	606508	99.822078
7094	9ec1cafe-b019-498e-b246-7ab06167d72c	595141	99.819952
7187	f33166e8-d8d3-4e8c-b9d0-8a1922c35910	594369	99.817589
7420	2caa32d0-50b7-4d29-b31f-2528b06d7bc8	592625	99.816186
8120	bf4a03bc-b747-45a6-a6b5-659afa3bf3ac	587853	99.815060

	Converted	Conversion Probability	CustID	Predicted	decile
7327	1	0.998260	7327	1	1
7420	1	0.998162	7420	1	1
4613	1	0.997809	4613	1	1
6243	1	0.997674	6243	1	1
7324	1	0.997402	7324	1	1

	Decile	Max Prob	Min Prob	Converted Count	Un Converted Count	Total Leads	Conversion Rate	Converted %	Un Converted %
9	1	0.998260	0.910258	257	14	271	18.357143	0.250243	0.008343
8	2	0.909066	0.781567	224	46	270	4.869565	0.218111	0.027414
7	3	0.780757	0.564123	182	89	271	2.044944	0.177215	0.053039
6	4	0.563613	0.380673	118	152	270	0.776316	0.114898	0.090584
5	5	0.380229	0.251101	113	157	270	0.719745	0.110029	0.093564

	Max Prob	Min Prob	Converted Count	Un Converted Count	Total Leads	Conversion Rate	Converted %	Un Converted %	ks_stats	Cum Conversion %	Base %
Decile
1	0.998260	0.910258	257	14	271	18.357143	0.250243	0.008343	24.19	25.02	10
2	0.909066	0.781567	224	46	270	4.869565	0.218111	0.027414	43.26	46.84	20
3	0.780757	0.564123	182	89	271	2.044944	0.177215	0.053039	55.68	64.56	30
4	0.563613	0.380673	118	152	270	0.776316	0.114898	0.090584	58.11	76.05	40
5	0.380229	0.251101	113	157	270	0.719745	0.110029	0.093564	59.76	87.05	50
6	0.250676	0.149579	64	207	271	0.309179	0.062317	0.123361	53.65	93.28	60
7	0.149564	0.113845	37	233	270	0.158798	0.036027	0.138856	43.37	96.88	70
8	0.113644	0.069387	22	249	271	0.088353	0.021422	0.148391	30.67	99.03	80
9	0.069378	0.029278	8	256	264	0.031250	0.007790	0.152563	16.19	99.81	90
10	0.029267	0.002231	2	275	277	0.007273	0.001947	0.163886	0.00	100.00	100

Table of Contents

Lead Scoring for X Education¶

Problem Statement¶

Business Goals¶

Analysis Approach & Conclusions¶

Importing Data¶

Data Cleaning¶

Incorrect data types¶

Duplicates¶

Separating ID columns¶

Missing Values¶

Disguised missing Values¶

Imputation¶

Incorrect Labels¶

Cleaning Categorical Features¶

Dropping Unnecessary Columns¶

Grouping Labels with less leads¶

Country¶

Lead Origin¶

Lead Source¶

Last Activity¶

Specialization¶

Retained Data¶

Data Imbalance¶

Univariate Analysis¶

Lead Origin¶

Lead Source¶

Do not Email¶

Last Activity¶

Country¶

Specialization¶

What is your current occupation¶

City¶

A free copy of Mastering The Interview.¶

TotalVisits¶

Total TIme Spent on Website.¶

Page Views Per Visit.¶

Bivariate Analysis¶

TotalVisits vs A free copy of Mastering The Interview¶

Lead Source vs Country¶

Occupation vs City¶

Last Activity vs Country¶

Data Preparation¶

Mapping Binary Variables to 0 / 1¶

Creating Indicator Variables¶

Correlation¶

Train-Test Split¶

Standardizing Continuous Variables¶

Modelling¶

Recurvise Feature Elimination¶

Manual Feature Elimination¶

Model 1¶

Model 2¶

Model 3¶

Model 4¶

Model 5¶

Model 6¶

Model 7¶

Model 8¶

Model 9¶

Model 10¶

Model 11 - Final Model¶

Final Features¶

Predictions¶

Predictions on Train set¶

Actual Conversions vs Conversion Predictions¶

Predictions with cut off = 0.5¶

Confusion Matrix¶

Accuracy of the Model¶

Metrics beyond simple accuracy¶

Plotting ROC Curve.¶

Finding Optimal Cutoff Point¶

Precision and Recall¶

Precision and Recall Tradeoff¶

Predictions on Test set¶

Actual Conversions vs Conversion Probability¶

Confusion Matrix¶

Accuracy¶

Metrics Beyond Simple Accuracy¶

Lead Scoring¶