An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.
The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.
Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:
X Education has appointed us to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires us to build a model wherein we need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.
Lead scoring is a class probability estimation problem, a form of classification problem. The target variable in the data set has two classes : 0 - Un converted and 1 - Converted. The objective is to model the probability(p) that each lead belongs to the class - Converted
. Since there are just two classes - it follows that the probability of belonging to class - Un-Converted is (1-p). The relationship between probability of conversion of each lead and its characteristics is modelled using Logistic Regression. And the leads are scored on a scale of 0-100, 100 being most probable conversion candidate.
The final solution has been provided in two parts.
1. Scoring the leads provided by the company in the order of probability of conversion (0-100)
2. Insights into the relationship between characteristics of a lead and the log-odds probability of conversion that could help the company score leads in the future.
A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).
The final relationship between log Odds of Conversion Probability and lead features is
logOdds(Conversion Probability)
= -0.6469 - 1.5426 Do Not Email
-1.2699 Unknown Occupation
-0.9057 No Specialization
-0.8704 Hospitality Management
- 0.6584 Outside India
+ 1.7923 SMS Sent
+ 1.1749 Other Last Activity
+ 2.3769 Working Professional
- 0.8614 Olark Chat Conversation
+ 5.3886 Welingak Website
+ 3.0246 Reference
+ 1.1876 Olark Chat
-1.0250 Landing Page Submission
+ 1.1253 Total Time Spent on Website
+ 0.6106 * Email Opened
where Total Time Spent on Website
is standardized to $\mu=0,\sigma=1$
Interpreting Top 6 features affecting Conversion Probability :
Welingak Website
has 5.4 times higher log odds of conversion than those from Google
. Reference
have 3 times higher log odds of conversion than those from Google
.Working Professional
have 2.38 times higher log odds of conversion than those from Businessman
.SMS Sent
have 1.8 times higher log odds of conversion than those with no SMS sent.Do Not Email
have 1.5 times lesser log odds of conversion compared to leads who would like email updates.Unknown Occupation
have 1.27 times lesser log odds of conversion compared to those from Businessman
.Lead Scores :
Score Sheet for X Education
cell in the analysis notebook.At an optimum cut-off probability of 0.36, model performance is as follows.
Model Performance on Training Set :
Model Performance for Test Set :
KS statistic :
Gain :
Lift :
Note :
Tags
where the classes are not mutually exclusive have been dropped.import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
warnings.filterwarnings('ignore')
!pip install tabulate
from tabulate import tabulate
import sidetable
!jt -t grade3 -f roboto -fs 12 -cellw 100%
# to table print a dataframe
def tab(ser) :
print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt="psql"))
# importing the dataset
leads = pd.read_csv('./leads.csv')
# Inspecting few column heads at a time
for i in range(0,leads.shape[1], 5) :
if i+4 <= leads.shape[1] :
print('Columns : ',i,' to ',i+4)
else :
print('Columns : ',i,' to last')
tab(leads.iloc[:,i : i+5].head())
print('\n')
# dataset information
leads.info()
Converted
column indicates whether the particular lead was converted to a client. This is our target variable.# 'Converted' is a binary categorical variable but the info shows it is `int64`. Converting to `category` data type
leads['Converted'] = leads['Converted'].astype('category')
# Checking for any duplicate leads / prospects
duplicate_prospect_ids = leads['Prospect ID'][leads['Prospect ID'].duplicated()].sum()
duplicate_lead_no = leads['Lead Number'][leads['Lead Number'].duplicated()].sum()
print('No of Duplicate Prospect IDs : ', duplicate_prospect_ids)
print('No of Duplicate Lead Nos : ', duplicate_lead_no)
# Popping Prospect ID and Lead Number columns for later use
prospect_ids = leads.pop('Prospect ID')
lead_no = leads.pop('Lead Number')
# Null values in each Column
nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
nulls.columns = ['Null Percentage']
# Sorting null percentages in descending order and highlighting null % > 45
nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 45 else '')
Lead Quality
,Asymmetrique Profile Score
,
Asymmetrique Activity Score
,
Asymmetrique Profile Index
,
Asymmetrique Activity Index
# Dropping columns with null percentage > 45
high_null_col = nulls[nulls['Null Percentage'] >=45].index
leads.drop(columns=high_null_col, inplace=True)
# Rows Missing Target Variable
print('Number of rows with missing Target Variable : ',leads['Converted'].isnull().sum())
# Rows missing more than 50% of values
highNullRowsCondition = leads.isnull().sum(axis=1)/leads.shape[1] > 0.5
leads[highNullRowsCondition].index
# Categorical columns
condition = leads.dtypes == 'object'
categoricalColumns = leads.dtypes[condition].index.values
categoricalColumns
# value counts of each label in a categorical feature
def cat_value_counts(column_name) :
'''
prints unique values and value counts of each label in categorical column
'''
print(tabulate(pd.DataFrame(leads.stb.freq([column_name])), headers='keys', tablefmt='psql'))
print(pd.DataFrame(leads[column_name]).stb.missing(),'\n\n\n')
# Looking at value counts of each label in categorical variables
for col in sorted(categoricalColumns) :
print(col)
cat_value_counts(col)
Select
which is a disguised missing value.Select
is the default option in online forms and this value might mean that the lead hasn't selected any option.np.nan
. Specialization
Lead Profile
City
How did you hear about X Education
# Replacing Select with NaN value
leads.replace({'Select' : np.nan},inplace=True)
# Looking at Missing Values again
# Null values in each Column
nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
nulls.columns = ['Null Percentage']
# Sorting null percentages in descending order and highlighting null % > 50
nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 50 else '')
Lead Profile
& How did you hear about X Education
have very high percentage of nulls. Let's drop these columnsleads.drop(columns=['Lead Profile','How did you hear about X Education'],inplace=True)
# Sorting null percentages in ascending order and highlighting null % < 16
def lowNulls() :
nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
nulls.columns = ['Null Percentage']
return nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=True).style.applymap(lambda x : 'color : green' if x < 16 else '')
lowNulls()
# Country Imputation :
leads.stb.freq(['Country'])
# Imputing missing values in Country feature with "India"
leads['Country'].fillna('India', inplace=True)
Specialization
feature has 36% of missing values.# Imputing Null Values by filling it using "No Specialization".
leads['Specialization'].fillna("No Specialization",inplace=True)
print('Missing values in Specialization feature ', leads['Specialization'].isnull().sum())
leads['Specialization'].value_counts()
City
column are missing # Imputation of missing cities
leads.stb.freq(['City'])
# Missing Cities vs Country
condition_india = leads['Country'] == 'India'
print('Total Missing City values :', leads['City'].isnull().sum())
print('Missing City values in leads from India : ',leads.loc[condition_india,'City'].isnull().sum())
City
feature, 60% of the leads come from Mumbai.City
value for leads from India with Mumbai
# Replacing Null Cities in India with Mumbai
condition = (leads['City'].isnull()) & condition_india
leads.loc[condition,'City'] = 'Mumbai'
What is your current occupation
column are missing. tab(leads.stb.freq(['What is your current occupation']))
tab(leads['What is your current occupation'].reset_index().stb.missing())
What is your current occupation
with 'Unknown Occupation'leads['What is your current occupation'].fillna('Unknown Occupation',inplace=True)
# Missing Values in `What matters most to you in choosing a course`
ftr = 'What matters most to you in choosing a course'
tab(leads.stb.freq([ftr]))
tab(leads[ftr].reset_index().stb.missing())
Better Career Prospects
, excluding leads who havent filled this feature. What matters most to you in choosing a course
with 'Unknown Target' for now. # Filling Missing Values with `Unknown`
leads[ftr].fillna('Unknown Target',inplace=True)
# Looking at labels in each Categorical Variable to check for incorrect labels.
categoricalFeatures = leads.dtypes[leads.dtypes == 'object'].index.values
print('Categorical Features : ', categoricalFeatures,'\n\n')
for feature in categoricalFeatures :
print('Levels in ',feature,' are ' , leads[feature].unique(),'\n\n')
# Replacing 'google' with 'Google
leads['Lead Source']=leads['Lead Source'].str.replace("google","Google")
# Missing Values and Value Counts for all categorical Variables
tab(leads.stb.missing())
print('Value Counts of each Feature : \n')
for feature in sorted(categoricalFeatures) :
tab(leads.stb.freq([feature]))
# Dropping columns having only one label - since these donot explain any variability in the dataset
invariableCol = ['Digital Advertisement','Do Not Call','Get updates on DM Content','Magazine','Newspaper','Newspaper Article','Receive More Updates About Our Courses','Search',
'Update me on Supply Chain Content','Through Recommendations',
'I agree to pay the amount through cheque',"What matters most to you in choosing a course",'X Education Forums']
leads.drop(columns=invariableCol, inplace=True)
# Tags feature
tab(leads.stb.freq(['Tags']))
Tags
column shows remarks by the sales team. This a subjective variable based on judgement of the team and cannot be used for analysis since the lables might change or might not always be available. # dropping Tags feature
leads.drop(columns=['Tags'], inplace=True)
Last Notable Activity
& Last Activity
seem to have similar levels # Last Notable Activity vs Last Activity
leads_copy = leads.copy()
leads_copy['Converted'] = leads_copy['Converted'].astype('int')
tab(leads_copy.stb.freq(['Last Notable Activity'],value='Converted'))
tab(leads_copy.stb.freq(['Last Activity'],value='Converted'))
Last Activity
has more levels compared to Last Notable Activity
Last Notable Activity
seems like a column derived by the sales team using Last Activity
. Last Notable Activity
. leads.drop(columns = ['Last Notable Activity'], inplace=True)
# Looking at Missing Values again
leads.stb.missing()
leads.dropna(inplace=True)
leads.stb.missing()
# Country distribution
leads.stb.freq(['Country'])
# Grouping Countries with very low lead count into 'Outside India'
leadsByCountry = leads['Country'].value_counts(normalize=True)
lowLeadCountries = leadsByCountry[leadsByCountry <= 0.01].index
leads['Country'].replace(lowLeadCountries,'Outside India',inplace=True)
leads.stb.freq(['Country'])
feature = 'Lead Origin'
leads.stb.freq([feature])
Lead Add Form
,Lead Import
are less than 1% of all origins. # Grouping lead origins
leadOriginsToGroup = ["Lead Add Form","Lead Import"]
leads[feature] = leads[feature].replace(leadOriginsToGroup, ['Other Lead Origins']*2)
leads.stb.freq([feature])
feature = 'Lead Source'
leads.stb.freq([feature])
# Grouping lead Sources
labelCounts = leads[feature].value_counts(normalize=True)
# labels with less than 1% contribution
labelsToGroup = labelCounts[labelCounts < 0.01].index.values
leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature+'s']*len(labelsToGroup))
leads.stb.freq([feature])
feature = 'Last Activity'
leads.stb.freq([feature])
# Grouping Last Activity
labelCounts = leads[feature].value_counts(normalize=True)
# labels with less than 2% contribution
labelsToGroup = labelCounts[labelCounts < 0.01].index.values
leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))
leads.stb.freq([feature])
feature = 'Specialization'
leads.stb.freq([feature])
# Grouping Last Activity
labelCounts = leads[feature].value_counts(normalize=True)
# labels with less than 2% contribution
labelsToGroup = labelCounts[labelCounts <=0.012121].index.values
leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))
leads.stb.freq([feature])
# Columns retained
print('Retained Columns\n\n', leads.columns.values)
# Retained rows
print('Retained rows : ',leads.shape[0])
print("Ratio of retained rows", 100*leads.shape[0]/9240)
leads.stb.freq(['Converted'])
converted_cond = leads['Converted'] == 1
imbalance = leads[converted_cond].shape[0]/leads[~converted_cond].shape[0]
print('Class Imbalance : Converted /Un-converted =', np.round(imbalance,3))
def categoricalUAn(column,figsize=[8,8]) :
''' Function for categorical univariate analysis '''
print('Types of ' + column)
tab(leads.stb.freq([column]))
converted = leads[leads['Converted'] == 1]
unconverted = leads[leads['Converted'] == 0]
print(column + ' for Converted Leads')
tab(converted.stb.freq([column]))
print(column + ' for Un-Converted Leads')
tab(unconverted.stb.freq([column]))
print(column + ' vs Conversion Rate')
tab((converted[column].value_counts()) / (converted[column].value_counts() + unconverted[column].value_counts()))
# bar plot
plt.figure(figsize=figsize)
ax = sns.countplot(y=column,hue='Converted',data=leads)
title = column + ' vs Lead Conversion'
ax.set(title= title)
column = 'Lead Origin'
categoricalUAn(column,figsize=[8,8])
Landing Page Submission
followed by API
make up 93% of all leads. column = 'Lead Source'
categoricalUAn(column,figsize=[8,8])
Google
(31%), followed by Direct Traffic
(28%) and Olark Chat
(19%)Reference
have a very high conversion rate (91%)feature = 'Do Not Email'
categoricalUAn(feature,figsize=[8,8])
Do not Email = No
# 'Last Activity'
feature = 'Last Activity'
categoricalUAn(feature,figsize=[8,8])
Converted to Lead
feature = 'Country'
categoricalUAn(feature,figsize=[8,8])
feature = 'Specialization'
categoricalUAn(feature)
feature = 'What is your current occupation'
categoricalUAn(feature,figsize=[8,8])
feature = 'City'
categoricalUAn(feature)
feature = 'A free copy of Mastering The Interview'
categoricalUAn(feature)
def num_univariate_analysis(column_name,scale='linear') :
converted = leads[leads['Converted'] == 1]
unconverted = leads[leads['Converted'] == 0]
plt.figure(figsize=(8,6))
ax = sns.boxplot(x=column_name, y='Converted', data = leads)
title = 'Boxplot of ' + column_name+' vs Conversion'
ax.set(title=title)
if scale == 'log' :
ax.set_xscale('log')
ax.set(ylabel=column_name + '(Log Scale)')
print("Spread for range of "+column_name+" that were Converted")
tab(converted[column_name].describe())
print("Spread for range of "+column_name+" that were not converted")
tab(unconverted[column_name].describe())
column_name = 'TotalVisits'
num_univariate_analysis(column_name,scale='log')
Total Visits
have a lot of outliers among both Converted
and Un-converted
leads. # Looking at Quantiles
tab(leads[column_name].quantile(np.linspace(.90,1,20)))
soft range
capping. # Capping outliers to 99th perentile value
cap = leads[column_name].quantile(.99)
condition = leads[column_name] > cap
leads.loc[condition, column_name] = cap
column = 'Total Time Spent on Website'
num_univariate_analysis(column)
tab(leads[column].quantile(np.linspace(0.75,1,25)))
leads[column].quantile(np.linspace(0.75,1,50)).plot()
# Capping `Total Time Spent on Website` values to 99th percentile
cap = leads[column].quantile(.99)
condition = leads[column] > cap
leads.loc[condition, column] = cap
column = 'Page Views Per Visit'
num_univariate_analysis(column)
leads[column].quantile(np.linspace(0.75,1,30)).plot()
# Capping `Page Views Per Visit` values to 99th percentile
cap = leads[column].quantile(.99)
condition = leads[column] > cap
leads.loc[condition, column] = cap
leads.columns.values
continuous_vars = ['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']
plt.figure(figsize=[8,8])
sns.barplot(x=continuous_vars[0], y = 'A free copy of Mastering The Interview', data=leads, hue='Converted')
# sns.barplot(x='Lead Source', y = 'Country', hue='Converted', data=leads)
leads.groupby(['Country','Lead Source'])['Converted'].value_counts(normalize=True)\
.unstack()\
.plot(
layout=(2,2),
figsize=(14,12), kind='barh', stacked=True);
x = "What is your current occupation"
y = 'City'
leads.groupby([x,y])['Converted'].value_counts(normalize=True)\
.unstack()\
.plot(
layout=(2,2),
figsize=(14,12), kind='barh', stacked=True);
x = "Country"
y = 'Last Activity'
leads.groupby([x,y])['Converted'].value_counts(normalize=True)\
.unstack()\
.plot(
layout=(2,2),
figsize=(14,12), kind='barh', stacked=True);
binary_var = ['Do Not Email', 'A free copy of Mastering The Interview']
leads[binary_var] = leads[binary_var].replace({'Yes' : 1, 'No' : 0})
categoricalCol = ['Lead Origin', 'Lead Source','Last Activity', 'Country', 'Specialization',
'What is your current occupation', 'City']
print('Levels in Each Cateogrical Variable\n')
for col in sorted(categoricalCol) :
print(col, leads[col].unique(), '\n')
# Creating dummy variables
leadOriginDummies = pd.get_dummies(leads['Lead Origin'], drop_first=True)
leadSourceDummies = pd.get_dummies(leads['Lead Source'], drop_first=True)
lastActivityDummies = pd.get_dummies(leads['Last Activity'], drop_first=True)
countryDummies = pd.get_dummies(leads['Country'] ,drop_first=True)
specDummies = pd.get_dummies(leads['Specialization'],drop_first=True)
occupationDummies = pd.get_dummies(leads[ 'What is your current occupation'],drop_first=True)
cityDummies = pd.get_dummies(leads[ 'City'],drop_first=True)
# adding dummy variables to leads dataframe
leads = pd.concat([leads, leadOriginDummies,leadSourceDummies,lastActivityDummies, countryDummies, specDummies, occupationDummies, cityDummies], axis=1)
# dropping categorical columns
leads.drop(columns = categoricalCol, inplace=True)
print('Final Columns')
leads.columns
# Top Correlations
def correlation(dataframe) :
cor0=dataframe.corr()
type(cor0)
cor0.where(np.triu(np.ones(cor0.shape),k=1).astype(np.bool))
cor0=cor0.unstack().reset_index()
cor0.columns=['VAR1','VAR2','CORR']
cor0.dropna(subset=['CORR'], inplace=True)
cor0.CORR=round(cor0['CORR'],2)
cor0.CORR=cor0.CORR.abs()
cor0.sort_values(by=['CORR'],ascending=False)
cor0=cor0[~(cor0['VAR1']==cor0['VAR2'])]
return pd.DataFrame(cor0.sort_values(by=['CORR'],ascending=False))
#Correlations for Converted Leads
convertedCondition= leads['Converted']==1
print('Correlations for Converted Leads')
correlation(leads[convertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()
#Correlations for un-Converted Leads
unconvertedCondition=leads['Converted']==0
print('Correlations for Non-Converted Leads')
correlation(leads[unconvertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()
Unknown Occupation
and Unemployed
are highly correlated for non-converted leads. from sklearn.model_selection import train_test_split
y = leads.pop('Converted')
X = leads
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=100)
continuous_vars
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# fitting and transforming train set
X_train[continuous_vars] = scaler.fit_transform(X_train[continuous_vars])
# Transforming test set for later use
X_test[continuous_vars] = scaler.transform(X_test[continuous_vars])
print('No of features : ', len(X_train.columns))
# RFE
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
minFeatures = 25
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=minFeatures)
rfe = rfe.fit(X_train, y_train)
# Columns selected by RFE :
RFE_features = pd.DataFrame( {'feature' : X_train.columns, 'rank' : rfe.ranking_, 'support' : rfe.support_})
condition = RFE_features['support'] == True
rfe_features = RFE_features[condition].sort_values(by='rank',ascending=True )
print('Features selected by RFE\n')
rfe_features
rfeFeatures = rfe_features['feature'].values
### Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
def vif(X) :
df = sm.add_constant(X)
vif = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
vif_frame = pd.DataFrame({'vif' : vif[0:]},index = df.columns).reset_index()
tab(vif_frame.sort_values(by='vif',ascending=False))
# Model 1
import statsmodels.api as sm
features = rfe_features['feature'].values
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
Unemployed
has the highest VIF. let's drop this feature. # Model 2 : Removing `Unemployed`
column_to_remove = 'Unemployed'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
Other Lead Origins
has a very high VIF.# Model 3 : Removing `Other Lead Origins`
column_to_remove = 'Other Lead Origins'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
Housewife
has a high p-value and hence the coefficient is insignificant. let's drop the same. # Model 4 : Removing `Housewife`
column_to_remove = 'Housewife'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
Student
has a p-value higher than 0.05 and the highest among all p-values. Let's drop this feature.# Model 5 : Removing `Student`
column_to_remove = 'Student'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
Tier II Cities
has a p-value higher than confidence level and the highest among all the p - values. # Model 6 : Removing `Tier II Cities`
column_to_remove = 'Tier II Cities'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
Page Views Per Visit
has a high p-value. Let's eliminate this. # Model 7 : Removing `Page Views Per Visit`
column_to_remove = 'Page Views Per Visit'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
Media and Advertising
has a high p-value. let's drop this feature# Model 8 : Removing `Media and Advertising`
column_to_remove = 'Media and Advertising'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
Email Link Clicked
with high p-value of 0.059. Let's drop this feature.# Model 9 : Removing `Email Link Clicked`
column_to_remove = 'Email Link Clicked'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
print("VIF for X_train")
vif(X_train)
logm1.fit().summary()
TotalVisits
has the least coefficient. Let'd drop this. # Model 10 : Removing `TotalVisits`
column_to_remove = 'TotalVisits'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
logm1 = logm1.fit()
print("VIF for X_train")
vif(X_train)
logm1.summary()
Other Lead Sources
has high p-value. Let's drop this variable. # Model 11 : Removing `Other Lead Sources`
column_to_remove = 'Other Lead Sources'
features = X_train.columns[X_train.columns !=column_to_remove]
X_train = X_train[features]
logm_final = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
logm_final = logm_final.fit()
print("VIF for X_train")
vif(X_train)
logm_final.summary()
finalFeatures = X_train.columns.values
print('The Final Feature for Modelling are :', finalFeatures)
X_train_sm = sm.add_constant(X_train)
y_train_pred = logm_final.predict(X_train_sm)
# Creating a data frame with converted vs converted probabilities
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
y_train_pred_final['CustID'] = y_train.index
y_train_pred_final.head(10)
#Creating new column 'predicted' with 1 if Converted_Prob > 0.5 else 0
y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)
# Let's see the head
y_train_pred_final.head(10)
from sklearn import metrics
# Confusion matrix
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
print(confusion)
Confusion Matrix for Train Set
$\frac{Predicted}{Actual}$ | Not Converted | Converted |
---|---|---|
Not Converted | 3462 | 455 |
Converted | 699 | 1693 |
# Let's check the overall accuracy.
accuracy = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
print('Accuracy on Train set : ', round(100*accuracy,3),'%')
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ', round(100*specificity,3),'%')
print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
def draw_roc( actual, probs ):
fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
drop_intermediate = False )
auc_score = metrics.roc_auc_score( actual, probs )
plt.figure(figsize=(5, 5))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
return None
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)
# Let's create columns with different probability cutoffs
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
# TP = confusion[1,1] # true positive
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
total1=sum(sum(cm1))
accuracy = (cm1[0,0]+cm1[1,1])/total1
speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)
# Let's plot accuracy sensitivity and specificity for various cutoff probabilities.
fig,ax = plt.subplots()
fig.set_figwidth(30)
fig.set_figheight(10)
plots=['accuracy','sensi','speci']
ax.set_xticks(np.linspace(0,1,50))
ax.set_title('Finding Optimal Cutoff')
sns.lineplot(x='prob',y=plots[0] , data=cutoff_df,ax=ax)
sns.lineplot(x='prob',y=plots[1] , data=cutoff_df,ax=ax)
sns.lineplot(x='prob',y=plots[2] , data=cutoff_df,ax=ax)
ax.set_xlabel('Probabilites')
ax.set_ylabel('Accuracy,Sensitivity,Specificity')
ax.legend(["Accuracy",'Sensitivity','Specificity'])
# cutoff_df.plot.line(, figure=[10,10])
plt.show()
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.36 else 0)
y_train_pred_final.head()
# Let's check the overall accuracy.
accu = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
print('Accuracy on Train set at Optimum Cut Off : ', round(100*accu,3),'%')
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2
TP = confusion2[1,1] # true positive
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ', round(100*specificity,3),'%')
print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
## ROC curve for cut off probability of 0.36
draw_roc(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
#Looking at the confusion matrix again
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
confusion
print('Precision :', confusion[1,1]/(confusion[0,1]+confusion[1,1]))
print('Recall :', confusion[1,1]/(confusion[1,0]+confusion[1,1]))
#Doing the same using the sklearn.
from sklearn.metrics import precision_score, recall_score
print('Precision : ', precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted))
print('Recall :', recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted))
from sklearn.metrics import precision_recall_curve
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()
X_test_sm = sm.add_constant(X_test[finalFeatures])
y_test_pred = logm_final.predict(X_test_sm)
# predicted conversions vs actual conversions and customer ID
y_test_predictions = pd.DataFrame({'Converted' :y_test, 'Conversion Probability' : y_test_pred, 'CustID' : y_test.index})
y_test_predictions.head()
# predictions with optimal cut off = 0.35
cutoff=0.36
y_test_predictions['Predicted'] = y_test_predictions[
'Conversion Probability'
].map(lambda x : 1 if x > cutoff else 0 )
confusion = metrics.confusion_matrix(y_test_predictions['Converted'], y_test_predictions['Predicted'])
confusion
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
print('Accuracy on Test set : ', round(100*(TP + TN)/(TP + TN + FP + FN),3),'%')
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
print('specificity : ', round(100*specificity,3),'%')
print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')
print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
## ROC curve for cut off probability of 0.364
draw_roc(y_test_predictions['Converted'],y_test_predictions['Predicted'])
# merging final predictions with leads dataset
conversionProb = pd.concat([y_test_predictions['Conversion Probability'],y_train_pred_final['Converted_Prob']],axis=0)
conversionProb = pd.DataFrame({'Conversion Probability' : conversionProb}, index=conversionProb.index)
leads = pd.concat([leads,conversionProb],axis=1)
leads['Prospect ID'] = prospect_ids
leads['Lead No'] = lead_no
leads['Converted'] = y
# Verifying prediction accuracy
leads['Predicted'] = leads['Conversion Probability'].map(lambda x : 1 if x > 0.36 else 0)
confusion = metrics.confusion_matrix(leads['Converted'], leads['Predicted'])
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
acc = metrics.accuracy_score(leads['Converted'], leads['Predicted'])
print('Accuracy : ', round(100*acc,3),'%')
sensitivity = TP/(FN + TP)
specificity = TN/(FP + TN)
falsePositiveRate = FP/(FP + TN)
falseNegativeRate = FN/(FP + TP)
positivePredictivePower = TP/(TP +FP )
negativePredictivePower = TN/(TN + FN)
print('sensitivity : ', round(100*sensitivity,3),'%')
print('specificity : ', round(100*specificity,3),'%')
print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')
print('False Negative Rate : ', round(100*falseNegativeRate,3),'%')
print('Positive Predictive Power / Precision : ', round(100*positivePredictivePower,3),'%')
print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
## ROC curve
draw_roc(leads['Converted'], leads['Predicted'])
# Lead Scores
leads['Lead Score'] = leads['Conversion Probability']*100
leads[['Prospect ID','Lead No','Lead Score']].sort_values(by='Lead Score', ascending=False)[:10]
# Run the following to generate a sheet containing lead information provided by the company and corresponding scores
leads.to_csv('lead_scores.csv')
# Gain Chart
y_test_predictions = y_test_predictions.sort_values(by='Conversion Probability', ascending=False)
y_test_predictions['decile'] = pd.qcut(y_test_predictions['Conversion Probability'],10,labels=range(10,0,-1))
y_test_predictions['Converted'] = y_test_predictions['Converted'].astype('int')
y_test_predictions['Un Converted'] = 1 - y_test_predictions['Converted']
y_test_predictions.head()
df1 = pd.pivot_table(data=y_test_predictions,index=['decile'],values=['Converted','Un Converted','Conversion Probability'],
aggfunc={'Converted':[np.sum],
'Un Converted':[np.sum],
'Conversion Probability' : [np.min,np.max]})
df1 = df1.reset_index()
df1.columns = ['Decile','Max Prob', 'Min Prob','Converted Count','Un Converted Count']
df1 = df1.sort_values(by='Decile', ascending=False)
df1['Total Leads'] = df1['Converted Count'] + df1['Un Converted Count']
df1['Conversion Rate'] = df1['Converted Count'] / df1['Un Converted Count']
converted_sum = df1['Converted Count'].sum()
unconverted_sum = df1['Un Converted Count'].sum()
df1['Converted %'] = df1['Converted Count'] / converted_sum
df1['Un Converted %'] = df1['Un Converted Count'] / unconverted_sum
df1.head()
df1['ks_stats'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum() -(df1['Un Converted Count'] / df1['Un Converted Count'].sum()).cumsum()), 4) * 100
df1
df1['Cum Conversion %'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum()), 4) * 100
df1
df1['Base %'] = np.arange(10,110,10)
df1 = df1.set_index('Decile')
df1
### Gain chart
plot_columns =['Base %','Cum Conversion %']
plt.plot(df1[plot_columns]);
plt.xticks(df1.index);
plt.title('Gain chart');
plt.xlabel('Decile')
plt.ylabel('Cummulative Conversion %')
plt.legend(('Our Model','Random Model'));
df1['Lift'] = df1['Cum Conversion %'] / df1['Base %']
df1['Baseline'] = 1
df1
# Lift chart
plot_columns =['Lift', 'Baseline']
plt.plot(df1[plot_columns]);
plt.xticks(df1.index);
plt.title('Lift chart');
plt.xlabel('Decile')
plt.ylabel('Lift')
plt.legend(('Our Model','Random Model'));
A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).
The final relationship between log Odds of Conversion Probability and lead features is
logOdds(Conversion Probability)
= -0.6469 - 1.5426 Do Not Email
-1.2699 Unknown Occupation
-0.9057 No Specialization
-0.8704 Hospitality Management
- 0.6584 Outside India
+ 1.7923 SMS Sent
+ 1.1749 Other Last Activity
+ 2.3769 Working Professional
- 0.8614 Olark Chat Conversation
+ 5.3886 Welingak Website
+ 3.0246 Reference
+ 1.1876 Olark Chat
-1.0250 Landing Page Submission
+ 1.1253 Total Time Spent on Website
+ 0.6106 * Email Opened
where Total Time Spent on Website
is standardized to $\mu=0,\sigma=1$
Interpreting Top 6 features affecting Conversion Probability :
Welingak Website
has 5.4 times higher log odds of conversion than those from Google
. Reference
have 3 times higher log odds of conversion than those from Google
.Working Professional
have 2.38 times higher log odds of conversion than those from Businessman
.SMS Sent
have 1.8 times higher log odds of conversion than those with no SMS sent.Do Not Email
have 1.5 times lesser log odds of conversion compared to leads who would like email updates.Unknown Occupation
have 1.27 times lesser log odds of conversion compared to those from Businessman
.Lead Scores :
At an optimum cut-off probability of 0.36, model performance is as follows.
Model Performance on Training Set :
Model Performance for Test Set :
KS statistic :
Gain :
Lift :
Note :
Tags
where the classes are not mutually exclusive have been dropped.