A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
The company needs to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
This problem can be solved using Multiple Linear Regression Analysis. The company requires a two fold solution.
Analysis is carried out using a Mixed Feature Selection Approach. 15 features are selected algorithmically using Recursive Feature Elimination. Further selection is done manually by looking at multicollinearity and statistical significance of features and overall fit of the model. The 10 most significant features to understand demand have been reported.
The data set is randomly divided into training and test data.
Final Model
built on training data set explains 84% of the variability and achieves 81% on test data.
The final relationship between demand and predictors is as follows.
cnt
= 2392.0791 + 1946.7864 yr
+ 444.4907 Saturday
+ 466.0136 winter
- 890.3115 july
-1063.6669 spring
+ 296.8008 workingday
- 1749.8275 hum
+ 4471.6602 temp
- 1110.3191 windspeed
- 1273.7519 light snow/rain
where temp
, windspeed
and hum
are normalized.
Note :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('./day.csv')
data.head()
data.info()
# dropping `instant`,`dteday`,`casual`,`registered`
data = data.drop(columns=['instant','dteday','casual','registered'])
These variables were dropped since instant
is the just the serial number of the record,
dteday
is redundant coz the required data for analysis is contained in mnth,yr
casual
+ registered
= cnt
# summary statistics of numerical variables
data[['temp','atemp','hum','windspeed']].describe()
# Scatter Plots of Continuous variables vs 'cnt'
sns.set_style("whitegrid")
sns.pairplot(data=data,x_vars=['temp','atemp','hum','windspeed'],y_vars='cnt',kind='scatter',height=5,aspect=1);
## Dropping outliers in continuous variables
# outliers in temp
data = data.drop(index = data[(data['temp'] > 15) & (data['temp'] < 20) & (data['cnt'] < 100)].index)
data = data.drop(index = data[(data['temp'] > 25) & (data['temp'] < 30) & (data['cnt'] < 2000)].index)
# outliers in atemp
data = data.drop(index = data[(data['atemp'] > 20) & (data['atemp'] < 25) & (data['cnt'] < 100)].index)
data = data.drop(index = data[(data['atemp'] > 30) & (data['atemp'] < 35) & (data['cnt'] < 2000)].index)
#outliers in hum
data = data.drop(index = data[(data['hum'] < 20)].index)
#outliers in windspeed
data = data.drop(index = data[(data['windspeed'] > 30)].index)
# Looking at correlation with continuous variables
correlation = data[['temp','atemp','hum','windspeed','cnt']].corr()['cnt'].apply(lambda x : round(x,4))
correlation = pd.DataFrame(correlation).sort_values(by='cnt',ascending=False)
correlation.drop(index=['cnt'],inplace=True)
# dropping registered,casual, instant
correlation.style.background_gradient(cmap='GnBu')
adjusted temperature
has the highest positive correlation with cnt
followed by temperature
. hum
has the lowest correlation. # correlation between ```temp``` and ```atemp```
data[['temp','atemp']].corr()
temp
and atemp
is almost 1, one of them could be dropped. atemp
represents adjusted temperature which is an indicator of how hot it actually feels like
which is a compound measure of temperature,humidity and windspeed. Ref : UK Meteorological Deptatemp
might cause bias in data because it's a compound variable, Instead we could use temp
, hum
, windspeed
. Hence , dropping atemp
. Also it makes business sense to keep temp
and calcuate adjusted temperature from it. # dropping ```atemp```
data = data.drop(columns=['atemp'])
data[['temp','hum','windspeed']].corr()
atemp
and hum
, windspeed
.# Converting variables into categorical type
data[['season','weathersit','mnth']] = data[['season','weathersit','mnth']].astype('category')
# Unique values in each categorical variable / [To check for disguised missing values]
cat_vars = ['season','yr','mnth','holiday','weekday','workingday','weathersit']
for i in cat_vars :
print('Unique values in ',i, data[i].unique())
# Replacing numbers with labels
season_labels = {
1 : 'spring',
2 : 'summer',
3 : 'fall',
4 : 'winter'
}
mnth_labels = {
1 : 'january',
2 : 'february',
3 : 'march',
4 : 'april',
5 : 'may',
6 : 'june',
7 : 'july',
8 : 'august',
9 : 'september',
10 : 'october',
11 : 'november',
12 : 'december'
}
weekday_labels = { # considering the first row of dteday to be 01-01-2011
0 : 'Sunday',
1 : 'Monday',
2 : 'Tuesday',
3 : 'Wednesday',
4 : 'Thursday',
5 : 'Friday',
6 : 'Saturday'
}
weathersit_labels = {
1 : 'clear',
2 : 'cloudy',
3 : 'light snow/rain'
}
# replacing numerals with labels
data['season'] = data['season'].replace(season_labels)
data['mnth'] = data['mnth'].replace(mnth_labels)
data['weekday'] = data['weekday'].replace(weekday_labels)
data['weathersit'] = data['weathersit'].replace(weathersit_labels)
data.head()
cat_vars = ['season','yr','mnth','holiday','weekday', 'workingday','weathersit']
data1 = data[cat_vars]
data1.loc[:,'cnt'] = data['cnt'].values
data1[['yr','holiday','workingday']] = data1[['yr','holiday','workingday']].astype('category')
plot_dim = [3,3]
fig,axs = plt.subplots(*plot_dim)
fig.set_figheight(15)
fig.set_figwidth(20)
for i in range(plot_dim[0]) :
for j in range(plot_dim[1]) :
axs[i,j].set(title = i*plot_dim[1]+j)
sns.boxplot(data=data1,x='cnt',y=cat_vars[i*plot_dim[1]+j],width=0.4,ax=axs[i,j])
if i*plot_dim[1]+j == 6 :
break
axs[2,1].set_axis_off()
axs[2,2].set_axis_off()
# Dropping outliers in Categorical Variables
data = data.drop(index = data[(data['season'] == 'spring') & (data['cnt'] > 7000)].index)
# correlation among variables
plt.figure(figsize=[10,10])
sns.heatmap(data.corr(),cmap='GnBu',center=0,annot=True)
cnt
is seen in temp
followed by yr
# creating indicator variable columns
season_indicators = pd.get_dummies(data['season'],drop_first=True)
mnth_indicators = pd.get_dummies(data['mnth'],drop_first=True)
weekday_indicators = pd.get_dummies(data['weekday'],drop_first=True)
weathersit_indicators = pd.get_dummies(data['weathersit'],drop_first=True)
# adding indicator variable columns to the dataset . Dropping original columns
data = pd.concat([data,season_indicators,mnth_indicators,weekday_indicators,weathersit_indicators],axis=1)
data = data.drop(columns=['season','mnth','weekday','weathersit'])
data.head()
data.columns
Variable | Reference Label |
---|---|
season | fall |
mnth | april |
weekday | Friday |
weathersit | clear |
from sklearn.model_selection import train_test_split
dtrain,dtest = train_test_split(data,train_size=0.7,test_size=0.3,random_state=120)
# normalization of continuous variables
from sklearn.preprocessing import MinMaxScaler
numerical_scaler = MinMaxScaler()
num_vars = ['temp','hum','windspeed']
numerical_scaler.fit(dtrain[num_vars])
dtrain[num_vars] = numerical_scaler.transform(dtrain[num_vars])
y_train = dtrain.pop('cnt')
X_train = dtrain
y_train.head()
X_train.head()
X_train.columns
Approach
# Selecting 15 Features using RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
lr_estimator = LinearRegression()
rfe = RFE(lr_estimator,n_features_to_select=15, step=1)
selector = rfe.fit(X_train,y_train)
# RFE Feature Ranking
rfe_ranking = pd.DataFrame({'rank' : selector.ranking_, 'support': selector.support_, 'features' : X_train.columns}).sort_values(by='rank',ascending=True)
rfe_ranking
# Selected Features
selected_features = rfe_ranking.loc[rfe_ranking['rank'] == 1,'features'].values
selected_features
# Following a stepwise elimination
import statsmodels.api as sm
def ols_fit(y,X) :
X_train_sm = sm.add_constant(X)
model = sm.OLS(y,X_train_sm).fit()
print(model.summary())
return model
def vif(X) :
df = sm.add_constant(X)
vif = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
vif_frame = pd.DataFrame({'vif' : vif[0:]},index = df.columns).reset_index()
print(vif_frame.sort_values(by='vif',ascending=False))
'winter', 'july', 'spring', 'holiday', 'workingday', 'hum', 'temp',
'windspeed', 'light snow/rain'
features_1 = selected_features
ols_fit(y_train,X_train[features_1])
holiday
because of high p-valuedel_feature = 'holiday'
selected_features = selected_features[selected_features!=del_feature]
ols_fit(y_train,X_train[selected_features])
Sunday
because of high p-valuedel_feature = 'Sunday'
selected_features = selected_features[selected_features!=del_feature]
ols_fit(y_train,X_train[selected_features])
january
because this information might also be contained in winter
.del_feature = 'january'
selected_features = selected_features[selected_features!=del_feature]
ols_fit(y_train,X_train[selected_features])
december
because this information might also be contained in winter
.del_feature = 'december'
selected_features = selected_features[selected_features!=del_feature]
ols_fit(y_train,X_train[selected_features])
november
because this information might also be contained in winter
.del_feature = 'november'
selected_features = selected_features[selected_features!=del_feature]
final_model = ols_fit(y_train,X_train[selected_features])
vif(X_train[selected_features])
final_model = ols_fit(y_train,X_train[selected_features])
# Residual Analysis of Trained Data
X_train_sm = sm.add_constant(X_train[selected_features])
y_train_pred = final_model.predict(X_train_sm)
fig,ax = plt.subplots(1,2)
fig.set_figheight(8)
fig.set_figwidth(16)
ax[0].set(title='Frequency Distribution of Residuals')
sns.distplot(y_train-y_train_pred, bins=30, ax=ax[0])
ax[1].set(title='Predicted Values vs Residuals')
\
sns.regplot(y_train_pred,y_train-y_train_pred,ax=ax[1])
plt.show()
# Mean of Residuals
(y_train-y_train_pred).mean()
# Verifying the normality of distribution of residuals
mean = (y_train-y_train_pred).mean()
std = (y_train-y_train_pred).std()
ref_normal = np.random.normal(mean,std,(y_train-y_train_pred).shape[0])
percs = np.linspace(0,100,21)
qn_ref_normal = np.percentile(ref_normal, percs)
qn_residual = np.percentile(y_train - y_train_pred , percs)
plt.plot(qn_ref_normal,qn_residual, ls="", marker="o")
x = np.linspace(np.min((qn_ref_normal.min(),qn_residual.min())), np.max((qn_ref_normal.max(),qn_residual.max())))
plt.plot(x,x, color="k", ls="--")
plt.title('Q-Q Plot : Reference Normal vs Distribution of Residuals ')
plt.show()
# lag plot to assess independence of data points
from pandas.plotting import lag_plot
lag_plot(y_train-y_train_pred)
Hence, assumptions of Linear Regression are satisfied by this model
y_test = dtest.pop('cnt')
X_test = dtest
X_test[num_vars] = numerical_scaler.transform(X_test[num_vars])
X_test = X_test[selected_features]
X_test = sm.add_constant(X_test)
y_test_pred = final_model.predict(X_test)
# Plotting Actual vs Predicted No of rentals
fig,ax = plt.subplots()
fig.set_figheight(8)
fig.set_figwidth(20)
l1,=ax.plot(range(len(y_test)),y_test)
l2, = ax.plot(range(len(y_test_pred)),y_test_pred)
plt.legend([l1,l2],['Actual','Predicted'])
plt.title('Predicted vs Actual No of Rentals');
plt.ylabel('No of Bike Rentals')
plt.xticks([])
plt.show()
plt.figure(figsize=[8,8])
plt.scatter(y_test,y_test_pred);
plt.title('Predicted vs Actual No of Rentals');
from sklearn.metrics import mean_squared_error,r2_score
mse = mean_squared_error(y_test, y_test_pred)
rsquared_test = r2_score(y_test, y_test_pred)
rsquared_train = r2_score(y_train, y_train_pred)
print('R-squared for train data:',round(rsquared_train,2))
print('R-squared for test data:',round(rsquared_test,2))
print('Mean Squared Error',round(mse,3))
# R-square using cross validation
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
clr = cross_val_score(lr,X_train[selected_features],y_train,cv=10, scoring='r2')
clr
print("R-square at 0.95 confidence level : %0.2f (+/- %0.2f)" % (clr.mean(), clr.std() * 2))
selected_features
# standardizing numerical variables
from sklearn.preprocessing import StandardScaler
reg_features = selected_features
scaler = StandardScaler()
data = X_train[selected_features]
std_num = scaler.fit(data[['temp','windspeed','hum']])
std_X_train = pd.DataFrame(data = scaler.transform(data[['temp','windspeed','hum']]), columns=['temp','windspeed','hum'])
for i in reg_features :
std_X_train[i] = data[i].values
reshaped_y_train = y_train.values.reshape(-1,1)
# Fitting linear regression model
std_model = lr.fit(std_X_train, reshaped_y_train)
# Coefficients and intercept
result = pd.DataFrame(data = std_model.coef_, columns = std_X_train.columns, index=['MLR Coefficients']).T
result = result.sort_values(by='MLR Coefficients',ascending=False)
print('\nIntercept :',std_model.intercept_)
result
temp
, followed by yr
and hum
xxx
, when all other modelled paramters are held unchanged. xxx
,, when all other modelled paramters are held unchanged. Analysis is carried out using a Mixed Feature Selection Approach. 15 features are selected algorithmically using Recursive Feature Elimination. Further selection is done manually by looking at multicollinearity and statistical significance of features and overall fit of the model. The 10 most significant features to understand demand have been reported.
The data set is randomly divided into training and test data.
Final Model
built on training data set explains 84% of the variability and achieves 81% on test data.
The final relationship between demand and predictors is as follows.
cnt
= 2392.0791 + 1946.7864 yr
+ 444.4907 Saturday
+ 466.0136 winter
- 890.3115 july
-1063.6669 spring
+ 296.8008 workingday
- 1749.8275 hum
+ 4471.6602 temp
- 1110.3191 windspeed
- 1273.7519 light snow/rain
where temp
, windspeed
and hum
are normalized.
Note :